unsupervised learning with apache spark
Embed Size (px)
DESCRIPTION
Unsupervised learning refers to a branch of algorithms that try to find structure in unlabeled data. Clustering algorithms, for example, try to partition elements of a dataset into related groups. Dimensionality reduction algorithms search for a simpler representation of a dataset. Spark's MLLib module contains implementations of several unsupervised learning algorithms that scale to huge datasets. In this talk, we'll dive into uses and implementations of Spark's K-means clustering and Singular Value Decomposition (SVD). Bio: Sandy Ryza is an engineer on the data science team at Cloudera. He is a committer on Apache Hadoop and recently led Cloudera's Apache Spark development.TRANSCRIPT
- Data scientist at Cloudera Recently lead Apache Spark development at Cloudera Before that, committing on Apache Hadoop Before that, studying combinatorial optimization and distributed systems at Brown
- How many kinds of stuff are there? Why is some stuff not like the others? How do I contextualize new stuff? Is there a simpler way to represent this stuff?
- Learn hidden structure of your data Interpret new data as it relates to this structure
- Clustering Partition data into categories Dimensionality reduction Find a condensed representation of your data
- Designing a system for processing huge data in parallel Taking advantage of it with algorithms that work well in parallel
- bigfile.txt lines val lines = sc.textFile (bigfile.txt) numbers Partition Partition Partition Partition Partition Partition HDFS sum Driver val numbers = lines.map ((x) => x.toDouble) numbers.sum()
- bigfile.txt lines val lines = sc.textFile (bigfile.txt) numbers Partition Partition Partition Partition Partition Partition HDFS sum Driver val numbers = lines.map ((x) => x.toInt) numbers.cache() .sum()
- bigfile.txt lines numbers Partition Partition Partition sum Driver
- Discrete Continuous Supervised Classification Logistic regression (and regularized variants) Linear SVM Naive Bayes Random Decision Forests (soon) Regression Linear regression (and regularized variants) Unsupervised Clustering K-means Dimensionality reduction, matrix factorization Principal component analysis / singular value decomposition Alternating least squares
- Discrete Continuous Supervised Classification Logistic regression (and regularized variants) Linear SVM Naive Bayes Random Decision Forests (soon) Regression Linear regression (and regularized variants) Unsupervised Clustering K-means Dimensionality reduction, matrix factorization Principal component analysis / singular value decomposition Alternating least squares
- Anomalies as data points far away from any cluster
- val data = sc.textFile("kmeans_data.txt") val parsedData = data.map( _.split(' ').map(_.toDouble)) // Cluster the data into two classes using KMeans val numIterations = 20 val numClusters = 2 val clusters = KMeans.train(parsedData, numClusters, numIterations)
- Alternate between two steps: Assign each point to a cluster based on existing centers Recompute cluster centers from the points in each cluster
- Alternate between two steps: Assign each point to a cluster based on existing centers Process each data point independently Recompute cluster centers from the points in each cluster Average across partitions
- // Find the sum and count of points mapping to each center val totalContribs = data.mapPartitions { points => val k = centers.length val dims = centers(0).vector.length val sums = Array.fill(k)(BDV.zeros[Double](dims).asInstanceOf[BV[Double]]) val counts = Array.fill(k)(0L) points.foreach { point => val (bestCenter, cost) = KMeans.findClosest(centers, point) costAccum += cost sums(bestCenter) += point.vector counts(bestCenter) += 1 } val contribs = for (j epsilon * epsilon) { changed = true } centers(j) = newCenter } j += 1 } if (!changed) { logInfo("Run " + run + " finished in " + (iteration + 1) + " iterations") } cost = costAccum.value
- K-Means is very sensitive to initial set of center points chosen. Best existing algorithm for choosing centers is highly sequential.
- Start with random point from dataset Pick another one randomly, with probability proportional to distance from the closest already chosen Repeat until initial centers chosen
- Initial cluster has expected bound of O(log k) of optimum cost
- Requires k passes over the data
- Do only a few (~5) passes Sample m points on each pass Oversample Run K-Means++ on sampled points to find initial centers
- Discrete Continuous Supervised Classification Logistic regression (and regularized variants) Linear SVM Naive Bayes Random Decision Forests (soon) Regression Linear regression (and regularized variants) Unsupervised Clustering K-means Dimensionality reduction, matrix factorization Principal component analysis / singular value decomposition Alternating least squares
- Select a basis for your data that Is orthonormal Maximizes variance along its axes
- Find dominant trends
- Find a lower-dimensional representation that lets you visualize the data Feature learning - find a representation that s good for clustering or classification Latent Semantic Analysis
- val data: RDD[Vector] = ... val mat = new RowMatrix(data) // compute the top 5 principal components val principalComponents = mat.computePrincipalComponents(5) // project data into subspace val transformed = data.map(_.toBreeze * mat.toBreeze)
- Center data Find covariance matrix Its eigenvectors are the principal components
- Datam n Covariance Matrix n n
- Data m n Data Data Data Data Data
- Data m n Data Data Data Data Data n n n n ...
- Data m n Data Data Data Data Data n n n n ... ...
- n n
- n n
- n n
- def computeGramianMatrix (): Matrix = { val n = numCols().toInt val nt: Int = n * (n + 1) / 2 // Compute the upper triangular part of the gram matrix. val GU = rows.aggregate( new BDV[Double](new Array[Double](nt)))( seqOp = (U, v) => { RowMatrix.dspr( 1.0, v, U.data) U }, combOp = (U1, U2) => U1 += U2 ) RowMatrix.triuToFull(n, GU.data) }
- n n
- n^2 must fit in memory
- n^2 must fit in memory Not yet implemented: EM algorithm can do it with O(kn), where k is the number of principal components