unsupervised learning with apache spark

of 80 /80

Author: db-tsai

Post on 19-Aug-2014

1.970 views

Category:

Engineering


20 download

Embed Size (px)

DESCRIPTION

Unsupervised learning refers to a branch of algorithms that try to find structure in unlabeled data. Clustering algorithms, for example, try to partition elements of a dataset into related groups. Dimensionality reduction algorithms search for a simpler representation of a dataset. Spark's MLLib module contains implementations of several unsupervised learning algorithms that scale to huge datasets. In this talk, we'll dive into uses and implementations of Spark's K-means clustering and Singular Value Decomposition (SVD). Bio: Sandy Ryza is an engineer on the data science team at Cloudera. He is a committer on Apache Hadoop and recently led Cloudera's Apache Spark development.

TRANSCRIPT

  • Data scientist at Cloudera Recently lead Apache Spark development at Cloudera Before that, committing on Apache Hadoop Before that, studying combinatorial optimization and distributed systems at Brown
  • How many kinds of stuff are there? Why is some stuff not like the others? How do I contextualize new stuff? Is there a simpler way to represent this stuff?
  • Learn hidden structure of your data Interpret new data as it relates to this structure
  • Clustering Partition data into categories Dimensionality reduction Find a condensed representation of your data
  • Designing a system for processing huge data in parallel Taking advantage of it with algorithms that work well in parallel
  • bigfile.txt lines val lines = sc.textFile (bigfile.txt) numbers Partition Partition Partition Partition Partition Partition HDFS sum Driver val numbers = lines.map ((x) => x.toDouble) numbers.sum()
  • bigfile.txt lines val lines = sc.textFile (bigfile.txt) numbers Partition Partition Partition Partition Partition Partition HDFS sum Driver val numbers = lines.map ((x) => x.toInt) numbers.cache() .sum()
  • bigfile.txt lines numbers Partition Partition Partition sum Driver
  • Discrete Continuous Supervised Classification Logistic regression (and regularized variants) Linear SVM Naive Bayes Random Decision Forests (soon) Regression Linear regression (and regularized variants) Unsupervised Clustering K-means Dimensionality reduction, matrix factorization Principal component analysis / singular value decomposition Alternating least squares
  • Discrete Continuous Supervised Classification Logistic regression (and regularized variants) Linear SVM Naive Bayes Random Decision Forests (soon) Regression Linear regression (and regularized variants) Unsupervised Clustering K-means Dimensionality reduction, matrix factorization Principal component analysis / singular value decomposition Alternating least squares
  • Anomalies as data points far away from any cluster
  • val data = sc.textFile("kmeans_data.txt") val parsedData = data.map( _.split(' ').map(_.toDouble)) // Cluster the data into two classes using KMeans val numIterations = 20 val numClusters = 2 val clusters = KMeans.train(parsedData, numClusters, numIterations)
  • Alternate between two steps: Assign each point to a cluster based on existing centers Recompute cluster centers from the points in each cluster
  • Alternate between two steps: Assign each point to a cluster based on existing centers Process each data point independently Recompute cluster centers from the points in each cluster Average across partitions
  • // Find the sum and count of points mapping to each center val totalContribs = data.mapPartitions { points => val k = centers.length val dims = centers(0).vector.length val sums = Array.fill(k)(BDV.zeros[Double](dims).asInstanceOf[BV[Double]]) val counts = Array.fill(k)(0L) points.foreach { point => val (bestCenter, cost) = KMeans.findClosest(centers, point) costAccum += cost sums(bestCenter) += point.vector counts(bestCenter) += 1 } val contribs = for (j epsilon * epsilon) { changed = true } centers(j) = newCenter } j += 1 } if (!changed) { logInfo("Run " + run + " finished in " + (iteration + 1) + " iterations") } cost = costAccum.value
  • K-Means is very sensitive to initial set of center points chosen. Best existing algorithm for choosing centers is highly sequential.
  • Start with random point from dataset Pick another one randomly, with probability proportional to distance from the closest already chosen Repeat until initial centers chosen
  • Initial cluster has expected bound of O(log k) of optimum cost
  • Requires k passes over the data
  • Do only a few (~5) passes Sample m points on each pass Oversample Run K-Means++ on sampled points to find initial centers
  • Discrete Continuous Supervised Classification Logistic regression (and regularized variants) Linear SVM Naive Bayes Random Decision Forests (soon) Regression Linear regression (and regularized variants) Unsupervised Clustering K-means Dimensionality reduction, matrix factorization Principal component analysis / singular value decomposition Alternating least squares
  • Select a basis for your data that Is orthonormal Maximizes variance along its axes
  • Find dominant trends
  • Find a lower-dimensional representation that lets you visualize the data Feature learning - find a representation that s good for clustering or classification Latent Semantic Analysis
  • val data: RDD[Vector] = ... val mat = new RowMatrix(data) // compute the top 5 principal components val principalComponents = mat.computePrincipalComponents(5) // project data into subspace val transformed = data.map(_.toBreeze * mat.toBreeze)
  • Center data Find covariance matrix Its eigenvectors are the principal components
  • Datam n Covariance Matrix n n
  • Data m n Data Data Data Data Data
  • Data m n Data Data Data Data Data n n n n ...
  • Data m n Data Data Data Data Data n n n n ... ...
  • n n
  • n n
  • n n
  • def computeGramianMatrix (): Matrix = { val n = numCols().toInt val nt: Int = n * (n + 1) / 2 // Compute the upper triangular part of the gram matrix. val GU = rows.aggregate( new BDV[Double](new Array[Double](nt)))( seqOp = (U, v) => { RowMatrix.dspr( 1.0, v, U.data) U }, combOp = (U1, U2) => U1 += U2 ) RowMatrix.triuToFull(n, GU.data) }
  • n n
  • n^2 must fit in memory
  • n^2 must fit in memory Not yet implemented: EM algorithm can do it with O(kn), where k is the number of principal components