Download - Unsupervised Learning with Apache Spark
● Data scientist at Cloudera● Recently lead Apache Spark development at
Cloudera● Before that, committing on Apache Hadoop● Before that, studying combinatorial
optimization and distributed systems at Brown
● How many kinds of stuff are there?● Why is some stuff not like the others?
● How do I contextualize new stuff?● Is there a simpler way to represent this stuff?
● Learn hidden structure of your data● Interpret new data as it relates to this
structure
● Clustering○ Partition data into categories
● Dimensionality reduction○ Find a condensed representation of your
data
● Designing a system for processing huge data in parallel
● Taking advantage of it with algorithms that work well in parallel
bigfile.txt lines
val lines = sc.textFile (“bigfile.txt”)
numbers
Partition
Partition
Partition
Partition
Partition
Partition
HDFS
sum
Driver
val numbers = lines.map ((x) => x.toDouble) numbers.sum()
bigfile.txt lines
val lines = sc.textFile (“bigfile.txt”)
numbers
Partition
Partition
Partition
Partition
Partition
Partition
HDFS
sum
Driver
val numbers = lines.map ((x) => x.toInt) numbers.cache()
.sum()
bigfile.txt lines numbers
Partition
Partition
Partition
sum
Driver
Discrete Continuous
Supervised Classification● Logistic regression (and
regularized variants)● Linear SVM● Naive Bayes● Random Decision Forests
(soon)
Regression● Linear regression (and
regularized variants)
Unsupervised Clustering● K-means
Dimensionality reduction, matrix factorization
● Principal component analysis / singular value decomposition
● Alternating least squares
Discrete Continuous
Supervised Classification● Logistic regression (and
regularized variants)● Linear SVM● Naive Bayes● Random Decision Forests
(soon)
Regression● Linear regression (and
regularized variants)
Unsupervised Clustering
● K-meansDimensionality reduction, matrix factorization
● Principal component analysis / singular value decomposition
● Alternating least squares
● Anomalies as data points far away from any cluster
val data = sc.textFile("kmeans_data.txt")
val parsedData = data.map( _.split(' ').map(_.toDouble))
// Cluster the data into two classes using KMeans
val numIterations = 20
val numClusters = 2
val clusters = KMeans.train(parsedData, numClusters,
numIterations)
● Alternate between two steps:○ Assign each point to a cluster based on
existing centers○ Recompute cluster centers from the
points in each cluster
● Alternate between two steps:○ Assign each point to a cluster based on
existing centers■ Process each data point independently
○ Recompute cluster centers from the points in each cluster■ Average across partitions
// Find the sum and count of points mapping to each center
val totalContribs = data.mapPartitions { points =>
val k = centers.length
val dims = centers(0).vector.length
val sums = Array.fill(k)(BDV.zeros[Double](dims).asInstanceOf[BV[Double]])
val counts = Array.fill(k)(0L)
points.foreach { point =>
val (bestCenter, cost) = KMeans.findClosest(centers, point)
costAccum += cost
sums(bestCenter) += point.vector
counts(bestCenter) += 1
}
val contribs = for (j <- 0 until k) yield {
(j, (sums(j), counts(j)))
}
contribs.iterator
}.reduceByKey(mergeContribs).collectAsMap()
// Update the cluster centers and costs
var changed = false
var j = 0
while (j < k) {
val (sum, count) = totalContribs(j)
if (count != 0) {
sum /= count.toDouble
val newCenter = new BreezeVectorWithNorm(sum)
if (KMeans.fastSquaredDistance(newCenter, centers(j)) > epsilon * epsilon) {
changed = true
}
centers(j) = newCenter
}
j += 1
}
if (!changed) {
logInfo("Run " + run + " finished in " + (iteration + 1) + " iterations")
}
cost = costAccum.value
● K-Means is very sensitive to initial set of center points chosen.
● Best existing algorithm for choosing centers is highly sequential.
● Start with random point from dataset● Pick another one randomly, with probability
proportional to distance from the closest already chosen
● Repeat until initial centers chosen
● Initial cluster has expected bound of O(log k) of optimum cost
● Requires k passes over the data
● Do only a few (~5) passes● Sample m points on each pass● Oversample● Run K-Means++ on sampled points to find
initial centers
Discrete Continuous
Supervised Classification● Logistic regression (and
regularized variants)● Linear SVM● Naive Bayes● Random Decision Forests
(soon)
Regression● Linear regression (and
regularized variants)
Unsupervised Clustering● K-means
Dimensionality reduction, matrix factorization
● Principal component analysis / singular value decomposition
● Alternating least squares
● Select a basis for your data that○ Is orthonormal
○ Maximizes variance along its axes
● Find dominant trends
● Find a lower-dimensional representation that lets you visualize the data
● Feature learning - find a representation that’s good for clustering or classification
● Latent Semantic Analysis
val data: RDD[Vector] = ...
val mat = new RowMatrix(data)
// compute the top 5 principal components
val principalComponents =
mat.computePrincipalComponents(5)
// project data into subspace
val transformed = data.map(_.toBreeze *
mat.toBreeze)
● Center data● Find covariance matrix● Its eigenvectors are the principal
components
Datam
n
Covariance Matrix
n
n
Data
m
n
Data
Data
Data
Data
Data
Data
m
n
Data
Data
Data
Data
Data
n
n
n
n
...
Data
m
n
Data
Data
Data
Data
Data
n
n
n
n
... ...
n
n
n
n
n
n
def computeGramianMatrix (): Matrix = {
val n = numCols().toInt
val nt: Int = n * (n + 1) / 2
// Compute the upper triangular part of the gram matrix.
val GU = rows.aggregate( new BDV[Double](new Array[Double](nt)))(
seqOp = (U, v) => {
RowMatrix.dspr( 1.0, v, U.data)
U
},
combOp = (U1, U2) => U1 += U2
)
RowMatrix.triuToFull(n, GU.data)
}
n
n
● n^2 must fit in memory
● n^2 must fit in memory● Not yet implemented: EM algorithm can do it
with O(kn), where k is the number of principal components