unsupervised learning with apache spark
Embed Size (px)
DESCRIPTION
Unsupervised learning refers to a branch of algorithms that try to find structure in unlabeled data. Clustering algorithms, for example, try to partition elements of a dataset into related groups. Dimensionality reduction algorithms search for a simpler representation of a dataset. Spark's MLLib module contains implementations of several unsupervised learning algorithms that scale to huge datasets. In this talk, we'll dive into uses and implementations of Spark's K-means clustering and Singular Value Decomposition (SVD). Bio: Sandy Ryza is an engineer on the data science team at Cloudera. He is a committer on Apache Hadoop and recently led Cloudera's Apache Spark development.TRANSCRIPT


● Data scientist at Cloudera● Recently lead Apache Spark development at
Cloudera● Before that, committing on Apache Hadoop● Before that, studying combinatorial
optimization and distributed systems at Brown












● How many kinds of stuff are there?● Why is some stuff not like the others?
● How do I contextualize new stuff?● Is there a simpler way to represent this stuff?

● Learn hidden structure of your data● Interpret new data as it relates to this
structure

● Clustering○ Partition data into categories
● Dimensionality reduction○ Find a condensed representation of your
data

● Designing a system for processing huge data in parallel
● Taking advantage of it with algorithms that work well in parallel




bigfile.txt lines
val lines = sc.textFile (“bigfile.txt”)
numbers
Partition
Partition
Partition
Partition
Partition
Partition
HDFS
sum
Driver
val numbers = lines.map ((x) => x.toDouble) numbers.sum()


bigfile.txt lines
val lines = sc.textFile (“bigfile.txt”)
numbers
Partition
Partition
Partition
Partition
Partition
Partition
HDFS
sum
Driver
val numbers = lines.map ((x) => x.toInt) numbers.cache()
.sum()

bigfile.txt lines numbers
Partition
Partition
Partition
sum
Driver



Discrete Continuous
Supervised Classification● Logistic regression (and
regularized variants)● Linear SVM● Naive Bayes● Random Decision Forests
(soon)
Regression● Linear regression (and
regularized variants)
Unsupervised Clustering● K-means
Dimensionality reduction, matrix factorization
● Principal component analysis / singular value decomposition
● Alternating least squares

Discrete Continuous
Supervised Classification● Logistic regression (and
regularized variants)● Linear SVM● Naive Bayes● Random Decision Forests
(soon)
Regression● Linear regression (and
regularized variants)
Unsupervised Clustering
● K-meansDimensionality reduction, matrix factorization
● Principal component analysis / singular value decomposition
● Alternating least squares



● Anomalies as data points far away from any cluster






val data = sc.textFile("kmeans_data.txt")
val parsedData = data.map( _.split(' ').map(_.toDouble))
// Cluster the data into two classes using KMeans
val numIterations = 20
val numClusters = 2
val clusters = KMeans.train(parsedData, numClusters,
numIterations)

● Alternate between two steps:○ Assign each point to a cluster based on
existing centers○ Recompute cluster centers from the
points in each cluster






● Alternate between two steps:○ Assign each point to a cluster based on
existing centers■ Process each data point independently
○ Recompute cluster centers from the points in each cluster■ Average across partitions

// Find the sum and count of points mapping to each center
val totalContribs = data.mapPartitions { points =>
val k = centers.length
val dims = centers(0).vector.length
val sums = Array.fill(k)(BDV.zeros[Double](dims).asInstanceOf[BV[Double]])
val counts = Array.fill(k)(0L)
points.foreach { point =>
val (bestCenter, cost) = KMeans.findClosest(centers, point)
costAccum += cost
sums(bestCenter) += point.vector
counts(bestCenter) += 1
}
val contribs = for (j <- 0 until k) yield {
(j, (sums(j), counts(j)))
}
contribs.iterator
}.reduceByKey(mergeContribs).collectAsMap()

// Update the cluster centers and costs
var changed = false
var j = 0
while (j < k) {
val (sum, count) = totalContribs(j)
if (count != 0) {
sum /= count.toDouble
val newCenter = new BreezeVectorWithNorm(sum)
if (KMeans.fastSquaredDistance(newCenter, centers(j)) > epsilon * epsilon) {
changed = true
}
centers(j) = newCenter
}
j += 1
}
if (!changed) {
logInfo("Run " + run + " finished in " + (iteration + 1) + " iterations")
}
cost = costAccum.value


● K-Means is very sensitive to initial set of center points chosen.
● Best existing algorithm for choosing centers is highly sequential.


● Start with random point from dataset● Pick another one randomly, with probability
proportional to distance from the closest already chosen
● Repeat until initial centers chosen

● Initial cluster has expected bound of O(log k) of optimum cost

● Requires k passes over the data

● Do only a few (~5) passes● Sample m points on each pass● Oversample● Run K-Means++ on sampled points to find
initial centers










Discrete Continuous
Supervised Classification● Logistic regression (and
regularized variants)● Linear SVM● Naive Bayes● Random Decision Forests
(soon)
Regression● Linear regression (and
regularized variants)
Unsupervised Clustering● K-means
Dimensionality reduction, matrix factorization
● Principal component analysis / singular value decomposition
● Alternating least squares

● Select a basis for your data that○ Is orthonormal
○ Maximizes variance along its axes


● Find dominant trends

● Find a lower-dimensional representation that lets you visualize the data
● Feature learning - find a representation that’s good for clustering or classification
● Latent Semantic Analysis

val data: RDD[Vector] = ...
val mat = new RowMatrix(data)
// compute the top 5 principal components
val principalComponents =
mat.computePrincipalComponents(5)
// project data into subspace
val transformed = data.map(_.toBreeze *
mat.toBreeze)

● Center data● Find covariance matrix● Its eigenvectors are the principal
components

Datam
n
Covariance Matrix
n
n

Data
m
n
Data
Data
Data
Data
Data

Data
m
n
Data
Data
Data
Data
Data
n
n
n
n
...

Data
m
n
Data
Data
Data
Data
Data
n
n
n
n
... ...

n
n

n
n

n
n

def computeGramianMatrix (): Matrix = {
val n = numCols().toInt
val nt: Int = n * (n + 1) / 2
// Compute the upper triangular part of the gram matrix.
val GU = rows.aggregate( new BDV[Double](new Array[Double](nt)))(
seqOp = (U, v) => {
RowMatrix.dspr( 1.0, v, U.data)
U
},
combOp = (U1, U2) => U1 += U2
)
RowMatrix.triuToFull(n, GU.data)
}

n
n

● n^2 must fit in memory

● n^2 must fit in memory● Not yet implemented: EM algorithm can do it
with O(kn), where k is the number of principal components