Transcript
Page 1: Unsupervised Learning with Apache Spark
Page 2: Unsupervised Learning with Apache Spark

● Data scientist at Cloudera● Recently lead Apache Spark development at

Cloudera● Before that, committing on Apache Hadoop● Before that, studying combinatorial

optimization and distributed systems at Brown

Page 3: Unsupervised Learning with Apache Spark
Page 4: Unsupervised Learning with Apache Spark
Page 5: Unsupervised Learning with Apache Spark
Page 6: Unsupervised Learning with Apache Spark
Page 7: Unsupervised Learning with Apache Spark
Page 8: Unsupervised Learning with Apache Spark
Page 9: Unsupervised Learning with Apache Spark
Page 10: Unsupervised Learning with Apache Spark
Page 11: Unsupervised Learning with Apache Spark
Page 12: Unsupervised Learning with Apache Spark
Page 13: Unsupervised Learning with Apache Spark
Page 14: Unsupervised Learning with Apache Spark

● How many kinds of stuff are there?● Why is some stuff not like the others?

● How do I contextualize new stuff?● Is there a simpler way to represent this stuff?

Page 15: Unsupervised Learning with Apache Spark

● Learn hidden structure of your data● Interpret new data as it relates to this

structure

Page 16: Unsupervised Learning with Apache Spark

● Clustering○ Partition data into categories

● Dimensionality reduction○ Find a condensed representation of your

data

Page 17: Unsupervised Learning with Apache Spark

● Designing a system for processing huge data in parallel

● Taking advantage of it with algorithms that work well in parallel

Page 18: Unsupervised Learning with Apache Spark
Page 19: Unsupervised Learning with Apache Spark
Page 20: Unsupervised Learning with Apache Spark
Page 21: Unsupervised Learning with Apache Spark

bigfile.txt lines

val lines = sc.textFile (“bigfile.txt”)

numbers

Partition

Partition

Partition

Partition

Partition

Partition

HDFS

sum

Driver

val numbers = lines.map ((x) => x.toDouble) numbers.sum()

Page 22: Unsupervised Learning with Apache Spark
Page 23: Unsupervised Learning with Apache Spark

bigfile.txt lines

val lines = sc.textFile (“bigfile.txt”)

numbers

Partition

Partition

Partition

Partition

Partition

Partition

HDFS

sum

Driver

val numbers = lines.map ((x) => x.toInt) numbers.cache()

.sum()

Page 24: Unsupervised Learning with Apache Spark

bigfile.txt lines numbers

Partition

Partition

Partition

sum

Driver

Page 25: Unsupervised Learning with Apache Spark
Page 26: Unsupervised Learning with Apache Spark
Page 27: Unsupervised Learning with Apache Spark

Discrete Continuous

Supervised Classification● Logistic regression (and

regularized variants)● Linear SVM● Naive Bayes● Random Decision Forests

(soon)

Regression● Linear regression (and

regularized variants)

Unsupervised Clustering● K-means

Dimensionality reduction, matrix factorization

● Principal component analysis / singular value decomposition

● Alternating least squares

Page 28: Unsupervised Learning with Apache Spark

Discrete Continuous

Supervised Classification● Logistic regression (and

regularized variants)● Linear SVM● Naive Bayes● Random Decision Forests

(soon)

Regression● Linear regression (and

regularized variants)

Unsupervised Clustering

● K-meansDimensionality reduction, matrix factorization

● Principal component analysis / singular value decomposition

● Alternating least squares

Page 29: Unsupervised Learning with Apache Spark
Page 30: Unsupervised Learning with Apache Spark
Page 31: Unsupervised Learning with Apache Spark

● Anomalies as data points far away from any cluster

Page 32: Unsupervised Learning with Apache Spark
Page 33: Unsupervised Learning with Apache Spark
Page 34: Unsupervised Learning with Apache Spark
Page 35: Unsupervised Learning with Apache Spark
Page 36: Unsupervised Learning with Apache Spark
Page 37: Unsupervised Learning with Apache Spark

val data = sc.textFile("kmeans_data.txt")

val parsedData = data.map( _.split(' ').map(_.toDouble))

// Cluster the data into two classes using KMeans

val numIterations = 20

val numClusters = 2

val clusters = KMeans.train(parsedData, numClusters,

numIterations)

Page 38: Unsupervised Learning with Apache Spark

● Alternate between two steps:○ Assign each point to a cluster based on

existing centers○ Recompute cluster centers from the

points in each cluster

Page 39: Unsupervised Learning with Apache Spark
Page 40: Unsupervised Learning with Apache Spark
Page 41: Unsupervised Learning with Apache Spark
Page 42: Unsupervised Learning with Apache Spark
Page 43: Unsupervised Learning with Apache Spark
Page 44: Unsupervised Learning with Apache Spark

● Alternate between two steps:○ Assign each point to a cluster based on

existing centers■ Process each data point independently

○ Recompute cluster centers from the points in each cluster■ Average across partitions

Page 45: Unsupervised Learning with Apache Spark

// Find the sum and count of points mapping to each center

val totalContribs = data.mapPartitions { points =>

val k = centers.length

val dims = centers(0).vector.length

val sums = Array.fill(k)(BDV.zeros[Double](dims).asInstanceOf[BV[Double]])

val counts = Array.fill(k)(0L)

points.foreach { point =>

val (bestCenter, cost) = KMeans.findClosest(centers, point)

costAccum += cost

sums(bestCenter) += point.vector

counts(bestCenter) += 1

}

val contribs = for (j <- 0 until k) yield {

(j, (sums(j), counts(j)))

}

contribs.iterator

}.reduceByKey(mergeContribs).collectAsMap()

Page 46: Unsupervised Learning with Apache Spark

// Update the cluster centers and costs

var changed = false

var j = 0

while (j < k) {

val (sum, count) = totalContribs(j)

if (count != 0) {

sum /= count.toDouble

val newCenter = new BreezeVectorWithNorm(sum)

if (KMeans.fastSquaredDistance(newCenter, centers(j)) > epsilon * epsilon) {

changed = true

}

centers(j) = newCenter

}

j += 1

}

if (!changed) {

logInfo("Run " + run + " finished in " + (iteration + 1) + " iterations")

}

cost = costAccum.value

Page 47: Unsupervised Learning with Apache Spark
Page 48: Unsupervised Learning with Apache Spark

● K-Means is very sensitive to initial set of center points chosen.

● Best existing algorithm for choosing centers is highly sequential.

Page 49: Unsupervised Learning with Apache Spark
Page 50: Unsupervised Learning with Apache Spark

● Start with random point from dataset● Pick another one randomly, with probability

proportional to distance from the closest already chosen

● Repeat until initial centers chosen

Page 51: Unsupervised Learning with Apache Spark

● Initial cluster has expected bound of O(log k) of optimum cost

Page 52: Unsupervised Learning with Apache Spark

● Requires k passes over the data

Page 53: Unsupervised Learning with Apache Spark

● Do only a few (~5) passes● Sample m points on each pass● Oversample● Run K-Means++ on sampled points to find

initial centers

Page 54: Unsupervised Learning with Apache Spark
Page 55: Unsupervised Learning with Apache Spark
Page 56: Unsupervised Learning with Apache Spark
Page 57: Unsupervised Learning with Apache Spark
Page 58: Unsupervised Learning with Apache Spark
Page 59: Unsupervised Learning with Apache Spark
Page 60: Unsupervised Learning with Apache Spark
Page 61: Unsupervised Learning with Apache Spark
Page 62: Unsupervised Learning with Apache Spark
Page 63: Unsupervised Learning with Apache Spark

Discrete Continuous

Supervised Classification● Logistic regression (and

regularized variants)● Linear SVM● Naive Bayes● Random Decision Forests

(soon)

Regression● Linear regression (and

regularized variants)

Unsupervised Clustering● K-means

Dimensionality reduction, matrix factorization

● Principal component analysis / singular value decomposition

● Alternating least squares

Page 64: Unsupervised Learning with Apache Spark

● Select a basis for your data that○ Is orthonormal

○ Maximizes variance along its axes

Page 65: Unsupervised Learning with Apache Spark
Page 66: Unsupervised Learning with Apache Spark

● Find dominant trends

Page 67: Unsupervised Learning with Apache Spark

● Find a lower-dimensional representation that lets you visualize the data

● Feature learning - find a representation that’s good for clustering or classification

● Latent Semantic Analysis

Page 68: Unsupervised Learning with Apache Spark

val data: RDD[Vector] = ...

val mat = new RowMatrix(data)

// compute the top 5 principal components

val principalComponents =

mat.computePrincipalComponents(5)

// project data into subspace

val transformed = data.map(_.toBreeze *

mat.toBreeze)

Page 69: Unsupervised Learning with Apache Spark

● Center data● Find covariance matrix● Its eigenvectors are the principal

components

Page 70: Unsupervised Learning with Apache Spark

Datam

n

Covariance Matrix

n

n

Page 71: Unsupervised Learning with Apache Spark

Data

m

n

Data

Data

Data

Data

Data

Page 72: Unsupervised Learning with Apache Spark

Data

m

n

Data

Data

Data

Data

Data

n

n

n

n

...

Page 73: Unsupervised Learning with Apache Spark

Data

m

n

Data

Data

Data

Data

Data

n

n

n

n

... ...

Page 74: Unsupervised Learning with Apache Spark

n

n

Page 75: Unsupervised Learning with Apache Spark

n

n

Page 76: Unsupervised Learning with Apache Spark

n

n

Page 77: Unsupervised Learning with Apache Spark

def computeGramianMatrix (): Matrix = {

val n = numCols().toInt

val nt: Int = n * (n + 1) / 2

// Compute the upper triangular part of the gram matrix.

val GU = rows.aggregate( new BDV[Double](new Array[Double](nt)))(

seqOp = (U, v) => {

RowMatrix.dspr( 1.0, v, U.data)

U

},

combOp = (U1, U2) => U1 += U2

)

RowMatrix.triuToFull(n, GU.data)

}

Page 78: Unsupervised Learning with Apache Spark

n

n

Page 79: Unsupervised Learning with Apache Spark

● n^2 must fit in memory

Page 80: Unsupervised Learning with Apache Spark

● n^2 must fit in memory● Not yet implemented: EM algorithm can do it

with O(kn), where k is the number of principal components


Top Related