![Page 1: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/1.jpg)
Scalable and Flexible Machine Learning With ScalaBay Area Scala Enthusiasts MeetupMarch 11, 2013 LinkedIn
![Page 2: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/2.jpg)
2
Who are we?
@BigDataSc @ccsevers
![Page 3: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/3.jpg)
3
Stuff you will see today …
Different types of data scientists – Comparison of different approaches to develop machine learning flows
Code The five tool tool – Why Scala (and its ecosystem) is the best tool
to develop machine learning flows (Hint: MapReduce is functional) Some more code Machine Learning examples – Real life (well … almost) examples
of different machine learning problems Even more code
![Page 4: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/4.jpg)
4
“Good data scientists understand, in a deep way, that the heavy lifting of cleanup and
preparation is not something that gets in the way of solving the problem – it is the
problem!”DJ Patil – Founding member of the LinkedIn data science team
![Page 5: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/5.jpg)
5
The data funnel
Real data is an awful, terrible mess Cleaning often is a process of operating on data, excluding some
data, bucketing data and calculating aggregates about the data
These blocks form the basis of most data flows
Generate map, flatMap, for
Exclude filter
Bucket group, groupBy, groupWith
Aggregate sum, reduce, foldLeft
![Page 6: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/6.jpg)
6
There are many ways to develop data flows
![Page 7: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/7.jpg)
7
The mixer
![Page 8: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/8.jpg)
8
The Mixer Word Count
#wordcount.py
from org.apache.pig.scripting import *
@outputSchema("b: bag{ w: chararray}")
def tokenize(words):
return words.split(" ")
script = """
A = load './input.txt';
B = foreach A generate flatten(tokenize((chararray)$0)) as word;
C = group B by word;
D = foreach C generate group, COUNT(B);
store D into './wordcount’ using AvroStorage("schema");
"""
Pig.compile(script).bind().runSingle()
{"schema": {
"type": "record",
"name": "WordCount",
"fields": [
{
"name": "word",
"type": "string"
},
{
"name": "count",
"type": "int"
}]}}
![Page 9: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/9.jpg)
9
The Mixer Data Scientist
Too many occurrences of code inside strings Three different languages inside a single file User Defined Functions (UDFs) vs. Language Support Not real Python, but Jython (which missing some libraries) This is just a simple word count!
![Page 10: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/10.jpg)
10
The Mixer Data Scientist
Pig is great at extract, transform, load (ETL) … as long as you want to use a function that is already part of the
included library … or you get someone else to write it for you (hello, DataFu!) Realistically you will need to maintain a Pig code base and a code
base in some language which can run on the JVM Pig Latin is a bit funky, missing a lot of core programming language
features Pig Latin is interpreted so you get (limited) type and syntax
checking only at runtime
![Page 11: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/11.jpg)
11
The Expert
![Page 12: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/12.jpg)
12
The Expert Word Count
hadoop fs –get input.txt input.txtcp /mnt/hadoop/input.txt ~/MyProjects/WordCount/input.txt
##!/usr/bin/perluse strict;use warnings;
my %count_of;while (my $line = <>) { #read from file or STDIN foreach my $word (split /\s+/, $line) { $count_of{$word}++; }}print "All words and their counts: \n";for my $word (sort keys %count_of) { print "'$word': $count_of{$word}\n";}__END__
![Page 13: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/13.jpg)
13
The Scalable Expert – Hadoop Streaming
Lets you use any language you want. Same issues as Java MapReduce with regards to multiple passes,
complicated joins, etc. Always reading from stdin and writing to stdout. Easy to test out on local data
– cat myfile.txt | mymapper.sh | sort | myreducer.sh Actual data may not be as nice. No type checking on input or
output can will lead to problems. The main reason to do this is so you can use a nice interpreted
language to do your processing.
![Page 14: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/14.jpg)
14
The craftsman
![Page 15: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/15.jpg)
15
The Craftsman Word Count
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
![Page 16: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/16.jpg)
16
The Craftsman Data Scientist
If you like Java it works fine … until you want to do more than one pass, a complicated join or
anything fancy. Cascading solves many of these problems for you but it is still very
verbose
![Page 17: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/17.jpg)
17
We need a better tool
A five tool tool!
http://en.wikipedia.org/wiki/Willie_Mayshttp://en.wikipedia.org/wiki/Five-tool_player
![Page 18: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/18.jpg)
18
The Pragmatic Data Scientist
Agile – Iterates quickly Productive - Uses the right tool for the right job Correct - Tests as much as he can before the job is even submitted Scalable – Can handle real world problems Simple - Single language to represent Operations, UDFs and Data
![Page 19: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/19.jpg)
19
The Pragmatic Data Scientist
Agile – Iterates quickly Productive - Uses the right tool for the right job Correct - Tests as much as he can before the job is even submitted Scalable – Can handle real world problems Simple - Single language to represent Operations, UDFs and Data
![Page 20: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/20.jpg)
20
Agility – Data is complex
![Page 21: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/21.jpg)
21
Agility – Try before you buy
scala> 1 to 10
res0: Range.Inclusive = Range(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
scala> 1 until 10
res1: Range = Range(1, 2, 3, 4, 5, 6, 7, 8, 9)
scala> res0.slice(3, 5)
res3: scala.collection.immutable.IndexedSeq[Int] = Vector(4, 5)
scala> res0.groupBy(_ % 2)
res4: Map[Int, IndexedSeq[Int]] =
Map(1 -> Vector(1, 3, 5, 7, 9), 0 -> Vector(2, 4, 6, 8, 10))
![Page 22: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/22.jpg)
22
Productivity – Don't reinvent the wheel
![Page 23: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/23.jpg)
23
Productivity – Have the work done for you
Python Collections Operators map reduce filter sum min/max
Scala Collections Operators foreach map flatMap collect find takeWhile dropWhile filter withFilter filterNot splitAt
span partition groupBy forall exists count fold reduce sum product min/max
![Page 24: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/24.jpg)
24
Correctness – how to keep your sanity
![Page 25: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/25.jpg)
25
Scalability – works on more than your machine
Integrates with Hadoop (more than just streaming) Has the support of scalable libraries Parallel by design – not just for M/R flows
![Page 26: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/26.jpg)
26
Simplicity
Paco Nathan, Evil Mad Scientist, Concurrent Inc., @pacoid, says:– “[Scalding] code is compact, simple to understand”– “nearly 1:1 between elements of conceptual flow diagram and function
calls”– “Cascalog and Scalding DSLs leverage the functional aspects of
MapReduce, helping to limit complexity in process” Scala is a functional tool for a fundamentally functional job
![Page 27: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/27.jpg)
27
Let’s count some wordsHadoop basics
![Page 28: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/28.jpg)
28
Let’s count some words
This is the “Hello, World!” of anything tangentially related to Hadoop.
Let’s try it in Scala first without any Hadoop stuff.
val myLines : Seq[String] = ... // get some stuff val myWords = myLines.flatMap(w => w.split("\\s+")) val myWordsGrouped = myWords.groupBy(identity) val countedWords = myWordsGrouped.mapValues(x=>x.size) Now write out the words somehow
val countedWords = myLines.flatMap(_.split("\\s+"))
.groupBy(identity)
.mapValues(_.size)
![Page 29: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/29.jpg)
29
Let’s count a lot of words
I’ve gone to the trouble of rewriting this example to run in Hadoop. Here it is: val myLines : TypedPipe[String] = TextLine(args("input")) val myWords = myLines.flatMap(w => w.split("\\s+")) val myWordsGrouped = myWords.groupBy(identity) val countedWords = myWordsGrouped.mapValueStream(x =>
Iterator(x.size)) We can make this even better. val countedWords = myWordsGrouped.size countedWords.write(TypedTsv[(String,Long)](output))
![Page 30: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/30.jpg)
30
Something for nothing
Other people have already done the hard work to make the previous example run
The previous example is using Scalding, a Scala library to write (mainly) Hadoop MapReduce jobs.
https://github.com/twitter/scalding It even has its own Twitter account, @scalding Created by:
– Avi Bryant @avibryant– Oscar Boykin @posco – Argyris Zymnis @argyris
Tweet them now and tell them how awesome it is … I’ll wait
![Page 31: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/31.jpg)
31
Side by side comparison of local and Hadoop
val myWords = myLines.flatMap(w =>
w.split("\\s+"))
val myWordsGrouped = myLines.groupBy(identity)
val countedWords = myWordsGrouped.
mapValues(x=>x.size)
val myWords =
myLines.flatMap(w =>
w.split("\\s+"))
val myWordsGrouped = myWords.groupBy(identity)
val countedWords = myWordsGrouped.
size
There are some small differences, mainly due to how the underlying Hadoop process needs to happen.
![Page 32: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/32.jpg)
32
Why does this work?
Scala has support for embedded domain specific languages (DSLs) Scalding includes a couple DSLs for specifying Cascading (and by
extension Hadoop) workflows. Info about Cascading: http://www.cascading.org/ One of the Scalding DSLs, the Typed one, is designed to be very
close to the standard Scala collections API It’s not a perfect mapping due to how Cascading and Hadoop work,
but in general it is very easy to write your code locally, change a couple small bits, and have it run on a Hadoop cluster
Scalding also has a local mode if you want the syntactic sugar without fussing with Hadoop
![Page 33: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/33.jpg)
33
DSLs for everyone!
We’re showing you Scalding in this talk, but there are others that are similar.
– Scoobi: https://github.com/NICTA/scoobi– Scrunch: https://github.com/cloudera/crunch/tree/master/scrunch
All three attempt to make using code to written on Scala collections work (almost) seamlessly in Hadoop.
More on DSLs: http://www.scala-lang.org/node/1403 Some guts:
![Page 34: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/34.jpg)
34
Fields based DSL
From com.twitter.scalding.Dsl/**
* This object has all the implicit functions and values that are used
* to make the scalding DSL.
*
* It's useful to import Dsl._ when you are writing scalding code outside
* of a Job.
*/
object Dsl extends FieldConversions with TupleConversions with GeneratedTupleAdders with java.io.Serializable {
implicit def pipeToRichPipe(pipe : Pipe) : RichPipe = new
RichPipe(pipe)
implicit def richPipeToPipe(rp : RichPipe) : Pipe = rp.pipe
}
}
}
![Page 35: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/35.jpg)
35
Typed DSL
From com.twitter.scalding.TDsl/** implicits for the type-safe DSL
* import TDsl._ to get the implicit conversions from Grouping/CoGrouping to Pipe,
* to get the .toTypedPipe method on standard cascading Pipes.
* to get automatic conversion of Mappable[T] to TypedPipe[T]
*/
object TDsl extends Serializable with GeneratedTupleAdders {
implicit def pipeTExtensions(pipe : Pipe) : PipeTExtensions = new
PipeTExtensions(pipe)
implicit def mappableToTypedPipe[T](mappable : Mappable[T])
(implicit flowDef : FlowDef, mode : Mode, conv :
TupleConverter[T]) : TypedPipe[T] = {
TypedPipe.from(mappable)(flowDef, mode, conv)
}
}
![Page 36: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/36.jpg)
36
We did something fancy in the previous example: val countedWords = myGroupedWords.size val countedWords = myGroupedWords.mapValues(x =>
1L).sum val countedWords = myGroupedWords.mapValues(x =>
1L).reduce(implicit mon: Monoid[Long])((l,r) => mon.plus(l,r))
Scalding uses Algebird extensively to make your life easier. Algebird can also be used outside of Scalding with no trouble. Algebird has your favorite things like monoids, monads, bloom
filters, count-min sketches, hyperloglogs, etc.
Algebird – It’s like algebra and a bird
![Page 37: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/37.jpg)
37
Counting words with some extra information
Sometimes we want to know some information about the contexts that words occurred in. At eBay, this is often the category that a term appeared in.
Let’s count words and calculate the entropy of the category distribution for each word.
– If you’re unfamiliar with this type of entropy just think of it as a measure of how concentrated the distribution is.
– If you really like formulas it is: Σi p(xi) log(pi)
http://en.wikipedia.org/wiki/Entropy_%28information_theory%29
![Page 38: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/38.jpg)
38
More code
case class MyAvroOutput(word: String, count: Long,
entropy: Double) extends AvroRecord
TypedTsv[(String,Int)]
.flatMap{case(line,cat) => line.split("\\s+").map(x =>
(x,Map(cat->1L))}
.group
.sum
.map{ case(word, dist) =>
val total: Double = dist.values.sum
val entropy = (-1)*dist.values.map{ count =>
(count/total)*math.log(count/total)}.sum
MyAvroOutput(word,total.toLong,entropy)
}
.write(PackedAvroSource[MyAvroOutput](output)) Math is great
![Page 39: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/39.jpg)
39
The reason why you are hereMachine Learning Examples
![Page 40: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/40.jpg)
40
How much should we charge for a Titanic insurance?
Classification case study
![Page 41: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/41.jpg)
41
Titanic II case study
We want to sell life insurance to passengers of Titanic II All we have is data from Titanic I We have to be able to explain why we charge the prices we do
(damn regulators!)
http://commons.wikimedia.org
![Page 42: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/42.jpg)
42
Titanic I Data
Cabin class – e.g. 1st, 2nd, 3rd .. Name – String Age – Integer Embark place – String Destination – String Room – Integer Ticket – Integer Gender – Male or Female
![Page 43: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/43.jpg)
43
Titanic Model
http://www.dtreg.com/
![Page 44: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/44.jpg)
44
Classifier code
object Titanic {
def main(args: Array[String]) = {
// parse data
val reader = new CSVReader(new FileReader(
"src/main/data/titanic.csv"))
val passengers = reader.readAll.tail.map(Passenger(_))
val instances = passengers.map(_.getInstance).toSet
// build tree
val treeBuilder = new TreeBuilder
val tree = treeBuilder.buildTree(instances)
// print tree
tree.dump(System.out)
}
}
![Page 45: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/45.jpg)
45
Titanic Model
http://www.dtreg.com/
![Page 46: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/46.jpg)
46
Let’s cluster some eBay keywords.
Clustering case study
![Page 47: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/47.jpg)
47
Motivation
eBay, like any large site, has a massive number of unique queries every day
Identifying groups of queries based on user behavior might help us to understand the individual queries better
For queries we are unsure of we can even try and match them into a cluster that contains queries we know a lot about.
We can use behavioral things like:– number of searches– number of clicks– number of subsequent bids, buys– number of exits– etc
![Page 48: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/48.jpg)
48
Let’s use Mahout
Apache Mahout, http://mahout.apache.org/, @ApacheMahout, is a powerful machine learning and data mining library that works with Hadoop.
It has a ton of great stuff in it, but many of the drawbacks of using Java MapReduce apply.
It uses some proprietary data formats (is your data in VectorWritable SequenceFiles?)
Luckily for us, there are some nice things that work as standalone pieces.
Coming in release 0.8, there is an excellent single pass k-means clustering algorithm we can use.
![Page 49: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/49.jpg)
49
Let’s use Mahout, inside Scalding
lazy val clust = new StreamingKMeans(new FastProjectionSearch(new EuclideanDistanceMeasure,5,10),
args("sloppyclusters").toInt, (10e-6).asInstanceOf[Float])
var count = 0;
val sloppyClusters =
TextLine(args("input"))
.map{ str =>
val vec = str.split("\t").map(_.toDouble)
val cent = new Centroid(count, new DenseVector(vec))
count += 1
cent
}
.toPipe('centroids)
// This won't work with the current build, coming soon though
.unorderedFoldTo[StreamingKMeans,Centroid]('centroids->’clusters)(clust){(cl,cent) =>
cl.cluster(cent); cl}
.toTypedPipe[StreamingKMeans](Dsl.intFields(Seq(0)))
.flatMap(c => c.iterator.asScala.toIterable)
![Page 50: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/50.jpg)
50
Let’s use Mahout, inside Scalding
val finalClusters = sloppyClusters.groupAll
.mapValueStream{centList =>
lazy val bclusterer = new BallKMeans(new BruteSearch(
new EuclideanDistanceMeasure),
args("numclusters").toInt, 100)
bclusterer.cluster(centList.toList.asJava)
bclusterer.iterator.asScala
}
.values
![Page 51: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/51.jpg)
51
Results
These are primarily eBay head queries. Remember that the clustering algorithm knows nothing about the text in the query.
Sample groups:– chanel, tory burch, diamond ring, kathy van zeeland handbags, ...– ipad 4th generation, samsung galaxy s iii, iphone 4 s, nexus 4, ipad
mini, ...– kohls coupons, lowes coupons– jcrew, cole haan, diesel, banana republic, gucci, burberry, brooks
brothers, …– ferrari, utility trailer, polaris ranger, porsche 911, dump truck, bmw m3,
chainsaw, rv, chevelle, vw bus, dodge charger, ...– paypal account, ebay.com, apple touch icon precomposed.png,
paypal, undefined, ps3%2520games, michael%2520kors
![Page 52: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/52.jpg)
52
Clustering Takeaway
There are some excellent libraries that exist, and even fit the functional model
Scala and Scalding will help you work around the rough edges and integrate them into your data flow, rather than having to create new data flows
Being able to prototype locally and in the Scala REPL saves massive amounts of developer time
![Page 53: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/53.jpg)
53
Using LinkedIn endorsement data to rank Scala experts
Matrix API case study
![Page 54: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/54.jpg)
54
LinkedIn Endorsements
![Page 55: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/55.jpg)
55
Page Rank Algorithm
http://commons.wikimedia.org
![Page 56: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/56.jpg)
56
Prepare Datadef prepareData = {
// read endorsements and transform to edges
val ends = readFile[Endorsement]("endorsements")
.filter(_.skill == "Scala")
.map(e => (e.sender, e.recipient, 1))
.write(TSV(”edges"))
}
def getDominantEigenVector = { … } // outputs to “ranks” (memberId, rank)
def getMembers = {
// get Bay Area members
val members = readLatest[Member]("members")
.filter(_.getRegionCode == 84)
.groupBy(_.getMemberId.toLong)
// join ranks and members
readFile[Ranks](”ranks”).withReducers(10).join(members).toTypedPipe
.map{ case (id, ((_, rank), m)) =>
(rank, m.getMemberId, m.getFirstName, m.getLastName, m.getHeadline) }
.groupAll.sortBy(_._1).reverse.values
.write(TextLine("talk/scalaRanks"))
}
![Page 57: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/57.jpg)
57
Matrix API
mat.mapValues( func ): Matrix mat.filterValues( func ) : Matrix mat.getRow( ind ) : RowVector mat.reduceRowVectors{ f } :
RowVector mat.sumRowVectors :
RowVector mat.mapRows{ func } : Matrix mat.topRowElems( k ) : Matrix mat.rowL1Normalize : Matrix mat.rowL2Normalize : Matrix
rowMeanCentering : Matrix rowSizeAveStdev : Matrix matrix1 * matrix2 : Matrix matrix / scalar(Scalar) : Matrix elemWiseOp( mat2 ){ func } mat1.hProd( matrix2 ) : Matrix mat1.zip( mat2/r/c ) : Matrix matrix.nonZerosWith( sclr ) matrix.trace : Scalar matrix.sum : Scalar matrix.transpose : Matrix mat.diagonal : DiagonalMatrix
![Page 58: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/58.jpg)
58
Time for Results!Endorsements Page Rank
![Page 59: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/59.jpg)
59
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
![Page 60: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/60.jpg)
60
13.
28.
35.
38.
48.
![Page 61: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/61.jpg)
61
SummaryOnly one slide left!
![Page 62: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/62.jpg)
62
Stuff you have seen today …
There are many ways to develop machine learning programs, none of them are perfect
Scala which reflects the 20 years of evolution since Java's invention, and Scalding which is the same for vanilla MapReduce, are a much better alternative
Machine learning is fun and not necessarily complicated
![Page 63: Scalable and Flexible Machine Learning With Scala @ LinkedIn](https://reader033.vdocuments.pub/reader033/viewer/2022061223/54c67f254a79598d528b461f/html5/thumbnails/63.jpg)
63