sparkling random ferns by p dendek and m fedoryszak

36
Sparkling Random Ferns. From an academic paper to spark- packages.org Piotr Jan Dendek Mateusz Fedoryszak

Upload: spark-summit

Post on 12-Feb-2017

916 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Sparkling Random Ferns by  P Dendek and M Fedoryszak

Sparkling Random Ferns. From an academic paper to spark-packages.org

Piotr Jan DendekMateusz Fedoryszak

Page 2: Sparkling Random Ferns by  P Dendek and M Fedoryszak

The Agenda1. How it starts?2. What is the Random Ferns algorithm?3. How did implementation, evaluation and publishing

went?

Page 3: Sparkling Random Ferns by  P Dendek and M Fedoryszak

Motivations• Random Ferns is the popular classification algorithm in

the image processing field• Our colleague - Miron Kursa as part of his research[1]

implemented this algorithm and publish as R package called rFerns

• We have decided to empower Spark community with this method by making it available as a Spark package

Page 4: Sparkling Random Ferns by  P Dendek and M Fedoryszak

THE ALGORITHM

Page 5: Sparkling Random Ferns by  P Dendek and M Fedoryszak

The Algorithm• Random Ferns

– Example of the supervised learning– Solves classification problems– Kind of Ensemble Algorithm

Page 6: Sparkling Random Ferns by  P Dendek and M Fedoryszak

Posterior Probability• Hypothetically we can learn conditional probabilities:

• Where the classifier is described as

• Not suitable, not traceable, memory consuming

Page 7: Sparkling Random Ferns by  P Dendek and M Fedoryszak

Naïve Bayes Classifier

• Naïve as it misses dependencies among features• Often quite successful classifications

Page 8: Sparkling Random Ferns by  P Dendek and M Fedoryszak

• Goal to reach:– Avoid Overfitting– Build classifiers faster

• Ways to reach randomness– Item sampling with replacement– Feature sampling

Randomness in classifiers

Page 9: Sparkling Random Ferns by  P Dendek and M Fedoryszak

Random Ferns• Each classifier [1;L] has its set of features [1;S]

• Assume that classifiers are independent

• Then classify items

Page 10: Sparkling Random Ferns by  P Dendek and M Fedoryszak

Random Ferns• Less-naïve Bayes• From Random Forests perspective:

A

B

C C C C

B

A

B

D D D D

B

C

D

E E E E

D

Page 11: Sparkling Random Ferns by  P Dendek and M Fedoryszak

THE IMPLEMENTATION

Page 12: Sparkling Random Ferns by  P Dendek and M Fedoryszak

Bagging

Initial set

Fern 1

Fern 2

Fern 3

Page 13: Sparkling Random Ferns by  P Dendek and M Fedoryszak

Bagging

Initial set

Fern 1

Fern 2

Fern 3

2 0 2 1 0

1 1 0 1 2

1 1 1 1 1

Page 14: Sparkling Random Ferns by  P Dendek and M Fedoryszak

Big data bagging• How many times would a data point be sampled?

– Binomial distribution,

– As (big data) Binomial distribution tends to Poisson distribution, [2]

Simulate sampling using Poisson distribution

Page 15: Sparkling Random Ferns by  P Dendek and M Fedoryszak

Binarisation

Note: each fern has its own binarisers

Categorical features Continuous features

— Get a random subset of categories

— Given category either fits this set or not

— Get two random feature values from the training set

— Use their mean as threshold

Page 16: Sparkling Random Ferns by  P Dendek and M Fedoryszak

Binarisation — implementation

Categorical features Continuous features

— Trivial as we have user supplied categories info

— Assign every value a random float

— Reduce by taking two values with greatest floats assigned

Page 17: Sparkling Random Ferns by  P Dendek and M Fedoryszak

𝐻 ( 𝒇 )≡argmax𝑘𝑃 (𝑪𝑘)∏

𝑙=1

𝐿

𝑃 (𝐹 𝑙∨𝑪𝑘)

Probabilities

What’s that?

Page 18: Sparkling Random Ferns by  P Dendek and M Fedoryszak

𝑃 (𝐹 𝑙∨𝐶𝑘)• A combination of binary feature values used by

fern • For a fern of height there are distinct values of • You may think of it as fern mapping each object

into one of buckets

Page 19: Sparkling Random Ferns by  P Dendek and M Fedoryszak

𝑃 (𝐹 𝑙∨𝐶𝑘)• Probability of an object of class falling into

bucket • Count of objects of class falling into bucket

divided by count of objects of class

𝑃 (𝐹 𝑙|𝐶𝑘 )=|𝐹 𝑙∩𝐶𝑘|

|𝐶𝑘|

Page 20: Sparkling Random Ferns by  P Dendek and M Fedoryszak

Reduction• The most important training part is

counting objects• Sounds similar to… counting words!• We have reduced classifier building to the

best-known big data problem

Page 21: Sparkling Random Ferns by  P Dendek and M Fedoryszak

Memory

Q: How many probabilities do we need to compute?A: About per fern

That means a binary classifier of 100 20-feature ferns will weight over 1.5GB

Page 22: Sparkling Random Ferns by  P Dendek and M Fedoryszak

THE EVALUATION

Page 23: Sparkling Random Ferns by  P Dendek and M Fedoryszak

Accuracy et al.• Evaluation on Iris and Car datasets as integration test• Iris:

– 10 ferns, 3 features per fern (out of 4) – Accuracy: 98%

• Car:– 20 ferns, 4 features per fern (out of 6) – Accuracy: 90%

Page 24: Sparkling Random Ferns by  P Dendek and M Fedoryszak

Dataset• Million Song Dataset – Year Prediction

– Not quite about classification, but big (0.5M items)– Task: having 90 real number features indicate a

publication year (ranging from 1922 to 2011)– For sake of demonstration let’s just pretend it is

classification problem

Page 25: Sparkling Random Ferns by  P Dendek and M Fedoryszak

Model Training Codeval raw = sc.textFile(…)val lp = raw.map(parseIntoLabeledPoints(_))val data = splitIntoTrainTest(lp)val numFerns = 90val numFeatures = 10val model = FernForest.train(data.train, numFerns, numFeatures, Map.empty)val correct = data.test.map(lp => model.predict(lp.features) == lp.label)

Page 26: Sparkling Random Ferns by  P Dendek and M Fedoryszak

Model Training time

• Where: – is number of features – is number of items in a dataset

Page 27: Sparkling Random Ferns by  P Dendek and M Fedoryszak

Model Training Time

10 12 14 16 18 2025.00

30.00

35.00

40.00

45.00

50.00

Number of Features

Est

. Tra

inin

g Ti

me

[min

]• Training time is linear

– against numer of features (diff to Random Forests)– against number of samples

0% 10% 20% 30% 40% 50% 60%0.02.04.06.08.0

10.012.0

Sample of 0.5M items Dataset

Trai

ning

Tim

e [m

in]

Page 28: Sparkling Random Ferns by  P Dendek and M Fedoryszak

THE PACKAGE

Page 29: Sparkling Random Ferns by  P Dendek and M Fedoryszak

Our toolbox

Page 30: Sparkling Random Ferns by  P Dendek and M Fedoryszak

How can you help your users?

• Simplify discovery– Register at spark-packages.org

• Simplify utilisation– Publish artifacts to the Central Repository

Page 31: Sparkling Random Ferns by  P Dendek and M Fedoryszak

spark-packages.org

• An index of packages for Apache Spark• Spark Community keeps an eye on it• Ideal place if you want to extend Spark• You can register any GitHub-hosted Spark

project

Page 32: Sparkling Random Ferns by  P Dendek and M Fedoryszak

The Central Repository

• Apache Maven retrieves all components from the Central Repository by default– so does Apache Spark– and many other build systems

• Are your artifacts there yet?

Page 33: Sparkling Random Ferns by  P Dendek and M Fedoryszak

Getting to the CentralSonatype provides OSSRH

– free repository – for open source software – store snapshot artifacts – promote releases to the Central Repository

Checklist:1. Register[3] at Sonatype OSSRH2. Generate GPG key (if you don’t have one yet)3. Alter[4] your build.sbt4. Build and sign your artefacts5. Stage[5] release at OSSRH and promote to Central Repository6. Voilà!

Page 34: Sparkling Random Ferns by  P Dendek and M Fedoryszak

Things are smooth now

./$SPARK_HOME/bin/spark-shell \ --packages pl.edu.icm:sparkling-ferns_2.10:0.2.0

Page 35: Sparkling Random Ferns by  P Dendek and M Fedoryszak

THANK YOU! QUESTIONS?

http://spark-packages.org/package/CeON/

sparkling-ferns

@pjden

@mfedoryszak

/piotrdendek

/mfedoryszak

Mateusz Fedoryszak
Może się okazać, że to będzie jeden z najdłużej wyświetlanych slajów: dodam tu w wolnej chwili link do githuba i twitterowe namiary na nas :)
Piotr Dendek
świetna uwaga!
Page 36: Sparkling Random Ferns by  P Dendek and M Fedoryszak

References[1] „rFerns: An Implementation of the Random Ferns Method for General-Purpose Machine Learning”, M. Kursa, DOI: 10.18637/jss.v061.i10[2] „Proof that the Binomial Distribution tends to the Poisson Distribution”, https://youtu.be/ceOwlHnVCqo[3] „OSSRH Guide”, Sonatype, http://central.sonatype.org/pages/ossrh-guide.html[4] „Deploying to Sonatype”, Sbt, http://www.scala-sbt.org/release/docs/Using-Sonatype.html[5] „Releasing the Deployment”, Sonatype, http://central.sonatype.org/pages/releasing-the-deployment.html