sparkling random ferns by p dendek and m fedoryszak

Sparkling Random Ferns. From an academic paper to spark-packages.org

Piotr Jan DendekMateusz Fedoryszak

The Agenda1. How it starts?2. What is the Random Ferns algorithm?3. How did implementation, evaluation and publishing

went?

Motivations• Random Ferns is the popular classification algorithm in

the image processing field• Our colleague - Miron Kursa as part of his research[1]

implemented this algorithm and publish as R package called rFerns

• We have decided to empower Spark community with this method by making it available as a Spark package

THE ALGORITHM

The Algorithm• Random Ferns

– Example of the supervised learning– Solves classification problems– Kind of Ensemble Algorithm

Posterior Probability• Hypothetically we can learn conditional probabilities:

• Where the classifier is described as

• Not suitable, not traceable, memory consuming

Naïve Bayes Classifier

• Naïve as it misses dependencies among features• Often quite successful classifications

• Goal to reach:– Avoid Overfitting– Build classifiers faster

• Ways to reach randomness– Item sampling with replacement– Feature sampling

Randomness in classifiers

Random Ferns• Each classifier [1;L] has its set of features [1;S]

• Assume that classifiers are independent

• Then classify items

Random Ferns• Less-naïve Bayes• From Random Forests perspective:

A

B

C C C C

B

A

B

D D D D

B

C

D

E E E E

D

THE IMPLEMENTATION

Bagging

Initial set

Fern 1

Fern 2

Fern 3

Bagging

Initial set

Fern 1

Fern 2

Fern 3

2 0 2 1 0

1 1 0 1 2

1 1 1 1 1

Big data bagging• How many times would a data point be sampled?

– Binomial distribution,

– As (big data) Binomial distribution tends to Poisson distribution, [2]

Simulate sampling using Poisson distribution

https://youtu.be/ceOwlHnVCqo

Binarisation

Note: each fern has its own binarisers

Categorical features Continuous features

— Get a random subset of categories

— Given category either fits this set or not

— Get two random feature values from the training set

— Use their mean as threshold

Binarisation — implementation

Categorical features Continuous features

— Trivial as we have user supplied categories info

— Assign every value a random float

— Reduce by taking two values with greatest floats assigned

𝐻 ( 𝒇 )≡argmax𝑘𝑃 (𝑪𝑘)∏

𝑙=1

𝐿

𝑃 (𝐹 𝑙∨𝑪𝑘)

Probabilities

What’s that?

𝑃 (𝐹 𝑙∨𝐶𝑘)• A combination of binary feature values used by

fern • For a fern of height there are distinct values of • You may think of it as fern mapping each object

into one of buckets

𝑃 (𝐹 𝑙∨𝐶𝑘)• Probability of an object of class falling into

bucket • Count of objects of class falling into bucket

divided by count of objects of class

𝑃 (𝐹 𝑙|𝐶𝑘 )=|𝐹 𝑙∩𝐶𝑘|

|𝐶𝑘|

Reduction• The most important training part is

counting objects• Sounds similar to… counting words!• We have reduced classifier building to the

best-known big data problem

Memory

Q: How many probabilities do we need to compute?A: About per fern

That means a binary classifier of 100 20-feature ferns will weight over 1.5GB

THE EVALUATION

Accuracy et al.• Evaluation on Iris and Car datasets as integration test• Iris:

– 10 ferns, 3 features per fern (out of 4) – Accuracy: 98%

• Car:– 20 ferns, 4 features per fern (out of 6) – Accuracy: 90%

Dataset• Million Song Dataset – Year Prediction

– Not quite about classification, but big (0.5M items)– Task: having 90 real number features indicate a

publication year (ranging from 1922 to 2011)– For sake of demonstration let’s just pretend it is

classification problem

Model Training Codeval raw = sc.textFile(…)val lp = raw.map(parseIntoLabeledPoints(_))val data = splitIntoTrainTest(lp)val numFerns = 90val numFeatures = 10val model = FernForest.train(data.train, numFerns, numFeatures, Map.empty)val correct = data.test.map(lp => model.predict(lp.features) == lp.label)

Model Training time

• Where: – is number of features – is number of items in a dataset

Model Training Time

10 12 14 16 18 2025.00

30.00

35.00

40.00

45.00

50.00

Number of Features

Est

. Tra

inin

g Ti

me

[min

]• Training time is linear

– against numer of features (diff to Random Forests)– against number of samples

0% 10% 20% 30% 40% 50% 60%0.02.04.06.08.0

10.012.0

Sample of 0.5M items Dataset

Trai

ning

Tim

e [m

in]

THE PACKAGE

Our toolbox

How can you help your users?

• Simplify discovery– Register at spark-packages.org

• Simplify utilisation– Publish artifacts to the Central Repository

spark-packages.org

• An index of packages for Apache Spark• Spark Community keeps an eye on it• Ideal place if you want to extend Spark• You can register any GitHub-hosted Spark

project

The Central Repository

• Apache Maven retrieves all components from the Central Repository by default– so does Apache Spark– and many other build systems

• Are your artifacts there yet?

Getting to the CentralSonatype provides OSSRH

– free repository – for open source software – store snapshot artifacts – promote releases to the Central Repository

Checklist:1. Register[3] at Sonatype OSSRH2. Generate GPG key (if you don’t have one yet)3. Alter[4] your build.sbt4. Build and sign your artefacts5. Stage[5] release at OSSRH and promote to Central Repository6. Voilà!

http://central.sonatype.org/pages/ossrh-guide.html

http://www.scala-sbt.org/release/docs/Using-Sonatype.html

http://central.sonatype.org/pages/releasing-the-deployment.html

Things are smooth now

./$SPARK_HOME/bin/spark-shell \ --packages pl.edu.icm:sparkling-ferns_2.10:0.2.0

THANK YOU! QUESTIONS?

http://spark-packages.org/package/CeON/

sparkling-ferns

@pjden

@mfedoryszak

/piotrdendek

/mfedoryszak

Mateusz Fedoryszak

Może się okazać, że to będzie jeden z najdłużej wyświetlanych slajów: dodam tu w wolnej chwili link do githuba i twitterowe namiary na nas :)

Piotr Dendek

świetna uwaga!

References[1] „rFerns: An Implementation of the Random Ferns Method for General-Purpose Machine Learning”, M. Kursa, DOI: 10.18637/jss.v061.i10[2] „Proof that the Binomial Distribution tends to the Poisson Distribution”, https://youtu.be/ceOwlHnVCqo[3] „OSSRH Guide”, Sonatype, http://central.sonatype.org/pages/ossrh-guide.html[4] „Deploying to Sonatype”, Sbt, http://www.scala-sbt.org/release/docs/Using-Sonatype.html[5] „Releasing the Deployment”, Sonatype, http://central.sonatype.org/pages/releasing-the-deployment.html

https://youtu.be/ceOwlHnVCqo

http://central.sonatype.org/pages/ossrh-guide.html

http://www.scala-sbt.org/release/docs/Using-Sonatype.html

http://central.sonatype.org/pages/releasing-the-deployment.html

sparkling random ferns by p dendek and m fedoryszak

Data & Analytics