sparkling random ferns by p dendek and m fedoryszak
TRANSCRIPT
Sparkling Random Ferns. From an academic paper to spark-packages.org
Piotr Jan DendekMateusz Fedoryszak
The Agenda1. How it starts?2. What is the Random Ferns algorithm?3. How did implementation, evaluation and publishing
went?
Motivations• Random Ferns is the popular classification algorithm in
the image processing field• Our colleague - Miron Kursa as part of his research[1]
implemented this algorithm and publish as R package called rFerns
• We have decided to empower Spark community with this method by making it available as a Spark package
THE ALGORITHM
The Algorithm• Random Ferns
– Example of the supervised learning– Solves classification problems– Kind of Ensemble Algorithm
Posterior Probability• Hypothetically we can learn conditional probabilities:
• Where the classifier is described as
• Not suitable, not traceable, memory consuming
Naïve Bayes Classifier
• Naïve as it misses dependencies among features• Often quite successful classifications
• Goal to reach:– Avoid Overfitting– Build classifiers faster
• Ways to reach randomness– Item sampling with replacement– Feature sampling
Randomness in classifiers
Random Ferns• Each classifier [1;L] has its set of features [1;S]
• Assume that classifiers are independent
• Then classify items
Random Ferns• Less-naïve Bayes• From Random Forests perspective:
A
B
C C C C
B
A
B
D D D D
B
C
D
E E E E
D
THE IMPLEMENTATION
Bagging
Initial set
Fern 1
Fern 2
Fern 3
Bagging
Initial set
Fern 1
Fern 2
Fern 3
2 0 2 1 0
1 1 0 1 2
1 1 1 1 1
Big data bagging• How many times would a data point be sampled?
– Binomial distribution,
– As (big data) Binomial distribution tends to Poisson distribution, [2]
Simulate sampling using Poisson distribution
Binarisation
Note: each fern has its own binarisers
Categorical features Continuous features
— Get a random subset of categories
— Given category either fits this set or not
— Get two random feature values from the training set
— Use their mean as threshold
Binarisation — implementation
Categorical features Continuous features
— Trivial as we have user supplied categories info
— Assign every value a random float
— Reduce by taking two values with greatest floats assigned
𝐻 ( 𝒇 )≡argmax𝑘𝑃 (𝑪𝑘)∏
𝑙=1
𝐿
𝑃 (𝐹 𝑙∨𝑪𝑘)
Probabilities
What’s that?
𝑃 (𝐹 𝑙∨𝐶𝑘)• A combination of binary feature values used by
fern • For a fern of height there are distinct values of • You may think of it as fern mapping each object
into one of buckets
𝑃 (𝐹 𝑙∨𝐶𝑘)• Probability of an object of class falling into
bucket • Count of objects of class falling into bucket
divided by count of objects of class
𝑃 (𝐹 𝑙|𝐶𝑘 )=|𝐹 𝑙∩𝐶𝑘|
|𝐶𝑘|
Reduction• The most important training part is
counting objects• Sounds similar to… counting words!• We have reduced classifier building to the
best-known big data problem
Memory
Q: How many probabilities do we need to compute?A: About per fern
That means a binary classifier of 100 20-feature ferns will weight over 1.5GB
THE EVALUATION
Accuracy et al.• Evaluation on Iris and Car datasets as integration test• Iris:
– 10 ferns, 3 features per fern (out of 4) – Accuracy: 98%
• Car:– 20 ferns, 4 features per fern (out of 6) – Accuracy: 90%
Dataset• Million Song Dataset – Year Prediction
– Not quite about classification, but big (0.5M items)– Task: having 90 real number features indicate a
publication year (ranging from 1922 to 2011)– For sake of demonstration let’s just pretend it is
classification problem
Model Training Codeval raw = sc.textFile(…)val lp = raw.map(parseIntoLabeledPoints(_))val data = splitIntoTrainTest(lp)val numFerns = 90val numFeatures = 10val model = FernForest.train(data.train, numFerns, numFeatures, Map.empty)val correct = data.test.map(lp => model.predict(lp.features) == lp.label)
Model Training time
• Where: – is number of features – is number of items in a dataset
Model Training Time
10 12 14 16 18 2025.00
30.00
35.00
40.00
45.00
50.00
Number of Features
Est
. Tra
inin
g Ti
me
[min
]• Training time is linear
– against numer of features (diff to Random Forests)– against number of samples
0% 10% 20% 30% 40% 50% 60%0.02.04.06.08.0
10.012.0
Sample of 0.5M items Dataset
Trai
ning
Tim
e [m
in]
THE PACKAGE
Our toolbox
How can you help your users?
• Simplify discovery– Register at spark-packages.org
• Simplify utilisation– Publish artifacts to the Central Repository
spark-packages.org
• An index of packages for Apache Spark• Spark Community keeps an eye on it• Ideal place if you want to extend Spark• You can register any GitHub-hosted Spark
project
The Central Repository
• Apache Maven retrieves all components from the Central Repository by default– so does Apache Spark– and many other build systems
• Are your artifacts there yet?
Getting to the CentralSonatype provides OSSRH
– free repository – for open source software – store snapshot artifacts – promote releases to the Central Repository
Checklist:1. Register[3] at Sonatype OSSRH2. Generate GPG key (if you don’t have one yet)3. Alter[4] your build.sbt4. Build and sign your artefacts5. Stage[5] release at OSSRH and promote to Central Repository6. Voilà!
Things are smooth now
./$SPARK_HOME/bin/spark-shell \ --packages pl.edu.icm:sparkling-ferns_2.10:0.2.0
THANK YOU! QUESTIONS?
http://spark-packages.org/package/CeON/
sparkling-ferns
@pjden
@mfedoryszak
/piotrdendek
/mfedoryszak
References[1] „rFerns: An Implementation of the Random Ferns Method for General-Purpose Machine Learning”, M. Kursa, DOI: 10.18637/jss.v061.i10[2] „Proof that the Binomial Distribution tends to the Poisson Distribution”, https://youtu.be/ceOwlHnVCqo[3] „OSSRH Guide”, Sonatype, http://central.sonatype.org/pages/ossrh-guide.html[4] „Deploying to Sonatype”, Sbt, http://www.scala-sbt.org/release/docs/Using-Sonatype.html[5] „Releasing the Deployment”, Sonatype, http://central.sonatype.org/pages/releasing-the-deployment.html