machine learning summary for caltech2

National Aeronautics and Space AdministrationJet Propulsion LaboratoryCalifornia Institute of Technology

Machine Learning and Instrument Autonomy Group (398E)

Contact (Supervisor): Tara Estlin

Presenter: Lukas Mandrake

[email protected]

[email protected]

© 2015 California Institute of TechnologyGovernment sponsorship acknowledged

The Machine Learning Group at JPL:Squeezing More Out Of Your Data

mailto:[email protected]

mailto:[email protected]

Outline of Talk• Machine Learning Definition

• Amenable Data Types

• Curse of Dimensionality

• Common Techniques

• Example Applications

2

What is Machine Learning?(why do we need it?)

3

How Scientists Spend Brainpower

NotionalGraphs

Only

% T

hink

ing

Spe

nt

kBMBGBTB

Available DataElectrical sensors for data takingComputers for analyticsML for building modelsLeaves humans to interpret

4

The World Today Without ML

Linear regressionsCorrelation coeff’s, R2

Higher curve fitsSeparate & identify sub-populationsMake & Compare ModelsThrow out outliersWonder if you picked the best X-axis

Pick 2 dimensions and plot

Potentially interesting

Imp

ort

an

t va

riab

le

What you hope forScatter Data

What you get What you do

Spectral DataFit slopesSubtract / divide backgroundsTake ratios of known frequenciesLook for familiar peaksMake & Compare ModelsHope you know which frequencies are most informativeDoes it work?YES! It’s how we got here.

Takes a LOT of human timeThat’s the currency of merit

Bins (λ,ν)

Co

un

ts

5

ML: Fusion of Statistics and CompSci• Replicate what humans do while analyzing data

– Builds and auto-tunes models– Forces analyst to make metrics clear(er)

• Subjective decisions become principled (or at least repeatable) and FAST!

• Higher dimensions than humans can visualize

• Replicate over huge datasets

• In the end, extends human analysis.

No one gets replaced, they get augmented!6

What Kinds of Science DataDoes it Help Most?

(new ones found all the time)

7

Data: Boon and BaneHyperspectral imagers: pictures with high-res spectra at each pixel

4000 frequencies x 5e6 pixels x 3000 images

A human can select a pixel to study the spectra, or make an image of a particular frequency (or ratio). Maybe take a dot product with a desired spectra.

Most of image(s) remain unexamined! Needle in haystack problem.

Image databases: How to search?

Martian crater expert only wants images with craters, but would take hundreds of grad students to label them all. Even if you do, each student’s criteria will be different (fatigue).

This classification problem is ubiquitous in image and video processing

Maybe want to find list of “things that aren’t anything you taught me” to look for interesting new landforms not expected.

Compare nearly overlapping images at different times and find “what’s changed” for dynamics study.

Just too large for even a team to tackle.8

Data: Boon and BaneMetadata-Rich Datasets:

Observations aren’t just numbers: whole vector of associated data- Estimates of T, P, aerosol estimates, cloudiness, H2O content…- 10 to 1000 such parameters

Look for correlated trends, sub-category description / separation, anomaly discovery, and primary correlates to unwanted behavior

Fundamentally >2D relationships will be missed by simple plotting

Object databases: What’s in there?

Martian rock database records dozens of properties of each rock examined… thousands or millions of them

What rock types group together?

What stands out as unique? Why was it unique?Given we now know that, what’s the next most unique thing?

9

Curse of Dimensionality• Sphere vols increase until dim=5 then go to zero as d->inf• Sphere within cube eventually removes no volume from cube• A smaller sphere within a larger removes no volume• Zero prob within 1 stdev in Gaussian (“Losing the Middle”)

…

10

Effects on AnalysisExample: Analyze a genetics dataset• Data: 1e6 samples, each is gene snippet 1e5 base-pair long• “Find the gene locations that correlate with an observed condition”• Already processed dataset with only 1e2 base-pairs just fine

Each base-pair location has ~10 samples. Dbig makes data coverage -> 0

Regressions detonate as (ever more likely) correlated genes cause singularities

Even without correlations, distance function diluted. Everybody is very far.

Dilutes meaning of all regressions, nearest neighbor comparisons

Space is too large to be searched. Requires exponential sampling.

Errors! Singular matrices! Meaningless results!

11

How to Kill the Curse?

• Identify most informative dimensions or Mixtures “feature selection”

• Requires a search over number of dimensions D Can take time

• But once informative features recognized, everything else is faster and easier

• Fundamental to Machine Learning, usually as a first step

12

How Do You Know You NeedMachine Learning?

13

• Operational Scenarios• Data > Transmission– Autonomous prioritize by “interest”– “interest” can be specified, calculated in-situ, or anomalous

• Comm Delay > Decision Time– Autonomous decision making– Pre-defined or anomalous triggers– Plan/schedule response and follow-up

• Volume > Analysis Time– Identify uninteresting / interesting data and sub-populations– Identify anomalies, test models

• Data collection capability > data storage capability– Autonomous decision what to retain

When to Use Machine Learning

14

What Sorts of Questions Does Machine Learning Answer?

What Kind of Models Can It Build?

15

Unsupervised

“Data Mining”

Supervised

“Learning”

Other Methods

Kinds of Solution Methods

• Algorithm studies data independently• User does not “help” algorithm understand• No user assumptions to corrupt results• No human expertise either

• Human provides labeled examples to “learn”• Human selects algorithm / model for generalization• Algorithm figures out how to generalize labels• Resulting tuned model reveals structure of data• Produces useful system for replicating the labeling

• Might involve humans as part of the learning cycle• Might seek feedback to make new labeled data• Might use evolution to figure out best ML parameters• Might maintain multi-goal output space

16

No Label Learning

Data Mining

Unsupervised

17

Unsupervised Data MiningFinding Hidden Structure In Unlabeled Data

Clustering: “Are there sub-populations in my data?”Defines n clusters to which all observations are membersEasy to see in 2D, harder in Dbig

Sub-populations can guide analyst to independent analysisMay correspond to physically meaningful populationsMust provide distance metric, algorithm, parameters, data filtration, parameter n

PCA: “What combination of dims explains my data Var?”New axes based on linear combination of original dimsAxes ordered in terms of data variance they explainWorks if data variance is all “interesting” (rare)Dimension reduction: take first n axes until 99% Var explainedCan’t handle dimensional correlations

HMM: “What statistical model produced my data?”Pertains to time-series or sequential dataAssumes prob model that only depends on last stateConstructs most likely model that would explain datasetCan reveal hidden relations and driving processes

Principle Component Analysis

Hidden Markov Model

18

Unsupervised Data MiningFinding Hidden Structure In Unlabeled Data

Rules Learning: “What events tend to co-occur?”Discovers strong, simple rules between dimensionsUseful to figure out hidden relations in large datasetsCan also be helpful to remove correlated featuresGives potential rules for interpretation and investigation

Segmentation: “What regions best describe image?”Groups pixels/samples into larger regions of similar natureExpensive, slow analysis may then be done per regionAveraging across region may reduce noise in “super pixel”Helps image recognition and classification tasksFocuses analysis on complex areas vs boring stretches

DOGO: “Order my data by its quality / utility”Specify a metric to max/minimizeFinds features that, via filtration, optimize metricConstructs sliding filter that monotonically reduces metricInverts filter to produce data ordering of most to least trustedUseful when data isn’t merely “good” or “bad”

Data Ordering through Genetic Optimization

19

Labels Provided

Machine Learning

Supervised

20

Supervised Machine LearningAlgorithms Learning from Humans

LDA: “What hyperplane would best separate my labels?”Like PCA, but works to separate labels, not explain VarianceReturns vector of how useful each dimension was to separateVulnerable to correlated featuresSurprisingly powerful for simple classification

SVM: “What set of samples best define label separation?”Same idea of LDA. Make a separating hyperplanePays attention only to the most confusing examplesCreates a “basis” set of support vectors, most informative samplesGives idea of data importance: which samples change answer

Neural Network: “Predict my labels, I don’t care how”Define layer geometry: # hidden layers, input types, output typesTrain on input data and user labels. Defines weights.Black-box predictor now onlineMonte Carlo stimulate inputs to maximize output concept signal Hard to get insight from network itself

Linear Discriminant Analysis

Support Vector Machine

21


Decision Tree: “Play 20 questions to separate my labels.”Set # of tree branches allowedLearns best series of questions that isolate provided labelsDirectly interpretable by domain experts… no black boxExtremely fast to evaluate once trained

Naïve Bayes: “What’s probability of belong to any label?”Assumes non-correlated dimensions (often works despite this)Needs relatively small number of input labelsLearns distribution of training labels independentlyPredicts probability of sample being in all label categoriesCan easily have “I don’t know” response added

Nearest Neighbor: “Use comparables to predict label”User picks how many neighbors to considerFor each sample to predict, scans all input training dataFinds k neighbors by distance metric, then averages labelsLearns no structure, uses no models, just distance metricCan be slow to evaluate if lots of labeled data22


Random Forest: “Make lots of little trees and take vote.”Trains hundreds of small trees on label data subsetsIn final prediction, let them vote for final outputOvercomes Decision Tree tendency to overfit dataRemoves Decision Tree strength of interpretability

Boosting / Ensemble: “Combine algorithms to improve.”Iteratively train lots of weak / simple methods (any mix will do)Larger optimization (genetic) twiddles with all their parametersLarger optimization learns weights to combine answers togetherTakes a lot of processing power and input data, but great resultsCan learn about data based on which predictors were selected

23

Mixed Cases

Minimal or InteractiveHuman Support

Or

Methods that work with both24

Interesting MixturesAlgorithms that work with or without labels

or that are interactive and can generate them

Genetic Algorithm: “Iteratively optimize my (gene) model”Define a “gene” of all parameters you want GA to optimizeDefine goal metric(s) GA should try to max/minimizeGA’s handle mixed input, arbitrary goal metricsNot really learning, but useful in similar situations. Slow.

Active Learning: “What should I have labeled to help?”Initialize system with supervised or unsupervised methodHave system predict and display to userUser corrects errors and addresses confusing examplesIterate between prediction and feedback until results look goodCan easily over-fit, so hold-out tests are important here

25

Travel Advisory: Data PreparationSome concepts to save you from harm

Over-Fitting• Samples is not >> degrees of freedom• Gets you great results… on your training set• Can’t generalize! Predictor fails on new data.• Use cross-validation and/or simpler model

• Train on 20% data, test on 80%? Vice versa? 50/50?• Depends on data volume and algorithm need• Data structure also determines: how heterogeneous

Train / Test Split

• Automated way to explore all possible train/test splits• Second level withholds data from test & train• Takes lots of data and time• Actually tests generalization

Cross-Validation

Label Imbalance• If 5% of your labels are “yes” and 95% are “no”…• Just guess no all the time, and you’re right 95%!• This can imbalance some training algorithms

26

Normalization• Should you normalize all inputs between 0-1?• Perhaps they should have mean 0, STD=1 instead?• Not if spectrum where relative intensity matters

Evaluation Metric• 4 metrics: TPOS, TNEG, FPOS, FNEG• Application Specific• Make trade-off curve by running algorithm w/ different parameters• Receiver Operating Characteristic (ROC)• Just means “How often do you miss” vs. “How often do you hallucinate?”• Curse says what your options are. You pick what you can tolerate most.

False Positives ->

Tru

e P

ositi

ves

->

27

Example Applications at JPL

(Just to be cool)

28

• Automatically Identifies and Classifies Rocks

Image Data Classification

TextureCam: A Smart InstrumentRandom Forest

Dr. David Thompson29

March 11, 2011 0500 UTC March 11, 2011 0530 UTC March 11, 2011 0600 UTC

March 11, 2011 0630 UTC March 11, 2011 0700 UTC March 13, 2011 1300 UTC

Before the ruptureNominal states

Rupture Initiation Propagation

~1.5 hour timescalepropagation ofstate changes

Rupture completionNominal states

Two days laterGrowth of featurenear triple junction(near Tokyo)

• Looks for Earthquake behavior people miss

Time Series Data Anomaly Detection

GPS Time Series Anomaly DetectionHidden Markov Models

Dr. Robert Granat30

Full Pancam ViewAEGIS Autonomously

Delivers 13F Pancam image

• Notices “interesting” objects while driving/scanning

• Takes higher resolution images for later analysis

• Operational on MER

Real Time Image Data Anomaly Detection

AEGISAutonomous Exploration for Gathering Increased Science

Segmentation

Dr. Tara Estlin31

• Prioritizes incoming soundings on usefulness for further analysis• Lets retrieval algorithms initially work only on cleanest data• Will be operational in OCO-2 DAC to meet Level 1 requirements• Advises scientists on which data to include in their analysis

Real-Time Soundings Data Prioritization

DOGOData Ordering for Genetic Optimization

Genetic Algorithm

Dr. Lukas Mandrake32

• Automatically recognize, outline, and classify Martian landmarks• Hi-Rise database = tens of thousands of huge resolution images• How to search for your field of interest? • What are statistics on various landforms?

Image Database Anomaly Detection & Classification

LandmarksBoost + SVM

Dr. Kiri Wagstaff33

A transient signal Separating signals in the parameter space

• Recognizes brand new Supernova in a few seconds

Fast, Real-Time Series Data Anomaly Detection

V-FASTRTransient detection at the VLBA

Random Forests

Supernova &Pulsars

Dr. Umaa Rebbapragada34

Summary

• Machine Learning is for everyone!• Relatively simple algorithms lying around for use• Can help researcher understand their data initially• Can help drill-down into sub-populations• Can automate monotonous labeling tasks

• Available in– Python (Scikit-learn, Orange, Weka)– Matlab (Statistics, Neural Net, Fuzzy Logic Toolboxes)– Most languages (OpenCV)

Or just jot us an email! We love to collaborate.35

machine learning summary for caltech2

Documents