machine learning summary for caltech2
TRANSCRIPT
National Aeronautics and Space AdministrationJet Propulsion LaboratoryCalifornia Institute of Technology
Machine Learning and Instrument Autonomy Group (398E)
Contact (Supervisor): Tara Estlin
Presenter: Lukas Mandrake
© 2015 California Institute of TechnologyGovernment sponsorship acknowledged
The Machine Learning Group at JPL:Squeezing More Out Of Your Data
Outline of Talk• Machine Learning Definition
• Amenable Data Types
• Curse of Dimensionality
• Common Techniques
• Example Applications
2
What is Machine Learning?(why do we need it?)
3
How Scientists Spend Brainpower
NotionalGraphs
Only
% T
hink
ing
Spe
nt
kBMBGBTB
Available DataElectrical sensors for data takingComputers for analyticsML for building modelsLeaves humans to interpret
4
The World Today Without ML
Linear regressionsCorrelation coeff’s, R2
Higher curve fitsSeparate & identify sub-populationsMake & Compare ModelsThrow out outliersWonder if you picked the best X-axis
Pick 2 dimensions and plot
Potentially interesting
Imp
ort
an
t va
riab
le
What you hope forScatter Data
What you get What you do
Spectral DataFit slopesSubtract / divide backgroundsTake ratios of known frequenciesLook for familiar peaksMake & Compare ModelsHope you know which frequencies are most informativeDoes it work?YES! It’s how we got here.
Takes a LOT of human timeThat’s the currency of merit
Bins (λ,ν)
Co
un
ts
5
ML: Fusion of Statistics and CompSci• Replicate what humans do while analyzing data
– Builds and auto-tunes models– Forces analyst to make metrics clear(er)
• Subjective decisions become principled (or at least repeatable) and FAST!
• Higher dimensions than humans can visualize
• Replicate over huge datasets
• In the end, extends human analysis.
No one gets replaced, they get augmented!6
What Kinds of Science DataDoes it Help Most?
(new ones found all the time)
7
Data: Boon and BaneHyperspectral imagers: pictures with high-res spectra at each pixel
4000 frequencies x 5e6 pixels x 3000 images
A human can select a pixel to study the spectra, or make an image of a particular frequency (or ratio). Maybe take a dot product with a desired spectra.
Most of image(s) remain unexamined! Needle in haystack problem.
Image databases: How to search?
Martian crater expert only wants images with craters, but would take hundreds of grad students to label them all. Even if you do, each student’s criteria will be different (fatigue).
This classification problem is ubiquitous in image and video processing
Maybe want to find list of “things that aren’t anything you taught me” to look for interesting new landforms not expected.
Compare nearly overlapping images at different times and find “what’s changed” for dynamics study.
Just too large for even a team to tackle.8
Data: Boon and BaneMetadata-Rich Datasets:
Observations aren’t just numbers: whole vector of associated data- Estimates of T, P, aerosol estimates, cloudiness, H2O content…- 10 to 1000 such parameters
Look for correlated trends, sub-category description / separation, anomaly discovery, and primary correlates to unwanted behavior
Fundamentally >2D relationships will be missed by simple plotting
Object databases: What’s in there?
Martian rock database records dozens of properties of each rock examined… thousands or millions of them
What rock types group together?
What stands out as unique? Why was it unique?Given we now know that, what’s the next most unique thing?
9
Curse of Dimensionality• Sphere vols increase until dim=5 then go to zero as d->inf• Sphere within cube eventually removes no volume from cube• A smaller sphere within a larger removes no volume• Zero prob within 1 stdev in Gaussian (“Losing the Middle”)
…
10
Effects on AnalysisExample: Analyze a genetics dataset• Data: 1e6 samples, each is gene snippet 1e5 base-pair long• “Find the gene locations that correlate with an observed condition”• Already processed dataset with only 1e2 base-pairs just fine
Each base-pair location has ~10 samples. Dbig makes data coverage -> 0
Regressions detonate as (ever more likely) correlated genes cause singularities
Even without correlations, distance function diluted. Everybody is very far.
Dilutes meaning of all regressions, nearest neighbor comparisons
Space is too large to be searched. Requires exponential sampling.
Errors! Singular matrices! Meaningless results!
11
How to Kill the Curse?
• Identify most informative dimensions or Mixtures “feature selection”
• Requires a search over number of dimensions D Can take time
• But once informative features recognized, everything else is faster and easier
• Fundamental to Machine Learning, usually as a first step
12
How Do You Know You NeedMachine Learning?
13
• Operational Scenarios• Data > Transmission– Autonomous prioritize by “interest”– “interest” can be specified, calculated in-situ, or anomalous
• Comm Delay > Decision Time– Autonomous decision making– Pre-defined or anomalous triggers– Plan/schedule response and follow-up
• Volume > Analysis Time– Identify uninteresting / interesting data and sub-populations– Identify anomalies, test models
• Data collection capability > data storage capability– Autonomous decision what to retain
When to Use Machine Learning
14
What Sorts of Questions Does Machine Learning Answer?
What Kind of Models Can It Build?
15
Unsupervised
“Data Mining”
Supervised
“Learning”
Other Methods
Kinds of Solution Methods
• Algorithm studies data independently• User does not “help” algorithm understand• No user assumptions to corrupt results• No human expertise either
• Human provides labeled examples to “learn”• Human selects algorithm / model for generalization• Algorithm figures out how to generalize labels• Resulting tuned model reveals structure of data• Produces useful system for replicating the labeling
• Might involve humans as part of the learning cycle• Might seek feedback to make new labeled data• Might use evolution to figure out best ML parameters• Might maintain multi-goal output space
16
No Label Learning
Data Mining
Unsupervised
17
Unsupervised Data MiningFinding Hidden Structure In Unlabeled Data
Clustering: “Are there sub-populations in my data?”Defines n clusters to which all observations are membersEasy to see in 2D, harder in Dbig
Sub-populations can guide analyst to independent analysisMay correspond to physically meaningful populationsMust provide distance metric, algorithm, parameters, data filtration, parameter n
PCA: “What combination of dims explains my data Var?”New axes based on linear combination of original dimsAxes ordered in terms of data variance they explainWorks if data variance is all “interesting” (rare)Dimension reduction: take first n axes until 99% Var explainedCan’t handle dimensional correlations
HMM: “What statistical model produced my data?”Pertains to time-series or sequential dataAssumes prob model that only depends on last stateConstructs most likely model that would explain datasetCan reveal hidden relations and driving processes
Principle Component Analysis
Hidden Markov Model
18
Unsupervised Data MiningFinding Hidden Structure In Unlabeled Data
Rules Learning: “What events tend to co-occur?”Discovers strong, simple rules between dimensionsUseful to figure out hidden relations in large datasetsCan also be helpful to remove correlated featuresGives potential rules for interpretation and investigation
Segmentation: “What regions best describe image?”Groups pixels/samples into larger regions of similar natureExpensive, slow analysis may then be done per regionAveraging across region may reduce noise in “super pixel”Helps image recognition and classification tasksFocuses analysis on complex areas vs boring stretches
DOGO: “Order my data by its quality / utility”Specify a metric to max/minimizeFinds features that, via filtration, optimize metricConstructs sliding filter that monotonically reduces metricInverts filter to produce data ordering of most to least trustedUseful when data isn’t merely “good” or “bad”
Data Ordering through Genetic Optimization
19
Labels Provided
Machine Learning
Supervised
20
Supervised Machine LearningAlgorithms Learning from Humans
LDA: “What hyperplane would best separate my labels?”Like PCA, but works to separate labels, not explain VarianceReturns vector of how useful each dimension was to separateVulnerable to correlated featuresSurprisingly powerful for simple classification
SVM: “What set of samples best define label separation?”Same idea of LDA. Make a separating hyperplanePays attention only to the most confusing examplesCreates a “basis” set of support vectors, most informative samplesGives idea of data importance: which samples change answer
Neural Network: “Predict my labels, I don’t care how”Define layer geometry: # hidden layers, input types, output typesTrain on input data and user labels. Defines weights.Black-box predictor now onlineMonte Carlo stimulate inputs to maximize output concept signal Hard to get insight from network itself
Linear Discriminant Analysis
Support Vector Machine
21
Supervised Machine LearningAlgorithms Learning from Humans
Decision Tree: “Play 20 questions to separate my labels.”Set # of tree branches allowedLearns best series of questions that isolate provided labelsDirectly interpretable by domain experts… no black boxExtremely fast to evaluate once trained
Naïve Bayes: “What’s probability of belong to any label?”Assumes non-correlated dimensions (often works despite this)Needs relatively small number of input labelsLearns distribution of training labels independentlyPredicts probability of sample being in all label categoriesCan easily have “I don’t know” response added
Nearest Neighbor: “Use comparables to predict label”User picks how many neighbors to considerFor each sample to predict, scans all input training dataFinds k neighbors by distance metric, then averages labelsLearns no structure, uses no models, just distance metricCan be slow to evaluate if lots of labeled data22
Supervised Machine LearningAlgorithms Learning from Humans
Random Forest: “Make lots of little trees and take vote.”Trains hundreds of small trees on label data subsetsIn final prediction, let them vote for final outputOvercomes Decision Tree tendency to overfit dataRemoves Decision Tree strength of interpretability
Boosting / Ensemble: “Combine algorithms to improve.”Iteratively train lots of weak / simple methods (any mix will do)Larger optimization (genetic) twiddles with all their parametersLarger optimization learns weights to combine answers togetherTakes a lot of processing power and input data, but great resultsCan learn about data based on which predictors were selected
23
Mixed Cases
Minimal or InteractiveHuman Support
Or
Methods that work with both24
Interesting MixturesAlgorithms that work with or without labels
or that are interactive and can generate them
Genetic Algorithm: “Iteratively optimize my (gene) model”Define a “gene” of all parameters you want GA to optimizeDefine goal metric(s) GA should try to max/minimizeGA’s handle mixed input, arbitrary goal metricsNot really learning, but useful in similar situations. Slow.
Active Learning: “What should I have labeled to help?”Initialize system with supervised or unsupervised methodHave system predict and display to userUser corrects errors and addresses confusing examplesIterate between prediction and feedback until results look goodCan easily over-fit, so hold-out tests are important here
25
Travel Advisory: Data PreparationSome concepts to save you from harm
Over-Fitting• Samples is not >> degrees of freedom• Gets you great results… on your training set• Can’t generalize! Predictor fails on new data.• Use cross-validation and/or simpler model
• Train on 20% data, test on 80%? Vice versa? 50/50?• Depends on data volume and algorithm need• Data structure also determines: how heterogeneous
Train / Test Split
• Automated way to explore all possible train/test splits• Second level withholds data from test & train• Takes lots of data and time• Actually tests generalization
Cross-Validation
Label Imbalance• If 5% of your labels are “yes” and 95% are “no”…• Just guess no all the time, and you’re right 95%!• This can imbalance some training algorithms
26
Normalization• Should you normalize all inputs between 0-1?• Perhaps they should have mean 0, STD=1 instead?• Not if spectrum where relative intensity matters
Evaluation Metric• 4 metrics: TPOS, TNEG, FPOS, FNEG• Application Specific• Make trade-off curve by running algorithm w/ different parameters• Receiver Operating Characteristic (ROC)• Just means “How often do you miss” vs. “How often do you hallucinate?”• Curse says what your options are. You pick what you can tolerate most.
False Positives ->
Tru
e P
ositi
ves
->
27
Example Applications at JPL
(Just to be cool)
28
• Automatically Identifies and Classifies Rocks
Image Data Classification
TextureCam: A Smart InstrumentRandom Forest
Dr. David Thompson29
March 11, 2011 0500 UTC March 11, 2011 0530 UTC March 11, 2011 0600 UTC
March 11, 2011 0630 UTC March 11, 2011 0700 UTC March 13, 2011 1300 UTC
Before the ruptureNominal states
Rupture Initiation Propagation
~1.5 hour timescalepropagation ofstate changes
Rupture completionNominal states
Two days laterGrowth of featurenear triple junction(near Tokyo)
• Looks for Earthquake behavior people miss
Time Series Data Anomaly Detection
GPS Time Series Anomaly DetectionHidden Markov Models
Dr. Robert Granat30
Full Pancam ViewAEGIS Autonomously
Delivers 13F Pancam image
• Notices “interesting” objects while driving/scanning
• Takes higher resolution images for later analysis
• Operational on MER
Real Time Image Data Anomaly Detection
AEGISAutonomous Exploration for Gathering Increased Science
Segmentation
Dr. Tara Estlin31
• Prioritizes incoming soundings on usefulness for further analysis• Lets retrieval algorithms initially work only on cleanest data• Will be operational in OCO-2 DAC to meet Level 1 requirements• Advises scientists on which data to include in their analysis
Real-Time Soundings Data Prioritization
DOGOData Ordering for Genetic Optimization
Genetic Algorithm
Dr. Lukas Mandrake32
• Automatically recognize, outline, and classify Martian landmarks• Hi-Rise database = tens of thousands of huge resolution images• How to search for your field of interest? • What are statistics on various landforms?
Image Database Anomaly Detection & Classification
LandmarksBoost + SVM
Dr. Kiri Wagstaff33
A transient signal Separating signals in the parameter space
• Recognizes brand new Supernova in a few seconds
Fast, Real-Time Series Data Anomaly Detection
V-FASTRTransient detection at the VLBA
Random Forests
Supernova &Pulsars
Dr. Umaa Rebbapragada34
Summary
• Machine Learning is for everyone!• Relatively simple algorithms lying around for use• Can help researcher understand their data initially• Can help drill-down into sub-populations• Can automate monotonous labeling tasks
• Available in– Python (Scikit-learn, Orange, Weka)– Matlab (Statistics, Neural Net, Fuzzy Logic Toolboxes)– Most languages (OpenCV)
Or just jot us an email! We love to collaborate.35