big data algorithms with medical applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfefficiency of...
TRANSCRIPT
![Page 1: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/1.jpg)
Big Data Algorithms with Medical Applications
Yixin Chen
![Page 2: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/2.jpg)
Outline
Challenges to big data algorithms
Clinical Big Data
Our new algorithms
![Page 3: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/3.jpg)
Small data vs. Big data
![Page 4: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/4.jpg)
Small data vs. Big data
一般性规律
VS
特殊性规律
![Page 5: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/5.jpg)
Small data vs. Big data
Causality Association
Domain knowledge
Data knowledge
![Page 6: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/6.jpg)
Small data vs. Big data Models
Data Size
Model Quality
Big Data
Small Data
![Page 7: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/7.jpg)
Modeling techniques
Parametric VS Non-parametric
Efficiency interpretability Accuracy
![Page 8: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/8.jpg)
Efficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic improvements (e.g. O(N3) vs O(N2))
Large-scale Manifold Learning Maximum Variance Correction (Chen et al. ICML’13)
![Page 9: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/9.jpg)
Outline
Challenges to big data algorithms
Clinical Big Data
Our new algorithms
![Page 10: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/10.jpg)
The need for clinical prediction
• The ICU direct costs per day for survivors is between six and seven times those for non-ICU care.
• Unlike patients at ICUs, general hospital wards (GHW) patients are not under extensive electronic monitoring and nurse care.
• Clinical study has found that 4–17% of patients will undergo cardiopulmonary or respiratory arrest while in the GHW of hospital.
![Page 11: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/11.jpg)
Goal: Let Data Speak!
Sudden deteriorations (e.g. septic shock, cardiopulmonary or respiratory arrest) of GHW patients can often be severe and life threatening. Goal: Provide early detection and intervention based on data mining to prevent these serious, often life-threatening events. Using both clinical data and wireless body sensor data A NSF/NIH funded clinical trial at Washington University/Barnes Jewish Hospital
![Page 12: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/12.jpg)
Clinical Data: high-dimensional real-time time-series data
34 vital signs: pulse, temperature, oxygen saturation, shock index, respirations, blood pressure, …
Time/second
Time/second
![Page 13: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/13.jpg)
Previous Work
Main problems : Most previous general work uses a snapshot method that takes all the features at a given time as input to a model, discarding the temporal evolving of data
Medical data
mining
medical knowledge
machine learning
methods
SCAP and PSI
Acute Physiology Score, Chronic
Health Score , and APACHE score are
used to predict renal failures
Modified Early Warning
Score (MEWS)
decision trees
neural networks SVM
![Page 14: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/14.jpg)
Machine learning task
0
5000
10000
15000
20000
25000
30000
Non-ICUICU
Challenges: • Classification of high-
dimensional time series data
• Irregular data gaps • measurement errors • class imbalance
![Page 15: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/15.jpg)
Solution based on existing techniques
Temporal feature extraction Bootstrap aggregating (bagging) Exploratory under-sampling Feature selection Exponential moving average smoothing Basic classifier (Mao et al. KDD’12)
![Page 16: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/16.jpg)
Solution based on existing techniques
Temporal feature extraction Bootstrap aggregating (bagging) Exploratory under-sampling Feature selection Exponential moving average smoothing Basic classifier (Mao et al. KDD’12)
![Page 17: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/17.jpg)
• Nonlinear classification ability • Interpretability • Support for mixed data types • Efficiency • Multi-class classification
Desired Classifier Properties
Linear SVM and Logistic Regression Interpretable and efficient but linear
SVM with RBF kernels Nonlinear but not interpretable; inefficient
![Page 18: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/18.jpg)
kNN NB NN LR Linear SVM
Kernel SVM
Nonlinear classification ability Y N Y N N Y
Interpretability N Y N Y Y N Direct support for mixed data types Y Y N N N N
Efficiency Y Y Y Y Y N Multi-class classification Y Y Y Y N N
Desired Classifier Properties
![Page 19: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/19.jpg)
Random kitchen sinks (RKS) Random nonlinear feature
transformation
Parametric, linear
classifier
1. Transform each input x into: exp(-i wk x), k= 1, …, K, wk ~ Gaussian distribution p(w) 2. Learn a linear model ∑ αk exp(-i wk x) Theory: based on Fourier transformation, RKS converges to RBF-SVM with large K Efficiency, but no interpretability
![Page 20: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/20.jpg)
Outline
Challenges to big data algorithms
Clinical Big Data
Our new algorithms
![Page 21: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/21.jpg)
Key Idea: Hybrid Model Non-parametric,
Nonlinear Feature
Transformation
Parametric, Linear
Classifier
Efficiency
Interpretability
Nonlinearity
![Page 22: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/22.jpg)
kNN NB NN LR Linear SVM
Kernel SVM DLR
Nonlinear classification ability Y N Y N N Y Y
Interpretability N Y N Y Y N Y Direct support for mixed data types Y Y N N N N Y
Efficiency Y Y Y Y Y N Y Multi-class classification Y Y Y Y N N Y
Desired Classifier Properties
DLR: Density-based Logistic Regression (Chen et al., KDD’13)
![Page 23: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/23.jpg)
Each instance has D features:
Logistic Regression
Training dataset:
Optimization: maximize the overall log likelihood
where τ(x)
Assume:
![Page 24: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/24.jpg)
Problem with linear models
If we set , what should be ϕd(x)?
![Page 25: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/25.jpg)
Insights on τ(x) (Logistic regression)
On the other hand:
Hence: LR:
![Page 26: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/26.jpg)
Factorization in DLR
Assumption:
![Page 27: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/27.jpg)
, where
DLR Feature Transformation
is an increasing function of
![Page 28: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/28.jpg)
Conditional Probability Estimation
Numerical : Kernel density estimation
Categorical xd :
(smoothed histogram)
![Page 29: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/29.jpg)
kernel bandwidth
Kernel density estimation Training dataset:
where
![Page 30: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/30.jpg)
DLR Learning Maximize the overall log likelihood
Objective:
A function of
![Page 31: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/31.jpg)
Overview of DLR Initialize h and w
Update w
Calculate new feature vector
Update h
Converged? No
![Page 32: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/32.jpg)
Fix and optimize (steepest gradient descent)
Repeat until convergence (using a LR solver) Fix and optimize
Optimization
Initial h iter 1 Iter 2 Iter 3
![Page 33: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/33.jpg)
Interpretability DLR:
For example, represents a particular disease If represents the blood pressure (BP) of a patient
On disease level Ranking can identify the risk factors of this disease
indicates the abnormality of his BP indicates the extent of BP resulting in his disease
On patient level
![Page 34: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/34.jpg)
Kernel Ideal kernel:
RBF kernel:
doesn’t consider the label information
![Page 35: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/35.jpg)
DLR Kernel DLR kernel:
indicates same label
indicates different label
![Page 36: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/36.jpg)
DLR on example data
Original LR Density-based LR
Test Data:
![Page 37: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/37.jpg)
Accuracy on UCI Datasets
Better
numerical categorical
![Page 38: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/38.jpg)
Training Time
Better
numerical categorical
![Page 39: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/39.jpg)
Results on clinical data SVM: 0.9194 DLR: 0.9204 Accuracy: LR: 0.9141
Early alert when the patient appears normal to the best doctors in the world
![Page 40: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/40.jpg)
DLR for real large data estimation: kernel density smoothing
Still too slow for big data Testing time grows as get larger
No curse of dimensionality for estimation Ultra-fast training and testing
estimation: histogram
![Page 41: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/41.jpg)
DLR with Bins
![Page 42: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/42.jpg)
DLR with Bins
Not smooth Not enough data
![Page 43: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/43.jpg)
Histogram KDE Smoothing
where is the number of label in bin i is the number of instances in bin i
![Page 44: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/44.jpg)
Different Number of Bins
5 bins 20 bins 100 bins
![Page 45: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/45.jpg)
Results on accuracy
Splice 1K
Mush 8K
w5a 10K
w8a 50K
Adult 30K
kddcup
1.26M linearSVM 75 100 98.15 98.57 60.03 99.99
LR 77 99.87 97.67 98.24 84.80 99.99 RBF SVM 80 99.23 97.14 97.20 75.29 N/A
DLR-b 88 99.95 98.26 98.55 85.54 99.99
![Page 46: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/46.jpg)
Results on efficiency
Splice 1K
Mush 8K
w5a 10K
w8a 50K
Adult 30K
kddcup
1.26M linearSVM 0.12 0.56 1.16 15 2847 81.70
LR 0.15 0.21 0.18 0.7 2.89 55.66 RBF SVM 0.09 1.63 1.60 29 217 N/A
DLR-b 0.22 0.32 2.65 7.6 0.6 17.93
![Page 47: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/47.jpg)
Feature Selection Ability
DLR:
• l1-regularization: loss(w) + c∑max(wd,0) non-smooth optimization
• However, in DLR, we can simply use c ∑wd
along with constraints wd ≥ 0 smooth optimization
![Page 48: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/48.jpg)
Top features selected by DLR
standard deviation of heart rate
ApEn of heart rate
Energy of oxygen saturation
LF of oxygen saturation
LF of heart rate
DFA of oxygen saturation
Mean of heart rate
HF of heart rate
Inertia of heart rate
Homogeneity of heart rate
Energy of heart rate
linear correlation of heart rate of oxygen saturation
![Page 49: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/49.jpg)
• Nonlinear classification ability • Support for mixed data types • Interpretability • Efficiency • Multi-class classification
Conclusions on DLR DLR satisfies all the following:
Try it out! http://www.cse.wustl.edu/~wenlinchen/project/DLR/
![Page 50: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/50.jpg)
• Hybrid! - Non-parametric + parametric - Association + causality - Generative + discriminative - Balance accuracy and speed
• For real big data, get rid of heavy machinery - Let accuracy grow with data size • Linear model would suffice with enough
nonlinearity/randomness
Big Data Algorithms
![Page 51: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/51.jpg)
Thank you
![Page 52: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/52.jpg)
第
大数据时代的挑战:
麦肯锡全球研究院报告:大数据人才稀缺
人才
![Page 53: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/53.jpg)
kNN NB NN LR Linear SVM
Kernel SVM
Random Kitchen Sinks
Nonlinear classification ability Y N Y N N Y Y
Interpretability N Y N Y Y N N Direct support for mixed data types Y Y N N N N N
Efficiency Y Y Y Y Y N Y Multi-class classification Y Y Y Y N N N
RKS: Linear model over nonlinear features
RBF SVM: k(x,x’) =
![Page 54: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/54.jpg)
Gaussian Naive Bayes Assumption:
Gaussian:
![Page 55: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/55.jpg)
LR and GNB
Both GNB and LR express in a linear model
GNB learns under GNB assumption LR learns using maximum likelihood of the data
Assumption:
![Page 56: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/56.jpg)
Motivation
NB LR
Assumption:
![Page 57: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic](https://reader034.vdocuments.pub/reader034/viewer/2022042200/5e9fcd7b6f756623d62c3e43/html5/thumbnails/57.jpg)
Motivation GNB Assumption:
Factorizing by
Factorizing by Naïve Bayes