ama ieee-rpg
TRANSCRIPT
LOW/HIGH-RISK DIABETES
GROUP SEGMENTATION
USING α-TREES
Anurekha Ramakrishnan1, Yubin Park2, Joydeep Ghosh2
1Dept. of Statistics and Scientific Computation2Dept. of Electrical and Computer Engineering
The University of Texas at Austin
AMA-IEEE Medical Technology Conference 2011
AMA-IEEE Medical Technology Conference 2011
Ramakrishnan/Park/Ghosh
Barriers to M/C learning
Adoption in Healthcare• Class-imbalance
Target ratios are often extremely skewed.
• Mismatch with Performance Metrics
„Misclassification rates may not be relevantAsymmetric costs involved.
• „Sensitivity/Specificity‟ or „Lift‟ should be a part of learning goals.
• Interpretation of Results
– Simple AND/OR Rules (in Natural Language) are desirable.
• We suggest a possible solution for these problems using:
– Modified α-Trees,
– Disjunctive Combination of Rules.
AMA-IEEE Medical Technology Conference 2011
Ramakrishnan/Park/Ghosh
Objectives
α-Tree segmentationOriginal Data
Healthcare
Data
(e.g. BRFSS,
‘+’ class: 12%)
Low Risk Group
(e.g. <3%, <1% ‘+’ class)
High Risk Group
(e.g. >12% ‘+’ class)
Other Requirements:1. Interpretable segmentation - AND, OR Rules in Natural language2. Extensive coverage using Simple rules.
Note: These objectives are different from traditional machine learning objectives. The objectives are based on the observations on many failed Medical Decision Support systems.
AMA-IEEE Medical Technology Conference 2011
Ramakrishnan/Park/Ghosh
BRFSS Dataset
• Behavioral Risk Factor Surveillance System
– URL: http://www.cdc.gov/brfss/
• The largest telephone survey since 1984.
• Tracks health conditions and risk behaviors in the United States.
• Contains information on a variety of diseases
– e.g. diabetes, hypertension, cancer, asthma, HIV, etc.
• More than 400,000 records per year.
• Many states use BRFSS data to support health-related legislative efforts.
AMA-IEEE Medical Technology Conference 2011
Ramakrishnan/Park/Ghosh
α-Tree1
• A Decision Tree Algorithm (e.g. CART, C4.5)
– Decision criterion: α-Divergence.
– Generalizes C4.5.
– Robust performance in class-imbalance settings.
– Stop its growth when a Low/High-risk group is obtained. (modified α-Tree)
• Different „α‟ values result in different decision rules.
– Decision trees provide greedy solutions (sub-optimal solutions).
– By disjunctively combining different solutions from different α-Trees, we can approach to a better solution.
– Python Code available (http://www.ideal.ece.utexas.edu/~yubin/)
1. Y. Park and J. Ghosh, “Compact Ensemble Trees for Imbalanced Data,” in 10th
International Wokshop on Multiple Classifier Systems, Italy, June 2011.
AMA-IEEE Medical Technology Conference 2011
Ramakrishnan/Park/Ghosh
3-Phase Diagram
Rule Generation using
α-Trees
Disjunctive Combination
α=0.1
α=1
α=2.0
...
BRFSS data Extensive
Rule Set
Original Data
Example) When High-risk group is defined as more than 24% Diabetes Rate group.- Twice Higher rate than Normal Population
Rule1: RFHYPE5 = 1 & AGE_G >= 5.0 & RFHLTH = 2 & BMI4CAT >= 2.0 from α=0.1OR Rule 2: RFHYPE5 ≠ 1 & RFHLTH = 1 & BMI4CAT >= 2.9 & PNEUVAC3 = 1 from
α=1.0OR Rule 3: RFHYPE5 = 2 & RFHLTH ≠ 1 from α=1.5OR …
These combined rules extract High-risk Diabetes Segments (>24%).
AMA-IEEE Medical Technology Conference 2011
Ramakrishnan/Park/Ghosh
Example Tree Structure
RFHLTH =2?
29.2%RFHYPE5 =
2?
BMI4CAT < 2.9
PNEUVAC3 = 2?
RACE2 = 2?
… …
RACE2 = 1?
SEX = 2?
12.25% 16.55%
INCOMG >=4?
19.3% 25.3%
PNEUVAC3 = 1
32.97%RFCHOL =
2?
INCOMG >= 4?
19.39% 31.14%
AGE_G < 5?
6.7% 16.7%
….
Yes No
When α=2.0, total five High-risk Segmentation Rules are extracted.
Different α values result in different tree structures.
AMA-IEEE Medical Technology Conference 2011
Ramakrishnan/Park/Ghosh
Results for Twice Higher Diabetes Rate Group
(High-risk)
Resultant Rules from α-Trees.1. RFHYPE5 = 2 & RFHLTH ≠ 12. RFHYPE5 ≠ 2 & RFHLTH = 2
& RFCHOL = 23. …
English Translation
Segment 1: They have high-blood pressure and think themselves unhealthy (including not responding to this question).Segment 2: They have high cholesterol and think themselves unhealthy. But they don’t have high-blood pressure.…
0
0.05
0.1
0.15
0.2
0.25
0.3
1 2 3 4 5 6 7 8 9 1011121314
Coverage
AMA-IEEE Medical Technology Conference 2011
Ramakrishnan/Park/Ghosh
Results for Four-times lower Diabetes Rate
Group (Low-risk)
0.42
0.43
0.44
0.45
0.46
0.47
0.48
0.49
0.5
0.51
Coverage Resultant Rules from α-Trees.1. RFHYPE5 ≠ 2 and RFHLTH ≠ 2 and
PNEUVAC3 ≠ 1 2. RFHYPE5 =1 and RFHLTH ≠ 2 and
AGE_G < 5.03. …
English Translation
Segment 1: They don’t have high blood pressure and think themselves healthy. They had a pneumonia shot at least once in their life time.Segment 2: They have high blood pressure, but think themselves healthy and are under 50 yrs of age.…
AMA-IEEE Medical Technology Conference 2011
Ramakrishnan/Park/Ghosh
Appendix A
• α-Divergence
• Special cases
AMA-IEEE Medical Technology Conference 2011
Ramakrishnan/Park/Ghosh
Appendix B
• Modified α-Tree Algorithm
• Input: BRFSS (input data), α (parameter)
• Output: Low-risk group extraction rules
• Select the best feature, which gives the maximum α-divergence criterion.
– If (no such feature)
or (number of data points < cut-off size)
or (This group is a low/high-risk group)
then stop its growth.
– Else
Segment the input data based on the best feature.
Recursively run Modified α-Tree Algorithm( segmented data, α)