ama ieee-rpg

11
LOW/HIGH-RISK DIABETES GROUP SEGMENTATION USING α-TREES Anurekha Ramakrishnan 1 , Yubin Park 2 , Joydeep Ghosh 2 1 Dept. of Statistics and Scientific Computation 2 Dept. of Electrical and Computer Engineering The University of Texas at Austin AMA-IEEE Medical Technology Conference 2011

Upload: jackpo

Post on 12-Jul-2015

233 views

Category:

Education


0 download

TRANSCRIPT

Page 1: Ama ieee-rpg

LOW/HIGH-RISK DIABETES

GROUP SEGMENTATION

USING α-TREES

Anurekha Ramakrishnan1, Yubin Park2, Joydeep Ghosh2

1Dept. of Statistics and Scientific Computation2Dept. of Electrical and Computer Engineering

The University of Texas at Austin

AMA-IEEE Medical Technology Conference 2011

Page 2: Ama ieee-rpg

AMA-IEEE Medical Technology Conference 2011

Ramakrishnan/Park/Ghosh

Barriers to M/C learning

Adoption in Healthcare• Class-imbalance

Target ratios are often extremely skewed.

• Mismatch with Performance Metrics

„Misclassification rates may not be relevantAsymmetric costs involved.

• „Sensitivity/Specificity‟ or „Lift‟ should be a part of learning goals.

• Interpretation of Results

– Simple AND/OR Rules (in Natural Language) are desirable.

• We suggest a possible solution for these problems using:

– Modified α-Trees,

– Disjunctive Combination of Rules.

Page 3: Ama ieee-rpg

AMA-IEEE Medical Technology Conference 2011

Ramakrishnan/Park/Ghosh

Objectives

α-Tree segmentationOriginal Data

Healthcare

Data

(e.g. BRFSS,

‘+’ class: 12%)

Low Risk Group

(e.g. <3%, <1% ‘+’ class)

High Risk Group

(e.g. >12% ‘+’ class)

Other Requirements:1. Interpretable segmentation - AND, OR Rules in Natural language2. Extensive coverage using Simple rules.

Note: These objectives are different from traditional machine learning objectives. The objectives are based on the observations on many failed Medical Decision Support systems.

Page 4: Ama ieee-rpg

AMA-IEEE Medical Technology Conference 2011

Ramakrishnan/Park/Ghosh

BRFSS Dataset

• Behavioral Risk Factor Surveillance System

– URL: http://www.cdc.gov/brfss/

• The largest telephone survey since 1984.

• Tracks health conditions and risk behaviors in the United States.

• Contains information on a variety of diseases

– e.g. diabetes, hypertension, cancer, asthma, HIV, etc.

• More than 400,000 records per year.

• Many states use BRFSS data to support health-related legislative efforts.

Page 5: Ama ieee-rpg

AMA-IEEE Medical Technology Conference 2011

Ramakrishnan/Park/Ghosh

α-Tree1

• A Decision Tree Algorithm (e.g. CART, C4.5)

– Decision criterion: α-Divergence.

– Generalizes C4.5.

– Robust performance in class-imbalance settings.

– Stop its growth when a Low/High-risk group is obtained. (modified α-Tree)

• Different „α‟ values result in different decision rules.

– Decision trees provide greedy solutions (sub-optimal solutions).

– By disjunctively combining different solutions from different α-Trees, we can approach to a better solution.

– Python Code available (http://www.ideal.ece.utexas.edu/~yubin/)

1. Y. Park and J. Ghosh, “Compact Ensemble Trees for Imbalanced Data,” in 10th

International Wokshop on Multiple Classifier Systems, Italy, June 2011.

Page 6: Ama ieee-rpg

AMA-IEEE Medical Technology Conference 2011

Ramakrishnan/Park/Ghosh

3-Phase Diagram

Rule Generation using

α-Trees

Disjunctive Combination

α=0.1

α=1

α=2.0

...

BRFSS data Extensive

Rule Set

Original Data

Example) When High-risk group is defined as more than 24% Diabetes Rate group.- Twice Higher rate than Normal Population

Rule1: RFHYPE5 = 1 & AGE_G >= 5.0 & RFHLTH = 2 & BMI4CAT >= 2.0 from α=0.1OR Rule 2: RFHYPE5 ≠ 1 & RFHLTH = 1 & BMI4CAT >= 2.9 & PNEUVAC3 = 1 from

α=1.0OR Rule 3: RFHYPE5 = 2 & RFHLTH ≠ 1 from α=1.5OR …

These combined rules extract High-risk Diabetes Segments (>24%).

Page 7: Ama ieee-rpg

AMA-IEEE Medical Technology Conference 2011

Ramakrishnan/Park/Ghosh

Example Tree Structure

RFHLTH =2?

29.2%RFHYPE5 =

2?

BMI4CAT < 2.9

PNEUVAC3 = 2?

RACE2 = 2?

… …

RACE2 = 1?

SEX = 2?

12.25% 16.55%

INCOMG >=4?

19.3% 25.3%

PNEUVAC3 = 1

32.97%RFCHOL =

2?

INCOMG >= 4?

19.39% 31.14%

AGE_G < 5?

6.7% 16.7%

….

Yes No

When α=2.0, total five High-risk Segmentation Rules are extracted.

Different α values result in different tree structures.

Page 8: Ama ieee-rpg

AMA-IEEE Medical Technology Conference 2011

Ramakrishnan/Park/Ghosh

Results for Twice Higher Diabetes Rate Group

(High-risk)

Resultant Rules from α-Trees.1. RFHYPE5 = 2 & RFHLTH ≠ 12. RFHYPE5 ≠ 2 & RFHLTH = 2

& RFCHOL = 23. …

English Translation

Segment 1: They have high-blood pressure and think themselves unhealthy (including not responding to this question).Segment 2: They have high cholesterol and think themselves unhealthy. But they don’t have high-blood pressure.…

0

0.05

0.1

0.15

0.2

0.25

0.3

1 2 3 4 5 6 7 8 9 1011121314

Coverage

Page 9: Ama ieee-rpg

AMA-IEEE Medical Technology Conference 2011

Ramakrishnan/Park/Ghosh

Results for Four-times lower Diabetes Rate

Group (Low-risk)

0.42

0.43

0.44

0.45

0.46

0.47

0.48

0.49

0.5

0.51

Coverage Resultant Rules from α-Trees.1. RFHYPE5 ≠ 2 and RFHLTH ≠ 2 and

PNEUVAC3 ≠ 1 2. RFHYPE5 =1 and RFHLTH ≠ 2 and

AGE_G < 5.03. …

English Translation

Segment 1: They don’t have high blood pressure and think themselves healthy. They had a pneumonia shot at least once in their life time.Segment 2: They have high blood pressure, but think themselves healthy and are under 50 yrs of age.…

Page 10: Ama ieee-rpg

AMA-IEEE Medical Technology Conference 2011

Ramakrishnan/Park/Ghosh

Appendix A

• α-Divergence

• Special cases

Page 11: Ama ieee-rpg

AMA-IEEE Medical Technology Conference 2011

Ramakrishnan/Park/Ghosh

Appendix B

• Modified α-Tree Algorithm

• Input: BRFSS (input data), α (parameter)

• Output: Low-risk group extraction rules

• Select the best feature, which gives the maximum α-divergence criterion.

– If (no such feature)

or (number of data points < cut-off size)

or (This group is a low/high-risk group)

then stop its growth.

– Else

Segment the input data based on the best feature.

Recursively run Modified α-Tree Algorithm( segmented data, α)