text classification for healthcare information support

19
1 Text Classification for Healthcare Information Support Rey-Long Liu ( 劉劉 劉) Dept. of Medical Informat ics Tzu Chi University, Taiwa n

Upload: gianna

Post on 04-Jan-2016

31 views

Category:

Documents


0 download

DESCRIPTION

Text Classification for Healthcare Information Support. Rey-Long Liu ( 劉瑞瓏 ) Dept. of Medical Informatics Tzu Chi University, Taiwan. Background. Text categorization (TC) as a fundamental component for information processing Many TC techniques were developed - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Text Classification for Healthcare Information Support

1

Text Classification for Healthcare Information Support

Rey-Long Liu (劉瑞瓏 )Dept. of Medical Informatics

Tzu Chi University, Taiwan

Page 2: Text Classification for Healthcare Information Support

2

Background

Text categorization (TC) as a fundamental component for information processing– Many TC techniques were developed

Unfortunately, high-quality TC is often an unrealizable ideal– Very high precision– Very high recall

Page 3: Text Classification for Healthcare Information Support

3

Background (Cont.)

An application scenario: healthcare information support

Classification Confirmation

Classified Information

Information Gathered

Relevant Information Classified

Query

Query

Consultancy

InquiryClassified Inquiry

High-Quality TC

General Users (e.g. patients)

Information Gathering Systems

Healthcare Professionals

Classified Information Base

Page 4: Text Classification for Healthcare Information Support

4

Outline

Interaction as an approach to high-quality TC – Main consideration

Reducing the amount of the interaction

– Criteria & straightforward interaction strategies

An intelligent interaction strategy: COM (Content Overlapping Measurement)

Empirical evaluation– Chinese cancer texts classification

Conclusion

Page 5: Text Classification for Healthcare Information Support

5

Interaction for High-Quality TC

Interaction with the user– Possibly a “final” approach– More application scenarios

Information recommendation & archiving– Definite relevant vs. potentially relevant

Main consideration– Reducing the number of interactions

Page 6: Text Classification for Healthcare Information Support

6

Interaction for High-Quality TC (Cont.)

Evaluation criteria– Confirmation Precision (CP)

Related to cognitive load to users

– Confirmation Recall (CR) Related to the quality of TC

y wrongpotentiall as identified decisions #

identified decisions wrong#

conducted onsconfirmati #

conducted onsconfirmatinecessary #

identified be should that decisions wrong#

identified decisions wrong#

conducted be should that onsconfirmati #

conducted onsconfirmatinecessary #

Page 7: Text Classification for Healthcare Information Support

7

Interaction for High-Quality TC (Cont.)

Straightforward interaction strategies

Max DOA

x o xx x o x o o o oo o oxx x

(A) Setting two thresholds to identify the DOA range for confirmation (o: positive validation document; x: negative validation document)

:Rejection Threshold

Acceptance Threshold

Uniform Confirmation (UC): Preferring CR

(B) Confirmation strategy:

Prob = 0 (when DOA(d, c) > AT)

Prob = 0 (when DOA(d, c) < RT)

Prob = 1.0 (whenRT DOA(d, c) AT)

Min DOA

Page 8: Text Classification for Healthcare Information Support

8

Interaction for High-Quality TC (Cont.)

Probabilistic Confirmation (PC): Preferring CP

Prob = 0 (when DOA(d, c) = Min)

Prob = 0 (when DOA(d, c) = Max)

Prob = 1.0 (when DOA(d, c) = threshold)

(B) Confirmation strategy:

(A) Tuning a threshold in the hope to optimize F1 (o: positive validation document; x: negative validation document):

x o xx x o x o o o oo o o

Min DOA

xx x

The classifier’s Threshold (T) Max

DOA

Page 9: Text Classification for Healthcare Information Support

9

ICCOM: Interactive Confirmation by COM

Training

Testing

(2) Threshold Tuning based on Content Overlapping

Incoming Document

Training Documents for Classifier BuildingTraining

Documents for Threshold Tuning (validation)

ICCOM

Classified/Filtered Documents

Classifier Building

Feature Selection

Threshold Tuning

Underlying Classifier

(1) Content Overlap Measurement (COM)

Documents to be Confirmed

(3) Content Overlap Measurement (COM) Classification

Page 10: Text Classification for Healthcare Information Support

10

ICCOM: Interactive Confirmation by COM (content overlapping measurement)

Procedure COM(c, d), where (1) c is a category,(2) d is a document for thresholding or testingReturn: Degree of content overlap (DCO) between d and c

Begin(1) DCO = 0;(2) For each term t that is positively correlated with c but does not appear in d, do

(2.1) DCO = DCO - 2(t,c); (3) For each term t that is negatively correlated with c but appears in d, do

(3.1) DCO = DCO - (number of occurrences of t in d) 2(t,c);(4) Return DCO;

End.

Page 11: Text Classification for Healthcare Information Support

11

ICCOM: Interactive Confirmation by COM (content overlapping measurement, cont.)

To discriminate c from others To validate content overlapping Features that

correlate with c Features that correlate with other categories

Features that appear in c but do not appear in d

Features that do not appear in c but appear in d

Underlying classifier Considered Considered Not considered Not considered COM Not considered Not considered Considered Considered

Page 12: Text Classification for Healthcare Information Support

12

ICCOM: Interactive Confirmation by COM (content overlapping measurement, cont.)

N: total number of documents,

A: # documents that are in c and contain t,

B: # documents that are not in c but contain t,

C: # documents that are in c but do not contain t, and

D: # documents that are not in c and do not contain t.

))()()((),(

)(2

2

DCDBCABA

Nct

BCAD

“positively-correlated” if AD>BC; otherwise “negative-correlated”

Page 13: Text Classification for Healthcare Information Support

13

ICCOM: Interactive Confirmation by COM (thresholding)

Rejection Threshold (RT)

Rejection Confirmation

oxx

Confirmation

xo x o o o oo o o

Acceptance

Rejection

o xx x o x o o o oo o o

Min DOA

Max DOA

xx x

The classifier’s threshold (T)

Invoking COM to compute DCO

Positive Confirmation Threshold (PCT)

Negative Confirmation Threshold (NCT)

Page 14: Text Classification for Healthcare Information Support

14

ICCOM: Interactive Confirmation by COM (collaboration with the classifier)

Procedure InteractiveHighQualityTC(c, d, T, RT, PCT, NCT), where (1) c is a category,(2) d is the document to be processed,(3) T is the classifier’s threshold for c,(4) RT is the rejection threshold for c,(5) PCT is the positive confirmation threshold for c, and (6) NCT is the negative confirmation threshold for c.

Return: A decision (acceptance, rejection, or confirmation) for d with respect to c.

Begin(1) DOAd = Invoke the classifier to compute DOA of d with respect to c;(2) If (DOAd RT), Return “rejection”;(3) Else

(3.1) DCOd = Invoke COM to compute DCO of d with respect to c;(3.2) If (DOAd T)

(3.2.1) If (DCOd PCT), Return “acceptance”;(3.2.2) Return “confirmation”;

(3.3) Else(3.3.1) If (DCOd NCT), Return “rejection”;(3.3.2) Return “confirmation”;

End.

Page 15: Text Classification for Healthcare Information Support

15

Empirical Evaluation

Chinese disease (cancer) texts– 16 types of cancers (e.g. liver cancer, lung cancer, …,

etc.) top-ranked by the department of health in Taiwan

– Collected by sending cancer names to “知識 +” (knowledge+) in Yahoo! at Taiwan

– For each cancer, there are 5 subcategories Cause, symptom, curing, side-effect, and prevention Therefore, we have 80 (16*5) categories with 2850 documen

ts 90% for training; 10% for testing 2-fold cross validation (classifier building vs. thresholding)

Page 16: Text Classification for Healthcare Information Support

16

Empirical Evaluation (cont.)

Best F1 by RO

F1 by RO+PC

CP of RO+PC

F1 by RO+UC

CP of RO+UC

F1 by RO+ICCOM

CP of RO+ICCOM

1st fold

0.3485 (FS=1500)

0.8413 0.0969 0.9610 0.0848 0.9607 0.1117

2nd fold

0.3270 (FS=1500)

0.7823 0.1037 0.9656 0.0725 0.9433 0.1166

Classification of cancer information

Page 17: Text Classification for Healthcare Information Support

17

Empirical Evaluation (cont.)

Best F1 by RO

F1 by RO+PC

CP of RO+PC

F1 by RO+UC

CP of RO+UC

F1 by RO+ICCOM

CP of RO+ICCOM

1st fold

0.8919 (FS=300)

0.9610 0.0676 0.9744 0.1017 0.9610 0.1429

2nd fold

0.8718 (FS=300)

0.9620 0.1000 0.9750 0.0580 0.9744 0.1569

Classification of 40 symptom description without cancer names

Note: For the 40 test symptom documents, RO+ICCOM conducts 35 and 51 confirmations in the 1st and 2nd folds, respectively

Page 18: Text Classification for Healthcare Information Support

18

Conclusion

High-quality TC is essential but often unrealizable

Interactive confirmation may be one final resort– Information recommendation & archiving– Healthcare information support

COM as a classifier-independent strategy for interaction

Page 19: Text Classification for Healthcare Information Support

Thank you!Thank you!