identifying ra patients from the electronic medical records at partners healthcare robert plenge,...
TRANSCRIPT
Identifying RA patients from the electronic medical records at Partners HealthCare
Robert Plenge, M.D., Ph.D.
VA Hospital
July 20, 2010
HARVARDMEDICAL SCHOOL
July 2010: >30 RA risk loci
20031978 1987 20052004
PTPN22
2008
“shared epitope”hypothesis
HLADR4
2007
PADI4 CTLA4
TNFAIP3STAT4TRAF1-C5IL2-IL21
CD40CCL21CD244IL2RBTNFRSF14PRKCQPIP4K2CIL2RAAFF3
Latest GWAS in 25,000 case-control samples with replication in 20,000 additional samples
2009
RELBLKTAGAPCD28TRAF6PTPRCFCGR2APRDM1CD2-CD58
Together explain ~35% of the genetic burden of
diseaseIL6STSPRED25q21RBPJIRF5CCR6PXK
2010 (Q2)
Genetic predictors of response to anti-TNF
therapy in RA
PTPRC/CD45 allelen=1,283 patients
P=0.0001
Cui et al (2010) Arth & Rheum
Options for clinical + DNA
design Clinical
data
DNA Sample size
cost
clinical trial
+++ +++ + $$$
registry ++ +++ ++ $$
claims data
+ n/a +++ $
EMR ++ +++ +++ $
• Narrative data = free-form written text– info about symptoms, medical history,
medications, exam, impression/plan
• Codified data = structured format– age, demographics, and billing codes
Content of EMRs
EMRs are increasingly utilized!
Gabriel (1994) Arthritis and Rheumatism
Conclusion: The sole reliance on such databases for the diagnosis of RA can result in substantial misdiagnosis.
…but EMR data are “dirty”
4 million patients
31,171 patients
ICD9 RA and/or CCP checked(goal = high sensitivity)
3,585 RA patients
Classification algorithm(goal = high PPV)
• Natural language processing (NLP)– disease terms (e.g., RA, lupus)– medications (e.g., methotrexate)– autoantibodies (e.g., CCP, RF)– radiographic erosions
• Codified data– ICD9 disease codes– prescription medications– laboratory autoantibodies
Our library of RA phenotypes
Qing Zeng
Concept/term Accuracy of concept
presence of erosion 88% seropositive 96% CCP positive 98.7% RF positive 99.3% etanercept 100% methotrexate 100%
Guergana Savova
• Natural language processing (NLP)– disease terms (e.g., RA, lupus)– medications (e.g., methotrexate)– autoantibodies (e.g., CCP, RF)– radiographic erosions
• Codified data– ICD9 disease codes– prescription medications– laboratory autoantibodies
Our library of RA phenotypes
Shawn Murphy
‘Optimal’ algorithm to classify RA: NLP + codified data
Regression model with a penalty parameter (to avoid over-fitting)
Codified data NLP data
Tianxi Cai, Kat Liao
High PPV with adequate sensitivity
Model PPV (SE) Sensitivity (SE)
Codified + NLP 0.93 (0.02) ✪ 0.63 (0.06)
NLP only 0.89 (0.02) 0.56 (0.05)
Codified only 0.88 (0.02) 0.51 (0.05)
✪392 out of 400 (98%) had definite or possible RA!
This means more patients!
Model PPV (SE) Sensitivity (SE)
Codified + NLP 0.93 (0.02) ✪ 0.63 (0.06)
NLP only 0.89 (0.02) 0.56 (0.05)
Codified only 0.88 (0.02) 0.51 (0.05)
~25% more subjects with the complete algorithm:
3,585 subjects (3,334 with true RA)3,046 subjects (2,680 with true RA)
Liao et (2010) Arth. Care Research
Characteristics i2b2 RA CORRONA
total number 3,585 7,971
Mean age (SD) 57.5 (17.5) 58.9 (13.4)
Female (%) 79.9 74.5
Anti-CCP(%) 63 N/A
RF (%) 74.4 72.1
Erosions (%) 59.2 59.7
MTX (%) 59.5 52.8
Anti-TNF (%) 32.6 22.6
Clinical features of patients
CCP has an OR = 1.5 for predicting erosions
4 million patients
31,171 patients
ICD9 RA and/or CCP checked(goal = high sensitivity)
3,585 RA patients
Classification algorithm(goal = high PPV)
Discarded blood for DNA
OR similar in EMR cohort
0.7
0.9
1.1
1.3
1.5
1.7
1.9
2.1
2.3
2.5GWAS
ACPA+ Current Study
SNP (ordered by chromosome and position)
Od
ds
Ra
tio
PTPN
22
CD2,
CD58
RE
L
SPRE
D2
AFF3
STAT
4 CD28 CT
LA4
PXK
RBPJ
C5O
RF13
ANKR
D55,
IL6S
T
HLA
PRDM
1
TNFA
IP3
TNFA
IP3
TAGA
P CCR6
IRF5
BLK
CCL2
1CC
L21
TRAF
1, C
5
IL2RA
IL2RA
PRKC
Q
KIF5
A, P
IP4K
2C
CD40
IL2RB
1,500 RA multi-ethnic RA cases and 1,500 matched controls
4 million patients
31,171 patients
ICD9 RA and/or CCP checked(goal = high sensitivity)
3,585 RA patients
Classification algorithm(goal = high PPV)
Clinical subsets
Discarded blood for DNA
Non-responder to anti-TNF therapy
NLP+codified data, together with statistical modeling, to define treatment
response
Responder to anti-TNF therapy
NLP+codified data, together with statistical modeling, to define treatment
response
Responder to anti-TNF therapy
5-year NIH grant as part of the PharmacoGenomics
Research Network (PGRN)
Options for clinical + DNA
design Clinical data
DNA Sample size
cost
clinical trial
+++ +++ + $$$
registry ++ +++ ++ $$
claims data
+ n/a +++ $
EMR ++ +++ +++ $
Conclusion: NLP + codified data, together with appropriate statistical modeling, can yield accurate clinical data.
Options for clinical + DNA
design Clinical data
DNA Sample size
cost
clinical trial
+++ +++ + $$$
registry ++ +++ ++ $$
claims data
+ n/a +++ $
EMR ++ +++ +++ $
Conclusion: Genetic studies in our EMR cohort yield effect sizes similar to traditional cohorts.