bitem / sibtex @ trec cds 2014
Post on 10-Aug-2015
147 Views
Preview:
TRANSCRIPT
Full-texts representation with MeSH, co-citations network reranking
BiTeM/SIBtex group
J Gobeill (me), A Gaudinat, E Pasche and P Ruch
University of Applied Sciences,Swiss Institute of Bioinformatics,
Hospitals and University of Geneva
The BiTeM / SIBtex group
• Text Mining and Bibliomics (P Ruch) Strong focus on clinical and biological data
heg (training librarians) and SIB (assisting biocurators)
• Long history of participation in TREC campaigns Genomics, Chemical IR, Medical Records…
• Translational medicine projects (EU FP7 Programme)
Khresmoi: multimodal medical search engine
MD-Paedigree: retrieval of similar cases for clinicians
The CDS Track 2014
• Clinical Decision Support : « retrieval of biomedical articles relevant for answering generic clinical questions about medical records »
Ex. query: « 25-year-old woman with fatigue,hair loss,
weight gain, and cold intolerance for 6
months »
Collection: subset of PubMed Central
Strategies for TREC CDS 2014
Reranking
4. Boosting based on article types
5. Exploitation of the co-citations network
Document Representation
1. Classical document representation with text
2. Document representation withMeSH
3. Target-specific semanticenrichment with MeSH
IR performed by (Okapi BM25)
BiTeM official results
ourbaseline
ourbaseline
Creating a baseline
1. Classical document representation with text
Text indexSearch engine
1. Classical document representation with text
• Two different indexing levels:• Document• Section Run 2 vs run 4 : document > section (+ 65%)
• Query representation (R-Prec):• Numbers removing (no age)• Only description: 0.169• Only summaries: 0.170• Both: 0.185 (+10%) Signal/noise ratio: better with more information
Document
Sections
Creating a complementary view
2. Document representation withMeSH
MeSH indexSearch engine
MeSH for PMC 2649306D008569 Memory DisordersD001921 BrainD001284 AtrophyD001706 BiopsyD005911 Gliosis
2. Document representation with MeSH
• Two possible sources:• Collected from MEDLINE when there is a PMID
• Extracted from documents with a categorizer (strict mapping)
• Two possible integrations between original text and MeSH:• Building separate indexes then combining runs
• Merging both representations into one unique document
MeSH concepts found:D008568 MemoryD008569 Memory DisordersD007866 LegD009068 MovementD001921 BrainD001284 AtrophyD001706 BiopsyD005911 Gliosis
<topic number="8"><summary>62-year-old man with
progressive memory loss and involuntary leg movements. Brain MRI
reveals cortical atrophy, and cortical biopsy shows vacuolar gray matter
changes with reactiveastrocytosis.</summary>
Example of MeSH mapping
D013035:Muscular Spasm ?
Some good (power of synonyms)
Some broad Some missing (too ambiguous)
D002540:Cerebral Cortex ?
D008279: MH = Magnetic Resonance Imaging ?Medical Research Institute ?Moderate Renal Insufficiency ?
MEDLINE MeSH in docsHumansAnimalsFemaleMaleAdult
Middle AgedMiceAged
AdolescentMolecular Sequence Data
RatsYoung AdultTime Factors
ChildSignal Transduction
Extracted MeSH in docsCells
Ficus (because of «fig»)Patients
TimeGenes
TherapeuticsMethods
RoleHumansDiseaseVolition
MiceAttention
DNAPopulation
Extracted MeSH in topicsWomenHistory
PainBlood
Physical ExaminationFemale
Blood PressurePressureDyspneaFamilyThoraxUrineFeverMale
Emergencies
Top 15 MeSH in benchmark
Results for MeSH representation
• Best R-Prec 0.143 for MeSH representation (vs 0.211 for text)o MeSH concepts collected from MEDLINE not useful (best R-Prec 0.028)
o Only 53% of documents had MeSH terms in MEDLINE
• Complementarity for finding relevant documents (thanks to qrel) :
• Low complementarity
• Combination: 0.211 -> 0.213
Favoring target types
MeSH for PMC 2649306D008569 Memory DisordersD001921 BrainD001284 AtrophyD005911 GliosisD001706 Biopsy
MeSHtargetDiagnosisMeSHtargetDiagnosis
MeSHtargetTest
Do relevant documents for diagnosis deal more with diagnosis ?
3. Target-specific semantic enrichment with MeSH
3. Target-specific semantic enrichment with MeSH
• In UMLS, each MeSH term has Semantic Types (ex: T060 Diagnostic Procedure)
Focus on targets (diagnosis, treatments and tests)
• Specific words (ex: «MeSHtargetDiag») are added in docs and queries
Target% docs that have
at least 1Average number
in documents
Test 83 % 16
Diagnosis 86 % 41
Treatment 86 % 24
Small improvementonly for section indexing
In the qrel…Set Aver. Diagnosis MeSH Aver. Test MeSH Aver. Treatment MeSH
All collection 41 16 24
Relevant for diagnosis(1|2 for queries 1..10)
108 41 41
Relevant for test(1|2 for queries 11..20)
107 41 33
Relevant for treatment(1|2 for queries 21..30)
114 47 52
All relevant documents:o Are quite similar, with no distinction between targetso But have 2/3 times more target MeSH termso ... but it’s also the case for documents with 0 in the qrel
4. Boosting based on article types
Promoting some article types
Are some article types more likely to be relevant ?
Article typeDistribution
in docs in qrel in our runsresearch-article 74.3 % 52.2% 37.9 %
case-report 4.0 % 20.4 % 41.5 %review-article 6.9 % 17.9 % 10.9 %
Other 2.6 % 3.2 % 3.6 %brief-report 1.1 % 1.5 % 0.9 %
4. Boosting based on article types
• Strategy: to promote review and case-based articles (boosting)
• Intuition was good…
• In reality… the IR engine already promoted these types !
but the strategy failed !
Top 5
5. Exploitation of the co-citations network
Promoting citations
Are citations of retrieved documents relevant ?
5. Exploitation of the co-citations network
• E is the set of retrieved documents
• RSVe is the Retrieval Status Value of doce
• We boost each citation of doce by + α x RSVe
• 50% of documents cite another one in the collection (avg 3.8 cits)
Results
• With α = 0.1, slight improvement• + 10% for R-PREC
• + 20% for infNCDG
• In TREC Chem 2010 Prior Art task, + 150% for MAP
Conclusions
“what is important is to have fought well”
Conclusions
• A lot of strategies, but not much better than Terrier baseline
• Section indexing: never again
• MeSH not complementary… Better when infered by a k-NN ?
• Relevant docs talk about test, diag and treatment altogether.
• Maybe we have to start working from the baseline run…
top related