the researcher perspective, jean-fred fontaine, mdc berlin

16
Text and data mining for Biomedical Research Dr. Jean-Fred Fontaine Max Delbrück Center for Molecular Medicine, Berlin

Upload: liber-europe

Post on 06-May-2015

1.793 views

Category:

Technology


1 download

DESCRIPTION

Presentation by Jean-Fred Fontaine (MDC Berlin) from the 'Prefect Swell' workshop on text and data mining on the 27th of September 2013.

TRANSCRIPT

Page 1: The researcher perspective, Jean-Fred Fontaine, MDC Berlin

Text and data mining for Biomedical Research

Dr. Jean-Fred FontaineMax Delbrück Center for Molecular Medicine, Berlin

Page 2: The researcher perspective, Jean-Fred Fontaine, MDC Berlin

Scientific project and biomedical literature

Project designProject design

AnalysisAnalysis

ExperimentsExperimentsCommunication

Communication

• Methods• Explanations• New hypotheses

• State of the art• Innovative ideas

• Technologies• State of the art• Explanations• Open hypotheses• Perspectives

Page 3: The researcher perspective, Jean-Fred Fontaine, MDC Berlin

Data growth

Literature growth Molecular data growth

Page 4: The researcher perspective, Jean-Fred Fontaine, MDC Berlin

Accessibility

Krallinger et al. (2010) Methods Mol Biol.

* PMC Open Access subset (2012): 249,108 full texts (Ortuno et al., 2013)

18 M (all)

9.7 M – TEXT MINING OF ABSTRACTS8.6 M

2.4 M – (freely readable)1.8 M0.2 M - TEXT MINING OF FULL TEXTS*

Page 5: The researcher perspective, Jean-Fred Fontaine, MDC Berlin

Document retrieval

Alzheimer’s disease?

By date

Fontaine et al. (2009) Nucleic Acids Res.http://cbdm.mdc-berlin.de/tools/medlineranker/

By relevance

...........

...........

...........

.......

...........

...........

...........

.......

...........

...........

...........

.......

...........

...........

...........

.......

...........

...........

...........

.......

...........

...........

...........

.......

...........

...........

...........

.......

...........

...........

...........

.......

............

............

............

....

............

............

............

....

Medline Ranker

0

5,000,000

10,000,000

15,000,000

20,000,000

25,000,000

1940

1944

1948

1952

1956

1960

1964

1968

1972

1976

1980

1984

1988

1992

1996

2000

2004

2008

Citations in PubMed®

Page 6: The researcher perspective, Jean-Fred Fontaine, MDC Berlin

Discovery of gene-disease associations

......

Database miningDatabase mining

Fontaine et al. (2011) Nucleic Acids Res.

http://cbdm.mdc-berlin.de/tools/genie

Medline Ranker / Génie

Rank 20 000 genes

Page 7: The researcher perspective, Jean-Fred Fontaine, MDC Berlin

Discovery of gene- and drug-disease associations

Frijters et al. (2010) PLoS Comput Biol.

?

Before 2007

After 2007

Before 2007

After 2007

Page 8: The researcher perspective, Jean-Fred Fontaine, MDC Berlin

Semantic analysis

Knowledge bases

Van Landeghem et al. (2013) PLoS One.

Page 9: The researcher perspective, Jean-Fred Fontaine, MDC Berlin

Network construction

Miljkovic et al. (2012) PLoS One.

Modelling Plant Defence Response

Page 10: The researcher perspective, Jean-Fred Fontaine, MDC Berlin

Trends

Palidwor & Andrade-Navarro (2010) J Biomed Discov Collab.

http:// www.ogic.ca/mltrends/

Page 11: The researcher perspective, Jean-Fred Fontaine, MDC Berlin

Surveillance of Surgical Site Infections

Campillo-Gimenez et al. (2013) Stud Health Technol Inform.

2008-2009relevant records

2008-2009relevant records

...........

...........

...........

......

...........

...........

...........

......

Classification

Classification

2010 medical reports

Conventional surveillance

ICD10 codes

Full-text medical reports

TRUE positive 3 11 12FALSE positive 0 219 18FALSE negative 10 2 1TRUE negative 1212 993 1194

University Hospital of Rennes, France SSI secondary to neurosurgery Electronic Patient Records

ICD10 codes Free text

Page 12: The researcher perspective, Jean-Fred Fontaine, MDC Berlin

Disease Correlations from Electronic Patient Records

Avg. ICD10 codes Manual: 2.7 Text Mining: 9.5

Roque et al. (2011) PLoS Comput Biol.

Patient recordsPatient records

ICD10 codesICD10 codes

Manual

Text Mining

Alopecia

Migraine

THRA

ESR1

HR

Co-morbidity 93 / 802 unexpected Ex. Alopecia and Migraine

Page 13: The researcher perspective, Jean-Fred Fontaine, MDC Berlin

Summary

Computers and biomedical literature and data Generation Storage Analysis

Text and data mining Useful from project start to finish Broad and critical applications

Information retrieval Information extraction Knowledge databases Knowledge discovery

Limited by text availability

Page 14: The researcher perspective, Jean-Fred Fontaine, MDC Berlin

Accuracy in some applications Ambiguity, complex sentences, document context, novelty

“Protein A and its partners”

From abstracts to full texts Current methods optimized for short texts (abstracts) Figures and tables Supplementary information

File format The PDF problem

XML: structured format Abstract, Introduction, Results, Methods, Discussion, References, ...

Challenges

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.........................

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.

.......

.........................

.......

.

.......

.

.......

.

?........................

.......

.

.......

.

.......

.........................

.......

.

.......

.

.......

.

?

Page 15: The researcher perspective, Jean-Fred Fontaine, MDC Berlin

Needs

Copyright Teach scientists Unify licenses

Availability All significant documents

Articles, reviews, case reports, letters The main structured text (XML)

No figures (or optional) texts mostly useless for readers

Supplements: optional No fancy user interface or webservice

FTP/P2P + Compressed XML Communicating Research results

Open Access As text As data

standardized list of facts standardized figures data and tables

# articles Compressed file size*

1 13 KB

1M 12 GB

20M 250 GB

* Projections based on PMC Open Access 2012

Page 16: The researcher perspective, Jean-Fred Fontaine, MDC Berlin