icic 2014 finding answers in the data – the future role of text and data mining (tdm)

30
Finding Answers in the Data The Future Role of Text and Data Mining Kim Zwollo General Manager, RightsDirect Andrew Hinton Linguamatics

Upload: dr-haxel-congress-and-event-management-gmbh

Post on 11-Jun-2015

615 views

Category:

Internet


1 download

DESCRIPTION

Vast amounts of new information and data are generated every day through scientific research. More and more of this data is stored in rapidly growing, but siloed databases, creating “Big Data” challenges. New technologies such as text and data mining make it possible to efficiently search and improve knowledge by applying analytics across these data sources. Research-intensive companies in the pharmaceutical and chemical industry are exploring the use of text and data mining (TDM) techniques to glean new insights from patents, clinical data, scientific literature, and other data sources. These insights are seen as critical to accelerating the process of drug and product discovery. As these researchers leverage TDM techniques, obtaining easy, centralized access to TDM-ready full-text content from multiple publishers becomes more and more important. What will be the future role of TDM in 2014 and beyond? What are the major TDM trends and what solutions are companies looking for to accelerate their R&D; efforts? Based on the experience gathered in a text and data mining pilot program successfully run by RightsDirect’s parent company Copyright Clearance Center (CCC) in 2013, RightsDirect’s General Manager Kim Zwollo will give an overview of current market needs, options and trends in Text and Data Mining. Using CCC’s TDM solution as an example, the presentation outlines critical success factors in technology and business models that need to be part of a comprehensive approach to text and data mining.

TRANSCRIPT

Page 1: ICIC 2014 Finding Answers in the Data – The Future Role of Text and Data Mining (TDM)

Finding Answers in the Data The Future Role of Text and Data Mining

Kim Zwollo General Manager, RightsDirect Andrew Hinton Linguamatics

Page 2: ICIC 2014 Finding Answers in the Data – The Future Role of Text and Data Mining (TDM)

Making Copyright Work – CCC and RightsDirect

Rightsholders Content Users

600+ million rights from:

• Publishers

• Authors

• Creators

• 35,000 companies

• Employees worldwide

• Users in 180 countries

• Licensing Solutions

• Rights Management

• Content Delivery

• Copyright Education

10/15/2014

Page 3: ICIC 2014 Finding Answers in the Data – The Future Role of Text and Data Mining (TDM)

Overview

• What is Text and Data Mining

• Why text mining is useful

• Technology Trends

• Information Retrieval Challenges

• Publisher perspective

• Emerging solutions

• Use cases from Linguamatics

10/15/2014 3

Page 4: ICIC 2014 Finding Answers in the Data – The Future Role of Text and Data Mining (TDM)

What is Text and Data Mining

Interpret Meaning, Identify

& Extract

• Facts

• Relationships

• Assertions

Linguamatics 2014

Page 5: ICIC 2014 Finding Answers in the Data – The Future Role of Text and Data Mining (TDM)

Application Areas for text mining

Protein-

Protein

Interactions

Vocabulary

Development

Target

Identification

&

Prioritization

Conference

Abstract

Mining

Key Opinion

Leader

Identification Safety/Tox

In-licensing

Opportunities

Gene

Profiling Systems

Biology

Mining

FDA Drug

Labels

Extracting

Numerical

and

Experimental

Data

Mutations

and Gene

Expressio

n

Sentiment

Analysis in

Social

Media

Workflow

Integration

Mining

Electronic

Medical

Records

Clinical

Trial

Analysis

Patent

Analysis

Biomarker

Discovery

Competitive

Intelligence

Drug

Repositioning

Page 6: ICIC 2014 Finding Answers in the Data – The Future Role of Text and Data Mining (TDM)

“Drug Discovery” Process

• Goal: Develop new treatments for diseases through hypothesis formation.

• Methodology:

– Keyword/Database Searching

– Review Literature

– Find relationships

– Develop hypothesis

– Test

– Product development

Etc.

10/15/2014 6

Page 7: ICIC 2014 Finding Answers in the Data – The Future Role of Text and Data Mining (TDM)

Analyzing Article Sets

Page 8: ICIC 2014 Finding Answers in the Data – The Future Role of Text and Data Mining (TDM)

Problem: Too Much Research

• 53M Records in Scopus

• 800,000 Journal Articles published per year

10/15/2014 8

http://altmetrics.org/manifesto/ October 26, 2010

Page 9: ICIC 2014 Finding Answers in the Data – The Future Role of Text and Data Mining (TDM)

Even within one disease area…

• Angina

• Acute coronary

syndrome

• Alexia

• Anomic aphasia

• Aortic dissection

• Aortic regurgitation

• Aortic stenosis

• Apoplexy

• Apraxia

• Arrhythmias

• Asymmetric septal

hypertrophy (ASH)

• Atherosclerosis

• Atrial flutter

• Atrial septal defect

• Atrioventricular canal

defect

• Atrioventricular septal

defect

• Avascular necrosis

–Etc…

10/15/2014 9

Lots of disorders …

Lots of documents…

• 35,000+ on Improve Circulation

• 7,000+ per disease area

Page 10: ICIC 2014 Finding Answers in the Data – The Future Role of Text and Data Mining (TDM)

Literature Based Discovery

10/15/2014 10

Don Swanson (1924-2012)

[1986] Blood viscosity served

as a bridge between the topics of

Raynaud’s disease and dietary fish oil.

A

B

C

Page 11: ICIC 2014 Finding Answers in the Data – The Future Role of Text and Data Mining (TDM)

Information Retrieval and Discovery Process

10/15/2014 11 *http://www.jisc.ac.uk/reports/value-and-benefits-of-text-mining

Software Platforms for TDM

Information Retrieval

Knowledge Discovery

Page 12: ICIC 2014 Finding Answers in the Data – The Future Role of Text and Data Mining (TDM)

Challenges for Text Mining Researchers

• Many sources of content

• Many formats

• Difficult to obtain full-text in XML

• Difficult to integrate content into TDM software.

• Hard to negotiate and manage licenses and feeds from all publishers.

12

Page 13: ICIC 2014 Finding Answers in the Data – The Future Role of Text and Data Mining (TDM)

STM Publisher Perspective

• Concern about disruptive nature of TDM to subscription business

• Access problem, more than a copyright problem

• Technical challenges with formats and authentication

• More industry education needed

• Top STM Publishers are making their content available for mining

10/15/2014 13

Page 14: ICIC 2014 Finding Answers in the Data – The Future Role of Text and Data Mining (TDM)

Background: Timeline

• JISC paper May 2011

• First PDR-TDM meeting Nov 2011

• CCC TDM Event – March 2012

• CCC White Paper on TDM Issues and Solutions – May 2012

• CCC Pilot 2013

• Second PDR-TDM meeting Nov 2013

• Content acquisition 2014

• Launch CCC service for mining full text (2015)

10/15/2014 14

Page 15: ICIC 2014 Finding Answers in the Data – The Future Role of Text and Data Mining (TDM)

Helping TDM Researchers P

ub

lish

er 1

Pu

blis

her 2

Rightsholders provide CCC with

a feed of their content in XML

Pu

blis

her 3

<XML>

Page 16: ICIC 2014 Finding Answers in the Data – The Future Role of Text and Data Mining (TDM)

Helping TDM Researchers C

om

pan

y A

Co

mp

an

y C

Co

mp

an

y B

Companies provide CCC with information about

their subscriptions and holdings, using our

automated tools in DirectPath.

Page 17: ICIC 2014 Finding Answers in the Data – The Future Role of Text and Data Mining (TDM)

Helping TDM Researchers C

om

pan

y A

Pu

blis

her 1

C

om

pan

y C

Pu

blis

her 2

Pu

blis

her 3

Co

mp

an

y B

Companies request article sets

for each TDM project.

CCC manages access based on

subscription information.

<XML>

Page 18: ICIC 2014 Finding Answers in the Data – The Future Role of Text and Data Mining (TDM)

Looking Ahead: Emerging Solutions for Information Retrieval

• Open Access Content

• Publisher-specific capabilities for delivering content (Elsevier and others)

• Industry-wide content access solutions by intermediaries

– CrossRef

– CCC

– PLS

10/15/2014 18

Page 19: ICIC 2014 Finding Answers in the Data – The Future Role of Text and Data Mining (TDM)

A look at a Text Mining Application A presentation by Linguamatics

Andrew Hinton, Linguamatics

10/15/2014 19

Page 20: ICIC 2014 Finding Answers in the Data – The Future Role of Text and Data Mining (TDM)

Click to edit Master title style Click to edit Master title style About Linguamatics

Boston Cambridge

I2E: agile, scalable, real-time NLP-based text mining

Fact extraction and knowledge synthesis

Fortune 500

Pharma/Biotech

Healthcare

Government

Linguamatics 2014

Including 17 of the top 20

Including Kaiser Permanente

Including FDA

Software Consulting Hosted Content

Page 21: ICIC 2014 Finding Answers in the Data – The Future Role of Text and Data Mining (TDM)

Click to edit Master title style Click to edit Master title style Linguistic Processing Using NLP

• Groups words into meaningful units

• Morphology allows search for different forms of words

We find that p42mapk phosphorylates c-Myb on serine and threonine . Purified recombinant p42 MAPK was found to phosphorylate Wee1 .

sentences

morphology -

different forms

noun groups

match entities

verb groups

match actions

Linguamatics 2014

Page 22: ICIC 2014 Finding Answers in the Data – The Future Role of Text and Data Mining (TDM)

Unique capabilities of Text Mining

Use-Cases

Linguamatics 2014

Page 23: ICIC 2014 Finding Answers in the Data – The Future Role of Text and Data Mining (TDM)

Click to edit Master title style Click to edit Master title style

Linguamatics 2014

Biomarker Discovery - Genes

Gene

(from

Entrez)

Complex

linguistic

relationship

Disease

(from

MedDRA)

Relevant sentence

extracted with terms

highlighted

Link to

source

document

Page 24: ICIC 2014 Finding Answers in the Data – The Future Role of Text and Data Mining (TDM)

Click to edit Master title style Click to edit Master title style Categorizing Relationships

Use of NLP allows accurate and precise

identification of biomarker relationships

Linguamatics 2014

Page 25: ICIC 2014 Finding Answers in the Data – The Future Role of Text and Data Mining (TDM)

Click to edit Master title style Click to edit Master title style

Patents Applications and Grants Companies vs. Diseases

0

5000

10000

15000

20000

25000

Ap

plic

atio

ns

Gra

nts

Ap

plic

atio

ns

Gra

nts

Ap

plic

atio

ns

Gra

nts

Ap

plic

atio

ns

Gra

nts

Ap

plic

atio

ns

Gra

nts

Ap

plic

atio

ns

Gra

nts

Ap

plic

atio

ns

Gra

nts

Ap

plic

atio

ns

Gra

nts

Ap

plic

atio

ns

Gra

nts

Abbott AZ Bayer BMS GSK Roche Merck Novartis Pfizer

Virus Diseases

Substance-Related Disorders

Stomatognathic Diseases

Skin and Connective Tissue Diseases

Respiratory Tract Diseases

Parasitic Diseases

Otorhinolaryngologic Diseases

Occupational Diseases

Nutritional and Metabolic Diseases

Nervous System Diseases

Neoplasms

Musculoskeletal Diseases

Mental Disorders

Male Urogenital Diseases

Immune System Diseases

Hemic and Lymphatic Diseases

Female Urogenital Diseases andPregnancy ComplicationsEye Diseases

Endocrine System Diseases

Linguamatics 2014

Page 26: ICIC 2014 Finding Answers in the Data – The Future Role of Text and Data Mining (TDM)

Click to edit Master title style Click to edit Master title style

Find properties

Melting Points for Exemplified Compounds

Output to e.g. Excel

Linguamatics 2014

Page 27: ICIC 2014 Finding Answers in the Data – The Future Role of Text and Data Mining (TDM)

Click to edit Master title style Click to edit Master title style

Connecting information found in different parts of the document for example finding a compound as “Example 12” in a patent and linking to a table where numerical data is reported

Patent document

Linking from Definitions to Table Values

Combined into a row of data in the structured results table

Patent Data from IFI Claims Direct

Linguamatics 2014

Page 28: ICIC 2014 Finding Answers in the Data – The Future Role of Text and Data Mining (TDM)

Click to edit Master title style Click to edit Master title style

• For information in claims, often want to work back along the chain of claims, to see what the current claim is dependent upon

Claim Chain Information

Linguamatics 2014

Compounds Treats

cervical cancer

Peptide Seq

Residues 33-176

Page 29: ICIC 2014 Finding Answers in the Data – The Future Role of Text and Data Mining (TDM)

Click to edit Master title style Click to edit Master title style

• Analysis of PubMed Central records

• Look for analytical chemical techniques mention’s

• Identify concepts in abstract ‘v’ body

Benefits on Text Mining Using Full Text

Linguamatics 2014

Many more mentions of

experimental techniques in full

text compared to abstract alone!

Analytical Chemistry Techniques Section

Page 30: ICIC 2014 Finding Answers in the Data – The Future Role of Text and Data Mining (TDM)

Thank You!