11 source – j kreulen aqui é onde eu trabalho - o ibm centro de pesquisa de almaden

Post on 07-Apr-2016

214 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

11Source – J Kreulen

Aqui é onde eu trabalho - o IBM Centro de Pesquisa de Almaden

22

Por que eu estou aqui?

(Explicar)

- o que estamos fazendo com o computador curation- (texto e Imagem analytics)

- por que é importante para a comunidade científica

- como ele pode impactar o seu trabalho e ter competitividade vantajosa

33

Computer Curation of Patents & Scientific Literature

[ Analitica de Informações ] [Transformando Informação em Valor]

Stephen K. Boyer, Ph.D.SBoyer@us.ibm.com

408-858-5544

44

O Problema

Todo o conteúdo e nenhuma descoberta?

55

A Pergunta

Podemos usar computadores "para ler" documentos, identificar entidades críticas, e executar associações significativas – que pode ajudar-nos com o nosso trabalho?

66

As text

Nomes quimicos ino texto do documento

Imagens de bitmap

Figuras de quimica encontradas no documento

Por exemplo:-

As patentes e os papéis científicos contêm dados moleculares em variadas formas

77

 a) (2P/4S)-4-[4-Amino-5-(4-benzyloxy-phenyl)pyrrolo[2,3-d]pyrimidin-7-yl]-2-hydroxymethyl-pyrrolidine-1-carboxylic acid tert-butyl ester prepared analogously to Example 18 starting from (2R/4S)-4-[4-amino-5-(4-benzyloxy-phenyl)-pyrrolo[2,3-d]pyrimidin-7-yl]-pyrrolidine-1,2-dicarboxylic acid 1-tert-butyl ester 2-ethyl ester (Example 20a). 1 H-NMR (CDCl3, ppm): 8.52 (s, 1H), 7.52-7.32 (m, 7H), 7.1 (d, 2H), 6.95 (d,1 H), 5.50 (m, 1H), 5.13 (s, 2H), 4.62-4.42 (m, 2H), 4.28 (m, 2H), 4.10 (m, 1H), 3.95-3.70 (m, 1H), 2.75 (m, 1H), 2.50 (m, 1H),1.49 (s, 9H).     b) (2R/4S)-{4-[4-Amino-5-(4-benzyloxy-phenyl)-pyrrolo[2,3-d]pyrimidin-7-yl]-pyrrolidin-2-yl}-methanol: 0.100 g of (2R/4S)4-[4-amino-5-(4-benzyloxy-phenyl)-pyrrolo[2,3-d]pyrimidin-7-yl]-pyrrolidine-1,2-dicarboxylic acid 1-tert-butyl ester is dissolved in 4 ml of tetrahydrofuran; 10 ml of 4M hydrogen chloride in diethyl ether are added, and stirring is carried out for 1 hour at room temperature. The product is filtered off and dried under a high vacuum. The dihydrochloride of the title compound is obtained. 1 H-NMR (CD3 OD, ppm): 8.4 (s, 1H); 7.60 (s, 1H), 7.5-7.10 (m, 9H), 5.65 (m, 1H), 5.18 (s, 2H), 4.32 (m, 1H), 4.00-3.65 (m, 4H), 2.60 (m, 2H). EXAMPLE 24(2R/4S)-4-(4-Amino-5-phenyl-pyrrolo[2,3-d]pyrimidin-7-yl)-1-(2,2-dimethyl-propionyl)-pyrrolidine-2-carboxylic acid ethyl ester 0.130 g of (2R/4S)-4-(4-benzyloxycarbonylamino-5-phenyl-pyrrolo[2,3-d]pyrimidin-7-yl)-1-(2,2-dimethyl-propionyl)-pyrrolidine-2-carboxylic acid ethyl ester is dissolved in 8 ml of methanol, and the solution is hydrogenated over 0.030 g of palladium-on-carbon (10%) for 1 hour at normal pressure. The catalyst is removed by filtration, the filtrate is concentrated by

Você pode encontrar as moléculas-chave nesta patente de Novartis?

[A nomenclatura química pode estar atemorizando ]

88

a) (2P/4S)-4-[4-Amino-5-(4-benzyloxy-phenyl)pyrrolo[2,3-d]pyrimidin-7-yl]-2-hydroxymethyl-pyrrolidine-1-carboxylic acid tert-butyl ester prepared analogously to Example 18 starting from (2R/4S)-4-[4-amino-5-(4-benzyloxy-phenyl)-pyrrolo[2,3-d]pyrimidin-7-yl]-pyrrolidine-1,2-dicarboxylic acid 1-tert-butyl ester 2-ethyl

ester (Example 20a). 1 H-NMR (CDCl3, ppm): 8.52 (s, 1H), 7.52-7.32 (m, 7H), 7.1 (d, 2H), 6.95 (d,1 H), 5.50 (m, 1H), 5.13 (s, 2H), 4.62-4.42 (m, 2H), 4.28 (m, 2H), 4.10 (m, 1H), 3.95-3.70 (m, 1H), 2.75 (m, 1H), 2.50 (m,

1H),1.49 (s, 9H). b) (2R/4S)-{4-[4-Amino-5-(4-benzyloxy-phenyl)-pyrrolo[2,3-d]pyrimidin-7-yl]-pyrrolidin-2-yl}-methanol:

0.100 g of (2R/4S)4-[4-amino-5-(4-benzyloxy-phenyl)-pyrrolo[2,3-d]pyrimidin-7-yl]-pyrrolidine-1,2-dicarboxylic acid 1-tert-butyl ester is dissolved in 4 ml of tetrahydrofuran; 10 ml of 4M hydrogen chloride in

diethyl ether are added, and stirring is carried out for 1 hour at room temperature. The product is filtered off and dried under a high vacuum. The dihydrochloride of the title compound is obtained. 1 H-NMR (CD3 OD, ppm): 8.4 (s, 1H); 7.60 (s, 1H), 7.5-7.10 (m, 9H), 5.65 (m, 1H), 5.18 (s, 2H), 4.32 (m, 1H), 4.00-3.65 (m, 4H),

2.60 (m, 2H). EXAMPLE 24

(2R/4S)-4-(4-Amino-5-phenyl-pyrrolo[2,3-d]pyrimidin-7-yl)-1-(2,2-dimethyl-propionyl)-pyrrolidine-2-carboxylic acid ethyl ester 0.130 g of (2R/4S)-4-(4-benzyloxycarbonylamino-5-phenyl-pyrrolo[2,3-

d]pyrimidin-7-yl)-1-(2,2-dimethyl-propionyl)-pyrrolidine-2-carboxylic acid ethyl ester is dissolved in 8 ml of methanol, and the solution is hydrogenated over 0.030 g of palladium-on-carbon (10%) for 1 hour at normal

pressure. The catalyst is removed by filtration, the filtrate is concentrated by

O que isto é composto??

NO

O

HO

N

N

N

O

NH2

Você sabe que este produto químico é?

entity identification

99

Valium (Trade Name)

= Diazepam (Generic Name)

= CAS # 439-14-5(Chemical ID #)

ALBORAL, ALISEUM, ALUPRAM , AMIPROL ,ANSIOLIN , ANSIOLISINA , APAURIN, APOZEPAM, ASSIVAL , ATENSINE , ATILEN , BIALZEPAM , CALMOCITENE, CALMPOSE , CERCINE, CEREGULART, CONDITION, DAP, DIACEPAN, DIAPAM , DIAZEMULS , DIAZEPAN , DIAZETARD , DIENPAX, DIPAM , DIPEZONA, DOMALIUM , DUKSEN, DUXEN, E-PAM, ERIDAN, EVACALM, FAUSTAN, FREUDAL , FRUSTAN, GIHITAN, HORIZON, KIATRIUM, LA-III , LEMBROL, LEVIUM, LIBERETAS , METHYL DIAZEPINONE, MOROSAN , NEUROLYTRIL NOAN NSC-77518 PACITRAN PARANTEN PAXATE PAXEL PLIDAN QUETINIL QUIATRIL QUIEVITA RELAMINAL RELANIUM RELAX RENBORIN RO 5-2807 S.A. R.L. SAROMET SEDAPAM SEDIPAM SEDUKSEN SEDUXEN , SERENACK SERENAMIN SERENZIN SETONIL SIBAZON SONACON STESOLID STESOLIN , TENSOPAM TRANIMUL TRANQDYN TRANQUASE TRANQUIRIT , TRANQUO-TABLINEN , UMBRIUM UNISEDIL USEMPAX AP VALEO VALITRAN VALRELEASE VATRAN VELIUM, VIVAL VIVOL WY-3467

=

Valium has > 149 “names(O tranqüilizante tem> 149 "nomes”)”

Problema – tenho de encontrar a informação do Tranqüilizante

nomenclature issues

1010

Há muitos nomes químicos diferentes do Tranqüilizante

Valium = Diazepam =

7-CHLORO-1-METHYL-5-PHENYL-2H-1,4-BENZODIAZEPIN-2-ONE

7-CHLORO-1-METHYL-5-PHENYL-3H-1,4-BENZODIAZEPIN-2(1H)-ONE

7-CHLORO-1-METHYL-5-PHENYL-1,3-DIHYDRO-2H-1,4-BENZODIAZEPIN-2-ONE

7-CHLORO-1-METHYL-2-OXO-5-PHENYL-3H-1,4-BENZODIAZEPINE

1-METHYL-5-PHENYL-7-CHLORO-1,3-DIHYDRO-2H-1,4-BENZODIAZEPIN-2-ONE

7-CHLORO-1,3-DIHYDRO-1-METHYL-5-PHENYL-2H-1,4-BENZODIAZEPIN-2-ONE

7-CHLORO-1-METHYL-5-3H-1,4-BENZIODIAZEPIN-2(1H)-ONE

CAS # 439-14-5 =

entity identification

1111

Problemas de taxonomies e normalização de nome

Valium Taxonomies &

Dictionaries

Multiple documents contain Information about Valium

Diazepam

Sedapam

DIAPAM

Medline In-house database

Choose keywords

439-14-5(Chemical ID)

Chem. Abstracts

Pereira notebook 23a

7-CHLORO-1,3-DIHYDRO-1-METHYL-5-PHENYL-2H-1,4-BENZODIAZEPIN-2-ONE

Patent database

7-CHLORO-1-METHYL-5-PHENYL-2H-1,4-BENZODIAZEPIN-2-ONE

The scientist simply wants information about valium

1212

Considerações – para procurar documentos (ou páginas da Web) para substâncias químicas

Os produtos químicos têm uma larga variedade de nomes triviais e oficiais.

Nenhuma pesquisa de texto pode encontrar produtos químicos que são denominados usando um dos nomes alternativos.

A expansão de sinônimo é insuficiente.

A procura pela estrutura pode ser útil.

Source J Cooper / IBM

A normalização de nome é importante

1313

Considerations – for searching documents (or web pages) for chemical substances

Chemicals have a wide variety of trivial and official names.

No text search can find chemicals which are named using one of the alternative names.

Synonym expansion is insufficient.

Searching by structure could be helpful

Source J Cooper / IBM

Name normalization is important - (A normalização de nome é importante)

1414

Achado de estruturas de semelhança – não texto somente semelhante!

Além disso, nós gostaríamos de encontrar compostos que são superjogos da estrutura dada.

For example: toluene and methylnaphthalene

Source J Cooper / IBM

Encontre documentos com estruturas semelhantes

As pesquisas de texto não encontrarão documentos com estruturas semelhantes

1515

Computer curation now involves multiple types of analysis(O computador curation agora implica múltiplos tipos da análise)

• Analysis of text

• Analysis of image

• Analysis of XML files

Derived Meta data

Internal data

IBM + Collaborator input

Output db to Collaborators

• Analysis of (CWU’s )

NIH

1616

Paper Words

- - - - - - - - - - - - - - - - - - - - - - - -

Chemical Names

Dictionary of the English Language – minus – the Dictionary of Desired Entities

. - -

-

toluene

[CC1=CC=CC=C1]

CH3

Name=Structure SMILES String

2D Structure

methyl benzene

Computational Resources

Blue Gene – enabled -

Sumario de toda operacao de analise de texto para Quimica

Options to compute 300 properties per molecule

- Fluxograma de todo processo para analise de texto

(HMM, CRF, CFG)

1717

5-chloro-N-methyl-N-phthalimidoacetylanthranilic acid

N-aminoacetyl-5-chloro-N-methylanathranilic acid

Phosphorus pentachloride

aluminum chloride

hydrazine

7-chloro-1.3-dihydro-1-methyl-5-phenyl-2H-1,4-benzodiazepin-2-one

benzene

Chemical Entities Extracted from page

Passo 2: Extraia nomes químicos

Passo 1: Identifique as entidades químicas

Entity extraction

1818

Name Structure Program

7-CHLORO-1-METHYL-5-PHENYL-2H-1,4-

BENZODIAZEPIN-2-ONE

language-free entities

SMILES strings:

c1ccccc1

6 6 0 0 0 0 0 0 0 0999 V2000 6.7092 5.6087 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 6.7076 4.5056 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 7.6607 3.9551 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 8.6160 4.5062 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 8.6121 5.6136 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 7.6583 6.1591 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0

1 2 2 0 0 0 0 2 3 1 0 0 0 0 3 4 2 0 0 0 0 4 5 1 0 0 0 0 5 6 2 0 0 0 0 6 1 1 0 0 0 0

M END

Connection tables

INChI=1/C6H6/c1-2-4-6-5-3-1/h1-6H

Passo 3: Converta nomes químicos em estruturas químicas

Convert the chemicals into machine readable formats !

1919

Background info re InChI’s

Source : Prof Peter Murry Rust

2020

IBM Servers

Medline

Patents

Web Pages

Any text

HealthCare Life Science Data warehouse

Valium

Benzene

11 Million patent documents18 Million Medline abstracts

100 Million chemical structures

>12 Million unique

Passo 4: Automatize o processo

Aumente e automatize o processo

2121

Exemplos

Chemicals derived from text analytics –( Os produtos químicos derivaram do texto analytics )

2222

Ambiente Computacional Grande

Find and compute the 3D structures

dentifique cada doença

Identify each disease

Identify every Medline MeSh code

Identifique a ocorrência de cada biomarker

Equivalente a 240 K pesquisas de Google simultâneas

Data warehouse

Compute properties, & find relationships,

Chemical & Biological information derived from text analytics

2323

Atividades Atuais …

2424

- - - - -- - - - -- - - - -

- - - - -

- - - - -- - - - -- - - - -

- - - - -

= Chemical

= Target

= Disease

= Assay data

Texto [Anotação de Texto]

- - - - -- - - - -- - - - -

- - - - -

Texto Anotado

Identifique cada nome químico

Converta todos os nomes de chem nas suas estruturas químicas[SMILES] - então convertem essesSMILES em inchi's e Inchkeys (um identificador único do produto químico)

- - - - -- - - - -- - - - -

- - - - -

Anote o aumento de / todos os nomes químicos com o termo “inchikey e o inchikey único” para aquele produto químico. Os InChiKeys são postos no índex agora como se eles fossem palavras (texto) no documento

Re-índice o texto aumentado [inchikeys] w SOLR

= aspirin = inchikey= BSYNRYMUTXBXSQ-UHFFFAOYSA-N

= aspirin = SMILE string= CC(=O)OC1=CC=CC=C1C(=O)O

dB SOLR index

Atividade atual: “em linha” etiquetagem de entidade e classificação de nomes químicos

Índice de Texto Índice de Anotação

Acrescente as estruturas conseguidas anotações (e dados de Meta) ao nosso database

2525

Aspirin

InChI = 1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12) 

InChI Key = BSYNRYMUTXBXSQ-UHFFFAOYSA-N 

SMILE = O=C(Oc1ccccc1C(=O)O)C

MOL File

2626

Mrv0541 03191312032D

13 13 0 0 0 0 999 V2000 1.4289 3.3000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 1.4289 2.4750 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.7145 2.0625 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 0.7145 1.2375 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1.4289 0.8250 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1.4289 -0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.7145 -0.4125 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 0.0000 0.8250 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -0.7145 1.2375 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -0.7145 2.0625 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 -1.4289 0.8250 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 2.1434 2.0625 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1 2 2 0 0 0 0 2 3 1 0 0 0 0 3 4 1 0 0 0 0 4 5 4 0 0 0 0 5 6 4 0 0 0 0 6 7 4 0 0 0 0 7 8 4 0 0 0 0 8 9 4 0 0 0 0 4 9 4 0 0 0 0 9 10 1 0 0 0 0 10 11 2 0 0 0 0 10 12 1 0 0 0 0 2 13 1 0 0 0 0M END

Aspirin MOL file

2727

- - - - -- - - - -- - - - -

- - - - -

- - - - -- - - - -- - - - -

- - - - -

= Chemical

= Target

= Disease

= Assay data

Text [ Text + Annotations]

identify all targets [Gene names & their synonyms ]

Augment all target names with a “tag = geneid “ & the NCBI unique Identifier # for that target

- - - - -- - - - -- - - - -

- - - - -

Re-index the augmentented text + geneid identifiers w SOLR

= JAK3 + Aliases = geneid = geneid=NCBIID# = 3718

dB SOLR index

Current activity : “in line” entity tagging & classification for targets (=geneid’s)

Annotated Text Index

Add the derived annotations (& meta data) to our master database

2828

- - - - -- - - - -- - - - -

- - - - -

- - - - -- - - - -- - - - -

- - - - -

= Chemical

= Target

= Disease

= Assay data

Text Text + Annotation

Identify all known MeSH terms [for example, diseases (C01) or signs & symptoms (C23)

Identified & augment every occurrence of every MeSh term with a ‘tag = MeSH & the specific MeSh code Identifier

- - - - -- - - - -- - - - -

- - - - -

Re-index the augmented text + the MeSh tags w SOLR

= Headache += MeSH term += C23 sign or symptom

dB SOLR index

Atividade atual: “em linha” etiquetagem de entidade e classificação de termos de Rede

Text Index + Annotation Index

Text = Headache New index of original text plus all of it’s associated annotated information

Add the derived annotations (& meta data) to our master database

2929

Um texto Aumentado de Mostra

“Interactions of ibogaine and D-amphetamine:[ibmentity type="drug" name="amphetamine" value="amphetamine" chebitype="neurotoxin,toxin"] in vivo microdialysis and motor behavior in rats Ibogaine, an indolalkylamine, has been proposed for use in treating stimulant addiction. In the present study we sought to determine if ibogaine had any effects on the neurochemical and motor changes induced by D-amphetamine[ibmentity type="drug" name="amphetamine" value="amphetamine" chebitype="neurotoxin,toxin"] that would substantiate the anti-addictive claim. Ibogaine (40 mg/kg, i.p.) injected 19 h prior to a D-amphetamine[ibmentity type="drug" name="amphetamine" value="amphetamine" chebitype="neurotoxin,toxin"] challenge (1.25 mg/kg, i.p.) potentiated the expected rise in extracellular dopamine[ibmentity type="drug" name="dopamine" value="dopamine" chebitype="pharmacological role,neurotransmitter agent"] levels in the striatum[ibmentity type="target" name="striatum" value="striatum" targettype="tissue"] and in the nucleus accumbens, as measured by microdialysis in freely moving rats. Using …”

3030

- - - - -- - - - -- - - - -- - - - -

= Chemical_” inchikey BSYNRYMUTXBXSQ-UHFFFAOYSA-N”

= Target

= Disease

= Assay data

Text

= Chemical

compoundTarget 1

Target 2

Target 3

Target 1

Target 2

Target 3

= [target _gene name]

Target 4

Target 5

Compound – Targets associationsKnown from the literature

Compound – Targets associationsKnown from the SEA or other computations

dB

Overall Objective : Integrate [Compound – Target] associations derived from literature + computations + additional experimental efforts (HTS)

In line text tagging (classification) coupled with computational & experimental data

NIH HTS Assay data

Compound – Targets associationsKnown from NIH or Other experimental sources

3131

Data

Sou

rces

View selected

Documents & Reports

U.S.Patents(1976 -—

2009)

U.S. Pre-

Grants (All)

PCT &EPO

Apps

Medline Abstracts

(>18 M)

SelectedInternet Content

User Applications

In-House

Content

Knime or Pipeline Pilot

BIW

SIMPLE

Chem Axon Search

Cognos/DDQB/Other Apps

Parse & Extract

data

Annotator 1

Annotator 2

Database

+compu ted Meta Data

e Classifier & OtherData Associations

Annotation Factory

Computational Analytics

(SemanticAssociations)

Computer Curation Process Overview & integration with our collaborators -

IP Database(e.g. DB2)

ADU*

* ADU = Automated Data Update

ChemVersedb

ChemVerse

Services Hosted at IBM Almaden

3232

Os exemplos –

por que isto é importante e o que ele nos permite fazer isto nós não pode fazer facilmente antes-

3333

Batch Analysis

For Example : You are about to file a patent application – that contains ~ 300 – 400 chemical compounds. How do you know if any of these (400+) compounds has been patented before ?

3434

Paste a list of InChIkeys to be batch searched here !

3535

Input list of InChIkeys to be batch searched here !

1

2

Click run search !

3636

Results form batch search of InChikeys !

Diavan Glipazol Ibuprofen Asprin Lotensin ImItrex Nabumetone Tessalon Sulfamethoxazole Trimethoprim Cyclobenzaprine Guaifenesin Oxymetazoline Anvitoff Dextromethorphan Lyrica Celexa

One can readily search hundreds or even thousands of compounds at at time – to see if any of the compounds have already been patented - & by whom & for what purpose

top related