uwe reyle institute of computational linguistics university of stuttgart

33
Processing Natural Language Comments in Biological Databases: Molecular Assemblies and Their Catalitic Functions. A Case Study Uwe Reyle Institute of Computational Linguistics University of Stuttgart

Upload: sheera

Post on 13-Jan-2016

31 views

Category:

Documents


3 download

DESCRIPTION

Processing Natural Language Comments in Biological Databases: Molecular Assemblies and Their Catalitic Functions. A Case Study. Uwe Reyle Institute of Computational Linguistics University of Stuttgart. EML European Media Laboratory Heidelberg INRIA Institute National de Recherche en - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Uwe Reyle Institute of Computational Linguistics University of Stuttgart

Processing Natural Language Comments in Biological

Databases: Molecular Assemblies and Their Catalitic Functions.

A Case Study

Uwe Reyle

Institute of Computational Linguistics

University of Stuttgart

Page 2: Uwe Reyle Institute of Computational Linguistics University of Stuttgart

EMLEuropean Media Laboratory

Heidelberg

INRIAInstitute National de Recherche en

Informatique et AutomatiqueGrenoble

Page 3: Uwe Reyle Institute of Computational Linguistics University of Stuttgart

Biological Databases

Proteins

EnzymesCompounds

Pathways

flat files – no relational/deductive databasesmade for Biologists – not for Machines

Page 4: Uwe Reyle Institute of Computational Linguistics University of Stuttgart

Biological Databases

Proteins

EnzymesCompounds

Pathways

Data-ModelOntology

Efficient Querying

Page 5: Uwe Reyle Institute of Computational Linguistics University of Stuttgart

Overview

• Genes, Proteins and Enzymes• Swissprot Protein Database• Two Examples

– Semantic Processing – Parsing Protein Names

• Merits for – Coreference Resolution– Extraction/Detaction of Molecular Assemblies

Page 6: Uwe Reyle Institute of Computational Linguistics University of Stuttgart

gene polypeptide

Chromosome

biochemical reactions

compounds (e.g. sugar...)molecularassembly

EC

EC

EC enzyme

TranslationTranscription

PosttranslationalModifications

Molecular Assembly

Catalitic Activity

Pathways

Page 7: Uwe Reyle Institute of Computational Linguistics University of Stuttgart

gene polypeptide

Chromosome

biochemical reactions

compounds (e.g. sugar...)molecularassembly

EC

EC

EC enzyme

CATALITIC ACTIVITY

PATHWAY

SUBUNIT

DE POS

DE INCLUDES CONTAINS

FUNCTION

Swissprot Entries

Page 8: Uwe Reyle Institute of Computational Linguistics University of Stuttgart

Reference Database

Proteins

EnzymesCompounds

Pathways

Swissprot

Page 9: Uwe Reyle Institute of Computational Linguistics University of Stuttgart

Swissprot vs. Medline

Papers

Medline Abstracts

Swissprot

• fact + organism + experimental context• enormous vocabulary• coreference = intra-document coreference + coreference to database

IE• fact• much smaller vocabulary• coreference = intra-document coreference

Page 10: Uwe Reyle Institute of Computational Linguistics University of Stuttgart

RECOMMENDED NAME dipeptidyl-peptidase IV

SYNONYMS peptidase, dipeptidyl, IVPep X leukocyte antigen CD26 glycylprolyl dipeptidylaminopeptidaseglycylproline-dipeptidyl-aminopeptidaseglycylproline aminopeptidaseXaa-Pro-dipeptidyl-aminopeptidasedipeptidyl-peptide hydrolaselymphocyte, antigen CD26postproline dipeptidyl aminopeptidase IVglycylprolyl aminopeptidasedipeptidyl-aminopeptidase IVGly-Pro-naphthylamidaseDPP IV/CD26 glycoprotein GP110amino acyl-prolyl dipeptidyl aminopeptidasedipeptidyl aminopeptidase IVT cell triggering molecule Tp103 dipeptidyl-peptidase IV (CD26) X-prolyl dipeptidyl aminopeptidaseX-PDAP aminopeptidase, glycylproline

Coreference to DE-line of database entry

Page 11: Uwe Reyle Institute of Computational Linguistics University of Stuttgart

Structure of Swissprot Entries

3. The different line types 3.1    The ID line 3.2    The AC line 3.3    The DT line 3.4    The DE line 3.5    The GN line 3.6    The OS line 3.7    The OG line 3.8    The OC line 3.9    The OX line 3.10  The reference (RN, RP, RC, RX, RA, RT, RL) lines 3.11  The CC line 3.12  The DR line 3.13  The KW line 3.14  The FT line 3.15  The SQ line 3.16  The sequence data line 3.17  The // line

Quality of information by marking: experiment, similarity, ...

Each entry refers to a polypeptide in one single organism

Page 12: Uwe Reyle Institute of Computational Linguistics University of Stuttgart

An Example

ID ACCD_ECOLI STANDARD; PRT; 304 AA.AC P08193; P78251; P76937;DE ACETYL-COENZYME A CARBOXYLASE CARBOXYL TRANSFERASE SUBUNIT BETA

(EC 6.4.1.2) (ACCASE BETA CHAIN).CC -!- FUNCTION: THIS PROTEIN IS A COMPONENT OF THE ACETYL COENZYME A

CARBOXYLASE COMPLEX; FIRST, BIOTIN CARBOXYLASE CATALYZES THE CARBOXYLATION OF THE CARRIER PROTEIN AND THEN THE TRANSCARBOXYLASE TRANSFERS THE CARBOXYL GROUP TO FORM

MALONYL-COA. CC -!- SUBUNIT: ACETYL-COA CARBOXYLASE IS AN HETEROHEXAMER OF

BIOTIN CARBOXYL CARRIER PROTEIN, BIOTIN CARBOXYLASE AND THE TWO SUBUNITS OF CARBOXYL TRANSFERASE IN A 2:2 COMPLEX.

-!- CATALITIC ACTIVITY, PATHWAY, SIMILARITY, FEATURES, ...

Page 13: Uwe Reyle Institute of Computational Linguistics University of Stuttgart

Variety of SUBUNIT-lines

• HETERODIMER...

• PP2 CONSISTS OF A COMMON HETERODIMERIC CORE ENZYME, COMPOSED OF A 36 KDA CATALITIC SUBUNIT (SUBUNIT C) AND A 65 KDA CONSTANT REGULATORY SUBUNIT (PR65 OR SUBUNIT A), THAT ASSOCIATES WITH A VARIETY OF REGULATORY SUBUNITS. PROTEINS THAT ASSOCIATE WITH THE CORE DIMER INCLUDE THREE FAMILIES OF ...

Page 14: Uwe Reyle Institute of Computational Linguistics University of Stuttgart

Subunit-lines of type NP

<DE> A-kinase anchor protein 5<SUBUNIT> BINDING PROTEIN FOR DIMER OF PKA

AND ALSO FOR PKC AND PP2B. EACH ENZYME IS INHIBITED WHEN BOUND TO THE ANCHOR PROTEIN.

<DE> Potassium-transporting ATPase alpha chain<SUBUNIT> HETERODIMER COMPOSED OF TWO

SUBUNITS, ALPHA AND BETA.

Page 15: Uwe Reyle Institute of Computational Linguistics University of Stuttgart

Subunit-lines of type NP

<DE> A-kinase anchor protein 5<SUBUNIT> BINDING PROTEIN FOR DIMER OF PKA AND

ALSO FOR PKC AND PP2B. EACH ENZYME IS INHIBITED WHEN BOUND TO THE ANCHOR PROTEIN.

(AKAP5 : (PKA:PKA)), where PKA is inhibited PKC AKAP5 and PP2B AKAP5

where PKC and PP2B are inhibited

<DE> Potassium-transporting ATPase alpha chain<SUBUNIT> HETERODIMER COMPOSED OF TWO

SUBUNITS, ALPHA AND BETA.

Page 16: Uwe Reyle Institute of Computational Linguistics University of Stuttgart

Subunit-lines of type NP

<DE> A-kinase anchor protein 5<SUBUNIT> BINDING PROTEIN FOR DIMER OF PKA AND

ALSO FOR PKC AND PP2B. EACH ENZYME IS INHIBITED WHEN BOUND TO THE ANCHOR PROTEIN.

<DE> Potassium-transporting ATPase alpha chain<SUBUNIT> HETERODIMER COMPOSED OF TWO SUBUNITS,

ALPHA AND BETA. Potassium-transporting ATPase (alpha : beta) Potassium-transporting ATPase alpha chain (alpha : beta)

Page 17: Uwe Reyle Institute of Computational Linguistics University of Stuttgart

Subunit-lines of type NP

<DE> A-kinase anchor protein 5<SUBUNIT> BINDING PROTEIN FOR DIMER OF PKA AND

ALSO FOR PKC AND PP2B. EACH ENZYME IS INHIBITED WHEN BOUND TO THE ANCHOR PROTEIN.

<DE> Potassium-transporting ATPase alpha chain<SUBUNIT> HETERODIMER COMPOSED OF TWO SUBUNITS,

ALPHA AND BETA. Potassium-transporting ATPase (alpha : beta) Potassium-transporting ATPase alpha chain (alpha : beta)

Task: parse recommended name

Page 18: Uwe Reyle Institute of Computational Linguistics University of Stuttgart

Structure of Polypeptide Names that Refer to Subunits of Proteines

AssemblyName SubunitRef

Protein NameEnzyme Name

{beta 1, ASHI, lacH, ...} subunit30 kda subunit{small, major, second largest,...} subunittype B catalitic subunit subunit {alpha 3, 2 type B, ...}iron-sulfur subunit alpha-2{alpha, light, catalitic,...} chain cytochrome B-558

homologprecursor; phrase(s)

vacuolarsoluableanaerobic

Page 19: Uwe Reyle Institute of Computational Linguistics University of Stuttgart

Problems

• We cannot assume a dictionary of assembly names

• AssemblyName very often end with a highly ambiguous symbol that may also be used to start the SubunitRef expression

- F, A1, I, II, i, ..., geneName, ...• Nomenclature of subunits does not exist• Contextual knowledge is needed to disambiguate,

e.g., XYase A1 large chain

Page 20: Uwe Reyle Institute of Computational Linguistics University of Stuttgart

Assembly Names• Mitogen-activated protein kinase kinase kinase kinase acting on a kinase that acts on a protein kinase one of these kinases is mitogen-activated, not the protein, however

• „kinase“ has 1 semantic argument, namely the molecule X that it phosphorylates

Acceptor/Donor

Group phosphoryl Function transfer

Acceptor/Donor ...Group phosphorylFunction transfer

Page 21: Uwe Reyle Institute of Computational Linguistics University of Stuttgart

CoA Carboxylase Carboxyl Transferase Acceptor/Donor CoA Carboxylase Group carboxyl Function transfer

Acceptor/Donor X Group carboxyl Function transfer

ADJ-Rel CoA Carboxylase

With ADJ-Rel {,is_expressed_by, ...}

Assembly Names

Page 22: Uwe Reyle Institute of Computational Linguistics University of Stuttgart

Semantic Relations projected from the Lexicon

• carboxyl transferase transcarboxylase (IUPAC)• transcarboxylation carboxylation • transcarboxylate carboxylate

• phosphorylate, biotinylate, adenylylate, ... • transphosphorylate, ... • crossphosphorylate, ...

Page 23: Uwe Reyle Institute of Computational Linguistics University of Stuttgart

Coreference (local)

ID ACCD_ECOLI STANDARD; PRT; 304 AA.AC P08193; P78251; P76937;DE ACETYL-COENZYME A CARBOXYLASE CARBOXYL TRANSFERASE SUBUNIT BETA

(EC 6.4.1.2) (ACCASE BETA CHAIN).CC -!- FUNCTION: THIS PROTEIN IS A COMPONENT OF THE ACETYL COENZYME A

CARBOXYLASE COMPLEX; FIRST, BIOTIN CARBOXYLASE CATALYZES THE CARBOXYLATION OF THE CARRIER PROTEIN AND THEN THE TRANSCARBOXYLASE TRANSFERS THE CARBOXYL GROUP TO FORM

MALONYL-COA. CC -!- SUBUNIT: ACETYL-COA CARBOXYLASE IS AN HETEROHEXAMER OF

BIOTIN CARBOXYL CARRIER PROTEIN, BIOTIN CARBOXYLASE AND THE TWO SUBUNITS OF CARBOXYL TRANSFERASE IN A 2:2 COMPLEX.

Page 24: Uwe Reyle Institute of Computational Linguistics University of Stuttgart

Coreference (local)

ID ACCD_ECOLI STANDARD; PRT; 304 AA.AC P08193; P78251; P76937;DE ACETYL-COENZYME A CARBOXYLASE CARBOXYL TRANSFERASE SUBUNIT BETA

(EC 6.4.1.2) (ACCASE BETA CHAIN).CC -!- FUNCTION: THIS PROTEIN IS A COMPONENT OF THE ACETYL COENZYME A

CARBOXYLASE COMPLEX; FIRST, BIOTIN CARBOXYLASE CATALYZES THE CARBOXYLATION OF THE CARRIER PROTEIN AND THEN THE TRANSCARBOXYLASE TRANSFERS THE CARBOXYL GROUP TO FORM

MALONYL-COA. CC -!- SUBUNIT: ACETYL-COA CARBOXYLASE IS AN HETEROHEXAMER OF

BIOTIN CARBOXYL CARRIER PROTEIN, BIOTIN CARBOXYLASE AND THE TWO SUBUNITS OF CARBOXYL TRANSFERASE IN A 2:2 COMPLEX.

Page 25: Uwe Reyle Institute of Computational Linguistics University of Stuttgart

Coreference (local)

ID ACCD_ECOLI STANDARD; PRT; 304 AA.AC P08193; P78251; P76937;DE ACETYL-COENZYME A CARBOXYLASE CARBOXYL TRANSFERASE SUBUNIT BETA

(EC 6.4.1.2) (ACCASE BETA CHAIN).CC -!- FUNCTION: THIS PROTEIN IS A COMPONENT OF THE ACETYL COENZYME A

CARBOXYLASE COMPLEX; FIRST, BIOTIN CARBOXYLASE CATALYZES THE CARBOXYLATION OF THE CARRIER PROTEIN AND THEN THE TRANSCARBOXYLASE TRANSFERS THE CARBOXYL GROUP TO FORM

MALONYL-COA. CC -!- SUBUNIT: ACETYL-COA CARBOXYLASE IS AN HETEROHEXAMER OF

BIOTIN CARBOXYL CARRIER PROTEIN, BIOTIN CARBOXYLASE AND THE TWO SUBUNITS OF CARBOXYL TRANSFERASE IN A 2:2 COMPLEX.

PP-attachment: semantics of Heterohexamer

Page 26: Uwe Reyle Institute of Computational Linguistics University of Stuttgart

Coreference (non-local)

ID ACCD_ECOLI STANDARD; PRT; 304 AA.AC P08193; P78251; P76937;DE ACETYL-COENZYME A CARBOXYLASE CARBOXYL TRANSFERASE SUBUNIT BETA (EC

6.4.1.2) (ACCASE BETA CHAIN).CC -!- FUNCTION: THIS PROTEIN IS A COMPONENT OF THE ACETYL COENZYME A

CARBOXYLASE COMPLEX; FIRST, BIOTIN CARBOXYLASE CATALYZES THE CARBOXYLATION OF THE CARRIER PROTEIN AND THEN THE TRANSCARBOXYLASE TRANSFERS THE CARBOXYL GROUP TO FORM

MALONYL-COA. CC -!- SUBUNIT: ACETYL-COA CARBOXYLASE IS AN HETEROHEXAMER OF

BIOTIN CARBOXYL CARRIER PROTEIN, BIOTIN CARBOXYLASE AND THE TWO SUBUNITS OF CARBOXYL TRANSFERASE IN A 2:2 COMPLEX.

ACETYL-COENZYME A CARBOXYLASE CARBOXYL TRANSFERASE SUBUNIT XYZXYZ SUBUNIT OF ACETYL-COENZYME A CARBOXYLASE CARBOXYL TRANSFERASE

...

Page 27: Uwe Reyle Institute of Computational Linguistics University of Stuttgart

Coreference (non-local)

ID ACCA_ECOLI STANDARD; PRT; 318 AA.AC P30867;DE ACETYL-COENZYME A CARBOXYLASE CARBOXYL TRANSFERASE SUBUNIT ALPHA

(EC 6.4.1.2).CC -!- FUNCTION: THIS PROTEIN IS A COMPONENT OF THE ACETYL COENZYME

A CARBOXYLASE COMPLEX; FIRST, BIOTIN CARBOXYLASE CATALYZES THE CARBOXYLATION OF THE CARRIER PROTEIN AND THEN THE

TRANSCARBOXYLASE TRANSFERS THE CARBOXYL GROUP TO FORM MALONYL- COA.

CC -!- CATALYTIC ACTIVITY: CARBOXYBIOTIN CARBOXYL CARRIER PROTEIN + ACETYL-COA = BIOTIN CARBOXYL CARRIER PROTEIN + MALONYL-COA.

CC -!- PATHWAY: FIRST STEP IN LONG-CHAIN FATTY ACID SYNTHESIS.CC -!- SUBUNIT: ACETYL-COA CARBOXYLASE IS AN HETEROHEXAMER OF BIOTIN

CARBOXYL CARRIER PROTEIN, BIOTIN CARBOXYLASE AND THE TWO SUBUNITS OF CARBOXYL TRANSFERASE IN A 2:2 COMPLEX.CC -!- SIMILARITY: TO THE C-TERMINUS OF MAMMALIAN PROPIONYL-COA

CARBOXYLASE BETA CHAIN.

Page 28: Uwe Reyle Institute of Computational Linguistics University of Stuttgart

Coreference (non-local)

ID BCCP_ECOLI STANDARD; PRT; 156 AA.AC P02905;DE BIOTIN CARBOXYL CARRIER PROTEIN OF ACETYL-COA CARBOXYLASE

(BCCP).CC -!- FUNCTION: THIS PROTEIN IS A COMPONENT OF THE ACETYL

COENZYME A CARBOXYLASE COMPLEX; FIRST, BIOTIN CARBOXYLASE CATALYZES THE CARBOXYLATION OF THE CARRIER PROTEIN AND THEN THE TRANSCARBOXYLASE TRANSFERS THE CARBOXYL GROUP TO FORM MALONYL-COA.

CC -!- PATHWAY: FIRST STEP IN LONG-CHAIN FATTY ACID SYNTHESIS.CC -!- SUBUNIT: HOMODIMER.

Page 29: Uwe Reyle Institute of Computational Linguistics University of Stuttgart

Coreference (non-local)

ID ACCC_ECOLI STANDARD; PRT; 449 AA.AC P24182;DE BIOTIN CARBOXYLASE (EC 6.3.4.14) (A SUBUNIT OF ACETYL-COA CARBOXYLASE) (EC 6.4.1.2) (ACC).CC -!- FUNCTION: THIS PROTEIN IS A COMPONENT OF THE ACETYL COENZYME A CARBOXYLASE COMPLEX; FIRST, BIOTIN CARBOXYLASE CATALYZES THE CARBOXYLATION OF THE CARRIER PROTEIN AND THEN

THE TRANSCARBOXYLASE TRANSFERS THE CARBOXYL GROUP TO FORM MALONYL-COA.CC -!- CATALYTIC ACTIVITY: ATP + BIOTIN-CARBOXYL-CARRIER PROTEIN + CO(2) = ADP + ORTHOPHOSPHATE + CARBOXYBIOTIN-CARBOXYL-CARRIER PROTEIN.CC -!- PATHWAY: FIRST STEP IN LONG-CHAIN FATTY ACID SYNTHESIS.CC -!- SUBUNIT: ACETYL-COA CARBOXYLASE IS AN HETEROHEXAMER OF BIOTIN CARBOXYL CARRIER PROTEIN, BIOTIN CARBOXYLASE AND THE TWO SUBUNITS OF CARBOXYL TRANSFERASE IN A 2:2 COMPLEX.CC -!- SIMILARITY: TO OTHER BIOTIN-DEPENDENT ENZYMES AND CARBAMOYL- PHOSPHATE SYNTHETASES.

Page 30: Uwe Reyle Institute of Computational Linguistics University of Stuttgart

ExtractionID ACCD_ECOLI STANDARD; PRT; 304 AA.AC P08193; P78251; P76937;DE ACETYL-COENZYME A CARBOXYLASE CARBOXYL TRANSFERASE SUBUNIT BETA

(EC 6.4.1.2) (ACCASE BETA CHAIN).CC -!- FUNCTION: THIS PROTEIN IS A COMPONENT OF THE ACETYL COENZYME A

CARBOXYLASE COMPLEX; FIRST, BIOTIN CARBOXYLASE CATALYZES THE CARBOXYLATION OF THE CARRIER PROTEIN AND THEN THE TRANSCARBOXYLASE TRANSFERS THE CARBOXYL GROUP TO FORM

MALONYL-COA.CC -!- SUBUNIT: ACETYL-COA CARBOXYLASE IS AN HETEROHEXAMER OF

BIOTIN CARBOXYL CARRIER PROTEIN, BIOTIN CARBOXYLASE AND THE TWO SUBUNITS OF CARBOXYL TRANSFERASE IN A 2:2 COMPLEX.

CC -!- SIMILARITY: BELONGS TO THE ACCD / PCCB FAMILY.

Complex consisting of 6 subunits

Page 31: Uwe Reyle Institute of Computational Linguistics University of Stuttgart

ExtractionID ACCD_ECOLI STANDARD; PRT; 304 AA.AC P08193; P78251; P76937;DE ACETYL-COENZYME A CARBOXYLASE CARBOXYL TRANSFERASE SUBUNIT BETA

(EC 6.4.1.2) (ACCASE BETA CHAIN).CC -!- FUNCTION: THIS PROTEIN IS A COMPONENT OF THE ACETYL COENZYME A

CARBOXYLASE COMPLEX; FIRST, BIOTIN CARBOXYLASE CATALYZES THE CARBOXYLATION OF THE CARRIER PROTEIN AND THEN THE TRANSCARBOXYLASE TRANSFERS THE CARBOXYL GROUP TO FORM

MALONYL-COA.CC -!- SUBUNIT: ACETYL-COA CARBOXYLASE IS AN HETEROHEXAMER OF

BIOTIN CARBOXYL CARRIER PROTEIN, BIOTIN CARBOXYLASE AND THE TWO SUBUNITS OF CARBOXYL TRANSFERASE IN A 2:2 COMPLEX.

CC -!- SIMILARITY: BELONGS TO THE ACCD / PCCB FAMILY.

Acetyl-CoA Carboxylase

Carrier Protein Biotin Carboxylase

Alpha Alpha Beta Beta

Carboxyl Transferase

Page 32: Uwe Reyle Institute of Computational Linguistics University of Stuttgart

Completing the Picture

ID BIRA_ECOLI STANDARD; PRT; 321 AA.AC P06709;CC -!- FUNCTION: BIRA ACTS BOTH AS A BIOTIN-OPERON REPRESSOR

AND AS THE ENZYME THAT SYNTHESIZES THE COREPRESSOR,ACETYL COA:CARBON-DIOXIDE LIGASE. THIS PROTEIN ALSO

ACTIVATES BIOTIN TO FORM BIOTINYL-5'-ADENYLATE AND TRANSFERS THE BIOTIN MOIETY TO BIOTIN-ACCEPTING PROTEINS.

CC -!- CATALYTIC ACTIVITY: ATP + BIOTIN + APO-[ACETYL-COA:CARBON-DIOXIDE LIGASE (ADP FORMING)]

= AMP + PYROPHOSPHATE + [ACETYL-COA:CARBON-DIOXIDE LIGASE (ADP FORMING)].CC -!- SUBUNIT: MONOMER.CC -!- SIMILARITY: WITH OTHER BACTERIAL BIRA AND WITH

EUKARYOTIC BIOTIN APO-PROTEIN LIGASE.

= Acetyl CoA Carboxylase

Page 33: Uwe Reyle Institute of Computational Linguistics University of Stuttgart

Conclusion• Sophisticated IE must incorporate

– Domain Ontology (EML, INRIA,IMS)

– Lexical Semantics (IMS)

– Morphological Analysis + Compositional Semantics (IMS)

– Discourse Semantics (IMS)

• Work on the Lexicon of Cell-Biology– Organic Chemical Compounds

„Was bedeutet UREYLEN“ (C. Gerstenberger, IMS)

– Semantic/ontological classification of 100 chemical Verbs (Phillip Cimiano Lavin, IMS)

– Enzyme- and Protein Names (work in progres)