uwe reyle institute of computational linguistics university of stuttgart
DESCRIPTION
Processing Natural Language Comments in Biological Databases: Molecular Assemblies and Their Catalitic Functions. A Case Study. Uwe Reyle Institute of Computational Linguistics University of Stuttgart. EML European Media Laboratory Heidelberg INRIA Institute National de Recherche en - PowerPoint PPT PresentationTRANSCRIPT
Processing Natural Language Comments in Biological
Databases: Molecular Assemblies and Their Catalitic Functions.
A Case Study
Uwe Reyle
Institute of Computational Linguistics
University of Stuttgart
EMLEuropean Media Laboratory
Heidelberg
INRIAInstitute National de Recherche en
Informatique et AutomatiqueGrenoble
Biological Databases
Proteins
EnzymesCompounds
Pathways
flat files – no relational/deductive databasesmade for Biologists – not for Machines
Biological Databases
Proteins
EnzymesCompounds
Pathways
Data-ModelOntology
Efficient Querying
Overview
• Genes, Proteins and Enzymes• Swissprot Protein Database• Two Examples
– Semantic Processing – Parsing Protein Names
• Merits for – Coreference Resolution– Extraction/Detaction of Molecular Assemblies
gene polypeptide
Chromosome
biochemical reactions
compounds (e.g. sugar...)molecularassembly
EC
EC
EC enzyme
TranslationTranscription
PosttranslationalModifications
Molecular Assembly
Catalitic Activity
Pathways
gene polypeptide
Chromosome
biochemical reactions
compounds (e.g. sugar...)molecularassembly
EC
EC
EC enzyme
CATALITIC ACTIVITY
PATHWAY
SUBUNIT
DE POS
DE INCLUDES CONTAINS
FUNCTION
Swissprot Entries
Reference Database
Proteins
EnzymesCompounds
Pathways
Swissprot
Swissprot vs. Medline
Papers
Medline Abstracts
Swissprot
• fact + organism + experimental context• enormous vocabulary• coreference = intra-document coreference + coreference to database
IE• fact• much smaller vocabulary• coreference = intra-document coreference
RECOMMENDED NAME dipeptidyl-peptidase IV
SYNONYMS peptidase, dipeptidyl, IVPep X leukocyte antigen CD26 glycylprolyl dipeptidylaminopeptidaseglycylproline-dipeptidyl-aminopeptidaseglycylproline aminopeptidaseXaa-Pro-dipeptidyl-aminopeptidasedipeptidyl-peptide hydrolaselymphocyte, antigen CD26postproline dipeptidyl aminopeptidase IVglycylprolyl aminopeptidasedipeptidyl-aminopeptidase IVGly-Pro-naphthylamidaseDPP IV/CD26 glycoprotein GP110amino acyl-prolyl dipeptidyl aminopeptidasedipeptidyl aminopeptidase IVT cell triggering molecule Tp103 dipeptidyl-peptidase IV (CD26) X-prolyl dipeptidyl aminopeptidaseX-PDAP aminopeptidase, glycylproline
Coreference to DE-line of database entry
Structure of Swissprot Entries
3. The different line types 3.1 The ID line 3.2 The AC line 3.3 The DT line 3.4 The DE line 3.5 The GN line 3.6 The OS line 3.7 The OG line 3.8 The OC line 3.9 The OX line 3.10 The reference (RN, RP, RC, RX, RA, RT, RL) lines 3.11 The CC line 3.12 The DR line 3.13 The KW line 3.14 The FT line 3.15 The SQ line 3.16 The sequence data line 3.17 The // line
Quality of information by marking: experiment, similarity, ...
Each entry refers to a polypeptide in one single organism
An Example
ID ACCD_ECOLI STANDARD; PRT; 304 AA.AC P08193; P78251; P76937;DE ACETYL-COENZYME A CARBOXYLASE CARBOXYL TRANSFERASE SUBUNIT BETA
(EC 6.4.1.2) (ACCASE BETA CHAIN).CC -!- FUNCTION: THIS PROTEIN IS A COMPONENT OF THE ACETYL COENZYME A
CARBOXYLASE COMPLEX; FIRST, BIOTIN CARBOXYLASE CATALYZES THE CARBOXYLATION OF THE CARRIER PROTEIN AND THEN THE TRANSCARBOXYLASE TRANSFERS THE CARBOXYL GROUP TO FORM
MALONYL-COA. CC -!- SUBUNIT: ACETYL-COA CARBOXYLASE IS AN HETEROHEXAMER OF
BIOTIN CARBOXYL CARRIER PROTEIN, BIOTIN CARBOXYLASE AND THE TWO SUBUNITS OF CARBOXYL TRANSFERASE IN A 2:2 COMPLEX.
-!- CATALITIC ACTIVITY, PATHWAY, SIMILARITY, FEATURES, ...
Variety of SUBUNIT-lines
• HETERODIMER...
• PP2 CONSISTS OF A COMMON HETERODIMERIC CORE ENZYME, COMPOSED OF A 36 KDA CATALITIC SUBUNIT (SUBUNIT C) AND A 65 KDA CONSTANT REGULATORY SUBUNIT (PR65 OR SUBUNIT A), THAT ASSOCIATES WITH A VARIETY OF REGULATORY SUBUNITS. PROTEINS THAT ASSOCIATE WITH THE CORE DIMER INCLUDE THREE FAMILIES OF ...
Subunit-lines of type NP
<DE> A-kinase anchor protein 5<SUBUNIT> BINDING PROTEIN FOR DIMER OF PKA
AND ALSO FOR PKC AND PP2B. EACH ENZYME IS INHIBITED WHEN BOUND TO THE ANCHOR PROTEIN.
<DE> Potassium-transporting ATPase alpha chain<SUBUNIT> HETERODIMER COMPOSED OF TWO
SUBUNITS, ALPHA AND BETA.
Subunit-lines of type NP
<DE> A-kinase anchor protein 5<SUBUNIT> BINDING PROTEIN FOR DIMER OF PKA AND
ALSO FOR PKC AND PP2B. EACH ENZYME IS INHIBITED WHEN BOUND TO THE ANCHOR PROTEIN.
(AKAP5 : (PKA:PKA)), where PKA is inhibited PKC AKAP5 and PP2B AKAP5
where PKC and PP2B are inhibited
<DE> Potassium-transporting ATPase alpha chain<SUBUNIT> HETERODIMER COMPOSED OF TWO
SUBUNITS, ALPHA AND BETA.
Subunit-lines of type NP
<DE> A-kinase anchor protein 5<SUBUNIT> BINDING PROTEIN FOR DIMER OF PKA AND
ALSO FOR PKC AND PP2B. EACH ENZYME IS INHIBITED WHEN BOUND TO THE ANCHOR PROTEIN.
<DE> Potassium-transporting ATPase alpha chain<SUBUNIT> HETERODIMER COMPOSED OF TWO SUBUNITS,
ALPHA AND BETA. Potassium-transporting ATPase (alpha : beta) Potassium-transporting ATPase alpha chain (alpha : beta)
Subunit-lines of type NP
<DE> A-kinase anchor protein 5<SUBUNIT> BINDING PROTEIN FOR DIMER OF PKA AND
ALSO FOR PKC AND PP2B. EACH ENZYME IS INHIBITED WHEN BOUND TO THE ANCHOR PROTEIN.
<DE> Potassium-transporting ATPase alpha chain<SUBUNIT> HETERODIMER COMPOSED OF TWO SUBUNITS,
ALPHA AND BETA. Potassium-transporting ATPase (alpha : beta) Potassium-transporting ATPase alpha chain (alpha : beta)
Task: parse recommended name
Structure of Polypeptide Names that Refer to Subunits of Proteines
AssemblyName SubunitRef
Protein NameEnzyme Name
{beta 1, ASHI, lacH, ...} subunit30 kda subunit{small, major, second largest,...} subunittype B catalitic subunit subunit {alpha 3, 2 type B, ...}iron-sulfur subunit alpha-2{alpha, light, catalitic,...} chain cytochrome B-558
homologprecursor; phrase(s)
vacuolarsoluableanaerobic
Problems
• We cannot assume a dictionary of assembly names
• AssemblyName very often end with a highly ambiguous symbol that may also be used to start the SubunitRef expression
- F, A1, I, II, i, ..., geneName, ...• Nomenclature of subunits does not exist• Contextual knowledge is needed to disambiguate,
e.g., XYase A1 large chain
Assembly Names• Mitogen-activated protein kinase kinase kinase kinase acting on a kinase that acts on a protein kinase one of these kinases is mitogen-activated, not the protein, however
• „kinase“ has 1 semantic argument, namely the molecule X that it phosphorylates
Acceptor/Donor
Group phosphoryl Function transfer
Acceptor/Donor ...Group phosphorylFunction transfer
CoA Carboxylase Carboxyl Transferase Acceptor/Donor CoA Carboxylase Group carboxyl Function transfer
Acceptor/Donor X Group carboxyl Function transfer
ADJ-Rel CoA Carboxylase
With ADJ-Rel {,is_expressed_by, ...}
Assembly Names
Semantic Relations projected from the Lexicon
• carboxyl transferase transcarboxylase (IUPAC)• transcarboxylation carboxylation • transcarboxylate carboxylate
• phosphorylate, biotinylate, adenylylate, ... • transphosphorylate, ... • crossphosphorylate, ...
Coreference (local)
ID ACCD_ECOLI STANDARD; PRT; 304 AA.AC P08193; P78251; P76937;DE ACETYL-COENZYME A CARBOXYLASE CARBOXYL TRANSFERASE SUBUNIT BETA
(EC 6.4.1.2) (ACCASE BETA CHAIN).CC -!- FUNCTION: THIS PROTEIN IS A COMPONENT OF THE ACETYL COENZYME A
CARBOXYLASE COMPLEX; FIRST, BIOTIN CARBOXYLASE CATALYZES THE CARBOXYLATION OF THE CARRIER PROTEIN AND THEN THE TRANSCARBOXYLASE TRANSFERS THE CARBOXYL GROUP TO FORM
MALONYL-COA. CC -!- SUBUNIT: ACETYL-COA CARBOXYLASE IS AN HETEROHEXAMER OF
BIOTIN CARBOXYL CARRIER PROTEIN, BIOTIN CARBOXYLASE AND THE TWO SUBUNITS OF CARBOXYL TRANSFERASE IN A 2:2 COMPLEX.
Coreference (local)
ID ACCD_ECOLI STANDARD; PRT; 304 AA.AC P08193; P78251; P76937;DE ACETYL-COENZYME A CARBOXYLASE CARBOXYL TRANSFERASE SUBUNIT BETA
(EC 6.4.1.2) (ACCASE BETA CHAIN).CC -!- FUNCTION: THIS PROTEIN IS A COMPONENT OF THE ACETYL COENZYME A
CARBOXYLASE COMPLEX; FIRST, BIOTIN CARBOXYLASE CATALYZES THE CARBOXYLATION OF THE CARRIER PROTEIN AND THEN THE TRANSCARBOXYLASE TRANSFERS THE CARBOXYL GROUP TO FORM
MALONYL-COA. CC -!- SUBUNIT: ACETYL-COA CARBOXYLASE IS AN HETEROHEXAMER OF
BIOTIN CARBOXYL CARRIER PROTEIN, BIOTIN CARBOXYLASE AND THE TWO SUBUNITS OF CARBOXYL TRANSFERASE IN A 2:2 COMPLEX.
Coreference (local)
ID ACCD_ECOLI STANDARD; PRT; 304 AA.AC P08193; P78251; P76937;DE ACETYL-COENZYME A CARBOXYLASE CARBOXYL TRANSFERASE SUBUNIT BETA
(EC 6.4.1.2) (ACCASE BETA CHAIN).CC -!- FUNCTION: THIS PROTEIN IS A COMPONENT OF THE ACETYL COENZYME A
CARBOXYLASE COMPLEX; FIRST, BIOTIN CARBOXYLASE CATALYZES THE CARBOXYLATION OF THE CARRIER PROTEIN AND THEN THE TRANSCARBOXYLASE TRANSFERS THE CARBOXYL GROUP TO FORM
MALONYL-COA. CC -!- SUBUNIT: ACETYL-COA CARBOXYLASE IS AN HETEROHEXAMER OF
BIOTIN CARBOXYL CARRIER PROTEIN, BIOTIN CARBOXYLASE AND THE TWO SUBUNITS OF CARBOXYL TRANSFERASE IN A 2:2 COMPLEX.
PP-attachment: semantics of Heterohexamer
Coreference (non-local)
ID ACCD_ECOLI STANDARD; PRT; 304 AA.AC P08193; P78251; P76937;DE ACETYL-COENZYME A CARBOXYLASE CARBOXYL TRANSFERASE SUBUNIT BETA (EC
6.4.1.2) (ACCASE BETA CHAIN).CC -!- FUNCTION: THIS PROTEIN IS A COMPONENT OF THE ACETYL COENZYME A
CARBOXYLASE COMPLEX; FIRST, BIOTIN CARBOXYLASE CATALYZES THE CARBOXYLATION OF THE CARRIER PROTEIN AND THEN THE TRANSCARBOXYLASE TRANSFERS THE CARBOXYL GROUP TO FORM
MALONYL-COA. CC -!- SUBUNIT: ACETYL-COA CARBOXYLASE IS AN HETEROHEXAMER OF
BIOTIN CARBOXYL CARRIER PROTEIN, BIOTIN CARBOXYLASE AND THE TWO SUBUNITS OF CARBOXYL TRANSFERASE IN A 2:2 COMPLEX.
ACETYL-COENZYME A CARBOXYLASE CARBOXYL TRANSFERASE SUBUNIT XYZXYZ SUBUNIT OF ACETYL-COENZYME A CARBOXYLASE CARBOXYL TRANSFERASE
...
Coreference (non-local)
ID ACCA_ECOLI STANDARD; PRT; 318 AA.AC P30867;DE ACETYL-COENZYME A CARBOXYLASE CARBOXYL TRANSFERASE SUBUNIT ALPHA
(EC 6.4.1.2).CC -!- FUNCTION: THIS PROTEIN IS A COMPONENT OF THE ACETYL COENZYME
A CARBOXYLASE COMPLEX; FIRST, BIOTIN CARBOXYLASE CATALYZES THE CARBOXYLATION OF THE CARRIER PROTEIN AND THEN THE
TRANSCARBOXYLASE TRANSFERS THE CARBOXYL GROUP TO FORM MALONYL- COA.
CC -!- CATALYTIC ACTIVITY: CARBOXYBIOTIN CARBOXYL CARRIER PROTEIN + ACETYL-COA = BIOTIN CARBOXYL CARRIER PROTEIN + MALONYL-COA.
CC -!- PATHWAY: FIRST STEP IN LONG-CHAIN FATTY ACID SYNTHESIS.CC -!- SUBUNIT: ACETYL-COA CARBOXYLASE IS AN HETEROHEXAMER OF BIOTIN
CARBOXYL CARRIER PROTEIN, BIOTIN CARBOXYLASE AND THE TWO SUBUNITS OF CARBOXYL TRANSFERASE IN A 2:2 COMPLEX.CC -!- SIMILARITY: TO THE C-TERMINUS OF MAMMALIAN PROPIONYL-COA
CARBOXYLASE BETA CHAIN.
Coreference (non-local)
ID BCCP_ECOLI STANDARD; PRT; 156 AA.AC P02905;DE BIOTIN CARBOXYL CARRIER PROTEIN OF ACETYL-COA CARBOXYLASE
(BCCP).CC -!- FUNCTION: THIS PROTEIN IS A COMPONENT OF THE ACETYL
COENZYME A CARBOXYLASE COMPLEX; FIRST, BIOTIN CARBOXYLASE CATALYZES THE CARBOXYLATION OF THE CARRIER PROTEIN AND THEN THE TRANSCARBOXYLASE TRANSFERS THE CARBOXYL GROUP TO FORM MALONYL-COA.
CC -!- PATHWAY: FIRST STEP IN LONG-CHAIN FATTY ACID SYNTHESIS.CC -!- SUBUNIT: HOMODIMER.
Coreference (non-local)
ID ACCC_ECOLI STANDARD; PRT; 449 AA.AC P24182;DE BIOTIN CARBOXYLASE (EC 6.3.4.14) (A SUBUNIT OF ACETYL-COA CARBOXYLASE) (EC 6.4.1.2) (ACC).CC -!- FUNCTION: THIS PROTEIN IS A COMPONENT OF THE ACETYL COENZYME A CARBOXYLASE COMPLEX; FIRST, BIOTIN CARBOXYLASE CATALYZES THE CARBOXYLATION OF THE CARRIER PROTEIN AND THEN
THE TRANSCARBOXYLASE TRANSFERS THE CARBOXYL GROUP TO FORM MALONYL-COA.CC -!- CATALYTIC ACTIVITY: ATP + BIOTIN-CARBOXYL-CARRIER PROTEIN + CO(2) = ADP + ORTHOPHOSPHATE + CARBOXYBIOTIN-CARBOXYL-CARRIER PROTEIN.CC -!- PATHWAY: FIRST STEP IN LONG-CHAIN FATTY ACID SYNTHESIS.CC -!- SUBUNIT: ACETYL-COA CARBOXYLASE IS AN HETEROHEXAMER OF BIOTIN CARBOXYL CARRIER PROTEIN, BIOTIN CARBOXYLASE AND THE TWO SUBUNITS OF CARBOXYL TRANSFERASE IN A 2:2 COMPLEX.CC -!- SIMILARITY: TO OTHER BIOTIN-DEPENDENT ENZYMES AND CARBAMOYL- PHOSPHATE SYNTHETASES.
ExtractionID ACCD_ECOLI STANDARD; PRT; 304 AA.AC P08193; P78251; P76937;DE ACETYL-COENZYME A CARBOXYLASE CARBOXYL TRANSFERASE SUBUNIT BETA
(EC 6.4.1.2) (ACCASE BETA CHAIN).CC -!- FUNCTION: THIS PROTEIN IS A COMPONENT OF THE ACETYL COENZYME A
CARBOXYLASE COMPLEX; FIRST, BIOTIN CARBOXYLASE CATALYZES THE CARBOXYLATION OF THE CARRIER PROTEIN AND THEN THE TRANSCARBOXYLASE TRANSFERS THE CARBOXYL GROUP TO FORM
MALONYL-COA.CC -!- SUBUNIT: ACETYL-COA CARBOXYLASE IS AN HETEROHEXAMER OF
BIOTIN CARBOXYL CARRIER PROTEIN, BIOTIN CARBOXYLASE AND THE TWO SUBUNITS OF CARBOXYL TRANSFERASE IN A 2:2 COMPLEX.
CC -!- SIMILARITY: BELONGS TO THE ACCD / PCCB FAMILY.
Complex consisting of 6 subunits
ExtractionID ACCD_ECOLI STANDARD; PRT; 304 AA.AC P08193; P78251; P76937;DE ACETYL-COENZYME A CARBOXYLASE CARBOXYL TRANSFERASE SUBUNIT BETA
(EC 6.4.1.2) (ACCASE BETA CHAIN).CC -!- FUNCTION: THIS PROTEIN IS A COMPONENT OF THE ACETYL COENZYME A
CARBOXYLASE COMPLEX; FIRST, BIOTIN CARBOXYLASE CATALYZES THE CARBOXYLATION OF THE CARRIER PROTEIN AND THEN THE TRANSCARBOXYLASE TRANSFERS THE CARBOXYL GROUP TO FORM
MALONYL-COA.CC -!- SUBUNIT: ACETYL-COA CARBOXYLASE IS AN HETEROHEXAMER OF
BIOTIN CARBOXYL CARRIER PROTEIN, BIOTIN CARBOXYLASE AND THE TWO SUBUNITS OF CARBOXYL TRANSFERASE IN A 2:2 COMPLEX.
CC -!- SIMILARITY: BELONGS TO THE ACCD / PCCB FAMILY.
Acetyl-CoA Carboxylase
Carrier Protein Biotin Carboxylase
Alpha Alpha Beta Beta
Carboxyl Transferase
Completing the Picture
ID BIRA_ECOLI STANDARD; PRT; 321 AA.AC P06709;CC -!- FUNCTION: BIRA ACTS BOTH AS A BIOTIN-OPERON REPRESSOR
AND AS THE ENZYME THAT SYNTHESIZES THE COREPRESSOR,ACETYL COA:CARBON-DIOXIDE LIGASE. THIS PROTEIN ALSO
ACTIVATES BIOTIN TO FORM BIOTINYL-5'-ADENYLATE AND TRANSFERS THE BIOTIN MOIETY TO BIOTIN-ACCEPTING PROTEINS.
CC -!- CATALYTIC ACTIVITY: ATP + BIOTIN + APO-[ACETYL-COA:CARBON-DIOXIDE LIGASE (ADP FORMING)]
= AMP + PYROPHOSPHATE + [ACETYL-COA:CARBON-DIOXIDE LIGASE (ADP FORMING)].CC -!- SUBUNIT: MONOMER.CC -!- SIMILARITY: WITH OTHER BACTERIAL BIRA AND WITH
EUKARYOTIC BIOTIN APO-PROTEIN LIGASE.
= Acetyl CoA Carboxylase
Conclusion• Sophisticated IE must incorporate
– Domain Ontology (EML, INRIA,IMS)
– Lexical Semantics (IMS)
– Morphological Analysis + Compositional Semantics (IMS)
– Discourse Semantics (IMS)
• Work on the Lexicon of Cell-Biology– Organic Chemical Compounds
„Was bedeutet UREYLEN“ (C. Gerstenberger, IMS)
– Semantic/ontological classification of 100 chemical Verbs (Phillip Cimiano Lavin, IMS)
– Enzyme- and Protein Names (work in progres)