bioinformatics, july 2003 p.w.load, r.d.stevens,a.brass and c.a.goble university of manchester

20
Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation Bioinformatics, July 2003 P.W.Load, R.D.Stevens,A.Brass and C.A.Goble University of Manchester Presented by 임 임임 July 22, 2005

Upload: yoshi-fuller

Post on 30-Dec-2015

40 views

Category:

Documents


1 download

DESCRIPTION

Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics, July 2003 P.W.Load, R.D.Stevens,A.Brass and C.A.Goble University of Manchester Presented by 임 동혁 July 22, 2005. Contents. Introduction - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Bioinformatics,   July 2003 P.W.Load, R.D.Stevens,A.Brass and C.A.Goble University of Manchester

Investigating semantic similarity measures across the Gene Ontology:the relationship between sequence and annotation

Bioinformatics, July 2003P.W.Load, R.D.Stevens,A.Brass and C.A.Goble

University of Manchester

Presented by 임 동혁July 22, 2005

Page 2: Bioinformatics,   July 2003 P.W.Load, R.D.Stevens,A.Brass and C.A.Goble University of Manchester

2

Contents

Introduction Semantic Similarity Measures Validating Semantic Similarity Investigating Semantic and Sequence

Similarity Semantic Searching of GO Annotated

Resources Discussion

Page 3: Bioinformatics,   July 2003 P.W.Load, R.D.Stevens,A.Brass and C.A.Goble University of Manchester

3

Introduction

Bioinformatics resources In form of sequence, which are then annotated In scientific natural language as text Human readable and understandable Not easy to interpret computationally

Ontologies Provide a mechanism for capturing a view of a domain in

a shareable form Both accessible by humans and computationally amenable

Provide a set of vocabulary terms that label concepts in the domain

“is-a” relationship between parent and child “part-of” relationship between part and whole

Page 4: Bioinformatics,   July 2003 P.W.Load, R.D.Stevens,A.Brass and C.A.Goble University of Manchester

4

Gene Ontology(1/2)

GO comprises three orthogonal taxonomies of aspects Molecular function Biological process Cellular component

GO is a rapidly growing collection of about 11000 phrases, representing terms or concepts

Directed Acyclic Graph(DAG)

Page 5: Bioinformatics,   July 2003 P.W.Load, R.D.Stevens,A.Brass and C.A.Goble University of Manchester

5

Gene Ontology(2/2)

Allow improved querying of databases Different resources queried with the same term Shared understanding improve retrieval

consistency across resources and recall and precision

One obvious alternative way Ask for proteins semantically similar to a query

protein Semantic similarity

Taxonomy of biomedical terms Ex) Medical Subject Heading(MeSH) : similar content(by

words)

Page 6: Bioinformatics,   July 2003 P.W.Load, R.D.Stevens,A.Brass and C.A.Goble University of Manchester

6

the Gene Ontology

molecular functionGO:0003674 p=1

signal transducerGO:0004871 p=0.208

chaperoneGO:0003754 p=0.0102

Receptor-associated proteinGO:0016962 p=0.00159

receptorGO:0004872 p=0.124

Receptor signaling proteinGO:0005057 p=0.0281

ligrandGO:0005102 p=0.0460

Transmembrane receptorGO:0004888 p=0.0997

photoreceptorGO:0009881 p=0.000433

isa

isa

isa

isa

isa

isa

isa

isa

Two proteins are both annotated as “transmembrane receptor”(GO:0004888) Similar semantic description

One as just “receptor”(GO:0004872) Semantically less similar

Page 7: Bioinformatics,   July 2003 P.W.Load, R.D.Stevens,A.Brass and C.A.Goble University of Manchester

7

Semantic Similarity Measure(1/3)

Early techniques (Rada et al, 1989) Path distances between terms Assumes that all of semantic links are of equal

weight Poor assumption Ex) “photoreceptor” and “transmembrane

receptor” are semantically more closely related than “chaperone” and “signal transducer”

Page 8: Bioinformatics,   July 2003 P.W.Load, R.D.Stevens,A.Brass and C.A.Goble University of Manchester

8

Semantic Similarity Measure(2/3)

Edge could be weighted The greater distance from root of the graph,

the more specific the terms However, GO varies widely in the distance of

nodes from the root Ex) (GO:0005300) is 14 terms deep,

(GO:0008435) is only 3 terms deep Not significantly less semantically precise

Page 9: Bioinformatics,   July 2003 P.W.Load, R.D.Stevens,A.Brass and C.A.Goble University of Manchester

9

Semantic Similarity Measure(2/3)

Usage of terms within the corpus (Resnik, 1999) Use the notion of “information content” Familiar from most internet search engines Ex) “chaperone” is a more informative term

than “signal transducer” The former is used several times, the later thousand

times GO:0004872 occurs, GO:0004871 and GO:0003674

have also occurred (“is-a” link are considered)

More informative

Page 10: Bioinformatics,   July 2003 P.W.Load, R.D.Stevens,A.Brass and C.A.Goble University of Manchester

10

Probabilities in the Gene Ontology

Each node is annotated with its GO accession and the probability of this term occurring in the SWISS-PROT-Human database

1. Count the number of times each concept occurrs, 2. A concept occurs if a term, or any node its children occur3. The probability, p(c), for each node is this value, divided by the number of times (the probability of root node will be 1)

Page 11: Bioinformatics,   July 2003 P.W.Load, R.D.Stevens,A.Brass and C.A.Goble University of Manchester

11

Semantic Similarity between terms

Use simplest of measure (Resnik, 1999) Based on the information content of shared

parents of the two terms S(c1, c2) is the set of parental concepts shared

by both c1 and c2 Minimum p(c) : GO allows multiple path Pms(probability of the minimum subsumer)

Similirity score between two terms

As probability increase,

informativeness decrease

Page 12: Bioinformatics,   July 2003 P.W.Load, R.D.Stevens,A.Brass and C.A.Goble University of Manchester

12

Validating Semantic Similarity

How do we validate such a measure?

Protein’s sequence relates to its function Highly similar sequences should be highly

semantically similar Protein sequences in pairs and plotting

sequence similarity against semantic ssimilarity should a relationship

Page 13: Bioinformatics,   July 2003 P.W.Load, R.D.Stevens,A.Brass and C.A.Goble University of Manchester

13

Adapting the Similarity Measures to GO and SWISS-PROT

“part-of” relationship Orphan term Linked them directly to the root Ex) GO:0009542 Is-a’s links alone

Proteins may be annotated with more than a single term Wordnet : Maximum similarity GO : average similarity

Page 14: Bioinformatics,   July 2003 P.W.Load, R.D.Stevens,A.Brass and C.A.Goble University of Manchester

14

Comparing Semantic Similarity Across GO Aspects

There is a good correlation between sequence similarity and semantic similarity The correlation is greater when measured against the “molecular function”

Page 15: Bioinformatics,   July 2003 P.W.Load, R.D.Stevens,A.Brass and C.A.Goble University of Manchester

15

The Relationship Between Semantic Similarity and Evidence Codes

TAS : regarded as the highest standard of evidence When only TAS GO annotation are considered, the correlation is much greater

Page 16: Bioinformatics,   July 2003 P.W.Load, R.D.Stevens,A.Brass and C.A.Goble University of Manchester

16

Effect of Using Semantic Links in Semantic Similarity

Consider only links of a single type “is-a” or “part-of” Little difference between all link and “is-a” : almost link are of “is-a” type (6167 / 6202) No links drop in the middle part : proteins share similar (links are included in semantic similarity measure)

Page 17: Bioinformatics,   July 2003 P.W.Load, R.D.Stevens,A.Brass and C.A.Goble University of Manchester

17

Analysis(1/2) Very high semantic similarity but little

sequence similarity “Polymorphic” groups

Two or more classes of protein involved in the same process

Heterodimerize or sub-families

Hyper variable protein families arbitrary

Mis-annotations SWISS-PROT “x-like” but in GO “x” Spelling mistake

Page 18: Bioinformatics,   July 2003 P.W.Load, R.D.Stevens,A.Brass and C.A.Goble University of Manchester

18

Analysis(2/2) - Example

Page 19: Bioinformatics,   July 2003 P.W.Load, R.D.Stevens,A.Brass and C.A.Goble University of Manchester

19

Semantic Searching of GO Annotated Resources

Develop a search tool Given query protein against all the others in

SWISS-PROT-Human Generates a ranked list of semantically similar

proteins Ex) “OPSR_HUMAN”

Page 20: Bioinformatics,   July 2003 P.W.Load, R.D.Stevens,A.Brass and C.A.Goble University of Manchester

20

Discussion

Investigated semantic similarity measure All cases semantic similarity is correlated with

sequence similarity GO aspect : molecular funstion Evidence code : “Traceable Author Statement”

Future work Effect of the different semantic links in ontologies Co-expression as revealed by microarray

experiments Expect that biological process aspect would be of

great use