bioinformatics, july 2003 p.w.load, r.d.stevens,a.brass and c.a.goble university of manchester
DESCRIPTION
Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics, July 2003 P.W.Load, R.D.Stevens,A.Brass and C.A.Goble University of Manchester Presented by 임 동혁 July 22, 2005. Contents. Introduction - PowerPoint PPT PresentationTRANSCRIPT
Investigating semantic similarity measures across the Gene Ontology:the relationship between sequence and annotation
Bioinformatics, July 2003P.W.Load, R.D.Stevens,A.Brass and C.A.Goble
University of Manchester
Presented by 임 동혁July 22, 2005
2
Contents
Introduction Semantic Similarity Measures Validating Semantic Similarity Investigating Semantic and Sequence
Similarity Semantic Searching of GO Annotated
Resources Discussion
3
Introduction
Bioinformatics resources In form of sequence, which are then annotated In scientific natural language as text Human readable and understandable Not easy to interpret computationally
Ontologies Provide a mechanism for capturing a view of a domain in
a shareable form Both accessible by humans and computationally amenable
Provide a set of vocabulary terms that label concepts in the domain
“is-a” relationship between parent and child “part-of” relationship between part and whole
4
Gene Ontology(1/2)
GO comprises three orthogonal taxonomies of aspects Molecular function Biological process Cellular component
GO is a rapidly growing collection of about 11000 phrases, representing terms or concepts
Directed Acyclic Graph(DAG)
5
Gene Ontology(2/2)
Allow improved querying of databases Different resources queried with the same term Shared understanding improve retrieval
consistency across resources and recall and precision
One obvious alternative way Ask for proteins semantically similar to a query
protein Semantic similarity
Taxonomy of biomedical terms Ex) Medical Subject Heading(MeSH) : similar content(by
words)
6
the Gene Ontology
molecular functionGO:0003674 p=1
signal transducerGO:0004871 p=0.208
chaperoneGO:0003754 p=0.0102
Receptor-associated proteinGO:0016962 p=0.00159
receptorGO:0004872 p=0.124
Receptor signaling proteinGO:0005057 p=0.0281
ligrandGO:0005102 p=0.0460
Transmembrane receptorGO:0004888 p=0.0997
photoreceptorGO:0009881 p=0.000433
isa
isa
isa
isa
isa
isa
isa
isa
Two proteins are both annotated as “transmembrane receptor”(GO:0004888) Similar semantic description
One as just “receptor”(GO:0004872) Semantically less similar
7
Semantic Similarity Measure(1/3)
Early techniques (Rada et al, 1989) Path distances between terms Assumes that all of semantic links are of equal
weight Poor assumption Ex) “photoreceptor” and “transmembrane
receptor” are semantically more closely related than “chaperone” and “signal transducer”
8
Semantic Similarity Measure(2/3)
Edge could be weighted The greater distance from root of the graph,
the more specific the terms However, GO varies widely in the distance of
nodes from the root Ex) (GO:0005300) is 14 terms deep,
(GO:0008435) is only 3 terms deep Not significantly less semantically precise
9
Semantic Similarity Measure(2/3)
Usage of terms within the corpus (Resnik, 1999) Use the notion of “information content” Familiar from most internet search engines Ex) “chaperone” is a more informative term
than “signal transducer” The former is used several times, the later thousand
times GO:0004872 occurs, GO:0004871 and GO:0003674
have also occurred (“is-a” link are considered)
More informative
10
Probabilities in the Gene Ontology
Each node is annotated with its GO accession and the probability of this term occurring in the SWISS-PROT-Human database
1. Count the number of times each concept occurrs, 2. A concept occurs if a term, or any node its children occur3. The probability, p(c), for each node is this value, divided by the number of times (the probability of root node will be 1)
11
Semantic Similarity between terms
Use simplest of measure (Resnik, 1999) Based on the information content of shared
parents of the two terms S(c1, c2) is the set of parental concepts shared
by both c1 and c2 Minimum p(c) : GO allows multiple path Pms(probability of the minimum subsumer)
Similirity score between two terms
As probability increase,
informativeness decrease
12
Validating Semantic Similarity
How do we validate such a measure?
Protein’s sequence relates to its function Highly similar sequences should be highly
semantically similar Protein sequences in pairs and plotting
sequence similarity against semantic ssimilarity should a relationship
13
Adapting the Similarity Measures to GO and SWISS-PROT
“part-of” relationship Orphan term Linked them directly to the root Ex) GO:0009542 Is-a’s links alone
Proteins may be annotated with more than a single term Wordnet : Maximum similarity GO : average similarity
14
Comparing Semantic Similarity Across GO Aspects
There is a good correlation between sequence similarity and semantic similarity The correlation is greater when measured against the “molecular function”
15
The Relationship Between Semantic Similarity and Evidence Codes
TAS : regarded as the highest standard of evidence When only TAS GO annotation are considered, the correlation is much greater
16
Effect of Using Semantic Links in Semantic Similarity
Consider only links of a single type “is-a” or “part-of” Little difference between all link and “is-a” : almost link are of “is-a” type (6167 / 6202) No links drop in the middle part : proteins share similar (links are included in semantic similarity measure)
17
Analysis(1/2) Very high semantic similarity but little
sequence similarity “Polymorphic” groups
Two or more classes of protein involved in the same process
Heterodimerize or sub-families
Hyper variable protein families arbitrary
Mis-annotations SWISS-PROT “x-like” but in GO “x” Spelling mistake
18
Analysis(2/2) - Example
19
Semantic Searching of GO Annotated Resources
Develop a search tool Given query protein against all the others in
SWISS-PROT-Human Generates a ranked list of semantically similar
proteins Ex) “OPSR_HUMAN”
20
Discussion
Investigated semantic similarity measure All cases semantic similarity is correlated with
sequence similarity GO aspect : molecular funstion Evidence code : “Traceable Author Statement”
Future work Effect of the different semantic links in ontologies Co-expression as revealed by microarray
experiments Expect that biological process aspect would be of
great use