Download - Protein World
![Page 1: Protein World](https://reader035.vdocuments.pub/reader035/viewer/2022062802/56814634550346895db34030/html5/thumbnails/1.jpg)
Protein World
SARA
12-12-2002 Amsterdam
Tim Hulsen
![Page 2: Protein World](https://reader035.vdocuments.pub/reader035/viewer/2022062802/56814634550346895db34030/html5/thumbnails/2.jpg)
Genome sequencing• Since 1995: sequencing of complete
‘genomes’ (DNA): A/C/G/T orderACGTCATCGTAGCTAGCTAGTCGTACGTATGTGCAGTAGCATCGATCGATCAGCATGCATAC
• At this moment more than 80 genomes have been sequenced and published, of all kinds of organisms:– Animals– Plants– Fungi– Bacteria
![Page 3: Protein World](https://reader035.vdocuments.pub/reader035/viewer/2022062802/56814634550346895db34030/html5/thumbnails/3.jpg)
Genomes Proteins
• ‘Transcription’ and ‘translation’ of specific regions of the genome leads to proteins, consisting of twenty types of ‘amino acids’:ATG ACG CTG AGC TGC GGA CGT TGA -> TLSCGR
• Proteins are responsible for all kinds of life processes• All the proteins that can be produced in an organism
together are called the ‘proteome’• Sequence comparisons make
possible the classification ofproteins
![Page 4: Protein World](https://reader035.vdocuments.pub/reader035/viewer/2022062802/56814634550346895db34030/html5/thumbnails/4.jpg)
Protein families• e.g. The GPCR family:
• Sequence comparison helps in predicting the function of new proteins
![Page 5: Protein World](https://reader035.vdocuments.pub/reader035/viewer/2022062802/56814634550346895db34030/html5/thumbnails/5.jpg)
Determining protein functions
• Function of 40-50% of the new proteins is unknown
• Understanding of protein functions and relationships is important for:– Study of fundamental biological processes– Drug design– Genetic engineering
![Page 6: Protein World](https://reader035.vdocuments.pub/reader035/viewer/2022062802/56814634550346895db34030/html5/thumbnails/6.jpg)
Sequence comparison
• Smith-Waterman dynamic programming algorithm (1981): calculates similarity/distance between two sequences:Query ---PLIT-LETRESV-Subject NEQPKVTMLETRQTAD(bold=similar)
• Results in a SW-score that is a measure for how similar the two sequences are to each other
• Disadvantage: score is dependent of length• After the alignments, the proteins are ‘clustered’
(divided into families) according to their similarity
![Page 7: Protein World](https://reader035.vdocuments.pub/reader035/viewer/2022062802/56814634550346895db34030/html5/thumbnails/7.jpg)
Existent databases
• Domain-based clusterings: Prosite, Pfam, ProDom, Prints, Domo, Blocks
• Protein-based clusterings: ProtoMap, COGs, Systers, PIR, ClusTr
• Structural classifications: SCOP, CATH, FSSP
Why should there be another database?
![Page 8: Protein World](https://reader035.vdocuments.pub/reader035/viewer/2022062802/56814634550346895db34030/html5/thumbnails/8.jpg)
Another method
• Enhanced Smith-Waterman algorithm: Monte-Carlo evaluation (Lipman et al., 1984)
• How big is the chance that two sequences are similar but not related?
• One of the two sequences is randomized and recalculated (200 times). Randomization leads to sequences with the same length and the same composition, but different order
• Method leads to calculation of the Z-value:S(A,B) - µ
Z(A,B) = ------------------- σ
![Page 9: Protein World](https://reader035.vdocuments.pub/reader035/viewer/2022062802/56814634550346895db34030/html5/thumbnails/9.jpg)
Advantages
• The obtained Z-value is a very reliable measure for sequence, compared to SW-score: – SW-score is dependent of length, Z-value is
not– Amino acid bias does not affect the Z-value
• Independent of the database size• Easier updating of the database, without a
total recalculation
![Page 10: Protein World](https://reader035.vdocuments.pub/reader035/viewer/2022062802/56814634550346895db34030/html5/thumbnails/10.jpg)
Disadvantage
• LOTS of calculation time needed, especially when all proteins in all proteomes are compared to each other (“all-against-all”)!
SARA
![Page 11: Protein World](https://reader035.vdocuments.pub/reader035/viewer/2022062802/56814634550346895db34030/html5/thumbnails/11.jpg)
SARA calculation
• Proteomes of 82 organisms compared ‘all-against-all’ with the use of the Monte Carlo algorithm: more than 400,000 proteins!
• 21,600 CPU days (~520,000 CPU hours)• = 21,600 PCs running parallel over 24
hours / 1 PC running for ~ 60 years• Using supercomputer TERAS (1024-CPU
SGI Origin 3800) at SARA: less than two months!
![Page 12: Protein World](https://reader035.vdocuments.pub/reader035/viewer/2022062802/56814634550346895db34030/html5/thumbnails/12.jpg)
Parties involved
• Gene-IT (Paris, France)
• SARA (Amsterdam, the Netherlands)
• CMBI (Nijmegen, the Netherlands)
• Organon (Oss, the Netherlands)
• EBI (Hinxton, UK)
![Page 13: Protein World](https://reader035.vdocuments.pub/reader035/viewer/2022062802/56814634550346895db34030/html5/thumbnails/13.jpg)
Supporting parties
• Financed by NCF, foundation in support of supercomputing
• Under the auspices of BioASP, the new Dutch knowledge and service center for Bioinformatics
![Page 14: Protein World](https://reader035.vdocuments.pub/reader035/viewer/2022062802/56814634550346895db34030/html5/thumbnails/14.jpg)
Results available through BioASP
• http://www.bioasp.nl• Log in and click on links ‘Research’ and ‘Protein
World’:1
2
![Page 15: Protein World](https://reader035.vdocuments.pub/reader035/viewer/2022062802/56814634550346895db34030/html5/thumbnails/15.jpg)
Results available through BioASP
• Organism selection screen:
![Page 16: Protein World](https://reader035.vdocuments.pub/reader035/viewer/2022062802/56814634550346895db34030/html5/thumbnails/16.jpg)
Results available through BioASP
• Results screen:
![Page 17: Protein World](https://reader035.vdocuments.pub/reader035/viewer/2022062802/56814634550346895db34030/html5/thumbnails/17.jpg)
Results available through BioASP
• Alignment screen:
![Page 18: Protein World](https://reader035.vdocuments.pub/reader035/viewer/2022062802/56814634550346895db34030/html5/thumbnails/18.jpg)
Conclusions
• Currently the most comprehensive and most accurate data-set of protein comparisons
• A start for a maintainable and unique database of all proteins currently known
• A rich data-source for clustering, data-mining and orthology determination
![Page 19: Protein World](https://reader035.vdocuments.pub/reader035/viewer/2022062802/56814634550346895db34030/html5/thumbnails/19.jpg)
Orthology determination
• Orthologs: genes/proteins in different species that derive from a common ancestor
• Orthologs often have the same function
• Interesting! Information from other species could help in annotating a protein
![Page 20: Protein World](https://reader035.vdocuments.pub/reader035/viewer/2022062802/56814634550346895db34030/html5/thumbnails/20.jpg)
Thank you for your attention
Any questions?