bioinformatics: introduction and...

39
生物信息学:导论与方法 Bioinformatics: Introduction and Methods https://www.coursera.org/course/pkubioinfo

Upload: others

Post on 02-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bioinformatics: Introduction and Methodsfac.ksu.edu.sa/sites/default/files/pkubioinfo-lecture_slides-2-1-1.pdf · 生物信息学:导论与方法 Bioinformatics: Introduction and

生物信息学:导论与方法Bioinformatics: Introduction and Methods

https://www.coursera.org/course/pkubioinfo

Page 2: Bioinformatics: Introduction and Methodsfac.ksu.edu.sa/sites/default/files/pkubioinfo-lecture_slides-2-1-1.pdf · 生物信息学:导论与方法 Bioinformatics: Introduction and

生物信息学:导论与方法Bioinformatics: Introduction and Methods

北京大学生物信息学中心 高歌、魏丽萍Ge Gao & Liping Wei

Center for Bioinformatics, Peking University

Page 3: Bioinformatics: Introduction and Methodsfac.ksu.edu.sa/sites/default/files/pkubioinfo-lecture_slides-2-1-1.pdf · 生物信息学:导论与方法 Bioinformatics: Introduction and

Sequence Database Search

北京大学生物信息学中心 高歌Ge Gao, Ph.D.

Center for Bioinformatics, Peking University

Page 4: Bioinformatics: Introduction and Methodsfac.ksu.edu.sa/sites/default/files/pkubioinfo-lecture_slides-2-1-1.pdf · 生物信息学:导论与方法 Bioinformatics: Introduction and

Unit 1: Sequence Databases

北京大学生物信息学中心 高歌Ge Gao, Ph.D.

Center for Bioinformatics, Peking University

Page 5: Bioinformatics: Introduction and Methodsfac.ksu.edu.sa/sites/default/files/pkubioinfo-lecture_slides-2-1-1.pdf · 生物信息学:导论与方法 Bioinformatics: Introduction and

( Russ Altman BMI214)

STS--T

djiFdjiF

yxsjiFjiF

F

ji

1,,1

,1,1max,

00,0

01,

,1,1,1

max,

00,0

djiFdjiF

yxsjiFjiF

F

ji

Global alignment (Needleman‐Wunsch) Local alignment (Smith‐Waterman)

Copyright © Peking University

Page 6: Bioinformatics: Introduction and Methodsfac.ksu.edu.sa/sites/default/files/pkubioinfo-lecture_slides-2-1-1.pdf · 生物信息学:导论与方法 Bioinformatics: Introduction and

Sequence Alignment

“How can we determine the similarity between two sequences?”

Why is it important?• Similar sequence  Similar structure  Similar function (The “Sequence‐to‐Structure‐to‐Function Paradigm”)

• Similar sequence  Common ancestor (“Homology”)

Copyright © Peking University

Page 7: Bioinformatics: Introduction and Methodsfac.ksu.edu.sa/sites/default/files/pkubioinfo-lecture_slides-2-1-1.pdf · 生物信息学:导论与方法 Bioinformatics: Introduction and

Sequence Database Searching• Rather than do the alignment pair‐wise, it’s more often to search sequence database in a high‐throughput style.

• Or, identify similarities between– novel query sequencewhose structures and functions are usually unknown and/or uncharacterized

– sequences in (public) databaseswhose structures and functions have been elucidated and annotated.

Copyright © Peking University

Page 8: Bioinformatics: Introduction and Methodsfac.ksu.edu.sa/sites/default/files/pkubioinfo-lecture_slides-2-1-1.pdf · 生物信息学:导论与方法 Bioinformatics: Introduction and

Sequence Database Searching• The query sequence is compared/aligned with every sequence in the database

• Statistically significant hits are assumed to be related to the query sequence– Similar function/structure– Common evolutionary ancestor

Copyright © Peking University

Page 9: Bioinformatics: Introduction and Methodsfac.ksu.edu.sa/sites/default/files/pkubioinfo-lecture_slides-2-1-1.pdf · 生物信息学:导论与方法 Bioinformatics: Introduction and

A (naïve) algorithm for database searching

Get the inputted query

Run pair‐wise global/local alignment between query and database sequence

Get a database sequence

Copyright © Peking University

Page 10: Bioinformatics: Introduction and Methodsfac.ksu.edu.sa/sites/default/files/pkubioinfo-lecture_slides-2-1-1.pdf · 生物信息学:导论与方法 Bioinformatics: Introduction and

djiFdjiF

yxsjiFjiF

F

ji

1,,1

,1,1max,

00,0xi aligned to yj

xi aligned to a gap

yj aligned to a gap

1,1 jiF

jiF , jiF ,1

1, jiF

d

d ji yxs ,

X

Y

Copyright © Peking University

Page 11: Bioinformatics: Introduction and Methodsfac.ksu.edu.sa/sites/default/files/pkubioinfo-lecture_slides-2-1-1.pdf · 生物信息学:导论与方法 Bioinformatics: Introduction and

Dynamic programming matrix

Sequence X of length mSequ

ence

Y of length n

There are nmentries in the matrix.

Each entry requires a constant number c of operation(s).

c*m*n operations needed in total, for one pair‐wise alignment.

Copyright © Peking University

Page 12: Bioinformatics: Introduction and Methodsfac.ksu.edu.sa/sites/default/files/pkubioinfo-lecture_slides-2-1-1.pdf · 生物信息学:导论与方法 Bioinformatics: Introduction and

• Say your query sequence (HBA_HUMAN) has 142 amino acids

• Most recent release of human‐curated Swiss‐Prot protein databases contains 540,958 sequences with 192,206,270 amino acids (Sept 18th, 2013);– On average, the sequence length is 192,206,270/540,958 = 355.30 aa

• And assume your super‐fast computer can run one operation in 1μs = (0.000001s)

• Then, you will need 7.8 hr for ONE comparison!

Copyright © Peking University

Page 13: Bioinformatics: Introduction and Methodsfac.ksu.edu.sa/sites/default/files/pkubioinfo-lecture_slides-2-1-1.pdf · 生物信息学:导论与方法 Bioinformatics: Introduction and

Source: http://web.expasy.org/docs/relnotes/relstat.html 

Source: http://www.ebi.ac.uk/uniprot/TrEMBLstats

Copyright © Peking University

Page 14: Bioinformatics: Introduction and Methodsfac.ksu.edu.sa/sites/default/files/pkubioinfo-lecture_slides-2-1-1.pdf · 生物信息学:导论与方法 Bioinformatics: Introduction and

log2(bp) = ‐1.2×103 + 0.59yR2 = 0.97, p‐value < 2.2×10‐16Doubling every 20 months

log2(bp) = ‐4.6×103 + 2.3yR2 = 0.97, p‐value < 2.2×10‐16Doubling every 5 months!

Genbank SRA

Copyright © Peking University

Page 15: Bioinformatics: Introduction and Methodsfac.ksu.edu.sa/sites/default/files/pkubioinfo-lecture_slides-2-1-1.pdf · 生物信息学:导论与方法 Bioinformatics: Introduction and

djiFdjiF

yxsjiFjiF

F

ji

1,,1

,1,1max,

00,0xi aligned to yj

xi aligned to a gap

yj aligned to a gap

1,1 jiF

jiF , jiF ,1

1, jiF

d

d ji yxs ,

X

Y

Copyright © Peking University

Page 16: Bioinformatics: Introduction and Methodsfac.ksu.edu.sa/sites/default/files/pkubioinfo-lecture_slides-2-1-1.pdf · 生物信息学:导论与方法 Bioinformatics: Introduction and

y

x

1,1 jiF

jiF , jiF ,1

1, jiF

d

d ji yxs ,

X

Y

Copyright © Peking University

Page 17: Bioinformatics: Introduction and Methodsfac.ksu.edu.sa/sites/default/files/pkubioinfo-lecture_slides-2-1-1.pdf · 生物信息学:导论与方法 Bioinformatics: Introduction and

A GA G

0000C

4000G

0220A

0000

GAA

1,1 jiF

jiF , jiF ,1

1, jiF

d

d ji yxs ,

0

Copyright © Peking University

Page 18: Bioinformatics: Introduction and Methodsfac.ksu.edu.sa/sites/default/files/pkubioinfo-lecture_slides-2-1-1.pdf · 生物信息学:导论与方法 Bioinformatics: Introduction and

BLAST: Intro• To make the alignment effectively, a Heuristic algorithm BLAST 

(Basic Local Alignment Search Tool) is proposed by Altschul et al in 1990.

• BLAST finds the highest scoring locally optimal alignments between a query sequence and a database. – Very fast algorithm– Can be used to search extremely large databases– Sufficiently sensitive and selective for most purposes– Robust – the default parameters just work for most cases

Copyright © Peking University

Page 19: Bioinformatics: Introduction and Methodsfac.ksu.edu.sa/sites/default/files/pkubioinfo-lecture_slides-2-1-1.pdf · 生物信息学:导论与方法 Bioinformatics: Introduction and

Copyright © Peking University

Page 20: Bioinformatics: Introduction and Methodsfac.ksu.edu.sa/sites/default/files/pkubioinfo-lecture_slides-2-1-1.pdf · 生物信息学:导论与方法 Bioinformatics: Introduction and

Copyright © Peking University

Page 21: Bioinformatics: Introduction and Methodsfac.ksu.edu.sa/sites/default/files/pkubioinfo-lecture_slides-2-1-1.pdf · 生物信息学:导论与方法 Bioinformatics: Introduction and

Copyright © Peking University

Page 22: Bioinformatics: Introduction and Methodsfac.ksu.edu.sa/sites/default/files/pkubioinfo-lecture_slides-2-1-1.pdf · 生物信息学:导论与方法 Bioinformatics: Introduction and

Copyright © Peking University

Page 23: Bioinformatics: Introduction and Methodsfac.ksu.edu.sa/sites/default/files/pkubioinfo-lecture_slides-2-1-1.pdf · 生物信息学:导论与方法 Bioinformatics: Introduction and

protein sequence functional informationdata knowledge

(Modified from http://education.expasy.org/cours/Turin/UniProtKB_Turin.ppt)Copyright © Peking University

Page 24: Bioinformatics: Introduction and Methodsfac.ksu.edu.sa/sites/default/files/pkubioinfo-lecture_slides-2-1-1.pdf · 生物信息学:导论与方法 Bioinformatics: Introduction and

(Modified from http://education.expasy.org/cours/Turin/UniProtKB_Turin.ppt)Copyright © Peking University

Page 25: Bioinformatics: Introduction and Methodsfac.ksu.edu.sa/sites/default/files/pkubioinfo-lecture_slides-2-1-1.pdf · 生物信息学:导论与方法 Bioinformatics: Introduction and

MSKEKFERTKPHVNVGTIGHVDHGKTTLTAAITTVLAKTYGGAARAFDQIDNAPEEKARGITINTSHVEYDTPTRHYAHVDCPGHADYVKNMITGAAQMDGAILVVAATDGPMPQTREHILLGRQVGVPYIIVFLNKCDMVDDEELLELVEMEVRELLSQYDFPGDDTPIVRGSALKALE GDAEWEAKILELAGFLDSYIPEPERAIDKPFLLPIEDVFSISGRGTVVTGRVERGIIKVGEEVEIVGIKETQKSTCTGVEMFRKLLDEGRAGENVGVLLRGIKREEIERGQVLAKPGTIKPHTKFESEVYILSKDEGGRHTPFFKGYRPQFYFRTTDVTGTIELPEGVEMVMPGDNIKMV VTLIHPIAMDDGLRFAIREGGRTVGAGVVAKVLG

One protein sequenceOne gene

One species

Manual annotationKeywords

and Gene Ontology

Manual annotationFunction, Subcellular location,

Catalytic activity, Disease, Tissue specificty, Pathway…

Manual annotationPost-translational modifications,

variants, transmembrane domains, signal peptide…

Cross-references to over 125 databases

References

Protein and gene namesTaxonomic information

Alternative products:protein sequences produced by

alternative splicing, alternative promoter usage,

alternative initiation…

UniProtKB/Swiss-Protwww.uniprot.org

(Modified from http://education.expasy.org/cours/Turin/UniProtKB_Turin.ppt)

Copyright © Peking University

Page 26: Bioinformatics: Introduction and Methodsfac.ksu.edu.sa/sites/default/files/pkubioinfo-lecture_slides-2-1-1.pdf · 生物信息学:导论与方法 Bioinformatics: Introduction and

2D gel2DBase-EcoliANU-2DPAGEAarhus/Ghent-2DPAGE (no server)

COMPLUYEAST-2DPAGECornea-2DPAGE DOSAC-COBS-2DPAGEECO2DBASE (no server)

OGPPHCI-2DPAGEPMMA-2DPAGERat-heart-2DPAGEREPRODUCTION-2DPAGESiena-2DPAGESWISS-2DPAGEUCD-2DPAGEWorld-2DPAGE

Family and domainGene3DHAMAPInterProPANTHERPfamPIRSFPRINTSProDomPROSITESMARTSUPFAMTIGRFAMs

Organism-specificAGDArachnoServerCGDConoServerCTDCYGDdictyBaseEchoBASEEcoGeneeuHCVdbEuPathDBFlyBaseGeneCardsGeneDB_SpombeGeneFarmGenoListGrameneH-InvDBHGNCHPA LegioListLepromaMaizeGDBMGIMIMneXtProtOrphanetPharmGKBPseudoCAPRGDSGDTAIRTubercuListWormBaseXenbaseZFIN

Protein family/groupAllergomeCAZyMEROPSPeroxiBasePptaseDBREBASETCDB

Genome annotationEnsemblEnsemblBacteriaEnsemblFungiEnsemblMetazoaEnsemblPlantsEnsemblProtistsGeneIDGenomeReviewsKEGGNMPDRTIGRUCSCVectorBase

Enzyme and pathwayBioCycBRENDAPathway_Interaction_DBReactome

OtherBindingDBDrugBank NextBio PMAP-CutDB

SequenceEMBLIPIPIRRefSeqUniGene

3D structureDisProtHSSPPDBPDBsumProteinModelPortalSMR

PTMGlycoSuiteDBPhosphoSitePhosSite

UniProtKB/Swiss-Prot:129 explicit links

and 14 implicit links!

ProteomicPeptideAtlasPRIDEProMEX

PPIDIPIntActMINTSTRING

Phylogenomic dbseggNOGGeneTreeHOGENOMHOVERGENInParanoidOMAOrthoDBPhylomeDBProtClustDB

PolymorphismdbSNP

Gene expressionArrayExpressBgeeCleanExGenevestigatorGermOnline

Ontologies GO

(Modified from http://education.expasy.org/cours/Turin/UniProtKB_Turin.ppt)

Page 27: Bioinformatics: Introduction and Methodsfac.ksu.edu.sa/sites/default/files/pkubioinfo-lecture_slides-2-1-1.pdf · 生物信息学:导论与方法 Bioinformatics: Introduction and

Copyright © Peking University

Page 28: Bioinformatics: Introduction and Methodsfac.ksu.edu.sa/sites/default/files/pkubioinfo-lecture_slides-2-1-1.pdf · 生物信息学:导论与方法 Bioinformatics: Introduction and

Copyright © Peking University

Page 29: Bioinformatics: Introduction and Methodsfac.ksu.edu.sa/sites/default/files/pkubioinfo-lecture_slides-2-1-1.pdf · 生物信息学:导论与方法 Bioinformatics: Introduction and

Summary

Domain(s)

Hits summary

Copyright © Peking University

Page 30: Bioinformatics: Introduction and Methodsfac.ksu.edu.sa/sites/default/files/pkubioinfo-lecture_slides-2-1-1.pdf · 生物信息学:导论与方法 Bioinformatics: Introduction and

Copyright © Peking University

Page 31: Bioinformatics: Introduction and Methodsfac.ksu.edu.sa/sites/default/files/pkubioinfo-lecture_slides-2-1-1.pdf · 生物信息学:导论与方法 Bioinformatics: Introduction and

HBB_HUMAN

HBA_HUMAN

Copyright © Peking University

Page 32: Bioinformatics: Introduction and Methodsfac.ksu.edu.sa/sites/default/files/pkubioinfo-lecture_slides-2-1-1.pdf · 生物信息学:导论与方法 Bioinformatics: Introduction and

Needleman‐WunschGlobal alignment

Smith‐WatermanLocal alignment

BLAST Local alignment

Copyright © Peking University

Page 33: Bioinformatics: Introduction and Methodsfac.ksu.edu.sa/sites/default/files/pkubioinfo-lecture_slides-2-1-1.pdf · 生物信息学:导论与方法 Bioinformatics: Introduction and

Copyright © Peking University

Page 34: Bioinformatics: Introduction and Methodsfac.ksu.edu.sa/sites/default/files/pkubioinfo-lecture_slides-2-1-1.pdf · 生物信息学:导论与方法 Bioinformatics: Introduction and

Copyright © Peking University

Page 35: Bioinformatics: Introduction and Methodsfac.ksu.edu.sa/sites/default/files/pkubioinfo-lecture_slides-2-1-1.pdf · 生物信息学:导论与方法 Bioinformatics: Introduction and

Copyright © Peking University

Page 36: Bioinformatics: Introduction and Methodsfac.ksu.edu.sa/sites/default/files/pkubioinfo-lecture_slides-2-1-1.pdf · 生物信息学:导论与方法 Bioinformatics: Introduction and

Copyright © Peking University

Page 37: Bioinformatics: Introduction and Methodsfac.ksu.edu.sa/sites/default/files/pkubioinfo-lecture_slides-2-1-1.pdf · 生物信息学:导论与方法 Bioinformatics: Introduction and

Copyright © Peking University

Page 38: Bioinformatics: Introduction and Methodsfac.ksu.edu.sa/sites/default/files/pkubioinfo-lecture_slides-2-1-1.pdf · 生物信息学:导论与方法 Bioinformatics: Introduction and

Summary Questions• Why do we need to perform a database searching?

• What’s the major challenge/obstacle when searching sequence database?

Copyright © Peking University

Page 39: Bioinformatics: Introduction and Methodsfac.ksu.edu.sa/sites/default/files/pkubioinfo-lecture_slides-2-1-1.pdf · 生物信息学:导论与方法 Bioinformatics: Introduction and

生物信息学:导论与方法Bioinformatics: Introduction and Methods

https://www.coursera.org/course/pkubioinfo