生物信息学基础讲座

59
22/6/23 1 生生生生生生生生生 生生生生生生生生生 生1生 生生生生生生生 http://cbb.sjtu.edu.cn/~mywu/bi217

Upload: lacey-conley

Post on 02-Jan-2016

48 views

Category:

Documents


2 download

DESCRIPTION

生物信息学基础讲座. 第 1 讲 生物信息学研究 http://cbb.sjtu.edu.cn/~mywu/bi217. 什么是生物信息学 ?. Concepts of Bioinformatics. Biology + computer?. (Molecular) Bio - informatics - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: 生物信息学基础讲座

23/4/20 1

生物信息学基础讲座生物信息学基础讲座

第 1 讲 生物信息学研究http://cbb.sjtu.edu.cn/~mywu/bi217

Page 2: 生物信息学基础讲座

23/4/20 2

什么是生物信息学什么是生物信息学 ??

Concepts of Bioinformatics

Page 3: 生物信息学基础讲座

23/4/20 3

Biology + computer?

(Molecular) Bio - informatics Bioinformatics is conceptualizing biology in te

rms of molecules (in the sense of physical-chemistry) and then applying “informatics” techniques (applied math, CS, and statistics) to understand and organize the information associated with these molecules, on a large-scale.

Bioinformatics is a practical discipline with many applications.

Page 4: 生物信息学基础讲座

23/4/20 4

生物分子信息

细胞

分子

存贮、复制、传递和表达遗传信息的系统

生物信息的载体

Page 5: 生物信息学基础讲座

23/4/20 5

两种信息载体核酸分子 蛋白质分子

Page 6: 生物信息学基础讲座

23/4/20 6

Page 7: 生物信息学基础讲座

23/4/20 7

Protein MachinesProtein Machines

Page 8: 生物信息学基础讲座

23/4/20 8

Cell Protein Machines

Page 9: 生物信息学基础讲座

23/4/20 9

三种信息与功能相关的结构信息遗传信息进化信息

Page 10: 生物信息学基础讲座

实验 数据 信息 知识

收集 表示 分析 建模

刻画特征 比较 推理

应 用

基因工程蛋白质设计

疾病诊断

疾病治疗

开发新药

Page 11: 生物信息学基础讲座

23/4/20 11

交叉学科

以生物医学数据和过程为研究对象采用数学、统计学方法和手段借助现代计算机技术和方法

解读生物学数据建立数学模型对生物过程进行预测和控制

Page 12: 生物信息学基础讲座

23/4/20 12

研究性学科中的应用学科

该学科目前还处于初步阶段大部分还停留在实验室的水平工业应用仍有许多不足

Page 13: 生物信息学基础讲座

23/4/20 13

研究思路与方法研究思路与方法

General Approaches to Bioinformatics

Page 14: 生物信息学基础讲座

23/4/20 14

一般研究策略

提出生物学问题数学描述与建模:前提、假设、问题的核心分别是什么

计算机分析:有时需要采取数值方法进行求解,有时需要采取蒙特卡洛的方法

分析结果的生物学意义

Page 15: 生物信息学基础讲座

23/4/20 15

生物信息学的研究内容

Working Around Central Dogma

Page 16: 生物信息学基础讲座

23/4/20 16

生物序列分析若干问题生物序列分析若干问题

Page 17: 生物信息学基础讲座

23/4/20 17

Genomes

Integrative Data1995, HI (bacteria): 1.6 Mb & 1600 genes do

ne

1997, yeast: 13 Mb & ~6000 genes for yeast

1998, worm: ~100Mb with 19 K genes

1999: >30 completed genomes!

2003, human: 3 Gb & 100 K genes...

Page 18: 生物信息学基础讲座

Bacteria, 1.6 Mb,

~1600 genes [Science 269: 496]

Eukaryote, 13 Mb,

~6K genes [Nature 387: 1]

1995

1997

1998

Animal, ~100 Mb,

~20K genes [Science 282:

1945]

Human, ~3 Gb, ~100K genes [???]

2000?

Page 19: 生物信息学基础讲座

23/4/20 19

人类基因组与其它生物基因组比较

Page 20: 生物信息学基础讲座

23/4/20 20

DNA

Raw DNA Sequence Coding or Not? Parse into genes? 4 bases: AGCT

~1 K in a gene, ~2 M in genome

~3 Gb Human

atggcaattaaaattggtatcaatggttttggtcgtatcggccgtatcgtattccgtgcagcacaacaccgtgatgacattgaagttgtaggtattaacgacttaatcgacgttgaatacatggcttatatgttgaaatatgattcaactcacggtcgtttcgacggcactgttgaagtgaaagatggtaacttagtggttaatggtaaaactatccgtgtaactgcagaacgtgatccagcaaacttaaactggggtgcaatcggtgttgatatcgctgttgaagcgactggtttattcttaactgatgaaactgctcgtaaacatatcactgcaggcgcaaaaaaagttgtattaactggcccatctaaagatgcaacccctatgttcgttcgtggtgtaaacttcaacgcatacgcaggtcaagatatcgtttctaacgcatcttgtacaacaaactgtttagctcctttagcacgtgttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgactgcaactcaaaaaactgtggatggtccatcagctaaagactggcgcggcggccgcggtgcatcacaaaacatcattccatcttcaacaggtgcagcgaaagcagtaggtaaagtattacctgcattaaacggtaaattaactggtatggctttccgtgttccaacgccaaacgtatctgttgttgatttaacagttaatcttgaaaaaccagcttcttatgatgcaatcaaacaagcaatcaaagatgcagcggaaggtaaaacgttcaatggcgaattaaaaggcgtattaggttacactgaagatgctgttgtttctactgacttcaacggttgtgctttaacttctgtatttgatgcagacgctggtatcgcattaactgattctttcgttaaattggtatc . . .

. . . caaaaatagggttaatatgaatctcgatctccattttgttcatcgtattcaacaacaagccaaaactcgtacaaatatgaccgcacttcgctataaagaacacggcttgtggcgagatatctcttggaaaaactttcaagagcaactcaatcaactttctcgagcattgcttgctcacaatattgacgtacaagataaaatcgccatttttgcccataatatggaacgttgggttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgactacaatcgttgacattgcgaccttacaaattcgagcaatcacagtgcctatttacgcaaccaatacagcccagcaagcagaatttatcctaaatcacgccgatgtaaaaattctcttcgtcggcgatcaagagcaatacgatcaaacattggaaattgctcatcattgtccaaaattacaaaaaattgtagcaatgaaatccaccattcaattacaacaagatcctctttcttgcacttgg

Page 21: 生物信息学基础讲座

23/4/20 21

Motif discovery

Page 22: 生物信息学基础讲座

23/4/20 22

Protein Sequence

20 letter alphabet ACDEFGHIKLMNPQRSTVWY but not BJOUXZ

Strings of ~300 aa in an average protein (in bacteria), ~200 aa in a domain

>1M known protein sequences (uniprot)d1dhfa_ LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ-NLVIMGKKTWFSI d8dfr__ LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ-NAVIMGKKTWFSI d4dfra_ ISLIAALAVDRVIGMENAMPWN-LPADLAWFKRNTL--------NKPVIMGRHTWESI d3dfr__ TAFLWAQDRDGLIGKDGHLPWH-LPDDLHYFRAQTV--------GKIMVVGRRTYESF d1dhfa_ LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ-NLVIMGKKTWFSId8dfr__ LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ-NAVIMGKKTWFSId4dfra_ ISLIAALAVDRVIGMENAMPW-NLPADLAWFKRNTLD--------KPVIMGRHTWESId3dfr__ TAFLWAQDRNGLIGKDGHLPW-HLPDDLHYFRAQTVG--------KIMVVGRRTYESF

d1dhfa_ VPEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYKEAMNHPd8dfr__ VPEKNRPLKDRINIVLSRELKEAPKGAHYLSKSLDDALALLDSPELKSKVDMVWIVGGTAVYKAAMEKPd4dfra_ ---G-RPLPGRKNIILS-SQPGTDDRV-TWVKSVDEAIAACGDVP------EIMVIGGGRVYEQFLPKAd3dfr__ ---PKRPLPERTNVVLTHQEDYQAQGA-VVVHDVAAVFAYAKQHLDQ----ELVIAGGAQIFTAFKDDV d1dhfa_ -PEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYKEAMNHPd8dfr__ -PEKNRPLKDRINIVLSRELKEAPKGAHYLSKSLDDALALLDSPELKSKVDMVWIVGGTAVYKAAMEKPd4dfra_ -G---RPLPGRKNIILSSSQPGTDDRV-TWVKSVDEAIAACGDVPE-----.IMVIGGGRVYEQFLPKAd3dfr__ -P--KRPLPERTNVVLTHQEDYQAQGA-VVVHDVAAVFAYAKQHLD----QELVIAGGAQIFTAFKDDV

Page 23: 生物信息学基础讲座

23/4/20 23

研究对象 Basic building blocks

Genome/DNA Genes, proteins Cis-regulatory elements

Basic mechanisms Transcription Splicing Translation (step 1~3 = gene expression) Post-translational modification / protein folding

Transcriptional regulatory mechanisms and other regulatory mechanisms Alternative splicing microRNAs DNA/protein modification

Page 24: 生物信息学基础讲座

23/4/20 24

研究内容 Sequencing similarity search: find similar sequences in dat

abases global search local search

Multiple alignment Motif discovery Gene finding : determining gene and gene str

ucture Phylogenetic analysis Comparative studies

Page 25: 生物信息学基础讲座

23/4/20 25

生物大分子结构分析研究生物大分子结构分析研究

Extending Studies to 3-Dimensions

Page 26: 生物信息学基础讲座

23/4/20 26

Tertiary

Hierarchy of Protein Structure

Page 27: 生物信息学基础讲座

23/4/20 27

Protein 3D Structure3D Structure of Protein

Alpha-helix Beta-sheet

Loop and Turn

Turn or coil

Page 28: 生物信息学基础讲座

23/4/20 28

Proteins Structural Classes

Page 29: 生物信息学基础讲座

23/4/20 29

结构分析 Structural Analysis

二级结构预测 secondary structure prediction

三级结构预测 tertiary structure prediction

分子对接 molecular docking分子动力学模拟 molecular dynamics (M

D) simulation

Page 30: 生物信息学基础讲座

23/4/20 30

表达数据分析表达数据分析

Expression studies

Page 31: 生物信息学基础讲座

23/4/20 31

Microarray data analysis

Low-expression gene

Highly expressed in treatment, but not control cells

High-expression gene

Highly expressed in control, but not treatment cells

Page 32: 生物信息学基础讲座

23/4/20 32

Supervised Classification

Page 33: 生物信息学基础讲座

23/4/20 33

Unsupervised Clustering

Page 34: 生物信息学基础讲座

23/4/20 34

二二维维电电泳泳图图

Page 35: 生物信息学基础讲座

23/4/20 35

Genome Sequence

Finding Genes in Genomic DNA Introns/exons/promotors

Characterizing Repeats in Genomic DNA Statistics Patterns

Duplications in the Genome Large scale genomic alignment

Whole-Genome Comparisons Finding Structural RNAs

Page 36: 生物信息学基础讲座

23/4/20 36

Protein Sequence

Sequence Alignment non-exact string matching, gaps Dynamic Programming Local vs Global Alignment Suboptimal Alignment Hashing to increase speed (BL

AST, FASTA) Amino acid substitution scoring

matrices Multiple Alignment and Consensus

Multiple alignment Transitive Comparisons HMMs, Profiles Motifs

Scoring schemes and Matching statistics How to tell if a given

alignment or match is statistically significant

A P-value (or an e-value)? Score Distributions

(extreme val. dist.) Low Complexity Sequences

Evolutionary Issues Rates of mutation and

change

Page 37: 生物信息学基础讲座

23/4/20 37

Sequence Structure Secondary Structure

“Prediction” via Propensities Neural Networks,

Genetic Alg. Simple Statistics TM-helix finding Assessing Secondary

Structure Prediction

Structure Prediction: Protein v RNA

Tertiary Structure Prediction Fold Recognition Threading Ab initio (Quaternary structure prediction)

Direct Function Prediction Active site identification

Relation of Sequence Similarity to Structural Similarity

Page 38: 生物信息学基础讲座

23/4/20 38

Structures

Structure Comparison Basic Protein Geometry and

Least-Squares Fitting Distances, Angles, Axes, Rotations

Calculating a helix axis in 3D via fitting a line

LSQ fit of 2 structures Molecular Graphics

Calculation of Volume and Surface How to represent a plane How to represent a solid How to calculate an area Hinge prediction Packing Measurement

Structural Alignment Aligning sequences on the

basis of 3D structure. DP does not converge,

unlike sequences, what to do?

Other Approaches: Distance Matrices, Hashing

Fold Library Docking and Drug Design as

Surface Matching

Page 39: 生物信息学基础讲座

23/4/20 39

DBs/SurveysRelational Database Concepts

Keys, Foreign KeysSQL, OODBMS, views, forms, transactions, reports, indexes

Joining Tables, NormalizationNatural Join as "where" selection on cross product

Array Referencing (perl/dbm)Forms and Reports Cross-tabulationDB interoperation

What are the Units ?What are the units of biological information for organization?

sequence, structuremotifs, modules, domains

How classified: folds, motions, pathways, functions?

Clustering and Trees Basic clustering

UPGMA single-linkage multiple linkage

Other Methods Parsimony, Maximum

likelihood Evolutionary implications

Visualization of Large Amounts of Information

The Bias Problem sequence weighting sampling

Page 40: 生物信息学基础讲座

23/4/20 40

Mining

Information integration and fusionDealing with heterogeneous data

Dimensionality Reduction (PCA etc)

Page 41: 生物信息学基础讲座

23/4/20 41

(Func) Genomics

Expression Analysis Time Courses clustering Measuring differences Identifying Regulatory Re

gions Large scale cross referencing

of information Function Classification and O

rthologs The Genomic vs. Single-mole

cule Perspective

Genome Comparisons Ortholog Families, pathways Large-scale censuses Frequent Words Analysis Genome Annotation Identification of interacting protei

ns

Networks Global structure and local motifs

Structural Genomics Folds in Genomes, shared & co

mmon folds Bulk Structure Prediction

Genome Trees

Page 42: 生物信息学基础讲座

23/4/20 42

Simulation

Molecular Simulation Geometry -> Energy -> Forces Basic interactions, potential

energy functions Electrostatics VDW Forces Bonds as Springs How structure changes over

time? How to measure the change

in a vector (gradient) Molecular Dynamics & MC Energy Minimization

Parameter Sets Number Density Simplifications

Poisson-Boltzman Equation

Lattice Models and Simplification

Page 43: 生物信息学基础讲座

23/4/20 43

Page 44: 生物信息学基础讲座

23/4/20 44

Systems biology: E-cell

Page 45: 生物信息学基础讲座

23/4/20 45

Metabolism network

Page 46: 生物信息学基础讲座

23/4/20 46

Protein-protein interaction (PPI)

Page 47: 生物信息学基础讲座

23/4/20 47

Nature 406 2000

Page 48: 生物信息学基础讲座

23/4/20 48

Application 1Novel Drug design

Understanding How Structures Bind to Small Molecules (Function)

Designing Inhibitors Docking, Structure Modeling

Page 49: 生物信息学基础讲座

23/4/20 49

生物信息学中的生物信息学中的 informaticsinformatics

Page 50: 生物信息学基础讲座

23/4/20 50

常用计算机技术

数据库技术:建库与搜索 Database 统计 statistics数据挖掘技术 data mining人工智能 Artificial Intelligence模式识别 pattern recognition机器学习 machine learning

Page 51: 生物信息学基础讲座

23/4/20 51

常用算法 Algorithms

一般数值计算方法 numerical methods 分而治之算法 divide-and-conquer 动态规划算法 dynamic programming 模拟退火算法 simulated annealing 遗传算法 genetic algorithm 隐马尔可夫模型 hidden Markov model (HMM) 贝叶斯统计方法 Bayes’ approaches 聚类方法 clustering methods 分类方法 classification methods 蚁群算法 ant colony system 爬山算法 hill-climbing algorithm

Page 52: 生物信息学基础讲座

23/4/20 52

Reading Lists

1. R. Durbin, S. R. Eddy, A. Krogh and G. Mitchison (1998). Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, chapter 1. Cambridge University Press.

2. H. Ji and W. H. Wong (2006). Computational biology: toward deciphering gene regulatory information in mammalian genomes, Biometrics, 62: 645–663.

3. Searls, D. B. (2002). The language of genes. Nature, 420:211—217

4. E. Nabieva et al (2005). Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics, 21suppl: i302-i310

5. M. Eisen et al (1998). Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci., 95: 14863-14868

6. Bengtsson, M. et al (2005). Gene Expression Profiling in Single Cells from the Pancreatic Islets of Langerhans Reveals Lognormal Distribution of mRNA Levels. Genome Res, 15: 1388-92

7. Jun Dong et al (2007). Understanding Network Concepts in Modules. BMC Systems Biology, 1: 24

8. Thompson J., et al (2001). Analysis of Mutations at Residues A2451 and G2447 of 23S rRNA in the Peptidyltransferase Active Site of the 50S Ribosomal Subunit. PNAS 98: 9002-9007.

Page 53: 生物信息学基础讲座

23/4/20 53

Assignments

Summarize the reading (1) in 1~2 A4 papers, with no less than 500 words, either in Chinese or English.

Extract the following items from the lists (2)~(8)

a) Biological problems

b) Approaches or algorithms

c) Conclusion

d) Your own opinions

Page 54: 生物信息学基础讲座

23/4/20 54

Programming assignments

1. Write a regular expression to extract all of the carbon atom position data from a PDB file. And then print this data out

2. Retrieve a gene of your interest in FASTA format. Parse the string in the text file to obtain the reverse complement of the sequence section alone. And then output this new string in a new file called revcomp.fasta

Page 55: 生物信息学基础讲座

23/4/20 55

Application 2Finding Homologs

Page 56: 生物信息学基础讲座

23/4/20 56

Application 3:Overall Genome Characterization

Overall Occurrence of a Certain Feature in the Genome e.g. how many kinases in Yea

st

Compare Organisms and Tissues Expression levels in Cancerou

s vs Normal Tissues

Databases, Statistics

Page 57: 生物信息学基础讲座

23/4/20 57

Reading AssignmentsReading Assignments

课后阅读作业

Page 58: 生物信息学基础讲座

23/4/20 58

常用参考书1. Andreas Baxevanis and Francis Ouellett (eds.), Bioinformatics: A Practical Guide

to the Analysis of Genes and Proteins, 3rd Edition, Wiley & Sons, 2005.

2. R. Durbin, S. R. Eddy, A. Krogh and G. Mitchison (1998). Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press

3. Gibas, Cynthia, and Per Jambeck. Developing Bioinformatics Computer Skills. O'Reilly, 2001.

Page 59: 生物信息学基础讲座

23/4/20 59

That’s all for lecture 1