生物信息学基础讲座

23/4/20 1

生物信息学基础讲座生物信息学基础讲座

第 1 讲生物信息学研究http://cbb.sjtu.edu.cn/~mywu/bi217

http://cbb.sjtu.edu.cn/~mywu/bi217

23/4/20 2

什么是生物信息学什么是生物信息学 ??

Concepts of Bioinformatics

23/4/20 3

Biology + computer?

(Molecular) Bio - informatics Bioinformatics is conceptualizing biology in te

rms of molecules (in the sense of physical-chemistry) and then applying “informatics” techniques (applied math, CS, and statistics) to understand and organize the information associated with these molecules, on a large-scale.

Bioinformatics is a practical discipline with many applications.

23/4/20 4

生物分子信息

细胞

分子

存贮、复制、传递和表达遗传信息的系统

生物信息的载体

23/4/20 5

两种信息载体核酸分子蛋白质分子

23/4/20 6

23/4/20 7

Protein MachinesProtein Machines

23/4/20 8

Cell Protein Machines

23/4/20 9

三种信息与功能相关的结构信息遗传信息进化信息

实验数据信息知识

收集表示分析建模

刻画特征比较推理

应用

基因工程蛋白质设计

疾病诊断

疾病治疗

开发新药

23/4/20 11

交叉学科

以生物医学数据和过程为研究对象采用数学、统计学方法和手段借助现代计算机技术和方法

解读生物学数据建立数学模型对生物过程进行预测和控制

23/4/20 12

研究性学科中的应用学科

该学科目前还处于初步阶段大部分还停留在实验室的水平工业应用仍有许多不足

23/4/20 13

研究思路与方法研究思路与方法

General Approaches to Bioinformatics

23/4/20 14

一般研究策略

提出生物学问题数学描述与建模：前提、假设、问题的核心分别是什么

计算机分析：有时需要采取数值方法进行求解，有时需要采取蒙特卡洛的方法

分析结果的生物学意义

23/4/20 15

生物信息学的研究内容

Working Around Central Dogma

23/4/20 16

生物序列分析若干问题生物序列分析若干问题

23/4/20 17

Genomes

Integrative Data1995, HI (bacteria): 1.6 Mb & 1600 genes do

ne

1997, yeast: 13 Mb & ~6000 genes for yeast

1998, worm: ~100Mb with 19 K genes

1999: >30 completed genomes!

2003, human: 3 Gb & 100 K genes...

Bacteria, 1.6 Mb,

~1600 genes [Science 269: 496]

Eukaryote, 13 Mb,

~6K genes [Nature 387: 1]

1995

1997

1998

Animal, ~100 Mb,

~20K genes [Science 282:

1945]

Human, ~3 Gb, ~100K genes [???]

2000?

23/4/20 19

人类基因组与其它生物基因组比较

23/4/20 20

DNA

Raw DNA Sequence Coding or Not? Parse into genes? 4 bases: AGCT

~1 K in a gene, ~2 M in genome

~3 Gb Human

atggcaattaaaattggtatcaatggttttggtcgtatcggccgtatcgtattccgtgcagcacaacaccgtgatgacattgaagttgtaggtattaacgacttaatcgacgttgaatacatggcttatatgttgaaatatgattcaactcacggtcgtttcgacggcactgttgaagtgaaagatggtaacttagtggttaatggtaaaactatccgtgtaactgcagaacgtgatccagcaaacttaaactggggtgcaatcggtgttgatatcgctgttgaagcgactggtttattcttaactgatgaaactgctcgtaaacatatcactgcaggcgcaaaaaaagttgtattaactggcccatctaaagatgcaacccctatgttcgttcgtggtgtaaacttcaacgcatacgcaggtcaagatatcgtttctaacgcatcttgtacaacaaactgtttagctcctttagcacgtgttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgactgcaactcaaaaaactgtggatggtccatcagctaaagactggcgcggcggccgcggtgcatcacaaaacatcattccatcttcaacaggtgcagcgaaagcagtaggtaaagtattacctgcattaaacggtaaattaactggtatggctttccgtgttccaacgccaaacgtatctgttgttgatttaacagttaatcttgaaaaaccagcttcttatgatgcaatcaaacaagcaatcaaagatgcagcggaaggtaaaacgttcaatggcgaattaaaaggcgtattaggttacactgaagatgctgttgtttctactgacttcaacggttgtgctttaacttctgtatttgatgcagacgctggtatcgcattaactgattctttcgttaaattggtatc . . .

. . . caaaaatagggttaatatgaatctcgatctccattttgttcatcgtattcaacaacaagccaaaactcgtacaaatatgaccgcacttcgctataaagaacacggcttgtggcgagatatctcttggaaaaactttcaagagcaactcaatcaactttctcgagcattgcttgctcacaatattgacgtacaagataaaatcgccatttttgcccataatatggaacgttgggttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgactacaatcgttgacattgcgaccttacaaattcgagcaatcacagtgcctatttacgcaaccaatacagcccagcaagcagaatttatcctaaatcacgccgatgtaaaaattctcttcgtcggcgatcaagagcaatacgatcaaacattggaaattgctcatcattgtccaaaattacaaaaaattgtagcaatgaaatccaccattcaattacaacaagatcctctttcttgcacttgg

23/4/20 21

Motif discovery

23/4/20 22

Protein Sequence

20 letter alphabet ACDEFGHIKLMNPQRSTVWY but not BJOUXZ

Strings of ~300 aa in an average protein (in bacteria), ~200 aa in a domain

>1M known protein sequences (uniprot)d1dhfa_ LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ-NLVIMGKKTWFSI d8dfr__ LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ-NAVIMGKKTWFSI d4dfra_ ISLIAALAVDRVIGMENAMPWN-LPADLAWFKRNTL--------NKPVIMGRHTWESI d3dfr__ TAFLWAQDRDGLIGKDGHLPWH-LPDDLHYFRAQTV--------GKIMVVGRRTYESF d1dhfa_ LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ-NLVIMGKKTWFSId8dfr__ LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ-NAVIMGKKTWFSId4dfra_ ISLIAALAVDRVIGMENAMPW-NLPADLAWFKRNTLD--------KPVIMGRHTWESId3dfr__ TAFLWAQDRNGLIGKDGHLPW-HLPDDLHYFRAQTVG--------KIMVVGRRTYESF

d1dhfa_ VPEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYKEAMNHPd8dfr__ VPEKNRPLKDRINIVLSRELKEAPKGAHYLSKSLDDALALLDSPELKSKVDMVWIVGGTAVYKAAMEKPd4dfra_ ---G-RPLPGRKNIILS-SQPGTDDRV-TWVKSVDEAIAACGDVP------EIMVIGGGRVYEQFLPKAd3dfr__ ---PKRPLPERTNVVLTHQEDYQAQGA-VVVHDVAAVFAYAKQHLDQ----ELVIAGGAQIFTAFKDDV d1dhfa_ -PEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYKEAMNHPd8dfr__ -PEKNRPLKDRINIVLSRELKEAPKGAHYLSKSLDDALALLDSPELKSKVDMVWIVGGTAVYKAAMEKPd4dfra_ -G---RPLPGRKNIILSSSQPGTDDRV-TWVKSVDEAIAACGDVPE-----.IMVIGGGRVYEQFLPKAd3dfr__ -P--KRPLPERTNVVLTHQEDYQAQGA-VVVHDVAAVFAYAKQHLD----QELVIAGGAQIFTAFKDDV

23/4/20 23

研究对象 Basic building blocks

Genome/DNA Genes, proteins Cis-regulatory elements

Basic mechanisms Transcription Splicing Translation (step 1~3 = gene expression) Post-translational modification / protein folding

Transcriptional regulatory mechanisms and other regulatory mechanisms Alternative splicing microRNAs DNA/protein modification

23/4/20 24

研究内容 Sequencing similarity search: find similar sequences in dat

abases global search local search

Multiple alignment Motif discovery Gene finding ： determining gene and gene str

ucture Phylogenetic analysis Comparative studies

23/4/20 25

生物大分子结构分析研究生物大分子结构分析研究

Extending Studies to 3-Dimensions

23/4/20 26

Tertiary

Hierarchy of Protein Structure

23/4/20 27

Protein 3D Structure3D Structure of Protein

Alpha-helix Beta-sheet

Loop and Turn

Turn or coil

23/4/20 28

Proteins Structural Classes

23/4/20 29

结构分析 Structural Analysis

二级结构预测 secondary structure prediction

三级结构预测 tertiary structure prediction

分子对接 molecular docking分子动力学模拟 molecular dynamics (M

D) simulation

23/4/20 30

表达数据分析表达数据分析

Expression studies

23/4/20 31

Microarray data analysis

Low-expression gene

Highly expressed in treatment, but not control cells

High-expression gene

Highly expressed in control, but not treatment cells

23/4/20 32

Supervised Classification

23/4/20 33

Unsupervised Clustering

23/4/20 34

二二维维电电泳泳图图

23/4/20 35

Genome Sequence

Finding Genes in Genomic DNA Introns/exons/promotors

Characterizing Repeats in Genomic DNA Statistics Patterns

Duplications in the Genome Large scale genomic alignment

Whole-Genome Comparisons Finding Structural RNAs

23/4/20 36

Protein Sequence

Sequence Alignment non-exact string matching, gaps Dynamic Programming Local vs Global Alignment Suboptimal Alignment Hashing to increase speed (BL

AST, FASTA) Amino acid substitution scoring

matrices Multiple Alignment and Consensus

Multiple alignment Transitive Comparisons HMMs, Profiles Motifs

Scoring schemes and Matching statistics How to tell if a given

alignment or match is statistically significant

A P-value (or an e-value)? Score Distributions

(extreme val. dist.) Low Complexity Sequences

Evolutionary Issues Rates of mutation and

change

23/4/20 37

Sequence Structure Secondary Structure

“Prediction” via Propensities Neural Networks,

Genetic Alg. Simple Statistics TM-helix finding Assessing Secondary

Structure Prediction

Structure Prediction: Protein v RNA

Tertiary Structure Prediction Fold Recognition Threading Ab initio (Quaternary structure prediction)

Direct Function Prediction Active site identification

Relation of Sequence Similarity to Structural Similarity

23/4/20 38

Structures

Structure Comparison Basic Protein Geometry and

Least-Squares Fitting Distances, Angles, Axes, Rotations

Calculating a helix axis in 3D via fitting a line

LSQ fit of 2 structures Molecular Graphics

Calculation of Volume and Surface How to represent a plane How to represent a solid How to calculate an area Hinge prediction Packing Measurement

Structural Alignment Aligning sequences on the

basis of 3D structure. DP does not converge,

unlike sequences, what to do?

Other Approaches: Distance Matrices, Hashing

Fold Library Docking and Drug Design as

Surface Matching

23/4/20 39

DBs/SurveysRelational Database Concepts

Keys, Foreign KeysSQL, OODBMS, views, forms, transactions, reports, indexes

Joining Tables, NormalizationNatural Join as "where" selection on cross product

Array Referencing (perl/dbm)Forms and Reports Cross-tabulationDB interoperation

What are the Units ?What are the units of biological information for organization?

sequence, structuremotifs, modules, domains

How classified: folds, motions, pathways, functions?

Clustering and Trees Basic clustering

UPGMA single-linkage multiple linkage

Other Methods Parsimony, Maximum

likelihood Evolutionary implications

Visualization of Large Amounts of Information

The Bias Problem sequence weighting sampling

23/4/20 40

Mining

Information integration and fusionDealing with heterogeneous data

Dimensionality Reduction (PCA etc)

23/4/20 41

(Func) Genomics

Expression Analysis Time Courses clustering Measuring differences Identifying Regulatory Re

gions Large scale cross referencing

of information Function Classification and O

rthologs The Genomic vs. Single-mole

cule Perspective

Genome Comparisons Ortholog Families, pathways Large-scale censuses Frequent Words Analysis Genome Annotation Identification of interacting protei

ns

Networks Global structure and local motifs

Structural Genomics Folds in Genomes, shared & co

mmon folds Bulk Structure Prediction

Genome Trees

23/4/20 42

Simulation

Molecular Simulation Geometry -> Energy -> Forces Basic interactions, potential

energy functions Electrostatics VDW Forces Bonds as Springs How structure changes over

time? How to measure the change

in a vector (gradient) Molecular Dynamics & MC Energy Minimization

Parameter Sets Number Density Simplifications

Poisson-Boltzman Equation

Lattice Models and Simplification

23/4/20 43

23/4/20 44

Systems biology: E-cell

23/4/20 45

Metabolism network

23/4/20 46

Protein-protein interaction (PPI)

23/4/20 47

Nature 406 2000

23/4/20 48

Application 1Novel Drug design

Understanding How Structures Bind to Small Molecules (Function)

Designing Inhibitors Docking, Structure Modeling

23/4/20 49

生物信息学中的生物信息学中的 informaticsinformatics

23/4/20 50

常用计算机技术

数据库技术：建库与搜索 Database 统计 statistics数据挖掘技术 data mining人工智能 Artificial Intelligence模式识别 pattern recognition机器学习 machine learning

23/4/20 51

常用算法 Algorithms

一般数值计算方法 numerical methods 分而治之算法 divide-and-conquer 动态规划算法 dynamic programming 模拟退火算法 simulated annealing 遗传算法 genetic algorithm 隐马尔可夫模型 hidden Markov model (HMM) 贝叶斯统计方法 Bayes’ approaches 聚类方法 clustering methods 分类方法 classification methods 蚁群算法 ant colony system 爬山算法 hill-climbing algorithm

23/4/20 52

Reading Lists

1. R. Durbin, S. R. Eddy, A. Krogh and G. Mitchison (1998). Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, chapter 1. Cambridge University Press.

2. H. Ji and W. H. Wong (2006). Computational biology: toward deciphering gene regulatory information in mammalian genomes, Biometrics, 62: 645–663.

3. Searls, D. B. (2002). The language of genes. Nature, 420:211—217

4. E. Nabieva et al (2005). Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics, 21suppl: i302-i310

5. M. Eisen et al (1998). Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci., 95: 14863-14868

6. Bengtsson, M. et al (2005). Gene Expression Profiling in Single Cells from the Pancreatic Islets of Langerhans Reveals Lognormal Distribution of mRNA Levels. Genome Res, 15: 1388-92

7. Jun Dong et al (2007). Understanding Network Concepts in Modules. BMC Systems Biology, 1: 24

8. Thompson J., et al (2001). Analysis of Mutations at Residues A2451 and G2447 of 23S rRNA in the Peptidyltransferase Active Site of the 50S Ribosomal Subunit. PNAS 98: 9002-9007.

23/4/20 53

Assignments

Summarize the reading (1) in 1~2 A4 papers, with no less than 500 words, either in Chinese or English.

Extract the following items from the lists (2)~(8)

a) Biological problems

b) Approaches or algorithms

c) Conclusion

d) Your own opinions

23/4/20 54

Programming assignments

1. Write a regular expression to extract all of the carbon atom position data from a PDB file. And then print this data out

2. Retrieve a gene of your interest in FASTA format. Parse the string in the text file to obtain the reverse complement of the sequence section alone. And then output this new string in a new file called revcomp.fasta

23/4/20 55

Application 2Finding Homologs

23/4/20 56

Application 3:Overall Genome Characterization

Overall Occurrence of a Certain Feature in the Genome e.g. how many kinases in Yea

st

Compare Organisms and Tissues Expression levels in Cancerou

s vs Normal Tissues

Databases, Statistics

23/4/20 57

Reading AssignmentsReading Assignments

课后阅读作业

23/4/20 58

常用参考书1. Andreas Baxevanis and Francis Ouellett (eds.), Bioinformatics: A Practical Guide

to the Analysis of Genes and Proteins, 3rd Edition, Wiley & Sons, 2005.

2. R. Durbin, S. R. Eddy, A. Krogh and G. Mitchison (1998). Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press

3. Gibas, Cynthia, and Per Jambeck. Developing Bioinformatics Computer Skills. O'Reilly, 2001.

23/4/20 59

That’s all for lecture 1

生物信息学基础讲座

Documents