生物信息学基础讲座
DESCRIPTION
生物信息学基础讲座. 第 1 讲 生物信息学研究 http://cbb.sjtu.edu.cn/~mywu/bi217. 什么是生物信息学 ?. Concepts of Bioinformatics. Biology + computer?. (Molecular) Bio - informatics - PowerPoint PPT PresentationTRANSCRIPT
23/4/20 1
生物信息学基础讲座生物信息学基础讲座
第 1 讲 生物信息学研究http://cbb.sjtu.edu.cn/~mywu/bi217
23/4/20 2
什么是生物信息学什么是生物信息学 ??
Concepts of Bioinformatics
23/4/20 3
Biology + computer?
(Molecular) Bio - informatics Bioinformatics is conceptualizing biology in te
rms of molecules (in the sense of physical-chemistry) and then applying “informatics” techniques (applied math, CS, and statistics) to understand and organize the information associated with these molecules, on a large-scale.
Bioinformatics is a practical discipline with many applications.
23/4/20 4
生物分子信息
细胞
分子
存贮、复制、传递和表达遗传信息的系统
生物信息的载体
23/4/20 5
两种信息载体核酸分子 蛋白质分子
23/4/20 6
23/4/20 7
Protein MachinesProtein Machines
23/4/20 8
Cell Protein Machines
23/4/20 9
三种信息与功能相关的结构信息遗传信息进化信息
实验 数据 信息 知识
收集 表示 分析 建模
刻画特征 比较 推理
应 用
基因工程蛋白质设计
疾病诊断
疾病治疗
开发新药
23/4/20 11
交叉学科
以生物医学数据和过程为研究对象采用数学、统计学方法和手段借助现代计算机技术和方法
解读生物学数据建立数学模型对生物过程进行预测和控制
23/4/20 12
研究性学科中的应用学科
该学科目前还处于初步阶段大部分还停留在实验室的水平工业应用仍有许多不足
23/4/20 13
研究思路与方法研究思路与方法
General Approaches to Bioinformatics
23/4/20 14
一般研究策略
提出生物学问题数学描述与建模:前提、假设、问题的核心分别是什么
计算机分析:有时需要采取数值方法进行求解,有时需要采取蒙特卡洛的方法
分析结果的生物学意义
23/4/20 15
生物信息学的研究内容
Working Around Central Dogma
23/4/20 16
生物序列分析若干问题生物序列分析若干问题
23/4/20 17
Genomes
Integrative Data1995, HI (bacteria): 1.6 Mb & 1600 genes do
ne
1997, yeast: 13 Mb & ~6000 genes for yeast
1998, worm: ~100Mb with 19 K genes
1999: >30 completed genomes!
2003, human: 3 Gb & 100 K genes...
Bacteria, 1.6 Mb,
~1600 genes [Science 269: 496]
Eukaryote, 13 Mb,
~6K genes [Nature 387: 1]
1995
1997
1998
Animal, ~100 Mb,
~20K genes [Science 282:
1945]
Human, ~3 Gb, ~100K genes [???]
2000?
23/4/20 19
人类基因组与其它生物基因组比较
23/4/20 20
DNA
Raw DNA Sequence Coding or Not? Parse into genes? 4 bases: AGCT
~1 K in a gene, ~2 M in genome
~3 Gb Human
atggcaattaaaattggtatcaatggttttggtcgtatcggccgtatcgtattccgtgcagcacaacaccgtgatgacattgaagttgtaggtattaacgacttaatcgacgttgaatacatggcttatatgttgaaatatgattcaactcacggtcgtttcgacggcactgttgaagtgaaagatggtaacttagtggttaatggtaaaactatccgtgtaactgcagaacgtgatccagcaaacttaaactggggtgcaatcggtgttgatatcgctgttgaagcgactggtttattcttaactgatgaaactgctcgtaaacatatcactgcaggcgcaaaaaaagttgtattaactggcccatctaaagatgcaacccctatgttcgttcgtggtgtaaacttcaacgcatacgcaggtcaagatatcgtttctaacgcatcttgtacaacaaactgtttagctcctttagcacgtgttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgactgcaactcaaaaaactgtggatggtccatcagctaaagactggcgcggcggccgcggtgcatcacaaaacatcattccatcttcaacaggtgcagcgaaagcagtaggtaaagtattacctgcattaaacggtaaattaactggtatggctttccgtgttccaacgccaaacgtatctgttgttgatttaacagttaatcttgaaaaaccagcttcttatgatgcaatcaaacaagcaatcaaagatgcagcggaaggtaaaacgttcaatggcgaattaaaaggcgtattaggttacactgaagatgctgttgtttctactgacttcaacggttgtgctttaacttctgtatttgatgcagacgctggtatcgcattaactgattctttcgttaaattggtatc . . .
. . . caaaaatagggttaatatgaatctcgatctccattttgttcatcgtattcaacaacaagccaaaactcgtacaaatatgaccgcacttcgctataaagaacacggcttgtggcgagatatctcttggaaaaactttcaagagcaactcaatcaactttctcgagcattgcttgctcacaatattgacgtacaagataaaatcgccatttttgcccataatatggaacgttgggttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgactacaatcgttgacattgcgaccttacaaattcgagcaatcacagtgcctatttacgcaaccaatacagcccagcaagcagaatttatcctaaatcacgccgatgtaaaaattctcttcgtcggcgatcaagagcaatacgatcaaacattggaaattgctcatcattgtccaaaattacaaaaaattgtagcaatgaaatccaccattcaattacaacaagatcctctttcttgcacttgg
23/4/20 21
Motif discovery
23/4/20 22
Protein Sequence
20 letter alphabet ACDEFGHIKLMNPQRSTVWY but not BJOUXZ
Strings of ~300 aa in an average protein (in bacteria), ~200 aa in a domain
>1M known protein sequences (uniprot)d1dhfa_ LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ-NLVIMGKKTWFSI d8dfr__ LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ-NAVIMGKKTWFSI d4dfra_ ISLIAALAVDRVIGMENAMPWN-LPADLAWFKRNTL--------NKPVIMGRHTWESI d3dfr__ TAFLWAQDRDGLIGKDGHLPWH-LPDDLHYFRAQTV--------GKIMVVGRRTYESF d1dhfa_ LNCIVAVSQNMGIGKNGDLPWPPLRNEFRYFQRMTTTSSVEGKQ-NLVIMGKKTWFSId8dfr__ LNSIVAVCQNMGIGKDGNLPWPPLRNEYKYFQRMTSTSHVEGKQ-NAVIMGKKTWFSId4dfra_ ISLIAALAVDRVIGMENAMPW-NLPADLAWFKRNTLD--------KPVIMGRHTWESId3dfr__ TAFLWAQDRNGLIGKDGHLPW-HLPDDLHYFRAQTVG--------KIMVVGRRTYESF
d1dhfa_ VPEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYKEAMNHPd8dfr__ VPEKNRPLKDRINIVLSRELKEAPKGAHYLSKSLDDALALLDSPELKSKVDMVWIVGGTAVYKAAMEKPd4dfra_ ---G-RPLPGRKNIILS-SQPGTDDRV-TWVKSVDEAIAACGDVP------EIMVIGGGRVYEQFLPKAd3dfr__ ---PKRPLPERTNVVLTHQEDYQAQGA-VVVHDVAAVFAYAKQHLDQ----ELVIAGGAQIFTAFKDDV d1dhfa_ -PEKNRPLKGRINLVLSRELKEPPQGAHFLSRSLDDALKLTEQPELANKVDMVWIVGGSSVYKEAMNHPd8dfr__ -PEKNRPLKDRINIVLSRELKEAPKGAHYLSKSLDDALALLDSPELKSKVDMVWIVGGTAVYKAAMEKPd4dfra_ -G---RPLPGRKNIILSSSQPGTDDRV-TWVKSVDEAIAACGDVPE-----.IMVIGGGRVYEQFLPKAd3dfr__ -P--KRPLPERTNVVLTHQEDYQAQGA-VVVHDVAAVFAYAKQHLD----QELVIAGGAQIFTAFKDDV
23/4/20 23
研究对象 Basic building blocks
Genome/DNA Genes, proteins Cis-regulatory elements
Basic mechanisms Transcription Splicing Translation (step 1~3 = gene expression) Post-translational modification / protein folding
Transcriptional regulatory mechanisms and other regulatory mechanisms Alternative splicing microRNAs DNA/protein modification
23/4/20 24
研究内容 Sequencing similarity search: find similar sequences in dat
abases global search local search
Multiple alignment Motif discovery Gene finding : determining gene and gene str
ucture Phylogenetic analysis Comparative studies
23/4/20 25
生物大分子结构分析研究生物大分子结构分析研究
Extending Studies to 3-Dimensions
23/4/20 26
Tertiary
Hierarchy of Protein Structure
23/4/20 27
Protein 3D Structure3D Structure of Protein
Alpha-helix Beta-sheet
Loop and Turn
Turn or coil
23/4/20 28
Proteins Structural Classes
23/4/20 29
结构分析 Structural Analysis
二级结构预测 secondary structure prediction
三级结构预测 tertiary structure prediction
分子对接 molecular docking分子动力学模拟 molecular dynamics (M
D) simulation
23/4/20 30
表达数据分析表达数据分析
Expression studies
23/4/20 31
Microarray data analysis
Low-expression gene
Highly expressed in treatment, but not control cells
High-expression gene
Highly expressed in control, but not treatment cells
23/4/20 32
Supervised Classification
23/4/20 33
Unsupervised Clustering
23/4/20 34
二二维维电电泳泳图图
23/4/20 35
Genome Sequence
Finding Genes in Genomic DNA Introns/exons/promotors
Characterizing Repeats in Genomic DNA Statistics Patterns
Duplications in the Genome Large scale genomic alignment
Whole-Genome Comparisons Finding Structural RNAs
23/4/20 36
Protein Sequence
Sequence Alignment non-exact string matching, gaps Dynamic Programming Local vs Global Alignment Suboptimal Alignment Hashing to increase speed (BL
AST, FASTA) Amino acid substitution scoring
matrices Multiple Alignment and Consensus
Multiple alignment Transitive Comparisons HMMs, Profiles Motifs
Scoring schemes and Matching statistics How to tell if a given
alignment or match is statistically significant
A P-value (or an e-value)? Score Distributions
(extreme val. dist.) Low Complexity Sequences
Evolutionary Issues Rates of mutation and
change
23/4/20 37
Sequence Structure Secondary Structure
“Prediction” via Propensities Neural Networks,
Genetic Alg. Simple Statistics TM-helix finding Assessing Secondary
Structure Prediction
Structure Prediction: Protein v RNA
Tertiary Structure Prediction Fold Recognition Threading Ab initio (Quaternary structure prediction)
Direct Function Prediction Active site identification
Relation of Sequence Similarity to Structural Similarity
23/4/20 38
Structures
Structure Comparison Basic Protein Geometry and
Least-Squares Fitting Distances, Angles, Axes, Rotations
Calculating a helix axis in 3D via fitting a line
LSQ fit of 2 structures Molecular Graphics
Calculation of Volume and Surface How to represent a plane How to represent a solid How to calculate an area Hinge prediction Packing Measurement
Structural Alignment Aligning sequences on the
basis of 3D structure. DP does not converge,
unlike sequences, what to do?
Other Approaches: Distance Matrices, Hashing
Fold Library Docking and Drug Design as
Surface Matching
23/4/20 39
DBs/SurveysRelational Database Concepts
Keys, Foreign KeysSQL, OODBMS, views, forms, transactions, reports, indexes
Joining Tables, NormalizationNatural Join as "where" selection on cross product
Array Referencing (perl/dbm)Forms and Reports Cross-tabulationDB interoperation
What are the Units ?What are the units of biological information for organization?
sequence, structuremotifs, modules, domains
How classified: folds, motions, pathways, functions?
Clustering and Trees Basic clustering
UPGMA single-linkage multiple linkage
Other Methods Parsimony, Maximum
likelihood Evolutionary implications
Visualization of Large Amounts of Information
The Bias Problem sequence weighting sampling
23/4/20 40
Mining
Information integration and fusionDealing with heterogeneous data
Dimensionality Reduction (PCA etc)
23/4/20 41
(Func) Genomics
Expression Analysis Time Courses clustering Measuring differences Identifying Regulatory Re
gions Large scale cross referencing
of information Function Classification and O
rthologs The Genomic vs. Single-mole
cule Perspective
Genome Comparisons Ortholog Families, pathways Large-scale censuses Frequent Words Analysis Genome Annotation Identification of interacting protei
ns
Networks Global structure and local motifs
Structural Genomics Folds in Genomes, shared & co
mmon folds Bulk Structure Prediction
Genome Trees
23/4/20 42
Simulation
Molecular Simulation Geometry -> Energy -> Forces Basic interactions, potential
energy functions Electrostatics VDW Forces Bonds as Springs How structure changes over
time? How to measure the change
in a vector (gradient) Molecular Dynamics & MC Energy Minimization
Parameter Sets Number Density Simplifications
Poisson-Boltzman Equation
Lattice Models and Simplification
23/4/20 43
23/4/20 44
Systems biology: E-cell
23/4/20 45
Metabolism network
23/4/20 46
Protein-protein interaction (PPI)
23/4/20 47
Nature 406 2000
23/4/20 48
Application 1Novel Drug design
Understanding How Structures Bind to Small Molecules (Function)
Designing Inhibitors Docking, Structure Modeling
23/4/20 49
生物信息学中的生物信息学中的 informaticsinformatics
23/4/20 50
常用计算机技术
数据库技术:建库与搜索 Database 统计 statistics数据挖掘技术 data mining人工智能 Artificial Intelligence模式识别 pattern recognition机器学习 machine learning
23/4/20 51
常用算法 Algorithms
一般数值计算方法 numerical methods 分而治之算法 divide-and-conquer 动态规划算法 dynamic programming 模拟退火算法 simulated annealing 遗传算法 genetic algorithm 隐马尔可夫模型 hidden Markov model (HMM) 贝叶斯统计方法 Bayes’ approaches 聚类方法 clustering methods 分类方法 classification methods 蚁群算法 ant colony system 爬山算法 hill-climbing algorithm
23/4/20 52
Reading Lists
1. R. Durbin, S. R. Eddy, A. Krogh and G. Mitchison (1998). Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, chapter 1. Cambridge University Press.
2. H. Ji and W. H. Wong (2006). Computational biology: toward deciphering gene regulatory information in mammalian genomes, Biometrics, 62: 645–663.
3. Searls, D. B. (2002). The language of genes. Nature, 420:211—217
4. E. Nabieva et al (2005). Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics, 21suppl: i302-i310
5. M. Eisen et al (1998). Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci., 95: 14863-14868
6. Bengtsson, M. et al (2005). Gene Expression Profiling in Single Cells from the Pancreatic Islets of Langerhans Reveals Lognormal Distribution of mRNA Levels. Genome Res, 15: 1388-92
7. Jun Dong et al (2007). Understanding Network Concepts in Modules. BMC Systems Biology, 1: 24
8. Thompson J., et al (2001). Analysis of Mutations at Residues A2451 and G2447 of 23S rRNA in the Peptidyltransferase Active Site of the 50S Ribosomal Subunit. PNAS 98: 9002-9007.
23/4/20 53
Assignments
Summarize the reading (1) in 1~2 A4 papers, with no less than 500 words, either in Chinese or English.
Extract the following items from the lists (2)~(8)
a) Biological problems
b) Approaches or algorithms
c) Conclusion
d) Your own opinions
23/4/20 54
Programming assignments
1. Write a regular expression to extract all of the carbon atom position data from a PDB file. And then print this data out
2. Retrieve a gene of your interest in FASTA format. Parse the string in the text file to obtain the reverse complement of the sequence section alone. And then output this new string in a new file called revcomp.fasta
23/4/20 55
Application 2Finding Homologs
23/4/20 56
Application 3:Overall Genome Characterization
Overall Occurrence of a Certain Feature in the Genome e.g. how many kinases in Yea
st
Compare Organisms and Tissues Expression levels in Cancerou
s vs Normal Tissues
Databases, Statistics
23/4/20 57
Reading AssignmentsReading Assignments
课后阅读作业
23/4/20 58
常用参考书1. Andreas Baxevanis and Francis Ouellett (eds.), Bioinformatics: A Practical Guide
to the Analysis of Genes and Proteins, 3rd Edition, Wiley & Sons, 2005.
2. R. Durbin, S. R. Eddy, A. Krogh and G. Mitchison (1998). Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press
3. Gibas, Cynthia, and Per Jambeck. Developing Bioinformatics Computer Skills. O'Reilly, 2001.
23/4/20 59
That’s all for lecture 1