生物信息学中的几个计算问题简介 林魁 laboratory of computational molecular biology...
TRANSCRIPT
生物信息学中的几个计算问题简介生物信息学中的几个计算问题简介
林魁Laboratory of Computational Molecular Biology
[http://cmb.bnu.edu.cn]
College of Life SciencesBeijing Normal University
Outline
Introduction to bioinformatics
Genome annotation
Genome evolution
Metagenomics
Acknowledgements
From U.S. Department of Energy Human Genome Program. http://www.ornl.gov/hgmis
Is Biology an Informational Science?
The HGP changed how we view & practice biology.
Biology is an informational science.
Digital genome
Environmental signals
Biology has become a cross-disciplinary science.
Bioinformatics as an intersecting discipline
Mathematical sciences Computer science
Life sciences
Developing the high throughput technologies and
computational/mathematical tools required for this new
biology.
Why? Where? What? How?
• Why: Ideas for what to produce these huge datasets?
Biological background needed.
• Where: Raw data need to store, IT platforms required.
• What: Patterns in datasets that can be analyzed using
computers. Various data models and their respective
algorithms are needed.
• How: Different resources need to be integrated.
What is Bioinformatics?
• The field of biology specializing in developing hardware and software to store and analyze the huge amounts of data being generated by life scientists. (NIH)
• More than 20 different definitions can be found from Google!
• Computational Biology?• Computational Molecular Biology?
Data integrationVarious molecular biology databases
Bioinformatics applications
Data analysis
Key Challenge of Bioinformatics
The world of biology is very different from what it
was even ten years ago.
To bridge the considerable gap between technical
data production and its use by scientists for
biological discovery.
Clients
Intranetand/or
Internet
Servers
Browsers
Light applications
WWW servers
Database servers
Intensive computing servers
HTML/XML
PERL/C/C++/Java (BioPerl)
MySQL
Bioinformatics tools
Statistical analysis (R)
HPC with MPI
Schematic platform for bioinformatics applications
新一代基因组测序 (NGS) 技术
3730xl
焦磷酸测序 1 亿 bp
边合成边测序 15 亿 bp
边连接边测序 20 亿 bp
Sanger 10 万 bp双脱氧核苷酸
Technology:
Read length:Throughput:
Sequencing cost:
Sanger
LongLowHigh
Sanger
LongLowHigh
454
MiddleMiddleMiddle
454
MiddleMiddleMiddle
ShortHighLow
ShortHighLow
Trade-off between read length and sequencing cost
With the new technology
• New scientific questions emerge
• Existing questions can be answered in a way that
was not considered before.
Sequence assembly by the shotgun approach
• Master sequence short sequences,
simply by examining the sequences for
overlaps.
• No need any prior knowledge of the genome.
CGGTTGAAAGCGGTAGCGTCCATGCGTATTACTCTTGAGCGGTCGAACCTTCTGAAATCGCTGAACCACGTCCACCGGGTCGTCGAGCGTCGCAACACGATCCCGATCCTGTCCAACGTTCTGCTGCGCGCCTCCGGCGCCAATCTGGACATGAAGGCGACCGACCTCGATCTGGAAATCACCGAAGCGACCCCGGCCATGGTGGAGCAGGCTGGCGCCACCACCGTACCGGCACACCTGCTTTACGAAATCGTGCGCAAGCTGCCGGATGGTTCCGAAGTGCTTCTGGCGACCAACCCGGACGGCTCCTCCATGACCGTTGCGTCCGGCCGCTCGAAATTCTCGCTGCAATGCCTGCCGGAAGCGGATTTCCCTGACCTCACCGCCGGCACCTTCAGCCACACCTTCAAACTGAAGGCGGCCGATCTGAAGATGCTGATCGACCGGACGCAGTTTGCGATTTCGACCGAAGAGACGCGTTATTACCTGAACGGCATTTTCTTCCACACCATCGAAAGCAATGGCGAGCTGAAACTGCGCGCCGTCGCCACCGACGGTCACCGCCTTGCGCGTGCTGACGTCGATGCGCCCTCCGGCTCCGAAGGCATGCCGGGCATCATCATTCCGCGCAAGACCGTCGGTGAACTGCAGAAGCTGATGGACAATCCGGAACTGGAAGTCACAGTCGAAGTCTCGGATGCGAAGATCCGCCTGGCCATCGGTTCCGTCGTTCTGACCTCGAAGCTGATCGACGGCACCTTTCCCGATTATCAGCGCGTCATCCCAACCGGCAACGACAAGGAAATGCGCGTCGATTGCCAGACCTTCGCCCGGGCAGTGGACCGTGTTTCGACGATTTCTTCCGAGCGCGGCCGCGCCGTGAAGCTGGCGCTAACTGACGGCCAGTTGACGCTGACCGTCAACAATCCCGACTCGGGAAGTGCTACCGAAGAAGTGGCCGTTGGCTACGACAATGATTCGATGGAAATCGGCTTCAATGCCAAATATCTCCTCGACATCACGTCGCAGCTCTCCGGCGAAGATGCGATTTTTCTGCTGGCGGATGCGGGTTCGCCAACACTGGTTCGCGATACCGCCGGCGACGACGCACTCTATGTTCTGATGCCGATGCGCGTTTAAAACCGACCGTTTTCTTCAATTTTTCCAGAAACGCCGGTGGATCGCTTCATCGGCGTTTTTTGATTCGGCGAACAGGTGGCTCTACCCGTAACTGAATTTTCTCAGTTACGACATTTTGCCTTGTTTTTGCGCCAAATGGGATCAACAGTACGTAACAATTTTTTGACAATGACCAATACATCCGAGGGGAATCATGGCACTCAACCTGAAGCAACGGCTTGAACAAAAATTTGAGGAAGAAATCCGCTTTTTCAAAGGTATGGTCAGCCAGCCGAAAAAAGTCGGCGCCATTGTCCCGACGGTTCCGTCGTTCTGACCTCGAAGCTGATCGACGGCACCTTTCCCGATTATCAGCGCGTCATCCCAACCGGCAACGACAAGGAAATGCGCGTCGATTGCCAGACCTTCGCCCGGGCAGTGGACCGTGTTTCGACGATTTCTTCCGAGCGCGGCCGCGCCGTGAAGCTGGCGCTAACTGACGGCCAGTTGACGCTGACCGTCAACAATCCCGACTCGGGAAGTGCTACCGAAGAAGTGGCCGTTGGCTACGACAATGATTCGATGGAAATCGGCTTCAATGCCAAATATCTCCTCGACATCACGTCGCAGCTCTCCGGCGAAGATGCGATTTTTCTGCTGGCGGATGCGGGTTCGCCAACACTGGTTCGCGATACCGCCGGCGACGACGCACTCTATGTTCTGATGCCGATGCGCGTTTAAAACCGACCGTTTTCTTCAATTTTTCCAGAAACGCCGGTGGATCGCTTCATCGGCGTTTTTTGATTCGGCGAACAGGTGGCTCTACCCGTAACTGAATTTTCTCAGTTACGACATTTTGCCTTGTTTTTGCGCCAAATGGGATCAACAGTACGTAACAATTTTTTGACAATGACCAATACATCCGAGGGGAATCATGGCACTCAACCTGAAGCAACGGCTT
Where is a gene?
Th
ree Layers of G
enom
e An
notation.
Fro
m S
tein
, L. 2
00
1. N
atu
re R
evie
ws g
en
etics 2
:49
3-5
03
1. Transcriptional control sites
2. Non-coding RNAs
3.3. mRNAsmRNAs Evidence-based preEvidence-based pre
dictiondiction Cis-alignmentCis-alignment Trans-alignmenTrans-alignmen
tt Ab initio / de novoAb initio / de novo p p
redictionrediction
Some models
• Dynamic programming• Hidden Markov Models (HMMs)• Conditional random field (CRF)• Support vector machines (SVMs)
Cucumber Genome Annotation Project
The Institute of Vegetables and Flowers,
Chinese Academy of Agricultural Sciences
Laboratory of Computational Molecular Biology,
Beijing Normal University
CMB genome CMB genome annotation annotation
pipelinepipelineRepeatMasker
MySQL MySQL DBMSDBMS
Functional annotation:Functional annotation: Protein homology Protein homology Domain annotation (InterProScan)Domain annotation (InterProScan) Mapping to Gene OntologyMapping to Gene Ontology Mapping to KEGGMapping to KEGG
WWW service WWW service + +
VisualizationVisualization
GBrowseGBrowse
EVM
Genomic sequenceGenomic sequence ESTs/cDNAsESTs/cDNAs UniProt pro
teins UniProt pro
teins
Rfam PseudoPipe
RepeatsProtein-coding Genes
Pseudogenes
ncRNAgenes
CMB genome CMB genome annotation annotation
pipelinepipeline
gbrowser
Depending on the state of sequencing project, genomic coordinates along the chromosome may change dramatically from assembly to assembly.
Phylogenetics
• Evolutionary theory states that groups of
similar organisms are descended from a
common ancestor.
• Phylogenetic systematics (cladistics) is a
method of taxonomic classification based on
their evolutionary history.
• It was developed by Willi Hennig, a German
entomologist, in 1950.
Major reasons to use phylogenetics
Understand the lineage of different species
Organizing principle to sort species into a taxonomy
Understand how various functions evolved
Understand forces and constraints on evolution
Perform multiple sequence alignment
Predict gene function (phylogenetic footprint)
Species/Gene Trees
Species tree (how are my species related?) contains only one representative from each
species
when did speciation take place?
all nodes indicate speciation events
Gene tree (how are my genes related?)often contains a number of genes from a single
species.
nodes relate either to speciation or gene
duplication events.
Species tree
Phylogenomics: Genome trees
Explore genome evolution based on large data sets of DNA or protein sequences.
Using entire genomes to infer a species tree (Eisen and Fraser 2003).
Based on maximum genetic information and average out the anomalies.
Has become the standard for reconstructing reliable phylogenies (Ciccarelli et al, 2006; Daubin et al. 2002).
Phylogenomics and the tree of life
Fro
m D
elsu
c, F., e
t al. (200
5) P
hyloge
nom
ics and
th
e recon
structio
n of th
e tree o
f life. N
at. R
ev. G
ene
t. 6, 3
61-3
75
Fro
m D
els
uc
, F., e
t al. (2
00
5) P
hy
log
en
om
ics
an
d th
e re
co
ns
truc
tion
of th
e tre
e o
f life. N
at. R
ev
. Ge
ne
t. 6, 3
61
-37
5
Taxonomic resolution of some of the novel approaches– Creating ever-more robust phylogeniesever-more robust phylogenies on the basis
of diverse data sets.
Try
Evolutionary theory is evolving
100 trillion microbial cells
The dominant form of life on Earth
~1,000 Gbp of microbial genome sequences per gram of soil !
Metagenomics offers a way forward• Who is out there?
• What are they doing?
What is being done by the communityWhat is being done by the community?
Why genomics is not enough
• Most microbes cannot be cultured
• Microbial diversity and variation have no limits
Definition:
Both a set of research techniques & a research field.
基于 16S rRNA 的分析方法 : 快速而高效 , 应用广泛
“Who is out there ?” 经过多年的发展和完善 (Olsen et al. 1986)
Renaissance currently (Tringe & Hugenholtz 2008)
16S rRNA 的高保守性 variable regions
包含大量 rRNA 基因序列( >200,000 )的数据库 ( Cole et al. 2005; Medini et al. 2008) encountering the limitations of existing tools
hsp60 gene: more sensitive than SSU rRNA
Cartoon of
the general
structure of
the bacterial
16S rRNA
gene
Who is there?
Analysis of diversity in the human gut microbial community based on surveys of a limited number of humans.
“strain-level”
‘‘species’’
‘‘genus’’
‘‘family’’
Broad-range PCR amplification and sequencing of microbial 16S rRNA genes
Microbial diversity in environmental
samples
We require the evolutionary and ecological mechanisms
Why? Clusters of very closely related sequences at the tips of phylogenetic trees separated by relatively long branches.
Functional analysis of complex microbial communities (EGTs)
From Turnbaugh et al. 2009. A core gut microbiome in obese and lean twins. Nature
Relative abundance of major phyla and relative abundance of categories of function.
Pathways and subnetworks reflect the adaptation of microbial communities across environments and habitats.
From Gianoulis et al. 2009. PNAS
Resolving strain-level
heterogeneity
Sequence divergence
Gene content difference
Multiple strain sequence types
Gene insertion
Gene rearrangement
Allen & Banfield (2005) Community genomics in microbial ecology and evolution. Nat Rev Microbiol, 3, 489-498
拟展开的工作• 不同生境(或样本)的群落基因组间比较分析,
具体阐明关键环境因子的改变如何导致群落组成的变异 ;
• 结合基因表达和代谢等数据探讨不同群落的基因组与主要生态系统过程(如氮的固定,碳的降解,反硝化作用以及厌氧微生物的除铵作用等)之间的相关关系。
Acknowledgements
Beijing Normal University Beijing Normal University
All members of LCMB, BNUAll members of LCMB, BNU
http://cmb.bnu.edu.cn
CommentsCommentsandand
Suggestions?Suggestions?