an introduction to bioinformatics 北京大学医学部医学信息学系 崔庆华 11-16, 2008

Post on 11-Jan-2016

499 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

An Introduction to Bioinformatics

北京大学医学部医学信息学系崔庆华

11-16, 2008

Introduction of basic concepts

Bioinformatics-- a definition --by NIH(1995)

Bioinformatics is defined as a scientific discipline that encompasses all aspects of biological information acquisition, processing, storage, distribution, analysis and interpretation, that combines the tools and techniques of mathematics, computer science and biology with the aim of understanding the biological significance of a variety of data.

Bio-informatics– the term

• Bio-informatics

• Computational biology

• Biological computing

Data……

★ Large-scale and high-throughput

★ High-dimensional

★ Non-linear

★ Noisy

★ Unequally distributed

Bioinformatics– what is the most important

• Algorithms?

• Data?

• Questions!

Bioinformatics– 误解

Biology

ComputationalTheoretical

Experimental

• 什么都能做?• 生物学 / 信息学

Sequences & Structures

Alignment

• blastall– blastp– blastn– blastx– tblastn– Tblastx

• clusterX

E<10-20

EvolutionConstructing phylogenetic trees

•Phylip•Clustalw•PAML•MEGA (Kumar et al., Briefings i

n Bioinformatics 2004)

Selection

•Coding region: Ka, Ks (dn,ds), Ka/Ks (dn/ds)

•PAML•Kaks_calculator•K-estimator•Mega•Database: UCSC or ENSEMBL

•Non-coding region•Ralph Haygood (Nature Genetics 2007)

•Recent populations•LRH test (Sabeti et al., Nature 2002•iHS test (Voight et al., Plos Biology 2006)•XP-EHH (Sabeti et al., Nature 2007)

Evolution—An application

•Recent positive selectionSLC24A5, SLC45A2, skin pigment, Europe populationLARGE, DMD, Lassa fever virus, Africa populationEDAR, EDA2R, the development of hair, teeth and exocrine glands, Asia population (Sabeti, Nature 2007).

Alternative Splicing (AS)

•Predicted from ESTs•Predicted from cDNA clones•Prediction of tissue-specific AS•Splicing graphs and EST assembly problem

Functional Domain

•TF binding sites•TRANSFAC: a TF binding site database•TESS: a web-based program

•Exons, introns, 5’UTR, 3’UTR•UCSC

•Promoter•CorePromoter

•Motif•Weeder

•RNA family•Rfam

•Protein domain•Pfam: database•InterPro: database•HMMER: a program based on HMM

Finding genes

Sequence mutations

Huang et al., Science 2007

Gymnopoulos et al., pnas 2007

•Tool: SIFT & Sapred•Conservation score?•Near functional sites?•Similarity score?•Surface?•………

PIK3CA

Modeling structures

•RNAfold•RNAStructure

Modeling structures

•Homology modeling•ESyPred3D•Swiss Model

•Ab initio prediction•Rosetta

•Single mutation modeling•Modeller

•Visualization•Pymol

最优化算法

目标: max ( 或 min)Y=f(x)约束: x>=0解:求 x=?

目标 约束 解

确定性优化算法 - 智能优化

遗传算法、模拟退火

DNA microarray data analysis

Biological Question

Sample Preparation

Data Analysis & Modelling

Microarray Reaction

MicroarrayDetection

Taken from Schena & Davis

Microarray 总流程

s1 s2 s3• • • • • • • • sj • • • •

• sMg1

g2

gi

gN

gene profile

arr

ay

pro

file

Gi

A jMicroarray data matrix

Mi,j

数据预处理• 数据缺失

– 原因• 图像受到污染• 图像分辨率不足• 片上灰尘或刮痕

– 缺失数据的处理方法• 舍弃该数据(同时丢掉了有用信息!)• 再做一次实验 (太昂贵了!)• 用某个数取代,比如样本均值• K-nearest neighbors 估计• 奇异值分解( SVD ) 估计

• 标准化– Log 变换– 线性回归– 伸缩 + 平移

Microarray 数据模式分类

预处理 特征提取 机器学习 决策

训练样本

新样本

分类器 决策

X

F(X)

Y

x1

x2

L: c1x1+c2x2- c=0

G1

G2

模式分类算法• 线性分类器• 神经网络• 最近邻• 贝叶斯分类器• 隐马尔科夫模型分类器• 决策树• 支持向量机

Microarray 数据模式聚类• 层次聚类• K-means 聚类• Fuzzy C-means 聚类• 自组织映射• Replicator dynamics

(Cui, 2004)

基因表达特征抽取• 区分男女的特征

– 头发长度?– 皮肤光滑度?– 嗓音?– 身高?– 力量?– 穿着?– 姿态?– XX/XY

• 差异表达基因• Gene set or pathway

• PCA

• SVD

• ISOMAP

• MDS

基因关系的刻划• Static relationship

– Pearson’s correlation

– Spearman’s correlation

– Mutual information

– Other similarity metric

• Dynamic relationship– Dynamic regression (Cui, 2005)

– Window based correlation

基因表达网络• Pearson’s correlation

– Hard threshold

– Weighted

• Mutual information

• Bayesian network

Computational Systems Biology

What is Systems Biology?

• Not a new concept!• Systems biology is an emergent field that ai

ms at system-level understanding of biological systems (Kitano 2002).

• To understand biology at the system level, we must examine the structure and dynamics and cellular organismal function, rather than the characteristics of isolated parts of a cell or organism.

Why Systems Biology?

http://www.newvisions.ucsb.edu/background/images/elephant.gif

++

+

_

0

A

B

C

D

E

Why Computational Systems Biology?

• Golden opportunity, now!

★ More than 16 international meetings in 2006

★ More than 10 books in the past two years

★ Journals: Molecular systems biology (Nature & EMBO), BMC systems biology, IET systems biology, EURASIP Journal on Bioinformatics and Systems Biology etc.

Large-scale, high-throughput data

Fields of Computational Systems Biology?

• Biological networks construction, such as gene regulatory networks, cellular signaling networks, metabolic networks, protein-protein interaction networks, genetic interaction networks, gene co-expression networks, literature networks.

Fields of Computational Systems Biology?

• Properties of systems, such as topology, robustness, tolerance.

Albert et al., Nature 2000

Fields of Computational Systems Biology?

Goh et al., PNAS 2007

P53 region TGFβ regionRas region

Cui et al., MSB 2007

• Biological questions on systems-levels, such as diseases, evolution, medicine etc.

一个应用: microRNA-disease systems biology

D1

D2

M1

M2

M3

M4

D3

D1 D1

Human microRNA disease network

我的建议以及需要大家帮助的问题

第一,相关参考文献通读一遍,相关数据要记录下来。

第二,浏览本 ppt 一遍或者咨询生物信息学专业人士看有无Bioinformatics就可以解决的问题

第三,所阅读文献中数据本身有无生物信息学分析的可能,比如 Meta-analysis, Systems biology.

第四,包括生物信息学在内的新知识并不难,当你亲自完成一个项目的时候就会深有体会!

My Suggestions

我们需要实验验证的工作• The functions of mir-423, mir-608 that are under recent positive selection

– SLC24A5, SLC45A2, skin pigment, Europe population

– LARGE, DMD, Lassa fever virus, Africa population

– EDAR, EDA2R, the development of hair, teeth and exocrine glands, Asia population (Sabeti, Nature 2007).

• Experimental validation of a potential liver-disease related microRNA: miR-149– SNP: rs2292832, CEU and YRI 80% C 20% U; CHB and JPT 20% C 80% U.

– Host gene is GPC1(Glypican 1,硫酸乙酰肝素蛋白聚糖 ), which is overexpressed in pancreas cancer; and another member (GPC3) of this host gene family is a liver cancer marker.

– GPC1是肝素结合生长因子的受体– Not expression in liver/ Expression in liver

– Target HEV and HGV

– Free energy: C: -54.9; U: -52.7

我们需要实验验证的工作

0. 2

0. 25

0. 3

0. 35

0. 4

0. 45

0 1 2 3 4 5 6 7

• Cardiovascular• miR-1• miR-133• miR-199a• miR-21• miR-23a• miR-23b• miR-208

• Liver (miR-122)• Kidney• Brain• Lung• ………

谢谢大家欢迎指导

崔庆华: 15801250611 , 82801585Email: cuiqinghua@bjmu.edu.cn

您身边最好的裁缝

top related