from laboratory to hospital – the new challenge of bioinformatics researchers 唐傳義...

From Laboratory to Hospital – The New Challenge of Bioinformatics

Researchers

唐傳義清大資訊系

[email protected]

合作醫院及相關項目• 癌症林口長庚口腔癌團隊林口長庚婦癌團隊工研院生醫中心 (成大醫院食道癌團隊、嘉義基督教醫院團隊 )• 感染性疾病長庚病毒中心 (流感病毒、腸病毒 ) 署立新竹醫院與竹東榮民醫院 (孢氏不動桿菌 ) 清大生命科學系 (幽門桿菌、白色念珠球菌 )

由演講、合開課程及共同執行計畫建立合作關係

從講述「淘汰低產蛋能力雞的經驗」開始

•收集台灣土雞不同發育時期之四種蛋白質樣品，根據實驗室分析結果，判讀蛋白質樣品及其濃度

•記錄每隻台灣土雞之產蛋數量•基於蛋白質樣品濃度及蛋產率，設計篩選法提昇台灣土雞產蛋率

•目前已找出方法可在 14 週就可預測雞的未來是否為低產雞，專利申請中

•合作 :動物科技研究所及雞場

Egg production rate of TRFCC (n=157). (A) Total egg number of all hens, (B) hens in four

groups

0

20

40

60

80

100

120

25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48

week of age

Egg

pro

duct

ion

rate

(%

)

0

10

20

30

40

50

60

70

80

90

100

25 30 35 40 45

Week of age

Egg

pro

duct

ion

rate

(%

)

Group IGroup IIGroup IIIGroup IV

(A)

(B)

Lack of association of relative protein levels

with total egg numbers

0

1

2

3

4

5

6

30 50 70 90 110 130

Total egg number

Rel

ativ

e le

vel

s o

f v

itel

log

enin 24w (r=0.23, p<0.01)

35w (r=0.53, p<0.01)

0

1

2

3

4

5

6

30 50 70 90 110 130

Total egg number

Rel

ativ

e le

vel

s o

f ap

o A

-I

24w (r=0.14)35w (r=-0.52. p<0.01)

(A) Vitellogenin (B) Apo A-I

組合序號篩選法 ( 幾何上鄰近點問題 )

• 將兩批雞的 4個蛋白質濃度轉成序號 (Rank), 從一批已知低產雞的蛋白質濃度可以找出其序號組合碼，利用序號轉換方式搜尋出另一批雞有類似序號組合碼的雞 ,預測其為低產雞

• 檢查 14週血清蛋白質序號組合碼 (尚未產蛋 ),即已可以發現其與未來低產雞的強烈規則性

• 利用組合序號篩選法可於 14wk 淘汰 19.5% 雞隻，其中包含 78.8% 之 50% 低產雞

臨床醫療資訊探勘與轉譯醫學授課教師：唐傳義教授、統計學研究所謝文萍助理教授、

核子醫學科閻紫宸主任、婦癌研究中心賴瓊慧主任、神經科學研究中心陸清松主任、腦中風中心李宗海主任

上課時間： S2S3課程說明本課程將從臨床醫學、醫療統計與生物資訊的整合性角

度，探討在後基因體時代如何進行臨床醫療資訊探勘與轉譯醫學的發展研究。此課程著重於結合臨床電子病歷資訊，臨床醫學圖像資訊，癌症臨床醫學及遺傳性疾病之基因庫分析。本課程集合來自臨床醫師、資訊、生物統計教授等師資，以跨領域研究實做專題方式，引入生物統計與系統生物學相關資訊技術，將珍貴的臨床醫學資訊作加值應用，以輔助制訂更有效的醫療決策模式。

Cdc2 cutoff 1NDRG1 cutoff 1EF1A cutoff 2

Biomarker Visualization: Functional Interaction Linkage Map

………. 9

11,612

6,434

353,043

11,402

34,435

169,093

59,852

594,111

SNP

dbEST

http://www.ncbi.nlm.nih.gov/

http://www.genome.jp/kegg/

http://www.hprd.org/index_html

http://www.ebi.ac.uk/intact

http://dip.doe-mbi.ucla.edu/dip/Main.cgi

KIT--- C-KIT,

FLT1--- VEGFR-1,

KDR --- VEGFR-2

ERBB2 --- HER-2/neu

EGFR --- EGFR

ESR1 --- ER

PGR --- PR

KIT FLT1 KDR ERBB2

EGFR ESR1 PGR

KIT 1 0.107 0.142 0.137 0.084 0.022 0.01

FLT1 1 0.268 0.102 0.055 0.016 0

KDR 1 0.102 0.069 0.027 0.024

ERBB2

1 0.231 0.043 0.026

EGFR 1 0.057 0.026

ESR1 1 0.088

PGR 1

the ratio of common neighbors

互補、信賴

長清計畫Oral Cancerwith 閻紫宸

Imaging andClinical Data

Psycho-social Study and

Supportive Care

InformaticsEpidemiologyand

Translational Study

Systems Biology Approach: 4 M’s Paradigm

清華團

隊長庚團

隊

GRP78 knockdown inhibits cell invasion

(A) FADU (B) Detroit

*

0

200

400

600

800

Nu

mb

er

of

invad

ed

cell

s

Vector Scramble siRNA-1

**P=0.0013

0

50

100

150

200

Nu

mb

er o

f in

vad

ed c

ells

Vector Scramble siRNA-1

P=0.016

Vector Scramble siRNA Vector Scramble siRNA

Systematic analyses flow:

14

Biomarkers Expression profiles

Significant changed genes

ClusteringFunctional module finding

Map to FILM

Map to FILM

Pathways finding

Map to KEGG

Pathway cross linkingMap to FILM

Disease network Drug-target network

Up- or down-regulation

Survival analyses & data mining

Pathway prediction

Missing link

Missing gene 15

hsa2099

has6256

hsa7153

16

剩餘檢體

其他生物標記測量

血清免疫生物標記測量

病歷資

料

基因晶片測量

長庚頭頸癌研究團隊長庚婦癌研究團隊長庚癌症臨床及研究中心

長庚大學生化與生醫工程系

長庚大學生物技術暨檢驗學系

長庚臨床病理科

清華大學資訊工程系

資料探勘 ,存活分析 ,

系統生物分析

Sample Classification

There are 112 patients in total.

18

Data provided by CGMH:

Analysis Plan

Association Differentially expressed genes

Genetic components of expression

20

L1

L2

L3

L4

Ln

……

….

R1

R2

Rn

……

….

DNA loci

g1

g2

g3

g4

g5

gn

……

….

Traits I

W1

W2

W3

W4

W5

Wn

g1

g2

g3

g4

g5

gn

……

….

W1

W2

W3

W4

W5

Wn

Traits II

W1

W2

W3

W4

W5

Wn

Causal genes Reactive genes

Biomarker Finding Plan

SNP 6.0 Exon array

Association study

Association study

LOH analysis

LOH analysis

Copy number analysis

Copy number analysis

QTL analysis

QTL analysis

Biological network analysis

Biological network analysis

Tissue array

Biomarkers at DNA level

Biomarkers at DNA/ mRNA level

Biomarkers at protein levelDiagnosis chip design

臨床病歷及生醫資料庫

FILM 資料庫(基因功能網路 )

疾病關連與藥物 -靶點作用網路資料庫

生醫資料庫

臨床病歷及生醫資料分析

平台Gene Expression

分析模組

Survival 分析模組

SNP Array 分析模組

分析平台

Text Mining模組

Pathway Analysis 模組

Biomarker and Drug Target

Prediction 模組

Gene Module Analysis 模組

臨床知識探勘平台

New Module:Next-generation Sequencing Dat

a Analysis

• Roche 454 GS-FLX System

• ABI SOLID sequencing system

• Illumina Solexa 1G Genome Analyzer

Read: short (35~100bp), small error rate, high coverage and low cost.

Genome

Reads

Re-sequencing Problem Definition

• We are given a text T=t1t2…tn, a set of patterns P1, P2, … PN , and a constant k. We are asked to find all the occurrences of Pj in T with k errors (Hamming distance).

Algorithms

• Indexing Genome with Hash Tables: SOAP …

• Indexing Reads with Hash Tables: MAQ, ZOOM, SeqMap and RMAP, …

• Indexing Genome with Suffix Array/BWT: Bowtie, …

Indexing Genome with Suffix Array/BWT

• Bowtie algorithm is the faster one.

This is probably the fastest short read aligner to date.

Length Program CPU time

36bp

Bowtie 6 m 15 s

Maq 3 h 52 m 54 s

SOAP 16 h 44 m 3 s

As Quick as Bowtie and with Ability of Alignment Distance

• If there are many indels (deletions or insertion) when align sequencing data onto reference sequence, the results of alignment with Hamming Distance are not acceptable. (<40% of read mapped)

…ACGGATAGCTAGCTAGCATCAGGGCAGATCA…

TAGCTAGCTGCATCAGGGGCTAGCTGCATCAGGGCA

AGCTGCATCAGGGCAGAT

Delete

…ACGGATAGCTAGCTCATCAGGGCAGATCA…

TAGCTAGCTGCATCAGGGGCTAGCTGCATCAGGGCA

AGCTGCATCAGGGCAGAT

+G Insert

NGS data

Map reads to marked regions with Hamming distance =1

Score and modify the ref-sequenceSNP found?

No

Yes

Mapping Algorithm (Bowtie) (Trim 17 bps and set error bound=2)

Mark Mapped Regions (depth=5)

Unmapped Reads

Trim 3 bps

WorkFlow

Filtering

Progressive Mapping

hsa-miR-99a

hsa-miR-99a target

IGF1R(3480)

SMARCD1(6602)

WISP2(8839)

HADHB(3032)

Glucocorticoid receptor regulatory network(NCI)

Beta oxidation of palmitoyl-CoA to myristoyl-CoA(Reactome) mitochondrial fatty acid beta-oxidation of unsaturated fatty acids(Reactome) Fatty acid elongation in mitochondria(KEGG)

IGF1 pathway(NCI) E-cadherin signaling events(NCI) Plasma membrane estrogen receptor signaling(NCI) Integrins in angiogenesis(NCI)

5959

7878

22

1313

Green: miRNA Red: mRNA Purple: miRNA +mRNA target target

感染科醫生常問的問題• 為新的本土抗藥菌株定序 (NGS Assembl

y)

• 了解抗藥菌株的抗藥機置 (NGS re-sequencing, SNP finding)

• 利用過去資料找宿主專一性，毒性• 設計新藥

Ref 1

Ref 2

Ref 3

Ref 4

Ref 5

Ref 6

Ref 7

error < 6

59% 59% 55% 57% 50% 57% 67%

HP Experiments (Edit distance)

Reference sequence size ≒1.6M

Read Number: 3074139

Read Size = 76bp

% of read mapped

Genome Sequence ModificationRef Seq: … A C C G A T C …

A score 25 0 1 0 1 0 1

C score 1 30 35 0 40 1 32

G score 0 1 0 40 0 0 0

T score 1 0 0 0 0 1 0

Deletion score

0 0 0 0 0 30 0

Insertion score

+A 31

Modified sequence: …ACACGCTC…

WorkFlow

Short read data

Map reads to ref-sequence with small edit distance

Score and output modified ref-sequence

Indel found?

No

End

Yes

Map reads to Modified ref-sequence with small edit

distance

error < 6

% of read mapped

# of deletion

# of insertion

Genome Coverage

Run 1 61.6% 497 435 86.6%

Run 2 71.2% 235 234 88.9%

Run 3 72.7% 132 157 89.6%

Run 4 73.2% 78 82 90%

Run 5 73.5% 37 74 90.2%

…

Run 16 73.9% 13 9 90.4%

PI substrate specificity

Antiviral activity Antibacterial activity for prokaryote

neurotoxicity

RNase 1(human pancreatic rnase)(HPR)

9.1 No No No

RNase2(eosinophil-derived neurotoxin)(EDN)

9.2 Viral genomic DNA

Yes(antiviral activity against RSV and HIV studied in vitro)

Yes (E. coli) Yes(required active ribonuclease activity)

RNase 3(eosinophil cationic protein)(ECP)(potent anti-parasitic agent)

10.8 Yes(against RSV)

Yes (E. coli) Yes(much lower than EDN)

RNase 4 9.3 uridine-preferring No No No

RNase 5 (angiogenin) (ANG) 9.73 tRNA-specificcytidine preferring

No No No

RNase 6 (k6) 9.49 No No

RNase 7 10.5 unknown No Yes

RNase 8 8.6 unknown No No

找功能所需要的特殊軟體：以 RNase proteins 為例

Multiple Sequence Alignment

• Given s set of sequences,the MSA problem is to find an alignment of the sequences such that some object function is minimized

• ie.(Sum of Pair Score)

S1：ATTCG

S2：AGTCG

S3：ATCAG

S’1： A T – T C –

G

S’2： A – G T C –

G

S’3： A T – – C A

G

MSA2

42

Cost = 8

MSA of 7 RNases (by Workbench3.2)

MSA of 7 RNases

1ONC 1RCN 1DE3 1BC4

PDB ID 1ONC 1RCN 1DE3 1BC4

RNase Name

Pancreatic Ribonuclease Ribonuclease A Ribonuclease -Sarcin Ribonuclease

Source Rana Pipiens Bovine (Bos Taurus) Pancreas

Aspergillus Giganteus Rana Catesbeiana

Recombin-ant/Native

Native Native Recombinant Native

pI 8.96 8.64 9.17 9.20

disulfide bond

4 4 2 4

Substrate/ligand

SO4 Dna (5'-D(Aptpapap)-3')

X-ray/NMR X-ray X-ray NMR NMR

Reference J Mol Biol 236 pp. 1141 (1994)

J Biol Chem 269 pp. 21526 (1994)

J.Mol.Biol. 299 pp. 1061 (2000)

J Mol Biol 283 pp. 231 (1998)

Multiple Sequence Alignment with

Constraints MuSiC (Bioinformatics, 2004)

MuSiC-ME (Memory Efficient, Bioinformatics, 2005)

RE-MuSiC (Regular Expression, NAR, 2006)

Multiple Sequence Alignment with Constraint

• Input: (1) multiple Protein/DNA/RNA sequences and (2) several constraints (represented by regular expressions), with each consisting of known functionally, structurally or evolutionarily related residues/nucleotides of the input sequences.

• Output: an optimal multiple sequence alignment in the condition that the constrained amino acids/ nucleotides should be aligned together in the alignment.

CMSA: Constrained Multiple Sequence Alignment Problem

• Input: a set of k sequences along with a order set of r constraints (C1, …, Cr) and an error ratio 0 < 1

• Output: an optimal CMSA, say A, in which r disjoint bands B1, …, Br are in A such that d(Ci, Bi(Sj)) l(Ci) for all 1ir and 1jk. – band: a block of consecutive columns in A– d(x,y): the Hamming distance between x and y

– Bi(Sj): the fragment of Sj whose bases are all in Bi

– Ci: the length of Ci (also denoted by i)

Example of CMSA

• Input: 6 RNA sequences along with 11 constraints and error ratio =0

Example of CMSA

• Input: 6 RNA sequences along with 11 constraints and error ratio =0.5

Web Interface of MuSiC

http://genome.life.nctu.edu.tw/MUSIC/

Syntax of Regular Expression• The IUPAC codes for the amino acids and

nucleotides are used in the regular expression. • “-”: separate the elements of a regular

expression. • “[]”: the amino acids (or nucleotides) that are

allowed to appear at a given position.• “{}”: The amino acids (or nucleotides) that are

not accepted at a given position.• Repetition of an element is indicated by

appending, immediately following that element, an integer or a pair of integers in parentheses.

Example: G-[AG]-x(4)-{AG}-x(4,5)

RE-MuSiC: Multiple Sequence Alignment with Regular Expression Constraints

RE-MuSiC 發表在 Nucleic Acids Research (Vol. 35, pp. W639-644,2007)

限制型多重序列比對的軟體工具

Too many false positives !

MSA of RNase1~RNase6- 3 active sites (His42, Lys65, His155 )- 4 disulfide bonds (Cys50, Cys64, Cys82, Cys89, Cys98, Cys110, Cys123, Cys138 )- MSA showed that 11 residue were conserved in RNase3, RNase2 (functionally related enzyme) and RNase1~RNase4 (sequence related proteins).

Clustering of 8 RNase

• Group1: ECP (RNase3)• Group3: EDN (RNase2) that is functionally related to ECP • Group2: RNase1, RNase4, RNase5 and RNase6 don’t

have toxicity. • Group4: RNase7 and RNase8 have toxicity, but their

toxicity is still unknown.

Algorithm

There are five possible coordinates: (1) Residues at rat imidase, functionally identical or related proteins (group3 or group4, respectively) and sequence related proteins (group2) are different, the score is set to zero. (2) The score is set to 1 if residues at all sequences are the same. (3) Residues common at rat imidase and proteins of group3 or group4 but differ from that of group2, the score is set to 3. (4) Residues common at imidase and group2 proteins but differ from that of group3 or group4, the score is set to –2. (5) Residues common at sequence related proteins and functional related proteins but differ from that of imidase, the score is set to zero.

Voting score (1)

Rat imidase

Aye from G3

Blackball from G2

Vscore

RNase:Comparison of MSA and

our method (2)

- The first row is the amino acid sequence of ECP, the second and the third row represent the total scores and their correspondent ranks respectively.- green residues: the top 5 high ones in our method- red residues: 3 active sites and 4 disulfide bonds of RNase proteins- Pro3 was verified to be associated with ECP’s toxicity by biological experiments.

FAVFAT• Revealing the desired features of target enzyme or protein by

voting on three different property groups aligned by three-profile alignment method. (accepted by BMC genomic 2010)

• Three properties – Target (interested sequence)– Property A (related function sequences)– Property ~A (Non-function sequences)

• Goal : – Identifying amino acid residues critical for Human Enterovirus

71.– Identifying function and species-associated sites for Influenza

A virus

Schematic diagram of the influenza virus replication cycle

Our approach

112/04/1959

• 3D-QSAR (Pharmacophore) model design

• Chemical compound inference

• Drug synthesis and validation

Structure of Neuraminidase protein

61

Pharmacophore Generation

Pharmacophore( Hypothesis )

Comformations generation

A series of inhibitors

Drug Screening

Build Feature Model

Training inhibitors for feature model of protei

n X

HBA

HY

RA

X’s Spatial Feature

DatabaseNatural compounds ~ 90,000 cpds

Synthetic compounds ~ 5,000,000 cpds

X’s Spatial Feature Search

X’s Spatial recognition

O

OH

O

NH

NH

O

F

OH

OH

NH2

NH

S

S

NO

O

O

N

O N

NH2

OH

O

OH OHO

NH

NH

NH2NHO

OHO

OH

O

NH

NH

O

OH

OH

OH

NH

NH2FF F

HBA

Inhibitor candidates of Protein X

Chemical Compound Inference Problem

Fujiwara et al. proposed a sequential branch-and-bound algorithm to solve this problem.

H. Fujiwara, J. Wang, L. Zhao, H. Nagamochi, and T. Akutsu, “Enumerating Treelike Chemical Graphs with Given Path Frequency”, J. Chem. Inf. Model., 2008, 48(7), pp. 1345-1357.

The algorithm proposed by Fujiwara cannot deal with the ring structure of chemical compounds. Moreover, the computation time increase significantly when the number of atoms grows.

In this study, we proposed a Balanced Multi-Process Parallel Algorithm for Chemical Compound Inference Problem.

BMPBB-CCI• Balanced Multi-Process Branch-and-Boun

d Algorithm for Chemical Compound Inference Problem

• The goal of BMPBB-CCI include–Reduce computation time via parallel computin

g–Take care of ring structure of CCI problem

2009.08.2164

112/04/1966

未來的新方向• Mata Genomics

NGS analysis

GPU solution

• Cancer Genomics

SNP, Indel, Translocation detection

• Experiment Design

Introduction (Penn State project)Introduction (Penn State project)Here, we illustrate a scenario of microbial community Here, we illustrate a scenario of microbial community profiling.profiling.

Fig. 1. The scenario of collecting samples from a car and Fig. 1. The scenario of collecting samples from a car and the sequencing process.the sequencing process.

Windshield GenomicsWindshield Genomics

SourcesSources

GPU

• A quiet revolution and potential build-up– Calculation: 367 GFLOPS vs. 32 GFLOPS– Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s– Until last year, programmed through graphics API

– GPU in every PC and workstation – massive volume and potential impact

GF

LOP

S

G80 = GeForce 8800 GTX



NV40 = GeForce 6800 Ultra

NV35 = GeForce FX 5950 Ultra

NV30 = GeForce FX 5800

Why GPU? Massively Parallel Processor

112/04/1971© David Kirk/NVIDIA and Wen-mei W. Hwu, 20

07ECE 498AL1, University of Illinois, Urbana-Cha

mpaign

Genome Rearrangements and Evolutionary Trees

ROBIN (Bioinformatics, 2005)

SPRING (Nucleic Acids Research, 2006)

Genome Rearrangements

4 6 1 7 2 3 5 8Human X

Mouse X

4 6 7 1 2 3 5 8

4 1 2 3 5 6 7 8

1 2 3 4 5 6 7 8

區段互換的基因體重組

三種常見人類致病性弧菌的演化關係

創傷弧菌

腸炎弧菌

霍亂弧菌

創傷弧菌

- 39 69

腸炎弧菌

39 - 65

霍亂弧菌

69 65 -

創傷弧菌

腸炎弧菌

霍亂弧菌

創傷弧菌

- 3 6

腸炎弧菌

3 - 7

霍亂弧菌

6 7 -

Chromosome 2Chromosome 1

研究成果發表在 J. Computational Biology, (Vol. 12, pp. 102-112. , 2005)

用反向工程技術做蘭花花型基因探勘

實驗工具RNAi

RNAi (RNA interference)

dsRNA 被細胞雙鏈 RNA 特異的核酸酶切成 21-23 個鹼基對的短雙鏈 RNA稱為 siRNA （ small interfering RNA ）

siRNA 與細胞某些酶和蛋白質形成複合體，稱為 RNA 誘導沉默複合體（ RNA-induced silencing complex,RISC ）

RISC 可識別與 siRNA 有同源序列的 mRNA且在特異的位點將該 mRNA 切斷

• 藉由載入與目標基因有同源序列的小片段雙股 RNA 誘發 RNAi 機制來達到抑目標基因表現的效果，做為探究基因功能之新工具。

• 若載入的小片段雙股 RNA 與多條基因的片段有同源性，則可以一次抑制多個基因的表現。

• 藉由分析蘭花基因序列，找出可以一次抑制多個基因表現的可能雙股 RNA 序列。

• 使用挑選出的雙股 RNA 序列，在蘭花上進行 RNAi 實驗，觀察產生變化之性徵，快速縮小與該性徵有關之可能基因的範圍。

• 對已經篩選過的可能基因做第二次續列分析，重複 RANi 實驗，直到目標基因的個數減少至可以一一檢測的範圍。

開花功能探勘：蘭花基因工程

According to similarity, find the center sequences and determine its own group

• S1 is the center of a group G if S1 has no second neighbor (Sec_nei_num(S1) =0).

• If exist subsequence F, and HD(F,S1)=5,then F is a Far_neighor (5) of S1.

S4

S3

S2

S5

G(S1) = {S1, S2, S3, S4 , S5}S1

S4

S3

S2

S5S1

4

44

45

F

TF No.1 …. No.272

No.74No.112No.168No.176

7700 genes

….… 146 ….…

siRNA from TF No.21

No.21No.13

No.130No.152

Level 1

No.21 No.13 No.130 No.152

Level 2

No.13No.15No.21No.194No.171

PR1 relative

PR1

PR1 PR1

TF No.1 …. No.272

No.74No.112No.168No.176

7700 genes

siRNA from TF No.21

No.21No.13

No.130No.152

PR1

siRNA from TF No.176

PR1 ---

siRNA from TF No.13

No.13

PR1

0

20

40

60

80

100

120

0

20

40

60

80

100

120

0

20

40

60

80

100

120

感謝• 實驗室全體成員• 林口長庚口腔癌閻紫宸、廖俊達醫師團隊• 長庚醫技鄭恩加教授實驗室團隊• 林口長庚婦癌賴瓊慧醫師團隊• 交大生資盧錦隆教授實驗室團隊• 長庚資工林俊淵教授實驗室團隊• 清大統計所謝文萍教授• 清大生科王雯靜教授實驗室團隊• 清大生科張大慈教授實驗室團隊• 元培醫技劉明麗博士• 台大植微葉信宏教授實驗室團隊• 動物研究所李仁權博士

from laboratory to hospital – the new challenge of bioinformatics researchers 唐傳義...

Documents