from laboratory to hospital – the new challenge of bioinformatics researchers 唐傳義...
Post on 22-Dec-2015
229 views
TRANSCRIPT
From Laboratory to Hospital – The New Challenge of Bioinformatics
Researchers
唐傳義清大資訊系
合作醫院及相關項目• 癌症 林口長庚口腔癌團隊 林口長庚婦癌團隊 工研院生醫中心 (成大醫院食道癌團隊、 嘉義基督教醫院團隊 )• 感染性疾病 長庚病毒中心 (流感病毒、腸病毒 ) 署立新竹醫院與竹東榮民醫院 (孢氏不動桿菌 ) 清大生命科學系 (幽門桿菌、白色念珠球菌 )
由演講、合開課程及共同執行計畫建立合作關係
從講述「淘汰低產蛋能力雞的經驗」開始
•收集台灣土雞不同發育時期之四種蛋白質樣品,根據實驗室分析結果,判讀蛋白質樣品及其濃度
•記錄每隻台灣土雞之產蛋數量•基於蛋白質樣品濃度及蛋產率,設計篩選法提昇台灣土雞產蛋率
•目前已找出方法可在 14 週就可預測雞的未來是否為低產雞,專利申請中
•合作 :動物科技研究所及雞場
Egg production rate of TRFCC (n=157). (A) Total egg number of all hens, (B) hens in four
groups
0
20
40
60
80
100
120
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
week of age
Egg
pro
duct
ion
rate
(%
)
0
10
20
30
40
50
60
70
80
90
100
25 30 35 40 45
Week of age
Egg
pro
duct
ion
rate
(%
)
Group IGroup IIGroup IIIGroup IV
(A)
(B)
Lack of association of relative protein levels
with total egg numbers
0
1
2
3
4
5
6
30 50 70 90 110 130
Total egg number
Rel
ativ
e le
vel
s o
f v
itel
log
enin 24w (r=0.23, p<0.01)
35w (r=0.53, p<0.01)
0
1
2
3
4
5
6
30 50 70 90 110 130
Total egg number
Rel
ativ
e le
vel
s o
f ap
o A
-I
24w (r=0.14)35w (r=-0.52. p<0.01)
(A) Vitellogenin (B) Apo A-I
組合序號篩選法 ( 幾何上鄰近點問題 )
• 將兩批雞的 4個蛋白質濃度轉成序號 (Rank), 從一批已知低產雞的蛋白質濃度可以找出其序號組合碼,利用序號轉換方式搜尋出另一批雞有類似序號組合碼的雞 ,預測其為低產雞
• 檢查 14週血清蛋白質序號組合碼 (尚未產蛋 ),即已可以發現其與未來低產雞的強烈規則性
• 利用組合序號篩選法可於 14wk 淘汰 19.5% 雞隻,其中包含 78.8% 之 50% 低產雞
臨床醫療資訊探勘與轉譯醫學 授課教師:唐傳義教授、統計學研究所謝文萍助理教授、
核子醫學科閻紫宸主任、婦癌研究中心賴瓊慧主任、神經科學研究中心陸清松主任、腦中風中心李宗海主任
上課時間: S2S3課程說明 本課程將從臨床醫學、醫療統計與生物資訊的整合性角
度,探討在後基因體時代如何進行臨床醫療資訊探勘與轉譯醫學的發展研究。此課程著重於結合臨床電子病歷資訊,臨床醫學圖像資訊,癌症臨床醫學及遺傳性疾病之基因庫分析。本課程集合來自臨床醫師、資訊、生物統計教授等師資,以跨領域研究實做專題方式,引入生物統計與系統生物學相關資訊技術,將珍貴的臨床醫學資訊作加值應用,以輔助制訂更有效的醫療決策模式。
Cdc2 cutoff 1NDRG1 cutoff 1EF1A cutoff 2
Biomarker Visualization: Functional Interaction Linkage Map
………. 9
11,612
6,434
353,043
11,402
34,435
169,093
59,852
594,111
SNP
dbEST
KIT--- C-KIT,
FLT1--- VEGFR-1,
KDR --- VEGFR-2
ERBB2 --- HER-2/neu
EGFR --- EGFR
ESR1 --- ER
PGR --- PR
KIT FLT1 KDR ERBB2
EGFR ESR1 PGR
KIT 1 0.107 0.142 0.137 0.084 0.022 0.01
FLT1 1 0.268 0.102 0.055 0.016 0
KDR 1 0.102 0.069 0.027 0.024
ERBB2
1 0.231 0.043 0.026
EGFR 1 0.057 0.026
ESR1 1 0.088
PGR 1
the ratio of common neighbors
互補、信賴
長清計畫Oral Cancerwith 閻紫宸
Imaging andClinical Data
Psycho-social Study and
Supportive Care
InformaticsEpidemiologyand
Translational Study
Systems Biology Approach: 4 M’s Paradigm
清華團
隊 長庚團
隊
GRP78 knockdown inhibits cell invasion
(A) FADU (B) Detroit
*
0
200
400
600
800
Nu
mb
er
of
invad
ed
cell
s
Vector Scramble siRNA-1
**P=0.0013
0
50
100
150
200
Nu
mb
er o
f in
vad
ed c
ells
Vector Scramble siRNA-1
P=0.016
Vector Scramble siRNA Vector Scramble siRNA
Systematic analyses flow:
14
Biomarkers Expression profiles
Significant changed genes
ClusteringFunctional module finding
Map to FILM
Map to FILM
Pathways finding
Map to KEGG
Pathway cross linkingMap to FILM
Disease network Drug-target network
Up- or down-regulation
Survival analyses & data mining
Pathway prediction
Missing link
Missing gene 15
hsa2099
has6256
hsa7153
16
剩餘檢體
其他生物標記測量
血清免疫生物標記測量
病歷資
料
基因晶片測量
長庚頭頸癌研究團隊 長庚婦癌研究團隊 長庚癌症臨床及研究中心
長庚大學生化與生醫工程系
長庚大學生物技術暨檢驗學系
長庚臨床病理科
清華大學資訊工程系
資料探勘 ,存活分析 ,
系統生物分析
Sample Classification
There are 112 patients in total.
18
Data provided by CGMH:
Analysis Plan
Association Differentially expressed genes
Genetic components of expression
20
L1
L2
L3
L4
Ln
……
….
R1
R2
Rn
……
….
DNA loci
g1
g2
g3
g4
g5
gn
……
….
Traits I
W1
W2
W3
W4
W5
Wn
g1
g2
g3
g4
g5
gn
……
….
W1
W2
W3
W4
W5
Wn
Traits II
W1
W2
W3
W4
W5
Wn
Causal genes Reactive genes
Biomarker Finding Plan
SNP 6.0 Exon array
Association study
Association study
LOH analysis
LOH analysis
Copy number analysis
Copy number analysis
QTL analysis
QTL analysis
Biological network analysis
Biological network analysis
Tissue array
Biomarkers at DNA level
Biomarkers at DNA/ mRNA level
Biomarkers at protein levelDiagnosis chip design
臨床病歷及生醫資料庫
FILM 資料庫(基因功能網路 )
疾病關連與藥物 -靶點作用網路資料庫
生醫資料庫
臨床病歷及生醫資料分析
平台Gene Expression
分析模組
Survival 分析模組
SNP Array 分析模組
分析平台
Text Mining模組
Pathway Analysis 模組
Biomarker and Drug Target
Prediction 模組
Gene Module Analysis 模組
臨床知識探勘平台
New Module:Next-generation Sequencing Dat
a Analysis
• Roche 454 GS-FLX System
• ABI SOLID sequencing system
• Illumina Solexa 1G Genome Analyzer
Read: short (35~100bp), small error rate, high coverage and low cost.
Genome
Reads
Re-sequencing Problem Definition
• We are given a text T=t1t2…tn, a set of patterns P1, P2, … PN , and a constant k. We are asked to find all the occurrences of Pj in T with k errors (Hamming distance).
Algorithms
• Indexing Genome with Hash Tables: SOAP …
• Indexing Reads with Hash Tables: MAQ, ZOOM, SeqMap and RMAP, …
• Indexing Genome with Suffix Array/BWT: Bowtie, …
Indexing Genome with Suffix Array/BWT
• Bowtie algorithm is the faster one.
This is probably the fastest short read aligner to date.
Length Program CPU time
36bp
Bowtie 6 m 15 s
Maq 3 h 52 m 54 s
SOAP 16 h 44 m 3 s
As Quick as Bowtie and with Ability of Alignment Distance
• If there are many indels (deletions or insertion) when align sequencing data onto reference sequence, the results of alignment with Hamming Distance are not acceptable. (<40% of read mapped)
…ACGGATAGCTAGCTAGCATCAGGGCAGATCA…
TAGCTAGCTGCATCAGGGGCTAGCTGCATCAGGGCA
AGCTGCATCAGGGCAGAT
Delete
…ACGGATAGCTAGCTCATCAGGGCAGATCA…
TAGCTAGCTGCATCAGGGGCTAGCTGCATCAGGGCA
AGCTGCATCAGGGCAGAT
+G Insert
NGS data
Map reads to marked regions with Hamming distance =1
Score and modify the ref-sequenceSNP found?
No
Yes
Mapping Algorithm (Bowtie) (Trim 17 bps and set error bound=2)
Mark Mapped Regions (depth=5)
Unmapped Reads
Trim 3 bps
WorkFlow
Filtering
Progressive Mapping
hsa-miR-99a
hsa-miR-99a target
IGF1R(3480)
SMARCD1(6602)
WISP2(8839)
HADHB(3032)
Glucocorticoid receptor regulatory network(NCI)
Beta oxidation of palmitoyl-CoA to myristoyl-CoA(Reactome) mitochondrial fatty acid beta-oxidation of unsaturated fatty acids(Reactome) Fatty acid elongation in mitochondria(KEGG)
IGF1 pathway(NCI) E-cadherin signaling events(NCI) Plasma membrane estrogen receptor signaling(NCI) Integrins in angiogenesis(NCI)
5959
7878
22
1313
Green: miRNA Red: mRNA Purple: miRNA +mRNA target target
感染科醫生常問的問題• 為新的本土抗藥菌株定序 (NGS Assembl
y)
• 了解抗藥菌株的抗藥機置 (NGS re-sequencing, SNP finding)
• 利用過去資料找宿主專一性,毒性• 設計新藥
Ref 1
Ref 2
Ref 3
Ref 4
Ref 5
Ref 6
Ref 7
error < 6
59% 59% 55% 57% 50% 57% 67%
HP Experiments (Edit distance)
Reference sequence size ≒1.6M
Read Number: 3074139
Read Size = 76bp
% of read mapped
Genome Sequence ModificationRef Seq: … A C C G A T C …
A score 25 0 1 0 1 0 1
C score 1 30 35 0 40 1 32
G score 0 1 0 40 0 0 0
T score 1 0 0 0 0 1 0
Deletion score
0 0 0 0 0 30 0
Insertion score
+A 31
Modified sequence: …ACACGCTC…
WorkFlow
Short read data
Map reads to ref-sequence with small edit distance
Score and output modified ref-sequence
Indel found?
No
End
Yes
Map reads to Modified ref-sequence with small edit
distance
error < 6
% of read mapped
# of deletion
# of insertion
Genome Coverage
Run 1 61.6% 497 435 86.6%
Run 2 71.2% 235 234 88.9%
Run 3 72.7% 132 157 89.6%
Run 4 73.2% 78 82 90%
Run 5 73.5% 37 74 90.2%
…
Run 16 73.9% 13 9 90.4%
PI substrate specificity
Antiviral activity Antibacterial activity for prokaryote
neurotoxicity
RNase 1(human pancreatic rnase)(HPR)
9.1 No No No
RNase2(eosinophil-derived neurotoxin)(EDN)
9.2 Viral genomic DNA
Yes(antiviral activity against RSV and HIV studied in vitro)
Yes (E. coli) Yes(required active ribonuclease activity)
RNase 3(eosinophil cationic protein)(ECP)(potent anti-parasitic agent)
10.8 Yes(against RSV)
Yes (E. coli) Yes(much lower than EDN)
RNase 4 9.3 uridine-preferring No No No
RNase 5 (angiogenin) (ANG) 9.73 tRNA-specificcytidine preferring
No No No
RNase 6 (k6) 9.49 No No
RNase 7 10.5 unknown No Yes
RNase 8 8.6 unknown No No
找功能所需要的特殊軟體:以 RNase proteins 為例
Multiple Sequence Alignment
• Given s set of sequences,the MSA problem is to find an alignment of the sequences such that some object function is minimized
• ie.(Sum of Pair Score)
S1:ATTCG
S2:AGTCG
S3:ATCAG
S’1: A T – T C –
G
S’2: A – G T C –
G
S’3: A T – – C A
G
MSA2
42
Cost = 8
MSA of 7 RNases (by Workbench3.2)
MSA of 7 RNases
1ONC 1RCN 1DE3 1BC4
PDB ID 1ONC 1RCN 1DE3 1BC4
RNase Name
Pancreatic Ribonuclease Ribonuclease A Ribonuclease -Sarcin Ribonuclease
Source Rana Pipiens Bovine (Bos Taurus) Pancreas
Aspergillus Giganteus Rana Catesbeiana
Recombin-ant/Native
Native Native Recombinant Native
pI 8.96 8.64 9.17 9.20
disulfide bond
4 4 2 4
Substrate/ligand
SO4 Dna (5'-D(Aptpapap)-3')
X-ray/NMR X-ray X-ray NMR NMR
Reference J Mol Biol 236 pp. 1141 (1994)
J Biol Chem 269 pp. 21526 (1994)
J.Mol.Biol. 299 pp. 1061 (2000)
J Mol Biol 283 pp. 231 (1998)
Multiple Sequence Alignment with
Constraints MuSiC (Bioinformatics, 2004)
MuSiC-ME (Memory Efficient, Bioinformatics, 2005)
RE-MuSiC (Regular Expression, NAR, 2006)
Multiple Sequence Alignment with Constraint
• Input: (1) multiple Protein/DNA/RNA sequences and (2) several constraints (represented by regular expressions), with each consisting of known functionally, structurally or evolutionarily related residues/nucleotides of the input sequences.
• Output: an optimal multiple sequence alignment in the condition that the constrained amino acids/ nucleotides should be aligned together in the alignment.
CMSA: Constrained Multiple Sequence Alignment Problem
• Input: a set of k sequences along with a order set of r constraints (C1, …, Cr) and an error ratio 0 < 1
• Output: an optimal CMSA, say A, in which r disjoint bands B1, …, Br are in A such that d(Ci, Bi(Sj)) l(Ci) for all 1ir and 1jk. – band: a block of consecutive columns in A– d(x,y): the Hamming distance between x and y
– Bi(Sj): the fragment of Sj whose bases are all in Bi
– Ci: the length of Ci (also denoted by i)
Example of CMSA
• Input: 6 RNA sequences along with 11 constraints and error ratio =0
Example of CMSA
• Input: 6 RNA sequences along with 11 constraints and error ratio =0.5
Web Interface of MuSiC
http://genome.life.nctu.edu.tw/MUSIC/
Syntax of Regular Expression• The IUPAC codes for the amino acids and
nucleotides are used in the regular expression. • “-”: separate the elements of a regular
expression. • “[]”: the amino acids (or nucleotides) that are
allowed to appear at a given position.• “{}”: The amino acids (or nucleotides) that are
not accepted at a given position.• Repetition of an element is indicated by
appending, immediately following that element, an integer or a pair of integers in parentheses.
Example: G-[AG]-x(4)-{AG}-x(4,5)
RE-MuSiC: Multiple Sequence Alignment with Regular Expression Constraints
RE-MuSiC 發表在 Nucleic Acids Research (Vol. 35, pp. W639-644,2007)
限制型多重序列比對的軟體工具
Too many false positives !
MSA of RNase1~RNase6- 3 active sites (His42, Lys65, His155 )- 4 disulfide bonds (Cys50, Cys64, Cys82, Cys89, Cys98, Cys110, Cys123, Cys138 )- MSA showed that 11 residue were conserved in RNase3, RNase2 (functionally related enzyme) and RNase1~RNase4 (sequence related proteins).
Clustering of 8 RNase
• Group1: ECP (RNase3)• Group3: EDN (RNase2) that is functionally related to ECP • Group2: RNase1, RNase4, RNase5 and RNase6 don’t
have toxicity. • Group4: RNase7 and RNase8 have toxicity, but their
toxicity is still unknown.
Algorithm
There are five possible coordinates: (1) Residues at rat imidase, functionally identical or related proteins (group3 or group4, respectively) and sequence related proteins (group2) are different, the score is set to zero. (2) The score is set to 1 if residues at all sequences are the same. (3) Residues common at rat imidase and proteins of group3 or group4 but differ from that of group2, the score is set to 3. (4) Residues common at imidase and group2 proteins but differ from that of group3 or group4, the score is set to –2. (5) Residues common at sequence related proteins and functional related proteins but differ from that of imidase, the score is set to zero.
Voting score (1)
Rat imidase
Aye from G3
Blackball from G2
Vscore
RNase:Comparison of MSA and
our method (2)
- The first row is the amino acid sequence of ECP, the second and the third row represent the total scores and their correspondent ranks respectively.- green residues: the top 5 high ones in our method- red residues: 3 active sites and 4 disulfide bonds of RNase proteins- Pro3 was verified to be associated with ECP’s toxicity by biological experiments.
FAVFAT• Revealing the desired features of target enzyme or protein by
voting on three different property groups aligned by three-profile alignment method. (accepted by BMC genomic 2010)
• Three properties – Target (interested sequence)– Property A (related function sequences)– Property ~A (Non-function sequences)
• Goal : – Identifying amino acid residues critical for Human Enterovirus
71.– Identifying function and species-associated sites for Influenza
A virus
Schematic diagram of the influenza virus replication cycle
Our approach
112/04/1959
• 3D-QSAR (Pharmacophore) model design
• Chemical compound inference
• Drug synthesis and validation
Structure of Neuraminidase protein
61
Pharmacophore Generation
Pharmacophore( Hypothesis )
Comformations generation
A series of inhibitors
Drug Screening
Build Feature Model
Training inhibitors for feature model of protei
n X
HBA
HY
RA
X’s Spatial Feature
DatabaseNatural compounds ~ 90,000 cpds
Synthetic compounds ~ 5,000,000 cpds
X’s Spatial Feature Search
X’s Spatial recognition
O
OH
O
NH
NH
O
F
OH
OH
NH2
NH
S
S
NO
O
O
N
O N
NH2
OH
O
OH OHO
NH
NH
NH2NHO
OHO
OH
O
NH
NH
O
OH
OH
OH
NH
NH2FF F
HBA
Inhibitor candidates of Protein X
Chemical Compound Inference Problem
Fujiwara et al. proposed a sequential branch-and-bound algorithm to solve this problem.
H. Fujiwara, J. Wang, L. Zhao, H. Nagamochi, and T. Akutsu, “Enumerating Treelike Chemical Graphs with Given Path Frequency”, J. Chem. Inf. Model., 2008, 48(7), pp. 1345-1357.
The algorithm proposed by Fujiwara cannot deal with the ring structure of chemical compounds. Moreover, the computation time increase significantly when the number of atoms grows.
In this study, we proposed a Balanced Multi-Process Parallel Algorithm for Chemical Compound Inference Problem.
BMPBB-CCI• Balanced Multi-Process Branch-and-Boun
d Algorithm for Chemical Compound Inference Problem
• The goal of BMPBB-CCI include–Reduce computation time via parallel computin
g–Take care of ring structure of CCI problem
2009.08.2164
112/04/1966
未來的新方向• Mata Genomics
NGS analysis
GPU solution
• Cancer Genomics
SNP, Indel, Translocation detection
• Experiment Design
Introduction (Penn State project)Introduction (Penn State project)Here, we illustrate a scenario of microbial community Here, we illustrate a scenario of microbial community profiling.profiling.
Fig. 1. The scenario of collecting samples from a car and Fig. 1. The scenario of collecting samples from a car and the sequencing process.the sequencing process.
Windshield GenomicsWindshield Genomics
SourcesSources
GPU
• A quiet revolution and potential build-up– Calculation: 367 GFLOPS vs. 32 GFLOPS– Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s– Until last year, programmed through graphics API
– GPU in every PC and workstation – massive volume and potential impact
GF
LOP
S
G80 = GeForce 8800 GTX
G71 = GeForce 7900 GTX
G70 = GeForce 7800 GTX
NV40 = GeForce 6800 Ultra
NV35 = GeForce FX 5950 Ultra
NV30 = GeForce FX 5800
Why GPU? Massively Parallel Processor
112/04/1971© David Kirk/NVIDIA and Wen-mei W. Hwu, 20
07ECE 498AL1, University of Illinois, Urbana-Cha
mpaign
Genome Rearrangements and Evolutionary Trees
ROBIN (Bioinformatics, 2005)
SPRING (Nucleic Acids Research, 2006)
Genome Rearrangements
4 6 1 7 2 3 5 8Human X
Mouse X
4 6 7 1 2 3 5 8
4 1 2 3 5 6 7 8
1 2 3 4 5 6 7 8
區段互換的基因體重組
三種常見人類致病性弧菌的演化關係
創傷弧菌
腸炎弧菌
霍亂弧菌
創傷弧菌
- 39 69
腸炎弧菌
39 - 65
霍亂弧菌
69 65 -
創傷弧菌
腸炎弧菌
霍亂弧菌
創傷弧菌
- 3 6
腸炎弧菌
3 - 7
霍亂弧菌
6 7 -
Chromosome 2Chromosome 1
研究成果發表在 J. Computational Biology, (Vol. 12, pp. 102-112. , 2005)
用反向工程技術做蘭花花型基因探勘
實驗工具RNAi
RNAi (RNA interference)
dsRNA 被細胞雙鏈 RNA 特異的核酸酶切成 21-23 個鹼基對的短雙鏈 RNA稱為 siRNA ( small interfering RNA )
siRNA 與細胞某些酶和蛋白質形成複合體,稱為 RNA 誘導沉默複合體( RNA-induced silencing complex,RISC )
RISC 可識別與 siRNA 有同源序列的 mRNA且在特異的位點將該 mRNA 切斷
• 藉由載入與目標基因有同源序列的小片段雙股 RNA 誘發 RNAi 機制來達到抑目標基因表現的效果,做為探究基因功能之新工具。
• 若載入的小片段雙股 RNA 與多條基因的片段有同源性,則可以一次抑制多個基因的表現。
• 藉由分析蘭花基因序列,找出可以一次抑制多個基因表現的可能雙股 RNA 序列。
• 使用挑選出的雙股 RNA 序列,在蘭花上進行 RNAi 實驗,觀察產生變化之性徵,快速縮小與該性徵有關之可能基因的範圍。
• 對已經篩選過的可能基因做第二次續列分析,重複 RANi 實驗,直到目標基因的個數減少至可以一一檢測的範圍。
開花功能探勘:蘭花基因工程
According to similarity, find the center sequences and determine its own group
• S1 is the center of a group G if S1 has no second neighbor (Sec_nei_num(S1) =0).
• If exist subsequence F, and HD(F,S1)=5,then F is a Far_neighor (5) of S1.
S4
S3
S2
S5
G(S1) = {S1, S2, S3, S4 , S5}S1
S4
S3
S2
S5S1
4
44
45
F
TF No.1 …. No.272
No.74No.112No.168No.176
7700 genes
….… 146 ….…
siRNA from TF No.21
No.21No.13
No.130No.152
Level 1
No.21 No.13 No.130 No.152
Level 2
No.13No.15No.21No.194No.171
PR1 relative
PR1
PR1 PR1
TF No.1 …. No.272
No.74No.112No.168No.176
7700 genes
siRNA from TF No.21
No.21No.13
No.130No.152
PR1
siRNA from TF No.176
PR1 ---
siRNA from TF No.13
No.13
PR1
0
20
40
60
80
100
120
0
20
40
60
80
100
120
0
20
40
60
80
100
120
感謝• 實驗室全體成員• 林口長庚口腔癌閻紫宸、廖俊達醫師團隊• 長庚醫技鄭恩加教授實驗室團隊• 林口長庚婦癌賴瓊慧醫師團隊• 交大生資盧錦隆教授實驗室團隊• 長庚資工林俊淵教授實驗室團隊• 清大統計所謝文萍教授• 清大生科王雯靜教授實驗室團隊• 清大生科張大慈教授實驗室團隊• 元培醫技劉明麗博士• 台大植微葉信宏教授實驗室團隊• 動物研究所李仁權博士