1 dan geiger computer science department, technion pedtool: gene hunting based on high-throughput...

25
1 Dan Geiger Computer Science Department, Technion PEDTOOL: Gene hunting based on high- throughput computing

Post on 22-Dec-2015

216 views

Category:

Documents


3 download

TRANSCRIPT

1

Dan GeigerComputer Science Department, Technion

PEDTOOL: Gene hunting based on high-throughput

computing

2

חיפוש גנים החושפים או גורמים למחלות

? מדוע לחפשבדיקות טרום לידתיות לאוכלוסיה בסיכון גבוהה .1בדיקת סיכון והתאמת אורך החיים לגורמי סיכון .2

מציאת החלבונים המוטנטים ופיתוח תרופות .3הבנת תהליכים ביולוגיים בסיסיים .4

? כיצד ניתן לחפש

מציאת משפחות בהם קיימת מחלה המועברת מדור .1לדור

לקיחת בדיקת דם פשוטה ממספר חולים ובריאים .2ניתוח מעבדתי של הדנא על כל הכרומוזומים .3

שלוש בעיות ניתוח באמצעים אלגוריתמים. אדגיש .4.חישוביות

4

Usage of our system in Israeli Hospitals

Rabin Hospital, by Motti Shochat’s group New locus for mental retardation (2003) Infantile bilateral striatal necrosis (2004)

Soroka Hospital, by Ohad Birk’s group Lethal congenital contractural syndrome

(2004) Congenital cataract (2005)

Rambam Hospital, by Eli Shprecher’s group

Congenital recessive ichthyosis (2005) CEDNIK syndrome (2005)

Galil Ma’aravi Hospital, by Tzipi Falik’s group

Familial Onychodysplasia and dysplasia Familial juvenile hypertrophy (2005)

5

Steps in Gene Hunting

Linkageanalysis

(106~107 bp)

Identifygenes

(104~105 bp)Resequencing

(100 bp)

6

Recombination During Recombination During MeiosisMeiosis

Recombinant gametes

Male or female

7

Family Pedigree

8

Familial Onychodysplasia and dysplasia of distal phalanges (ODP)

III-15 IV-10

IV-7

9

Familial juvenile hypertrophy of the breast (JHB)

IV-3

10

Marker Information Added סמנים)גנטיים)

Id, dad, mom, sex, affMarker 1Marker 2III-21 II-10 II-11 f h0000

II-5 I-3 I-4 f h15515713

III-7 II-4 II-5 f a15515711

III-13 II-4 II-5 m a15115511

III-14 II-1 II-2 f h15115523

III-15 II-4 II-5 m a15115511

III-16 II-10 II-11 f h15115914

III-5 II-4 II-5 f h15115511

IV-1 III-13 III-14 f h15115513

IV-2 III-13 III-14 f a15115513

IV-3 III-13 III-14 female a15515513

.

M1 M2

Chromosome pair:

11

Maximum Likelihood Maximum Likelihood Evaluation- Two Point Evaluation- Two Point

Analysis (Task 1)Analysis (Task 1)

III-15 151,159III-16 151,155

202,209202,202

ah

139,141139,146

1,23,3

M1 M2 M3 M4D1

θ

The first computational problem: find a value of θ that maximizes Pr(data|θ,Mode-Of-Iheritance)

Data means here one marker data at a time.LOD score (to quantify how confident we are):

Z(θ)=log10[Pr(data|θ) / Pr(data|θ=½)].

D2

12

Marker information

Recombination fraction

IdName0.000.010.050.100.20

9 m93.96 3.90 3.62 3.27 2.51

Results of Two-Point Analysis

13

Marker information

Recombination fraction

IdName0.000.010.050.100.20

4 m4-14.82 -1.57 -0.13 0.42 0.72

9 m93.96 3.90 3.62 3.27 2.51

13 m13-2.91 -2.31 -1.37 -0.86 -0.37

Results of Two-Point Analysis

14

Marker information

Recombination fraction

IdName0.000.010.050.100.20

4 m4-14.82 -1.57 -0.13 0.42 0.72

5 m53.67 3.60 3.35 3.02 2.31

6 m62.27 2.23 2.08 1.86 1.38

9 m93.96 3.90 3.62 3.27 2.51

10 m101.96 2.20 2.42 2.35 1.92

11 m111.09 1.08 1.04 0.98 0.80

12 m12-0.84 -0.56 -0.14 0.03 0.14

13 m13-2.91 -2.31 -1.37 -0.86 -0.37

Results of Two-Point Analysis

15

Maximum Likelihood Maximum Likelihood Evaluation Approach Evaluation Approach

(Task 2)(Task 2)Most probable Haplotype Configuration

of some or all persons:

Which alleles came from the mother and which from the father ?

The second computational problem: argmax Pr(h1,h2,…,h 2n-1, h2n |

data,θ,MOI)

For each person, there are 2k possible haplotypes, where k is the number of markers considered.

16

ID M3 M4 M5 M6 M9 M10 M11 M12 M13

III-71212  1  x2222

3  3  x3133122

IV-3152252123

443133122

IV-433  1  x252123

443133122

IV-71315  6  x2413

443133122

IV-10

131432413

443133122

Results of Haplotyping Analysis(Affected persons)

17

Results of Haplotyping Analysis(Healthy persons)

ID M3 M4 M5 M6 M9 M10 M11 M12 M13

II-5212122222121211111

III-14152252123331341132

III-16131432413131564314

IV-61315643144  4  x12111    1  x2

IV-81314324131212  1  x3122

18

Maximum Likelihood Maximum Likelihood Evaluation Multipoint Evaluation Multipoint

Analysis(Task 3)Analysis(Task 3)

III-15 151,159III-16 151,155

202,209202,202

ah

139,141139,146

1,23,3

M1 M2 M3 M4D1

θ

The third computational problem: find a value of θ that maximizes Pr(data|θ,MOI)

Data now means considering several markers at once.

19

Results of Multipoint Results of Multipoint AnalysisAnalysis

Position in centi-MorgansLn(Likelihood)LOD0.0000( Marker 3)-216.0217-14.74

0.5500-192.2385-4.41 1.1000( Marker 4)-216.0210-14.74

3.6000-176.38102.47 6.1000( Marker 5)-174.33923.35

8.6500-173.97433.51 11.2000( Marker 6)-173.70303.63

16.5500-173.31063.80 21.9000( Marker 9)-172.94973.96

25.2500 -173.65403.65 28.6000( Marker 10)-177.56221.95

40.3001-178.99461.33

23

20

The Computational TaskThe Computational Task

Computing P(data|θ) for a specific value of θ :

kx x x

n

iii paxPP

3 1 1

)|()|( data

ij ikl kjm lmnm n l k

Y A B C

This problem is equivalent to finding the best order for sum-product operations for high dimensional matrices :

21

Stochastic Greedy Ordering Stochastic Greedy Ordering Algorithm(s)Algorithm(s)

• Iteration i: – three indices yielding minimal table size are

found.– a coin (biased according to the resulting

table size) is flipped to choose between them.

• The algorithm is repeated many times unless a low cost elimination sequence is found.

Repeat these steps with several cost functions.

22

When intermediate tables become too large

for a given RAM, computation virtually halts:

ij ikl kjm lmnm n l k

Y A B C

iljm ikl kjmk

Y A B

But we can fix the value of the index m,

namely, condition on m’s value, and do each

part as a separate job:m milj ikl kj

k

Y A B

23

The Pedtool SystemThe Pedtool SystemDivides the computation of a single likelihood to hundreds of computers.Uses Condor at UW-Madison research pool.Simple user interface – used by novicesAble to compute a highly inbred pedigree with 250 individuals sent by NIH.

ij ikl kjm lmnm n l k

Y A B C Faster by 1-5 orders of magnitude

over other linkage programs.

24

Running times improvements

Files No. of Run Time Run Time Run Time Run TimeLoci V1.0 V1.1 V1.4 Online

A6 12 2.72 1.26 2.36 localA7 14 1.84 1.36 1.48 localA8 18 4.32 3.14 0.51 localA9 37 8231.04 265.32 28.56 local

A10 38 9871.33 3543.46 36.12 localA11 40 57.85 local

Mira-46 14 >6000m* 100m

Ginat-115 1 >1500m 70mEric-105 1 40m 3mEric-105 2 >1000m 15m

bioinfo.cs.technion.ac.il/pedtool

25

The Main Goals of future The Main Goals of future ResearchResearch

EfficiencySimplicityAvailability online to all Israeli researchers.More functionalities

bioinfo.cs.technion.ac.il/pedtool

26

Students:Ma’ayan Fishelson, Ph.D (Graduated 2004) Dmitry Rusakov, Ph.D (Graduated 2004)Anna Tzemach, M.ScNickolay Dovgolevsky, B.Sc (Graduated, 2004)Mark Silberstein, M.ScJulia StolinEdward Vitkin

Collaborators from medical genetics:Motti Shochat and Tami Shochat (Rabin)Ohad Birk and Rivka Ophir (Soroka)Tzipi Falik and Morad Khayat (Galil Ma’aravi)

Collaborators from distributed systems:Assaf Schuster

Pedtool is to be hosted by DSL at the CS/Technion and supported by IBM, ISF, Israeli Science Ministry

Acknowledgements