exact and approximation algorithms for dna tag set design ion mandoiu and dragos trinca computer...

33
Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut

Post on 21-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut

Exact and Approximation Algorithms for

DNA Tag Set Design

Ion Mandoiu and Dragos Trinca

Computer Science & Engineering Department

University of Connecticut

Page 2: Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut

Overview

Background on universal tag arrays Tag set design problem

- c-h code formulation- Integer program and LP-approximation- Cycle packing formulation

Experimental results Conclusions

Page 3: Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut

Watson-Crick Complementarity

• Four nucleotide types: A,C,G,T

• A’s paired with T’s (2 hydrogen bonds)

• C’s paired with G’s (3 hydrogen bonds)

Page 4: Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut

DNA Microarrays• Exploit Watson-Crick complementarity to simultaneously

perform a large number of substring tests

• Used in a variety of genomic analyses– Transcription (gene expression) analysis – Single Nucleotide Polymorphism (SNP) genotyping– Genomic-based microorganism identification

• Common microarray formats involve direct hybridization between labeled DNA/RNA sample and DNA probes attached to a glass slide

Page 5: Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut

Direct Hybridization Experiment

Images courtesy of Affymetrix.

Labeled DNA/RNA sample hybridized to array of probes

Laser activation of fluorescent labels

Optical scanning used to identify

probes with complements in the

mixture

Page 6: Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut

Limitations of Common Array Formats

• Arrays of cDNAs– Probes obtained by reverse transcription from Expressed

Sequence Tags (ESTs)

– Inexpensive, but can only be used for transcription analysis

• Oligonucleotide arrays– Short (20-60bp) synthetic DNA probes

– Flexible, but expensive unless produced in large quantities

Page 7: Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut

Universal Tag Arrays

• [Brenner 97, Morris et al. 98] “Programmable” oligonucleotide arrays – Array consists of application independent oligonucleotides called

tags

– Two-part “reporter” probes: aplication specific primers ligated to antitags

– Detection carried by a sequence of reactions separately involving the primer and the antitag part of reporter probes

Page 8: Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut

+

Mix reporter probes with genomic DNA

Universal Tag Array Experiment

Solution phase hybridization

Single-Base Extension

Solid phase hybridization

Page 9: Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut

Universal Tag Array Advantages

• Cost effective– Same array used in many analyses economies of scale

• Easy to customize– Only need to synthesize new set of reporter probes

• Reliable– Solution phase hybridization better understood than

hybridization on solid support

Page 10: Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut

Tag Hybridization Constraints

(H1) Antitags hybridize strongly to complementary tags

(H2) No antitag hybridizes to a non-complementary tag

(H3) Antitags do not cross-hybridize to each other

t1t1 t2t2 t1 t2t1

Tag Set Design Problem: Find a maximum cardinality set of tags satisfying (H1)-(H3)

Page 11: Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut

More Hybridization Constraints…

• Enforced during tag assignment by- Leaving some tags unassigned and distributing primers to multiple arrays [Ben-Dor et al. 03]

- Exploiting availability of multiple primer candidates [MPT05]

t1 t2t1

Page 12: Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut

Hybridization Models: Stability

• Melting temperature models, e.g., nearest neighbor [SantaLucia 96]

• Tag weight h [Ben-Dor et al. 00]– wt(A)=wt(T)=1, wt(C)=wt(G)=2

– Equivalent to “2-4 rule” melting temperature model

• Tag length = l [Affymetrix]– Combined with additional constraints on GC-content, etc.

Page 13: Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut

• Hamming distance model, e.g. [Marathe et al. 01]– Models rigid DNA strands

• LCS/edit distance model, e.g., [Torney et al. 03] – Models infinitely elastic DNA strands

• c-token model [Ben-Dor et al. 00]:– Derived from nucleation complex theory: duplex formation

requires formation of nucleation complex between perfectly complementary substrings

– Nucleation complex must have weight c

Hybridization Models: Non-Interaction

Page 14: Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut

c-h Code Problem

• c-token: left-minimal DNA string of weight c, i.e.,– w(x) c

– w(x’) < c for every proper suffix x’ of x

• A set of tags is called c-h code if(C1) Every tag has weight h

(C2) Every c-token is used at most once

• c-h code problem: given c and h, find maximum cardinality c-h code

Page 15: Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut

Previous Work

• [Ben-Dor et al.00]- Constructive upper-bound on c-h code size based on token tail-weight- Approximation algorithm based on DeBruijn sequences

• [MPT05]- Upper-bound for c-h codes including antitag-to-antitag hybridization constraints- Simple alphabetic tree search heuristic

Page 16: Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut

Token Content of a Tag

c=4 CCAGATT CC CCA CAG AGA GAT GATT

Tag sequence of c-tokensEnd pos: 2 3 4 5 6 7 c-token: CCCCACAGAGAGATGATT

Page 17: Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut

Layered c-token graph

st

c1

cN

ll-1c/2 (c/2)+1 …

Page 18: Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut

Integer Program Formulation

• Maximum integer flow problem w/ set capacity constraints

•O(lN)/O(hN) constraints & variables, where N = #c-tokens

Page 19: Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut

Number of c-tokensc # c-tokens

5 208

6 568

7 1552

8 4240

9 11584

10 31648

Page 20: Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut

Packing ILP Formulation path every for variable0/1 ps-txp

Page 21: Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut

Garg-Konemann Algorithm1. x 0; y // yi are variables of the dual LP

2. Find min weight s-t path p, where weight(v) = yi for every vVi

3. While weight(p) < 1 do

M maxi |p Vi|

xp xp + 1/M

For every i, yi yi( 1 + * |p Vi|/M )

Find min weight s-t path p, where weight(v) = yi for vVi

4. For every p, xp xp / (1 - log1+)

[GK98] The algorithm computes a factor (1- )2 approximation to the optimal LP solution with (N/)* log1+N shortest path computations

Page 22: Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut

LP Based c-h Code Construction

1. Run Garg-Konemann and store the minimum weight paths in a list

2. Traversing the list in reverse order, pick tags corresponding to paths if they are feasible and do not share c-tokens with already selected tags

3. Mark used c-tokens and run the alphabetic tree search algorithm to select additional tags

Page 23: Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut

Experimental Results (h=15)

Page 24: Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut

Periodic Tags

• Key observation: c-token uniqueness constraint in c-h code formulation is too strong– A c-token should not appear in two different tags, but

can be repeated within a tag!

• A tag t is called periodic if it is the prefix of () for some – Periodic strings make better use of c-tokens (t uses at

most || c-tokens)

Page 25: Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut

c-token factor graph, c=4 (incomplete)

CC

AAG AAC

AAAA

AAAT

Page 26: Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut

Cycle Packing Problem

• Vertex-Disjoint Cycle Packing Problem: Given directed graph G, find maximum number of vertex disjoint directed cycles in G

Theorem 1: APX-hard even for regular directed graphs with in-degree and out-degree 2

• [Salavatipour and Verstraete 05] For general graphs:– Quasi-NP-hard to approximate within (log1- n)

– O(n1/2) approximation algorithm

Page 27: Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut

Tag Set Design Algorithm1. Construct c-token factor graph G

2. T{}

3. For all cycles C defining periodic tags, in increasing

order of cycle length, do

• Add to T the tag defined by C

• Remove C from G

4. Perform an alphabetic tree search and add to T tags

with no c-tokens in common with T

5. Return T

Page 28: Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut

Experimental Results

h

Page 29: Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut

Antitag-to-Antitag Hybridization

• Formalization in c-token hybridization model:(C3) No two (anti)tags contain complementary substrings of

weight c

• Cycle packing and tree search extend easily

Page 30: Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut

Results w/ Extended Constraints

Page 31: Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut

Herpes B Gene Expression Assay

Tm # poolsPool size

500 tags 1000 tags 2000 tags

# arrays % Util. # arrays % Util. # arrays % Util.

60 14461 4 94.06 2 97.20 1 72.30

5 4 96.13 2 100.00 1 72.30

67 15601 4 96.53 2 98.70 1 78.00

5 4 98.00 2 99.90 1 78.00

70 15221 4 96.73 2 98.90 1 76.10

5 4 97.80 2 99.80 1 76.10

Tm # poolsPool size

500 tags 1000 tags 2000 tags

# arrays % Util. # arrays % Util. # arrays % Util.

60 14461 4 82.26 3 65.35 2 57.05

5 4 88.26 3 70.95 2 63.55

67 15601 4 86.33 3 69.70 2 61.15

5 4 91.86 3 76.00 2 67.20

70 15221 4 88.46 3 73.65 2 65.40

5 4 92.26 2 91.10 2 70.30

GenFlex Tags

Periodic Tags

Page 32: Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut

Conclusions• Use of periodic tags yields significant increase in tag

set size and enables higher multiplexing rates in tag assignment

• Algorithms extend to more accurate hybridization models– Monotonic melting temperature c-tokens factor graph

• Other applications of non-interacting DNA tag sets – Lab-on-chip, DNA-mediated assembly, DNA computing

[Brenneman&Condon 02]

• Open problems– Settle approximation complexity of disjoint cycle packing– Improved tag set design algorithms?

Page 33: Exact and Approximation Algorithms for DNA Tag Set Design Ion Mandoiu and Dragos Trinca Computer Science & Engineering Department University of Connecticut

Acknowledgments

• UCONN Research Foundation