exact and approximation algorithms for dna tag set design ion mandoiu and dragos trinca computer...

Post on 21-Dec-2015

214 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Exact and Approximation Algorithms for

DNA Tag Set Design

Ion Mandoiu and Dragos Trinca

Computer Science & Engineering Department

University of Connecticut

Overview

Background on universal tag arrays Tag set design problem

- c-h code formulation- Integer program and LP-approximation- Cycle packing formulation

Experimental results Conclusions

Watson-Crick Complementarity

• Four nucleotide types: A,C,G,T

• A’s paired with T’s (2 hydrogen bonds)

• C’s paired with G’s (3 hydrogen bonds)

DNA Microarrays• Exploit Watson-Crick complementarity to simultaneously

perform a large number of substring tests

• Used in a variety of genomic analyses– Transcription (gene expression) analysis – Single Nucleotide Polymorphism (SNP) genotyping– Genomic-based microorganism identification

• Common microarray formats involve direct hybridization between labeled DNA/RNA sample and DNA probes attached to a glass slide

Direct Hybridization Experiment

Images courtesy of Affymetrix.

Labeled DNA/RNA sample hybridized to array of probes

Laser activation of fluorescent labels

Optical scanning used to identify

probes with complements in the

mixture

Limitations of Common Array Formats

• Arrays of cDNAs– Probes obtained by reverse transcription from Expressed

Sequence Tags (ESTs)

– Inexpensive, but can only be used for transcription analysis

• Oligonucleotide arrays– Short (20-60bp) synthetic DNA probes

– Flexible, but expensive unless produced in large quantities

Universal Tag Arrays

• [Brenner 97, Morris et al. 98] “Programmable” oligonucleotide arrays – Array consists of application independent oligonucleotides called

tags

– Two-part “reporter” probes: aplication specific primers ligated to antitags

– Detection carried by a sequence of reactions separately involving the primer and the antitag part of reporter probes

+

Mix reporter probes with genomic DNA

Universal Tag Array Experiment

Solution phase hybridization

Single-Base Extension

Solid phase hybridization

Universal Tag Array Advantages

• Cost effective– Same array used in many analyses economies of scale

• Easy to customize– Only need to synthesize new set of reporter probes

• Reliable– Solution phase hybridization better understood than

hybridization on solid support

Tag Hybridization Constraints

(H1) Antitags hybridize strongly to complementary tags

(H2) No antitag hybridizes to a non-complementary tag

(H3) Antitags do not cross-hybridize to each other

t1t1 t2t2 t1 t2t1

Tag Set Design Problem: Find a maximum cardinality set of tags satisfying (H1)-(H3)

More Hybridization Constraints…

• Enforced during tag assignment by- Leaving some tags unassigned and distributing primers to multiple arrays [Ben-Dor et al. 03]

- Exploiting availability of multiple primer candidates [MPT05]

t1 t2t1

Hybridization Models: Stability

• Melting temperature models, e.g., nearest neighbor [SantaLucia 96]

• Tag weight h [Ben-Dor et al. 00]– wt(A)=wt(T)=1, wt(C)=wt(G)=2

– Equivalent to “2-4 rule” melting temperature model

• Tag length = l [Affymetrix]– Combined with additional constraints on GC-content, etc.

• Hamming distance model, e.g. [Marathe et al. 01]– Models rigid DNA strands

• LCS/edit distance model, e.g., [Torney et al. 03] – Models infinitely elastic DNA strands

• c-token model [Ben-Dor et al. 00]:– Derived from nucleation complex theory: duplex formation

requires formation of nucleation complex between perfectly complementary substrings

– Nucleation complex must have weight c

Hybridization Models: Non-Interaction

c-h Code Problem

• c-token: left-minimal DNA string of weight c, i.e.,– w(x) c

– w(x’) < c for every proper suffix x’ of x

• A set of tags is called c-h code if(C1) Every tag has weight h

(C2) Every c-token is used at most once

• c-h code problem: given c and h, find maximum cardinality c-h code

Previous Work

• [Ben-Dor et al.00]- Constructive upper-bound on c-h code size based on token tail-weight- Approximation algorithm based on DeBruijn sequences

• [MPT05]- Upper-bound for c-h codes including antitag-to-antitag hybridization constraints- Simple alphabetic tree search heuristic

Token Content of a Tag

c=4 CCAGATT CC CCA CAG AGA GAT GATT

Tag sequence of c-tokensEnd pos: 2 3 4 5 6 7 c-token: CCCCACAGAGAGATGATT

Layered c-token graph

st

c1

cN

ll-1c/2 (c/2)+1 …

Integer Program Formulation

• Maximum integer flow problem w/ set capacity constraints

•O(lN)/O(hN) constraints & variables, where N = #c-tokens

Number of c-tokensc # c-tokens

5 208

6 568

7 1552

8 4240

9 11584

10 31648

Packing ILP Formulation path every for variable0/1 ps-txp

Garg-Konemann Algorithm1. x 0; y // yi are variables of the dual LP

2. Find min weight s-t path p, where weight(v) = yi for every vVi

3. While weight(p) < 1 do

M maxi |p Vi|

xp xp + 1/M

For every i, yi yi( 1 + * |p Vi|/M )

Find min weight s-t path p, where weight(v) = yi for vVi

4. For every p, xp xp / (1 - log1+)

[GK98] The algorithm computes a factor (1- )2 approximation to the optimal LP solution with (N/)* log1+N shortest path computations

LP Based c-h Code Construction

1. Run Garg-Konemann and store the minimum weight paths in a list

2. Traversing the list in reverse order, pick tags corresponding to paths if they are feasible and do not share c-tokens with already selected tags

3. Mark used c-tokens and run the alphabetic tree search algorithm to select additional tags

Experimental Results (h=15)

Periodic Tags

• Key observation: c-token uniqueness constraint in c-h code formulation is too strong– A c-token should not appear in two different tags, but

can be repeated within a tag!

• A tag t is called periodic if it is the prefix of () for some – Periodic strings make better use of c-tokens (t uses at

most || c-tokens)

c-token factor graph, c=4 (incomplete)

CC

AAG AAC

AAAA

AAAT

Cycle Packing Problem

• Vertex-Disjoint Cycle Packing Problem: Given directed graph G, find maximum number of vertex disjoint directed cycles in G

Theorem 1: APX-hard even for regular directed graphs with in-degree and out-degree 2

• [Salavatipour and Verstraete 05] For general graphs:– Quasi-NP-hard to approximate within (log1- n)

– O(n1/2) approximation algorithm

Tag Set Design Algorithm1. Construct c-token factor graph G

2. T{}

3. For all cycles C defining periodic tags, in increasing

order of cycle length, do

• Add to T the tag defined by C

• Remove C from G

4. Perform an alphabetic tree search and add to T tags

with no c-tokens in common with T

5. Return T

Experimental Results

h

Antitag-to-Antitag Hybridization

• Formalization in c-token hybridization model:(C3) No two (anti)tags contain complementary substrings of

weight c

• Cycle packing and tree search extend easily

Results w/ Extended Constraints

Herpes B Gene Expression Assay

Tm # poolsPool size

500 tags 1000 tags 2000 tags

# arrays % Util. # arrays % Util. # arrays % Util.

60 14461 4 94.06 2 97.20 1 72.30

5 4 96.13 2 100.00 1 72.30

67 15601 4 96.53 2 98.70 1 78.00

5 4 98.00 2 99.90 1 78.00

70 15221 4 96.73 2 98.90 1 76.10

5 4 97.80 2 99.80 1 76.10

Tm # poolsPool size

500 tags 1000 tags 2000 tags

# arrays % Util. # arrays % Util. # arrays % Util.

60 14461 4 82.26 3 65.35 2 57.05

5 4 88.26 3 70.95 2 63.55

67 15601 4 86.33 3 69.70 2 61.15

5 4 91.86 3 76.00 2 67.20

70 15221 4 88.46 3 73.65 2 65.40

5 4 92.26 2 91.10 2 70.30

GenFlex Tags

Periodic Tags

Conclusions• Use of periodic tags yields significant increase in tag

set size and enables higher multiplexing rates in tag assignment

• Algorithms extend to more accurate hybridization models– Monotonic melting temperature c-tokens factor graph

• Other applications of non-interacting DNA tag sets – Lab-on-chip, DNA-mediated assembly, DNA computing

[Brenneman&Condon 02]

• Open problems– Settle approximation complexity of disjoint cycle packing– Improved tag set design algorithms?

Acknowledgments

• UCONN Research Foundation

top related