exact and approximation algorithms for dna tag set design ion mandoiu and dragos trinca computer...
Post on 21-Dec-2015
214 views
TRANSCRIPT
Exact and Approximation Algorithms for
DNA Tag Set Design
Ion Mandoiu and Dragos Trinca
Computer Science & Engineering Department
University of Connecticut
Overview
Background on universal tag arrays Tag set design problem
- c-h code formulation- Integer program and LP-approximation- Cycle packing formulation
Experimental results Conclusions
Watson-Crick Complementarity
• Four nucleotide types: A,C,G,T
• A’s paired with T’s (2 hydrogen bonds)
• C’s paired with G’s (3 hydrogen bonds)
DNA Microarrays• Exploit Watson-Crick complementarity to simultaneously
perform a large number of substring tests
• Used in a variety of genomic analyses– Transcription (gene expression) analysis – Single Nucleotide Polymorphism (SNP) genotyping– Genomic-based microorganism identification
• Common microarray formats involve direct hybridization between labeled DNA/RNA sample and DNA probes attached to a glass slide
Direct Hybridization Experiment
Images courtesy of Affymetrix.
Labeled DNA/RNA sample hybridized to array of probes
Laser activation of fluorescent labels
Optical scanning used to identify
probes with complements in the
mixture
Limitations of Common Array Formats
• Arrays of cDNAs– Probes obtained by reverse transcription from Expressed
Sequence Tags (ESTs)
– Inexpensive, but can only be used for transcription analysis
• Oligonucleotide arrays– Short (20-60bp) synthetic DNA probes
– Flexible, but expensive unless produced in large quantities
Universal Tag Arrays
• [Brenner 97, Morris et al. 98] “Programmable” oligonucleotide arrays – Array consists of application independent oligonucleotides called
tags
– Two-part “reporter” probes: aplication specific primers ligated to antitags
– Detection carried by a sequence of reactions separately involving the primer and the antitag part of reporter probes
+
Mix reporter probes with genomic DNA
Universal Tag Array Experiment
Solution phase hybridization
Single-Base Extension
Solid phase hybridization
Universal Tag Array Advantages
• Cost effective– Same array used in many analyses economies of scale
• Easy to customize– Only need to synthesize new set of reporter probes
• Reliable– Solution phase hybridization better understood than
hybridization on solid support
Tag Hybridization Constraints
(H1) Antitags hybridize strongly to complementary tags
(H2) No antitag hybridizes to a non-complementary tag
(H3) Antitags do not cross-hybridize to each other
t1t1 t2t2 t1 t2t1
Tag Set Design Problem: Find a maximum cardinality set of tags satisfying (H1)-(H3)
More Hybridization Constraints…
• Enforced during tag assignment by- Leaving some tags unassigned and distributing primers to multiple arrays [Ben-Dor et al. 03]
- Exploiting availability of multiple primer candidates [MPT05]
t1 t2t1
Hybridization Models: Stability
• Melting temperature models, e.g., nearest neighbor [SantaLucia 96]
• Tag weight h [Ben-Dor et al. 00]– wt(A)=wt(T)=1, wt(C)=wt(G)=2
– Equivalent to “2-4 rule” melting temperature model
• Tag length = l [Affymetrix]– Combined with additional constraints on GC-content, etc.
• Hamming distance model, e.g. [Marathe et al. 01]– Models rigid DNA strands
• LCS/edit distance model, e.g., [Torney et al. 03] – Models infinitely elastic DNA strands
• c-token model [Ben-Dor et al. 00]:– Derived from nucleation complex theory: duplex formation
requires formation of nucleation complex between perfectly complementary substrings
– Nucleation complex must have weight c
Hybridization Models: Non-Interaction
c-h Code Problem
• c-token: left-minimal DNA string of weight c, i.e.,– w(x) c
– w(x’) < c for every proper suffix x’ of x
• A set of tags is called c-h code if(C1) Every tag has weight h
(C2) Every c-token is used at most once
• c-h code problem: given c and h, find maximum cardinality c-h code
Previous Work
• [Ben-Dor et al.00]- Constructive upper-bound on c-h code size based on token tail-weight- Approximation algorithm based on DeBruijn sequences
• [MPT05]- Upper-bound for c-h codes including antitag-to-antitag hybridization constraints- Simple alphabetic tree search heuristic
Token Content of a Tag
c=4 CCAGATT CC CCA CAG AGA GAT GATT
Tag sequence of c-tokensEnd pos: 2 3 4 5 6 7 c-token: CCCCACAGAGAGATGATT
Layered c-token graph
st
c1
cN
ll-1c/2 (c/2)+1 …
Integer Program Formulation
• Maximum integer flow problem w/ set capacity constraints
•O(lN)/O(hN) constraints & variables, where N = #c-tokens
Number of c-tokensc # c-tokens
5 208
6 568
7 1552
8 4240
9 11584
10 31648
Packing ILP Formulation path every for variable0/1 ps-txp
Garg-Konemann Algorithm1. x 0; y // yi are variables of the dual LP
2. Find min weight s-t path p, where weight(v) = yi for every vVi
3. While weight(p) < 1 do
M maxi |p Vi|
xp xp + 1/M
For every i, yi yi( 1 + * |p Vi|/M )
Find min weight s-t path p, where weight(v) = yi for vVi
4. For every p, xp xp / (1 - log1+)
[GK98] The algorithm computes a factor (1- )2 approximation to the optimal LP solution with (N/)* log1+N shortest path computations
LP Based c-h Code Construction
1. Run Garg-Konemann and store the minimum weight paths in a list
2. Traversing the list in reverse order, pick tags corresponding to paths if they are feasible and do not share c-tokens with already selected tags
3. Mark used c-tokens and run the alphabetic tree search algorithm to select additional tags
Experimental Results (h=15)
Periodic Tags
• Key observation: c-token uniqueness constraint in c-h code formulation is too strong– A c-token should not appear in two different tags, but
can be repeated within a tag!
• A tag t is called periodic if it is the prefix of () for some – Periodic strings make better use of c-tokens (t uses at
most || c-tokens)
c-token factor graph, c=4 (incomplete)
CC
AAG AAC
AAAA
AAAT
Cycle Packing Problem
• Vertex-Disjoint Cycle Packing Problem: Given directed graph G, find maximum number of vertex disjoint directed cycles in G
Theorem 1: APX-hard even for regular directed graphs with in-degree and out-degree 2
• [Salavatipour and Verstraete 05] For general graphs:– Quasi-NP-hard to approximate within (log1- n)
– O(n1/2) approximation algorithm
Tag Set Design Algorithm1. Construct c-token factor graph G
2. T{}
3. For all cycles C defining periodic tags, in increasing
order of cycle length, do
• Add to T the tag defined by C
• Remove C from G
4. Perform an alphabetic tree search and add to T tags
with no c-tokens in common with T
5. Return T
Experimental Results
h
Antitag-to-Antitag Hybridization
• Formalization in c-token hybridization model:(C3) No two (anti)tags contain complementary substrings of
weight c
• Cycle packing and tree search extend easily
Results w/ Extended Constraints
Herpes B Gene Expression Assay
Tm # poolsPool size
500 tags 1000 tags 2000 tags
# arrays % Util. # arrays % Util. # arrays % Util.
60 14461 4 94.06 2 97.20 1 72.30
5 4 96.13 2 100.00 1 72.30
67 15601 4 96.53 2 98.70 1 78.00
5 4 98.00 2 99.90 1 78.00
70 15221 4 96.73 2 98.90 1 76.10
5 4 97.80 2 99.80 1 76.10
Tm # poolsPool size
500 tags 1000 tags 2000 tags
# arrays % Util. # arrays % Util. # arrays % Util.
60 14461 4 82.26 3 65.35 2 57.05
5 4 88.26 3 70.95 2 63.55
67 15601 4 86.33 3 69.70 2 61.15
5 4 91.86 3 76.00 2 67.20
70 15221 4 88.46 3 73.65 2 65.40
5 4 92.26 2 91.10 2 70.30
GenFlex Tags
Periodic Tags
Conclusions• Use of periodic tags yields significant increase in tag
set size and enables higher multiplexing rates in tag assignment
• Algorithms extend to more accurate hybridization models– Monotonic melting temperature c-tokens factor graph
• Other applications of non-interacting DNA tag sets – Lab-on-chip, DNA-mediated assembly, DNA computing
[Brenneman&Condon 02]
• Open problems– Settle approximation complexity of disjoint cycle packing– Improved tag set design algorithms?
Acknowledgments
• UCONN Research Foundation