dynamic programming: one algorithmic key to many biological locks

62
Dynamic programming: one algorithmic key to many biological locks Mikhail Gelfand RTCB, IITP, RA S and FBB, MSU 2010-2011

Upload: lyre

Post on 05-Jan-2016

32 views

Category:

Documents


1 download

DESCRIPTION

Dynamic programming: one algorithmic key to many biological locks. Mikhail Gelfand RTCB, IITP, RA S and FBB, MSU 2010-2011. BIOINFORMATICS FOR BIOLOGISTS Pavel Pevzner and Ron Shamir, eds. (Cambridge University Press, 2011) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Dynamic programming: one algorithmic key to many biological locks

Dynamic programming:

one algorithmic key to many

biological locksMikhail Gelfand

RTCB, IITP, RA S and FBB, MSU

2010-2011

Page 2: Dynamic programming: one algorithmic key to many biological locks

BIOINFORMATICS FOR BIOLOGISTSPavel Pevzner and Ron Shamir, eds.(Cambridge University Press, 2011)

Ch. 4. DYNAMIC PROGRAMMING: ONE ALGORITHMIC KEY FOR MANY BIOLOGICAL LOCKS

Mikhail GelfandResearch and Training Center “Bioinformatics” of the

Institute for Information Transmission Problems, RASand Faculty of Bioengineering and Bioinformatics,

M.V.Lomonosov Moscow State University

Page 3: Dynamic programming: one algorithmic key to many biological locks

Alignment

Three (of many) alignments of two sequences. Plus denotes a match; dot, a mismatch, minus, a gap. (a) Two matches, five mismatches, (b) three matches, one mismatch, two gaps of size three (six indels, that is one-nucleotide insertions/deletions), (c) four matches, two gaps of size three (six indels).

Page 4: Dynamic programming: one algorithmic key to many biological locks

The number of alignments is large

# of alignments of two sequences of length N~ (1+√2)2N+1√N

at N = 1000 # ≈ 10767

# of elementary particles in the Universe ≈ 1080 at N = 100 # ≈ 1076

assume 1 operation per alignment, 1012 operations per second

=> need 1057 years

=> we cannot consider them one by one

Page 5: Dynamic programming: one algorithmic key to many biological locks

Gene recognition

Segmentation of a genomic fragment into protein-coding and non-coding regionsbased on differences in statistical

properties of these regionsdifficult in eukaryotes due to the

existence of introns, non-coding regions within genes

Page 6: Dynamic programming: one algorithmic key to many biological locks

Toy example

How many operations are needed to calculate

∑i=1…m, j=1…n xi∙yj =

= x1∙y1 + x1∙y2 + … + x1∙yn +

+ x2∙y1 + x2∙y2 + … + x2∙yn +

+ … +

+ xm∙y1 + xm∙y2 + … + xm∙yn

Naïve answer: mn multiplications and mn–1 additions

Page 7: Dynamic programming: one algorithmic key to many biological locks

but rewrite as…

(x1 + x2 + … + xm) ∙ (y1 + y2 + … + yn) =

= ∑i=1…m xi ∙ ∑j=1…n yj

and it becomes m+n–2 additions and just 1 multiplication

Page 8: Dynamic programming: one algorithmic key to many biological locks

Quiz

How many multiplications do we need to calculate

x1y1 ∙ x1

y2 ∙ … ∙ x1yn ∙ x2

y1 ∙ x2y2 ∙ … ∙ x2

yn ∙ … ∙

∙ xmy1 ∙ xm

y2 ∙ … ∙ xmyn = ∏ i=1…m, j=1…n xi

yj

if we are (a)naïve? (b) sophisticated? (c) What if in addition to multiplication, we

have an operation “taking to the power”? (d) if we may perform not only multiplication,

but also addition?

Page 9: Dynamic programming: one algorithmic key to many biological locks

Lesson

Restructuring the order of calculations using properties of the data may sharply decrease the number of operations

Page 10: Dynamic programming: one algorithmic key to many biological locks

GraphsVertices/nodes: v1, v2, …, vn

Arcs /edges– directed pairs of vertices: am(vi, vj)

contains cyclesmultiple sources and sinks

Page 11: Dynamic programming: one algorithmic key to many biological locks

“bad” graphs and not graphs

multiple arcs loop

multiple components

not a graph (hanging arc)

undirected graph

Page 12: Dynamic programming: one algorithmic key to many biological locks

Sources, sinks, paths, cyclesSource is a vertex that is not an end vertex for any arcSink is a vertex that is not a start vertex for any arc.Walk p of length N is an ordered set of N arcs

w = (a1, …, aN) such that the end vertex of arc an = (bn, en) coincides with the start vertex of arc an+1, en=bn+1, for all n = 1, …, N–1.

no source and sink

multiple sources and sinks

one source and one sink

w=(a(v1,v3),a(v3,v2,),a(v2,v4,),a(v3,v4), a(v3,v1), a(v1,v3))

w=(a(v4,v5),a(v5,v3))w=(a(v2,v1))

v1

v3 v4

v2v1

v2

v1

v4 v5

v2

v6

v3

Page 13: Dynamic programming: one algorithmic key to many biological locks

Sources, sinks, paths, cyclesIn a graph without loops and multiple arcs, each walk

may also be defined as an ordered set of vertices w = (v1, …, vN+1) such that for each pair of adjacent vertices vn, vn+1 there is an arc an = (vn, vn+1), n = 1, …, N.

no source and sink

multiple sources and sinks

one source and one sink

w=(v1,v3,v2,v4,v3,v1,v3)

v1

v3 v4

v2v1

v2

v1

v4 v5

v2

v6

v3

w=(v4,v5,v3)w=(v2,v1)

Page 14: Dynamic programming: one algorithmic key to many biological locks

Sources, sinks, paths, cyclesA path is a walk in which no arc is passed twice.Cycle is a path in which the end vertex of the last arc

aN coincides with the start vertex of the first arc a1, eN=b1.

Acyclic graph contains no cycles.

no source and sink

multiple sources and sinks

one source and one sink

p=c=(v1,v3,v2,v4,v3,v1)

v1

v3 v4

v2v1

v2

v1

v4 v5

v2

v6

v3

p=(v4,v5,v3)p=(v2,v1)

Acyclic graph Acyclic graph Cyclic graph

Page 15: Dynamic programming: one algorithmic key to many biological locks

Quiz

(a) Draw all acyclic connected oriented graphs with three vertices (up to vertex labels).

(b) How many oriented graphs will there be if we label vertices with symbols A, B and C?

(c) Prove that in an acyclic graph there is at least one source and at least one sink.

(d) Draw sinks and sources in the graphs of (a).

Page 16: Dynamic programming: one algorithmic key to many biological locks

Problem

Consider an acyclic graph with one source and one sink. Assign each arc with a number called a weight. For a given path, its path score is defined as the sum of the weights of its arcs.

Given a weighted acyclic graph, find the highest scoring path from the sink to the source.

Page 17: Dynamic programming: one algorithmic key to many biological locks

ObservationIf two subpaths P and Q end at the same vertex v,

and the score of P is larger than the score of Q, then for all pairs of paths P* and Q* that start with P and Q, respectively, and coincide after v, the score of P* is higher than the score of Q*.

Hence, we do not need to consider all paths, as it is sufficient to construct the highest scoring subpath from the source to each vertex, finishing at the sink.Q

P

v P*,Q*P > Q P* > Q*

Page 18: Dynamic programming: one algorithmic key to many biological locks

Let’s do it for this graph

2

41

23 4

1

1

6 5

25

86 5 2

23

3 1

Page 19: Dynamic programming: one algorithmic key to many biological locks

2

41

23 4

1

1

6 5

25

86 5 2

23

3 1

13

22

41

23 4

1

1

6 5

25

86 5 2

23

3 1

45

2

Step 1 Step 2

3

6

Page 20: Dynamic programming: one algorithmic key to many biological locks

2

41

23 4

1

1

6 5

25

86 5 2

23

3 1

45

2

Step 3

3

6

2

41

23 4

1

1

6 5

25

86 5 2

23

3 1

105

2

Step 4

3

7

1110

Page 21: Dynamic programming: one algorithmic key to many biological locks

2

41

23 4

1

1

6 5

25

86 5 2

23

3 1

105

2

Step 5

3

7

1112

2

41

23 4

1

1

6 5

25

86 5 2

23

3 1

105

2

Step 6

3

7

1118

16

Page 22: Dynamic programming: one algorithmic key to many biological locks

2

41

23 4

1

1

6 5

25

86 5 2

23

3 1

105

2

Step 7

3

7

1118

16

2

41

23 4

1

1

6 5

25

86 5 2

23

3 1

105

2

Step 8

3

7

1119

16

19

Page 23: Dynamic programming: one algorithmic key to many biological locks

2

41

23 4

1

1

6 5

25

86 5 2

23

3 1

105

2

Step 9

3

7

1119

16

20

2

41

23 4

1

1

6 5

25

86 5 2

23

3 1

105

2

Backtracing

3

7

1119

16

20

Page 24: Dynamic programming: one algorithmic key to many biological locks

Quiz

At what steps did we have more than one vertex with all incoming arcs processed?

Page 25: Dynamic programming: one algorithmic key to many biological locks

AlgorithmData types and definitions:

vertices: v, u, Source, Sink;

arcs: (v,u), a;

start vertex of arc a: Begining_vertex(a);

weight of arc (v,u): W(v,u);

path: BestPath; // defined as a set of arcs

the highest score of subpath ending at v: Score (v);

the highest score of subpath coming through (v,u) and ending at

u : Top_score (v,u);

the last arc of the highest scoring subpath ending at u:

Last_arc(u).

 

Page 26: Dynamic programming: one algorithmic key to many biological locks

Initialize: for each vertex v: Score (v) := minus_infinity.Forward process: while There are unprocessed vertices: v := arbitrary unprocessed vertex with all incoming arcs processed; for each arc (v,u): // consider all arcs starting at v Top_score (v,u) := Score (v)+W(v,u); if Top_score (v,u)>Score (u) // subpath coming through v is better than the //current best subpath ending at u then: // update the data for u Score (u) := Top_score (v,u); Last_acr (u) := (v,u); endif; (v,u) := processed_arc; endfor; v := processed_vertex;endwhile.Backtracing: BestPath = empty_set; // initialize v := Sink; // go from the sink backwards by marked arcs until v=Source Add Last_arc (v) to BestPath; // add the last arc of the best path ending at the //current vertex v := Beging_vertex (Last_arc(v)); // go to the start vertex of this arc enduntil.Output BestPath.

Page 27: Dynamic programming: one algorithmic key to many biological locks

The number of operations

The limiting procedure is processing vertices and adding arcs to paths, and we consider each arc only once

Hence the number of operations is linear in the number of arcs A: the run time of the algorithm is O (A)

Page 28: Dynamic programming: one algorithmic key to many biological locks

Greedy algorithm

Start at the source and select the highest-weighted arc at each step.

13 < 20

It does not work. 2

41

23 4

1

1

6 5

25

86 5 2

23

3 1

Page 29: Dynamic programming: one algorithmic key to many biological locks

Quiz(a)Construct the simplest possible graph in which

the greedy algorithm yields the highest scoring path.

(b) Construct a graph with three vertices in which the greedy algorithm does not yield the highest scoring path.

(c) Construct a graph with three vertices in which the greedy algorithm does yield the highest scoring path.

(d) Assign new weights to the arcs of the above graph so that the greedy algorithm will yield the highest scoring path.

Page 30: Dynamic programming: one algorithmic key to many biological locks

Quiz cont’d(e) Write an algorithm for construction of the path

with the maximum number of arcs and apply it to the above graph.

Hint: do not change the algorithm, set proper arc weights.

(f) Modify the maximum score algorithm so as to construct the path with the minimal score and find this path for the above graph.

(g) Provide a greedy algorithm for finding the path of minimal score in a graph, and apply it to the above graph.

(h) For the above graph, find the path with the minimal number of arcs.

Page 31: Dynamic programming: one algorithmic key to many biological locks

Lesson

The generic dynamic programming algorithm may be applied to different problems. The common feature of these problems is that each one can be decomposed into an ordered set of smaller subproblems, and to solve a more complex subproblem one needs to know only the solutions of the simpler ones, but not the entire set of possibilities.

Page 32: Dynamic programming: one algorithmic key to many biological locks

Note

There exist path optimization problems that cannot be solved by the dynamic programming.

Traveling salesman problem. Given a non-oriented graph with weighted arcs, we need to construct the lowest scoring path passing through all the vertices (the salesman needs to visit all cities with travel time between the cities given by the arc weights, while spending the least amount of time traveling).

All cities need to be visited in a single trip => NP-complete problem.

No efficient algorithms are known. Most computer scientists believe that for all NP-complete problems the number of operations required to provide an optimal solution is exponential in the problem size.

Page 33: Dynamic programming: one algorithmic key to many biological locks

AlignmentGiven two symbol sequences (nucleotides or

amino acids) of lengths M and N, set a correspondence between these sequences so that some symbols are set in pairs, matching or mismatching, whereas other symbols are ignored (indels). The order of corresponding symbols in the subsequences should coincide.

The alignment score is the sum of match premiums r per matching pair minus the sum of mismatch penalties p per mismatching pair and deletion penalties q per ignored symbol.

The goal is to construct the highest scoring alignment.

Page 34: Dynamic programming: one algorithmic key to many biological locks

Quiz

What are the scores of the alignments

Page 35: Dynamic programming: one algorithmic key to many biological locks

Reduction to the optimal path problem

Construct a graph.Vertices correpond to pairs of positions

(endpoint of partial alignments).Outcoming arcs (for each vertex) are

of three types:• match (weight r ) or mismatch (weight(–

p)); total M∙N arcs

• deletion in the 1st sequence (weight (–q)); total M∙(N+1) arcs

• deletion in the 2nd sequence (weight (–q); total (M+1)∙N) arcs

Page 36: Dynamic programming: one algorithmic key to many biological locks

Alignment graphg e l af n d

g

a

l

a

f

n

d

Page 37: Dynamic programming: one algorithmic key to many biological locks

Alignment graph with weights

r

q

g e

qq

q

p

q

qq

q

p

q

qq

q

p

q

qq

q

r

q

qq

q

p

q

qq

q

p

q

qq

q

p

q

qq

q

p

q

qq

q

p

q

qq

q

p

q

qq

q

p

q

qq

q

pp

p p

q

qq

q

p

q

qq

q

p

q

qq

q

p

q

qq

q

p

q

qq

q

p

q

qq

q

q

q

q

q

q

q

q q q

q q q

p p p

p r

p p r

p p

p p p

p r p

l af n d

g

a

l

a

f

n

d

p qq

q

r qq

q

p qq

q

q

q q q

p p p

p

q

q

q

r

q

q

q

p

q

q

q

q

q

q

q

p

p

p

p q

q

Page 38: Dynamic programming: one algorithmic key to many biological locks

Paths for the three alignmentsg e l af n d

g

a

l

a

f

n

d

Page 39: Dynamic programming: one algorithmic key to many biological locks

Variants

• Hanging-end alignment (genome assembly)– zero-weight arcs from the source to the

top and left “perimeter” and from the right and bottom perimeter to the sink

• Local alignment– zero-weight arcs from the source to all

internal vertices and from internal vertices to the sink

Page 40: Dynamic programming: one algorithmic key to many biological locks

Weights• Amino-acid substitution weight matrices

– evolutionary• PAM (sure alignment of closely related proteins,

take matrix to the power)• BLOSUM (alignment of conservative regions in

distantly related proteins)– based on physical and chemical properties of

residues• Deletion penalty

– affine penalties (opening and extension penalties)

• Structural alignment as the gold standard

Page 41: Dynamic programming: one algorithmic key to many biological locks

Quiz

For the above alignments, assuming match premium r=10, what combinations of mismatch and deletion penalties would yield optimal alignments (a), (b), and (c)?

Page 42: Dynamic programming: one algorithmic key to many biological locks

Multiple alignment

• triple cubic graph– etc

• for K sequences of length N requires O(NK) operations

• soon becomes unworkable• progressive alignment

– all pairwise alignments, distance matrices

– guide tree– alignment of partial alignment

Page 43: Dynamic programming: one algorithmic key to many biological locks

Lesson

Weights matter. The same graph with differently assigned arc weights will yield different types of alignment.

Page 44: Dynamic programming: one algorithmic key to many biological locks

Gene recognitionDefine a gene as a sequence fragment consisting of

exons and introns.The boundaries between them are donor sites (between

exons and introns, usually GT) and acceptor sites (between introns and exons, usually AG).

Each exon and intron is assigned a weight, measuring coding affinity (respectively, non-coding affinity) of its sequence.

The gene’s score is the sum of weights of constituent exons and introns.

The goal is, given a sequence and a set of candidate donor and acceptor sites, construct the highest-scoring exon–intron structure for a gene.

Page 45: Dynamic programming: one algorithmic key to many biological locks

Construct a graph

actgagactgcagacggacgtacggcactgacgtataagccccacagtccttacgtctga

actgagactgcagACGGACGTACGGCACTGACgtataagCCCCACAGTCCTTACgtctga

(a)

(b)

Page 46: Dynamic programming: one algorithmic key to many biological locks

Complexity

Assume even distribution of sites (leave out details)

=> O(L) vertices, O(L2) arcs

Can we do better?

Page 47: Dynamic programming: one algorithmic key to many biological locks

It makes sense to assume that the segment weights are additive (we assume that for exons

anyhow). Then we have just O(L) arcs

actgagactgcagacggacgtacggcactgacgtataagccccacagtccttacgtctga

actgagactgcagACGGACGTACGGCACTGACgtataagCCCCACAGTCCTTACgtctga

(a)

(b)

(a)

(b)

Page 48: Dynamic programming: one algorithmic key to many biological locks

Quiz

There are two paths in the segment graph that describe exon–intron structures not represented in the exon–intron graph. What are they? What arcs need to be added to the exon–intron graph to represent these structures?

Page 49: Dynamic programming: one algorithmic key to many biological locks

Lesson

Structure matters. The same problem may be represented by different graphs, and the conceptually simplest representation is not necessarily the most efficient one.

Page 50: Dynamic programming: one algorithmic key to many biological locks

Return to the toy problem

calculate

the standard trick would not work because

x∙z + y∙z = (x + y) ∙ z (before) holds, but

(x+z) ∙ (y+z) = x∙y + z generally does not.

Quiz. When (x+z) ∙ (y+z) = x∙y + z ?

Page 51: Dynamic programming: one algorithmic key to many biological locks

DP, generic statement.1. Path weights

Let be the operation of calculating the path score S given arc weights W. We require that the associative rule hold

Hence we can simply write .

The path weight (former S(P) = ) becomes .

Page 52: Dynamic programming: one algorithmic key to many biological locks

DP, generic statement.2. Graph score

Let Ψ be the set of all paths and the operation of selecting the path. We require that possess the associative, commutative rules for combining paths:

and .The graph score is define as

(for the optimal path problem )

+

+

Page 53: Dynamic programming: one algorithmic key to many biological locks

DP, generic statement.3. Transitivity

To use dynamic programming, we need the distribution law

and .

This is a generalization of the property used for calculating the optimal path:max (x + z, y + z) = max (x, y) + z.

Page 54: Dynamic programming: one algorithmic key to many biological locks

DP, algorithm

Page 55: Dynamic programming: one algorithmic key to many biological locks

Problem (physics of polymers)

Linear polymer chain of L+1 monomers k = 0, …, L.Each monomer assumes N states σ(k) є {σi | i =

1, …, N}.Energy of interactions between adjacent monomers

is defined by an N×N matrix ξ(σi,σj) (measured in the KT units).

Chain conformation P is defined by the states of the monomers {σ(0), σ(1), …, σ(L)}.

Exponent of energy: S(P) = exp (–E(P)) = = ∏k=1…L exp (–ξ(σ(k–1),σ(k)).

Ψ is the set of all conformations. Calculate the partition function of the set of all

conformations Ω = ∑PєΨ S(P).

Page 56: Dynamic programming: one algorithmic key to many biological locks

Graph construction and reduction to DP

Vertices correspond to monomer states, so that their number is (L+1)∙N+2 (two additional vertices are the source and the sink, corresponding to the virtual start and end of the chain).

Arcs link vertices corresponding to adjacent monomers.

Arc weights are the interaction energies. Paths through this graph exactly correspond to the

chain conformations. is ordinary multiplication, and is additionThe path score is the product of arc weights.The total graph score is the sum of these products.Standard DP solves the problem.

Page 57: Dynamic programming: one algorithmic key to many biological locks

Quiz

(a)How many operations shall we need?

(b) How many operations shall we need if we calculate the partition function directly?

(c) Provide an algorithm for calculating the number of paths in a graph. Hint: invent suitable arc weights and reduce to the previous problem.

(d) What will Ω be if both and are the operation of taking the maximum?

Page 58: Dynamic programming: one algorithmic key to many biological locks

ProblemCalculate the minimum energy and the number of

conformations with the minimum energy.Arc weights are pairs [1, ξ], with ξ as defined previously.Path scores are pars [n, ε], where ε is the energy, and n is

the number of conformations having this energy.When two systems are combined, the resulting energy is

the sum of the systems’ energies, whereas the number of states is the product of the numbers of states. Hence

solves the problem.

Page 59: Dynamic programming: one algorithmic key to many biological locks

Lesson

Generalizations are useful

Page 60: Dynamic programming: one algorithmic key to many biological locks

Note

Not all problems that can be solved by dynamic programming have a simple graph representation. For example, reconstruction of the secondary structure of a RNA molecule given its sequence can be decomposed into simpler, embedded problems and can be solved by a variant of dynamic programming algorithm, but in the language of this paragraph it requires slightly more complicated objects called hypergraphs.

Page 61: Dynamic programming: one algorithmic key to many biological locks

Спасибо

• Mikhail Roytberg

• Andrei Mironov• Anatoly Rubinov• Pavel Pevzner

Page 62: Dynamic programming: one algorithmic key to many biological locks