bioinformātika filoģenētiskie koki lu, 2014, juris vīksna

100
B B ioinformā ioinformā ti ti ka ka Filoģenētiskie koki Filoģenētiskie koki LU, 2014, LU, 2014, Juris V Juris V īksna īksna

Upload: sydney-henderson

Post on 29-Dec-2015

228 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

BBioinformāioinformātitikaka

Filoģenētiskie kokiFiloģenētiskie koki

LU, 2014,LU, 2014, Juris VJuris Vīksnaīksna

Page 2: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Filoģenētiskie koki Hierarhiskā klasterizācija un dendrogrammas Filoģenētisko koku veidi "Molekulārais pulkstenis" Metodes koku konstruēšanai

• no attālumu matricām• no pazīmju matricām

Vēl dažas ar koku konstruēšanu saistītas problēmas Programmas filoģenētisko koku konstruēšanai un

vizualizācijai

Šodien:

Page 3: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

“Higher” organisms

“Lower” organisms

A phylogenetic tree is a hierarchical, graphical representation of relationships

Haeckel-a "Dzīvības koks"

[Adapted from M.Thomas]

Page 4: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Kas ir klāsterizācija?

DOGS!!

CATS!!

PETS!!

[Adapted from V.Olman]

Page 5: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Klāsterizācija un klasifikācija

[Adapted from R.B.Altman]

Page 6: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Kas ir klāsteris?

“Dabiska” definīcija ir diezgan plaša:

• ir strikta robeža starp divu klasteru

elementiem

• liela elementu koncentrācija, salīdzinot

ar fonu

• viena klastera elementi ir attālināti no citiem

[Adapted from V.Olman]

Page 7: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Hierarhiskā klāsterizācija

Sākotnēji katrs objekts ir savā klāsterī. Katrā nākamajā solīdivi klāsteri tiek apvienoti vienā (kamēr paliek tikai viensklāsteris).

[Adapted from Y.Guo]

Page 8: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Hierarhiskā klāsterizācija - varianti

• single-link clustering (also called the connectedness or minimum method) : we consider the distance between one cluster and another cluster to be equal to the shortest distance from any member of one cluster to any member of the other cluster. If the data consist of similarities, we consider the similarity between one cluster and another cluster to be equal to the greatest similarity from any member of one cluster to any member of the other cluster.

• complete-link clustering (also called the diameter or maximum method): we consider the distance between one cluster and another cluster to be equal to the longest distance from any member of one cluster to any member of

the other cluster.

• average-link clustering : we consider the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster

to any member of the other cluster.

[Adapted from Y.Guo]

Page 9: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Single-Link Method

ba

453652

cba

dcb

Distance Matrix

Euclidean Distance

453,

cba

dc

453652

cba

dcb4,, cbad

(1) (2) (3)

a,b,ccc d

a,b

d da,b,c,d

[Adapted from Y.Guo]

Page 10: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Complete-Link Method

ba

453652

cba

dcb

Distance Matrix

Euclidean Distance

465,

cba

dc

453652

cba

dcb6,,

badc

(1) (2) (3)

a,b

cc d

a,b

d c,da,b,c,d

[Adapted from Y.Guo]

Page 11: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Compare Dendrograms

a b c d a b c d

2

4

6

0

Single-Link Complete-Link

[Adapted from Y.Guo]

Page 12: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Nadler and Smith, Pattern Recognition Engineering, 1993

Single-link vs Complete-link

[Adapted from Y.Guo]

Page 13: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

OC - hierarhiskās klasterizācijas programma

http://www.compbio.dundee.ac.uk/Software/OC/oc.html

7FirstSecondThirdFourthFifthSixthSeventh100.0 100.0 50.0 33.0 25.0 20.0100.0 50.0 50.0 33.0 33.0100.0 33.0 20.0 25.033.0 20.0 25.0100.0 100.0100.0

## 0 20 2Entity Score: 20 Number of members: 2 0 6## 1 20 2Entity Score: 20 Number of members: 2 2 5## 2 20 3Entity Score: 20 Number of members: 3 3 2 5## 3 25 5Entity Score: 25 Number of members: 5 0 6 3 2 5## 4 33 6Entity Score: 33 Number of members: 6 4 0 6 3 2 5## 5 33 7Entity Score: 33 Number of members: 7 1 4 0 6 3 2 5

Page 14: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Filoģenētiskie koki

[Adapted from E.Willasen]

Page 15: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Filoģenētiskie koki

Single origin to all species Also describes

evolution of DNA Leaves-

contemporary Internal nodes -

ancestral

Tree may be rooted/unrooted Branch length - distance between sequences

[Adapted from I.Pe’er]

Page 16: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Kladistika un fenētika

Cladistic approach: Trees are drawn based on the conserved characters

Phenetic approach: Trees are based on some measure of distance between the leaves

Molecular phylogenies are inferred from molecular (usually sequence) data either cladistic (e.g. gene order) or phenetic

[Adapted from C.Seoighe]

Page 17: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Clade: A set of species which includes all of the species derived from a single common ancestor

Clade

[Adapted from C.Seoighe]

Page 18: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Pagurus bernhardus

Pagurus acadianus

Ellasochirus tenuimanus

Labidochirus splendescens

Lithodes aequispina

Paralithodes camtschatica

Pagurus pollicaris (NE)

Pagurus pollicaris (GU)

Pagurus longicarpus (NE)

Pagurus longicarpus (GU)

Clibanarius vittatus

Coenobita sp.

Artemia salina

t1

t2

cladogramrelative recency ofcommon descent.

•Does not imply that ancestors on the same line necessarily speciated at the same time. • t1 can be before or after t2 but not before t3

t3

(no time scale)

Koku veidi: kladogrammas

[Adapted from E.Willasen]

Page 19: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Pagurus bernhardus

Pagurus acadianus

Ellasochirus tenuimanus

Labidochirus splendescens

Lithodes aequispina

Paralithodes camtschatica

Pagurus pollicaris (NE)

Pagurus pollicaris (GU)

Pagurus longicarpus (NE)

Pagurus longicarpus (GU)

Clibanarius vittatus

Coenobita sp.

Artemia salina

0.05

phylogram(additive tree: branch lenghts can be summed)

relative recency ofcommon descent, and

branch lengths =amount of change

Koku veidi: filogrammas

[Adapted from E.Willasen]

Page 20: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Pagurus bernhardus

Pagurus acadianus

Ellasochirus tenuimanus

Labidochirus splendescens

Lithodes aequispina

Paralithodes camtschatica

Pagurus pollicaris (NE)

Pagurus pollicaris (GU)

Pagurus longicarpus (NE)

Pagurus longicarpus (GU)

Clibanarius vittatus

Coenobita sp.

Artemia salina

0.000.050.100.15

ultrametric tree(linearized tree)

Amount of change can be scaled to time

scale = time

Koku veidi: ultrametriskie koki

[Adapted from E.Willasen]

Page 21: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Molekulārais pulkstenis

(Emilie Zuckenkandl, Linus Paulig ~1960)

Accepted DNA mutations happens with the constant rate.

Thus, the number of mutations is proportional to the time of evolution.

But - mutation frequencies could be different for different proteins.

fibrinopeptides > hemoglobin > cytochrome c

For longer proteins the mutation frequency could differ in different regions.

Neutral Theory of Molecular Evolution (Motoo Kimura).Natural selection random genome changes.

Page 22: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Evolūcijas izmaiņu modeļi

A G

TC

purines

pyrimidines

transversions

transitions

transitions

the simplest model, Jukes-Cantor,assumes all probabilities of change to be equal. To be realistic:•the base frequencies must be equal•the rates of change must be equal

Kimura 2 parameter modelone might expect ts / tv rates to be 4 / 8 = 0.5, but transitions are usually more common. Kimura model allows for unequal rates of transitions and transversions

[Adapted from E.Willasen]

Page 23: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

(a) Cladogram showing the phylogenetic relationships between four species.

(b) Relationships of the same four species represented as a set of nested parentheses.

(c) Evolutionary relationships of the same four species with nine synapomorphies (shared, derived characters) plotted on the branches.

Koku reprezentācijas

[Adapted from M.Thomas]

Page 24: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Using Phylogeny to Understand Gene Duplication

and Loss

A. A gene tree.B. The gene tree superimposed on a species tree,

allowing identification of the duplication and loss events.

Pielietojumi

[Adapted from M.Thomas]

Page 25: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Pielietojumi

[Adapted from R.Shamir]

Page 26: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Attālumu matricas

A B C D E F G

A 63 94 111 67 23 107

B 63 79 96 16 58 92

C 94 79 47 83 89 43

D 111 96 47 100 106 20

E 67 16 83 100 62 96

F 23 58 89 106 62 102

G 107 92 43 20 96 102

MIN un MAX matricas:• MIN matrica - laika moments, kurā notikusi diverģence• MAX matrica - laiks, kas pagajis kopš diverģences

Page 27: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Dati filoģenētisko koku konstruēšanai

Laboratorijas metodes. - Hibridizē divu organismu DNS maisījumu. Tad pārbauda pie kādas temperatūras hibridizētāsvirknes atdalās. Putnu evolūcija (Sibley, Ahlquist,1986). Ieguva izteikti ultrametriskus datus.

Uz virknēm balstītas metodes. - Rēķina kaut ko līdzīgu ED (jāņem vērā atkārtotasmutacijas un tādas, kas neizmaina proteīnu).

Page 28: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Īpašību stāvokļu matricas

Taxa C1 C2 C3 C4 C5 C6

A 0 0 0 1 1 0

B 1 1 0 0 0 0

C 0 0 0 1 1 1

D 1 0 1 0 0 0

E 0 0 0 1 0 0

Page 29: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Dati filoģenētisko koku konstruēšanai

- “have wings”; “walk on four legs”...

- DNA contain a specific subsequence...

- specific nucleotide in a fixed DNS position

- are protein (gene) expression regulated by a specific protein (mice and humans have very similar proteinsbut very different regulation...)

- similarity between gaps for multiple alignment ofsequences from different organisms (in this way onecan demonstrate that fungi are closer to animals thanplants)

Page 30: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Īpašību stāvokļu matricas - piemērs

[Adapted from M.Thomas]

Page 31: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Dati - ortologi un paralogi

[Adapted from R.Shamir]

Page 32: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Dati - ortologi un paralogi

[Adapted from R.Shamir]

Page 33: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Metodes koku konstruēšanai

Distance based methods

Maximal parsimony methods - MP

Maximal likelihood methods - ML

Page 34: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Ultrametriskas matricas un koki

[Adapted from D.Gusfield]

Page 35: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Ultrametriskas matricas un koki

"Neformālāka" definīcija:

Attālumu matrica nn ir ultrametriska, ja tai atbilst ultrametriskskoks, resp., ja var uzbūvēt saknes koku ar svarotām šķautnēm,kura lapu kopa ir matricas rindu kopa {1,...,n} un katram rindu pārim i, j attālumi no i un j līdz tām tuvākajai kopīgajai virsotneiir M(i,j).

Ir "vienkārša" pazīme, vai matrica ir ultrametriska - simetriska matrica ir ultrametriska, ja katram trijniekam i,j,kvismaz divas no vērtībām M(i,j), M(j,k), M(i,k) ir vienādasar maksimālo no tām.

Page 36: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Ultrametrisku koku konstruēšāna

Ir vienkāršs O(n2) algoritms.

[Adapted from D.Gusfield]

Page 37: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Ultrametrisku koku konstruēšāna

Kāds ultrametrisks koks atbilst dotajai matricai?

A B C D E F G H I

A 0 8 5 9 8 7 5 9 9

B 0 8 9 6 8 8 9 9

C 0 9 8 7 2 9 9

D 0 9 9 9 2 3

E 0 8 8 9 9

F 0 7 9 9

G 0 9 9

H 0 3

I 0

Page 38: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Ne-ultrametriski dati?

Var mēģināt atrast “mazāko izmaiņu”, kas datus padara ultrametriskus.

Ja vērtības atļauts tikai samazināt, ir polinomiāls risinājums.

Ja vērtības atļauts samazināt vai palielināt, ir polinomiāls risinājums, kas minimizē maksimālo izmaiņu.

Ja vērtības atļauts tikai palielināt, problēma ir NP-pilna.

Page 39: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Aditīvas matricas un koki

[Adapted from D.Gusfield]

Page 40: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Aditīvas matricas un koki

"Neformālāka" definīcija:

Attālumu matrica nn ir ultrametriska, ja tai atbilst aditīvskoks (filogramma), resp., ja var uzbūvēt koku ar svarotām šķautnēm, kura virsotņu kopa satur {1,...,n} (visas matricas rindas) un katram rindu pārim i, j attālums starp lapām i un j ir M(i,j).

Page 41: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Aditīvi koki - konstruēšana

ProblēmaDota simetriska nn matrica, kam uz diagonāles ir 0 un pārējās vērtības ir pozitīvas. Atrast D atbisltosu aditīvu kokuvai noskaidrot, ka tāds neeksistē.

Ir zināmi O(n2) algoritmi. Problēmu var reducēt uz ultrametrisku koku konstruešanas problēmu.

Katra ultrametriska matrica ir aditīva.Aditīva matrica ir ultrametriska, ja tai eksiste aditīvs koks, kamviena no virsotnēm ir vienādā attalumā no visām lapam.

Ja D eksistē kompakts aditīvs koks, tad tas ir vienīgais mazākais pārklājošais koks grafam G(D).

Page 42: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Aditīvi koki - konstruēšana

[Adapted from D.Gusfield]

Page 43: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Koka "sakņošana"

In an unrooted tree the direction of evolution is unknown

The root is the hypothesized ancestor of the sequences in the tree

The root can either be placed on a branch or at a node

You should start by viewing an unrooted tree

[Adapted from C.Seoighe]

Page 44: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Koka "sakņošana" - piemērs

[Adapted from C.Seoighe]

Page 45: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Koka "sakņošana" - piemērs

[Adapted from C.Seoighe]

Page 46: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Ko iesākt ar ne-aditīvām distanču matricām?

Var mēģināt nodefinēt "labāko" iespējamo koku, un tadkonstruēt to atbilstoši izvēlētajai definīcijai.

Visas "saprātīgās" labākā koka definīcijas dos NP-pilnu problēmu.

Praksē parasti lieto tikai hiristiskus risinājumus - resp., algoritmus, kuri liekas "saprātīgi", bet negarantē, kaiegūtajam rezultātam piemisto kādas konkrētas īpašības.

Fitch-Margolias trees - “least squares” fit

To construct optimal trees more or less exhaustivesearch is required, algorithm provides a heuristic“approximation”.

Page 47: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

UPGMA metode

A B C D E A - 0.08 0.19 0.70 0.65 B - 0.17 0.75 0.70 C - 0.80 0.60 D - 0.12 E -

•find the shortest distance 0.08

•group OTUs (AB)

•A and B each has branch length 0.04 (because the sum is 0.08)•find the next shortest distance 0.12 (DE) - distance level 0.06

•find the next shortest distance 0.17 (BC) - but B has been ’used’

•so d = (0.19 + 0.17) / 2 = 0.18 - distance level 0.09

A B

0.04 0.04

D E

0.06 0.06

0.09

C

0.09

•and finally = (0.70+0.65+0.75+0.70+0.80+0.60) / 6 = 0.70

0.35

0.35

[Adapted from E.Willasen]

UPGMA = Unweighted Pair Group Method with Arithmetic mean

Page 48: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

UPGMA trees are additive

A B C D E A - 0.08 0.19 0.70 0.65 B - 0.17 0.75 0.70 C - 0.80 0.60 D - 0.12 E -

A B

0.04 0.04

D E

0.06 0.06

0.09

C

0.09

0.35

0.35

•additive: distances between nodes can be summed

•the distance from A to E is (0.04+0.09+0.35+0.35+0.06) = 0.89

[Adapted from E.Willasen]

Page 49: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

UPGMA metode

[Adapted from R.Shamir]

Rezultātā dod aditīvus kokus, bet -nekonstruēs "pareizu" aditīvu kokuaditīvai matricai :)

Page 50: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

UPGMA algorithm

[Adapted from R.Shamir]

Page 51: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Neighbour Joining metode

A B C D E A - 0.08 0.19 0.70 0.65 B - 0.17 0.75 0.70 C - 0.80 0.60 D - 0.12 E -

A B0.06 0.02

D E

0.06 0.06

0.09

C

0.09

0.35

0.35

•as opposed to UPGMA: allows for unequal rates of change on two sister linages

•the original distance matrix is transformed with the effect that branch lengths between each pair of nodes is adjusted, based on mean distance from all other nodes

[Adapted from E.Willasen]

Page 52: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Neighbour Joining metode

[Adapted from R.Shamir]

Page 53: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Neighbour Joining algorithm

For each node i the distance from the rest of the tree is estimated by

Choose the nodes i and j that for which Di,j - ui - uj is smallest

in distance matrix join i and j (ij is new node) Compute branch length from i and j to ij

Compute the distances between the new cluster and each other cluster:

ik

kii DN

u ,2

1

)(2

1

2

1),(

2

1

2

1,)(,,)(, ijjiijjjijiiji uuDduuDd

2,,,

),(jikjki

kij

DDDD

[Adapted from U.Stege]

Page 54: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Neighbour Joining metode

[Adapted from R.Shamir]

Page 55: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Parsimonijas princips

[Adapted from E.Willasen]

A general guideline to reasonable thinking:

Occam’s razorGo for simplest explanations!

Applied in phylogeny and evolutionary thinking: minimize the number of assumptions about evolutionary changes (steps)

The question ”How many times did this trait evolve” can only be answered with reference to a hypothesis of phylogeny on the groups where the trait can be observed

Page 56: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Example - fish

bichirs

gars

bowfin

coelocanths

lungfish

placoderms

primitive rayfinned

elasmobranchs

[Adapted from E.Willasen]

Page 57: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

lungfishtetrapods

bichirs

placoderms

elasmobranchs

prim rayfinnedgarsbowfins

modern rayfinnedcoelocanths

loss

gaingain

gain

gain

lungs

swimbladdersum: 6 evolutionary steps

gain

lungs evolved from swimbladder 3 times swimbladder evolved from lungs 2 times

lungs lost with no replacement 1 time

Hypothesis 1 on the evolution of lungs and swim bladder(Phylogeny derived from other characters) Example -

fish

[Adapted from E.Willasen]

Page 58: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

lungfishtetrapods

bichirs

placoderms

elasmobranchs

prim rayfinnedgarsbowfins

modern rayfinnedcoelocanths

loss

gain

lungsswimbladdersum: 3 evolutionary steps

swimbladder evolved from lungs 2 times

lungs lost with no replacement 1 time

Hypothesis 2 on the evolution of lungs and swim bladder

gain

Example - fish

[Adapted from E.Willasen]

Page 59: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

sum: 6 evolutionary steps

lungs evolved from swimbladder 3 times

swimbladder evolved from lungs 2 times

lungs lost with no replacement 1 time

Parsimonious evaluation

sum: 3 evolutionary steps

swimbladder evolved from lungs 2 times

lungs lost with no replacement 1 time

Hypothesis 1

Hypothesis 2

Hypothesis 2 is preferred under the principle of parsimony

Parsimony is also used to select one or more trees from the set of possible trees

Example - fish

[Adapted from E.Willasen]

Page 60: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Perfekti filoģenētiskie koki

Dotai pazīmju matricai nm atbilst perfekts filoģenētiskaiskoks, ja eksistē saknes koks, kura lapu kopa ir {1,...,n},šķautņu kopa, šķautņu kopa satur {1,...,m} un katram i ceļšno saknes uz lapu i satur šķautni j tad un tikai tad, ja M(i,j)=1.

Page 61: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Perfektas filoģēnētikas problēma

[Adapted from D.Gusfield]

Page 62: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Perfekti filoģenētiski koki - konstruēšanaO(nm) algoritms perfektu filoģenētisku koku konstruēšanai.

- apskata katru M kolonnu ka bināru skaitli. Sakārto šos skaitļus nedilstošā secībā - katrai rindai “pieraksta” to kolonnu virkni, kurās šajā rindā ir 1 - “ieklāj” šīs simbolu virknes kokā, sākot no saknes un “savietojot” vienādās pazīmes

Kādēļ kolonnas ir jāsakārto?

- jāpanāk, ka pazīmes, kas atrodas tuvāk saknei tiek aplūkotas vispirms - ja “rinda” A ir “rindas” B “apakškopa”, tad vispirms tiks aplūkota A - ja A un B ir disjunktas, tad secība nav svarīga

Page 63: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Perfekti filoģenētiski koki - konstruēšana

[Adapted from D.Gusfield]

O(nm) algoritms.

Page 64: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Perfekti filoģenētiski koki - konstruēšana

a b c d e f g h i j

A 0 0 1 0 0 0 0 1 0 0

B 1 0 1 0 1 0 1 1 0 0

C 0 0 0 0 0 1 0 0 0 0

D 0 1 0 1 0 1 0 0 0 1

E 0 1 0 0 0 1 0 0 0 0

F 0 0 1 0 0 0 1 1 0 0

G 0 0 0 0 0 0 0 1 0 0

H 0 1 0 1 0 1 0 0 1 1

I 0 0 1 0 1 0 1 1 0 0

J 0 1 0 0 0 1 0 0 0 1

Page 65: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Vēl par pazīmju matricām

Perfekti filoģenētiskie koki tika konstruēti pie pieņēmuma, ka "kopīgajam sencim" visas pazīmju vērtības bija "0".

Tas ne vienmēr ir pārāk reālistisks pieņēmums (īpaši ja pazīmes iegūtas no DNS/proteīnu virknēm).

Camin-Sokal metode: atļauj tikai izmaiņas 01Wagner metode: atļauj gan 01, gan 10

Page 66: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Perfekti filoģenētiski dati?

Reāliem datiem parasti neatbilst perfekti filoģenētiskie koki."Relaksēta" filoģenētikas problēma ir visparsimoniskākā(the most parsimonious) koka atrašana...

Page 67: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

4 toes

20 teeth

10 ribs, 5 toes, round lobes, long legs

oval lobes, 16 teeth, 25 verts,8 ribs, 3 toes, short legs

XFHG

4 toes, short legs, 8 ribs, 16 teeth, oval lobes

20 teeth, 5 toes, 10 ribs, round lobes, long legs

3 toes, round lobes

round lobes, 20 teeth, 25 verts,10 ribs, 5 toes, long legs

FGHX

1

4

1

Tree length: 6 steps

5

2

5

Tree length: 12 steps

Maksimālā parsimonija - piemērs

[Adapted from M.Thomas]

Page 68: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Maksimālā parsimonija. Šteinera koki.

G = (X,E) - neorientēts grafs ar šķautņu svariem. N ir virsotņukopas X apakškopa. Šteinera koks ST kopai N ir sakarīgs G apakškoks, kas satur visas N virsotnes (un var arī citas).

ProblēmaDotam G un X atrast Šteinera koku ar mazāko svaru. (Var būtarī gadījums, kad visām šķautnēm svars ir 1).

Problēma (arī, ja visi šķautņu svari ir 1) ir NP-pilna.Ir polinomiāla 2-aproksimācija.

Ja X=N, tad ST ir minimālais G pārklājošais koks (!); viegliatrisināma laikā O(E log(E)) (un mazliet papūloties patO(E)).

Page 69: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Parsimonija - pielietojums pazīmju matricām

Šķautņu svari - pazīmju skaits, kas jāizmaina, lai vienumatricas rindu pārvērstu par citu (Heminga attālums).

N = matricas rindu kopaX = visu bināru m-dimensionālu vectoru kopa

Problēma - konstruēt Šteinera koku ar mazāko svaru, kura lapu kopa ir N

Exact solutions could be found for N=10 or slightly larger(depending from property matrices)

Page 70: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Lifted Alignment

“Lifted alignment” - katrai iekšējai virsotnei atbilst tāda pati virkne, kā vienam no viņas bērniem

“Lifted alignment” dod koku, kura svars nav lielāks par dubultotu minimumu

“Lifted alignment” var atrast polinomiālā laikā[Adapted from D.Gusfield]

Page 71: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Large Parsimony Problem

[Adapted from R.Shamir]

Exhaustive search through all possible trees (up to 12 taxa)

Branch-and-bound methods - more intelligent search, trying to avoiding provably non-optimal trees (up to 25 taxa)

Heuristic methods (useful for around 100 or so taxa). Builds an initial tree by some greedy strategy and then tries to improve it by various recombination methods):

- NNI (Nearest Neighbour Interchange) - SPR (Subtree Pruning and Regrafting) - TBR (Tree Bisection and Reconnection)

Page 72: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

LPP - Exhaustive search

Page 73: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

LPP - Exhaustive search

Page 74: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

LPP - Branch and Bound

Page 75: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

LPP - Greedy algorithm

Page 76: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

LPP - NNI

[Adapted from R.Shamir]

Page 77: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

LPP - NNI

Page 78: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

LPP - SPR

Page 79: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

LPP - TBR

Page 80: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Issues & problems with parsimony

Multiple trees may be the most parsimonious (have the same tree length) A consensus tree can be constructed to visualize the

congruity & discontinuity between these

Branch lengths (and, therefore, rates of change) cannot be accurately estimated

No explicit model of change is used, even when one might be well supported The most parsimonious tree(s) may not be the true tree

[Adapted from M.Thomas]

Page 81: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Maksimālās ticamības (maximum likelihood) koki

• assumes a particular evolutionary model• technically uses character matrices, however they need to beassociated with sensible evolutionary rates (e.g. can useas characters chenges of single nucleotides)

Page 82: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Models of (nucleotide sequence) evolution

T

CA

G

a

a

aa a a

Jukes Cantor

T

CA

G

a

f

eb c d

General

T

CA

G

Kimura

3α1αεαεαεαε3α1αεαεαεαε3α1αεαεαεαε3α1

)(S

NB!!! Ir vērā ņema varbūtība, ka pat pēc ilgāka evolūcijaslaika nukleotīds nebūs izmainījies

Page 83: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Maksimālās ticamības (maximum likelihood) koki

Page 84: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Maksimālās ticamības (maximum likelihood) koki

• tā pati pilnās pārlases metode, kas visparsimoniskākajiem kokiem• var lietot arī NNI, SPR, TBR etc• grūtāk atrast labas branch-and-bound un greedy stratēģijas

Page 85: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Metožu salīdzinājums

Inconsistency Neighbour Joining (NJ) is very fast but depends on

accurate estimates of distance. This is more difficult with very divergent data

Parsimony suffers from Long Branch Attraction. This may be a particular problem for very divergent data

NJ can suffer from Long Branch Attraction Parsimony is also computationally intensive Codon usage bias can be a problem for MP and NJ Maximum Likelihood is the most reliable but depends

on the choice of model and is very slow Methods may be combined

[Adapted from C.Seoighe]

Page 86: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Factors that Affect Phylogenetic Inference

1. Relative base frequencies (A,G,T,C)2. Transition/transversion ratio3. Number of substitutions per site4. Number of nucleotides (or amino acids) in sequence5. Different rates in different parts of the molecule6. Synonymous/non-synonymous substitution ratio7. Substitutions that are uninformative or obfuscatory

1. Parallel substitutions2. Convergent substitutions3. Back substitutions4. Coincidental substitutions

In general, the more factors that are accounted for by the model (i.e., more parameters), the larger the error of estimation. It is often best to use fewer parameters by choosing the simpler model.

Evolutionary models - choice of parameters

[Adapted from M.Thomas]

Page 87: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Vēl dažas problēmas

OneTree algoritms - atrod apvienoto koku polinomiālā laikā (ja tāds koks eksistē)MinCutSupertree algoritms izdod resultātu arī tad, ja neeksistē “korekts” apvienojums

Mazā parsimonijas problēma

Koku apvienošana

Page 88: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Bootstrapping

Majority-rule consensus (with minority components)

Ochromonas

Symbiodinium

ProrocentrumLoxodesSpirostomumum

Tetrahymena

EuplotesTracheloraphis

Gruberia

71

26

1659

1621

Ochromonas

Symbiodinium

ProrocentrumLoxodesTracheloraphis

Spirostomumum

EuplotesTetrahymena

Gruberia

71

59

[Adapted from C.Seoighe]

Bootstrapping is a statistical technique that can use random resampling of data to determine sampling error for tree topologies

Page 89: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Bootstrapping

Page 90: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Bootstrapping

Page 91: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Filoģenētikas resursi

http://evolution.genetics.washington.edu/phylip/software.html

Page 92: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Informācija par filoģenētikas algoritmiem

http://www.icp.ucl.ac.be/~opperd/private/phylogeny.html

Page 93: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Programmatūras rīki

Construction of phylogenetic trees

PHYLIP (phylogeny inference package by Felsenstein) PAUP* (phylogenetic analysis using parsimony by Swofford)

GUI only for MACs :( already only a commercial version :(

Multi-purpose packages, including construction of trees

DAMBE (Data Analysis in Molecular Biology and Evolution)

Tree visualisation

TreeView

Page 94: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

PHYLYP - Filoģenētikas programmu pakete

http://evolution.genetics.washington.edu/phylip.html

Page 95: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

PHYLYP - Filoģenētikas programmu pakete

http://evolution.genetics.washington.edu/phylip.html

Page 96: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

PHYLYP - methods for character matices

PENNY. Finds all most parsimonious phylogenies for discrete-character data with two states, for the Wagner, Camin-Sokal, and mixed parsimony criteria using the branch-and-bound method of exact search. May be impractical (depending on the data) for more than 10-11 species.

PARS. Multistate discrete-characters parsimony method. Up to 8 states (as well as "?") are allowed. Cannot do Camin-Sokal or Dollo Parsimony. Can cope with multifurcations, reconstruct ancestral states, use character weights, and infer branch lengths.

FACTOR. Takes discrete multistate data with character state trees and produces the corresponding data set with two states (0 and 1). Written by Christopher Meacham. This program was formerly used to accomodate multistate characters in MIX, but this is less necessary now that PARS is available. .

Page 97: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

PHYLYP - methods for character matices

CLIQUE. Finds the largest clique of mutually compatible characters, and the phylogeny which they recommend, for discrete character data with two states. The largest clique (or all cliques within a given size range of the largest one) are found by a very fast branch and bound search method. The method does not allow for missing data. For such cases the T (Threshold) option of PARS or MIX may be a useful alternative. Compatibility methods are particular useful when some characters are of poor quality and the rest of good quality, but when it is not known in advance which ones are which.

MOVE. Interactive construction of phylogenies from discrete character data with two states (0 and 1). Evaluates parsimony and compatibility criteria for those phylogenies and displays reconstructed states throughout the tree. This can be used to find parsimony or compatibility estimates by hand.

Page 98: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

DAMBE - Data Analysis and Molecular Biology and Evolution

http://dambe.bio.uottawa.ca/dambe.asp

Page 99: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

Tree formats

.PHB faili:

((((Hippopotamus:0.70000,Human:41.80000):1.25000,Horse:

39.75000):1.09375,Pig:40.53125):0.46875,Rabit:41.90625,(Bovin:

39.00000,Chick:25.00000):1.09375);

.TRE faili:

#nexus

begin host;

tree host=(a,(b,c));

endblock;

Page 100: Bioinformātika Filoģenētiskie koki LU, 2014, Juris Vīksna

TreeView - Koku vizualizācija/rediģēšana

http://taxonomy.zoology.gla.ac.uk/rod/treeview.html