quartet inference from snp data under the coalescent model
Post on 01-Jan-2017
215 Views
Preview:
TRANSCRIPT
Quartet Inference from SNP Data Under the Coalescent Model
Julia Chifman and Laura Kubatko By
Shashank Yaduvanshi
EsDmaDng Species Tree from Gene Sequences
• Input: Alignments from mulDple genes
• Output: Unified species tree
• Challenges: – Every gene has its own phylogeny – Gene trees might vary from species tree due to ILS, horizontal gene transfer etc
Phylogeny EsDmaDon Methods under the Coalescent Model
• Used to model ILS in gene trees
• Summary based methods – Quartet based methods
• ConcatenaDon methods
• Co-‐esDmaDon methods
Summary Based Methods • First esDmate independent gene trees for each gene using methods like RaxML
• Second step is combining gene trees to get species trees by methods like Astral
• ComputaDonally efficient for large data sets
• EsDmaDon error in gene trees will lower the overall accuracy
Quartet Based Methods
• EsDmate the most likely true quartet tree for each 4 set of taxa using mulD gene sequences
• Combine all (or a subset) of these quartet trees using a Supertree method to get the species tree
• Works on the enDre data together while sDll remaining computaDonally efficient
ConcatenaDon Methods
• Concatenate all gene sequence alignments to get one long sequence alignment for each taxon
• Get the species tree using these long alignments directly with methods such as ML
• Ignores differences in the gene trees for different genes
Co-‐esDmaDon Methods
• Co-‐esDmate sequence alignments and species tree with methods such as Bayesian inference
• Generally higher accuracy than other methods
• ComputaDonally inefficient for large datasets
EsDmaDng Quartet Trees
• Most methods seen so far are distance based, or ML-‐based
• This paper introduces a new measure, SVD scores that is based on the frequency of quartet pa\erns amongst all gene alignments
• SVD scores can be used to esDmate the most likely quartet tree for any quartet of taxa
Important Concepts
• pijkl =P(X1 =i; X2 =j; X3 =k; X4 =l)
• A SPLIT of a taxa set L is a biparDDon of L into two non-‐overlapping subsets L1 & L2, denoted L1|L2. VALID SPLIT L1|L2 for tree T: There is some edge in T that results in the same biparDDon L1|L2. If no such edge exists, then the split is INVALID
• For taxa quartets, we will talk about splits corresponding to groups of two. There are 3 such possible splits for each quartet.
Fla\ening
Important Concepts • The RANK of a matrix A is the size of the largest collecDon of linearly independent columns(or rows) of A.
• SVD: The singular value decomposiDon of a matrix A is the factorizaDon of A into the product of three matrices A = UDVT where the columns of U and V are orthonormal and the matrix D is diagonal with posiDve real entries.
• Rank(A) equals the number of non-‐zero diagonal elements(singular values) in D.
Theorem
• [Chifman and Kubatko, 2014]. Let C denote the class of coalescent models under the four-‐state GTR model on a four-‐ taxon binary species tree. For a valid split L1|L2 , rank(FlatL1|L2(P))<= ︎10 for all distribuDons P arising from C. For a non-‐valid split L1 |L2 , rank(FlatL1|L2(P)) > 10.
ApproximaDon to Fla\ening
Finding the Best Split
• Calculate FlatL1|L2(P’) for all three possible splits.
• Calculate the rank of each of these three matrices. True split will have rank<=10.
• Not computaDonally intensive to get these counts and calculate rank
• Can be run in parallel for different quartets
SVD Scores
• SVD score 0 implies rank(L1|L2)<=10, hence L1|L2 is a valid split
• SVD score >0 implies rank(L1|L2)>10, hence L1|L2 is an invalid split
• Choose the split with the lowest SVD score
Suitable Data • SVD scores are applicable to data where each site evolves
independently, coming from a different locus
• However, authors claim that this method also works well when each locus produces mulDple sites , simulated and real world.
• Bootstrapping for a dataset consisDng of M aligned sites – Re-‐sample columns with replacement M Dmes – Calculate SVD scores of the three splits for this data matrix – Repeat this procedure B Dmes – Each bootstrap matrix votes for a parDcular split. Total votes for each split is its bootstrap support
Experiments
• SimulaDon Study
• Ra\lesnake MulD-‐Loci Data
• Soybean SNP Data
SimulaDon Study
1
2
3
4
x
x x
x x
SimulaDon Study • Generate a sample of g gene trees from the model species tree
((1:x,2:x):x,(3:x,4:x):x), where x is the length of each branch under the coalescent model using the program COAL (Degnan and Salter).
• Generate sequence data of length n on each gene tree under a specified subsDtuDon model.
• Construct the fla\ening matrix for each of the three possible splits, and compute SVD(L1|L2) for each
• Repeat 1000 Dmes and record SVD(L1|L2)k; k=1; 2; . . . ; 1000, for each split. For each of the 1000 datasets, generate B bootstrapped datasets and record SVD(L1|L2)k;b for each split.
SimulaDon Study • x(branch length)=0.5,1,2
• g=5000, n=1: Simulate SNP data, one site per gene • g=10, n=500: Simulate mulDple sites per gene
• SubsDtuDon Model: Jukes–Cantor model (JC69) and the GTR model with a proporDon of invariant sites and with gamma-‐distributed mutaDon rates across sites (GTR + I + ︎Γ) Γ)
• n=1, g=1000,5000,10000: Check runDme for quartets
Results
Results
Results
Results • In all cases, there is good separaDon of SVD scores of valid
split versus the other two splits. SVD score can be a good measure to find the correct quartet tree for each quartet
• Longer branch lengths results in be\er separaDon of SVD scores for quartets.
• As expected, unlinked SNP data has be\er separaDon than mulD-‐sites per gene data.
• RunDme is less than linear in the total number of site pa\erns. However this runDme is only for quartets. RunDme for general n-‐taxa datasets discussed later.
Results
• Experiments only cover a specific topology, other quartet topologies with different branch lengths need to be experimented with as we know certain topologies are difficult to esDmate
• RunDme is only measured for quartets. Running this in combinaDon with quartet aggregaDon methods to esDmate species tree for n-‐taxa discussed later
• Other suitable values of g and n should be analyzed.
Ra\lesnake Data
Ra\lesnake Data • Using SVD scores and QMC on dataset previously analyzed by Kubatko et al.
• 52 sequences with 8466 aligned nucleoDde posiDons each in the complete data matrix
• Method – Randomly sample 20000 quartets from the 52 sequences – Use SVD scores to infer the true quartet relaDonship for each quartet
– Apply QMC to get species tree from quartet trees
Results
• Produces similar findings on ra\lesnake data compared to the original analysis in Kubatko et al. (2011)
• Original analysis took ~10 days using BEAST while using SVD scores took ~1 day without parallelizing
• 20000 quartets sampled out of 52C4=270725 total quartets. Why random sampling? Using quartets that are more reliable may be be\er. Analyze runDme for using all quartets or other sampling strategies
Soybean Data
Soybean SNP Data • Previously published SNP dataset originally analyzed by Lam et al. (2010)
• Compared with computaDon using SNAPP which is suitable for SNP data
• SNAPP infers the species tree using the coalescent model and is designed for biallelic data consisDng of unlinked SNPs. It bypasses gene trees and computes species trees using ML.
Results
• Produced results in agreement with the original findings
• SNAPP failed to converge even axer 28 days.
• SVD Quartets method with 100 bootstrap samples and 20000 quartets sampled per replicate required︎ 600 hrs.
• Need to compare with other ML measures that are be\er than SNAPP.
Conclusion • SVD Quartets is an efficient algorithm that esDmates quartet trees for a 4-‐taxa set
• Can be combined with a supertree method to get species tree from mulDple gene alignments without calculaDng gene trees explicitly
• Experiments so far lack breadth and depth, scope for doing more intensive experiments and comparison with other methods solving the same problem
QuesDons?
top related