quartet inference from snp data under the coalescent model
TRANSCRIPT
Quartet Inference from SNP Data Under the Coalescent Model
Julia Chifman and Laura Kubatko By
Shashank Yaduvanshi
EsDmaDng Species Tree from Gene Sequences
• Input: Alignments from mulDple genes
• Output: Unified species tree
• Challenges: – Every gene has its own phylogeny – Gene trees might vary from species tree due to ILS, horizontal gene transfer etc
Phylogeny EsDmaDon Methods under the Coalescent Model
• Used to model ILS in gene trees
• Summary based methods – Quartet based methods
• ConcatenaDon methods
• Co-‐esDmaDon methods
Summary Based Methods • First esDmate independent gene trees for each gene using methods like RaxML
• Second step is combining gene trees to get species trees by methods like Astral
• ComputaDonally efficient for large data sets
• EsDmaDon error in gene trees will lower the overall accuracy
Quartet Based Methods
• EsDmate the most likely true quartet tree for each 4 set of taxa using mulD gene sequences
• Combine all (or a subset) of these quartet trees using a Supertree method to get the species tree
• Works on the enDre data together while sDll remaining computaDonally efficient
ConcatenaDon Methods
• Concatenate all gene sequence alignments to get one long sequence alignment for each taxon
• Get the species tree using these long alignments directly with methods such as ML
• Ignores differences in the gene trees for different genes
Co-‐esDmaDon Methods
• Co-‐esDmate sequence alignments and species tree with methods such as Bayesian inference
• Generally higher accuracy than other methods
• ComputaDonally inefficient for large datasets
EsDmaDng Quartet Trees
• Most methods seen so far are distance based, or ML-‐based
• This paper introduces a new measure, SVD scores that is based on the frequency of quartet pa\erns amongst all gene alignments
• SVD scores can be used to esDmate the most likely quartet tree for any quartet of taxa
Important Concepts
• pijkl =P(X1 =i; X2 =j; X3 =k; X4 =l)
• A SPLIT of a taxa set L is a biparDDon of L into two non-‐overlapping subsets L1 & L2, denoted L1|L2. VALID SPLIT L1|L2 for tree T: There is some edge in T that results in the same biparDDon L1|L2. If no such edge exists, then the split is INVALID
• For taxa quartets, we will talk about splits corresponding to groups of two. There are 3 such possible splits for each quartet.
Fla\ening
Important Concepts • The RANK of a matrix A is the size of the largest collecDon of linearly independent columns(or rows) of A.
• SVD: The singular value decomposiDon of a matrix A is the factorizaDon of A into the product of three matrices A = UDVT where the columns of U and V are orthonormal and the matrix D is diagonal with posiDve real entries.
• Rank(A) equals the number of non-‐zero diagonal elements(singular values) in D.
Theorem
• [Chifman and Kubatko, 2014]. Let C denote the class of coalescent models under the four-‐state GTR model on a four-‐ taxon binary species tree. For a valid split L1|L2 , rank(FlatL1|L2(P))<= ︎10 for all distribuDons P arising from C. For a non-‐valid split L1 |L2 , rank(FlatL1|L2(P)) > 10.
ApproximaDon to Fla\ening
Finding the Best Split
• Calculate FlatL1|L2(P’) for all three possible splits.
• Calculate the rank of each of these three matrices. True split will have rank<=10.
• Not computaDonally intensive to get these counts and calculate rank
• Can be run in parallel for different quartets
SVD Scores
• SVD score 0 implies rank(L1|L2)<=10, hence L1|L2 is a valid split
• SVD score >0 implies rank(L1|L2)>10, hence L1|L2 is an invalid split
• Choose the split with the lowest SVD score
Suitable Data • SVD scores are applicable to data where each site evolves
independently, coming from a different locus
• However, authors claim that this method also works well when each locus produces mulDple sites , simulated and real world.
• Bootstrapping for a dataset consisDng of M aligned sites – Re-‐sample columns with replacement M Dmes – Calculate SVD scores of the three splits for this data matrix – Repeat this procedure B Dmes – Each bootstrap matrix votes for a parDcular split. Total votes for each split is its bootstrap support
Experiments
• SimulaDon Study
• Ra\lesnake MulD-‐Loci Data
• Soybean SNP Data
SimulaDon Study
1
2
3
4
x
x x
x x
SimulaDon Study • Generate a sample of g gene trees from the model species tree
((1:x,2:x):x,(3:x,4:x):x), where x is the length of each branch under the coalescent model using the program COAL (Degnan and Salter).
• Generate sequence data of length n on each gene tree under a specified subsDtuDon model.
• Construct the fla\ening matrix for each of the three possible splits, and compute SVD(L1|L2) for each
• Repeat 1000 Dmes and record SVD(L1|L2)k; k=1; 2; . . . ; 1000, for each split. For each of the 1000 datasets, generate B bootstrapped datasets and record SVD(L1|L2)k;b for each split.
SimulaDon Study • x(branch length)=0.5,1,2
• g=5000, n=1: Simulate SNP data, one site per gene • g=10, n=500: Simulate mulDple sites per gene
• SubsDtuDon Model: Jukes–Cantor model (JC69) and the GTR model with a proporDon of invariant sites and with gamma-‐distributed mutaDon rates across sites (GTR + I + ︎Γ) Γ)
• n=1, g=1000,5000,10000: Check runDme for quartets
Results
Results
Results
Results • In all cases, there is good separaDon of SVD scores of valid
split versus the other two splits. SVD score can be a good measure to find the correct quartet tree for each quartet
• Longer branch lengths results in be\er separaDon of SVD scores for quartets.
• As expected, unlinked SNP data has be\er separaDon than mulD-‐sites per gene data.
• RunDme is less than linear in the total number of site pa\erns. However this runDme is only for quartets. RunDme for general n-‐taxa datasets discussed later.
Results
• Experiments only cover a specific topology, other quartet topologies with different branch lengths need to be experimented with as we know certain topologies are difficult to esDmate
• RunDme is only measured for quartets. Running this in combinaDon with quartet aggregaDon methods to esDmate species tree for n-‐taxa discussed later
• Other suitable values of g and n should be analyzed.
Ra\lesnake Data
Ra\lesnake Data • Using SVD scores and QMC on dataset previously analyzed by Kubatko et al.
• 52 sequences with 8466 aligned nucleoDde posiDons each in the complete data matrix
• Method – Randomly sample 20000 quartets from the 52 sequences – Use SVD scores to infer the true quartet relaDonship for each quartet
– Apply QMC to get species tree from quartet trees
Results
• Produces similar findings on ra\lesnake data compared to the original analysis in Kubatko et al. (2011)
• Original analysis took ~10 days using BEAST while using SVD scores took ~1 day without parallelizing
• 20000 quartets sampled out of 52C4=270725 total quartets. Why random sampling? Using quartets that are more reliable may be be\er. Analyze runDme for using all quartets or other sampling strategies
Soybean Data
Soybean SNP Data • Previously published SNP dataset originally analyzed by Lam et al. (2010)
• Compared with computaDon using SNAPP which is suitable for SNP data
• SNAPP infers the species tree using the coalescent model and is designed for biallelic data consisDng of unlinked SNPs. It bypasses gene trees and computes species trees using ML.
Results
• Produced results in agreement with the original findings
• SNAPP failed to converge even axer 28 days.
• SVD Quartets method with 100 bootstrap samples and 20000 quartets sampled per replicate required︎ 600 hrs.
• Need to compare with other ML measures that are be\er than SNAPP.
Conclusion • SVD Quartets is an efficient algorithm that esDmates quartet trees for a 4-‐taxa set
• Can be combined with a supertree method to get species tree from mulDple gene alignments without calculaDng gene trees explicitly
• Experiments so far lack breadth and depth, scope for doing more intensive experiments and comparison with other methods solving the same problem
QuesDons?