quartet inference from snp data under the coalescent model

33
Quartet Inference from SNP Data Under the Coalescent Model Julia Chifman and Laura Kubatko By Shashank Yaduvanshi

Upload: lytu

Post on 01-Jan-2017

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Quartet Inference from SNP Data Under the Coalescent Model

Quartet  Inference  from  SNP  Data  Under  the  Coalescent  Model    

Julia  Chifman  and  Laura  Kubatko      By  

Shashank  Yaduvanshi  

Page 2: Quartet Inference from SNP Data Under the Coalescent Model

EsDmaDng  Species  Tree  from  Gene  Sequences  

•  Input:  Alignments  from  mulDple  genes  

•  Output:    Unified  species  tree  

•  Challenges:    – Every  gene  has  its  own  phylogeny  – Gene  trees  might  vary  from  species  tree  due  to  ILS,  horizontal  gene  transfer  etc  

Page 3: Quartet Inference from SNP Data Under the Coalescent Model

Phylogeny  EsDmaDon  Methods  under  the  Coalescent  Model  

•  Used  to  model  ILS  in  gene  trees  

•  Summary  based  methods  – Quartet  based  methods  

•  ConcatenaDon  methods  

•  Co-­‐esDmaDon  methods  

Page 4: Quartet Inference from SNP Data Under the Coalescent Model

Summary  Based  Methods  •  First  esDmate  independent  gene  trees  for  each  gene  using  methods  like  RaxML  

•  Second  step  is  combining  gene  trees  to  get  species  trees  by  methods  like  Astral  

•  ComputaDonally  efficient  for  large  data  sets  

•  EsDmaDon  error  in  gene  trees  will  lower  the  overall  accuracy  

 

Page 5: Quartet Inference from SNP Data Under the Coalescent Model

Quartet  Based  Methods  

•  EsDmate  the  most  likely  true  quartet  tree  for  each  4  set  of  taxa  using  mulD  gene  sequences  

•  Combine  all  (or  a  subset)  of  these  quartet  trees  using  a  Supertree  method  to  get  the  species  tree  

•  Works  on  the  enDre  data  together  while  sDll  remaining  computaDonally  efficient  

Page 6: Quartet Inference from SNP Data Under the Coalescent Model

ConcatenaDon  Methods    

•  Concatenate  all  gene  sequence  alignments  to  get  one  long  sequence  alignment  for  each  taxon  

•  Get  the  species  tree  using  these  long  alignments  directly  with  methods  such  as  ML  

•  Ignores  differences  in  the  gene  trees  for  different  genes  

Page 7: Quartet Inference from SNP Data Under the Coalescent Model

Co-­‐esDmaDon  Methods    

•  Co-­‐esDmate  sequence  alignments  and  species  tree  with  methods  such  as  Bayesian  inference    

•  Generally  higher  accuracy  than  other  methods  

•  ComputaDonally  inefficient  for  large  datasets    

Page 8: Quartet Inference from SNP Data Under the Coalescent Model

EsDmaDng  Quartet  Trees  

•  Most  methods  seen  so  far  are  distance  based,  or  ML-­‐based  

•  This  paper  introduces  a  new  measure,  SVD  scores  that  is  based  on  the  frequency  of  quartet  pa\erns  amongst  all  gene  alignments  

•  SVD  scores  can  be  used  to  esDmate  the  most  likely  quartet  tree  for  any  quartet  of  taxa  

Page 9: Quartet Inference from SNP Data Under the Coalescent Model

Important  Concepts  

•  pijkl  =P(X1  =i;  X2  =j;  X3  =k;  X4  =l)    

•  A  SPLIT  of  a  taxa  set  L  is  a  biparDDon  of  L  into  two  non-­‐overlapping  subsets  L1  &  L2,  denoted  L1|L2.  VALID  SPLIT  L1|L2  for  tree  T:  There  is  some  edge  in  T  that  results  in  the  same  biparDDon  L1|L2.  If  no  such  edge  exists,  then  the  split  is  INVALID  

•  For  taxa  quartets,  we  will  talk  about  splits  corresponding  to  groups  of  two.  There  are  3  such  possible  splits  for  each  quartet.  

   

Page 10: Quartet Inference from SNP Data Under the Coalescent Model

Fla\ening  

Page 11: Quartet Inference from SNP Data Under the Coalescent Model

Important  Concepts  •  The  RANK  of  a  matrix  A  is  the  size  of  the  largest  collecDon  of  linearly  independent  columns(or  rows)  of  A.  

 •  SVD:  The  singular  value  decomposiDon  of  a  matrix  A  is  the  factorizaDon  of  A  into  the  product  of  three  matrices  A  =  UDVT  where  the  columns  of  U  and  V  are  orthonormal  and  the  matrix  D  is  diagonal  with  posiDve  real  entries.    

•  Rank(A)  equals  the  number  of  non-­‐zero  diagonal  elements(singular  values)  in  D.  

Page 12: Quartet Inference from SNP Data Under the Coalescent Model

Theorem  

•  [Chifman  and  Kubatko,  2014].  Let  C  denote  the  class  of  coalescent  models  under  the  four-­‐state  GTR  model  on  a  four-­‐  taxon  binary  species  tree.  For  a  valid  split  L1|L2  ,  rank(FlatL1|L2(P))<= ︎10  for  all  distribuDons  P  arising  from  C.  For  a  non-­‐valid  split  L1  |L2  ,  rank(FlatL1|L2(P))  >  10.    

Page 13: Quartet Inference from SNP Data Under the Coalescent Model

ApproximaDon  to  Fla\ening  

Page 14: Quartet Inference from SNP Data Under the Coalescent Model

Finding  the  Best  Split  

•  Calculate  FlatL1|L2(P’)  for  all  three  possible  splits.  

•  Calculate  the  rank  of  each  of  these  three  matrices.  True  split  will  have  rank<=10.  

•  Not  computaDonally  intensive  to  get  these  counts  and  calculate  rank  

•  Can  be  run  in  parallel  for  different  quartets    

Page 15: Quartet Inference from SNP Data Under the Coalescent Model

SVD  Scores  

•  SVD  score  0  implies  rank(L1|L2)<=10,  hence  L1|L2  is  a  valid  split  

•  SVD  score  >0  implies  rank(L1|L2)>10,  hence  L1|L2  is  an  invalid  split  

•  Choose  the  split  with  the  lowest  SVD  score  

Page 16: Quartet Inference from SNP Data Under the Coalescent Model

Suitable  Data  •  SVD  scores  are  applicable  to  data  where  each  site  evolves  

independently,  coming  from  a  different  locus  

•  However,  authors  claim  that  this  method  also  works  well  when  each  locus  produces  mulDple  sites  ,  simulated  and  real  world.  

•  Bootstrapping  for  a  dataset  consisDng  of  M  aligned  sites    –  Re-­‐sample  columns  with  replacement  M  Dmes    –  Calculate  SVD  scores  of  the  three  splits  for  this  data  matrix  –  Repeat  this  procedure  B  Dmes  –  Each  bootstrap  matrix  votes  for  a  parDcular  split.  Total  votes  for  each  split  is  its  bootstrap  support  

Page 17: Quartet Inference from SNP Data Under the Coalescent Model

Experiments  

•  SimulaDon  Study  

•  Ra\lesnake  MulD-­‐Loci  Data  

•  Soybean  SNP  Data  

Page 18: Quartet Inference from SNP Data Under the Coalescent Model

SimulaDon  Study  

1

2  

3

4

x  

x  x  

x   x  

Page 19: Quartet Inference from SNP Data Under the Coalescent Model

SimulaDon  Study  •  Generate  a  sample  of  g  gene  trees  from  the  model  species  tree  

((1:x,2:x):x,(3:x,4:x):x),  where  x  is  the  length  of  each  branch  under  the  coalescent  model  using  the  program  COAL  (Degnan  and  Salter).    

•  Generate  sequence  data  of  length  n  on  each  gene  tree  under  a  specified  subsDtuDon  model.    

•  Construct  the  fla\ening  matrix  for  each  of  the  three  possible  splits,  and  compute  SVD(L1|L2)  for  each  

•  Repeat  1000  Dmes  and  record  SVD(L1|L2)k;  k=1;  2;  .  .  .  ;  1000,  for  each  split.  For  each  of  the  1000  datasets,  generate  B  bootstrapped  datasets  and  record  SVD(L1|L2)k;b  for  each  split.    

Page 20: Quartet Inference from SNP Data Under the Coalescent Model

SimulaDon  Study  •  x(branch  length)=0.5,1,2  

•  g=5000,  n=1:  Simulate  SNP  data,  one  site  per  gene  •  g=10,  n=500:  Simulate  mulDple  sites  per  gene    

•  SubsDtuDon  Model:  Jukes–Cantor  model  (JC69)  and  the  GTR  model  with  a  proporDon  of  invariant  sites  and  with  gamma-­‐distributed  mutaDon  rates  across  sites  (GTR  +  I  +  ︎Γ)    Γ)    

•  n=1,  g=1000,5000,10000:  Check  runDme  for  quartets    

Page 21: Quartet Inference from SNP Data Under the Coalescent Model

Results  

Page 22: Quartet Inference from SNP Data Under the Coalescent Model

Results  

Page 23: Quartet Inference from SNP Data Under the Coalescent Model

Results  

Page 24: Quartet Inference from SNP Data Under the Coalescent Model

Results  •  In  all  cases,  there  is  good  separaDon  of  SVD  scores  of  valid  

split  versus  the  other  two  splits.  SVD  score  can  be  a  good  measure  to  find  the  correct  quartet  tree  for  each  quartet  

•  Longer  branch  lengths  results  in  be\er  separaDon  of  SVD  scores  for  quartets.  

•  As  expected,  unlinked  SNP  data  has  be\er  separaDon  than  mulD-­‐sites  per  gene  data.  

•  RunDme  is  less  than  linear  in  the  total  number  of  site  pa\erns.  However  this  runDme  is  only  for  quartets.  RunDme  for  general  n-­‐taxa  datasets  discussed  later.  

Page 25: Quartet Inference from SNP Data Under the Coalescent Model

Results  

•  Experiments  only  cover  a  specific  topology,  other  quartet  topologies  with  different  branch  lengths  need  to  be  experimented  with  as  we  know  certain  topologies  are  difficult  to  esDmate  

•  RunDme  is  only  measured  for  quartets.  Running  this  in  combinaDon  with  quartet  aggregaDon  methods  to  esDmate  species  tree  for  n-­‐taxa  discussed  later  

•  Other  suitable  values  of  g  and  n  should  be  analyzed.    

Page 26: Quartet Inference from SNP Data Under the Coalescent Model

Ra\lesnake  Data  

Page 27: Quartet Inference from SNP Data Under the Coalescent Model

Ra\lesnake  Data  •  Using  SVD  scores  and  QMC  on  dataset  previously  analyzed  by  Kubatko  et  al.    

•  52  sequences  with  8466  aligned  nucleoDde  posiDons  each  in  the  complete  data  matrix  

•  Method  –  Randomly  sample  20000  quartets  from  the  52  sequences  –  Use  SVD  scores  to  infer  the  true  quartet  relaDonship  for  each  quartet    

–  Apply  QMC  to  get  species  tree  from  quartet  trees  

Page 28: Quartet Inference from SNP Data Under the Coalescent Model

Results  

•  Produces  similar  findings  on  ra\lesnake  data  compared  to  the  original  analysis  in  Kubatko  et  al.  (2011)    

•  Original  analysis  took  ~10  days  using  BEAST  while  using  SVD  scores  took  ~1  day  without  parallelizing    

•  20000  quartets  sampled  out  of  52C4=270725  total  quartets.  Why  random  sampling?  Using  quartets  that  are  more  reliable  may  be  be\er.  Analyze  runDme  for  using  all  quartets  or  other  sampling  strategies  

 

Page 29: Quartet Inference from SNP Data Under the Coalescent Model

Soybean  Data  

Page 30: Quartet Inference from SNP Data Under the Coalescent Model

Soybean  SNP  Data  •  Previously  published  SNP  dataset  originally  analyzed  by  Lam  et  al.  (2010)  

•  Compared  with  computaDon  using  SNAPP  which  is  suitable  for  SNP  data  

•  SNAPP  infers  the  species  tree  using  the  coalescent  model  and  is  designed  for  biallelic  data  consisDng  of  unlinked  SNPs.  It  bypasses  gene  trees  and  computes  species  trees  using  ML.      

   

Page 31: Quartet Inference from SNP Data Under the Coalescent Model

Results  

•  Produced  results  in  agreement  with  the  original  findings  

•  SNAPP  failed  to  converge  even  axer  28  days.  

•  SVD  Quartets  method  with  100  bootstrap  samples  and  20000  quartets  sampled  per  replicate  required︎  600  hrs.  

•  Need  to  compare  with  other  ML  measures  that  are  be\er  than  SNAPP.  

Page 32: Quartet Inference from SNP Data Under the Coalescent Model

Conclusion  •  SVD  Quartets  is  an  efficient  algorithm  that  esDmates  quartet  trees  for  a  4-­‐taxa  set  

•  Can  be  combined  with  a  supertree  method  to  get  species  tree  from  mulDple  gene  alignments  without  calculaDng  gene  trees  explicitly  

•  Experiments  so  far  lack  breadth  and  depth,  scope  for  doing  more  intensive  experiments  and  comparison  with  other  methods  solving  the  same  problem  

Page 33: Quartet Inference from SNP Data Under the Coalescent Model

QuesDons?