taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf ·...

44
Taxonomic iden,fica,on and phylogene,c profiling Namphuong Nguyen Carl R. Woese Ins,tute for Genomic Biology University of Illinois at UrbanaChampaign Joint work with Siavash Mirarab, Mihai Pop, and Tandy Warnow

Upload: others

Post on 12-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

Taxonomic  iden,fica,on  and  phylogene,c  profiling  

Nam-­‐phuong  Nguyen  Carl  R.  Woese  Ins,tute  for  Genomic  Biology  University  of  Illinois  at  Urbana-­‐Champaign  

Joint  work  with  Siavash  Mirarab,  Mihai  Pop,  and  Tandy  Warnow  

Page 2: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

Metagenomics  

Courtesy of Human Microbiome Project

•  Culture-independent method for

studying a microbiome

•  Extract genetic material directly from the environment

•  Applications to biofuel production, agriculture, human health

•  Sequencing technology produces

millions of short reads from unknown species

•  Fundamental steps in analysis is

identifying taxa of read and estimating a population profile of a sample

Page 3: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

Taxonomic Identification and Profiling l  Taxonomic identification

l  Objective: Given a query sequence, identify the taxon (species, genus, family, etc...) of the sequence

l  Classification problem

l  Taxonomic profiling l  Objective: Given a set of query sequences collected from a sample,

estimate the population profile of the sample

l  Estimation problem

l  Can be solved via taxonomic identification

Page 4: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

•  Sequence similarity search •  Classifies by finding most similar sequence

•  Classifies fragments from any region of genome

•  BLAST

•  Composition-based methods •  Typically uses k-mers

•  Classifies fragments from any region of genome

•  PhymmBL, NBC

•  Phylogeny-based methods

•  Classifies fragments by using a phylogeny

Taxonomic Identification Methods

Page 5: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

ACT..TAGA..A (species5)

AGC...ACA (species4)

TAGA...CTT (species3)

TAGC...CCA (species2)

AGG...GCAT (species1)

•  ACCG  •  CGAG  •  CGG  •  GGCT  •  TAGA  •  GGGGG  •  TCGAG  •  GGCG  •  GGG  •  .  •  .  •  .  •  ACCT  

(60-­‐200  bp  long)  

Fragmentary  Reads:  Known  full-­‐length  gene  sequences,    and  an  alignment  and  a  tree  

(500-­‐10,000  bp  long)  

Phylogeny-based taxonomic identification

Page 6: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

ACT..TAGA..A (species5)

AGC...ACA (species4)

TAGA...CTT (species3)

TAGC...CCA (species2)

AGG...GCAT (species1)

•  ACCG  •  CGAG  •  CGG  •  GGCT  •  TAGA  •  GGGGG  •  TCGAG  •  GGCG  •  GGG  •  .  •  .  •  .  •  ACCT  

(60-­‐200  bp  long)  

Fragmentary  Reads:  Known  full-­‐length  gene  sequences,    and  an  alignment  and  a  tree  

(500-­‐10,000  bp  long)  

Phylogeny-based taxonomic identification

Page 7: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

ACT..TAGA..A (species5)

AGC...ACA (species4)

TAGA...CTT (species3)

TAGC...CCA (species2)

AGG...GCAT (species1)

•  ACCG  •  CGAG  •  CGG  •  GGCT  •  TAGA  •  GGGGG  •  TCGAG  •  GGCG  •  GGG  •  .  •  .  •  .  •  ACCT  

(60-­‐200  bp  long)  

Fragmentary  Reads:  Known  full-­‐length  gene  sequences,    and  an  alignment  and  a  tree  

(500-­‐10,000  bp  long)  

Phylogeny-based taxonomic identification

Page 8: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

ACT..TAGA..A (species5)

AGC...ACA (species4)

TAGA...CTT (species3)

TAGC...CCA (species2)

AGG...GCAT (species1)

•  ACCG  •  CGAG  •  CGG  •  GGCT  •  TAGA  •  GGGGG  •  TCGAG  •  GGCG  •  GGG  •  .  •  .  •  .  •  ACCT  

(60-­‐200  bp  long)  

Fragmentary  Reads:  Known  full-­‐length  gene  sequences,    and  an  alignment  and  a  tree  

(500-­‐10,000  bp  long)  

Phylogeny-based taxonomic identification

Page 9: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

ACT..TAGA..A (species5)

AGC...ACA (species4)

TAGA...CTT (species3)

TAGC...CCA (species2)

AGG...GCAT (species1)

•  ACCG  •  CGAG  •  CGG  •  GGCT  •  TAGA  •  GGGGG  •  TCGAG  •  GGCG  •  GGG  •  .  •  .  •  .  •  ACCT  

(60-­‐200  bp  long)  

Fragmentary  Reads:  Known  full-­‐length  gene  sequences,    and  an  alignment  and  a  tree  

(500-­‐10,000  bp  long)  

Phylogeny-based taxonomic identification

Page 10: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

ACT..TAGA..A (species5)

AGC...ACA (species4)

TAGA...CTT (species3)

TAGC...CCA (species2)

AGG...GCAT (species1)

•  ACCG  •  CGAG  •  CGG  •  GGCT  •  TAGA  •  GGGGG  •  TCGAG  •  GGCG  •  GGG  •  .  •  .  •  .  •  ACCT  

(60-­‐200  bp  long)  

Fragmentary  Reads:  Known  full-­‐length  gene  sequences,    and  an  alignment  and  a  tree  

(500-­‐10,000  bp  long)  

Phylogeny-based taxonomic identification

Page 11: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

Phylogenetic Placement •  Input: (Backbone) Alignment and tree on full-length

sequences and a query sequence (short read)

•  Output: Placement of the query sequence on the backbone tree

•  Use placement to infer relationship between query sequence and full-length sequences in backbone tree

•  Applications in metagenomic analysis

•  Millions of reads

•  Reads from different genomes mixed together

•  Use placement to identify read

Page 12: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

Align Sequence

S1

S4

S2

S3

S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = TAAAAC

Page 13: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

Align Sequence

S1

S4

S2

S3

S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC--------

Page 14: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

Place Sequence

S1

S4

S2

S3 Q1

S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC--------

Page 15: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

Phylogenetic Placement

l  Align each query sequence to backbone alignment: l  HMMALIGN (Eddy, Bioinformatics 1998) l  PaPaRa (Berger and Stamatakis, Bioinformatics

2011)

l  Place each query sequence into backbone tree, using extended alignment: l  pplacer (Matsen et al., BMC Bioinformatics 2010) l  EPA (Berger et al., Systematic Biology 2011)

Page 16: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

Phylogenetic Placement

l  Align each query sequence to backbone alignment: l  HMMALIGN (Eddy, Bioinformatics 1998) l  PaPaRa (Berger and Stamatakis, Bioinformatics

2011)

l  Place each query sequence into backbone tree, using extended alignment: l  pplacer (Matsen et al., BMC Bioinformatics 2010) l  EPA (Berger et al., Systematic Biology 2011)

Page 17: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

HMMER and PaPaRa results

Increasing rate evolution

0.0 Backbone size: 500 5000 fragments 20 replicates

Page 18: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

Old  approach  using  single  HMM  

Page 19: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

Old  approach  using  single  HMM  

HMM 1

Page 20: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

Old  approach  using  single  HMM  

Large evolutionary diameter

HMM 1

Page 21: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

New  approach  

Page 22: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

New  approach  

Page 23: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

New  approach  

Smaller evolutionary diameter

Page 24: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

New  approach  

HMM 1

HMM 2

Page 25: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

New  approach  

HMM 1

HMM 3 HMM 4

HMM 2

Page 26: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

SEPP (10% rule) Simulated Results

0.0 0.0

Increasing rate evolution

Backbone size: 500 5000 fragments 20 replicates

Page 27: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

ACT..TAGA..A (species5)

AGC...ACA (species4)

TAGA...CTT (species3)

TAGC...CCA (species2)

AGG...GCAT (species1)

•  ACCG  •  CGAG  •  CGG  •  GGCT  •  TAGA  •  GGGGG  •  TCGAG  •  GGCG  •  GGG  •  .  •  .  •  .  •  ACCT  

(60-­‐200  bp  long)  

Fragmentary  Unknown  Reads:   Known  Full  length  Sequences,    and  an  alignment  and  a  tree  

(500-­‐10,000  bp  long)  

Using SEPP

ML  placement        40%

Page 28: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

ACT..TAGA..A (species5)

AGC...ACA (species4)

TAGA...CTT (species3)

TAGC...CCA (species2)

AGG...GCAT (species1)

•  ACCG  •  CGAG  •  CGG  •  GGCT  •  TAGA  •  GGGGG  •  TCGAG  •  GGCG  •  GGG  •  .  •  .  •  .  •  ACCT  

(60-­‐200  bp  long)  

Fragmentary  Unknown  Reads:   Known  Full  length  Sequences,    and  an  alignment  and  a  tree  

(500-­‐10,000  bp  long)  

Taxonomic Identification using Phylogenetic Placement Adding Uncertainty

2nd  highest  likelihood                    placement    38%

ML  placement        40%

Page 29: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

ACT..TAGA..A (species5)

AGC...ACA (species4)

TAGA...CTT (species3)

TAGC...CCA (species2)

AGG...GCAT (species1)

•  ACCG  •  CGAG  •  CGG  •  GGCT  •  TAGA  •  GGGGG  •  TCGAG  •  GGCG  •  GGG  •  .  •  .  •  .  •  ACCT  

(60-­‐200  bp  long)  

Fragmentary  Unknown  Reads:   Known  Full  length  Sequences,    and  an  alignment  and  a  tree  

(500-­‐10,000  bp  long)  

Taxonomic Identification using Phylogenetic Placement Adding Uncertainty

2nd  highest  likelihood                    placement    38%

ML  placement        40%

Page 30: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

TIPP

Nguyen et al. Bioinformatics 2014

Page 31: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

TIPP for Taxonomic Profiling l  Marker-based abundance profiler

l  Uses a collection of single copy housekeeping genes

l  Only fragments binned to marker genes classified

l  Profiling algorithm

l  Bins fragments to marker genes

l  Classify fragments binned to each marker

l  Pool all classified reads

l  Estimate abundance profile on pooled reads

Page 32: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

5/14/14

Taxonomic  Profiling  Experimental  Design  

l  Datasets l  Easy conditions (low error rates, known genomes)

l  Hard conditions (novel genomes, high error rates)

l  Methods l  Marker-based – TIPP, Metaphyler, mOTU, Metaphlan

l  Genome-based – NBC, PhymmBL

l  Measured distance to true profile as error metric

Page 33: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

“Easy”  genome  datasets  

Page 34: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

High  indel  datasets  containing  known  genomes  

Note:  NBC,  MetaPhlAn,  and  Metaphyler  cannot  classify  any  sequences  from  at  least  of  the  high  indel  long  sequence  datasets.    mOTU  terminates  with  an  error  message  on  all  the  high  indel  datasets.  

“Hard”  genome  datasets  

Page 35: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

“Novel”  genome  datasets  

Note:  mOTU  terminates  with  an  error  message  on  the  long  fragment  datasets  and  high  indel  datasets.  

Page 36: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

Summary l  TIPP: marker-based taxonomic identification and

classification method through phylogenetic placement

l  Very robust to sequencing errors and novel genomes

l  Results in overall more accurate profiles

l  Accurate profiles can be obtained by classifying reads from the marker genes

Page 37: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

Acknowledgements  

Siavash Mirarab Tandy Warnow Mihai Pop

Supported by NSF DEB 0733029 University of Alberta

Bo Liu

Page 38: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

SEPP/TIPP/UPP  SEPP/UPP/TIPP site: https://github.com/smirarab/sepp/ Instructions for installing UPP: https://github.com/smirarab/sepp/blob/master/tutorial/upp-tutorial.md Instructions for installing TIPP: https://github.com/smirarab/sepp/blob/master/tutorial/tipp-tutorial.md References: 1) N. Nguyen, S. Mirarab, K. Kumar, and T. Warnow. Ultra-large alignments using phylogeny-aware profiles, Proceedings of Research in Computational Biology (RECOMB) 2015 and to appear in Genome Biology 2015. 1) N. Nguyen, S. Mirarab, B. Liu, M. Pop, and T. Warnow. TIPP:Taxonomic Identification and Phylogenetic Profiling. Bioinformatics, 2014, 30 (24): 3548-3555. 2) Mirarab, S., N. Nguyen, and T. Warnow, 2012. SEPP: SATe-Enabled Phylogenetic Placement. Pacific Symposium on Biocomputing.

Page 39: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

Place Sequence

S1

S4

S2

S3 Q1

S1 = -AGGCTATCACCTGACCTCCA-AA S2 = TAG-CTATCAC--GACCGC--GCA S3 = TAG-CT-------GACCGC--GCT S4 = TAC----TCAC--GACCGACAGCT Q1 = -------T-A--AAAC-------- Q1

Q2 Q3

Query sequences are aligned and placed independently

Page 40: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

Phylogenetic Placement

l  Align each query sequence to backbone alignment: l  HMMALIGN (Eddy, Bioinformatics 1998) l  PaPaRa (Berger and Stamatakis, Bioinformatics

2011)

l  Place each query sequence into backbone tree, using extended alignment: l  pplacer (Matsen et al., BMC Bioinformatics 2010) l  EPA (Berger et al., Systematic Biology 2011)

Page 41: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

16S Identification

A A

A A

B

B

16S gene

Page 42: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

16S Identification

A A

A A

B

B

16S gene

True Abundance A: 67% B: 33%

Estimated Abundance A: 50% B: 50%

Page 43: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

Single copy gene

A A

A A

B

B

Single copy gene

True Abundance A: 67% B: 33%

Estimated Abundance A: 67% B: 33%

Page 44: Taxonomicidenficaonandphylogenec profiling)tandy.cs.illinois.edu/nam_michigan.pdf · 2015-05-18 · • Developers: Nguyen, Mirarab, Pop, and Warnow • SEPP takes the best extended

•  Developers: Nguyen, Mirarab, Pop, and Warnow •  SEPP takes the best extended alignment and finds the

ML placement. •  Modify SEPP to use uncertainty:

•  Take as many alignments necessary to reach support alignment threshold

•  Classify query sequence at node with sufficient placement support threshold

•  Nguyen et al. Bioinformatics 2014

TIPP: Taxonomic identification and Phylogenetic Profiling