xin zhou - saturday closing plenary

36
Taxon diversity analysis for bulk insect samples using Illumina Hi-seq platform Xin ZHOU, Shanlin LIU, Yiyuan LI, Qing YANG, and Xu SU Department of Science and Technology Environmental Genomics Research Group BGI, China Adelaide, Australia, 3 December 2011

Upload: consortium-for-the-barcode-of-life-cbol

Post on 25-May-2015

894 views

Category:

Education


1 download

TRANSCRIPT

Page 1: Xin Zhou - Saturday Closing Plenary

Taxon diversity analysis for bulk insect samples using Illumina Hi-seq platform

Xin ZHOU, Shanlin LIU, Yiyuan LI,

Qing YANG, and Xu SU

Department of Science and Technology

Environmental Genomics Research Group

BGI, China

Adelaide, Australia, 3 December 2011

Page 2: Xin Zhou - Saturday Closing Plenary

Opt.1: ......zzzzZZZZZ

Opt.2: morph sorting indiv. ID … Opt.1

Opt.3: morph sorting indiv. barcoding … Opt.1

Opt.4: grinding up NGS CLUSTERING/BLAST DIVERSITY!

Problem Solutions?

Zhou et al. 2011, 4th International Barcode of Life Conference

Page 3: Xin Zhou - Saturday Closing Plenary

Environmental barcoding of bulk insects

Zhou et al. 2011, 4th International Barcode of Life Conference

aquatic insects mini-barcode (130bp) 454

bat diet (insects) COI fragment, 157 bp 454

Biodiversity soup: metabarcoding of arthropods for rapid biodiversity assessment and biomonitoring, Yu D.W. et.al., in review

Malaise trap (insects) COI fragment, ~400 bp 454

Page 4: Xin Zhou - Saturday Closing Plenary

NGS platforms Read length

Data/run(GB) Run time

Requirement of library

construction

454 platform(GS FLX Titanium XL+) ~400bp 0.7 23 hr. Yes

Illumina platform(Hi-Seq 2000)

150bp PE reads 600 14 d. Yes

Illumina platform(Mi-Seq)

150bp PE reads 2 27 hr. Yes

Ion Torrent 200bp ~1 3.5 hr. No

Major NGS platforms applicable in environmental barcoding

Zhou et al. 2011, 4th International Barcode of Life Conference

higher through-put less $ / bp increasing reading length variety of bioinformatics tools available from genomic

pipelines

Illumina Hi-Seq

Page 5: Xin Zhou - Saturday Closing Plenary

• 28 Illumina GAIIx• 137 Illumina Hi-Seq2000• 25 Life Tech

SOLiD 4• 16 ABI 3730XL • 110 MegaBACEs• 2 Illumina iScan• 1 Roche 454• 1 Ion Torrent• 1 Illumina Mi-Seq

Sequencing capacity at BGI

Data production:• 100 Gb / day (2009)• >5 Tb / day (end of 2010)• >1500X human genome / day

Zhou et al. 2011, 4th International Barcode of Life Conference

Page 6: Xin Zhou - Saturday Closing Plenary

What I am NOT going to talk about:

• Primer optimization

• Systematic comparisons of NGS platforms

• Quantitative diversity analysis

What I AM going to talk about:

• Can Illumina NGS be used in diversity analysis?

Zhou et al. 2011, 4th International Barcode of Life Conference

Page 7: Xin Zhou - Saturday Closing Plenary

Sequencing error rate

Read-length

Can Illumina NGS be used in diversity analysis?

Zhou et al. 2011, 4th International Barcode of Life Conference

Page 8: Xin Zhou - Saturday Closing Plenary

Recent improvement in sequencing quality using Illumina’s V3 chemical

(even at 100 bp, only about 10% of the base callings has error rate >1%)

Zhou et al. 2011, 4th International Barcode of Life Conference

No indel issue in homopolymers

Sequencing quality keeps increasing

Rare nucleotide error can be easily

corrected by:

increasing sequencing depth

pair-end (PE) sequencing

setting stringent matching criteria in

the overlapping fragment by allowing

only >99% identity

Sequencing error rate

Insert-size250nt

150bp

150bp

PE sequencing enables forming sequence contigs

Page 9: Xin Zhou - Saturday Closing Plenary

Zhou et al. 2011, 4th International Barcode of Life Conference

Read length keeps increasing

Short-gun reads can be further assembled

into longer fragments (“short-gun”

assembly

strategy used in genome sequencing

projects)

Read length

Insert-size250nt

150bp

150bp

150PE enables contig read of 250bp

Option of scaffold assembly

Page 10: Xin Zhou - Saturday Closing Plenary

Illumina environmental barcoding

COI amplicons shotgun PE sequencing

Full length COI barcode PE sequencing

PCR based

Full length COI

PCR free

Full length COI without PCR bias

Mitochondrial shotgun PE sequencing

Illumina e-barcoding

Lib1 (658bp, 150PE) Lib2 (200bp, 150PE)

Zhou et al. 2011, 4th International Barcode of Life Conference

Page 11: Xin Zhou - Saturday Closing Plenary

Sample information

Mock XSBN (provided by Yu et al.)

# Specimens 23 292

# Haplotypes (2%) 12 230

Soup protocol DNA extracted individually and mixed for PCR

PCR primers LepF1/LepR1 Customized

Sequence length 658 bp 700 bp

Sequencing library details Full length (658bp) + Short-gun library (~200bp)

Sequencing protocol 150PE

Zhou et al. 2011, 4th International Barcode of Life Conference

Approach #1: PCR-based

Page 12: Xin Zhou - Saturday Closing Plenary

Lib 1 Mock XSBN

Raw data 1.67G 4.04GFiltering adapter 1.60G 1.28G

High quality (Q20)

0.35G 0.50G

# Reads (Primer removed)

1,081,997 1,150,477

# Unique reads (Abundance > 1)

36,618 45,444

Zhou et al. 2011, 4th International Barcode of Life Conference

Pre-analysis data filtering

Approach #1: PCR-based

Page 13: Xin Zhou - Saturday Closing Plenary

Unique reads (abundance > 1)

OTU cluster (98%)

Remove Chimera

Compared to reads of Lib 2

Mock 36,618 784 490 119 44

XSBN 45,444 4,189 3887 403 399

OTU filtering workflow

Alignment

Zhou et al. 2011, 4th International Barcode of Life Conference

Page 14: Xin Zhou - Saturday Closing Plenary

Results

Mock 84 36

XSBN 19832 197

Sanger Reference

NGS OTUsBlast at 100% identity

Zhou et al. 2011, 4th International Barcode of Life Conference

LepF1/R1

Customized primers

Page 15: Xin Zhou - Saturday Closing Plenary

Mock

84 36

31 can be found in our total sample, from which our mock samples were assembledNot found in raw

data (likely due to primer failure)

5 likely to be PCR errors

Sanger Reference

NGS OTUs

Zhou et al. 2011, 4th International Barcode of Life Conference

False negative“False positive”?

Page 16: Xin Zhou - Saturday Closing Plenary

XSBN

19832 197

17 not found in raw data (primer failure)

Mea

n +

SE

15 were lost in data filtering

Cross-sample contamination?

Zhou et al. 2011, 4th International Barcode of Life Conference

(group1) (group2)

Sanger Reference

NGS OTUs

Page 17: Xin Zhou - Saturday Closing Plenary

18149 84

after removal of sequences with abundance <10

Significantly less false positives

Slight drop of true positives

Zhou et al. 2011, 4th International Barcode of Life Conference

19832 197

Sanger Reference

NGS OTUs

Page 18: Xin Zhou - Saturday Closing Plenary

What’s next?

Zhou et al. 2011, 4th International Barcode of Life Conference

Obtaining full-length barcodes via short-gun reads assembly

(new program in development – “SOAPbarcode”)

New algorithm to filter out false positive OTUs

Approach #1: PCR-based

Illuminae-barcoding

Page 19: Xin Zhou - Saturday Closing Plenary

Approach #2: PCR-free method

Zhou et al. 2011, 4th International Barcode of Life Conference

Individual barcoding

Total MT isolation&

DNA extraction

Shotgun sequencing

Reference

based methodReference

independent method

Page 20: Xin Zhou - Saturday Closing Plenary

Building reference library: individual barcoding

1. 89 individuals;2. 84 reference barcodes;3. 39 OTUs (2%);

Taxon group # OTUs

Lepidoptera 25Diptera 7

Hemiptera 4Hymenoptera 2Psocoptera 1

Total 39

Zhou et al. 2011, 4th International Barcode of Life Conference

Page 21: Xin Zhou - Saturday Closing Plenary

Total MT isolation & DNA extraction

Sample

mixture

Total MT

isolation

MT DNA extraction

Zhou et al. 2011, 4th International Barcode of Life Conference

Page 22: Xin Zhou - Saturday Closing Plenary

Shotgun sequencing

Percentage of base pairs

Q20 (Sequencing error rate < 1%) 96.2%

Q30 (Sequencing error rate < 0.1%) 92.9%

GC content 38.0%

Insert size: 200bp;Read length: 100bp PE;

Zhou et al. 2011, 4th International Barcode of Life Conference

Page 23: Xin Zhou - Saturday Closing Plenary

Pre-analysis

Raw data 2.45G

After filtering 2.20GRatio of high

quality reads 89.91%

Data filtering:1. Adaptor contamination removal;2. Quality control:

in each read, only allowing <10bp with seq. error rate >1%

Zhou et al. 2011, 4th International Barcode of Life Conference

Page 24: Xin Zhou - Saturday Closing Plenary

Taxon groups # OTUs

Lepidoptera 20Diptera 2

Hemiptera 3Psocoptera 1

Total 26Not found 13

Method 1: Reference basedBlast reads to reference barcodes, confident identification is made only when:1. Best BLAST hit >98% identity;2. Reference coverage > 90%;

Reference 1

Reference 2

Correct mapping

Incorrect mapping

Coverage: 100%

Coverage: 30%

Approach #2: PCR-free method

Zhou et al. 2011, 4th International Barcode of Life Conference

Page 25: Xin Zhou - Saturday Closing Plenary

Potential sources of failure in detecting taxa

?Taxon specific

orBio-mass

(size & number)

Zhou et al. 2011, 4th International Barcode of Life Conference

Page 26: Xin Zhou - Saturday Closing Plenary

Taxon bias?

Failures in taxon detection

Taxon groups undetected

# Total OTUs

# OTUs missing

Lepidoptera 25 5Diptera 7 5

Hymenoptera 2 2Hemiptera 4 1Psocoptera 1 0

Total 39 13

Zhou et al. 2011, 4th International Barcode of Life Conference

Page 27: Xin Zhou - Saturday Closing Plenary

OR bio-mass (body size, # individuals)?

Failures in taxon detection

Readily detectedAverage length> 5mm

MissingAverage length < 5mm

Zhou et al. 2011, 4th International Barcode of Life Conference

Page 28: Xin Zhou - Saturday Closing Plenary

1. Assembly of COI gene using genome assembly program (SOAPdenovo);

2. Annotation using ~240 MT genomes downloaded from Genbank;

Method 2: Reference independent

Approach #2: PCR-free method

Zhou et al. 2011, 4th International Barcode of Life Conference

(Will we be able to identify diversity without reference MT genomes for the targeted species?)

Workflow:

Page 29: Xin Zhou - Saturday Closing Plenary

PCR-Free reference-independent: results

23/31 falling in standard COI barcode region (mostly >600 bp);

1 of 23 is not in our reference barcodes;(Insecta; Lepidoptera; Pyralidae);

Multiple genes obtained simultaneously;1 nearly complete mitochondrial genome (~15k bp);3 fragments >6000 bp;

Zhou et al. 2011, 4th International Barcode of Life Conference

Page 30: Xin Zhou - Saturday Closing Plenary

23/31 falling in standard COI barcode region (mostly >600 bp);

1 of 23 was not presented in our reference barcodes;(Insecta; Lepidoptera; Pyralidae);

Reference independent

Barcode references39 OTUs (84 individuals)

References based26 OTUs

References independent23 OTUs

Number of individuals we collected89 individuals

3 OTUs not detected in reference independent method because:

(1) sequencing depth is too low (<10X) to allow for reliable assembly

(2) relatively small body-size

5 individuals failed in Sanger sequencing

Zhou et al. 2011, 4th International Barcode of Life Conference

Page 31: Xin Zhou - Saturday Closing Plenary

Gene NumberATP6 29ATP8 4COX1 31COX2 33COX3 31CYTB 31ND1 35ND2 34ND3 24ND4 30

ND4L 16ND5 30ND6 24

PCR-free method

Multiple MT genes obtained simultaneously

Zhou et al. 2011, 4th International Barcode of Life Conference

Page 32: Xin Zhou - Saturday Closing Plenary

PCR-free method

1 nearly complete mitochondrial genome (~15k bp);3 fragments longer than 6k bp;

Barcode regionZhou et al. 2011, 4th International Barcode of Life Conference

Page 33: Xin Zhou - Saturday Closing Plenary

What’s next?

1. Wet-lab protocol optimization Pre-sorting insects by body-size Alternative MT isolation methods

2. Increase sequencing depth

MT DNA 5-10% after isolation; Non-targeting DNA affects MT assembly (e.g.,

bacteria & genomic DNA); Taxonomic/biomass bias

Currently:

Potential solutions:

Approach #2: PCR-free method

Zhou et al. 2011, 4th International Barcode of Life Conference

Page 34: Xin Zhou - Saturday Closing Plenary

Conclusions Illumina Hi-Seq delivers compatible performance

as other NGS platforms in analyzing bulk insect samples, with potential advantages in achieving higher sensitivity at lower cost;

Deep sequencing capacity enables a novel PCR-free approach, which may eventually solve biases caused by DNA amplification;

It shares issues with other NGS platforms (non-quantitative, inflation of OTUs, etc.)

Methodology optimization is much needed in many details of the pipeline;

Collaborative and synergistic efforts made by the community would greatly advance the progress.

Zhou et al. 2011, 4th International Barcode of Life Conference

Page 35: Xin Zhou - Saturday Closing Plenary

Acknowledgements

Douglas W. YuKunming Institute of Zoology, Chinese Academy of Sciences

Mehrdad Hajibabaei, Shadi ShokrallaUniversity of Guelph

Owain EdwardsCSIRO Ecosystem Sciences

LU JianliangWU QiongAN SainanZHOU YizhuangZHAO Jing

Collaborators:

Zhou et al. 2011, 4th International Barcode of Life Conference

Funder:

Page 36: Xin Zhou - Saturday Closing Plenary

36

Thanks for your attention!

Zhou et al. 2011, 4th International Barcode of Life Conference