detecting somatic mutation - ensemble approach

37
Detecting Somatic Mutations in Impure Cancer Sample - Ensemble Approach 김광중, 홍창범 KT GenomeCloud 2015 한국유전체학회 동계심포지엄 NGSData를 이용한 생물정보분석 Workshop 2015.2.4~2.5

Upload: hong-changbum

Post on 16-Jul-2015

1.688 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Detecting Somatic Mutation - Ensemble Approach

Detecting Somatic Mutations in Impure Cancer Sample

- Ensemble Approach

김광중, 홍창범

KT GenomeCloud

2015  한국유전체학회  동계심포지엄  NGS를  Data를  이용한  생물정보분석  Workshop  

2015.2.4~2.5  

Page 2: Detecting Somatic Mutation - Ensemble Approach

Overview

Overview

Challenge Literature Search

Mutation callers Comparison of mutation callers Simple consensus approach

Integrated/Ensemble Approach Summary

Page 3: Detecting Somatic Mutation - Ensemble Approach

Challenge

Motivation

Sequencing reads from tumor samples are diluted by normal cells

lower signal-to-noise ratio: allele frequency 5% SNVs cannot be called with high significance

The genomes of primary tumors are genetically heterogeneous with frequent rearrangements copy number alterations subclones

Need highly sensitive and specific mutation-calling methods

Page 4: Detecting Somatic Mutation - Ensemble Approach

Terminologies

Terminologies

Challenge Literature Search

Mutation Callers Comparison of mutation callers Simple consensus approach

Integrated/Ensemble Approach Summary

Page 5: Detecting Somatic Mutation - Ensemble Approach

Comparison of mutation callers

Literature Search

Page 6: Detecting Somatic Mutation - Ensemble Approach

Comparison of mutation callers-Data Sets

Literature Search

whole genome sequencing (WGS) melanoma sample and matched blood 90% tumor content paired-end reads, ~50x coverage

whole exam sequencing (WES) tumor samples

18 lung tumor-normal pairs 70~80% tumor content paired-end reads, 63x

cell lines 7 lung cancer cell lines paired-end reads, 233x

Page 7: Detecting Somatic Mutation - Ensemble Approach

Comparison of mutation callers-Experiments

Literature Search

sSNV detection tools

Validation PCR and direct sequencing of genomic DNA on only deleted, functionally important sSNVs

Simulation 10 tumor-normal pairs (WES), 100x coverage

Page 8: Detecting Somatic Mutation - Ensemble Approach

Comparison of mutation callers- Results (1)

Literature Search

Page 9: Detecting Somatic Mutation - Ensemble Approach

Comparison of mutation callers- Results (2)

Literature Search

Synthetic data: 10 tumor-normal pairs (WES), 100x coverage

Page 10: Detecting Somatic Mutation - Ensemble Approach

Comparison of mutation callers- Summary of Findings

Literature Search

VarScan2 performed the best; MuTect follows VarScan2: better at higher allele frequencies MuTect: more sensitive with low allele frequencies

strand-bias filtering is useful eliminate many false positives common problem with Illumina seuquencing data

still a challenge: how to discern sSNVs and normal alternate alleles? to call ultras-rare sSNVs: targeted deep sequencing recommended over WES or WGS

Page 11: Detecting Somatic Mutation - Ensemble Approach

Simple consensus approach

Literature Search

Page 12: Detecting Somatic Mutation - Ensemble Approach

Simple consensus approach-Data Sets

Literature Search

whole exam sequencing (WES) 27 ovarian tumors and their matched germline samples HiSeq 2000 sequencer, using 100 bp paired-end reads mean coverage ranged from 102~225x in tumor and 119~118x germlines

validation Sanger sequencing

somatic SNV detection programs JointSNVMix2, MuTect, and SomaticSniper implement sophisticated detection algorithms used in major tumor sequencing studies

Page 13: Detecting Somatic Mutation - Ensemble Approach

Simple consensus approach-identification somatic SNVs

Literature Search

MuTect (v 1.0.27783) only the default parameter set was applied not labeled as ‘REJECT’

JointSNVMix2 (v 0.7.5) default prior genotype probabilities used for training set joint probability 0.9999 or greater

SomaticSniper (v 1.0.0) using joint genotyping mode (-J option) default prior probability of a somatic mutation (0.01) mapping quality of 0 were filtered predictions with a ‘somatic score’ of 40 or greater

SAMtools mpileup mapping qualities directionality depth of reads

Page 14: Detecting Somatic Mutation - Ensemble Approach

total read depth was of 8 or greater in both the T/N mutant allele frequency of ≥20% in tumor and ≥5% germline mutant allele supported by read mapping in both for/rev orientations variant call in only on tumor (exception of the BRAF V600, KRAS G12/13 hotspot) combined total of 9,226 somatic SNV predictions median of 321 predictions per sample (range 147~695) SomaticSniper and JointSNVMix2: most mutation per sample (median 171, 173) MuTect was more conservative (median 115)

Simple consensus approach-Prdiction Results

Literature Search

Page 15: Detecting Somatic Mutation - Ensemble Approach

Simple consensus approach-Properties of Predictions

Literature Search

non-reference allele frequency in germline S,J substantial number of reads with non-ref alleles significant number of germline variants into the call set M is much more stringent on evidence for non-ref alleles

non-reference allele frequency in tumor one or two programs have a lower proportion of non-ref reads not having sufficient allelic ratios to be predicted as somatic but enough support to rise above the thresholds of at least one program

Page 16: Detecting Somatic Mutation - Ensemble Approach

Simple consensus approach-Validation Results

Literature Search

Page 17: Detecting Somatic Mutation - Ensemble Approach

Simple consensus approach-Filtering Results

Literature Search

taking consensus between GATK Unified Genotyper mate-pair rescue read filtering minimum read depth of 10 in both the tumor and germline

2 true positive

Page 18: Detecting Somatic Mutation - Ensemble Approach

Simple consensus approach-Summary of Findings

Literature Search

Powerful method for increasing the validation rate while maintaining maximum sensitivity

Similar effects are likely to influence other bioinformatics classification problems

Prove effective for a variety of genomics and bioinformatics analyses

Page 19: Detecting Somatic Mutation - Ensemble Approach

Integrated/Ensemble Approach

Integrated/Ensemble Approach

Challenge Literature Search

Mutation callers Comparison of mutation callers Simple consensus approach

Integrated/Ensemble Approach Summary

Page 20: Detecting Somatic Mutation - Ensemble Approach

Integrated/Ensemble Approach

Integrated/Ensemble Approach

Ensemble Using multiple learning algorithms to obtain better predictive performance (Three somatic SNV callers: SomaticSniper, MuTect, and VarScan2)

Integrated For better performance, we will use additional filtering GATK Unified Genotyper: filtering SNVs predicted in the tumor but not the gremlin Scoring system: help us to identify strong and relevant mutation candidates

Page 21: Detecting Somatic Mutation - Ensemble Approach

Integrated/Ensemble Approach

Integrated/Ensemble Approach

Subject

tumor.bamnormal.bam

MuTect somatic.vcf VarScan2

somatic.vcf

SomaticSniper somatic.vcf

GATK tumor.vcf

GATK normal.vcf

Consensus (gatk) somatic.vcf

filtered (GATK) somatic.vcf

Cosmic, CCLE validate somatic list

validated(GATK) somatic.vcf

SAMtools mpileup

Page 22: Detecting Somatic Mutation - Ensemble Approach

Integrated/Ensemble Approach: Data Sets (CGHub)

Integrated/Ensemble Approach

https://cghub.ucsc.edu/datasets/benchmark_download.html

CGHub Cancer Genomics Hub a resource of the National Cancer Institute Cancer Genome Atlas (TCGA) consortium and related projects

Page 23: Detecting Somatic Mutation - Ensemble Approach

Integrated/Ensemble Approach: TCGA Benchmark 4

Integrated/Ensemble Approach

Three parts to mutation calling exercise: derived from grade 3 breast ductal carcinomas (breast cancer) HCC1143 (50x) vs. HCC1143 BL (60x) HCC1954 (58x) vs. HCC1954 BL (71x) Simulate normal contamination and sub clone expansion for both:

Total: 28 . bam files, ~4.3 TB

Page 24: Detecting Somatic Mutation - Ensemble Approach

Integrated/Ensemble Approach: download using GeneTorrent

Integrated/Ensemble Approach

GeneTorrent client software for downloading sequence data from CGHub’s repository two main programs: gtdownload and cgquery

get public key public key: https://cghub.ucsc.edu/software/downloads/cghub_public.key TCGA key: approval to access the restricted data from the ICGC-DACO

download uuid (xml file)

CGHub

CGHub

CGHub

Page 25: Detecting Somatic Mutation - Ensemble Approach

Validation Data Sets

Integrated/Ensemble Approach

COSMIC Catalogue of somatic mutations in cancer Cell Lines Project Wellcome Trust Sanger Institute http://cancer.sanger.ac.uk/cancergenome/projects/cell_lines/

CCLE Cancer Cell Line Encycolpedia Broad Institute and Novartis Institute for Biomedical Research http://www.broadinstitute.org/ccle/home

Page 26: Detecting Somatic Mutation - Ensemble Approach

Validation Data Sets (18)

Integrated/Ensemble Approach

17:5445207-5445207 17:7577538-7577538 17:10411982-10411982 17:43364293-43364293 17:47892946-47892946 17:67538038-67538038 17:67012449-67012449 17:48538716-48538716 17:27936181-27936181 17:79650824-79650824 17:79638782-79638782 17:76528554-76528554 17:6683197-6683197 17:73235515-73235515 17:39505636-39505636 17:33310040-33310040 17:56083818-56083818 17:37374298-37374298

Page 27: Detecting Somatic Mutation - Ensemble Approach

Java Application: version

Integrated/Ensemble Approach

Java version Java6 and Java 7 used in many systems

Select Java version use “update-alternatives —config java”

MuTect run at Java6/ GATK run at Java7 :-(

Page 28: Detecting Somatic Mutation - Ensemble Approach

Java Application: running options

Integrated/Ensemble Approach

-Xmx7g 자바 프로그램의 초기 힙사이즈를 설정

자바프로그램을 구동하기 위해, 초기 설정된 메모리 사이즈는 64M “java.lang.OutOfMemoryError” 힙사이즈가 부족해서 발생

-Djava.io.tmpdir=/tmp 시스템의 property 값을 설정 자바가 사용할 temporary 디렉토리를 설정

java [-java_options] -jar jarfile [jarfile_options]

java -Xmx10g -Djava.io.tmpdir=/tmp -jar muTect-1.1.4.jar --analysis_type MuTect --reference_sequence human_g1k_v37_decoy.fasta

Page 29: Detecting Somatic Mutation - Ensemble Approach

SomaticSniper v1.0.4

Integrated/Ensemble Approach

bam-somaticsniper -J -F vcf -n HCC1143_Normal -t HCC1143_Tumor -f ${gatk_b37} ${input_bam1} ${input_bam2} HCC1143_chr17_somaticsniper.vcf

Page 30: Detecting Somatic Mutation - Ensemble Approach

MuTect v1.1.4

Integrated/Ensemble Approach

java -jar muTect-1.1.4.jar --analysis_type MuTect --reference_sequence ${gatk_b37} --input_file:normal ${input_bam2} --input_file:tumor ${input_bam1} --out HCC1143_chr17_mutect.out --vcf HCC1143_chr17_mutect.vcf --coverage_file HCC1143_chr17_mutect.cov.wig.txt -nt 7 --normal_sample_name normal --tumor_sample_name tumor -L 17

Page 31: Detecting Somatic Mutation - Ensemble Approach

VarScan2 v2.3.7

Integrated/Ensemble Approach

samtools mpileup -f ${gatk_b37} -Q 20 -q 20 -B ${input_bam2} ${input_bam1} > hcc1143_chr17.mpileup java -jar VarScan.v2.3.7.jar somatic hcc1143_chr17.mpileup HCC1143_chr17.varscan --mpileup 1 --output-vcf 1

Page 32: Detecting Somatic Mutation - Ensemble Approach

GATK UnifiedGenotyper

Integrated/Ensemble Approach

java -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -L 17 -o hcc1143_chr17.gatk.normal.vcf -I ${input_bam2} --genotype_likelihoods_model BOTH -minIndelFrac 0.2 --min_base_quality_score 17 --standard_min_confidence_threshold_for_calling 30.0 --standard_min_confidence_threshold_for_emitting 30.0 --baq CALCULATE_AS_NECESSARY --baqGapOpenPenalty 30.0 --defaultBaseQualities -1 --validation_strictness STRICT --interval_merging ALL -R ${gatk_b37} -nt 7

Page 33: Detecting Somatic Mutation - Ensemble Approach

GATK SelectVariants

Integrated/Ensemble Approach

Select variants from a VCF source discordance: select all calls missed in mycalls, but present in Hiscalls concordance: select all calls made by both mycalls and Hiscalls selectType MNP/SNP: select only multi-allelic SNPs and MNPs select restrict the output vcf to a set of intervals

Page 34: Detecting Somatic Mutation - Ensemble Approach

Ensemble approach - results & rank score

Integrated/Ensemble Approach

each filtered count (total variants count/filtered count) SomaticSniper: 2,381/624 MuTect: 132,239/4,318 VarScan2: 89,986/1,457

concordance call (204 variants) total 460 variants exclude gatk germlines: 324 variant include gatk cancer sample: 204 variants

validation count (total variants count/validated count) SomaticSniper validate: 2,381/9(+4) MuTect: 132,239/13(+8) VarScan2: 89,986/6(+1) filterd consensus: 204/5

rank score: 1

rank score: 2

rank score: 5

rank score: 3rank score: 4

Page 35: Detecting Somatic Mutation - Ensemble Approach

Summary

Summary

Challenge Literature Search

Mutation callers Comparison of mutation callers Simple consensus approach

Integrated/Ensemble Approach Summary

Page 36: Detecting Somatic Mutation - Ensemble Approach

Summary

Summary

Identifying somatic changes from tumor and matched normal sequence requires accurate detection of somatic point mutations with low allele frequencies in impure and heterogeneous cancer samples

Mutations called by multiple tools are of higher-confidence than mutations called by single tools

Utilizing multiple callers can be a powerful way to construct a list of final calls for one’s research

Capable of running multiple tools in parallel, providing faster total run-time

Page 37: Detecting Somatic Mutation - Ensemble Approach

References

References

Wang, Q., Jia, P., Li, F., Chen, H., Ji, H., Hucks, D., et al. (2013). Detecting somatic point mutations in cancer genome sequencing data: a comparison of mutation callers. Genome Medicine, 5(10), 91. doi:10.1186/gm495 Goode, D. L., Hunter, S. M., Doyle, M. A., Ma, T., Rowley, S. M., Choong, D., et al. (2013). A simple consensus approach improves somatic mutation prediction accuracy. Genome Medicine, 5(9), 90. doi:10.1186/gm494 Roberts, N. D., Kortschak, R. D., Parker, W. T., Schreiber, A. W., Branford, S., Scott, H. S., et al. (2013). A comparative analysis of algorithms for somatic SNV detection in cancer. Bioinformatics, 29(18), 2223–2230. doi:10.1093/bioinformatics/btt375 Xu, H., DiCarlo, J., Satya, R. V., Peng, Q., & Wang, Y. (2014). Comparison of somatic mutation calling methods in amplicon and whole exome sequence data. BMC Genomics, 15(1), 244. doi:10.1186/1471-2164-15-244 Kim, S. Y., Jacob, L., & Speed, T. P. (2014). Combining calls from multiple somatic mutation-callers. BMC Bioinformatics, 15(1), 154–10. doi:10.1186/1471-2105-15-154 L*wer, M., Renard, B. Y., de Graaf, J., Wagner, M., Paret, C., Kneip, C., et al. (2012). Confidence-based Somatic Mutation Evaluation and Prioritization. PLoS Computational Biology, 8(9), e1002714. doi:10.1371/journal.pcbi.1002714 Ding, J., Bashashati, A., Roth, A., Oloumi, A., Tse, K., Zeng, T., et al. (2012). Feature-based classifiers for somatic mutation detection in tumour-normal paired sequencing data. Bioinformatics, 28(2), 167–175. doi:10.1093/bioinformatics/btr629 Fischer, A., Vázquez-García, I., Illingworth, C. J. R., & Mustonen, V. (2014). High-definition reconstruction of clonal composition in cancer. CellReports, 7(5), 1740–1752. doi:10.1016/j.celrep.2014.04.055 Frampton, G. M., Fichtenholtz, A., Otto, G. A., Wang, K., Downing, S. R., He, J., et al. (2013). Development and validation of a clinical cancer genomic profiling test based on massively parallel DNA sequencing. Nature Biotechnology, 31(11), 1023–1031. doi:10.1038/nbt.2696 Cibulskis, K., Lawrence, M. S., Carter, S. L., Sivachenko, A., Jaffe, D., Sougnez, C., et al. (2013). Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nature Biotechnology, 31(3), 213–219. doi:10.1038/nbt.2514 Roth, A., Ding, J., Morin, R., Crisan, A., Ha, G., Giuliany, R., et al. (2012). JointSNVMix: a probabilistic model for accurate detection of somatic mutations in normal/tumour paired next-generation sequencing data. Bioinformatics, 28(7), 907–913. doi:10.1093/bioinformatics/bts053 Koboldt, D. C., Zhang, Q., Larson, D. E., Shen, D., McLellan, M. D., Lin, L., et al. (2012). VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Research, 22(3), 568–576. doi:10.1101/gr.129684.111 Larson, D. E., Harris, C. C., Chen, K., Koboldt, D. C., Abbott, T. E., Dooling, D. J., et al. (2012). SomaticSniper: identification of somatic point mutations in whole genome sequencing data. Bioinformatics, 28(3), 311–317. doi:10.1093/bioinformatics/btr665