detecting somatic mutation - ensemble approach
TRANSCRIPT
Detecting Somatic Mutations in Impure Cancer Sample
- Ensemble Approach
김광중, 홍창범
KT GenomeCloud
2015 한국유전체학회 동계심포지엄 NGS를 Data를 이용한 생물정보분석 Workshop
2015.2.4~2.5
Overview
Overview
Challenge Literature Search
Mutation callers Comparison of mutation callers Simple consensus approach
Integrated/Ensemble Approach Summary
Challenge
Motivation
Sequencing reads from tumor samples are diluted by normal cells
lower signal-to-noise ratio: allele frequency 5% SNVs cannot be called with high significance
The genomes of primary tumors are genetically heterogeneous with frequent rearrangements copy number alterations subclones
Need highly sensitive and specific mutation-calling methods
Terminologies
Terminologies
Challenge Literature Search
Mutation Callers Comparison of mutation callers Simple consensus approach
Integrated/Ensemble Approach Summary
Comparison of mutation callers
Literature Search
Comparison of mutation callers-Data Sets
Literature Search
whole genome sequencing (WGS) melanoma sample and matched blood 90% tumor content paired-end reads, ~50x coverage
whole exam sequencing (WES) tumor samples
18 lung tumor-normal pairs 70~80% tumor content paired-end reads, 63x
cell lines 7 lung cancer cell lines paired-end reads, 233x
Comparison of mutation callers-Experiments
Literature Search
sSNV detection tools
Validation PCR and direct sequencing of genomic DNA on only deleted, functionally important sSNVs
Simulation 10 tumor-normal pairs (WES), 100x coverage
Comparison of mutation callers- Results (1)
Literature Search
Comparison of mutation callers- Results (2)
Literature Search
Synthetic data: 10 tumor-normal pairs (WES), 100x coverage
Comparison of mutation callers- Summary of Findings
Literature Search
VarScan2 performed the best; MuTect follows VarScan2: better at higher allele frequencies MuTect: more sensitive with low allele frequencies
strand-bias filtering is useful eliminate many false positives common problem with Illumina seuquencing data
still a challenge: how to discern sSNVs and normal alternate alleles? to call ultras-rare sSNVs: targeted deep sequencing recommended over WES or WGS
Simple consensus approach
Literature Search
Simple consensus approach-Data Sets
Literature Search
whole exam sequencing (WES) 27 ovarian tumors and their matched germline samples HiSeq 2000 sequencer, using 100 bp paired-end reads mean coverage ranged from 102~225x in tumor and 119~118x germlines
validation Sanger sequencing
somatic SNV detection programs JointSNVMix2, MuTect, and SomaticSniper implement sophisticated detection algorithms used in major tumor sequencing studies
Simple consensus approach-identification somatic SNVs
Literature Search
MuTect (v 1.0.27783) only the default parameter set was applied not labeled as ‘REJECT’
JointSNVMix2 (v 0.7.5) default prior genotype probabilities used for training set joint probability 0.9999 or greater
SomaticSniper (v 1.0.0) using joint genotyping mode (-J option) default prior probability of a somatic mutation (0.01) mapping quality of 0 were filtered predictions with a ‘somatic score’ of 40 or greater
SAMtools mpileup mapping qualities directionality depth of reads
total read depth was of 8 or greater in both the T/N mutant allele frequency of ≥20% in tumor and ≥5% germline mutant allele supported by read mapping in both for/rev orientations variant call in only on tumor (exception of the BRAF V600, KRAS G12/13 hotspot) combined total of 9,226 somatic SNV predictions median of 321 predictions per sample (range 147~695) SomaticSniper and JointSNVMix2: most mutation per sample (median 171, 173) MuTect was more conservative (median 115)
Simple consensus approach-Prdiction Results
Literature Search
Simple consensus approach-Properties of Predictions
Literature Search
non-reference allele frequency in germline S,J substantial number of reads with non-ref alleles significant number of germline variants into the call set M is much more stringent on evidence for non-ref alleles
non-reference allele frequency in tumor one or two programs have a lower proportion of non-ref reads not having sufficient allelic ratios to be predicted as somatic but enough support to rise above the thresholds of at least one program
Simple consensus approach-Validation Results
Literature Search
Simple consensus approach-Filtering Results
Literature Search
taking consensus between GATK Unified Genotyper mate-pair rescue read filtering minimum read depth of 10 in both the tumor and germline
2 true positive
Simple consensus approach-Summary of Findings
Literature Search
Powerful method for increasing the validation rate while maintaining maximum sensitivity
Similar effects are likely to influence other bioinformatics classification problems
Prove effective for a variety of genomics and bioinformatics analyses
Integrated/Ensemble Approach
Integrated/Ensemble Approach
Challenge Literature Search
Mutation callers Comparison of mutation callers Simple consensus approach
Integrated/Ensemble Approach Summary
Integrated/Ensemble Approach
Integrated/Ensemble Approach
Ensemble Using multiple learning algorithms to obtain better predictive performance (Three somatic SNV callers: SomaticSniper, MuTect, and VarScan2)
Integrated For better performance, we will use additional filtering GATK Unified Genotyper: filtering SNVs predicted in the tumor but not the gremlin Scoring system: help us to identify strong and relevant mutation candidates
Integrated/Ensemble Approach
Integrated/Ensemble Approach
Subject
tumor.bamnormal.bam
MuTect somatic.vcf VarScan2
somatic.vcf
SomaticSniper somatic.vcf
GATK tumor.vcf
GATK normal.vcf
Consensus (gatk) somatic.vcf
filtered (GATK) somatic.vcf
Cosmic, CCLE validate somatic list
validated(GATK) somatic.vcf
SAMtools mpileup
Integrated/Ensemble Approach: Data Sets (CGHub)
Integrated/Ensemble Approach
https://cghub.ucsc.edu/datasets/benchmark_download.html
CGHub Cancer Genomics Hub a resource of the National Cancer Institute Cancer Genome Atlas (TCGA) consortium and related projects
Integrated/Ensemble Approach: TCGA Benchmark 4
Integrated/Ensemble Approach
Three parts to mutation calling exercise: derived from grade 3 breast ductal carcinomas (breast cancer) HCC1143 (50x) vs. HCC1143 BL (60x) HCC1954 (58x) vs. HCC1954 BL (71x) Simulate normal contamination and sub clone expansion for both:
Total: 28 . bam files, ~4.3 TB
Integrated/Ensemble Approach: download using GeneTorrent
Integrated/Ensemble Approach
GeneTorrent client software for downloading sequence data from CGHub’s repository two main programs: gtdownload and cgquery
get public key public key: https://cghub.ucsc.edu/software/downloads/cghub_public.key TCGA key: approval to access the restricted data from the ICGC-DACO
download uuid (xml file)
CGHub
CGHub
CGHub
Validation Data Sets
Integrated/Ensemble Approach
COSMIC Catalogue of somatic mutations in cancer Cell Lines Project Wellcome Trust Sanger Institute http://cancer.sanger.ac.uk/cancergenome/projects/cell_lines/
CCLE Cancer Cell Line Encycolpedia Broad Institute and Novartis Institute for Biomedical Research http://www.broadinstitute.org/ccle/home
Validation Data Sets (18)
Integrated/Ensemble Approach
17:5445207-5445207 17:7577538-7577538 17:10411982-10411982 17:43364293-43364293 17:47892946-47892946 17:67538038-67538038 17:67012449-67012449 17:48538716-48538716 17:27936181-27936181 17:79650824-79650824 17:79638782-79638782 17:76528554-76528554 17:6683197-6683197 17:73235515-73235515 17:39505636-39505636 17:33310040-33310040 17:56083818-56083818 17:37374298-37374298
Java Application: version
Integrated/Ensemble Approach
Java version Java6 and Java 7 used in many systems
Select Java version use “update-alternatives —config java”
MuTect run at Java6/ GATK run at Java7 :-(
Java Application: running options
Integrated/Ensemble Approach
-Xmx7g 자바 프로그램의 초기 힙사이즈를 설정
자바프로그램을 구동하기 위해, 초기 설정된 메모리 사이즈는 64M “java.lang.OutOfMemoryError” 힙사이즈가 부족해서 발생
-Djava.io.tmpdir=/tmp 시스템의 property 값을 설정 자바가 사용할 temporary 디렉토리를 설정
java [-java_options] -jar jarfile [jarfile_options]
java -Xmx10g -Djava.io.tmpdir=/tmp -jar muTect-1.1.4.jar --analysis_type MuTect --reference_sequence human_g1k_v37_decoy.fasta
SomaticSniper v1.0.4
Integrated/Ensemble Approach
bam-somaticsniper -J -F vcf -n HCC1143_Normal -t HCC1143_Tumor -f ${gatk_b37} ${input_bam1} ${input_bam2} HCC1143_chr17_somaticsniper.vcf
MuTect v1.1.4
Integrated/Ensemble Approach
java -jar muTect-1.1.4.jar --analysis_type MuTect --reference_sequence ${gatk_b37} --input_file:normal ${input_bam2} --input_file:tumor ${input_bam1} --out HCC1143_chr17_mutect.out --vcf HCC1143_chr17_mutect.vcf --coverage_file HCC1143_chr17_mutect.cov.wig.txt -nt 7 --normal_sample_name normal --tumor_sample_name tumor -L 17
VarScan2 v2.3.7
Integrated/Ensemble Approach
samtools mpileup -f ${gatk_b37} -Q 20 -q 20 -B ${input_bam2} ${input_bam1} > hcc1143_chr17.mpileup java -jar VarScan.v2.3.7.jar somatic hcc1143_chr17.mpileup HCC1143_chr17.varscan --mpileup 1 --output-vcf 1
GATK UnifiedGenotyper
Integrated/Ensemble Approach
java -jar GenomeAnalysisTK.jar -T UnifiedGenotyper -L 17 -o hcc1143_chr17.gatk.normal.vcf -I ${input_bam2} --genotype_likelihoods_model BOTH -minIndelFrac 0.2 --min_base_quality_score 17 --standard_min_confidence_threshold_for_calling 30.0 --standard_min_confidence_threshold_for_emitting 30.0 --baq CALCULATE_AS_NECESSARY --baqGapOpenPenalty 30.0 --defaultBaseQualities -1 --validation_strictness STRICT --interval_merging ALL -R ${gatk_b37} -nt 7
GATK SelectVariants
Integrated/Ensemble Approach
Select variants from a VCF source discordance: select all calls missed in mycalls, but present in Hiscalls concordance: select all calls made by both mycalls and Hiscalls selectType MNP/SNP: select only multi-allelic SNPs and MNPs select restrict the output vcf to a set of intervals
Ensemble approach - results & rank score
Integrated/Ensemble Approach
each filtered count (total variants count/filtered count) SomaticSniper: 2,381/624 MuTect: 132,239/4,318 VarScan2: 89,986/1,457
concordance call (204 variants) total 460 variants exclude gatk germlines: 324 variant include gatk cancer sample: 204 variants
validation count (total variants count/validated count) SomaticSniper validate: 2,381/9(+4) MuTect: 132,239/13(+8) VarScan2: 89,986/6(+1) filterd consensus: 204/5
rank score: 1
rank score: 2
rank score: 5
rank score: 3rank score: 4
Summary
Summary
Challenge Literature Search
Mutation callers Comparison of mutation callers Simple consensus approach
Integrated/Ensemble Approach Summary
Summary
Summary
Identifying somatic changes from tumor and matched normal sequence requires accurate detection of somatic point mutations with low allele frequencies in impure and heterogeneous cancer samples
Mutations called by multiple tools are of higher-confidence than mutations called by single tools
Utilizing multiple callers can be a powerful way to construct a list of final calls for one’s research
Capable of running multiple tools in parallel, providing faster total run-time
References
References
Wang, Q., Jia, P., Li, F., Chen, H., Ji, H., Hucks, D., et al. (2013). Detecting somatic point mutations in cancer genome sequencing data: a comparison of mutation callers. Genome Medicine, 5(10), 91. doi:10.1186/gm495 Goode, D. L., Hunter, S. M., Doyle, M. A., Ma, T., Rowley, S. M., Choong, D., et al. (2013). A simple consensus approach improves somatic mutation prediction accuracy. Genome Medicine, 5(9), 90. doi:10.1186/gm494 Roberts, N. D., Kortschak, R. D., Parker, W. T., Schreiber, A. W., Branford, S., Scott, H. S., et al. (2013). A comparative analysis of algorithms for somatic SNV detection in cancer. Bioinformatics, 29(18), 2223–2230. doi:10.1093/bioinformatics/btt375 Xu, H., DiCarlo, J., Satya, R. V., Peng, Q., & Wang, Y. (2014). Comparison of somatic mutation calling methods in amplicon and whole exome sequence data. BMC Genomics, 15(1), 244. doi:10.1186/1471-2164-15-244 Kim, S. Y., Jacob, L., & Speed, T. P. (2014). Combining calls from multiple somatic mutation-callers. BMC Bioinformatics, 15(1), 154–10. doi:10.1186/1471-2105-15-154 L*wer, M., Renard, B. Y., de Graaf, J., Wagner, M., Paret, C., Kneip, C., et al. (2012). Confidence-based Somatic Mutation Evaluation and Prioritization. PLoS Computational Biology, 8(9), e1002714. doi:10.1371/journal.pcbi.1002714 Ding, J., Bashashati, A., Roth, A., Oloumi, A., Tse, K., Zeng, T., et al. (2012). Feature-based classifiers for somatic mutation detection in tumour-normal paired sequencing data. Bioinformatics, 28(2), 167–175. doi:10.1093/bioinformatics/btr629 Fischer, A., Vázquez-García, I., Illingworth, C. J. R., & Mustonen, V. (2014). High-definition reconstruction of clonal composition in cancer. CellReports, 7(5), 1740–1752. doi:10.1016/j.celrep.2014.04.055 Frampton, G. M., Fichtenholtz, A., Otto, G. A., Wang, K., Downing, S. R., He, J., et al. (2013). Development and validation of a clinical cancer genomic profiling test based on massively parallel DNA sequencing. Nature Biotechnology, 31(11), 1023–1031. doi:10.1038/nbt.2696 Cibulskis, K., Lawrence, M. S., Carter, S. L., Sivachenko, A., Jaffe, D., Sougnez, C., et al. (2013). Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nature Biotechnology, 31(3), 213–219. doi:10.1038/nbt.2514 Roth, A., Ding, J., Morin, R., Crisan, A., Ha, G., Giuliany, R., et al. (2012). JointSNVMix: a probabilistic model for accurate detection of somatic mutations in normal/tumour paired next-generation sequencing data. Bioinformatics, 28(7), 907–913. doi:10.1093/bioinformatics/bts053 Koboldt, D. C., Zhang, Q., Larson, D. E., Shen, D., McLellan, M. D., Lin, L., et al. (2012). VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Research, 22(3), 568–576. doi:10.1101/gr.129684.111 Larson, D. E., Harris, C. C., Chen, K., Koboldt, D. C., Abbott, T. E., Dooling, D. J., et al. (2012). SomaticSniper: identification of somatic point mutations in whole genome sequencing data. Bioinformatics, 28(3), 311–317. doi:10.1093/bioinformatics/btr665