genomics method seminar - breakdancer january 21, 2015 sora kim researcher [email protected] yonsei...

23
Genomics Method Seminar - BreakDancer January 21, 2015 Sora Kim Researcher [email protected] Yonsei Biomedical Science Institute Yonsei University College of Medicine

Upload: ada-richardson

Post on 04-Jan-2016

216 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Genomics Method Seminar - BreakDancer January 21, 2015 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College

Genomics Method Seminar- BreakDancer

January 21, 2015

Sora Kim

[email protected]

Yonsei Biomedical Science InstituteYonsei University College of Medicine

Page 2: Genomics Method Seminar - BreakDancer January 21, 2015 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College

2/12

Today’s paper

• PhD. Ken Chen– Assistant Professor, Department of Bioinformatics and Compu-

tational Biology, Division of Quantitative Sciences, The Univer-sity of Texas MD Anderson Cancer Center, Houston, TX

– Dr. Chen has designed, developed, and co-developed a set of computational tools such as BreakDancer, TIGRA, CREST, BreakTrans, BreakFusion, PolyScan, SomaticSniper, and VarScan

Page 3: Genomics Method Seminar - BreakDancer January 21, 2015 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College

3/12

Conceptual Overview

Page 4: Genomics Method Seminar - BreakDancer January 21, 2015 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College

4/12

Structural Variation

Hurles ME, Trends Genet(2008) 24: 238–245

Page 5: Genomics Method Seminar - BreakDancer January 21, 2015 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College

5/12

Structural Variation

Hurles ME, Trends Genet(2008) 24: 238–245

Page 6: Genomics Method Seminar - BreakDancer January 21, 2015 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College

6/12

Structural variation sequence signa-tures

Can Alkan, Nature Reviews Genetics (2011) 12, 363-376

Page 7: Genomics Method Seminar - BreakDancer January 21, 2015 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College

7/12

Structural variation sequence sig-natures

Can Alkan, Nature Reviews Genetics (2011) 12, 363-376

Page 8: Genomics Method Seminar - BreakDancer January 21, 2015 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College

8/12

BreakDancer Overview

Page 9: Genomics Method Seminar - BreakDancer January 21, 2015 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College

9/12

BreakDancer

• BreakDancer consists of two complementary algorithms– BreakDancerMax

• provides genome-wide detection of five types of structural variants– deletions, insertions, inversions

and intra/inter-chromosomal translocations

– BreakDancerMini• focuses on detecting small indels (typically 10-100 bp) that are

not routinely detected by BreakDancerMax

• In a family- or a population-based study, pooling enhanced the detection of common variants.

• In a tumor and normal sample paired study, it improved the specificity of somatic variant prediction through effec-tive elimination of inherited variants.

Page 10: Genomics Method Seminar - BreakDancer January 21, 2015 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College

10/12

BreakDancerMax

1. BreakDancerMax starts with the map files produced by MAQ.

2. Read pairs mapped to a reference genome with sufficient mapping quality are independently classified into six types: normal, deletion, insertion, inversion, intrachromo-somal translocation and interchromosomal transloca-tion.

• This classification process is based ona. the separation distance and alignment orientation be-

tween the paired readsb. the user-specified thresholdc. the empirical insert size distribution estimated from the

alignment of each library contributing genome cover-age

Page 11: Genomics Method Seminar - BreakDancer January 21, 2015 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College

11/12

BreakDancerMax

3. The algorithm then searches for genomic re-gions that anchor substantially more anoma-lous read pairs (ARPs) than expected on average.

4. A putative structural variant is derived from the identification of one or more regions that are inter-connected by at least two ARPs.

5. The confidence score is estimated for each vari-ant based on a Poisson model that takes into con-sideration the number of supporting ARPs, the size of the anchoring regions and the coverage of the genome.

Page 12: Genomics Method Seminar - BreakDancer January 21, 2015 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College

12/12

Confidence score estimation

• The accuracy of the score depends on many fac-tors.– whether the set of reads is an unbiased sampling of the

genome and all alleles– whether the reads are mapped to correct locations– whether the amount of observed evidence is sufficient

• One of the primary signals for the presence of a structural variant is the clustering of ARPs.– it was important to measure the degree of clustering

from the perspective of both depth and breadth

Page 13: Genomics Method Seminar - BreakDancer January 21, 2015 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College

13/12

Confidence score estimation

• assumed that under the null hypothesis of no variant, the genomic location of one particular type of insert was uniformly distributed.

• For studies that define more than one insert type, the number of inserts at a particular location forms a mixture Poisson distribution with each mixture component representing one of the insert types.

Page 14: Genomics Method Seminar - BreakDancer January 21, 2015 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College

14/12

Confidence score estimation

• The statistic that summarizes the degree of clustering of a particular insert type is the probability of having more than the observed number of inserts in a given region

• denotes a Poisson random variable with mean equal to • the type of the insert• the number of observed type inserts.

• represents the cumulative size of the regions that the ARPs an-chor to

• the total number of type inserts in the entire dataset and G the length of the reference genome.

• is counted directly from the data without assuming any form of insert size distribution.

Page 15: Genomics Method Seminar - BreakDancer January 21, 2015 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College

15/12

Confidence score estimation

• This probabilistic scoring system can conveniently in-tegrate information from multiple libraries from the same or different individuals using Fisher’s method assuming that the m libraries are produced inde-pendently.

• denotes a chi-square distribution of 2m degree of freedom

• the P value obtained from the library

• convert the combined P value to Phred scale

Page 16: Genomics Method Seminar - BreakDancer January 21, 2015 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College

16/12

BreakDancerMini

1. BreakDancerMini analyzes the normally mapped read pairs that were ignored by BreakDancerMax.

2. A genomic region of size equivalent to the mean insert size is classified as either normal or anomalous based on a sliding window test that examined the difference of the separation distances between read pairs that are mapped within the window versus those in the entire genome.

3. A confidence score is assigned based on the signifi-cance value of the sliding window test.

Page 17: Genomics Method Seminar - BreakDancer January 21, 2015 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College

17/12

The sliding window test

• applied a sliding window test to identify anomalous regions that contain read pairs significantly different from the entire genome.

• By default, BreakDancerMini using a fixed window size of bp and a step size of 1 bp– (mean) and (s.d.) estimated from the separation distance of normally and

confidently (mapping quality > 40) mapped read pairs– is the average read length.

• A two-sample Kolmogorov-Smirnov (KS) test statistic is com-puted for each window.

• and are the empirical cumulative distribution function (ECDF) estimated from the normal reads in the window and in the entire genome respectively

• and are the number of reads in each set• is the separation distance from 1 bp to a maximum size (~300 bp)• sup denotes the supremum of the set• objectively measures the difference between the two ECDFs in terms of both loca-

tion and shape.

Page 18: Genomics Method Seminar - BreakDancer January 21, 2015 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College

18/12

Variant calling based on local as-sembly

• A local assembly of the breakpoints within a suspected variant region can confirm the existence of the structural variant, precisely define the breakpoint locations and de-termine any inserted sequences that may be present; MAQ, Velvet, Phrap

• If the derived contig sequences cumulatively covered over 75% of the region from which the reads were extracted, we aligned the contigs to a region of the human reference se-quence containing the structural variant and 1,000 bp of flanking sequence on either side using cross-match.

• A variant was called if there is a gap or if the tumor and the normal contigs contain consistent breakpoint.

Page 19: Genomics Method Seminar - BreakDancer January 21, 2015 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College

19/12

SV Detection

• Breakdancer Bam2cfg– Computes the insert size distribution and gen-

erate the Breakdancer configuration file– Command

• /BIO/app/breakdancer-1.1.2/perl/bam2cfg.pl –c 4 –q 35 –h /BIO/ewha/SAMPLES/NA12878.chrom22.bam > NA12878.chrom22.cfg

• –c : Cut off in unit of standard deviation• –q : Minimum mapping quality• –h : Plot insert size histogram for

each BAM library

Page 20: Genomics Method Seminar - BreakDancer January 21, 2015 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College

20/12

readgroup:ERR001719 platform:ILLUMINA map:/BIO/ewha/SAMPLES/NA12878.chrom22.bam readlen:36.00 lib:g1k-sc-NA12878-CEU-1 num:10001 lower:110.35 upper:173.67 mean:140.02 std:10.52 SWnormality:-17.69 exe:samtools view

SV Detection

• Breakdancer Bam2cfg– Output

• Upper = mean + std * c• Histogram should not be a bimodal• Std / mean < 0.3

Page 21: Genomics Method Seminar - BreakDancer January 21, 2015 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College

21/12

SV Detection

• Breakdancer-max– Calls SVs by detecting cluster of reads that shows

an abnormal insert size length or orientations– Command

• /BIO/app/breakdancer-1.1.2/cpp/breakdancer-max –c 4 –q 35 –r 2 NA12878.chrom22.cfg > NA12878.chrom22.out

• –c : Cut off in unit of standard deviation• –q : Minimum mapping quality• –r : minimum number of read pairs required to establish

a connection

Page 22: Genomics Method Seminar - BreakDancer January 21, 2015 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College

22/12

SV Detection

• Breakdancer-max– Output

22 51119695 3+0- 22 51121322 0+3- DEL 1481 74 3 NA12878.chrom22.bam|3 0.16

• 1. Chromosome 1• 2. Position 1• 3. Orientation 1• 4. Chromosome 2• 5. Position 2• 6. Orientation 2• 7. Type of a SV• 8. Size of a SV• 9. Confidence Score• 10. Total number of supporting read pairs• 11. Total number of supporting read pairs from each bam/library• 12. Estimated allele frequency

DEL (deletions)INS (insertion)INV (inversion)ITX (intra-chromosomal translocation)CTX (inter-chromosomal translocation)

Page 23: Genomics Method Seminar - BreakDancer January 21, 2015 Sora Kim Researcher sora15@yuhs.ac Yonsei Biomedical Science Institute Yonsei University College

23/12

Discussion

• It may be beneficial to incorporate the mapping quality rather than applying a fixed threshold.

• There is evidence suggesting that integrating read depth may help improve segmentation and genotyping, although an effective in-tegration method is yet to be discovered.

• Some types of structural variants, such as inversions and translo-cations, appeared to be more difficult to detect and validate.

• Many putative predictions overlapped with regions of tandem or inverted repeat and required further sequence analysis and filter-ing or the use of additional longer reads and longer inserts.

• The BreakDancerMini code will not be included in the coming re-leases. Recommend using Pindel to detect intermediate size indels (10-80 bp).