ababcdfghiejkl - stanford university...idoia ochoa, himanshu asnani, dinesh bharadia, mainak...

1
ABabcdfghiejkl A new lossy compression algorithm for quality scores based on rate distortion theory Idoia Ochoa, Himanshu Asnani, Dinesh Bharadia, Mainak Chowdhury, Tsachy Weissman and Golan Yona Electrical Engineering, Stanford University Introduction Next Generation Sequencing has reduced the time and cost required for sequencing. As a result, large amounts of sequencing data are being generated. A typical sequencing data file may occupy tens or even hundreds of gigabytes of disk space, prohibitively large for many users, who have to download, store and analyze the data. This data is presented in the widely accepted FASTQ format, which consists of millions of entries. Each entry is composed of both the nucleotide sequence and per-base quality scores that indicate the level of confidence in the readout of these sequences. Example of an entry in the FASTQ format: @SRR001666.1 GATTTGGGGTTCAAAGCAGTGCAAGC + IIIHIIHABBBAA=2))!!!(!!!(( Motivation Quality scores account for about half of the required disk space, and therefore the compression of the quality scores can significantly reduce storage requirements and speed up analysis and transmission of sequencing data. Unlike header lines and nucleotide sequences, quality scores are particularly difficult to compress, due to their higher entropy and larger alphabet. Quality scores are important and very useful in many downstream applications such as trimming (used to re- move untrusted regions), alignment or Single Nucleotide Polymorphism (SNP) detection, among others. However, they significantly increase the size of the files storing raw sequencing data. With this in mind, we design a lossy compression algorithm for the quality scores, that minimizes the Mean Square Error (MSE) for a given rate (bits per quality score), specified by the user. Quality Scores A quality score Q is the integer mapping of P (the probability that the corresponding base call is incorrect). The higher the quality score, the higher the reliability of the corresponding base. It is normally repre- sented in the following standards: Sanger or Phred scale: Q = -10 log 10 P . Solexa scale: Q = -10 log 10 P 1-P . Different NGS technologies use different scales, Phred + 33, Phred + 64 and Solexa + 64 being the most common ones. For example, Phred + 33 corresponds to values of Q in the range [33 : 73]. The compression Method Denote the quality score sequences presented in a FASTQ file by {Q i } N i=1 , where Q i =[Q i (1),...,Q i (n)]. We assume each quality score vector is i.i.d. as P Q ∼N (μ Q , Σ Q ). This is justified by the fact that, given a vector source with a particular covariance structure, the gaussian multivariate source is the least com- pressible and, further, a code designed under the gaussian assumption will perform at least as well on any other source of the same covariance. Due to the correlation of quality scores within an entry, Σ Q is not in general a diagonal matrix. We then perform the singular value decomposition (Σ Q = V SV T ), and generate a set of new quality vectors {Q 0 i } N i=1 , where Q 0 i = V T (Q i - μ Q ). Due to the gaussian nature of the vectors, we have Q 0 i ∼N (0,S ), where S is diagonal. We can now apply a well known result on rate distortion theory, and optimally allocate nR bits to reduce the MSE by solving the following optimization problem: min ρ=[ρ 1 ,··· n ] 1 n n X j =1 σ 2 j 2 -2ρ j (1) s.t. n X j =1 ρ j nR, (2) where ρ j represents the number of bits allocated to the j th position of Q 0 i , for i [1 : N ]. Finally, we normalize each component of Q 0 i , for i [1 : N ], to a unit variance Gaussian and map them to decision regions representable in ρ j bits. The decision regions that minimize the MSE for different values of ρ and their representative values are found offline from a Lloyd Max procedure on a scalar gaussian distribution with mean zero and variance one. To improve the performance, we allow clustering prior to compression. Simulation Results Data used for simulations: Numer of Reads Length of each read Quality Scores PhiX (Virus) 13,310,768 100 66:98 SRR089526 (H. Sapiens) 23,892,841 48 33:73 Rate vs MSE: Alignment with the Burrows Wheeler Aligner (BWA) followed by SNP calling with Samtools for data SRR089526: T.P., F.P. and F.N. stand for true positives (detected both with the original FASTQ file and the reconstructed one), false positive (detected only with the reconstruced FASTQ file) and false negative (detected only with the original FASTQ file), respectively. The selectivity parameter is computed as T.P./(T.P. + F.P.), and sensitivity as T.P./(T.P. + F.N.). Four clusters R MSE T.P. F.P F.N. Selectivity (%) Sensitivity (%) 0 14.25 54708 4868 5719 91.82 90.53 0.25 6.99 58963 4722 1464 92.58 97.57 0.50 5.76 59110 4480 1317 92.95 97.82 1.00 3.99 59318 4028 1109 93.64 98.16 2.00 1.91 59592 3229 835 94.85 98.61 Software Availability The software is freely available for download at: http://www.stanford.edu/iochoa/QualComp.html Acknowledgement This work is supported in part by La Caixa Fellowship and by the NSF Center for Science of Information.

Upload: others

Post on 05-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ABabcdfghiejkl - Stanford University...Idoia Ochoa, Himanshu Asnani, Dinesh Bharadia, Mainak Chowdhury, Tsachy Weissman and Golan Yona Electrical Engineering, Stanford University Introduction

                  ABabcdfghiejklAnew lossycompressionalgorithmforqualityscoresbasedon ratedistortion theory

Idoia Ochoa, Himanshu Asnani, Dinesh Bharadia, Mainak Chowdhury, Tsachy Weissman and Golan YonaElectrical Engineering, Stanford University

IntroductionNext Generation Sequencing has reduced the time and costrequired for sequencing.

As a result, large amounts of sequencing data are beinggenerated. A typical sequencing data file may occupy tens oreven hundreds of gigabytes of disk space, prohibitively largefor many users, who have to download, store and analyzethe data.

This data is presented in the widely accepted FASTQ format,which consists of millions of entries. Each entry is composedof both the nucleotide sequence and per-base quality scoresthat indicate the level of confidence in the readout of thesesequences.

Example of an entry in the FASTQ format:

@SRR001666.1GATTTGGGGTTCAAAGCAGTGCAAGC+IIIHIIHABBBAA=2))!!!(!!!((

MotivationQuality scores account for about half of the required diskspace, and therefore the compression of the quality scorescan significantly reduce storage requirements and speed upanalysis and transmission of sequencing data.

Unlike header lines and nucleotide sequences, quality scoresare particularly difficult to compress, due to their higherentropy and larger alphabet.

Quality scores are important and very useful in manydownstream applications such as trimming (used to re-move untrusted regions), alignment or Single NucleotidePolymorphism (SNP) detection, among others. However,they significantly increase the size of the files storing rawsequencing data.

With this in mind, we design a lossy compression algorithmfor the quality scores, that minimizes the Mean Square Error(MSE) for a given rate (bits per quality score), specified bythe user.

Quality Scores

A quality score Q is the integer mapping of P (the probability that thecorresponding base call is incorrect). The higher the quality score, thehigher the reliability of the corresponding base. It is normally repre-sented in the following standards:

• Sanger or Phred scale: Q = −10 log10 P .

• Solexa scale: Q = −10 log10P

1−P .

Different NGS technologies use different scales, Phred + 33, Phred + 64and Solexa + 64 being the most common ones.For example, Phred + 33 corresponds to values of Q in the range [33 : 73].

The compression Method

Denote the quality score sequences presented in a FASTQ file by {Qi}Ni=1,where Qi = [Qi(1), . . . , Qi(n)].

We assume each quality score vector is i.i.d. as PQ ∼ N (µQ,ΣQ).

This is justified by the fact that, given a vector source with a particularcovariance structure, the gaussian multivariate source is the least com-pressible and, further, a code designed under the gaussian assumptionwill perform at least as well on any other source of the same covariance.

Due to the correlation of quality scores within an entry, ΣQ is notin general a diagonal matrix. We then perform the singular valuedecomposition (ΣQ = V SV T ), and generate a set of new quality vectors{Q′i}Ni=1, where Q′i = V T (Qi − µQ). Due to the gaussian nature of thevectors, we have Q′i ∼ N (0, S), where S is diagonal.

We can now apply a well known result on rate distortion theory, andoptimally allocate nR bits to reduce the MSE by solving the followingoptimization problem:

minρ=[ρ1,··· ,ρn]

1

n

n∑j=1

σ2j 2−2ρj (1)

s.t.n∑j=1

ρj ≤ nR, (2)

where ρj represents the number of bits allocated to the jth position ofQ′i, for i ∈ [1 : N ].

Finally, we normalize each component of Q′i, for i ∈ [1 : N ], to a unitvariance Gaussian and map them to decision regions representable in ρjbits.

The decision regions that minimize the MSE for different values of ρand their representative values are found offline from a Lloyd Maxprocedure on a scalar gaussian distribution with mean zero and varianceone.

To improve the performance, we allow clustering prior to compression.

Simulation Results

• Data used for simulations:

Numer of Reads Length of each read Quality ScoresPhiX (Virus) 13,310,768 100 66:98

SRR089526 (H. Sapiens) 23,892,841 48 33:73

• Rate vs MSE:

• Alignment with the Burrows Wheeler Aligner (BWA) followed by SNP calling with Samtools for dataSRR089526:

T.P., F.P. and F.N. stand for true positives (detected both with the original FASTQ file and the reconstructedone), false positive (detected only with the reconstruced FASTQ file) and false negative (detected only with theoriginal FASTQ file), respectively.

The selectivity parameter is computed as T.P./(T.P. + F.P.), and sensitivity as T.P./(T.P. + F.N.).

Four clustersR MSE T.P. F.P F.N. Selectivity (%) Sensitivity (%)0 14.25 54708 4868 5719 91.82 90.53

0.25 6.99 58963 4722 1464 92.58 97.570.50 5.76 59110 4480 1317 92.95 97.821.00 3.99 59318 4028 1109 93.64 98.162.00 1.91 59592 3229 835 94.85 98.61

Software AvailabilityThe software is freely available for download at:

• http://www.stanford.edu/∼iochoa/QualComp.html

Acknowledgement• This work is supported in part by La Caixa Fellowship and by the NSF Center for Science of Information.