advancing science with dna sequence sequence clustering mgm workshop september 26, 2011 reducing...

27
Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis Denis Kaznadzey, GBP

Upload: oswald-griffin

Post on 25-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence Clustering

MGM WorkshopSeptember 26, 2011

Reducing Search Space in Protein

and

DNA/RNA Sequence Analysis

Denis Kaznadzey, GBP

Page 2: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence clustering

- Classify into groups of essentially similar objects

- When new data arrives, assign objects to existing groups

- Classify ‘leftovers’

- Occasionally review entire classification

Problem: What is essentially similar’?• Finding properties that are important

(Ontological relevancy)

• Does classification reflect reality in any way?

To deal with a huge variety of individual ‘objects’:

Page 3: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence clustering

Taxonomical Classification

vs.

Continuity of Great Chain of Being

Even if reductionist, classification is a tool to study the world – the biology in particular.

When data is incomplete, any classification is a convention. At the same time, it is an approximation of a “reality”.

Carl Linnaeus Georges Buffon

Page 4: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence clustering

In Modern Biology: Most abundant type of data is sequence:

• Genomic DNA

• RNA (through RNASeq)

• Derived Proteins

Primary feature is Primary Structure, but

- Classification criteria depends on application.

Page 5: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence Clustering

Genome Assembly: Binning, Scaffolding

Transcriptomics: EST (read) clustering

Protein Function and Evolution studies:Protein families

Phylogenetic profiling: OTUs

Select Applications in Genomic Sciences:

Page 6: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence Clustering

In Metagenomics:

Primary tasks:

• Assess diversity

• Find genes

• Predict functions

• Predict pathways

• Estimate capabilities

Based on sequence comparison.

Page 7: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence Clustering

- Any Clustering is based on the Distance in some Metric.

- Initial clustering is based on pair-wise distances.

- Subsequent classification is based on distances from object to clusters- Representative- Set of representatives (all at

extreme)- Other measure, may be

unrelated to initial.

Page 8: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence Clustering

When distance measure is chosen, and distances are obtained / computed:

• There is a HUGE variety of clustering methods (clustering / classification is a very elaborate methodology)• K-mean, average linkage, complete linkage, single linkage,

iterative, SOM, etc.

• However options for large volume clustering are limited due to performance of algorithms.

• Single-linkage can be computed very efficiently

• (Method for pledging new sequences to clusters may be computationally more intense)

Page 9: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence clustering

Most efficient clustering: transitive-closure based.

• Requires ‘boolean’ distances (two sequences can be linked or not linked

• Requires number of nodes to be known

• Space ~ NodesNo

• Run-time (worst) ~ EdgesNo* AveClustSize

• Run-time (average) ~ EdgesNo * log2 (AveClustSize))

Page 10: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence clustering

Practical Transitive Closure algorithm:Allocate array of sequence numbers A [0..N]

Phase I: connect linked vertices through vertex of smallest index

For each edge (m, n):

While A [n] != n:

n = A [n]

While A [m] != m:

m = A [m]

A [max (m, n)] = min (m, n)

Phase II: propagate smallest indices as cluster identifiers

For each n from 0 to N:

If A [n] ! = A [ A [n]]:

A [n] = A [A [n]]

Phase III: collect clusters. (Implementation dependent)

Count number of distinct cluster “id”s => M (1 pass)

Allocate array of sizes; Count size of each cluster (1 pass)

Allocate array of clusters; fill it in (1 pass)

0 1 2 3 4 5 6

0 1 2 1 4 5 6

0 1 2 1 4 5 5

0 1 2 1 4 1 5

0 1 2 1 4 1 1

+(1,3)

+(5,6)

+(6, 1)

(0); (1,3,5,6); (2); (4)

Page 11: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

OK

Sequence clustering

Computing ‘boolean’ distances:• Threshold – based

• Additional rules (match arrangement)

Example: read/EST clustering% identity + length + arrangement:

Page 12: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Computing similarity measure:- Edit distance or (ungapped) statistics P-value: BLAST,

Fasta, needle, water, etc.

- Adjusted edit distance through progressive alignment: Clustal, MUSCLE, T-coffee

- K-mere statistics: CD-HIT, USEARCH, MUSCLE

- Suffix trees (and probabilistic suffix trees): MUMmer, Reputer, CLUSEQ

- Suffix Arrays: Bowtie, BWT

- Position-Specific scoring matrix: PSI-Blast, Impala

- Hidden Markoff Models: HMMer, HHSearch/HHPred, SAM

Page 13: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence clustering

Distance computing is harder then clustering. (Bundled solutions: BLASTCLUST, CD-HIT, UCLUST, CLUSEQ)

- For large data sets only k-mere and suffix array measures are practical.

- However: incremental/ greedy approaches can be used to avoid entire distance matrix computing. This makes use of sensitive similarity measures possible.

- For boolean distance, iterative similarity detection is possible. Fast binning->slow comparing. (no off-the-shelve implementations(?))

Page 14: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence clustering

Boolean distance clustering killer:

CLUSTER AGGREGATION.

In large clusters, even a small number of random links lead to huge conglomerates.

Page 15: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Common causes:

1) Contamination with standard constructs

2) Repeats

3) Chimeras

4) Spurious similarities (low complexity zones etc.

Page 16: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence clustering

Fighting aggregation

- Vector / adapter trimming:- Lucy, Figaro, etc. Integrated in many assembly suites

(newbler, velvet, AMOS, CLCbio, etc.)

- Low complexity detection / masking:- SEG, DUST, FastQC, WindowMasker etc. – often integrated

in search tools.

Page 17: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence clustering

- Repeat detection / masking:- Regular (tandem) repeats:

- Pre-search masking: Based on structure (IMEx, SRF); or on database (TRDB)

- Post-search detection based on similarity properties (multiple parallel threads)

- Irregular (long) repeats:- Database based: RepeatMasker- De-novo: RepeatScout, orrb, PILER, etc.

Require genome as input, construct database.

Page 18: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence clustering

Detecting chimeric sequences:

• Abundance-based: Perseus, UCHIME• Chimeras undergo less amplification

cycles. So chimera segments in native arrangement are more frequent.

• Specific to 16S: ChimeraSlayer, Bellerophon• Chimera ‘arms’ are closer to originating

phyla then entire chimera

Page 19: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence clustering

Detecting chimeric sequences• Similarity coverage based: Mira assembler

Page 20: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence clustering

Detecting chimeric sequences• Similarity graph topology based: dchim

Alignment view Connectivity view

Page 21: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Protein Clusters: various criteria- Primary structure similarity

- Close evolutionary relationship

- Similarity in physical properties

- 3-D structure similarity

- Similar fold arrangement

- Domain structure similarity

- Common or similar functions

- etc.

Page 22: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence clustering

Functional and structural classifications in IMG

Page 23: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence clustering

Direct similarity measure by edit distance is not sensitive enough for evolutionary distant species

Position-specific scoring matrices and profile-HMMs provide better sensitivity, but MUCH SLOWER.

For individual genomes (103 -5x104 proteins) could be used with massively parallel computations (while number of genomes is within thousands)

For metagenomes can not be used with foreseeable computing resources.

Page 24: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence clustering

Functional annotation of metagenome genes through protein clusters (under development):

- Build set of functionally homogenous clusters of similar proteins – for annotated genomes

- Build HMMs for each cluster, compose model database

- Pledge metagenome proteins to clusters by matching to models

- Cluster unpledged proteins, build models, update model database.

- Balance model database by creating model tree: aggregating small relative clusters and dissecting large ones.

- Perform hierarchical searches through profiles tree.

Page 25: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Sequence clustering

Clustering reduces search space, but adds another level of indirection, which is a source of errors, and complexity, which consumes effort.

Improves only searches within parameters space used for clustering (structure-based clusters not useful for searching for certain codon usage, etc.)

Page 26: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

However, for proteins, which form dense relationship networks, clustering is a great tool.

Page 27: Advancing Science with DNA Sequence Sequence Clustering MGM Workshop September 26, 2011 Reducing Search Space in Protein and DNA/RNA Sequence Analysis

Advancing Science with DNA Sequence

Thank you!