advancing science with dna sequence sequence clustering mgm workshop september 26, 2011 reducing...

Post on 25-Dec-2015

220 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Advancing Science with DNA Sequence

Sequence Clustering

MGM WorkshopSeptember 26, 2011

Reducing Search Space in Protein

and

DNA/RNA Sequence Analysis

Denis Kaznadzey, GBP

Advancing Science with DNA Sequence

Sequence clustering

- Classify into groups of essentially similar objects

- When new data arrives, assign objects to existing groups

- Classify ‘leftovers’

- Occasionally review entire classification

Problem: What is essentially similar’?• Finding properties that are important

(Ontological relevancy)

• Does classification reflect reality in any way?

To deal with a huge variety of individual ‘objects’:

Advancing Science with DNA Sequence

Sequence clustering

Taxonomical Classification

vs.

Continuity of Great Chain of Being

Even if reductionist, classification is a tool to study the world – the biology in particular.

When data is incomplete, any classification is a convention. At the same time, it is an approximation of a “reality”.

Carl Linnaeus Georges Buffon

Advancing Science with DNA Sequence

Sequence clustering

In Modern Biology: Most abundant type of data is sequence:

• Genomic DNA

• RNA (through RNASeq)

• Derived Proteins

Primary feature is Primary Structure, but

- Classification criteria depends on application.

Advancing Science with DNA Sequence

Sequence Clustering

Genome Assembly: Binning, Scaffolding

Transcriptomics: EST (read) clustering

Protein Function and Evolution studies:Protein families

Phylogenetic profiling: OTUs

Select Applications in Genomic Sciences:

Advancing Science with DNA Sequence

Sequence Clustering

In Metagenomics:

Primary tasks:

• Assess diversity

• Find genes

• Predict functions

• Predict pathways

• Estimate capabilities

Based on sequence comparison.

Advancing Science with DNA Sequence

Sequence Clustering

- Any Clustering is based on the Distance in some Metric.

- Initial clustering is based on pair-wise distances.

- Subsequent classification is based on distances from object to clusters- Representative- Set of representatives (all at

extreme)- Other measure, may be

unrelated to initial.

Advancing Science with DNA Sequence

Sequence Clustering

When distance measure is chosen, and distances are obtained / computed:

• There is a HUGE variety of clustering methods (clustering / classification is a very elaborate methodology)• K-mean, average linkage, complete linkage, single linkage,

iterative, SOM, etc.

• However options for large volume clustering are limited due to performance of algorithms.

• Single-linkage can be computed very efficiently

• (Method for pledging new sequences to clusters may be computationally more intense)

Advancing Science with DNA Sequence

Sequence clustering

Most efficient clustering: transitive-closure based.

• Requires ‘boolean’ distances (two sequences can be linked or not linked

• Requires number of nodes to be known

• Space ~ NodesNo

• Run-time (worst) ~ EdgesNo* AveClustSize

• Run-time (average) ~ EdgesNo * log2 (AveClustSize))

Advancing Science with DNA Sequence

Sequence clustering

Practical Transitive Closure algorithm:Allocate array of sequence numbers A [0..N]

Phase I: connect linked vertices through vertex of smallest index

For each edge (m, n):

While A [n] != n:

n = A [n]

While A [m] != m:

m = A [m]

A [max (m, n)] = min (m, n)

Phase II: propagate smallest indices as cluster identifiers

For each n from 0 to N:

If A [n] ! = A [ A [n]]:

A [n] = A [A [n]]

Phase III: collect clusters. (Implementation dependent)

Count number of distinct cluster “id”s => M (1 pass)

Allocate array of sizes; Count size of each cluster (1 pass)

Allocate array of clusters; fill it in (1 pass)

0 1 2 3 4 5 6

0 1 2 1 4 5 6

0 1 2 1 4 5 5

0 1 2 1 4 1 5

0 1 2 1 4 1 1

+(1,3)

+(5,6)

+(6, 1)

(0); (1,3,5,6); (2); (4)

Advancing Science with DNA Sequence

OK

Sequence clustering

Computing ‘boolean’ distances:• Threshold – based

• Additional rules (match arrangement)

Example: read/EST clustering% identity + length + arrangement:

Advancing Science with DNA Sequence

Computing similarity measure:- Edit distance or (ungapped) statistics P-value: BLAST,

Fasta, needle, water, etc.

- Adjusted edit distance through progressive alignment: Clustal, MUSCLE, T-coffee

- K-mere statistics: CD-HIT, USEARCH, MUSCLE

- Suffix trees (and probabilistic suffix trees): MUMmer, Reputer, CLUSEQ

- Suffix Arrays: Bowtie, BWT

- Position-Specific scoring matrix: PSI-Blast, Impala

- Hidden Markoff Models: HMMer, HHSearch/HHPred, SAM

Advancing Science with DNA Sequence

Sequence clustering

Distance computing is harder then clustering. (Bundled solutions: BLASTCLUST, CD-HIT, UCLUST, CLUSEQ)

- For large data sets only k-mere and suffix array measures are practical.

- However: incremental/ greedy approaches can be used to avoid entire distance matrix computing. This makes use of sensitive similarity measures possible.

- For boolean distance, iterative similarity detection is possible. Fast binning->slow comparing. (no off-the-shelve implementations(?))

Advancing Science with DNA Sequence

Sequence clustering

Boolean distance clustering killer:

CLUSTER AGGREGATION.

In large clusters, even a small number of random links lead to huge conglomerates.

Advancing Science with DNA Sequence

Common causes:

1) Contamination with standard constructs

2) Repeats

3) Chimeras

4) Spurious similarities (low complexity zones etc.

Advancing Science with DNA Sequence

Sequence clustering

Fighting aggregation

- Vector / adapter trimming:- Lucy, Figaro, etc. Integrated in many assembly suites

(newbler, velvet, AMOS, CLCbio, etc.)

- Low complexity detection / masking:- SEG, DUST, FastQC, WindowMasker etc. – often integrated

in search tools.

Advancing Science with DNA Sequence

Sequence clustering

- Repeat detection / masking:- Regular (tandem) repeats:

- Pre-search masking: Based on structure (IMEx, SRF); or on database (TRDB)

- Post-search detection based on similarity properties (multiple parallel threads)

- Irregular (long) repeats:- Database based: RepeatMasker- De-novo: RepeatScout, orrb, PILER, etc.

Require genome as input, construct database.

Advancing Science with DNA Sequence

Sequence clustering

Detecting chimeric sequences:

• Abundance-based: Perseus, UCHIME• Chimeras undergo less amplification

cycles. So chimera segments in native arrangement are more frequent.

• Specific to 16S: ChimeraSlayer, Bellerophon• Chimera ‘arms’ are closer to originating

phyla then entire chimera

Advancing Science with DNA Sequence

Sequence clustering

Detecting chimeric sequences• Similarity coverage based: Mira assembler

Advancing Science with DNA Sequence

Sequence clustering

Detecting chimeric sequences• Similarity graph topology based: dchim

Alignment view Connectivity view

Advancing Science with DNA Sequence

Protein Clusters: various criteria- Primary structure similarity

- Close evolutionary relationship

- Similarity in physical properties

- 3-D structure similarity

- Similar fold arrangement

- Domain structure similarity

- Common or similar functions

- etc.

Advancing Science with DNA Sequence

Sequence clustering

Functional and structural classifications in IMG

Advancing Science with DNA Sequence

Sequence clustering

Direct similarity measure by edit distance is not sensitive enough for evolutionary distant species

Position-specific scoring matrices and profile-HMMs provide better sensitivity, but MUCH SLOWER.

For individual genomes (103 -5x104 proteins) could be used with massively parallel computations (while number of genomes is within thousands)

For metagenomes can not be used with foreseeable computing resources.

Advancing Science with DNA Sequence

Sequence clustering

Functional annotation of metagenome genes through protein clusters (under development):

- Build set of functionally homogenous clusters of similar proteins – for annotated genomes

- Build HMMs for each cluster, compose model database

- Pledge metagenome proteins to clusters by matching to models

- Cluster unpledged proteins, build models, update model database.

- Balance model database by creating model tree: aggregating small relative clusters and dissecting large ones.

- Perform hierarchical searches through profiles tree.

Advancing Science with DNA Sequence

Sequence clustering

Clustering reduces search space, but adds another level of indirection, which is a source of errors, and complexity, which consumes effort.

Improves only searches within parameters space used for clustering (structure-based clusters not useful for searching for certain codon usage, etc.)

Advancing Science with DNA Sequence

However, for proteins, which form dense relationship networks, clustering is a great tool.

Advancing Science with DNA Sequence

Thank you!

top related