clustering genes: w-curve + tsp

26
Douglas Cork Steven Lembark

Upload: workhorse-computing

Post on 30-Jun-2015

725 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Clustering Genes: W-curve + TSP

Douglas CorkSteven Lembark

Page 2: Clustering Genes: W-curve + TSP

HIV­1, W­curves, & Shoe Leather

● Existing genetics tools fail on HIV­1● They make assumptions based on “normal” DNA 

that fail on HIV – or cancer, or plants.● Correlation tools look at evolution, not state.

● We are working on tools for clinical analysis.● The W­curve abstracts DNA into geometry.● The TSP clusters genenes rather than trying to 

impute inheritence.

Page 3: Clustering Genes: W-curve + TSP

Sequences Inform Treatment

● Treating HIV requires sequencing it to choose appropriate drugs:● HIV­1 evolves drug resistence in months.● Multiple strains in a single pateint are common, 

both from multiple sources or evolution.● Crossover recombination relatively common due to 

cross­infected cells.

Page 4: Clustering Genes: W-curve + TSP

Problem: HIV is Hard to Analyze

● HIV is a non­correcting retrovirus.● Evolves 10,000 times faster than humans or 

influenza – one new strain per patient per day.● Genomes for wild types range from 8349 to 

9829 bases, making localized comparisions difficult.

● The single FDA approved algorithm directing treatment from sequence handles only type­B; the U.S. Army has 15%+ non­B infections.

Page 5: Clustering Genes: W-curve + TSP

The Current Tools

● Blast, Fasta, ClustalW perform alignment.● Table­driven analysis of base transitions.● Score the entire sequence with a single value.

● Graphical tools are designed to display inheritence rather than state.● Output is difficult to read in a clinical setting.

Page 6: Clustering Genes: W-curve + TSP

Phenogram of Drug­Resistant and RandomSamples

● Tries to show ancestory, not state.

● Not very good for visual identification of which patients are drug resistant.

Page 7: Clustering Genes: W-curve + TSP

Trees are not particularlyhelpful either.

Page 8: Clustering Genes: W-curve + TSP

ClustalW of gp120

● Difficult to compare sequences vis.ually.

● Not useful for large numbers of sequences.

● Gaps make analysis difficult

HIVHXB2CG TGATCTGTAGTGCTACAGAAAAATTGTGGGTCACAGTCTATTATGGGGTACCTGTGTGGAAY736838-gp120_ -------------------------------TACAGTTTATTATGGGGTGCCTGTGTGGA ***** *********** **********HIVHXB2CG AGGAAGCAACCACCACTCTATTTTGTGCATCAGATGCTAAAGCATATGATACAGAGGTACAY736838-gp120_ GAGATGCAGATACCACCCTATTTTGTGCATCAGATGCCAAGGCACATGAGACAGAAGTGC ** *** ***** ******************** ** *** **** ***** ** *HIVHXB2CG ATAATGTTTGGGCCACACATGCCTGTGTACCCACAGACCCCAACCCACAAGAAGTAGTATAY736838-gp120_ ACAATGTCTGGGCCACACATGCCTGTGTACCCACAGACCCCAACCCACAAGAAATACACC * ***** ********************************************* **HIVHXB2CG TGGTAAATGTGACAGAAAATTTTAACATGTGGAAAAATGACATGGTAGAACAGATGCATGAY736838-gp120_ TGGAAAATGTAACAGAAAATTTTAACATGTGGAAAAATAACATGGTAGAGCAGATGCAGG *** ****** *************************** ********** ******** *HIVHXB2CG AGGATATAATCAGTTTATGGGATCAAAGCCTAAAGCCATGTGTAAAATTAACCCCACTCTAY736838-gp120_ AGGATGTAATCAGTTTATGGGATCAAAGTCTAAAGCCATGTGTAAAGTTAACTCCTCTCT ***** ********************** ***************** ***** ** ****HIVHXB2CG GTGTTAGTTTAAAGTGCAC------TGATTTGAAGAATGATACTAATACCAATAGTAGTAAY736838-gp120_ GCGTTACTTTAAATTGTACCAATGCTAATTTGACCAATGGCAGTAGCAAAACCAATGTCT * **** ****** ** ** * ****** **** * ** * * * *HIVHXB2CG GCGGGAGAATGATAATGGAGAAAGGAGAGATAAAAAACTGCTCTTTCAATATCAGCACAAAY736838-gp120_ CTAACATAATAGGAAATATAACAGATGAAGTAAGAAACTGTACTTTTAATATGACCACAG * *** ** * ** ** *** ****** **** ***** * ****HIVHXB2CG GCATAAGAGGTAAGGTGCAGAAAGAATATGCATTTTTTTATAAACTTGATATAATACCAA

AY736838-gp120_ AACTAACAGATAAGAAGCAGAAGGTCCATGCACTCTTTTATAAGCTTGATATAGTACAAA *** ** **** ****** * ***** * ******** ********* *** **HIVHXB2CG T---AGATAATGATACTACCAGC---TATAAGTTGACAAGTTGTAACACCTCAGTCATTAAY736838-gp120_ TTGAAGATAAGAAGAATAGTAGTGAGTATAGGTTAATAAATTGTAATACTTCAGTCATTA * ****** * * ** ** **** *** * ** ****** ** **********HIVHXB2CG CACAGGCCTGTCCAAAGGTATCCTTTGAGCCAATTCCCATACATTATTGTGCCCCGGCTGAY736838-gp120_ AGCAGGCTTGTCCAAAGATATCCTTTGATCCAATTCCTATACATTATTGTACTCCAGCTG ***** ********* ********** ******** ************ * ** ****HIVHXB2CG GTTTTGCGATTCTAAAATGTAATAATAAGACGTTCAATGGAACAGGACCATGTACAAATGAY736838-gp120_ GTTATGCGATTTTAAAGTGTAATGATAAGAATTTCAATGGGACAGGGCCATGTAAAAATG *** ******* **** ****** ****** ******** ***** ******* *****HIVHXB2CG TCAGCACAGTACAATGTACACATGGAATTAGGCCAGTAGTATCAACTCAACTGCTGTTAAAY736838-gp120_ TCAGCTCAGTACAATGCACACATGGAATTAAGCCAGTGGTATCAACTCAATTGCTGTTAA ***** ********** ************* ****** ************ *********HIVHXB2CG ATGGCAGTCTAGCAGAAGAAGAGGTAGTAATTAGATCTGTCAATTTCACGGACAATGCTAAY736838-gp120_ ATGGCAGTCTAGCAGAAGAAGAGATAATAATCAGATCTGAAGATCTCACAAACAATGCCA *********************** ** **** ******* ** **** ******* *HIVHXB2CG AAACCATAATAGTACAGCTGAACACATCTGTAGAAATTAATTGTACAAGACCCAACAACAAY736838-gp120_ AAACCATAATAGTGCACCTTAATAAATCTGTAGAAATCAATTGTACCAGACCCTCCAACA ************* ** ** ** * ************ ******** ****** *****HIVHXB2CG ATACAAGAAAAAGAATCCGTATCCAGAGAGGACCAGGGAGAGCATTTGTTACAATAGGAAAY736838-gp120_ ATACAAGAACAAGTATAACTAT------AGGACCAGGACGAGTATTCTATAGAACAGGAG ********* *** ** *** ********* *** *** ** ** ****HIVHXB2CG A---AATAGGAAATATGAGACAAGCACATTGTAACATTAGTAGAGCAAAATGGAATAACAAY736838-gp120_ ATATAATAGGAAATATAAGAAAAGCATATTGTGAGATTAATGGAACAAAATGGAATAAAG * ************ *** ***** ***** * **** * ** *************HIVHXB2CG CTTTAAAACAGATAGCTAGCAAATTAAGAGAACAATTTGGAAATAATAAAACAATAATCTAY736838-gp120_ TTTTAAAACAGGTAACTGAAAAATTAAAAGAGCACTTT------AATAAGACAATAATCT ********** ** ** ******* *** ** *** ***** **********HIVHXB2CG TTAAGCAATCCTCAGGAGGGGACCCAGAAATTGTAACGCACAGTTTTAATTGTGGAGGGGAY736838-gp120_ TTCAACCACCCTCAGGAGGAGATCTAGAAATTACAATGCATCATTTTAATTGTAGAGGGG ** * * * ********** ** * ******* ** *** ********** ******HIVHXB2CG AATTTTTCTACTGTAATTCAACACAACTGTTTAATAGTACTTGGTTTAATAGTACTTGGAAY736838-gp120_ AATTTTTCTATTGCAATACAACAAAACTGTTTAATAATATTTGCCTAGGAAATG---AAA ********** ** *** ***** ************ ** *** * * * *HIVHXB2CG GTACTGAAGGGTCAAATAACACTGAAGGAAGTGACACAATCACCCTCCCATGCAGAATAAAY736838-gp120_ CCATGGCGGGGTGTAATGACACT---------------ATCACACTTCCATGCAAGATAA * * **** *** ***** ***** ** ******* ****

Page 9: Clustering Genes: W-curve + TSP

New Tools

● Clinical vs. evolutionary.● Avoid assumptions that break current tools.● Suitable for a repeatable process in clinics or 

data mining in research.● We are using:

● W­curve for analysis.● TSP for clustering.● R for data management & display.

Page 10: Clustering Genes: W-curve + TSP

W­curve

● Geometric abstraction of DNA.● Manufactured by a simple state machine.● Alignment at finer scale available using 

geometry than character strings.● Avoids assumptions about transition 

probabilities by taking the figure as­is.

Page 11: Clustering Genes: W-curve + TSP

W­Curve Generator is a State Machine

● C,A,T,G are assigned to corners of a square.● Successive points move halfway to the next 

base's corner.

Page 12: Clustering Genes: W-curve + TSP

W­curve for “CG”

● Curve shown in Blue.

● Halfway to C then G in X Y, single ‑steps in Z.

● Cyl. storage simplifies comparision.

Page 13: Clustering Genes: W-curve + TSP

W­curve of Wild HIV­1 POL GeneW­curve of Wild HIV­1 POL

Page 14: Clustering Genes: W-curve + TSP

W­curves of Wild & Drug Resistant Pol

Page 15: Clustering Genes: W-curve + TSP

Detail of Wild & Drug Resistant Pol

Page 16: Clustering Genes: W-curve + TSP

Distance Metric

● Bases are arranged in square to minimize effects of SNP's.

● Synonymous SNP's are usually in the same quadrant.

● Points within same quadrant have small difference, opposite quad's get larger.

Page 17: Clustering Genes: W-curve + TSP

Comparison Produces “Chunks”

● Comparison yields a list of chunks.● Curves are aligned within the chunk.● Summing chunks gives single value two curves.● Analyzing them in detail allows mining local 

similarities and variations.● Grouping allows examination of crossover­

recombination events.

Page 18: Clustering Genes: W-curve + TSP

Clustering: Traveling Salesman Problem

● The TSP is simple to describe, hard to solve:● Starting and finishing in the same city.● Visit a list of cities once each.● Minimize the distance (cost).

● Optimal solutions will cluster the nearby cities.● The problem was always in defining the 

clusters.

Page 19: Clustering Genes: W-curve + TSP

Take a Walk and Cluster Your Genes

● Climer & Zhang, 2004.● Method for detecting N clusters:

● Add  N dummy cities to the distance map.● Each one has the same, small distance to all other 

cities (we use 2­20).● Dummy cities end up in the inter­cluster gaps.

● The process is trivial to implement: just add that many rows and columns to the original comparison matrix.

Page 20: Clustering Genes: W-curve + TSP

Displaying the Tour

● Mapping the tour onto a circle gives a good view of the distances.

● Coloring simplifies inspection.● Black dots for dummy cities.● Single type at the top (e.g. wild type).● Color successive data points using the “rainbow” 

sequence with a large number of colors.● Sequences more alike get more similar colors.

Page 21: Clustering Genes: W-curve + TSP
Page 22: Clustering Genes: W-curve + TSP
Page 23: Clustering Genes: W-curve + TSP

Example with 8 D­R, 100 Samples

Page 24: Clustering Genes: W-curve + TSP

Multiple uses for color sequence.

● Track individual over time.● Progression through colors shows history.● Clustering highlights progression towards drug 

resistance.

● Track sample population.● Recycling the colors from one initial tour helps show 

changes in successive graphs.● Simplifies tracking progression in anonymous 

populations found in HIV treatment centers.

Page 25: Clustering Genes: W-curve + TSP

Visualizing W­curves

● We use a WebGL­based package “WebCurve”.● Developed at IIT as a web­friendly solution for  

examining 3D geometry.● Gracefully handles displaying 100+ sequences 

at 10K bases each on a notebook computer.● Available from github, archive includes a web 

server and code to generate files for display.

Page 26: Clustering Genes: W-curve + TSP

Summary

● W­curve and TSP allow us to cluster genes.● Provides a more useful output in a clinical 

setting.● Color coding the TSP results allows tracking 

changes in a population or progression an individual over time.