all kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

49
All kmers are not created equal: finding the signal from the noise in largescale metagenomes. Will Trimble metagenomic annota<on group Argonne Na<onal Laboratory BEACON seminar April 23, 2014 MSU

Upload: wltrimbl

Post on 22-Jun-2015

109 views

Category:

Science


3 download

DESCRIPTION

Talk by Will Trimble of Argonne National Laboratory, on April 23, 2014, at MSU's BEACON Center for the Study of Evolution in Action on visualizing and interpreting the redundancy spectrum of long kmers in high-throughput sequence data.

TRANSCRIPT

Page 1: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

All  kmers  are  not  created  equal:  finding  the  signal  from  the  noise  in  large-­‐scale  metagenomes.  

Will  Trimble  metagenomic  annota<on  group  Argonne  Na<onal  Laboratory  

BEACON  seminar    April  23,  2014        MSU  

Page 2: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

Apology:  I  speak  biology    with  an  accent  

•  I  spent  six  years  in  dark  rooms  with  lasers  •  Now  I  use  computers  to  analyze  high-­‐throughput  sequence  data.  

•  I  introduce  myself  as  an  applied  mathema<cian.  

•  Finding  scoring  func<ons  to  answer  ques<ons  with  ambiguous  data  

 

Page 3: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

Apology:  I  speak  biology    with  an  accent  

•  I  spent  six  years  in  dark  rooms  with  lasers  •  Now  I  use  computers  to  analyze  high-­‐throughput  sequence  data.  

•  I  introduce  myself  as  an  applied  mathema<cian.  

•  Finding  scoring  func<ons  to  answer  ques<ons  with  ambiguous  data  

•  Shoveling  data  from  the  data  producing  machine  into  the  data-­‐consuming  furnace.  

 

Page 4: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

•  Sequences  are  different  •  How  much  did  my  sequencing  run  give  me?      kmerspectrumanalyzer!

•  How  much  did  I  sample?   nonpareil-k  •  PreXy  pictures   thumbnailpolish!

Outline  

Page 5: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

•  Sequences  are  different                                    (math)  •  How  much  did  my  sequencing  run  give  me?      kmerspectrumanalyzer (graphs)  

•  How  much  did  I  sample?   nonpareil-k (graphs)  •  PreXy  pictures   thumbnailpolish (micrographs)!

Outline  

Page 6: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

Sequences  are  different  

•  Sequencing  produces  sequences.    Sequences  are  qualita<vely  different  from  all  other  data  types.  

   

@HWI-ST1035:125:D1K4CACXX:8:1101:1168:2214 1:N:0:CGATGT!CAAACAGTTCCATCACATGGCCTAAGCTCATATCTTTTAACTCAGACCATTCAATATTCTCATTTAATTGATCTTCGTGTTGTTCATTTTCCTGTGCTTCA!+!@@@DFDFDFHHHHIIIIEHIIIHDHIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIFIHHIIHHHHHFFFFFDFEEEEEEDD!@HWI-ST1035:125:D1K4CACXX:8:1101:1190:2224 1:N:0:CGATGT!CAGCAAGAACGGATTGGCTGTGTAGGTGCGAAATTATTGTATCCAAATAATACGGTCCAACACGCAGGCGTTATTTTAGGATTAGGTGGTGTCGCTGGACA!+!CCCFFFFFHHFHFGIEHIJJHGCHEH:CFHHIGGGGIJB?BDFGHII<CGBFDBGFFHHIIGEHFFBDDDBB?DDCCCDDDCDDDC>@B<B<C@DDDDBDC!@HWI-ST1035:125:D1K4CACXX:8:1101:1339:2184 1:N:0:CGATGT!CTGGTTTAGTTTGCCTCAGTTACCATTAGTTAACTTTTCTTTCCAATTTGATTGGCCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCT!+!BCCFDFFFHDFHHIJJJJHIJJJJJJJJJJJJIIJJJJJJJJJJJIJJJHJJJFHIJJJJIJIIJJJIJIJJJIHHEHFFFFFEEEEEECDDDDDDDECCD!

Instrument  readings,  spectra,  micrographs    Not  categorical.  

Low-­‐throughput    categorical  data    Categories  are  sound    

High  throughput  sequence  data    Categoriza4on  is  an  art  

Page 7: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

Sequences  are  different  

•  Sequencing  produces  sequences.    Sequences  are  qualita<vely  different  from  all  other  data  types.  

   

@HWI-ST1035:125:D1K4CACXX:8:1101:1168:2214 1:N:0:CGATGT!CAAACAGTTCCATCACATGGCCTAAGCTCATATCTTTTAACTCAGACCATTCAATATTCTCATTTAATTGATCTTCGTGTTGTTCATTTTCCTGTGCTTCA!+!@@@DFDFDFHHHHIIIIEHIIIHDHIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIIIIIHIIIIIIIIIIIIFIHHIIHHHHHFFFFFDFEEEEEEDD!@HWI-ST1035:125:D1K4CACXX:8:1101:1190:2224 1:N:0:CGATGT!CAGCAAGAACGGATTGGCTGTGTAGGTGCGAAATTATTGTATCCAAATAATACGGTCCAACACGCAGGCGTTATTTTAGGATTAGGTGGTGTCGCTGGACA!+!CCCFFFFFHHFHFGIEHIJJHGCHEH:CFHHIGGGGIJB?BDFGHII<CGBFDBGFFHHIIGEHFFBDDDBB?DDCCCDDDCDDDC>@B<B<C@DDDDBDC!@HWI-ST1035:125:D1K4CACXX:8:1101:1339:2184 1:N:0:CGATGT!CTGGTTTAGTTTGCCTCAGTTACCATTAGTTAACTTTTCTTTCCAATTTGATTGGCCAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGATGTATCT!+!BCCFDFFFHDFHHIJJJJHIJJJJJJJJJJJJIIJJJJJJJJJJJIJJJHJJJFHIJJJJIJIIJJJIJIJJJIHHEHFFFFFEEEEEECDDDDDDDECCD!

Instrument  readings,  spectra,  micrographs    Not  categorical.  

Low-­‐throughput    categorical  data    Categories  are  sound    

High  throughput  sequence  data    Categoriza4on  is  an  art  

107  channels   103  channels   1011  channels  

Page 8: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

Sequences  are  different  

•  Sequencing  produces  sequences.    Sequences  are  qualita<vely  different  from  all  other  data  types.  

 •  Each  sequence  is  an  informa<on-­‐rich  (possibly  corrupted)  quota4on  from  the  catalog  of  gene<c  polymers.  

Page 9: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

What  is  this  sequence  ?  >mystery_sequence CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAGGTCATCGATAGCAGGATAATAATACAGTA!

Who  wrote  this  line  ?  “be regarded as unproved until it has been checked against more exact results”

Searching  

We  know  what  to  do  with  these  puzzles.      You  go  to  this  website,  and  type  it  in…  

Page 10: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

What  is  this  sequence  ?  >mystery_sequence CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAGGTCATCGATAGCAGGATAATAATACAGTA!

Who  wrote  this  line  ?  “be regarded as unproved until it has been checked against more exact results”

 

Searching  

How  long  do  reads  need  to  be    to  recognize  them?  

Page 11: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

What  is  this  sequence  ?  >mystery_sequence CTAAGCACTTGTCTCCTGTTTACTCCCCTGAGCTTGAGGGGTTAACATGAAGGTCATCGATAGCAGGATAATAATACAGTA!

Who  wrote  this  line  ?  “be regarded as unproved until it has been checked against more exact results”

 

Searching  

How  long  do  reads  need  to  be    to  recognize  them?  

To  do  what,  to  place  on  a  reference  genome?        this  can  be  turned  into  a  math  problem    that  I  will  illustrate  with  a  search  engine  analogy.        

Page 12: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

How  long  do  reads  need  to  be?  

Informa4on      (Shannon,  1949,  BSTJ):              is  a  quan<ta<ve  summary  of  the  uncertainty  of  a  probability  distribu4on  –  a  model  of  the  data    Profound  applicability  in  paXern  matching  +  modeling  

Logarithmic  measurements  have  units!  

H =

X

i

pi log2

✓1

pi

Page 13: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

A  word  on  the  sign  of  the  entropy    

•  A  popular  straw  man  among-­‐mathema<cians-­‐and-­‐CS-­‐people  is  the  “random  sequence  model.”    Uniform  categorical  distribu<on  over  all  4L    sequences.    

•  When  we  learn  something—like  we  collect  some  genomes  and  expect  our  new  sequences  to  look  like  them—we  implicitly  construct  a  less  flat  distribu<on.    Models  always  have  less  entropy  than  the  model  of  ignorance.  

Page 14: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

How  long  do  phrases  need  to  be?  

Exercise:    Pick  a  book  from  your  bookshelf.  Pick  an  arbitrary  page  and  arbitrary  line.    for n in 1..10 ! type the first n words into google books, quoted.! break if google identifies your book.!

Page 15: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

•  Informa<on  content  of  English  words:                  Hword                                                              ca.  12  bits  per  word.  •  Size  of  google  books?                      Big  libraries  have  few  107  books,                    each  one  has  105  indexed  words                  ….so  a  database  size  of  1012  words.            log(database  size)                =              1012    =  239.9                                                                  =  40  bits  •  So  we  expect  on  average  40  /  12  =  3.3  =  4  words  to  be  enough  to  find  a  phrase  in  google’s  index.                                                                                                            

                                                                                                                                                     Try  it.      

How  long  do  phrases  need  to  be?  

Page 16: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

How  long  do  phrases  need  to  be?  

Exercise:    Pick  a  book  from  your  bookshelf.  Pick  an  arbitrary  page  and  arbitrary  line.    for n in 1..10 ! type the first n words into google books, quoted.! break if google identifies your book.!

Most  oken  takes  4  words  

Page 17: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

•  Informa<on  content  of  English  words:                  Hword                                                              ca.  12  bits  per  word.  •  Size  of  google  books?                      Big  libraries  have  few  107  books,                    each  one  has  105  indexed  words                  ….so  a  database  size  of  1012  words.            log(database  size)                =              1012    =  239.9                                                                  =  40  bits  •  So  we  expect  on  average  40  /  12  =  3.3  =  4  words  to  be  enough  to  find  a  phrase  in  google’s  index.                                                                                                            

                                                                                                                                                     Try  it.      

How  long  do  phrases  need  to  be?  

Not  all  phrases  are  equally  dis<nc<ve.  

Page 18: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

•  Maximum  informa<on  content  of          base  pairs                            Hread                                            2        bits    per  length-­‐      sequence  •  Most  long  kmers  are  dis<nct:              genome  of  size  G  (ca  1010  bp)                            log(G)                                =            1010    =        233.2                                    =    34  bits  •  So  we  expect  that  when  2        >  34  bits,  we  should  be  able  to  place  any  sequence.  

•  That  means  we  need  at  least      17  base  pairs          (seems  small)  to  deliver  mail  anywhere  in  the  genome.    

How  long  do  reads  need  to  be?  

``

`

`

Page 19: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

The  data  deluge  

•  There  were  some  technological  breakthroughs  in  the  mid-­‐2000s  that  led  to  inexpensive  collec<on  of  10s  of  Gbytes  of  sequence  data  at  once.  

•  The  data  has  outgrown  some  favorite  algorithms  from  the  1990s  (BLAST)    

Page 20: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

Picture,  if  you  will,  a  hiseq  flowcell  Paris  of  microbial    genomes    

Microbial    transcriptomes  +  replicates  

Environmental  isolate  genomes    Environmental  extract  sequencing    Prepara<on-­‐intensive  sequencing  

Eukaryo<c    sequencing  Eukaryo<c    sequencing  for  variants  

What’s  in  there?  

Page 21: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

Picture,  if  you  will,  a  hiseq  flowcell  Paris  of  microbial    genomes    

Microbial    transcriptomes  +  replicates  

Environmental  isolate  genomes    Environmental  extract  sequencing    Prepara<on-­‐intensive  sequencing  

Eukaryo<c    sequencing  Eukaryo<c    sequencing  for  variants  

What’s  in  there?  

Let’s  count  kmers!  

Page 22: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

The  kmer  spectrum.  

21mer  abundance    

numbe

r  of  kmers  

microbial  genome  

Page 23: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

The  kmer  spectrum.    

21mer  abundance    

numbe

r  of  kmers  

microbial  genome  

low-­‐abundance  errors  

peak  contains  most  of  genome  

high-­‐abundance  peak  contains  mul<copy  genes  

really  high  abundance  stuff  oken  ar<facts  

rare   abundant  

Page 24: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

Ranked  kmer  spectrum      

kmer  rank  (cumula<ve  sum  of  number  of  kmers)  

21mer  abu

ndance        

Ranked  kmer  spectrum  

rare  

abundant  

Page 25: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

Ranked  kmers  consumed  

21mer  abundance    

frac<o

n  of  observed  km

ers  

Ranked  kmers  consumed  

rare  

abundant  

data  frac<on  is  unusually    stable  

Page 26: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

Different  kinds  of  data  have  different  spectra  

Page 27: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

Different  kinds  of  data  have  different  spectra  

Page 28: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

Redundancy  is  good  

•  OMG!      Check  out  these  three  sequences!    I’ve  found  the  fourth,  fikh,  and  sixth  domains  of  life.  

         •  OMG!    I  see  this  sequence  10  million  <mes.      

•  OMG!    There  are  more  than  10  billion  dis<nct  31mers  in  my  dataset.    I  only  have  128  Gbases  of  memory.  

•  Error  correc<on  and  diginorm  somewhat  amusingly  strive  for  opposite  ends.  

Page 29: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

Redundancy  is  good  

•  OMG!      Check  out  these  three  sequences!    I’ve  found  the  fourth,  fikh,  and  sixth  domains  of  life.  

         •  OMG!    I  see  this  sequence  10  million  <mes.      

•  OMG!    There  are  more  than  10  billion  dis<nct  31mers  in  my  dataset.    I  only  have  128  Gbases  of  memory.  

•  Error  correc<on  and  diginorm  somewhat  amusingly  strive  for  opposite  ends.  

Abundance-­‐based  inferences  are  beXer  in  the  high-­‐

abundance  part  of  the  data.  

Page 30: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

kmerspectrumanalyzer:  infer  genome  size  and  depth  

PNO (x; c, {an}, s) =X

n

anNBpdf (s;µ = cn,↵ = s/n)

Generaliza<on  of  mixed-­‐Poisson  model  to    es<mate  how  much  sequence  is  in  each  peak.  

Page 31: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

0 2000 4000 6000 8000 10000

0

2000

4000

6000

8000

10000

Complete Genome size (kb)

Estim

ated

Gen

ome

Size

(kb)

Fig 2 Coun<ng  kmers  tells  you  genome  size  

…for  single  genomes,  most  of  the  <me.  

so  much  for  calibra<on  data  

Page 32: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

10%  5.5%  4%  3%  

1.7%  1%  

0.5%  0.3%  0.1%  

The  kink  does  measure  error  

Ar<ficial  E.  coli  data  varying  subs<tu<on  errors  

Page 33: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

But  I  want  to  sequence  everything!  Ok,  we  can  count  kmers  in  everything  too..  

kmerspectrumanalyzer  summarizes  distribu<on,    es<mates  genome  size,  coverage  depth  

Page 34: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

How  much  novelty  is  in  my  dataset?  

How  many  sequences  do  you  need  to  see  before  you  start  seeing    the  same  ones  over  and  over  again?    Ini<ally,  everything  is  novel,  but  there  will  come  a  point  at  which    less  than  half  of  your  new  observa<ons  are  already  in  the  catalog.  

Page 35: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

Nonuniqefraction(✏; {r}, {n}) =X

i

ni · riPj nj · rj

(1� Poisscdf (✏ · ri, 1))(1� Poisscdf (✏ · ri, 0))

How  much  novelty  is  in  my  dataset?  

How  many  sequences  do  you  need  to  see  before  you  start  seeing    the  same  ones  over  and  over  again?    Ini<ally,  everything  is  novel,  but  there  will  come  a  point  at  which    less  than  half  of  your  new  observa<ons  are  already  in  the  catalog.    We  can  calculate  this  efficiently  using  the  kmer  spectrum.  

Page 36: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

Nonpareil: model of sequence coverage

Nonpareil-k: kmer rarefaction

summary of sequence diversity

Nonpareil–  uses  subset-­‐against-­‐all  alignment  to  find  out  how  much  of  dataset  is  unique  

Nonpareil-­‐k  –  crunches  kmer  spectrum  to  approximate  the  unique  frac<on,  300x  faster.  

Page 37: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

Nonpareil: model of sequence coverage

Nonpareil-k: kmer rarefaction

summary of sequence diversity

Page 38: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

Nonpareil-­‐k:  stra<fy  datasets  by  coverage  distribu<on  

most  of  dataset  likely  contained  in    assembly    

assembly  is  likely  to  miss  or    aXenuate  the    large  unique    frac<on  of  dataset.    

Page 39: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

kmer  spectra  reveal  sequencing  problems  

•  Amok  PCR  –  seemingly  random  sequences  •  Amok  MDA  –  10  Gbases  of  sequence,  one  gene  •  PCR  duplicates:  en<re  sequencing  run  was  50x  exact-­‐  and  near-­‐exact  duplicate  reads  

•  Unusually  high  error  rate:  indicated  by  low  frac<on  of  “solid”  kmers  (for  isolate  genomes)  

•  Contaminated  samples:  95%  E.  coli  5%  E.  faecalis  

Page 40: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

Figure'1c!

-6e-04 -4e-04 -2e-04 0e+00 2e-04 4e-04

0100

200

300

400

500

600

PC02 vs Alpha Diversity

eigen_vectors[, "PCO2"]

colo

r_m

atr

ix[, "

alp

ha

-div

ers

ity"]

All: y = -259839.54*x + 209.62 ; R^2 = 0.29Gut: y = -275950.37*x + 118.73 ; R^2 = 0.78Oral: y = -369610.24*x + 298.39 ; R^2 = 0.7

Figure'1d!

HMP  /  quan<le  norm  /  euclidean  /  colored  by  alpha    

 MG-­‐RAST  API  R-­‐package  matR  

Hey  kid,  you  want  some  unlabeled  data?  

Page 41: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

Figure'2a!

Figure'2b!

Hey  kid,  you  want  some  preXy  ordina<ons?  

Page 42: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

Generali<es  from  the    kmer  coun<ng  mines  

•  Many  datasets  have  as  much  as  5-­‐45%  of  the  sequence  yield  in  adapters.      

•  FEW  DATASETS  have  well-­‐separated  abundance  peaks  (of  the  sort  metavelvet  was  engineered  to  find)      

•  Diverse  datasets  have  a  featureless,  geometric  rela4onship  between  kmer  rank  and  kmer  abundance.  

•  Shannon  entropy  is  oversensi4ve  to  errors.  Higher-­‐order  Rényi  entropy  is  more  stable.  

Page 43: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

kmer  sta<s<cal  summaries  •  H0  kmer  richness                                                              (VERY  BAD)  •  H1  Shannon  entropy                                                  (BAD)  •  H2  Reyni  entropy  /  Simpson  index  (GOOD)  

•  observa<on-­‐weighted      coverage    (BAD)  •  observa<on-­‐weighted      size                        (BAD)  •  observa<on-­‐median            coverage    (GOOD)  •  observa<on-­‐median            size                        (GOOD)  •  frac<on  in  top  100  kmers              (USEFUL)  •  frac<on  unique  (OK  but  requires  size  correc<on)  

Page 44: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

kmer  sta<s<cal  summaries  •  H0  kmer  richness                                                              (VERY  BAD)  •  H1  Shannon  entropy                                                  (BAD)  •  H2  Reyni  entropy  /  Simpson  index  (GOOD)  

•  observa<on-­‐weighted      coverage    (BAD)  •  observa<on-­‐weighted      size                        (BAD)  •  observa<on-­‐median            coverage    (GOOD)  •  observa<on-­‐median            size                        (GOOD)  •  frac<on  in  top  100  kmers              (USEFUL)  •  frac<on  unique  (OK  but  requires  size  correc<on)  

Most  of  these  give  answers  which  vary  so  strongly  with  sampling  depth  as  to  be  unusable.    Observa<on-­‐weighted  frac<on-­‐of-­‐data  metrics    behave  fairly  well.    Frac<ons  of  the  data  with  par<cular  proper<es  are  stable  with  respect  to  sampling.      

Page 45: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

thumbnailpolish!

http://www.mcs.anl.gov/~trimble/flowcell/!

Page 46: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

Some<mes  the  sequencer  has  a  bad  day.  

Page 47: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

Metagenomic  annota<on  group    Folker  Meyer  Elizabeth  Glass  Narayan  Desai  Kevin  Keegan    Adina  Howe  Wolfgang  Gerlach  Wei  Tang  Travis  Harrison  Jared  Bishof  Dan  Braithwaite  Hunter  MaXhews  Sarah  Owens  

Formerly  of  Yale:  Howard  Ochman    David  Williams    Georgia  Tech:  Kostas  Konstan<nidis  Luis  Rodriguez-­‐Rojas    

Page 48: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

Observa<on:  Most  scien<sts  seem  to  be  self-­‐taught  in  compu<ng.  

 Observa<on:    Most  scien<sts  waste  a    

lot  of  <me  using  computers  inefficiently.  

Adina  and  I  volunteer  with    

Page 49: All kmers are not created equal: recognizing the signal from the noise in large-scale metagenomes

We  teach  scien<sts    how  to  get  more  done  

Woods  Hole  

Tuks  

U.  Chicago  

U.  Chicago  

UIC