primary analysis tutorial depracated

29
Bioinforma)cs Primary Analysis Tutorial Phil Richmond, PRA Dowell Lab University of Colorado, Biofron)ers Ins)tute

Upload: phillip-andrew-richmond

Post on 08-Jan-2017

35 views

Category:

Science


1 download

TRANSCRIPT

Page 1: Primary analysis tutorial depracated

Bioinforma)cs  Primary  Analysis  Tutorial  

Phil  Richmond,  PRA  Dowell  Lab  

University  of  Colorado,  Biofron)ers  Ins)tute  

 

Page 2: Primary analysis tutorial depracated

Outline  

•  Intro  – Things  that  will  be  covered  – Things  that  won’t  be  covered  

•  Workflow  •  Mapping  with  Bow)e  •  File  Conversion  with  Samtools  •  Visualiza)on  with  IGV  •  Extras  

Page 3: Primary analysis tutorial depracated

Sequencing  

•  There  are  many  different  types  of  sequencing  including  454,  Illumina,  SOLiD,  IonTorrent,  and  more.  

•  If  you  are  interested  in  each  type  of  sequencing…  

Page 4: Primary analysis tutorial depracated

Things  that  will  be  covered  

•  The  primary  analysis  that  I  will  walk  through  is  a  “bare  bones”  analysis,  meant  to  take  your  reads  from  Illumina  sequencer  to  visualizer,  as  well  as  some  organiza)onal  prac)ces  – Mapping  (Bow)e/BWA)  – File  format  conversion  – Visualiza)on  

Page 5: Primary analysis tutorial depracated

Things  that  won’t  be  covered  

•  Post/preprocessing  steps  that  I’m  leaving  out  include:  –  FastX  analysis  of  raw  reads  and  adapter  clipping,  etc.  –  PCR  duplicate  marking  (Illumina)  on  raw  reads  –  Base  Quality  Score  Recalibra)on  (GATK)  on  mapped  reads  –  Local  Realignment  around  indels  on  mapped  reads  

•  Any  Secondary  or  Ter)ary  analysis  or  scrip)ng  techniques  –  Secondary  analysis  by  personal  appt.  –  Scrip)ng  techniques  by  joining  Dave  Knox’s  python  class  

Page 6: Primary analysis tutorial depracated

Login  to  Tuxedo  

•  Login  with  –X  op)on  to  open  X11  viewer.  •  On  a  PC…see  me  for  separate  instruc)ons  to  pipe  visualiza)on  

•  ssh  –X  [email protected]  

Page 7: Primary analysis tutorial depracated

Working  Directory  

•  We  will  be  working  in  /data/Tutorial/<Student>  –  cd  /data/Tutorial/Phil/  

•  The  necessary  files  for  the  tutorial  are  in  /data/Tutorial/Files/  –  Parent113010.fa  is  the  reference  (e.  coli)  genome  –  Parent120710.gff  is  the  annota)on  file  –  Sample1_single.fastq  is  the  reads  file  we  are  working  with  

Page 8: Primary analysis tutorial depracated

Organiza)on  

•  In  your  own  directory  (/data/Tutorial/<Student>/)  create  the  following  sub-­‐directories:  – Genome/  

•  Keep  the  fasta  and  gff  files  here  – Bow)e/  

•  Keep  the  Bow)e  alignments,  and  post-­‐processing  of  bow)e  alignments  here  

– Fastq/  •  Keep  the  raw  fastq  files  here  

Page 9: Primary analysis tutorial depracated

Workflow  Raw  Reads  (Fastq)  

Mapped  Reads  (SAM)  

Mapping  (Bow)e)  

Binary  Mapped  Reads  (SORTED.BAM)  

File  Conversion  (SAMTOOLS)  

Visualiza)on  (IGV)  

Page 10: Primary analysis tutorial depracated

Workflow  Raw  Reads  (Fastq)  

Mapped  Reads  (SAM)  

Mapping  (Bow)e)  

Binary  Mapped  Reads  (SORTED.BAM)  

File  Conversion  (SAMTOOLS)  

Visualiza)on  (IGV)  

Page 11: Primary analysis tutorial depracated

Fastq  file  

•  File  extension  .fastq  or  .fq  •  Example:  @Read_iden)fier_and_flowcell_info  ACGTCCGGTTNNN…  +  B$!?NP\\\[%&C…  

•  For  more  info  on  ASCII  encoding  QV  scores…go  to  wikipedia  

Read  ID  Read  Sequence  Read  QV  ID  Read  QV  Sequence  

Page 12: Primary analysis tutorial depracated

Workflow  Raw  Reads  (Fastq)  

Mapped  Reads  (SAM)  

Mapping  (Bow)e)  

Binary  Mapped  Reads  (SORTED.BAM)  

File  Conversion  (SAMTOOLS)  

Visualiza)on  (IGV)  

Page 13: Primary analysis tutorial depracated

Mapping  the  Short  Reads  •  Taking  each  read  and  mapping  it  to  a  reference  genome    

– Bow)e  

 

TGCATGCATGCATGCATGCATGCATGCATGCATGCAAAAAGCATGCATGCA  

TGCATGAATGCAAAAAGCATGCA  

Page 14: Primary analysis tutorial depracated

Bow)e-­‐Build  Command  

•  In  order  to  map  the  reads  to  a  genome,  you  must  acquire  the  genome  in  the  .fasta  (.fa)  format,  and  then  index  it.  

•  bow)e-­‐build  -­‐f  <in.fasta>  <out_prefix>  – $bow)e-­‐build  SGDv4.fasta  SGDv4_bow)e    

Page 15: Primary analysis tutorial depracated

Bow)e  command  

•  Now  we  map  back  to  the  reference  we  just  indexed.  

•  bow)e  <reference_in.prefix>  -­‐q  <in.fastq>  -­‐S  <out.SAM>  2>  <out.stderr>  – $  bow)e  /data/Tutorial/Phil/Genome/Bow)e_index/SGDv3_bow)e  –q  Sample1.fastq  –S  Sample1_  bow)e.sam  2>  Sample1_bow)e.stderr  

Page 16: Primary analysis tutorial depracated

Sam  File  

•  Tab  Delimited  •  hup://genome.sph.umich.edu/wiki/SAM  •  Open  Example  SAM  

Page 17: Primary analysis tutorial depracated

Workflow  Raw  Reads  (Fastq)  

Mapped  Reads  (SAM)  

Mapping  (Bow)e)  

Binary  Mapped  Reads  (SORTED.BAM)  

File  Conversion  (SAMTOOLS)  

Visualiza)on  (IGV)  

Page 18: Primary analysis tutorial depracated

Samtools  Commands  

•  samtools  view  –bS  <in.sam>  -­‐o  <out.bam>  – $samtools  view  –bS  Sample1_bow)e.sam  –o  Sample1_bow)e.bam  

•  samtools  sort  <in.bam>  <out.sorted>  – $samtools  sort  Sample1_bow)e.bam  Sample1_bow)e.sorted  

•  samtools  index  <in.sorted.bam>  – $samtools  index  Sample1_bow)e.sorted.bam  

Page 19: Primary analysis tutorial depracated

Workflow  Raw  Reads  (Fastq)  

Mapped  Reads  (SAM)  

Mapping  (Bow)e)  

Binary  Mapped  Reads  (SORTED.BAM)  

File  Conversion  (SAMTOOLS)  

Visualiza)on  (IGV)  

Page 20: Primary analysis tutorial depracated

IGV  

•  Located  at  /data2/IGV/  •  Several  different  versions  available,  recommend  either:  

•   /data2/IGV/IGV_2.1.19/igv.jar  •  /data2/IGV/IGV_1.5.64/igv.jar  

•  To  run  IGV:    –  java  –Xmx5g  –jar  <igv.jar>    

•  $java  –Xmx5g  –jar  /data2/IGV/IGV_1.5.64/igv.jar  &  

Page 21: Primary analysis tutorial depracated

IGV:  Crea)ng  a  genome  

•  Reference  Instruc)ons  on  sheet.  

Page 22: Primary analysis tutorial depracated

Bow)e  and  Bfast  IGV  

Bow$e  

Bfast  

Gene  

Page 23: Primary analysis tutorial depracated

Advantages  to  Bfast  Gapped  Mapping  

Bow$e  

Bfast  

Gene  

Page 24: Primary analysis tutorial depracated

Bfast  Mapping  Loosely  

Bow$e  

Bfast  

Gene  

Page 25: Primary analysis tutorial depracated

If  you  are  gexng  the  hang  of  it  quickly…  

•  Try  going  through  the  next  few  commands  

Page 26: Primary analysis tutorial depracated

BWA  Paired  end  •  /usr/local/src/bwa-­‐0.6.2/bwa  index  –a  is  –f  <in.fasta>  

•  Map  each  read  in  the  pair  independently  •  /usr/local/src/bwa-­‐0.6.2/bwa  aln  <reference.prefix>  <in_1.fq>  >  <out.sai>  

•  Finalize  the  mapping  by  conver)ng  (for  both  reads)  both  the  .SAI  and  the  .FQ  into  a  final  SAM  alignment:  

•  /usr/local/src/bwa-­‐0.6.2/bwa  sampe  <reference.prefix>  <in_1.sai>  <in_2.sai>  <in_1.fq>  <in_2.fq>  >  <out_paired.sam>    

Page 27: Primary analysis tutorial depracated

Bow)e  Unique  Mapping  

•  Inves)gate  the  different  Bow)e  op)ons:  – Look  at  –m  (number  of  mappings  per  read),  -­‐v  (number  of  mismatches  per  seed)  

Page 28: Primary analysis tutorial depracated

TopHat  Spliced  Mapping  

•  /usr/local/src/tophat-­‐2.0.4.Linux_x86_64/tophat  –G  <in.gff>    -­‐o  <output_directory>  <bow)e_index>  <in.fastq>    

Page 29: Primary analysis tutorial depracated

The  end…for  now.