primary analysis tutorial depracated

Bioinforma)cs Primary Analysis Tutorial

Phil Richmond, PRA Dowell Lab

University of Colorado, Biofron)ers Ins)tute

Outline

•  Intro – Things that will be covered – Things that won’t be covered

•  Workflow •  Mapping with Bow)e •  File Conversion with Samtools •  Visualiza)on with IGV •  Extras

Sequencing

•  There are many different types of sequencing including 454, Illumina, SOLiD, IonTorrent, and more.

•  If you are interested in each type of sequencing…

Things that will be covered

•  The primary analysis that I will walk through is a “bare bones” analysis, meant to take your reads from Illumina sequencer to visualizer, as well as some organiza)onal prac)ces – Mapping (Bow)e/BWA) – File format conversion – Visualiza)on

Things that won’t be covered

•  Post/preprocessing steps that I’m leaving out include: –  FastX analysis of raw reads and adapter clipping, etc. –  PCR duplicate marking (Illumina) on raw reads –  Base Quality Score Recalibra)on (GATK) on mapped reads –  Local Realignment around indels on mapped reads

•  Any Secondary or Ter)ary analysis or scrip)ng techniques –  Secondary analysis by personal appt. –  Scrip)ng techniques by joining Dave Knox’s python class

Login to Tuxedo

•  Login with –X op)on to open X11 viewer. •  On a PC…see me for separate instruc)ons to pipe visualiza)on

•  ssh –X [email protected]

Working Directory

•  We will be working in /data/Tutorial/<Student> –  cd /data/Tutorial/Phil/

•  The necessary files for the tutorial are in /data/Tutorial/Files/ –  Parent113010.fa is the reference (e. coli) genome –  Parent120710.gff is the annota)on file –  Sample1_single.fastq is the reads file we are working with

Organiza)on

•  In your own directory (/data/Tutorial/<Student>/) create the following sub-‐directories: – Genome/

•  Keep the fasta and gff files here – Bow)e/

•  Keep the Bow)e alignments, and post-‐processing of bow)e alignments here

– Fastq/ •  Keep the raw fastq files here

Workflow Raw Reads (Fastq)

Mapped Reads (SAM)

Mapping (Bow)e)

Binary Mapped Reads (SORTED.BAM)

File Conversion (SAMTOOLS)

Visualiza)on (IGV)

Fastq file

•  File extension .fastq or .fq •  Example: @Read_iden)fier_and_flowcell_info ACGTCCGGTTNNN… + B$!?NP\\\[%&C…

•  For more info on ASCII encoding QV scores…go to wikipedia

Read ID Read Sequence Read QV ID Read QV Sequence


Mapped Reads (SAM)

Mapping (Bow)e)



Visualiza)on (IGV)

Mapping the Short Reads •  Taking each read and mapping it to a reference genome

– Bow)e

TGCATGCATGCATGCATGCATGCATGCATGCATGCAAAAAGCATGCATGCA

TGCATGAATGCAAAAAGCATGCA

Bow)e-‐Build Command

•  In order to map the reads to a genome, you must acquire the genome in the .fasta (.fa) format, and then index it.

•  bow)e-‐build -‐f <in.fasta> <out_prefix> – $bow)e-‐build SGDv4.fasta SGDv4_bow)e

Bow)e command

•  Now we map back to the reference we just indexed.

•  bow)e <reference_in.prefix> -‐q <in.fastq> -‐S <out.SAM> 2> <out.stderr> – $ bow)e /data/Tutorial/Phil/Genome/Bow)e_index/SGDv3_bow)e –q Sample1.fastq –S Sample1_ bow)e.sam 2> Sample1_bow)e.stderr

Sam File

•  Tab Delimited •  hup://genome.sph.umich.edu/wiki/SAM •  Open Example SAM


Mapped Reads (SAM)

Mapping (Bow)e)



Visualiza)on (IGV)

Samtools Commands

•  samtools view –bS <in.sam> -‐o <out.bam> – $samtools view –bS Sample1_bow)e.sam –o Sample1_bow)e.bam

•  samtools sort <in.bam> <out.sorted> – $samtools sort Sample1_bow)e.bam Sample1_bow)e.sorted

•  samtools index <in.sorted.bam> – $samtools index Sample1_bow)e.sorted.bam


Mapped Reads (SAM)

Mapping (Bow)e)



Visualiza)on (IGV)

IGV

•  Located at /data2/IGV/ •  Several different versions available, recommend either:

•  /data2/IGV/IGV_2.1.19/igv.jar •  /data2/IGV/IGV_1.5.64/igv.jar

•  To run IGV: –  java –Xmx5g –jar <igv.jar>

•  $java –Xmx5g –jar /data2/IGV/IGV_1.5.64/igv.jar &

IGV: Crea)ng a genome

•  Reference Instruc)ons on sheet.

Bow)e and Bfast IGV

Bow$e

Bfast

Gene

Advantages to Bfast Gapped Mapping

Bow$e

Bfast

Gene

Bfast Mapping Loosely

Bow$e

Bfast

Gene

If you are gexng the hang of it quickly…

•  Try going through the next few commands

BWA Paired end •  /usr/local/src/bwa-‐0.6.2/bwa index –a is –f <in.fasta>

•  Map each read in the pair independently •  /usr/local/src/bwa-‐0.6.2/bwa aln <reference.prefix> <in_1.fq> > <out.sai>

•  Finalize the mapping by conver)ng (for both reads) both the .SAI and the .FQ into a final SAM alignment:

•  /usr/local/src/bwa-‐0.6.2/bwa sampe <reference.prefix> <in_1.sai> <in_2.sai> <in_1.fq> <in_2.fq> > <out_paired.sam>

Bow)e Unique Mapping

•  Inves)gate the different Bow)e op)ons: – Look at –m (number of mappings per read), -‐v (number of mismatches per seed)

TopHat Spliced Mapping

•  /usr/local/src/tophat-‐2.0.4.Linux_x86_64/tophat –G <in.gff> -‐o <output_directory> <bow)e_index> <in.fastq>

The end…for now.

primary analysis tutorial depracated

Science