primary analysis tutorial depracated
TRANSCRIPT
Bioinforma)cs Primary Analysis Tutorial
Phil Richmond, PRA Dowell Lab
University of Colorado, Biofron)ers Ins)tute
Outline
• Intro – Things that will be covered – Things that won’t be covered
• Workflow • Mapping with Bow)e • File Conversion with Samtools • Visualiza)on with IGV • Extras
Sequencing
• There are many different types of sequencing including 454, Illumina, SOLiD, IonTorrent, and more.
• If you are interested in each type of sequencing…
Things that will be covered
• The primary analysis that I will walk through is a “bare bones” analysis, meant to take your reads from Illumina sequencer to visualizer, as well as some organiza)onal prac)ces – Mapping (Bow)e/BWA) – File format conversion – Visualiza)on
Things that won’t be covered
• Post/preprocessing steps that I’m leaving out include: – FastX analysis of raw reads and adapter clipping, etc. – PCR duplicate marking (Illumina) on raw reads – Base Quality Score Recalibra)on (GATK) on mapped reads – Local Realignment around indels on mapped reads
• Any Secondary or Ter)ary analysis or scrip)ng techniques – Secondary analysis by personal appt. – Scrip)ng techniques by joining Dave Knox’s python class
Login to Tuxedo
• Login with –X op)on to open X11 viewer. • On a PC…see me for separate instruc)ons to pipe visualiza)on
• ssh –X [email protected]
Working Directory
• We will be working in /data/Tutorial/<Student> – cd /data/Tutorial/Phil/
• The necessary files for the tutorial are in /data/Tutorial/Files/ – Parent113010.fa is the reference (e. coli) genome – Parent120710.gff is the annota)on file – Sample1_single.fastq is the reads file we are working with
Organiza)on
• In your own directory (/data/Tutorial/<Student>/) create the following sub-‐directories: – Genome/
• Keep the fasta and gff files here – Bow)e/
• Keep the Bow)e alignments, and post-‐processing of bow)e alignments here
– Fastq/ • Keep the raw fastq files here
Workflow Raw Reads (Fastq)
Mapped Reads (SAM)
Mapping (Bow)e)
Binary Mapped Reads (SORTED.BAM)
File Conversion (SAMTOOLS)
Visualiza)on (IGV)
Workflow Raw Reads (Fastq)
Mapped Reads (SAM)
Mapping (Bow)e)
Binary Mapped Reads (SORTED.BAM)
File Conversion (SAMTOOLS)
Visualiza)on (IGV)
Fastq file
• File extension .fastq or .fq • Example: @Read_iden)fier_and_flowcell_info ACGTCCGGTTNNN… + B$!?NP\\\[%&C…
• For more info on ASCII encoding QV scores…go to wikipedia
Read ID Read Sequence Read QV ID Read QV Sequence
Workflow Raw Reads (Fastq)
Mapped Reads (SAM)
Mapping (Bow)e)
Binary Mapped Reads (SORTED.BAM)
File Conversion (SAMTOOLS)
Visualiza)on (IGV)
Mapping the Short Reads • Taking each read and mapping it to a reference genome
– Bow)e
TGCATGCATGCATGCATGCATGCATGCATGCATGCAAAAAGCATGCATGCA
TGCATGAATGCAAAAAGCATGCA
Bow)e-‐Build Command
• In order to map the reads to a genome, you must acquire the genome in the .fasta (.fa) format, and then index it.
• bow)e-‐build -‐f <in.fasta> <out_prefix> – $bow)e-‐build SGDv4.fasta SGDv4_bow)e
Bow)e command
• Now we map back to the reference we just indexed.
• bow)e <reference_in.prefix> -‐q <in.fastq> -‐S <out.SAM> 2> <out.stderr> – $ bow)e /data/Tutorial/Phil/Genome/Bow)e_index/SGDv3_bow)e –q Sample1.fastq –S Sample1_ bow)e.sam 2> Sample1_bow)e.stderr
Sam File
• Tab Delimited • hup://genome.sph.umich.edu/wiki/SAM • Open Example SAM
Workflow Raw Reads (Fastq)
Mapped Reads (SAM)
Mapping (Bow)e)
Binary Mapped Reads (SORTED.BAM)
File Conversion (SAMTOOLS)
Visualiza)on (IGV)
Samtools Commands
• samtools view –bS <in.sam> -‐o <out.bam> – $samtools view –bS Sample1_bow)e.sam –o Sample1_bow)e.bam
• samtools sort <in.bam> <out.sorted> – $samtools sort Sample1_bow)e.bam Sample1_bow)e.sorted
• samtools index <in.sorted.bam> – $samtools index Sample1_bow)e.sorted.bam
Workflow Raw Reads (Fastq)
Mapped Reads (SAM)
Mapping (Bow)e)
Binary Mapped Reads (SORTED.BAM)
File Conversion (SAMTOOLS)
Visualiza)on (IGV)
IGV
• Located at /data2/IGV/ • Several different versions available, recommend either:
• /data2/IGV/IGV_2.1.19/igv.jar • /data2/IGV/IGV_1.5.64/igv.jar
• To run IGV: – java –Xmx5g –jar <igv.jar>
• $java –Xmx5g –jar /data2/IGV/IGV_1.5.64/igv.jar &
IGV: Crea)ng a genome
• Reference Instruc)ons on sheet.
Bow)e and Bfast IGV
Bow$e
Bfast
Gene
Advantages to Bfast Gapped Mapping
Bow$e
Bfast
Gene
Bfast Mapping Loosely
Bow$e
Bfast
Gene
If you are gexng the hang of it quickly…
• Try going through the next few commands
BWA Paired end • /usr/local/src/bwa-‐0.6.2/bwa index –a is –f <in.fasta>
• Map each read in the pair independently • /usr/local/src/bwa-‐0.6.2/bwa aln <reference.prefix> <in_1.fq> > <out.sai>
• Finalize the mapping by conver)ng (for both reads) both the .SAI and the .FQ into a final SAM alignment:
• /usr/local/src/bwa-‐0.6.2/bwa sampe <reference.prefix> <in_1.sai> <in_2.sai> <in_1.fq> <in_2.fq> > <out_paired.sam>
Bow)e Unique Mapping
• Inves)gate the different Bow)e op)ons: – Look at –m (number of mappings per read), -‐v (number of mismatches per seed)
TopHat Spliced Mapping
• /usr/local/src/tophat-‐2.0.4.Linux_x86_64/tophat –G <in.gff> -‐o <output_directory> <bow)e_index> <in.fastq>
The end…for now.