kogo 2013 rna-seq analysis

125
RNA-SEQ ANALYSIS 고준수, 송상훈, 김현민 테라젠 바이오 연구소 2012. 2. 5

Upload: junsu-ko

Post on 10-May-2015

2.728 views

Category:

Health & Medicine


0 download

DESCRIPTION

RNA-seq analysis

TRANSCRIPT

Page 1: Kogo 2013 RNA-seq analysis

RNA-SEQ ANALYSIS고준수, 송상훈, 김현민

테라젠 바이오 연구소2012. 2. 5

Page 2: Kogo 2013 RNA-seq analysis

CONTENTS• NGS

• RNA-seq

• File Forat

• Workflow

• Preparation

• Filtering & QC

• Mapping

• PCR Duplication

• Expression

• DEG

• Report

Page 3: Kogo 2013 RNA-seq analysis

TODAY’S KEYWORDS

NGSIllumina, Paired-End

RNA-seqmRNA, Reference-based

MappingTopHat

ExpressionCufflinks, Cuffmerge

DEGCuffdiff, DESeq

DesignReplicates

File FormatFastq, BAM

Page 4: Kogo 2013 RNA-seq analysis

NEXT-GENERATION SEQUENCING

Page 5: Kogo 2013 RNA-seq analysis

SEQUENCING

Sanger (1st Generation)

Page 6: Kogo 2013 RNA-seq analysis

NEXT-GENERATION SEQUENCING

Nat Rev Genet. 2010 Jan;11(1):31-46. doi: 10.1038/nrg2626. Epub 2009 Dec 8. Sequencing technologies - the next generation. Metzker ML.

2nd Generation

3rd Generation

Page 7: Kogo 2013 RNA-seq analysis

NGS WEAKNESS AND OVERCOMING

Quail MA, Smith M, Coupland P, Otto TD, Harris SR, Connor TR, Bertoni A, Swerdlow HP, Gu Y. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics. 2012 Jul 24;13:341.

Sanger 0.001%

Nature Biotechnology 26, 1135 - 1145 (2008), Next-generation DNA sequencing, Shendure J. and Ji H.

Page 8: Kogo 2013 RNA-seq analysis

NGS

http://users.ugent.be/~avierstr/nextgen/nextgen.html

Library Construction Sequencing

RawReads

Page 9: Kogo 2013 RNA-seq analysis

GENERAL NGS ANALYSIS PROCESS

Shearer AE, Hildebrand MS, Sloan CM, Smith RJ. Deafness in the genomics era. Hear Res. 2011 Dec;282(1-2):1-9. doi: 10.1016/j.heares.2011.10.001. Epub 2011 Oct 8.

Mapping1 WGS

Low depth < NT < High depth

3

Depth(Coverage)

2

Coverage

Speed

Page 10: Kogo 2013 RNA-seq analysis

MAPPING TOOLS• Mapper Type

• DNA• RNA• miRNA• bisulphite

Fonseca NA, Rung J, Brazma A, Marioni JC. Tools for mapping high-throughput sequencing data. Bioinformatics. 2012 Dec 1;28(24):3169-77.

BWAfor

WGS

TopHatfor RNA

Page 12: Kogo 2013 RNA-seq analysis

ILLUMINA PAIRED-END

Quinlan AR, Boland MJ, Leibowitz ML, Shumilina S, Pehrson SM, Baldwin KK, Hall IM. Genome sequencing of mouse induced pluripotent stem cells reveals retroelement stability and infrequent DNA rearrangement during reprogramming. Cell Stem Cell. 2011 Oct 4;9(4):366-73. doi: 10.1016/j.stem.2011.07.018.

Haas BJ, Zody MC.Advancing RNA-Seq analysis.Nat Biotechnol. 2010 May;28(5):421-3. doi: 10.1038/nbt0510-421.

mate-pair inner distnace

http://vallandingham.me/RNA_seq_differential_expr

ession.html

http://users.ugent.be/~avierstr/nextgen/nextgen.html

fastq_1

fastq_2

Page 13: Kogo 2013 RNA-seq analysis

SUMMARY• NGS platform : Short Reads, Depth, Coverage

• Sequencing Protocol

• Analysis Protocol

• Mapping

• PCR duplication

• Illumina Paired-end

Page 14: Kogo 2013 RNA-seq analysis

TRANSCRIPTOMERNA-SEQ

Page 15: Kogo 2013 RNA-seq analysis

TRANSCRIPTOME

• The complete set of transcripts in a cell, and their quantity

• The key aims of transcriptomics are:

• to catalogue all species of transcript, including mRNAs, non-coding RNAs and small RNAs

• to determine the transcriptional structure of genes, in terms of their start sites, 5′ and 3′ ends, splicing patterns and other post-transcriptional modifications

• to quantify the changing expression levels of each transcript during development and under different conditions.

Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009 Jan;10(1):57-63. doi: 10.1038/nrg2484.

Page 16: Kogo 2013 RNA-seq analysis

ADVANTAGES OF RNA-SEQ

Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009 Jan;10(1):57-63. doi: 10.1038/nrg2484.

Page 17: Kogo 2013 RNA-seq analysis

RNA-SEQ & MICROARRAY

Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009 Jan;10(1):57-63. doi: 10.1038/nrg2484.

Page 18: Kogo 2013 RNA-seq analysis

RNA-SEQ• Gene expression level

• Relative expression level in sample

• Differentially expressed gene

• Identification of alternative spliced transcripts

• Prediction of novel transcripts

• Gene Fusion

Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009 Jan;10(1):57-63. doi: 10.1038/nrg2484.

Page 19: Kogo 2013 RNA-seq analysis

RNA-SEQ VS. DNA-SEQ

RNA-seq DNA-seq

Methods Reference-based,de novo assembly

WES,WGS re-sequencing,

WGS de novo

Goal

Expression,Differentially Expressed Genes,

Novel transcript,Alternative splicing form,

Gene fusion

SNPs, Indels, SV

Measure Mapped Read Count Base accuracy

Page 20: Kogo 2013 RNA-seq analysis

OVERVIEW OF A TYPICAL RNA-SEQ

Page 21: Kogo 2013 RNA-seq analysis

RNA MAPPING

Trapnell C, Salzberg SL., How to map billions of short reads onto genomes. Nat Biotechnol. 2009 May;27(5):455-7.

Oshlack A, Robinson MD, Young MD. From RNA-seq reads to differential expression results. Genome Biol. 2010;11(12):220.

Page 22: Kogo 2013 RNA-seq analysis

MAPPERMapper Data Seq.Plat. Input Output Cit. Cit/years Reference

MapSplice RNA I FASTA/Q SAM, BED 50 28.17 Wang et al. (2010)

MicroRazerS miRNA N FASTA/Q SAM, TSV 7 2.75 Emde et al. (2010)

mrFAST miRNA I FASTA/Q SAM 158 58.34 Alkan et al. (2009)

mrsFAST miRNA I,So FASTA/Q SAM 32 18.03 Hach et al. (2010)

Passion RNA I,4,Sa,P FASTA/Q BED - - Zhang et al. (2012)

PatMaN miRNA N FASTA TSV 38 9.36 Prufer et al. (2008)

QPALMA RNA I,4 Specific TSV 75 21.11 De Bona et al. (2008)

RNA-Mate RNA So CFASTA BED, Counts 28 10.04 Cloonan et al. (2009)

RUM RNA I,4 FASTA/Q SAM,TSV,BED 2 2.36 Grant et al. (2011)

SOAPSplice RNA I,4 FASTA/Q TSV 3 3.54 Huang et al. (2011)

SpliceMap RNA I FASTA/Q SAM, BED 63 29.80 Au et al. (2010)

Supersplat RNA N FASTA TSV 21 9.93 Bryant Jr et al. (2010)

TopHat RNA I FASTA/Q, GFF BAM 389 121.04 Trapnell et al. (2009)

Fonseca NA, Rung J, Brazma A, Marioni JC. Tools for mapping high-throughput sequencing data. Bioinformatics. 2012 Dec 1;28(24):3169-77.

The number of citations (Cit.) was obtained from Google Scholar on April 14, 2012

Page 23: Kogo 2013 RNA-seq analysis

ANALYSIS STRATEGIESReference-based de novo

Method•Using a reference genome•The transcriptome assembly can be built upon it

•not use a reference genome

Adv.

• Contamination or sequencing artefacts are not a major concern• Very sensitive and can assemble transcripts of low abundance • To discover novel transcripts that are not present in the current annotation

• Not depend on a reference genome• Not depend on the correct alignment of reads to known splice sites or the prediction of novel splicing sites• Trans-spliced transcripts can be assembled

Disadv. • Depends on the quality of the reference genome being used.

• Computing resources• Senstive to sequencing errors

Depth ~ 10x > 30x

Martin JA, Wang Z. Next-generation transcriptome assembly. Nat Rev Genet. 2011 Sep 7;12(10):671-82. doi: 10.1038/nrg3068.

Page 24: Kogo 2013 RNA-seq analysis

REFERENCE-BASED

Martin JA, Wang Z. Next-generation transcriptome assembly. Nat Rev Genet. 2011 Sep 7;12(10):671-82. doi: 10.1038/nrg3068.

Page 25: Kogo 2013 RNA-seq analysis

REFERENCE-BASED

Martin JA, Wang Z. Next-generation transcriptome assembly. Nat Rev Genet. 2011 Sep 7;12(10):671-82. doi: 10.1038/nrg3068.

Page 26: Kogo 2013 RNA-seq analysis

SUMMARY

• Transcriptome

• RNA-seq advantages

• Process

• Analysis strategies

• Reference-based method

Page 27: Kogo 2013 RNA-seq analysis

NGS FILE FORMAT

Page 28: Kogo 2013 RNA-seq analysis

FILE FORMAT• NGS

• Fastq

• SAM/BAM

• VCF

• Reference

• Fasta

• GTF / GFF

Page 29: Kogo 2013 RNA-seq analysis

S01_1.fq

S01_2.fq

FASTQ FORMAT• de factor standard file format for raw reads

• fq, fastq, fq.gz, fastq.gz 1: @title identifier description

2: Sequence

3: + description

4: Quality valuesPaired-end

Sequencer

Fastq

Page 30: Kogo 2013 RNA-seq analysis

QUALITY SCORE• The base-calling error probabilities.

• Types

• Pred33 / Illumina 1.8+• Score 0~60• ASCII 33 ~ 126

• Solexa / Illumina 1.0• -5~62• ASCII 56 ~ 126

• Pred64 / Illumina 1.3 ~ 1.5• 0 ~ 62 • ASCII 64 ~126

http://www.asciitable.com

Page 31: Kogo 2013 RNA-seq analysis

SAM / BAM FORMAT• SAM stands for Sequence Alignment/Map format.

• TAB-delimited text format

• 11 mandatory fields

Sequencer

Fastq

Mapper

SAM/BAM

Read Name

FlagReference

Position

QualityPos. of Mate

Length

Page 32: Kogo 2013 RNA-seq analysis

SAM / BAMFlag

CIGAR

SAM

Page 33: Kogo 2013 RNA-seq analysis

TOOLS FOR SAM/BAM• Samtools

• index

• view

• sort

• faidx

• flagstat

• tview

• mpileup

• Picard

• SortSam

• MarkDuplicates

• ......

Page 34: Kogo 2013 RNA-seq analysis

GTF (ENSEMBL)

protein_coding, mtRNA, miRNA, lincRNA, pseudogene......

Gene ID Transcript ID

Page 35: Kogo 2013 RNA-seq analysis

SUMMARY• Fastq format

• de facto standard

• Quality Score

• Pred33/Illumina 1.8+, Illumina 1.0, Pred64/Illumina 1.3~1.5

• SAM/BAM format

• GTF

Page 36: Kogo 2013 RNA-seq analysis

WORKFLOW

Page 37: Kogo 2013 RNA-seq analysis

REFERENCE

Page 38: Kogo 2013 RNA-seq analysis

REFERENCE WORKFLOW

TopHat Cufflinks Cuffmerge Cuffdiff

Sample1

Sample2

Mappedreads

Mappedreads

Assembledtranscripts

Assembledtranscripts

Finaltranscriptome

assembly

Differentialexpression

results

CummeRbundExpressionplots

Page 39: Kogo 2013 RNA-seq analysis

PicardSamtools

RSeQCFastQC

cummeRbundGO

CuffdiffDEGseqDESeq

CufflinksHTseq-count Cuffmerge

TopHatRUMBWA

Bowtie2

TBI-toolkit

OUR WORKFLOW

FilteringRead

MappingGene

StructureExpression

Level

DEGanalysis

Report

UniProtGO

KEGG

Annotation

Samples Reference Geneset

Duplication

Page 40: Kogo 2013 RNA-seq analysis

PREPARATION

Page 41: Kogo 2013 RNA-seq analysis

S01.fq.gz, S02.fq.gz

chr.fa, ens.gtf, mask.gtf

DIRECTORY/KOGO/RNA-seq ref

inputs

outputs S01

......

merged_asm

accepted_hits.bam, transcripts.gtf

Diff-S01-S02

merged.gtf, transcripts.gtf

gene_exp.diff, isoforms_exp.diffscripts

accepted_hits.bam, transcripts.gtf

Tools

Page 42: Kogo 2013 RNA-seq analysis

SAMPLES

S01 S02

S03 S04

Horse 1

Horse I1

운동전 운동후

Page 43: Kogo 2013 RNA-seq analysis

TOOLSCategory Programs Version Homepage

QC FastQC 0.10.1 http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

MapperBowtie2 2.0.5 http://bowtie-bio.sourceforge.net/bowtie2/index.shtml

MapperTopHat 2.0.7 http://tophat.cbcb.umd.edu

Abundance

Cufflinks 2.0.2 http://cufflinks.cbcb.umd.edu

Abundance HTseq-count - http://www-huber.embl.de/users/anders/HTSeq/doc/count.htmlAbundance

DESeq 1.10.1 http://bioconductor.org/packages/release/bioc/html/DESeq.html

Annotation goseq 1.10.0 http://www.bioconductor.org/packages/2.11/bioc/html/goseq.html

Tools

samtools 0.1.18 http://samtools.sourceforge.net

Tools

picard 1.83 http://picard.sourceforge.net

Tools TBI-toolkit 0.1 http://dev.totalomics.kr/Tools

R 2.15.0 http://www.r-project.org

Tools

Gnuplot - http://www.gnuplot.info

Page 44: Kogo 2013 RNA-seq analysis

TBI-TOOLKIT• TBI NGS Toolkit

• http://dev.totalomics.kr

• Application

• TBI-toolkit-qscore

• TBI-toolkit-fq_filter

• TBI-toolkit-gtf_selector

• TBI-toolkit-fa_spliter

• TBI-toolkit-make_matrix

Page 45: Kogo 2013 RNA-seq analysis

REFERENCE• Reference-based strategy

Name FileType Description

Reference fasta Genome Sequence

Geneset GTF2.2/GFF3 Reference Geneset

Name Source Description

Mask Geneset Geneset Geneset that has ncRNA information.(rRNA, tRNA, and other ncRNA)

Bowtie2 Index Reference Index files for running bowtie2

GO information GO Gene ontology information for GO enrichment

Optional

Page 46: Kogo 2013 RNA-seq analysis

REFERENCE SOURCE• Ensembl (http://www.ensembl.org)

• General file format for all species

• Geneset (GTF format)

• Constant Database schema for all species

• Comprehensive Annotation (GO, InterPro, Pfam, Prosite Smart, ...... )

• Automated Update

• UCSC (http://genome.ucsc.edu)

• Semi general file format for all species

• Semi constant Database schea for all species

• Gene table dump (BED format compatible)

• Annotation (Pfam, Kegg)

• Comparative Analysis

• NCBI

• Raw data bank

• GFF type geneset file

Page 47: Kogo 2013 RNA-seq analysis

ENSEMBLensembl.org plants.ensembl.org fungi.ensembl.org

metazoa.ensembl.org protists.ensembl.org bacteria.ensembl.org

Page 48: Kogo 2013 RNA-seq analysis

ENSEMBL• Homo Sapiens ( ftp://ftp.ensembl.org/pub/release-69 )

• fasta/homo_sapiens/

• dna/Homo_sapiens.GRCh37.69.dna.toplevel.fa.gz

• dna/Homo_sapiens.GRCh37.69.dna.chromosome.1.fa.gz

• cdna/Homo_sapiens.GRCh37.69.cdna.all.fa.gz

• gtf/homo_sapiens/Homo_sapiens.GRCh37.69.gtf.gz

• mysql/homo_sapiens_core_69_37/

• Arabidopsis thaliana ( ftp://ftp.ensemblgenomes.org/pub/release-16/plants )

• fasta/arabidopsis_thaliana

• dna/Arabidopsis_thaliana.TAIR10.16.dna.toplevel.fa.gz

• cdna/Arabidopsis_thaliana.TAIR10.16.cdna.all.fa.gz

• gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.16.gtf.gz

• mysql/arabidopsis_thaliana_core_16_69_10/

chr.fa

ens.gtf

Page 49: Kogo 2013 RNA-seq analysis

PRE-PROCESSING• Check quality score type of input file

• Reference files

• Reference index

• Mask geneset

Page 50: Kogo 2013 RNA-seq analysis

SAMPLE QUALITY SCORE

Run)$ cd /KOGO/RNA-seq/inputs$ TBI-toolkit-qscore S01_1.fq.gzSanger(Phred33) or Illumina 1.8+

0 to 93 using ASCII 33 to 1260:1, 1:”, 2:#, 3:$, 4:%, 5:&, ......

Usage)$ TBI-toolkit-qscore [FASTQ]Sanger(Phred33) or Illumina 1.8+

0 to 93 using ASCII 33 to 126

Page 51: Kogo 2013 RNA-seq analysis

REFERENCE INDEX

Usage)$ bowtie2-build [options] <reference_in> <bt2_base>

Run)$ cd /KOGO/RNA-seq/ref$ bowtie2-build chr.fa chr.fa$ lschr.fa.1.bt2 chr.fa.2.bt2 ......

Index for bowtie2 mapper

Usage)$ samtools faidx <ref.fasta>

Run)$ cd /KOGO/RNA-seq/ref$ samtools faidx chr.fa$ lschr.fa.fai

Fasta index

Page 52: Kogo 2013 RNA-seq analysis

MASK GENESET

Run)$ cd /KOGO/RNA-seq/ref$ TBI-toolkit-gtf_selector ens.gtf mask.gtf tRNA rRNA Mt_tRNA Mt_rRNA

Usage)$ TBI-toolkit-gtf_selector [IN GTF] [OUT GTF] [Source 1] [Source 2] ......

...... We recommend including any annotated rRNA, mitochondrial transcripts other abundant transcripts you wish to ignore in your analysis in this file. Due to variable efficiency of mRNA enrichment methods and rRNA depletion kits, masking these transcripts often improves the overall robustness of transcript abundance estimates.

cufflinks manuals (http://cufflinks.cbcb.umd.edu/manual.html)

Page 53: Kogo 2013 RNA-seq analysis

SUMMARY• Directory

• /KOGO/RNA-seq

• Tools

• Reference

• Pre-processing

Page 54: Kogo 2013 RNA-seq analysis

FILTERING & QC

RSeQCFastQC

Filtering ReadMapping Gene

StructureExpression

Level

DEGanalysis

Report

Annotation

Duplication

Page 55: Kogo 2013 RNA-seq analysis

FILTERING & QC• Improving assembly accuracy

• Removing artifacts

• Sequencing adaptor

• Low quality reads

• Near-identical reads

• PCR amplification

• rRNA and other RNA

• Applications

• Filtering - TBI-toolkit, fastx-toolkit

• QC - FastQC, SolexaQC, RSeQC

Filtering

Mapping

GeneStructure

Expression

DEG

Report

Annotation

Duplication

Page 56: Kogo 2013 RNA-seq analysis

QUALITY CONTROL• FastQC ( v0.10.1 )

• A quality control tool for high throughput sequence data.

• Java

• http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

• RSeQC

• RSeQC package provides a number of useful modules that can comprehensively evaluate high throughput sequence data especially RNA-seq data

• http://code.google.com/p/rseqc/

Page 57: Kogo 2013 RNA-seq analysis

FASTQC

Usages)$ fastqc seqfile1 seqfile2 .. seqfileN$ fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam] [-c contaminant file] seqfile1 .. seqfileN

Arguments-f format bam,sam,bam_mapped,sam_mapped and fastq-t threads

Run)$ cd /KOGO/RNA-seq/inputs$ fastqc -f fastq -t 2 S01_1.fq.gz S01_2.fq.gz

Output)$ firefox R01_1.fq_fastqc/fastqc_report.html$ firefox R01_2.fq_fastqc/fastqc_report.html

Page 58: Kogo 2013 RNA-seq analysis

FASTQCPer Base Sequence Quality Per Sequence Quality Scores Per Base Sequence Content Per Base GC Content

Per Sequence GC Content Per Base N Content Sequence Length Distribution Duplicate Sequences

Page 59: Kogo 2013 RNA-seq analysis

RSEQC

Page 60: Kogo 2013 RNA-seq analysis

READ FILTERING (CUTOFF)

RNA-seq DNA-seq

LowQuality

N > 10%Average QV < Q20NT (<Q20) > 40%

N > 10%Average AV < Q20 NT (<Q20) > 5%

TrimmingNo trimming

orTrimming

Trimming

Page 61: Kogo 2013 RNA-seq analysis

FILTERINGUsages)$ TBI-toolkit filter [option*] seqfile_1 seqfile_2 output_1 output_2

Option)-n N_ratio-a integer : Average QV of read-m NT_ratio < QV

Run)$ cd /KOGO/RNA-seq/inputs$ TBI-toolkit-fq_filter -n 0.1 -m 0.4 -a 20 S01_1.fq.gz S01_2.fq.gz S01_Q20_1.fq.gz S01_Q20_2.fq.gz$ lsS01_Q20_1.fq.gz S01_Q20_2.fq.gz S01_Q20.log S01_Q20.err$ cat S01_Q20.log$ less S01_Q20.err

Page 62: Kogo 2013 RNA-seq analysis

FASTQC

Run)$ cd /KOGO/RNA-seq/inputs$ fastqc -f fastq -t 2 S01_Q20_1.fq.gz S01_Q20_2.fq.gz

Page 63: Kogo 2013 RNA-seq analysis

SUMMARY• Read Quality

• FastQC

• RSeQC

• Filter

Page 64: Kogo 2013 RNA-seq analysis

MAPPING READS(TOPHAT)

RSeQCFastQC

Filtering ReadMapping Gene

StructureExpression

Level

DEGanalysis

Report

Annotation

Duplication

Page 65: Kogo 2013 RNA-seq analysis

TOPHAT• TopHat is a fast splice junction mapper for RNA-

Seq reads.

• It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons.

Filtering

Mapping

GeneStructure

Expression

DEG

Report

Annotation

Duplication

Trapnell C, Salzberg SL. How to map billions of short reads onto genomes. Nat Biotechnol. 2009 May;27(5):455-7.

Page 66: Kogo 2013 RNA-seq analysis

USAGEUsage$ tophat [options] <bowtie_index_base> <reads1_1> <reads1_2>

Option Value Description

-o/--output-dir string The default is "./tophat_out".

-p/--num-threads int Use this many threads to align reads. The default is 1.

-r/--mate-inner-dist int This is the expected (mean) inner distance between mate pairs.The default is 50bp

--mate-std-dev int The standard deviation for the distribution on inner distances between mate pairs. The default is 20bp.

--library-typefr-unstranded

fr-firststrandfr-secondstrand

fr-unstranded : Standard Illuminafr-firststrand : dUTP, NSR, NNSRfr-secondstrand : Ligation, Standard Solid

--solexa-quals - Use the Solexa scale for quality values in FASTQ files.

--solexa1.3-quals - Phred64/Illumina 1.3~1.5

-G/--GTF Geneset Geneset (GTF 2.2 or GFF3 formatted file)

--rg-id string Read group ID

--rg-sample string Sample ID

Page 67: Kogo 2013 RNA-seq analysis

RUN$ cd /KOGO/RNA-seq/outputs$ tophat -o S01 -p 1 -r 170

--library-type fr-unstranded -G ../ref/ens.gtf --rg-id S01_Q20 --rg-sample S01_Q20../ref/chr.fa ../inputs/S01_Q20_1.fq.gz ../inputs/S01_Q20_2.fq.gz

Category Option Value

Output -o/--output-dir /KOGO/RNA-seq/outputs/S01

Thread -p/--num-threads 1

Inner Distance Mean -r/--mate-inner-dist 170

Inner distance SD. --mate-std-dev 20 (default)

Library Type --library-type fr-unstranded (Standard Illumina)

Quality Score Phred33 (default)

Geneset -G/--GTF /KOGO/RNA-seq/ref/ens_69.gtf

Read Group --rg-id--rg-sample S01_Q20

check

Page 68: Kogo 2013 RNA-seq analysis

ALGORITHM

Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009 May 1;25(9):1105-11.

Page 69: Kogo 2013 RNA-seq analysis

TOPHAT• Two step method

• Extracting the transcript sequences and using Bowtie to align reads to this virtual transcriptome first.

• Only the reads that do not fully map to the transcriptome will then be mapped on the genome.

• Optimized for reads >= 75bp

• The values in the first column of the provided GTF/GFF file must match the name of the reference sequence in the Bowtie index you are using with TopHat.

Page 70: Kogo 2013 RNA-seq analysis

OUTPUT

Filename Types Description

accepted_hits.bam BAM A list of read alignments in SAM format.Coordinate-sorted

unmapped.bam BAM A list of unmapped read in SAM format.

junctions.bed UCSC BED A track of junctions reported by TopHat

insertions.bed UCSC BED chromLeft referes to the last genomic base before the insertion

deletions.bed UCSC BED chromLeft referes to the first genomic base before the insertion

Page 71: Kogo 2013 RNA-seq analysis

SIMPLE ALIGNMENT VIEWUsage$ cd /KOGO/RNA-seq/output/S01$ samtools index accepted_hits.bam$ samtools tview accepted_hits.bam ../../ref/chr.fa

Key Desc? This window

Arrows Small scroll movement

H,J,K,L Large scroll movement

space Scroll one screen

backspace Scroll back one screen

g Go to specific location

m Color for mapping qual

n Color for nucleotide

b Color for base quality

. Toggle on/off dot view

q Exit

25:413751

Page 72: Kogo 2013 RNA-seq analysis

MAPPING STATISTICSRun)$ cd /KOGO/RNA-seq/outputs/S01$ samtools flagstat accepted_hits.bam

Run)$ cd /KOGO/RNA-seq/outputs/S01$ bam_stat.py -i accepted_hits.bam

45338688 + 0 in total (QC-passed reads + QC-failed reads)0 + 0 duplicates45338688 + 0 mapped (100.00%:-nan%)45338688 + 0 paired in sequencing22757885 + 0 read122580803 + 0 read239796048 + 0 properly paired (87.78%:-nan%)42308960 + 0 with itself and mate mapped3029728 + 0 singletons (6.68%:-nan%)705846 + 0 with mate mapped to a different chr92166 + 0 with mate mapped to a different chr (mapQ>=5)

Total Reads (Records): 45338688

QC failed: 0Optical/PCR duplicate: 0Non Primary Hits 1861695Unmapped reads: 0Multiple mapped reads: 586067

Uniquely mapped: 42890926Read-1: 21527100Read-2: 21363826Reads map to '+': 21457407Reads map to '-': 21433519Non-splice reads: 32872272Splice reads: 10018654Reads mapped in proper pairs: 38402964

Page 73: Kogo 2013 RNA-seq analysis

SUMMARY• TopHat

• Splice junction

• Geneset

• Two step method

• accepted_hits.bam

Page 74: Kogo 2013 RNA-seq analysis

PCR DUPLICATES(OPTIONAL)

RSeQCFastQC

Filtering ReadMapping Gene

StructureExpression

Level

DEGanalysis

Report

Annotation

Duplication

Page 75: Kogo 2013 RNA-seq analysis

PCR DUPLICATION Filtering

Mapping

GeneStructure

Expression

DEG

Report

Annotation

Duplication

Run) $ cd /KOGO/RNA-seq/outputs/S01/$ samtools rmdup accepted_hits.bam accepted_hits.rmdup.bam

• Removing reads that have same mapping coordinates.

• Tools

• samtools - rmdup

• Picard - MarkDuplicates

Run) $ cd /KOGO/RNA-seq/outputs/S01/$ java -jar /KOGO/RNA-seq/Tools/Picard/MarkDuplicates.jar

INPUT=accepted_hits.bam OUTPUT=accpted_hits.mark_dup.bamASSUME_SORTED=true REMOVE_DUPLICATES=trueMETRICS_FILE=accpeted_hits.metric

Page 76: Kogo 2013 RNA-seq analysis

PCR DUPLICATION

accepted_hits.bam samtools Picard (Mark) Picard (Remove)

45338688 + 0 in total0 + 0 duplicates45338688 + 0 mapped45338688 + 0 paired22757885 + 0 read122580803 + 0 read239796048 + 0 properly paired (87.78%:-nan%)42308960 + 0 with itself and mate mapped3029728 + 0 singletons (6.68%:-nan%)705846 + 0 with mate mapped to a different chr92166 + 0 with mate mapped to a different chr (mapQ>=5)

29259330 + 0 in total0 + 0 duplicates29259330 + 0 mapped29259330 + 0 paired14717809 + 0 read114541521 + 0 read224471885 + 0 properly paired (83.64%:-nan%)26229602 + 0 with itself and mate mapped3029728 + 0 singletons (10.35%:-nan%)705846 + 0 with mate mapped to a different chr92166 + 0 with mate mapped to a different chr (mapQ>=5)

45338688 + 0 in total17717244 + 0 duplicates45338688 + 0 mapped45338688 + 0 paired22757885 + 0 read122580803 + 0 read239796048 + 0 properly paired (87.78%:-nan%)42308960 + 0 with itself and mate mapped3029728 + 0 singletons (6.68%:-nan%)705846 + 0 with mate mapped to a different chr92166 + 0 with mate mapped to a different chr (mapQ>=5)

27621444 + 0 in total0 + 0 duplicates27621444 + 0 mapped27621444 + 0 paired13820471 + 0 read113800973 + 0 read224945306 + 0 properly paired (90.31%:-nan%)26660814 + 0 with itself and mate mapped960630 + 0 singletons (3.48%:-nan%)655922 + 0 with mate mapped to a different chr52614 + 0 with mate mapped to a different chr (mapQ>=5)

Page 77: Kogo 2013 RNA-seq analysis

EXPRESSION(CUFFLINKS)

RSeQCFastQC

Filtering ReadMapping Gene

StructureExpression

Level

DEGanalysis

Report

Annotation

Duplication

Page 78: Kogo 2013 RNA-seq analysis

EXPRESSINO & MODELING

Adam  Roberts  et  al.,  Iden%fica%on  of  novel  transcripts  in  annotated  genomes  using  RNA-‐Seq.  Bioinforma4cs,  2011,    27:2325–2329

Page 79: Kogo 2013 RNA-seq analysis

NORMALIZATION• Read counts need to be properly normalized to extract meaningful

expression estimates

• First, RNA fragmentation during library construction causes longer transcripts to generate more reads compared to shorter transcripts present at the same abundance in the sample

• Second, the variability in the number of reads produced for each run causes fluctuations in the number of fragments mapped across samples

Garber M, Grabherr MG, Guttman M, Trapnell C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat Methods. 2011 Jun;8(6):469-77.

Page 80: Kogo 2013 RNA-seq analysis

RPKM

• C : the number of mappable reads that fell onto the gene’s exons

• N : the total number of mappable reads in the experiment

• L : the sum of the exons in base pairs

the reads per kilobase of transcript per million mapped reads

Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L., and Wold, B. (2008). Mapping and quantifying mammalian transcriptomes by rna-seq. Nat Methods, 5(7):621-628.

Relative Expression Level in

Sample

Page 81: Kogo 2013 RNA-seq analysis

CUFFLINKS• Cufflinks assembles transcripts, estimates their

abundances, and tests for differential expression and regulation in RNA-Seq samples

• Cufflinks constructs a parsimonious set of transcripts that "explain" the reads observed in an RNA-Seq experiment

Filtering

Mapping

GeneStructure

Expression

DEG

Report

Annotation

Duplication

http://cufflinks.cbcb.umd.edu/index.html

Page 82: Kogo 2013 RNA-seq analysis

CUFFLINKS PACKAGE• cufflinks

• assembles transcripts

• estimates their abundances

• cuffmerge

• a script called cuffmerge that you can use to merge together several Cufflinks assemblies.

• cuffdiff

• tests for differential expression

Page 83: Kogo 2013 RNA-seq analysis

USAGE$ cufflinks [options] <aligned_reads.(sam/bam)>

Option Value Description

-o/--output-dir String Sets the name of the directory in which Cufflinks will write all of its output. The default is "./".

-p/--num-threads int Use this many threads to align reads. The default is 1.

-G/--GTF geneset Use the supplied reference annotation (a GFF file) to estimate isoform expression. It will not assemble novel transcripts.

-g/--GTF-guide genesetUse the supplied reference annotation (GFF) to guide RABT assembly. Output will include all reference transcripts as well as any novel genes and isoforms that are assembled.

-M/--mask-file mask genesetIgnore all reads that could have come from transcripts in this GTF file. We recommend including any annotated rRNA, mitochondrial transcripts other abundant transcripts you wish to ignore in your analysis in this file.

--library-typefr-unstrandedfr-firststrand

fr-secondstrand

fr-unstranded : Standard Illuminafr-firststrand : dUTP, NSR, NNSR /fr-secondstrand : Ligation, Standard Solid

Quantification

Novel Isoforms

Improvingaccuracy

Page 84: Kogo 2013 RNA-seq analysis

RUN

$ cd /KOGO/RNA-seq/outputs$ cufflinks -o S01 -p 1 --library-type fr-unstranded -g ../ref/ens.gtf -M ../ref/mask.gtf

S01/accepted_hits.bam

Category Option Value

Output -o/--output-dir /KOGO/RNA-seq/outputs/S01

Thread -p/--num-threads 1

Guide Geneset -g/--GTF-guide /KOGO/RNA-seq/ref/ens.gtf

Mask Geneset -M/--mask-file /KOGO/RNA-seq/ref/mask.gtf

Library Type --library-type fr-unstranded

Page 85: Kogo 2013 RNA-seq analysis

ALGORITHM

Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, Salzberg SL, Wold BJ, Pachter L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010 May;28(5):511-5.

Page 86: Kogo 2013 RNA-seq analysis

CUFFLINKS EXPRESSION• FPKM

• Fragments Per Kilobase of exon per Million fragments mapped

• analogous to single-read “RPKM”

• Isoform expression estimation

• maximum likelihood estimation

• Normalization

• by total number of mapped reads

• by upper quantile method

Page 87: Kogo 2013 RNA-seq analysis

OUTPUT

File Description

transcripts.gtf The GTF file contains Cufflinks ‘ assembled isoforms

isoforms.fpkm_tracking The estimated isoform-level expression values in the generic FPKM Tracking Format.

genes.fpkm_tracking The estimated gene-level expression values in the generic FPKM Tracking Format.

Page 88: Kogo 2013 RNA-seq analysis

TRANSCRIPTS.GTFCol. Name Example Description

1 seqname chrX Chromosome or contig name

2 source Cufflinks The name of the program that generated this file (always 'Cufflinks')

3 feature exon The type of record (always either "transcript" or "exon".

4 start 77696957 The leftmost coordinate of this record (where 1 is the leftmost possible coordinate)

5 end 77712009 The rightmost coordinate of this record, inclusive.

6 score 1000 The most abundant isoform for each gene is assigned a score of 1000. Minor isoforms are scored by the ratio (minor FPKM/major FPKM)

7 strand + Cufflinks' guess for which strand the isoform came from. Always one of "+", "-", "."

7 frame . Cufflinks does not predict where the start and stop codons (if any) are located within each transcript, so this field is not used.

8 attributes ... See below.

Page 89: Kogo 2013 RNA-seq analysis

TRANSCRIPTS.GTFAttribute Example Description

gene_id CUFF.1 Cufflinks gene id

transcript_id CUFF.1.1 Cufflinks transcript id

FPKM 101.267 Isoform-level relative abundance in Fragments Per Kilobase of exon model per Million mapped fragments

frac 0.7647 Reserved. Please ignore, as this attribute may be deprecated in the future

conf_lo 0.07 Lower bound of the 95% confidence interval of the abundance of this isoform, as a fraction of the isoform abundance. That is, lower bound = FPKM * (1.0 - conf_lo)

conf_hi 0.1102 Upper bound of the 95% confidence interval of the abundance of this isoform, as a fraction of the isoform abundance. That is, upper bound = FPKM * (1.0 + conf_lo)

cov 100.765 Estimate for the absolute depth of read coverage across the whole transcript

full_read_support yes When RABT assembly is used, this attribute reports whether or not all introns and internal exons were fully covered by reads from the data.

Page 90: Kogo 2013 RNA-seq analysis

FPKM TRACKING FILESCol. name Example Description1 tracking_id TCONS_00000001 A unique identifier describing the object (gene, transcript, CDS, primary transcript)

2 class_code = The class_code attribute for the object, or "-" if not a transcript, or if class_code isn't present

3 nearest_ref_id NM_008866.1 The reference transcript to which the class code refers, if any4 gene_id NM_008866 The gene_id(s) associated with the object5 gene_short_name Lypla1 The gene_short_name(s) associated with the object

6 tss_id TSS1 The tss_id associated with the object, or "-" if not a transcript/primary transcript, or if tss_id isn't present

7 locus chr1:4797771-4835363 Genomic coordinates for easy browsing to the object8 length 2447 The number of base pairs in the transcript, or '-' if not a transcript/primary transcript9 coverage 43.4279 Estimate for the absolute depth of read coverage across the object10 FPKM 8.01089 FPKM of the object in sample11 FPKM_lo 7.03583 the lower bound of the 95% confidence interval on the FPKM of the object in sample12 FPKM_hi 8.98595 the upper bound of the 95% confidence interval on the FPKM of the object in sample

13 status OK OK (deconvolution successful), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL

Page 91: Kogo 2013 RNA-seq analysis

SIMPLE STATISTICS$ cd /KOGO/RNA-seq/outputs/S01# Check higest expressed genes$ sort -r -g -k 10 genes.fpkm_tracking | head -n 30# Select FPKM S$ cut -f 1,10 genes.fpkm_traking > gene_fpkm_s# $ R> data <- read.table(“gene_fpkm_s”, header=TRUE)> fpkm_s <- as.numeric(data[,2])>> mean(fpkm_s)> sd(fpkm_s)>> fpkm_s.log10 <- log(fpkm_s+1,10)> bin_seq = seq(min(fpkm_s.log10-0.1),max(fpkm_s.log10+0.1),by=0.1)> hist(fpkm_s.log10, breaks=bin_seq, xlab=‘log10(x+1)’, ylab=‘Number of genes’, axes=TRUE)>> boxplot(fpkm_s.log10)

Page 92: Kogo 2013 RNA-seq analysis

SUMMARY• Expression Level

• Normalization

• RPKM (FPKM)

• Length Bias

• Cufflinks

• Isoforms

• maximum likelihood estimation

Page 93: Kogo 2013 RNA-seq analysis

CUFFMERGE

RSeQCFastQC

Filtering ReadMapping Gene

StructureExpression

Level

DEGanalysis

Report

Annotation

Duplication

Page 94: Kogo 2013 RNA-seq analysis

CUFFMERGE• Use to merge together several

Cufflinks assemblies

• Automatically filters a number of transfrags that are probably artfifacts

• The main purpose of this script is to make it easier to make an assembly GTF file suitable for use with Cuffdiff

Filtering

Mapping

GeneStructure

Expression

DEG

Report

Annotation

Duplication

Trapnell C. et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks.Nat Protoc. 2012 Mar 1;7(3):562-78. doi: 10.1038/nprot.2012.016.

Page 95: Kogo 2013 RNA-seq analysis

USAGE

$ cuffmerge [options] <assembly_GTF_list.txt>

Option Value Description

-o <outprefix> Write the summary stats into the text output file <outprefix>(instead of stdout)

-g/--ref-gtf geneset An optional "reference" annotation GTF. The input assemblies are merged together with the reference GTF and included in the final output.

-p/--num-threads <int> Use this many threads to align reads. The default is 1.

-s/--ref-sequence <seq_dir>/<seq_fasta>

This argument should point to the genomic DNA sequences for the reference. If a directory, it should contain one fasta file per contig. If a multifasta file, all contigs should be present. The merge script will pass this option to cuffcompare, which will use the sequences to assist in classifying transfrags and excluding artifacts (e.g. repeats). For example, Cufflinks transcripts consisting mostly of lower-case bases are classified as repeats. Note that <seq_dir> must contain one fasta file per reference chromosome, and each file must be named after the chromosome, and have a .fa or .fasta extension.

Page 96: Kogo 2013 RNA-seq analysis

RUN

$ cd /KOGO/RNA-seq/outputs$ find ./ -iname transcripts.gtf > gtf_list.txt$ cuffmerge -p 1 -g ../ref/ens.gtf -s ../ref/chr.fa gtf_list.txt

Category Option Value

Outputprefix -o /KOGO/RNA-seq/outputs

Geneset -g/--ref-gtf /KOGO/RNA-seq/ref/ens.gtf

Thread -p/--num-threads 1

Reference -s/--ref-sequence /KOGO/RNA-seq/ref/chr.fa

Page 97: Kogo 2013 RNA-seq analysis

RUN$ cd /KOGO/RNA-seq/outputs/merged_asm$ less transcripts.gtf$ less merged.gtf$ gffread -g /KOGO/ref/chr.fa -w transcripts.fa transcripts.gtf$ head transcripts.fa>CUFF.11.1 gene=CUFF.11GTGCATGTAACCCAAGAAGGGTTTGGCTGGGGGCTGTGGCAGCGCCAGAGTTCTGTTCGAATCCCAATTGGGTTCTGGTCACAGATTTGGCATGGAGCAGAAGAGAGATACAGCATGGTTGAAAAGCAGTTATTGGCTAC$ grep '>' transcripts.fa | head -n 30>CUFF.2.1 gene=CUFF.2>CUFF.11.1 gene=CUFF.11>ENSGALT00000015891 gene=CUFF.11>CUFF.12.1 gene=CUFF.12

Page 98: Kogo 2013 RNA-seq analysis

DEG ANALYSIS

RSeQCFastQC

Filtering ReadMapping Gene

StructureExpression

Level

DEGanalysis

Report

Annotation

Duplication

Page 99: Kogo 2013 RNA-seq analysis

DIFFERENTIALLY EXPRESSED GENE

• Abundance of transcripts between different conditions

Filtering

Mapping

GeneStructure

Expression

Report

Annotation

Duplication

DEG

Zhang et al., Mol Cancer Res June 2006 4; 401

Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data.Genome Biol. 2010;11(3):R25. doi: 10.1186/gb-2010-11-3-r25. Epub 2010 Mar 2.

Page 100: Kogo 2013 RNA-seq analysis

LENGTH BIAS

Oshlack A, Wakefield MJ. Transcript length bias in RNA-seq data confounds systems biology. Biol Direct. 2009 Apr 16;4:14.

Page 101: Kogo 2013 RNA-seq analysis

BIAS

Robinson MD, Oshlack A. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010;11(3):R25.

Page 102: Kogo 2013 RNA-seq analysis

REPLICATESTechnicalReplicates

BiologicalReplicates

Source Same samples Different samples

Purpose the reproducibility of the results

A quantity from difference sources under the same

conditions.

Issue

The differences are based only on

technical issues in the measurement

what is similar in your replicates and how they

are different from a different set of

conditions

Taylor S, Wakem M, Dijkman G, Alsarraj M, Nguyen M. A practical approach to RT-qPCR-Publishing data that conform to the MIQE guidelines. Methods. 2010 Apr;50(4):S1-5. doi: 10.1016/j.ymeth.2010.01.005.

http://wiki.answers.com/Q/What_is_defference_between_Biological_replicates_and_technical_replicates

More variance, More useful

Page 103: Kogo 2013 RNA-seq analysis

DEG METHODS

Cuffdiff DEGseq DESeq

- Poisson Negative binomial

Isoform Gene Gene

genesetBAM files Raw Read Count Raw Read Count

TechnicalReplicates

TechnicalReplicates

BiologicalReplicates

Page 104: Kogo 2013 RNA-seq analysis

CUFFDIFF• Use to find significant changes in transcript expression,

splicing, and promoter use.Usage)$ cuffdiff [options]* <transcripts.gtf> <sample1_replicate1.sam[,...,sample1_replicateM]> <sample2_replicate1.sam[,...,sample2_replicateM.sam]>

Option Value Description

-o / --output-dir <string> Sets the name of the directory in which Cuffdiff will write all

of its output. The default is "./".

-L / --labels <label1,label2,...,labelN> Specify a label for each sample, which will be included in

various output files produced by Cuffdiff.

-p /--num-threads <int> Use this many threads to align reads. The default is 1.

Page 105: Kogo 2013 RNA-seq analysis

RUN

$ cd /KOGO/RNA-seq/outputs$ cuffdiff -o Diff-S01-S02 -L S01,S02 -p 1 merged_asm/merged.gtf S01/accepted_hits.bam S02/accepted_hits.bam

Category Option Value

Output -o/--output-dir /KOGO/RNA-seq/outputs/Diff-S01-S02

Label -L / --labels S01,S02

Thread -p / --num-threads 1

Page 106: Kogo 2013 RNA-seq analysis

OUTPUTType Files Description

Genesgenes.fpkm_trackinggenes.count_tracking

genes.read_group_tracking

Gene [FPKMs, counts, read group tracking]. Tracks the summed [FPKMs, counts, read group tracking] of transcripts sharing each gene_id

Isoformsisoforms.fpkm_trackingisoforms.count_tracking

isoforms.read_group_trackingTranscript [FPKMs, counts, read group tracking]

CDScds.fpkm_trackingcds.count_tracking

cds.read_group_tracking

Coding sequence [FPKMs, counts, read group tracking]. Tracks the summed [FPKMs, counts, read group tracking] of transcripts sharing each p_id, independent of tss_id

Primary Transcripts

tss_groups.fpkm_trackingtss_groups.count_tracking

tss_groups.read_group_tracking

Primary transcript [FPKMs, counts, read group tracking]. Tracks the summed [FPKMs, counts, read group tracking] of transcripts sharing each tss_id

Page 107: Kogo 2013 RNA-seq analysis

FPKM TRACKING FILES

Col. Column name Example Description

1 tracking_id TCONS_00000001 A unique identifier describing the object (gene, transcript, CDS, primary transcript)

3 nearest_ref_id NM_008866.1 The reference transcript to which the class code refers, if any4 gene_id NM_008866 The gene_id(s) associated with the object5 gene_short_name Lypla1 The gene_short_name(s) associated with the object9 coverage 43.4279 Estimate for the absolute depth of read coverage across the object10 q0_FPKM 8.01089 FPKM of the object in sample 0

13 q0_status OK OK (deconvolution successful), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL

14 q1_FPKM 8.55155 FPKM of the object in sample 1

17 q1_status OK OK (deconvolution successful), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL

$ cd /KOGO/RNA-seq/outputs/Diff-S01-S02$ cut -f 1,3,4,5,9,10,13,14,17 genes.fpkm_tracking | head

Page 108: Kogo 2013 RNA-seq analysis

OUTPUTType Files Description

Genes gene_exp.diff Gene differential FPKM. Tests difference sin the summed FPKM of transcripts sharing each gene_id

Isoforms isoform_exp.diff Transcript differential FPKM.

CDS cds_exp.diff Coding sequence differential FPKM. Tests differences in the summed FPKM of transcripts sharing each p_id independent of tss_id

Primary Transcripts tss_group_exp.diff Primary transcript differential FPKM. Tests differences in the summed FPKM

of transcripts sharing each tss_id

Splicing splicing.diff how much differential splicing exists between isoforms processed from a single primary transcript

CDS cds.diff the amount of overloading detected among its coding sequences, i.e. how much differential CDS output exists between samples

Promoter promoter.diff the amount of overloading detected among its primary transcripts, i.e. how much differential promoter use exists between samples.

Page 109: Kogo 2013 RNA-seq analysis

GENE_EXP.DIFF

Col. Name Example Description1 Tested id XLOC_000001 A unique identifier2 gene Lypla1 The gene_name(s) or gene_id(s) being tested

6 Test status NOTEST OK (test successful), NOTEST (not enough alignments for testing), LOWDATA (too complex or shallowly sequenced), HIDATA (too many fragments in locus), or FAIL

7 FPKMx 8.01089 FPKM of the gene in sample x8 FPKMy 8.551545 FPKM of the gene in sample y9 log2(FPKMy/FPKMx) 0.06531 The (base 2) log of the fold change y/x

10 test stat 0.860902 The value of the test statistic used to compute significance of the observed change in FPKM

11 p value 0.389292 The uncorrected p-value of the test statistic12 q value 0.985216 The FDR-adjusted p-value of the test statistic

13 significant no Can be either "yes" or "no", depending on whether p is greater then the FDR after Benjamini-Hochberg correction for multiple-testing

$ cd /KOGO/RNA-seq/outputs/Diff-S01-S02$ cut -f 1,2,7,8,9,10,11,12,13,14 gene_exp.diff | head

Page 110: Kogo 2013 RNA-seq analysis

SIMPLE STATISTICS$ cd /KOGO/RNA-seq/outputs/Diff-S01-S02$ gnuplotgnuplot> set gridgnuplot> set zeroaxis -1gnuplot> set xlabel ‘log(FPKMs of S01)’gnuplot> set ylabel ‘log(FPKMs of S02)’gnuplot> pl ‘genes.fpkm_tracking’ u (log($10)):(log($14)) w points notitle, x notitlegnuplot> exit

$ cd /KOGO/RNA-seq/outputs/Diff-S01-S02$ grep yes gene_exp.diff > gene_exp.diff.yes$ less gene_exp.diff.yes$ grep no gene_exp.diff > gene_exp.diff.no$ gnuplotgnuplot> set gridgnuplot> set zeroaxis lt 2gnuplot> set xlabel ‘log2foldchange’gnuplot> set ylabel ‘-log(p-value)’gnuplot> pl ‘gene_exp.diff.no’ u 10:(-log($12)) lt 0 no title,\ ‘gene_exp.diff.yes’ u 10:(-log($12)) lt 1 pt 6 ps 2 t ‘DE’gnuplot> exit

Page 112: Kogo 2013 RNA-seq analysis

HTSEQ-COUNT• To count how many reads map to each feature

• Not counted for any feature for various reasons, namely:

• no_feature: reads which could not be assigned to any feature

• ambiguous: reads which could have been assigned to more than one feature and hence were not counted for any of these

• too_low_aQual: reads which were not counted due to the -a option

• not_aligned: reads in the SAM file without alignment

• alignment_not_unique: reads with more than one reported alignment. These reads are recognized from the NH optional SAM field tag.

• If you have paired-end data, you have to sort the SAM file by read name first

http://www-huber.embl.de/users/anders/HTSeq/doc/count.html

Page 113: Kogo 2013 RNA-seq analysis

HTSEQ-COUNTIf you have paired-end data, you have to sort the SAM file by read name first

Usage)$ htseq-count [options] <sam_file> [gff_file, ensembl gtf]

Options)-m [union,intersection-strict,intersection-nonempty]-s.--stranded=<yes, no, or reverse>

whether the data is from a strand-specific assay (default: yes)

Run)$ cd /KOGO/RNA-seq/outputs/S01$ samtools sort -n accepted_hits.bam accepted_hits.nameSorted$ samtools view accepted_hits.nameSorted.bam | htseq-count -m union -s no - ../merged_asm/merged.gtf > accepted_hits.count$ less accepted_hits.count# ..... for (S02, S03, S04)

Page 114: Kogo 2013 RNA-seq analysis

RUNRun)$ cd /KOGO/RNA-seq/outputs$ mkdir DESeq$ TBI-toolkit-make_matrix S01/hits.count 2 S02/hits.count 2 S03/hits.count 2 S04/hits.count 2 > DESeq/hits.mtx$ cd DESeq$ less hits.mtx$ cp /KOGO/RNA-seq/scripts/DESeq.4samples.R .$ R CMD BATCH DESeq.R

DESeq.2samples for 2 samples

Page 115: Kogo 2013 RNA-seq analysis

DEG METHODS• Cuffdiff, baySeq, DESeq, edgeR and NOISeq generated consistent results

• edgeR identified more DGE than the other methods at the same cut-off, which might infer less control of type 1 error with this method

Nookaew I, et al. A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae. Nucleic Acids Res. 2012 Nov 1;40(20):10084-97.

Page 116: Kogo 2013 RNA-seq analysis

SUMMARY• DEG

• Replicate

• Technical replicates

• Biological replicates

• Cuffdiff

• HTSeq-count

• DESeq

Page 117: Kogo 2013 RNA-seq analysis

REPORT(CUMMERBUND)

RSeQCFastQC

Filtering ReadMapping Gene

StructureExpression

Level

DEGanalysis

Report

Annotation

Duplication

Page 118: Kogo 2013 RNA-seq analysis

CUMMERBUND• an R package that is designed to aid and simplify the task of

analyzing Cufflinks RNA-Seq output.

• R

• using SQLite

• cuffData.db

Page 119: Kogo 2013 RNA-seq analysis

CUMMERBUND DB SCHEMA

Page 120: Kogo 2013 RNA-seq analysis

RUNRun)$ cd /KOGO/outputs/Diff-S01-S02$ R> library(cummeRbund)> cuff <- readCufflinks()> cuff# Global statistics and Quality Control> disp<-dispersionPlot(genes(cuff))> disp# Density> dens<-csDensity(genes(cuff))> dens# Boxplot> b<-csBoxplot(genes(cuff))> b# Volcano> v<-csVolcanoMatrix(genes(cuff))> v> v<-csVolcano(genes(cuff),"S01","S02")

# Pairwise Scatterplots> s<-csScatter(genes(cuff),"S01","S02",smooth=T)> s# Geneset level plots> data(sampleData)> myGeneIds <- sampleIDs> myGenes <- getGenes(cuff,myGeneIds)> h<-csHeatmap(myGenes,cluster='both')> h# Barplot> b <- expressionBarplot(myGenes)> b# Cluster> ic <-csCluster(myGenes,k=4)> icp <- csClusterPlot(ic)> icp

Page 121: Kogo 2013 RNA-seq analysis

ADDITIONAL ANALYSIS

Page 122: Kogo 2013 RNA-seq analysis

VIEWER• IGV

• Integrative Genomics Viewer

• http://www.broadinstitute.org/igv/

Run) Generate BAM index$ cd /KOGO/RNA-seq/outputs/S01$ samtools index accepted_hits.bam$ lsaccepted_hits.bai

Page 123: Kogo 2013 RNA-seq analysis

GO ENRICHMENT• GO annotation

• Using SwissProt

• Blastx

• Blast2Go

• InterProScan

• GO Enrichment

• GOseq

• Fisher's exact test

• DAVID

Page 124: Kogo 2013 RNA-seq analysis

CONCLUSION• (m)RNA-seq analysis

• Reference-based method

• NGS data analysis

• RNA-seq vs. DNA-seq

• Filtering

• Low Quality

• PCR Duplication

• Mapping

• RNA mapper

• Gene Expression

• Normalization

• DEG Analysis

• RPKM

• Replicates

Page 125: Kogo 2013 RNA-seq analysis

END