rna_seq
TRANSCRIPT
RNA‐Seq:MethodsandApplica6ons
PratThiru
1
Outline• IntrotoRNA‐Seq
BiologicalQues6ons ComparisonwithOtherMethods RNA‐SeqProtocol
• RNA‐SeqApplica6ons Annota6on Quan6fica6on OtherApplica6ons
• ExpressionProfilingStepsandSoGware• RunningTopHatandCufflinks(Commands)
2
GoalsofSequencingtheTranscriptome
• Annota6on Iden6fygenes,exons,splicingevents,ncRNAs,etc. Novelgenesortranscripts
• Quan6fica6on Abundanceoftranscriptsbetweendifferentcondi6ons
3
Transcriptome:RNAWorld
4hYp://finchtalk.geospiza.com/2009/05/small‐rnas‐get‐smaller.html
Transcriptome:Complexity
5hYp://www.ncbi.nlm.nih.gov/books/NBK21128/
ComparisonofMethodsforStudyingtheTranscriptome
Technology Tilingmicroarray cDNAorESTsequencing RNA‐Seq
Technology specifica0ons
Principle Hybridiza6on Sangersequencing High‐throughputsequencing
Resolu.on Fromseveralto100bp Singlebase Singlebase
Throughput High Low High
Relianceongenomicsequence Yes No Insomecases
Backgroundnoise High Low Low
Applica0on
Simultaneouslymaptranscribedregionsandgeneexpression Yes Limitedforgeneexpression Yes
Dynamicrangetoquan.fygeneexpressionlevel Uptoafew‐hundredfold Notprac6cal >8,000‐fold
Abilitytodis.nguishdifferentisoforms Limited Yes Yes
Abilitytodis.nguishallelicexpression Limited Yes Yes
Prac0cal issues
RequiredamountofRNA High High Low
Costformappingtranscriptomesoflargegenomes High High Rela6velylow
6Wang,Z.etal.RNA‐Seq:arevolu.onarytoolfortranscriptomicsNatureReviewsGene6cs(2009)
RNA‐SeqExperiment
7Wang,Z.etal.RNA‐Seq:arevolu.onarytoolfortranscriptomicsNatureReviewsGene6cs(2009)
Outline• IntrotoRNA‐Seq
BiologicalQues6ons ComparisonwithOtherMethods RNA‐SeqProtocol
• RNA‐SeqApplica6ons Annota6on Quan6fica6on OtherApplica6ons
• ExpressionProfilingStepsandSoGware• RunningTopHatandCufflinks(Commands)
8
RNA‐SeqApplica6ons–Annota6on:Alterna6veSplicingEvents
9Ozsolak,F.andMilos,P.RNAsequencing:advances,challengesandopportuni.esNatureReviewsGene6cs(2011)
RNA‐SeqApplica6ons–Annota6on:Iden6fyKnownandNovelTranscripts
10
UnmappedReads:novelsplicejunc6ons?
MappedReads:novelexonorgene?Knownexons/gene
GuYman,M.etalAbini.oreconstruc.onofcelltype–specifictranscriptomesinmouserevealstheconservedmul.‐exonicstructureoflincRNAsNatureBiotechnology(2010)
Trapnell,C.etalTranscriptassemblyandquan.fica.onbyRNA‐Seqrevealsunannotatedtranscriptsandisoformswitchingduringcelldifferen.a.onNatureBiotechnology(2010)
AssemblyandMappingRNA‐Seq
11Haas,B.J.,andZody,M.C.AdvancingRNA‐SeqanalysisNatureBiotechnology(2010)
• Op6ons: Alignandthenassemble Assembleandthenalign
• Alignto genome transcriptome
RNA‐SeqApplica6ons‐Quan6fica6on:ExpressionProfiling
12MortazaviA.,etal.Mappingandquan.fyingmammaliantranscriptomesbyRNA‐SeqNatureMethods(2008)
NeedforNormaliza6on
• Morereadsmappedtoatranscriptifitisi)long
ii)athigherdepthofcoverage
• Normalizesuchthat
i)featuresofdifferentlengths
ii)totalsequencefromdifferentcondi6ons
canbecompared
13
Quan6fyingExpression:RPKM
14
• RPKM:ReadsPerKilobaseperMillionmappedreads
• RPKM= C:Numberofmappablereadsonafeature(eg.transcript,exon,etc.)
L:Lengthoffeature(inkb) N:Totalnumberofmappablereads(inmillions)
MortazaviA.,etal.Mappingandquan.fyingmammaliantranscriptomesbyRNA‐SeqNatureMethods(2008)
RPKMExample
15
N=6M
N=8M
Sample1
Sample2
C=12C=24C=11
C=19C=28C=16
RPKM=19/(0.6*8)=3.96RPKM=28/(1.1*8)=1.94RPKM=16/(1.4*8)=1.43
RPKM=12/(0.6*6)=3.33RPKM=24/(1.1*6)=3.64RPKM=11/(1.4*6)=1.31
GeneA600basesGeneB1100basesGeneC1400bases
Quan6fyingExpression:FPKM
• FPKM:FragmentsPerKilobaseoftranscriptperMillionfragmentsmapped AnalogoustoRPKMbutdoesnotusereadcounts.
therela6veabundancesoftranscriptsaredescribedintermsoftheexpectedbiologicalobjects(fragments)observedfromanRNA‐Seqexperiment,whichinthefuturemaynotberepresentedbysingleread
16Trapnell,C.etalTranscriptassemblyandquan.fica.onbyRNA‐Seqrevealsunannotatedtranscriptsandisoformswitchingduringcelldifferen.a.onNatureBiotechnology(2010)
Quan6fyingExpression:Normaliza6onMethods
• Total‐count(eg.RPKM)• UpperQuar6le(eg.75thpercen6le):SimilartoTotal‐countbutper‐laneupper‐quar6leofcountsforgeneswithreadsinatleastonelane.
• Quan6le:Foreachlanethedistribu6onofreadcountsismatchedtoareferencedistribu6ondefinedintermsofmediancounts
17Bullard,J.,etal.Evalua.onofsta.s.calmethodsfornormaliza.onanddifferen.alexpressioninmRNA‐SeqexperimentsBMCBioinforma6cs(2010)
RNA‐SeqApplica6ons:GeneFusion
18Ozsolak,F.andMilos,P.RNAsequencing:advances,challengesandopportuni.esNatureReviewsGene6cs(2011)
Outline• IntrotoRNA‐Seq
BiologicalQues6ons ComparisonwithOtherMethods RNA‐SeqProtocol
• RNA‐SeqApplica6ons Iden6fyingTranscripts Quan6fica6on OtherApplica6ons
• ExpressionProfilingStepsandSoGware• RunningTopHatandCufflinks(Commands)
19
ExpressionProfilingWorkflow
20
QC:FilterShortReads
AlignandAssembleorAssembleandAlign
Computa6onalAnalysis:Quan6fyExpression,or
otherapplica6ons
VisualizeData
(SeeHotTopicsonMappingNGSReads)• FASTXToolkit• FastQC• R:ShortRead
• AlignwithTopHat,assemblewithCufflinks
• Cuffcompare,Cuffdiff• SAMtools,BEDtools• R:edgeR,DESeq
• IGV(SeeHotTopicsonIGV)• UCSCGenomeBrowser
TheTuxedoTools
21hYp://mged12‐deep‐sequencing‐analysis.wikispaces.com/file/view/Cole_MGED_tutorial_slides.pdf
TopHatAlgorithm
22Trapnell,C.,etalTopHat:discoveringsplicejunc.onswithRNA‐SeqBioinforma6cs(2009)
CufflinksAlgorithm
23Trapnell,C.,etalTranscriptassemblyandquan.fica.onbyRNA‐Seqrevealsunannotatedtranscriptsandisoformswitchingduringcelldifferen.a.onNatureBiotechnology(2010)
Outline• IntrotoRNA‐Seq
BiologicalQues6ons ComparisonwithOtherMethods RNA‐SeqProtocol
• RNA‐SeqApplica6ons Iden6fyingTranscripts Quan6fica6on OtherApplica6ons
• ExpressionProfilingStepsandSoGware• RunningTopHatandCufflinks(Commands)
24
RunningTopHat:AlignReads
• TopHatManual:hYp://tophat.cbcb.umd.edu/manual.html
• RunningTopHatonTakUsage:tophat[op6ons]<bow6e_index><reads1[,reads2,...,readsN]>[reads1[,reads2,...,readsN]]eg.bsub“tophat‐p2‐‐solexa1.3‐quals‐‐max‐mul6hits5‐os_1_TopHat_Out/nfs/genomes/
mouse_gp_jul_07_no_random/bow6e/mm9s_1_sequence.txt”Op6ons(SeeManualforallavailableop6ons):‐o/‐‐output‐dir SetsthenameofthedirectoryinwhichTopHatwillwriteallofitsoutput.‐‐solexa‐quals UsetheSolexascaleforqualityvaluesinFASTQfiles.‐‐solexa1.3‐quals AsoftheIlluminaGApipelineversion1.3,qualityscoresareencodedinPhred‐scaledbase‐64.
Usethisop6onforFASTQfilesfrompipeline1.3orlater.‐p/‐‐num‐threads Usethismanythreadstoalignreads.Thedefaultis1.‐g/‐‐max‐mul6hits InstructsTopHattoallowuptothismanyalignmentstothereferenceforagivenread,and
suppressesallalignmentsforreadswithmorethanthismanyalignments.Thedefaultis40.
25
TopHatOutput
• OutputofTopHatisabamfile.BinaryversionofSequenceAlignment/Map(SAM)file
• UseIntegra6veGenomicsViewer(IGV)toviewbamfileoruseSAMtoolstoanalyzebamfile
eg.SAMFile
26
WICMT‐SOLEXA:1:20:670:1533#137chr13240920330M*00CTGGATCTGGACCTGGACCTGGATCTATAT::::::::::::::::‐:::::::::::::NM:i:1NH:i:2CC:Z:chr6CP:i:83893005WICMT‐SOLEXA:1:69:135:1285#89chr13269437130M*00TGCCTAAACTTATTAAGGCAGGCCATGGGC:((/+:::(+:+':/:+++&+//':++:::NM:i:2NH:i:4CC:Z:chr7CP:i:20934843WICMT‐SOLEXA:1:84:584:747#153chr13270083030M*00AGCAAGTTTTTTNTTAGCCCTAGATTCCAG::::::::::::%:::::::::::::::::NM:i:1NH:i:5CC:Z:=CP:i:136301734WICMT‐SOLEXA:1:75:1357:1675#163chr1352212825530M=35222870GTGGCTTTGTGGTCTTCACCAACCTTTCTC::::::::::::::::::::::::::::::NM:i:1NH:i:1WICMT‐SOLEXA:1:75:1357:1675#83chr1352228725530M=35221280CTGTAGGTGTAATCCTAAATTCTTATTACG::::::::::::::::::::::::::::::NM:i:0NH:i:1WICMT‐SOLEXA:1:8:59:283#153chr13522536330M*00TTTCTGCTTTGATTATGGTACTGATGTCTG:::::::::::4::::::::::::::::::NM:i:2NH:i:2CC:Z:chr5CP:i:134317691WICMT‐SOLEXA:1:12:1161:945#89chr13523371130M*00TCTACATAGCCCAAACTGGCTTTGGACTCT::::::::::::::::::::::::::::::NM:i:0NH:i:3CC:Z:chr10CP:i:117172515WICMT‐SOLEXA:1:45:1469:1826#73chr13620888330M*00CAAGTATTTAATGTTTTCATTAAATTGTTT::::::::::::::::::::::::::4:::NM:i:0NH:i:2CC:Z:chr11CP:i:22903295WICMT‐SOLEXA:1:14:536:150#73chr13620943330M*00CTGGAAGACAATGTCCAAAAACTCTGAATC:::::::::::::::::::::::::%::&:NM:i:1NH:i:2CC:Z:chr11CP:i:22903240WICMT‐SOLEXA:1:66:646:1188#137chr13662923030M*00AAAAAAAAAACACCACCCCCAACAAAAAAA+00++0+0+''0++++:00::.&:::,:,:NM:i:2NH:i:5CC:Z:chr10CP:i:94881279
Cufflinks:AssembleandQuan6fyReads
• CufflinksManual:hYp://cufflinks.cbcb.umd.edu/manual.html
• RunningCufflinksonTak• Op6onal:Supplyannota6oninGTFformatwith“‐G”op6on
Usage:cufflinks[op6ons]<hits.bam>eg.bsub“cufflinks‐p2‐os_1_Cufflinks_Outs_1_TopHat_Out/accepted_hits.bam”
eg.cufflinkswillassembleandquan6fyusingknowntranscriptsusingg~filesuppliedbsub“cufflinks‐p2‐Gtranscripts.g~accepted_hits.bam”
27
CufflinksOutput• OutputofCufflinksisaGTFfilewithassembledisoforms
eg.chr1Cufflinkstranscript36321447363302701000‐.gene_id"Neurl3";transcript_id"NM_153408";FPKM"3.7155221121";frac"1.000000";
conf_lo"0.000000";conf_hi"7.570660";cov"0.649922";chr1Cufflinksexon36321447363233981000‐.gene_id"Neurl3";transcript_id"NM_153408";exon_number"1";FPKM"3.7155221121";frac
"1.000000";conf_lo"0.000000";conf_hi"7.570660";cov"0.649922";chr1Cufflinksexon36325501363255541000‐.gene_id"Neurl3";transcript_id"NM_153408";exon_number"2";FPKM"3.7155221121";frac
"1.000000";conf_lo"0.000000";conf_hi"7.570660";cov"0.649922";chr1Cufflinksexon36326058363265461000‐.gene_id"Neurl3";transcript_id"NM_153408";exon_number"3";FPKM"3.7155221121";frac
"1.000000";conf_lo"0.000000";conf_hi"7.570660";cov"0.649922";chr1Cufflinksexon36330183363302701000‐.gene_id"Neurl3";transcript_id"NM_153408";exon_number"4";FPKM"3.7155221121";frac
"1.000000";conf_lo"0.000000";conf_hi"7.570660";cov"0.649922";chr1Cufflinkstranscript36364578363808744+.gene_id"Arid5a";transcript_id"NM_145996";FPKM"0.0015751054";frac"0.002360";conf_lo
"0.000000";conf_hi"0.081996";cov"0.000263";chr1Cufflinksexon36364578363646814+.gene_id"Arid5a";transcript_id"NM_145996";exon_number"1";FPKM"0.0015751054";frac
"0.002360";conf_lo"0.000000";conf_hi"0.081996";cov"0.000263";chr1Cufflinksexon36373054363731724+.gene_id"Arid5a";transcript_id"NM_145996";exon_number"2";FPKM"0.0015751054";frac
"0.002360";conf_lo"0.000000";conf_hi"0.081996";cov"0.000263";chr1Cufflinksexon36374929363750264+.gene_id"Arid5a";transcript_id"NM_145996";exon_number"3";FPKM"0.0015751054";frac
"0.002360";conf_lo"0.000000";conf_hi"0.081996";cov"0.000263";chr1Cufflinksexon36375333363754984+.gene_id"Arid5a";transcript_id"NM_145996";exon_number"4";FPKM"0.0015751054";frac
"0.002360";conf_lo"0.000000";conf_hi"0.081996";cov"0.000263";chr1Cufflinksexon36375837363808744+.gene_id"Arid5a";transcript_id"NM_145996";exon_number"5";FPKM"0.0015751054";frac
"0.002360";conf_lo"0.000000";conf_hi"0.081996";cov"0.000263";
28
LocalResources
• Descrip6onofavailablefiles,see/nfs/genomes/BaRC_Genomes_README.txt
Bow6eindex/nfs/genomes/<species>/bowtie
eg./nfs/genomes/mouse_gp_jul_07_no_random/bowtie
GTFfiles/nfs/genomes/<species>/gtf
eg./nfs/genomes/mouse_gp_jul_07/gtf
29
FurtherReading• RNA‐SeqMortazavi,A.,etal.Mappingandquan.fyingmammaliantranscriptomesbyRNA‐SeqNatureMethods
5(7):621‐628(2008)Wang,Z.,atal.RNA‐Seq:arevolu.onarytoolfortranscriptomicsNatureReviewsGene6cs10:57‐63
(2009)Ozsolak,F.andMilosP.M.RNAsequencing:advances,challenges,andopportuni.esNatureReviews
Gene6cs12:87‐98(2011)• TopHatTrapnell,C.,etal.TopHat:discoveringsplicejunc.onswithRNA‐SeqBioinforma6cs25(9)1105‐1111
(2009)
• CufflinksTrapnell,C.,etal.Transcriptassemblyandquan.fica.onbyRNA‐Seqrevealsunannotatedtranscripts
andisoformswitchingduringcelldifferen.a.onNatureBiotechnology28(5)511‐515(2010)
30
OnlineCommunityForumandDiscussion
• hYp://seqanswers.com/
31