생물학 연구를 위한 컴퓨터 활용기술 제 10강
TRANSCRIPT
Computational Skill for Modern Biology Research
Department of BiologyChungbuk National University
10th Lecture 2015.11.17
NGS Analysis III : RNA quantification with kallisto & DE
Syllabus주 수업내용1주차 Introduction : Why we need to learn this stuff?
2주차 Basic of Unix and running BLAST in your PC
3주차 Unix Command Prompt II and shell scripts
4주차 Basic of programming (Python programming)
5주차 Python Scripting II and sequence manipulations
6주차 Ipython Notebook and Pandas
7주차 Basic of Next Generation Sequencings and Tutorial
8주차9주차 Next Generation Sequencing Analysis I
10주차 Next Generation Sequencing Analysis II
11주차 Next Generation Sequencing Analysis III
12주차 Bioconductor I
13주차 Bioconductor II
14주차 Network analysis
Conventional RNA-Seq Analysis
Sequencing Read Mapping on reference genome
Read Quantifications
Calcuration of FPKM
Differential Expression Analysis
Bottleneck
Too much time consumptions
30 million paired-end. All processing was done using 20 cores with programs being run with 20 threads
http://arxiv.org/pdf/1505.02710v2.pdf
Even in 20 core CPU server, it tooks serious time..
More efficient way to quantify transcriptome needed..
https://pachterlab.github.io/kallisto/Kallisto : Near-optimal RNA-Seq quantification http://arxiv.org/abs/1505.02710
Do we really need to align RNA sequencing read to Genome?
Most transcriptome size is far smaller than genome
Sometime we only need to know which reads is corresponding to the specific isoforms
Download and install
https://pachterlab.github.io/kallisto/download.html
cd ~wget https://github.com/pachterlab/kallisto/releases/download/v0.42.4/kallisto_mac-v0.42.4.tar.gztar –xvzf kallisto_mac-v0.42.4.tar.gz
Then add kallisto path into PATH (~/.bash_profile)
kallisto 0.42.4
Usage: kallisto <CMD> [arguments] ..
Where <CMD> can be one of:
index Builds a kallisto index quant Runs the quantification algorithm h5dump Converts HDF5-formatted results to plaintext version Prints version information
Transcriptome index
You need to generate index for transcriptome
http://bio.math.berkeley.edu/kallisto/transcriptomes/
It is just fasta file contains all of mRNA in your genome.
Download mouse transcriptome
http://bio.math.berkeley.edu/kallisto/transcriptomes/Mus_musculus.GRCm38.rel79.cdna.all.fa.gz
Generate index
kallisto index -i mouse Mus_musculus.GRCm38.rel79.cdna.all.fa.gz
Index Name Transcriptome fasta file
Now it is time to download some NGS data from SRA archive.
http://sra.dnanexus.com
Input Keywords
Confine search type as ‘Transcriptome Analysis’
Select Run and download SRA URLS
Open download_sra_urls txt file
Add wget –c
bash download_sra_urls.txt --2015-11-16 13:28:38-- ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR128/SRR1286228/SRR1286228.sra => ‘SRR1286228.sra’Resolving ftp-trace.ncbi.nlm.nih.gov (ftp-trace.ncbi.nlm.nih.gov)... 130.14.250.7, 2607:f220:41e:250::13Connecting to ftp-trace.ncbi.nlm.nih.gov (ftp-trace.ncbi.nlm.nih.gov)|130.14.250.7|:21... connected.Logging in as anonymous ... Logged in!==> SYST ... done. ==> PWD ... done.==> TYPE I ... done. ==> CWD (1) /sra/sra-instant/reads/ByRun/sra/SRR/SRR128/SRR1286228 ... done.==> SIZE SRR1286228.sra ... 3488089695==> PASV ... done. ==> RETR SRR1286228.sra ... done.Length: 3488089695 (3.2G) (unauthoritative)
SRR1286228.sra 0%[ ] 67.01K 48.8KB/s
Download SRA
Convert them as fastq
fastq-dump –split-files –gzip SRR1286228.sra
Run kallisto
kallisto quant -t 4 -b 100 -o SRR1171560 –i mouse.idx SRR1171560_1.fastq.gz SRR1171560_2.fastq.gz
Fastq file
Generated index
DirectoryOutput saved
(Two file : paired end)
[quant] fragment length distribution will be estimated from the data[index] k-mer length: 31[index] number of targets: 88,198[index] number of k-mers: 82,099,631[index] number of equivalence classes: 297,305[quant] running in paired-end mode[quant] will process pair 1: SRR1171560_1.fastq.gz SRR1171560_2.fastq.gz[quant] finding pseudoalignments for the reads ...
Depends on transcriptome and Read sizes, it would took 5-10 min in ordinary PC
Transcripts abundance
It is tab-seperated text file. Like other bioinformatics data, you can read them in ipython Notebook
Analyze transcript abundance with iPython and Pandas
Gene Name : Ensembl Read count on gene tpm
Tpm : transcripts number per million transcript
Sort based on the tpm
Highly expressed
Lower expressed
Without transcript annotation, it is difficult to understand.
Transcripts annotation data
Transcript id was done as ensembl
Go to http://asia.ensembl.org/index.htmlredirect=no
Select ‘BioMart’
Select Ensembl Gene 82
Select Mouse
Then click ‘Attributes’
Add informations you want to see
Add more informations
Press ‘Go’
File will be saved and download as mart-export.txt
mart-export.txt
Copy to directory where iPython Notebook is
Read mart_export.txt in annotation DataFrame
Now we have two dataFrame to connect
Merge annotation dataframe into abundance based on transcripts id
Some of data has ‘NaN’ (Not available). Fill them as ‘blank’
Save them as ‘abundance_plus’.
Some counting : transcripts TPM>1 TPM>10
Sort based on TPM
Same as before. But this time we have gene name and Descriptions
Most abundant transcripts in your samples..
If we want to find all of transcripts involved in the specific biological process?
GO : Gene Ontology
Keyword and classification systems of biological entity (Protein, gene, transcripts)
http://geneontology.org
http://www.ebi.ac.uk/QuickGO/GProtein?ac=O88569
Many keyword is associated withSpecific Proteins
1. Retrieve data of all transcripts and its corresponding Go term association
2. Search using Go term name
3. Find out list of genes containing specific GO terms
4. Find out genes in transcripts abundance table
Process
GO Term – Transcript associations
Go back to ensembl – BioMart, Select Gene and Mus musculus
Results and Export data
‘Press Go’ and download file
Open mart_export.txt
Rename it as GO.txt and copy to working directory where iPython notebook is
Many Terms is asociated in a gene or transcript
Read GO – Transcript Associations
Transcripts associated with Go Term Name ‘Cell Cycle’
Search ‘Go Term Name’ contains Cell Cycle
Find out unique Transcript ID
Save them as cellcycleDF
cellcycleDF
Transcripts associated GO Term‘cell cycle’
Abundance_plus
Whole Transcriptome
TranscriptomeAssociated with ‘Cell Cycle’
transcriptJoin based on target_id
Only include Common data
Using two GO term Name
“I want to search transcripts associated with ‘cell cycle’ and ‘actin’”
First, generate DataFrame containing transcripts associated with ‘actin’
cellcycleDFactinDF
205 Transcripts
Differential Gene Expression (DGE)
Observing one transcriptome is informative..
But comparing two or more transcriptome would be more informative..
Differential Stages..
WT vs Mutant?Different treatment
You may think like this..
Quantificaiton of Each Samples
Sample A Sample B
Just find out Gene lists higher at Sample A. Simple!
Not really…
Two factors
Repeat
Although RNA-Seq contain numerous informations, single RNA-Seq is just ONE experimentsYou need to repeat them and show statistical significance between them!
Multiple Comparison
“OK. We repeated treated and control for three time each. Compare TPM of each genesAnd do statistical test for each seperately and if p<0.05, it is significantly different”
“If you compare many thing simulataneously, something should be different”
“If you have many comparison, you should adjust stringency higher”
-> Not good
Inference of Differential Expression is not trivial
In the case of kallisto-generated quantification, uses sleuth
http://pachterlab.github.io/sleuth/
Installation of R and Rstudio
First, install R https://www.r-project.org
Rstudio is environment for R and Applications
https://www.rstudio.com
Launch RStudio
Install Sleuth and Dependency
source("http://bioconductor.org/biocLite.R")
biocLite("rhdf5")
install.packages("devtools")
devtools::install_github("pachterlab/sleuth")
Differential Expression Datasets
Three datasets for mouse oocytes (SRR1286228, SRR1286230, SRR1286231)
http://sra.dnanexus.com/studies/SRP009468/runs
Three datasets for Two cell mouse embryos(SRR385622, SRR385623, SRR385624)
Download and convert as fastq
Convert as fastq and run kallisto
And make text file describing samples
MII Oocyte datasets (3 set)
2 Cell datasets (3 set)
And organize kallisto output directory like this
And save them as study_design.txt
Download this scripts and save into working directory
https://gist.github.com/madscientist01/a49574b7fba18e65818a
Change here as your working directory
Study-design should be same directory
https://gist.github.com/madscientist01/a49574b7fba18e65818a
Open Anal.R file
If analysis is done without problem….
Quality Check
Variations between repeats?
Much Better Correlation between repeats
Higher ExpressionLower expression
High Fold Change
No Change
Differentially expressed
Not Differentially expressed
Q : False Discovery Rate
Log Fold change
Search by gene name
Gene Level Expressions
One Isofomrs
Second Isoform has much higher Expression levels.
Abnormality?
Download table and you can analye them in Pandas..
Continues…
Assignments
• Install kallisto, sleuth (R and R Studio)• Download sra datasets SRR385622• Run Kallisto for the sample