생물학 연구를 위한 컴퓨터 활용기술 제 10강

Computational Skill for Modern Biology Research

Department of BiologyChungbuk National University

10th Lecture 2015.11.17

NGS Analysis III : RNA quantification with kallisto & DE

Syllabus주 수업내용1주차 Introduction : Why we need to learn this stuff?

2주차 Basic of Unix and running BLAST in your PC

3주차 Unix Command Prompt II and shell scripts

4주차 Basic of programming (Python programming)

5주차 Python Scripting II and sequence manipulations

6주차 Ipython Notebook and Pandas

7주차 Basic of Next Generation Sequencings and Tutorial

8주차9주차 Next Generation Sequencing Analysis I

10주차 Next Generation Sequencing Analysis II

11주차 Next Generation Sequencing Analysis III

12주차 Bioconductor I

13주차 Bioconductor II

14주차 Network analysis

Conventional RNA-Seq Analysis

Sequencing Read Mapping on reference genome

Read Quantifications

Calcuration of FPKM

Differential Expression Analysis

Bottleneck

Too much time consumptions

30 million paired-end. All processing was done using 20 cores with programs being run with 20 threads

http://arxiv.org/pdf/1505.02710v2.pdf

Even in 20 core CPU server, it tooks serious time..

More efficient way to quantify transcriptome needed..

https://pachterlab.github.io/kallisto/Kallisto : Near-optimal RNA-Seq quantification http://arxiv.org/abs/1505.02710

Do we really need to align RNA sequencing read to Genome?

Most transcriptome size is far smaller than genome

Sometime we only need to know which reads is corresponding to the specific isoforms

Download and install

https://pachterlab.github.io/kallisto/download.html

cd ~wget https://github.com/pachterlab/kallisto/releases/download/v0.42.4/kallisto_mac-v0.42.4.tar.gztar –xvzf kallisto_mac-v0.42.4.tar.gz

Then add kallisto path into PATH (~/.bash_profile)

kallisto 0.42.4

Usage: kallisto <CMD> [arguments] ..

Where <CMD> can be one of:

index Builds a kallisto index quant Runs the quantification algorithm h5dump Converts HDF5-formatted results to plaintext version Prints version information

https://github.com/pachterlab/kallisto/releases/download/v0.42.4/kallisto_mac-v0.42.4.tar.gz

https://github.com/pachterlab/kallisto/releases/download/v0.42.4/kallisto_mac-v0.42.4.tar.gz

Transcriptome index

You need to generate index for transcriptome

http://bio.math.berkeley.edu/kallisto/transcriptomes/

It is just fasta file contains all of mRNA in your genome.

Download mouse transcriptome

http://bio.math.berkeley.edu/kallisto/transcriptomes/Mus_musculus.GRCm38.rel79.cdna.all.fa.gz

Generate index

kallisto index -i mouse Mus_musculus.GRCm38.rel79.cdna.all.fa.gz

Index Name Transcriptome fasta file

Now it is time to download some NGS data from SRA archive.



http://sra.dnanexus.com

Input Keywords

Confine search type as ‘Transcriptome Analysis’

Select Run and download SRA URLS

Open download_sra_urls txt file

Add wget –c

bash download_sra_urls.txt --2015-11-16 13:28:38-- ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR128/SRR1286228/SRR1286228.sra => ‘SRR1286228.sra’Resolving ftp-trace.ncbi.nlm.nih.gov (ftp-trace.ncbi.nlm.nih.gov)... 130.14.250.7, 2607:f220:41e:250::13Connecting to ftp-trace.ncbi.nlm.nih.gov (ftp-trace.ncbi.nlm.nih.gov)|130.14.250.7|:21... connected.Logging in as anonymous ... Logged in!==> SYST ... done. ==> PWD ... done.==> TYPE I ... done. ==> CWD (1) /sra/sra-instant/reads/ByRun/sra/SRR/SRR128/SRR1286228 ... done.==> SIZE SRR1286228.sra ... 3488089695==> PASV ... done. ==> RETR SRR1286228.sra ... done.Length: 3488089695 (3.2G) (unauthoritative)

SRR1286228.sra 0%[ ] 67.01K 48.8KB/s

Download SRA

Convert them as fastq

fastq-dump –split-files –gzip SRR1286228.sra

Run kallisto

kallisto quant -t 4 -b 100 -o SRR1171560 –i mouse.idx SRR1171560_1.fastq.gz SRR1171560_2.fastq.gz

Fastq file

Generated index

DirectoryOutput saved

(Two file : paired end)

[quant] fragment length distribution will be estimated from the data[index] k-mer length: 31[index] number of targets: 88,198[index] number of k-mers: 82,099,631[index] number of equivalence classes: 297,305[quant] running in paired-end mode[quant] will process pair 1: SRR1171560_1.fastq.gz SRR1171560_2.fastq.gz[quant] finding pseudoalignments for the reads ...

Depends on transcriptome and Read sizes, it would took 5-10 min in ordinary PC

Transcripts abundance

It is tab-seperated text file. Like other bioinformatics data, you can read them in ipython Notebook

Analyze transcript abundance with iPython and Pandas

Gene Name : Ensembl Read count on gene tpm

Tpm : transcripts number per million transcript

Sort based on the tpm

Highly expressed

Lower expressed

Without transcript annotation, it is difficult to understand.

Transcripts annotation data

Transcript id was done as ensembl

Go to http://asia.ensembl.org/index.htmlredirect=no

Select ‘BioMart’

Select Ensembl Gene 82

Select Mouse

Then click ‘Attributes’

Add informations you want to see

Add more informations

Press ‘Go’

File will be saved and download as mart-export.txt

mart-export.txt

Copy to directory where iPython Notebook is

Read mart_export.txt in annotation DataFrame

Now we have two dataFrame to connect

Merge annotation dataframe into abundance based on transcripts id

Some of data has ‘NaN’ (Not available). Fill them as ‘blank’

Save them as ‘abundance_plus’.

Some counting : transcripts TPM>1 TPM>10

Sort based on TPM

Same as before. But this time we have gene name and Descriptions

Most abundant transcripts in your samples..

If we want to find all of transcripts involved in the specific biological process?

GO : Gene Ontology

Keyword and classification systems of biological entity (Protein, gene, transcripts)

http://geneontology.org

http://www.ebi.ac.uk/QuickGO/GProtein?ac=O88569

Many keyword is associated withSpecific Proteins

1. Retrieve data of all transcripts and its corresponding Go term association

2. Search using Go term name

3. Find out list of genes containing specific GO terms

4. Find out genes in transcripts abundance table

Process

GO Term – Transcript associations

Go back to ensembl – BioMart, Select Gene and Mus musculus

Results and Export data

‘Press Go’ and download file

Open mart_export.txt

Rename it as GO.txt and copy to working directory where iPython notebook is

Many Terms is asociated in a gene or transcript

Read GO – Transcript Associations

Transcripts associated with Go Term Name ‘Cell Cycle’

Search ‘Go Term Name’ contains Cell Cycle

Find out unique Transcript ID

Save them as cellcycleDF

cellcycleDF

Transcripts associated GO Term‘cell cycle’

Abundance_plus

Whole Transcriptome

TranscriptomeAssociated with ‘Cell Cycle’

transcriptJoin based on target_id

Only include Common data

Using two GO term Name

“I want to search transcripts associated with ‘cell cycle’ and ‘actin’”

First, generate DataFrame containing transcripts associated with ‘actin’

cellcycleDFactinDF

205 Transcripts

Differential Gene Expression (DGE)

Observing one transcriptome is informative..

But comparing two or more transcriptome would be more informative..

Differential Stages..

WT vs Mutant?Different treatment

You may think like this..

Quantificaiton of Each Samples

Sample A Sample B

Just find out Gene lists higher at Sample A. Simple!

Not really…

Two factors

Repeat

Although RNA-Seq contain numerous informations, single RNA-Seq is just ONE experimentsYou need to repeat them and show statistical significance between them!

Multiple Comparison

“OK. We repeated treated and control for three time each. Compare TPM of each genesAnd do statistical test for each seperately and if p<0.05, it is significantly different”

“If you compare many thing simulataneously, something should be different”

“If you have many comparison, you should adjust stringency higher”

-> Not good

Inference of Differential Expression is not trivial

In the case of kallisto-generated quantification, uses sleuth

http://pachterlab.github.io/sleuth/

Installation of R and Rstudio

First, install R https://www.r-project.org

Rstudio is environment for R and Applications

https://www.rstudio.com

Launch RStudio

Install Sleuth and Dependency

source("http://bioconductor.org/biocLite.R")

biocLite("rhdf5")

install.packages("devtools")

devtools::install_github("pachterlab/sleuth")

Differential Expression Datasets

Three datasets for mouse oocytes (SRR1286228, SRR1286230, SRR1286231)

http://sra.dnanexus.com/studies/SRP009468/runs

Three datasets for Two cell mouse embryos(SRR385622, SRR385623, SRR385624)

Download and convert as fastq

Convert as fastq and run kallisto

And make text file describing samples

MII Oocyte datasets (3 set)

2 Cell datasets (3 set)

And organize kallisto output directory like this

And save them as study_design.txt

Download this scripts and save into working directory

https://gist.github.com/madscientist01/a49574b7fba18e65818a

Change here as your working directory

Study-design should be same directory

https://gist.github.com/madscientist01/a49574b7fba18e65818a

Open Anal.R file

If analysis is done without problem….

Quality Check

Variations between repeats?

Much Better Correlation between repeats

Higher ExpressionLower expression

High Fold Change

No Change

Differentially expressed

Not Differentially expressed

Q : False Discovery Rate

Log Fold change

Search by gene name

Gene Level Expressions

One Isofomrs

Second Isoform has much higher Expression levels.

Abnormality?

Download table and you can analye them in Pandas..

Continues…

Assignments

• Install kallisto, sleuth (R and R Studio)• Download sra datasets SRR385622• Run Kallisto for the sample

생물학 연구를 위한 컴퓨터 활용기술 제 10강

Education