생물학 연구를 위한 컴퓨터 활용기술 8강

Computational Skill for Modern Biology Research

Department of BiologyChungbuk National University

8th Lecture 2015.11.3

NGS Analysis I : align NGS read into reference genome

Syllabus주 수업내용1 주차 Introduction : Why we need to learn this stuff?

2 주차 Basic of Unix and running BLAST in your PC

3 주차 Unix Command Prompt II and shell scripts

4 주차 Basic of programming (Python programming)

5 주차 Python Scripting II and sequence manipulations

6 주차 Ipython Notebook and Pandas

7 주차 Basic of Next Generation Sequencings and Tutorial

8 주차9 주차 Next Generation Sequencing Analysis I

10 주차 Next Generation Sequencing Analysis II

11 주차 R and statistical analysis

12 주차 Bioconductor I

13 주차 Bioconductor II

14 주차 Network analysis

What we can do with NGS data

ResequencingDe novo genome sequencing

Is there reference sequence for your favorite organism?

Yes No

NGS Sequencing Data

Sequence Assembly

Output : Sequence Contigs

Alignment with reference genome

Output : variants (SNP, Structural Variations)

Gene PredictionsFunctional Classifications…

Association study with phenotypes

Resequencing

Reference sequences : well-estabilished genome sequence

We are interested in understanding genome level differences

Snyder M et al. Genes Dev. 2010;24:423-431

SNP/Indel

Phased SNP

Deletion

Insertion

Inverstion

ACGTTTGGATACTGCAAACCTATG

ACGTTTGTATACTGCAAACATATG

SNP (Single Nucleotide Polymorphisms)

• Change in Single Nucleotide Sequence

• When we compare with Human reference sequences, individual Human has 3 – 4 million SNPs

• Some of them is very frequent, while others are very rare

- Common Variant (20-40% frequencies in Populations)- Rare Variant (less than 1%_

SNPs vs. SNVsBoth are found as single nucleotide variances

• SNP

– Known variant in the specie (Well Characterized)– Known variants exists in specific frequency in Populations– Verified in Population– Resistered in dbSNP (http://www.ncbi.nlm.nih.gov/snp)

• SNV

– Specific variants found on the specific person (Not well characterized)– Very low frequency– Not well characterized

Really a matter of frequency of occurrence

http://ccsb.stanford.edu/education/Nair_NGS.pptx

Single Nucleotide Variances

http://www.ncbi.nlm.nih.gov/snp

TGCAAACCTATG

Indel (Insertion/Deletion)

• Deltion or addition of base (less than 1kb)

- 300,000-600,000 indels per person

• Large Scale Structural Variation (more than 2kbp

- more than 1,000 per person

TGCAAAC-TATGTGCAAACC-TATGTGCAAACCCTATG

Today, we will learn how to find these variants from NGS sequencing

- Reference Genome Sequences (Fasta Format)- Sequence Data (Fastq format)

Software

-bwa, samtools, bcftools

• Most software is unix based• In the case of big eucaryotic genomes, it is difficult to run in ordinary PC• But in small eucaryote or bacteria, it would be ok…

WorkFlow

Sequencing DataFastQ

ReferenceGenome Sequence

Alignment File(sam format)

Mapping

Some of informations for NGS

Single Read (SR) or Paired End (PE)

Read Length

Depth of Coverage (DNA)

SRA

http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=announcement

Sequence Read Archive : Repository for NGS Data

http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP018525

During this study, they performed 47 RNA sequencing (160.8Gbp)

SourcesAccessions Type of Experiments

Install SRA Toolkit

To download NGS data archived in NCBI/SRA, you need to download SRA Toolkit

http://www.ncbi.nlm.nih.gov/Traces/sra/?view=software

tar -xvzf sratoolkit.2.5.4-mac64.tar.gz

Extract archive

cd binpwd

(In the case of mac)

Setup sratoolkit in your PATHAdd These line into your .bash_profile in home directory

Setup path

Download sra file

Let’s download some of datafile (It is BIG)

prefetch ERR560539 (SRA id)

Maximum file size download limit is 20,971,520KB

2015-11-02T01:09:26 prefetch.2.5.4: 1) Downloading ‘ERR560539 '...2015-11-02T01:09:26 prefetch.2.5.4: Downloading via http...2015-11-02T01:23:08 prefetch.2.5.4: 1) 'SRR032988' was downloaded successfully

File will be saved in ~/ncbi/public/sra

Convert sra file into FASTQ file

fastq-dump --split-files ERR560539 Read 1887328 spots for ERR560539 Written 1887328 spots for ERR560539

<sra id>

ls ERR560539 _1.fastq ERR560539 _2.fastq Paired End reads

Reverse

Forward5’ 3’5’3’

See end of fastq file

Quality

Sequence

Size of file

About 2.9Gb

Quality Control of Fastq using FASTQC

Download and install FastQC

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Open Fastq file

Install bwa, samtools, bcftools

bwa: short illumina read aligner to reference geome sequences

Genome sequence

Sequencing Data

Find out matching, and align sequences

samtools : convert data format find out variants in concert with bcftools

Install bwa, samtools, bcftools

1. Download source files and compile it based on the instructions

2. Install via Homebrew (Mac) or apt-get (Ubuntu linux)

https://github.com/lh3/bwa/https://github.com/samtools/samtools/https://github.com/samtools/bcftools

http://gatkforums.broadinstitute.org/discussion/2899/howto-install-all-software-packages-required-to-follow-the-gatk-best-practices

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Install Homebrew

brew tap homebrew/science

http://alejandrosoto.net/blog/2014/01/22/setting-up-my-mac-for-scientific-research/

brew install bwabrew install samtoolsbrew install bcftools

https://github.com/lh3/bwa/archive/master.zip



https://github.com/samtools/samtools/

https://github.com/samtools/samtools/

samtools

bcftools

What will do..

Align sequencing reads in reference genome

First, we will download Reference Genomes

https://support.illumina.com/sequencing/sequencing_software/igenome.html

We will use Saccharomyces cerevisiae genome (sacSer3)

Download this filehttps://support.illumina.com/sequencing/sequencing_software/igenome.html

Download genome file and genome sequence in current directory



tar -xvzf Saccharomyces_cerevisiae_UCSC_sacCer3.tar.gzcp ./Saccharomyces_cerevisiae/UCSC/sacCer3/Sequence/BWAIndex/genome.fa .mv genome.fa yeast

Extract reference genome

First, you need to generate index file for genome sequencebwa index yeast

You can think ‘index’ as something like address book in genome for fast access..

Download ERR560539.sraPrefetch ERR560539fasta-dump –split-files ERR560539

Then, download NGS sequence Data to analysis

Saccharomyces cerevisiae seperated from wine

Running bwa (Align NGS reads into Reference)

bwa mem -t 4 yeast ERR560539_1.fastq ERR560539_2.fastq > ERR560539.sam

memMethod for alignment (if NGS sequences is bigger than 50bp, select this)

Number of ThreadIf cpu of your computer (sever) is 4 core, uses –t 4

Two fastq files contains NGS sequencing

Output was saved as ERR560539.sam file

For Yeast alignments it takes 259.789secFor 4 core computer

Sam file

Write down the location of each reads in references file

Starting PositionRead Name

Convert Sam file to Bam file and indexing

samtools view -b -@ 4 ERR560539.sam > ERR560539.bam

samtools sort -@ 4 ERR5605392.bam ERR560539.sorted

Sort bam file

Convert sam to bam (binary sam file)

Generate index filesamtools index 941832.sorted.bam

941832.bam941832.sam941832.sorted.bam941832.sorted.bam.bai

output ‘bam’ file Uses 4 threads (for 4 Core CPU)

Uses 4 threads (for 4 Core CPU)

Now what?

Let’s visualize data : Integrated Genome Viewer

https://www.broadinstitute.org/igv/download

https://www.broadinstitute.org/software/igv/download

In our examples, select sacCer3

Zoom in Zoom OutSelect chromosome

Locations

Load bam file

File->Load from file-> Select yeast.sorted.bam

SNP

Gene

Zoom it

Reference :C Sequenced : T

Missing in Sequenced Genome?

Low sequencing Depth

Find out Variants

samtools mpileup -g -f yeast yeast.sorted.bam > yeast.bcf

Examine every position in genome and check alignmentFind out the possibility of alternative allele

bcftools call -c -v yeast.bcf > yeast.vcf

Write out variant as yeast.vcf

Open yeast.vcf in nano editor

Header

Variants

DP : Raw read depth….How many sequence reads support these variation?

<ID=DP4: Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">

Visualize

‘Load from Files’ in IGV

Select VCF file (yeast.vcf)

Data mining from variant data

VCF file is just text file. So we can handle them with unix utility and Pandas

head -n 50 yeast.vcf Print out first 50 line in yeast.vcf file

Headers start in ‘##’

We want to remove these lines started with ##. How?

grep –v “^##” Display lines except start with ‘##’

And Save it as yeast2.vcf

We can uses premitive filtering using grep

grep 'chrIX' yeast2.vcf | wc -l 2818

Display variants in ChrIX and count lines

Filtering with Pandas

Data mining using ipython Notebook

Informations are stored as DP=14;VBD=2.6447e=06…We want to convert them as columns in dataFrame. How?

Define functions

DP=81;VDB=5.92922e-11;SGB=-0.693147;MQSB=1;MQ0.Convert string as dictionary

{‘DP’:81, ‘VDB’:5.92922e-11, ‘SGB’:-0.693147…}

View single column as series

Apply ‘split’ functions in each row

Convert as list

Generate DataFrame from list

Save as new dataframe named as info

Select two columns in info (DP, MQ) and add into vcf

Filter DP (read depth) is higher than 50, MQ (Mapping Quality) is higher than 30

How many filtered SNV is found on ‘chrI’?

Unfiltered

Save back filtered VCF data…

Save as vcf3.vcf

grep "^##" yeast.vcf > header.vcf Extract Header regions in VCF

cat header.vcf vcf3.vcf > filtered.vcf Attach Header back

Open in IGV and compare original variant calling and filtered one..

Filtered

Original

Common Question Examples..

1. Find out all SNV present on Exon

2. Find out SNV present on Promoter Regions on the Genes

3. Find out SNV present on the specific genes of interest

4. Filter out SNV which causes Loss of Functions on genes

…You need another sets of tools to answer these questions

We will look in the next lectures

SNV FilteringPre-processing in the mapping phase and SNV filtering help minimize false positives• Absent in dbSNP• Exclude LOH events• Retain non-synonymous• Sufficient depth of read coverage• SNV present in given number of reads• High mapping and SNV quality• SNV density in a given bp window• SNV greater than a given bp from a

predicted indel • Strand balance/bias• Concordance across various SNV callers


Variant Annotation• 실제 찾아진 Variant 에 대한 해석• SeattleSeq

– annotation of known and novel SNPs – includes dbSNP rs ID, gene names and accession

numbers, SNP functions (e.g., missense), protein positions and amino-acid changes, conservation scores, HapMap frequencies, PolyPhen predictions, and clinical association

• Annovar– Gene-based annotation– Region-based annotations– Filter-based annotation

http://snp.gs.washington.edu/SeattleSeqAnnotation/http://www.openbioinformatics.org/annovar/


생물학 연구를 위한 컴퓨터 활용기술 8강

Education