생물학 연구를 위한 컴퓨터 활용기술 8강
TRANSCRIPT
Computational Skill for Modern Biology Research
Department of BiologyChungbuk National University
8th Lecture 2015.11.3
NGS Analysis I : align NGS read into reference genome
Syllabus주 수업내용1 주차 Introduction : Why we need to learn this stuff?
2 주차 Basic of Unix and running BLAST in your PC
3 주차 Unix Command Prompt II and shell scripts
4 주차 Basic of programming (Python programming)
5 주차 Python Scripting II and sequence manipulations
6 주차 Ipython Notebook and Pandas
7 주차 Basic of Next Generation Sequencings and Tutorial
8 주차9 주차 Next Generation Sequencing Analysis I
10 주차 Next Generation Sequencing Analysis II
11 주차 R and statistical analysis
12 주차 Bioconductor I
13 주차 Bioconductor II
14 주차 Network analysis
What we can do with NGS data
ResequencingDe novo genome sequencing
Is there reference sequence for your favorite organism?
Yes No
NGS Sequencing Data
Sequence Assembly
Output : Sequence Contigs
Alignment with reference genome
Output : variants (SNP, Structural Variations)
Gene PredictionsFunctional Classifications…
Association study with phenotypes
Resequencing
Reference sequences : well-estabilished genome sequence
We are interested in understanding genome level differences
Snyder M et al. Genes Dev. 2010;24:423-431
SNP/Indel
Phased SNP
Deletion
Insertion
Inverstion
ACGTTTGGATACTGCAAACCTATG
ACGTTTGTATACTGCAAACATATG
SNP (Single Nucleotide Polymorphisms)
• Change in Single Nucleotide Sequence
• When we compare with Human reference sequences, individual Human has 3 – 4 million SNPs
• Some of them is very frequent, while others are very rare
- Common Variant (20-40% frequencies in Populations)- Rare Variant (less than 1%_
SNPs vs. SNVsBoth are found as single nucleotide variances
• SNP
– Known variant in the specie (Well Characterized)– Known variants exists in specific frequency in Populations– Verified in Population– Resistered in dbSNP (http://www.ncbi.nlm.nih.gov/snp)
• SNV
– Specific variants found on the specific person (Not well characterized)– Very low frequency– Not well characterized
Really a matter of frequency of occurrence
http://ccsb.stanford.edu/education/Nair_NGS.pptx
Single Nucleotide Variances
TGCAAACCTATG
Indel (Insertion/Deletion)
• Deltion or addition of base (less than 1kb)
- 300,000-600,000 indels per person
• Large Scale Structural Variation (more than 2kbp
- more than 1,000 per person
TGCAAAC-TATGTGCAAACC-TATGTGCAAACCCTATG
Today, we will learn how to find these variants from NGS sequencing
- Reference Genome Sequences (Fasta Format)- Sequence Data (Fastq format)
Software
-bwa, samtools, bcftools
• Most software is unix based• In the case of big eucaryotic genomes, it is difficult to run in ordinary PC• But in small eucaryote or bacteria, it would be ok…
WorkFlow
Sequencing DataFastQ
ReferenceGenome Sequence
Alignment File(sam format)
Mapping
Some of informations for NGS
Single Read (SR) or Paired End (PE)
Read Length
Depth of Coverage (DNA)
SRA
http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=announcement
Sequence Read Archive : Repository for NGS Data
http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP018525
During this study, they performed 47 RNA sequencing (160.8Gbp)
SourcesAccessions Type of Experiments
Install SRA Toolkit
To download NGS data archived in NCBI/SRA, you need to download SRA Toolkit
http://www.ncbi.nlm.nih.gov/Traces/sra/?view=software
tar -xvzf sratoolkit.2.5.4-mac64.tar.gz
Extract archive
cd binpwd
(In the case of mac)
Setup sratoolkit in your PATHAdd These line into your .bash_profile in home directory
Setup path
Download sra file
Let’s download some of datafile (It is BIG)
prefetch ERR560539 (SRA id)
Maximum file size download limit is 20,971,520KB
2015-11-02T01:09:26 prefetch.2.5.4: 1) Downloading ‘ERR560539 '...2015-11-02T01:09:26 prefetch.2.5.4: Downloading via http...2015-11-02T01:23:08 prefetch.2.5.4: 1) 'SRR032988' was downloaded successfully
File will be saved in ~/ncbi/public/sra
Convert sra file into FASTQ file
fastq-dump --split-files ERR560539 Read 1887328 spots for ERR560539 Written 1887328 spots for ERR560539
<sra id>
ls ERR560539 _1.fastq ERR560539 _2.fastq Paired End reads
Reverse
Forward5’ 3’5’3’
See end of fastq file
Quality
Sequence
Size of file
About 2.9Gb
Quality Control of Fastq using FASTQC
Download and install FastQC
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Open Fastq file
Install bwa, samtools, bcftools
bwa: short illumina read aligner to reference geome sequences
Genome sequence
Sequencing Data
Find out matching, and align sequences
samtools : convert data format find out variants in concert with bcftools
Install bwa, samtools, bcftools
1. Download source files and compile it based on the instructions
2. Install via Homebrew (Mac) or apt-get (Ubuntu linux)
https://github.com/lh3/bwa/https://github.com/samtools/samtools/https://github.com/samtools/bcftools
http://gatkforums.broadinstitute.org/discussion/2899/howto-install-all-software-packages-required-to-follow-the-gatk-best-practices
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
Install Homebrew
brew tap homebrew/science
http://alejandrosoto.net/blog/2014/01/22/setting-up-my-mac-for-scientific-research/
brew install bwabrew install samtoolsbrew install bcftools
samtools
bwa
bcftools
What will do..
Align sequencing reads in reference genome
First, we will download Reference Genomes
https://support.illumina.com/sequencing/sequencing_software/igenome.html
We will use Saccharomyces cerevisiae genome (sacSer3)
Download this filehttps://support.illumina.com/sequencing/sequencing_software/igenome.html
Download genome file and genome sequence in current directory
tar -xvzf Saccharomyces_cerevisiae_UCSC_sacCer3.tar.gzcp ./Saccharomyces_cerevisiae/UCSC/sacCer3/Sequence/BWAIndex/genome.fa .mv genome.fa yeast
Extract reference genome
First, you need to generate index file for genome sequencebwa index yeast
You can think ‘index’ as something like address book in genome for fast access..
Download ERR560539.sraPrefetch ERR560539fasta-dump –split-files ERR560539
Then, download NGS sequence Data to analysis
Saccharomyces cerevisiae seperated from wine
Running bwa (Align NGS reads into Reference)
bwa mem -t 4 yeast ERR560539_1.fastq ERR560539_2.fastq > ERR560539.sam
memMethod for alignment (if NGS sequences is bigger than 50bp, select this)
Number of ThreadIf cpu of your computer (sever) is 4 core, uses –t 4
Two fastq files contains NGS sequencing
Output was saved as ERR560539.sam file
For Yeast alignments it takes 259.789secFor 4 core computer
Sam file
Write down the location of each reads in references file
Starting PositionRead Name
Convert Sam file to Bam file and indexing
samtools view -b -@ 4 ERR560539.sam > ERR560539.bam
samtools sort -@ 4 ERR5605392.bam ERR560539.sorted
Sort bam file
Convert sam to bam (binary sam file)
Generate index filesamtools index 941832.sorted.bam
941832.bam941832.sam941832.sorted.bam941832.sorted.bam.bai
output ‘bam’ file Uses 4 threads (for 4 Core CPU)
Uses 4 threads (for 4 Core CPU)
Now what?
Let’s visualize data : Integrated Genome Viewer
https://www.broadinstitute.org/igv/download
https://www.broadinstitute.org/software/igv/download
In our examples, select sacCer3
Zoom in Zoom OutSelect chromosome
Locations
Load bam file
File->Load from file-> Select yeast.sorted.bam
SNP
Gene
Zoom it
Reference :C Sequenced : T
Missing in Sequenced Genome?
Low sequencing Depth
Find out Variants
samtools mpileup -g -f yeast yeast.sorted.bam > yeast.bcf
Examine every position in genome and check alignmentFind out the possibility of alternative allele
bcftools call -c -v yeast.bcf > yeast.vcf
Write out variant as yeast.vcf
Open yeast.vcf in nano editor
Header
Variants
DP : Raw read depth….How many sequence reads support these variation?
<ID=DP4: Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases">
Visualize
‘Load from Files’ in IGV
Select VCF file (yeast.vcf)
SNV
Data mining from variant data
VCF file is just text file. So we can handle them with unix utility and Pandas
head -n 50 yeast.vcf Print out first 50 line in yeast.vcf file
Headers start in ‘##’
We want to remove these lines started with ##. How?
grep –v “^##” Display lines except start with ‘##’
And Save it as yeast2.vcf
We can uses premitive filtering using grep
grep 'chrIX' yeast2.vcf | wc -l 2818
Display variants in ChrIX and count lines
Filtering with Pandas
Data mining using ipython Notebook
Informations are stored as DP=14;VBD=2.6447e=06…We want to convert them as columns in dataFrame. How?
Define functions
DP=81;VDB=5.92922e-11;SGB=-0.693147;MQSB=1;MQ0.Convert string as dictionary
{‘DP’:81, ‘VDB’:5.92922e-11, ‘SGB’:-0.693147…}
View single column as series
Apply ‘split’ functions in each row
Convert as list
Generate DataFrame from list
Save as new dataframe named as info
Select two columns in info (DP, MQ) and add into vcf
Filter DP (read depth) is higher than 50, MQ (Mapping Quality) is higher than 30
How many filtered SNV is found on ‘chrI’?
Unfiltered
Save back filtered VCF data…
Save as vcf3.vcf
grep "^##" yeast.vcf > header.vcf Extract Header regions in VCF
cat header.vcf vcf3.vcf > filtered.vcf Attach Header back
Open in IGV and compare original variant calling and filtered one..
Filtered
Original
Common Question Examples..
1. Find out all SNV present on Exon
2. Find out SNV present on Promoter Regions on the Genes
3. Find out SNV present on the specific genes of interest
4. Filter out SNV which causes Loss of Functions on genes
…You need another sets of tools to answer these questions
We will look in the next lectures
SNV FilteringPre-processing in the mapping phase and SNV filtering help minimize false positives• Absent in dbSNP• Exclude LOH events• Retain non-synonymous• Sufficient depth of read coverage• SNV present in given number of reads• High mapping and SNV quality• SNV density in a given bp window• SNV greater than a given bp from a
predicted indel • Strand balance/bias• Concordance across various SNV callers
http://ccsb.stanford.edu/education/Nair_NGS.pptx
Variant Annotation• 실제 찾아진 Variant 에 대한 해석• SeattleSeq
– annotation of known and novel SNPs – includes dbSNP rs ID, gene names and accession
numbers, SNP functions (e.g., missense), protein positions and amino-acid changes, conservation scores, HapMap frequencies, PolyPhen predictions, and clinical association
• Annovar– Gene-based annotation– Region-based annotations– Filter-based annotation
http://snp.gs.washington.edu/SeattleSeqAnnotation/http://www.openbioinformatics.org/annovar/
http://ccsb.stanford.edu/education/Nair_NGS.pptx