bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... ·...
TRANSCRIPT
![Page 2: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/2.jpg)
Background
Cusco, Marzo 2009
• Laurea Triennale: Scienze Biologiche, Universita’ degli Studi di Ferrara, Dr. Silvia Fuselli;
• Laurea Specialistica: Scienze Biomolecolarie Cellulari, Universita’ degli Studi di Ferrara, Dr. Silvia Fuselli;
• PhD in Genetics, University of Leicester, prof. Mark A. Jobling;
• Post-doctoral fellow EPHE-MNHN, Paris, Dr.Stefano Mona.
• Research Fellow Trinity College, Dublin, Prof. Daniel Bradley
![Page 3: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/3.jpg)
Muséum national d'Histoire naturelle - Paris
![Page 4: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/4.jpg)
Trinity College Dublin - Ireland
![Page 5: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/5.jpg)
Informazioni pratiche
• Teoria + pratica;
• Software and tools;
• Files;
• Slides on the website;
• Argomenti nuovi / argomenti gia’ trattati;
![Page 6: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/6.jpg)
Informazioni pratiche
Cartella di lavoro (fastq file): /home/bioinfo_file/
File referenza , intevalli per il coverage, genoma per IGV): /home/bioinfo_file/reference/
Ricordatevi i percorsi dei file!!
pwd: mostra la vostra posizione
![Page 7: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/7.jpg)
Programma
• next-generation sequencing (NGS)…come, quando, perche’?
• un esempio di gestione e analisi dati NGS:
• tipo di dato;• file e formati;• programmi;• interpretazione dei risultati;• stima dell’errore;• quando fermarsi?
• Applicazioni e/o progetti su diversi organismi.
![Page 8: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/8.jpg)
capture: exome/custom/cancer
amplicon sequencing
whole genome
mapping to a reference genome
de-novoassembly
sequencing
unalignedreads QC
mapping refinement
mapping QCassembly QC
whole transcriptome
amplicon sequencing: fixed/custom
DNA-seq
RNA-seq
reads trimming
NGS: come, quando, perché?
Filtering
Validation
![Page 9: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/9.jpg)
Domanda: quando?
Risposta: quando ha senso!
• Amplicone 400bp in 100 individui? → Sanger sequencing
• 50 ampliconi in 100 individui? → NGS + target capture
• Gene conversion, elementiripetuti, recombination breakpoints? → NGS + Sanger sequencing
Domanda: perche’?
Risposta: la vostra idea per un progetto!
NGS: come, quando, perché?
![Page 10: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/10.jpg)
un esempio di gestione e analisi dati NGS
Nanopore minIon/gridIon
Pacific Bioscience (PacBio)
Ion torrent PGM/Proton
Roche 454
Illumina MiSeq/HiSeq
![Page 11: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/11.jpg)
capture: exome/custom/cancer
amplicon sequencing
whole genome
mapping to a reference genome
de-novoassembly
sequencing
unalignedreads QC
mapping refinement
mapping QCassembly QC
whole transcriptome
amplicon sequencing: fixed/custom
DNA-seq
RNA-seq
reads trimming
Filtering
Validation
un esempio di gestione e analisi dati NGS
![Page 12: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/12.jpg)
un esempio di gestione e analisi dati NGS
• progetto
• progetto:applicazione (whole genomes? Exomes? Target capture? Amplicon sequencing?)
• progetto:applicazione:scopo (SNPs, indels, repeated elements, CNVs…)
• progetto:applicazione:scopo:coverage (SNPs, indels, repeatedelements, CNVs…)
![Page 13: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/13.jpg)
Project:
• Carcharodon carcharias - the great white shark;
![Page 14: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/14.jpg)
Project:
• Carcharodon carcharias;
• Diploid organism;
• 82 chromosomes (41 pairs);
• Genome size ~5.2 Gb – not fully sequenced yet;
• Target capture experiment;
• Paired-end reads (250bp).
![Page 15: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/15.jpg)
Project:
• Target capture experiment;
Meyerson M et al., 2010, NatRevGenetics
![Page 16: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/16.jpg)
fragment ========================================fragment + adaptors ~~~========================================~~~SE read --------->PE reads R1---------> <---------R2unknown gap ..................................................
Single-end (SE) or paired-end (PE) sequencing.
![Page 17: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/17.jpg)
raw reads (.fastq) 2. alignment to a reference genomeclose reference?
time limited?
bwa
distant reference?
stampy
aligned reads (.sam/.bam)
3. bam refinementduplicate removal
local realignment
base recalibration
picardGATK GATK
aligned reads (.sam/.bam)
5. variant callingSNPs/indels
single/multi-sample
samtools
raw variants (.vcf)
ready-to-use variants (.vcf)
4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)
6. variant filtering and validation
in silico vs in vitro validation
vcftools
variant score recalibration
big datasets
known SNPs/indels
1. Fastq quality control + trimming
Adapters ?Low quality bases?
samtoolsIGV/tablet
![Page 18: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/18.jpg)
.fa/.fasta
.fastq
.sam (.sai)
.bam (.bai)
.vcf
sequences
read data
mapped reads
mapped reads (binary)
variant information
![Page 19: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/19.jpg)
@M00725:28:000000000-AJ72K:1:1101:11561:1002 1:N:0:1AGTCAACAACGGGAACAAAATCCTGAAGGTCATGGTATGTGTANNNNTNTTNNNNNCCNNNNNNATGTGTCNNNNNNNTNNNNNNTCTGAGTNNNNNNNCTCTCTTNNNNNNNAGTGGGTNNNNNNNGCATCCANNNAGCACGATTTTNNNNNNNTATTCAGGAGACAANNNNNNNGTGGGCANNNNNNNGTGTTGGNNNNNNNNNNNNNNGGAGAGANAAAAAANNNNNNNTGAAGTCNNNNNNNNNNNNAGCGNNANNNNNNNTCNNNNNNNNNNNNNNATCANNNNNNNNNNGGTG+8ACCFGFGGGCDGGGGCFGGGGGGGFGGGGGFEFFGGGFFEGG####9#::#####:9######::CD@FG#######:######,:99CF?#######::DBFDE#######4::DFG>#######+9A=D@F###88=+<FFFFGG#######++8@8;EEFG8>DG#######+6@DEFF#######*44D=,:##############*/**2:*#212/8C#######*.*2:/9############)-))##0#######,(##############0((,##########-((-
raw reads (.fastq)
![Page 20: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/20.jpg)
raw reads (.fastq)
Terminal: more cc_gn2_R1_trimmed.fastq head cc_gn2_R1_trimmed.fastq
![Page 21: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/21.jpg)
raw reads (.fastq)
@M00725:28:000000000-AJ72K:1:1101:11561:1002 1:N:0:1AGTCAACAACGGGAACAAAATCCTGAAGGTCATGGTATGTGTANNNNTNTTNNNNNCCNNNNNNATGTGTCNNNNNNNTNNNNNNTCTGAGTNNNNNNNCTCTCTTNNNNNNNAGTGGGTNNNNNNNGCATCCANNNAGCACGATTTTNNNNNNNTATTCAGGAGACAANNNNNNNGTGGGCANNNNNNNGTGTTGGNNNNNNNNNNNNNNGGAGAGANAAAAAANNNNNNNTGAAGTCNNNNNNNNNNNNAGCGNNANNNNNNNTCNNNNNNNNNNNNNNATCANNNNNNNNNNGGTG+8ACCFGFGGGCDGGGGCFGGGGGGGFGGGGGFEFFGGGFFEGG####9#::#####:9######::CD@FG#######:######,:99CF?#######::DBFDE#######4::DFG>#######+9A=D@F###88=+<FFFFGG#######++8@8;EEFG8>DG#######+6@DEFF#######*44D=,:##############*/**2:*#212/8C#######*.*2:/9############)-))##0#######,(##############0((,##########-((-
![Page 22: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/22.jpg)
First mate in the pair (paired-end reads)
Run ID
flowcell ID
index
Quality values for each nucleotide
Instrument ID
raw reads (.fastq)
@M00725:28:000000000-AJ72K:1:1101:11561:1002 1:N:0:1AGTCAACAACGGGAACAAAATCCTGAAGGTCATGGTATGTGTANNNNTNTTNNNNNCCNNNNNNATGTGTCNNNNNNNTNNNNNNTCTGAGTNNNNNNNCTCTCTTNNNNNNNAGTGGGTNNNNNNNGCATCCANNNAGCACGATTTTNNNNNNNTATTCAGGAGACAANNNNNNNGTGGGCANNNNNNNGTGTTGGNNNNNNNNNNNNNNGGAGAGANAAAAAANNNNNNNTGAAGTCNNNNNNNNNNNNAGCGNNANNNNNNNTCNNNNNNNNNNNNNNATCANNNNNNNNNNGGTG+8ACCFGFGGGCDGGGGCFGGGGGGGFGGGGGFEFFGGGFFEGG####9#::#####:9######::CD@FG#######:######,:99CF?#######::DBFDE#######4::DFG>#######+9A=D@F###88=+<FFFFGG#######++8@8;EEFG8>DG#######+6@DEFF#######*44D=,:##############*/**2:*#212/8C#######*.*2:/9############)-))##0#######,(##############0((,##########-((-
lane tile
coordinates of the cluster
Is the read filtered? No (N) or Yes (Y)
Control included? 0=No
read
![Page 23: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/23.jpg)
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
Lowest HighestASCII
33 1260.2......................26...31........41
Illumina 1.8+ Phred+33, raw reads typically (0, 41)
raw reads (.fastq)
@M00725:28:000000000-AJ72K:1:1101:11561:1002 1:N:0:1AGTCAACAACGGGAACAAAATCCTGAAGGTCATGGTATGTGTANNNNTNTTNNNNNCCNNNNNNATGTGTCNNNNNNNTNNNNNNTCTGAGTNNNNNNNCTCTCTTNNNNNNNAGTGGGTNNNNNNNGCATCCANNNAGCACGATTTTNNNNNNNTATTCAGGAGACAANNNNNNNGTGGGCANNNNNNNGTGTTGGNNNNNNNNNNNNNNGGAGAGANAAAAAANNNNNNNTGAAGTCNNNNNNNNNNNNAGCGNNANNNNNNNTCNNNNNNNNNNNNNNATCANNNNNNNNNNGGTG+8ACCFGFGGGCDGGGGCFGGGGGGGFGGGGGFEFFGGGFFEGG####9#::#####:9######::CD@FG#######:######,:99CF?#######::DBFDE#######4::DFG>#######+9A=D@F###88=+<FFFFGG#######++8@8;EEFG8>DG#######+6@DEFF#######*44D=,:##############*/**2:*#212/8C#######*.*2:/9############)-))##0#######,(##############0((,##########-((-
Quality values for each nucleotide
![Page 24: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/24.jpg)
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
Lowest HighestASCII
33 1260.2......................26...31........41
Illumina 1.8+ Phred+33, raw reads typically (0, 41)
Phred-scale value:
Q = -10*log_10P → P = 10-Q/10
Phred Quality Score(Q)
Probability of incorrect base call
(P)Base call accuracy
10 1 in 10 90%20 1 in 100 99%30 1 in 1000 99.9%40 1 in 10000 99.99%50 1 in 100000 99.999%
raw reads (.fastq)
![Page 25: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/25.jpg)
raw reads (.fastq)
• Open cc_gn2_R2_trimmed.fastq
Terminal: more cc_gn2_R2_trimmed.fastq OR head cc_gn2_R2_trimmed.fastq
• Are cc_gn2_R1_trimmed.fastq and cc_gn2_R2_trimmed.fastq coming from two different lanes?
• What’s the difference between the two fastq files?
![Page 26: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/26.jpg)
raw reads (.fastq)
cc_gn2_R2_trimmed.fastq
cc_gn2_R1_trimmed.fastq
@M00725:28:000000000-AJ72K:1:1101:11561:1002 1:N:0:1AGTCAACAACGGGAACAAAATCCTGAAGGTCATGGTATGTGTANNNNTNTTNNNNNCCNNNNNNATGTGTCNNNNNNNTNNNNNNTCTGAGTNNNNNNNCTCTCTTNNNNNNNAGTGGGTNNNNNNNGCATCCANNNAGCACGATTTTNNNNNNNTATTCAGGAGACAANNNNNNNGTGGGCANNNNNNNGTGTTGGNNNNNNNNNNNNNNGGAGAGANAAAAAANNNNNNNTGAAGTCNNNNNNNNNNNNAGCGNNANNNNNNNTCNNNNNNNNNNNNNNATCANNNNNNNNNNGGTG
@M00725:28:000000000-AJ72K:1:1101:11561:1002 2:N:0:1CCATTTCTNNNNNNNAGGACCTNNNNNNNAGCCCTNNNNNNNNNNNNNAGNATATGANNNNNNNTCTTATTNANCCANNNTCTAGNNNNNNNCTTTCCTNNNNNNNTCTCTGANNNNNNNNNNNNNNCCCTTCCNNTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTNTCTCNTNNNNNNNNNNNNAAAATCCNNNNNNNNNNNNNNCCACTAANNNNNNNNNNNNNNAAGAAATAACACACNNNNNNNACAAAAANNNNNNNACAACACNNNNNNNGCATAAANNNA
Same lane, different read mate in the pair!
![Page 27: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/27.jpg)
raw reads (.fastq) 2. alignment to a reference genomeclose reference?
time limited?
bwa
distant reference?
stampy
aligned reads (.sam/.bam)
3. bam refinementduplicate removal
local realignment
base recalibration
picardGATK GATK
aligned reads (.sam/.bam)
5. variant callingSNPs/indels
single/multi-sample
samtools
raw variants (.vcf)
ready-to-use variants (.vcf)
4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)
6. variant filtering and validation
in silico vs in vitro validation
vcftools
variant score recalibration
big datasets
known SNPs/indels
1. Fastq quality control + trimming
Adapters ?Low quality bases?
4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)
samtoolsIGV/tablet
![Page 28: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/28.jpg)
1- Fastq quality control + trimming
Fastqc: quality control of the raw data coming out from the sequencer
• Evaluation of the quality of the generated data;
• Basic summary statistics of the raw data;
• Several modules to evaluate different features (i.e. adapters; base quality, etc…)
• Feedback (green, orange, red): do not fully rely on that, think what does it mean!!
![Page 29: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/29.jpg)
1- Fastq quality control + trimming
![Page 30: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/30.jpg)
1- Fastq quality control + trimming
Per base sequence quality: warning
![Page 31: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/31.jpg)
1- Fastq quality control + trimming
95-99 bp 90-94 bp
What can we do to improve the quality at the end of the reads?
Read Trimming: removal of lower-quality 3' Ends with Low Quality Scores
![Page 32: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/32.jpg)
1- Fastq quality control + trimming
Per sequence quality score: pass
![Page 33: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/33.jpg)
1- Fastq quality control + trimming
Sequence length: pass
![Page 34: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/34.jpg)
Adapters removal1- Fastq quality control + trimming
Failed
Warning
![Page 35: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/35.jpg)
Adapters removal1- Fastq quality control + trimming
Pass
![Page 36: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/36.jpg)
Overrepresented sequences
1- Fastq quality control + trimming
Removal of overrepresented sequences (PCR primers).
![Page 37: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/37.jpg)
FASTQC references:
• Software website:http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
• Manual:https://insidedna.me/tool_page_assets/pdf_manual/fastqc.pdf
![Page 38: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/38.jpg)
raw reads (.fastq) 2. alignment to a reference genomeclose reference?
time limited?
bwa
distant reference?
stampy
aligned reads (.sam/.bam)
3. bam refinementduplicate removal
local realignment
base recalibration
picardGATK GATK
aligned reads (.sam/.bam)
5. variant callingSNPs/indels
single/multi-sample
samtools
raw variants (.vcf)
ready-to-use variants (.vcf)
4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)
6. variant filtering and validation
in silico vs in vitro validation
vcftools
variant score recalibration
big datasets
known SNPs/indels
1. Fastq quality control + trimming
Adapters ?Low quality bases?
samtoolsIGV/tablet
![Page 39: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/39.jpg)
Alignment : process of determining the most likelylocation within the genome for the observed DNA read
raw reads reference genome
2- Alignment to a reference genome
![Page 40: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/40.jpg)
trade-off: speed vs sensitivity – the higher the accuracy the longer the alignment run
two classes of methods:
Burrows-Wheeler
• Fast• less robust at high divergence
with reference genome• e.g. bwa
Hashing
• slow (needs more memory)• robust at high divergence with
reference genome• e.g. stampy
the shorter the read the harder is to find its location in the genome
big amount of data: computationally challenging for memory and speed
2- Alignment to a reference genome
BW: https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transformHashing: https://en.wikipedia.org/wiki/Hash_table
![Page 41: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/41.jpg)
raw reads reference genome
low MQ: the probability of mapping to different locations is high, but no perfect multiple matches
high MQ: a single match
MQ0: a perfect multiple match
What if there are several possible places to align your sequencing read?
This may be due to:- Repeated elements in the genome- Low complexity sequences- Reference errors and gaps
MQ is a phred-score of the quality of the alignment
2- Alignment to a reference genome
![Page 42: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/42.jpg)
This may be due to:- Repeated elements in the genome- Low complexity sequences- Reference errors and gaps
2- Alignment to a reference genome
Reference sequence
Element 1 Element 2
Sample_1
Reference sequence
Sample_1
1 copia
1 copia
1 copia
1 copia
![Page 43: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/43.jpg)
This may be due to:- Repeated elements in the genome- Low complexity sequences- Reference errors and gaps
2- Alignment to a reference genome
Reference sequence
Element 1 Element 2
Sample_1
Element 1
Perfect mul ple matches → MQ0Not a perfect match → Low MQ
![Page 44: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/44.jpg)
This may be due to:- Repeated elements in the genome- Low complexity sequences- Reference errors and gaps
2- Alignment to a reference genome
Reference sequence
Element 1 Element 2
Sample_1
Element 1
Perfect mul ple matches → MQ0Not a perfect match → Low MQ
Reference sequence
Sample_1
2 copia
1 copia
1 copia
1 copia
![Page 45: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/45.jpg)
This may be due to:- Repeated elements in the genome- Low complexity sequences- Reference errors and gaps
2- Alignment to a reference genome
Reference sequence
Element 1 Element 2
Sample_1
False heterozygous callCluster of heterozygotes
Reference sequence
Sample_1
1 copia
2 copia
1 copia
1 copia
![Page 46: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/46.jpg)
This may be due to:- Repeated elements in the genome- Low complexity sequences- Reference errors and gaps
2- Alignment to a reference genome
AluSg7
![Page 47: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/47.jpg)
This may be due to:- Repeated elements in the genome- Low complexity sequences- Reference errors and gaps
2- Alignment to a reference genome
![Page 48: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/48.jpg)
2- Alignment to a reference genome: mapping with bwa-mem
Three different algorithm:
1. BWA-backtrack: for illumina reads up to 100bp;
2. BWA-SW: long read support, split alignment;
3. BWA-MEM: long read support, split alignment, faster, more accurate
Fastq files are already trimmed → adapters removed
![Page 49: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/49.jpg)
2- Alignment to a reference genome: mapping with bwa-mem
Split read:
Karacok E et al., 2012
• paired-end alignment;
• it uses the reference genome (.fa) and the reads (.fastq) to create a SAM file;
• Option to mark shorter split hits as secondary (not supplementary).
![Page 50: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/50.jpg)
• paired-end alignment;
• it uses the reference genome (.fa) and the reads (.fastq) to create a SAM file;
• Option to mark shorter split hits as secondary (not supplementary).
bwa mem [options] [RefSeq] [fastq1] [fastq2] > cc_gn2_R12.sam
2- Alignment to a reference genome: mapping with bwa-mem
Type: bwa mem to check the options
![Page 51: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/51.jpg)
bwa mem -M cc_ref.fa cc_gn2_R1_trimmed.fastq cc_gn2_R2_trimmed.fastq > cc_gn2_R12.sam
2- Alignment to a reference genome: mapping with bwa-mem
• paired-end alignment;
• it uses the reference genome (.fa) and the reads (.fastq) to create a SAM file;
• Option to mark shorter split hits as secondary (not supplementary).
Reference genomeMark shorter split hits as secondary
Fastq 1
Fastq 2
![Page 52: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/52.jpg)
2- Alignment to a reference genome: from sam to bam
Convert sam-to-bam:
samtools view .. .. .. input_sam .. input_bam
• Option to define that the input is a sam file;
• Option to have output in bam format;
• Option to define the output;
![Page 53: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/53.jpg)
samtools view -Sb cc_gn2_R12.sam -o cc_gn2_R12.bam
2- Alignment to a reference genome: from sam to bam
sam-to-bamOutput in bam format
Output file (bam)
Input file (sam)
Input is a sam file
![Page 54: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/54.jpg)
raw reads (.fastq) 2. alignment to a reference genomeclose reference?
time limited?
bwa
distant reference?
stampy
aligned reads (.sam/.bam)
3. bam refinementduplicate removal
local realignment
base recalibration
picardGATK GATK
aligned reads (.sam/.bam)
5. variant callingSNPs/indels
single/multi-sample
samtools
raw variants (.vcf)
ready-to-use variants (.vcf)
4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)
6. variant filtering and validation
in silico vs in vitro validation
vcftools
variant score recalibration
big datasets
known SNPs/indels
1. Fastq quality control + trimming
Adapters ?Low quality bases?
samtoolsIGV/tablet
![Page 55: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/55.jpg)
SAM/BAM format
SAM – sequence alignment mapBAM – binary alignment map
Standard formats for alignmentBAM is the binary version of SAM – reduced size, easier to store and to access but the full information is not readable by human eye
aligned reads (.sam/.bam)
![Page 56: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/56.jpg)
more cc_gn2_R12.bam
aligned reads (.sam/.bam)
![Page 57: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/57.jpg)
BAM format
aligned reads (.sam/.bam)
![Page 58: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/58.jpg)
more cc_gn2_R12.sam
aligned reads (.sam/.bam)
![Page 59: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/59.jpg)
SAM format
aligned reads (.sam/.bam)
![Page 60: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/60.jpg)
SAM – sequence alignment mapBAM – binary alignment map
aligned reads (.sam/.bam)
They consist of two parts:
1. Header: contains information about the sample
2. Alignment: contains location and qualities for all the reads
You can find a detailed explanation in the sam/bam format specification (http://samtools.sourceforge.net/SAMv1.pdf).
![Page 61: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/61.jpg)
SAM format
aligned reads (.sam/.bam)
Header
Alignment
![Page 62: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/62.jpg)
Header contains:@HD – header line@SQ – Reference sequence dictionary, one per chromosome,
SN (reference sequence name) and LN (reference sequence length) @RG – Read group@PG – Program, ID (identifier)@CO – comment
SAM – sequence alignment mapBAM – binary alignment map
They consist of two parts:
1. Header: contains information about the sample
aligned reads (.sam/.bam)
![Page 63: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/63.jpg)
SAM – sequence alignment mapBAM – binary alignment map
They consist of two parts:
1. Header: contains information about the sample
aligned reads (.sam/.bam)
Reference Sequence Name Reference Sequence Length
![Page 64: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/64.jpg)
SAM format
aligned reads (.sam/.bam)
Header
Alignment
![Page 65: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/65.jpg)
Alignment contains one line per read, and each line contains 12 columns:
SAM – sequence alignment mapBAM – binary alignment map
They consist of two parts:
2. Alignment: contains location and qualities for all the reads
aligned reads (.sam/.bam)
![Page 66: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/66.jpg)
SAM – sequence alignment mapBAM – binary alignment map
aligned reads (.sam/.bam)
M00725:28:000000000-AJ72K:1:1101:18215:1102 99 cc_ref 1754677 60
300M = 1754780 309
CTCCTTCACCAGATGGATTCTCGCCTTACAGTCCTGAGGAAACTAACCGCAGAGTCAACAAAGTAATGCGAGNNNNNNNGTACTTGCTACAGCNANNNGGTCCAAATNNNTNNNTTATTGGNNNAGATGTT
CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG#######::CFGGGGGGGGGG#:###:9BFGGGGG###:###::DFGDG###4+
2. Alignment: contains location and qualities for all the reads
![Page 67: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/67.jpg)
SAM – sequence alignment mapBAM – binary alignment map
aligned reads (.sam/.bam)
M00725:28:000000000-AJ72K:1:1101:18215:1102 99 cc_ref 1754677 60
300M = 1754780 309
CTCCTTCACCAGATGGATTCTCGCCTTACAGTCCTGAGGAAACTAACCGCAGAGTCAACAAAGTAATGCGAGNNNNNNNGTACTTGCTACAGCNANNNGGTCCAAATNNNTNNNTTATTGGNNNAGATGTT
CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG#######::CFGGGGGGGGGG#:###:9BFGGGGG###:###::DFGDG###4+
QNAME FLAG
2. Alignment: contains location and qualities for all the reads
![Page 68: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/68.jpg)
bitwise FLAG
It is an integer, but it represents the sum of different values.
aligned reads (.sam/.bam)
Open Firefox > google.co.uk > Type “bitwise flag broad”
There is a tool online which provides a quick “translation” (https://broadinstitute.github.io/picard/explain-flags.html)
M00725:28:000000000-AJ72K:1:1101:18215:1102 99QNAME FLAG
![Page 69: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/69.jpg)
bitwise FLAG
It is an integer, but it represents the sum of different values.
aligned reads (.sam/.bam)
There is a tool online which provides a quick “translation” (https://broadinstitute.github.io/picard/explain-flags.html)
M00725:28:000000000-AJ72K:1:1101:18215:1102 99QNAME FLAG
![Page 70: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/70.jpg)
SAM – sequence alignment mapBAM – binary alignment map
aligned reads (.sam/.bam)
M00725:28:000000000-AJ72K:1:1101:18215:1102 99 cc_ref 1754677 60
300M = 1754780 309
CTCCTTCACCAGATGGATTCTCGCCTTACAGTCCTGAGGAAACTAACCGCAGAGTCAACAAAGTAATGCGAGNNNNNNNGTACTTGCTACAGCNANNNGGTCCAAATNNNTNNNTTATTGGNNNAGATGTT
CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG#######::CFGGGGGGGGGG#:###:9BFGGGGG###:###::DFGDG###4+
QNAME FLAG RNAME POS MAPQ
CIGAR
2. Alignment: contains location and qualities for all the reads
![Page 71: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/71.jpg)
CIGAR string
It is a compact representation of sequence alignment. It includes:• M – match or mismatch• I – insertion• D – deletion
read: ACTCA–TGCAGTref: ACTCAGTG––GTcigar 5M1D2M2I2M
read: ACGTCATG––––CAGTref: ACG–CATGCGGCAGTcigar 3M1I4M3D4M
So, what is the cigar line of…?
aligned reads (.sam/.bam)
![Page 72: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/72.jpg)
SAM – sequence alignment mapBAM – binary alignment map
aligned reads (.sam/.bam)
M00725:28:000000000-AJ72K:1:1101:18215:1102 99 cc_ref 1754677 60
300M = 1754780 309
CTCCTTCACCAGATGGATTCTCGCCTTACAGTCCTGAGGAAACTAACCGCAGAGTCAACAAAGTAATGCGAGNNNNNNNGTACTTGCTACAGCNANNNGGTCCAAATNNNTNNNTTATTGGNNNAGATGTT
CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG#######::CFGGGGGGGGGG#:###:9BFGGGGG###:###::DFGDG###4+
QNAME FLAG RNAME POS MAPQ
CIGAR
2. Alignment: contains location and qualities for all the reads
MRNM
![Page 73: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/73.jpg)
SAM – sequence alignment mapBAM – binary alignment map
aligned reads (.sam/.bam)
M00725:28:000000000-AJ72K:1:1101:18215:1102 99 cc_ref 1754677 60
300M = 1754780 309
CTCCTTCACCAGATGGATTCTCGCCTTACAGTCCTGAGGAAACTAACCGCAGAGTCAACAAAGTAATGCGAGNNNNNNNGTACTTGCTACAGCNANNNGGTCCAAATNNNTNNNTTATTGGNNNAGATGTT
CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG#######::CFGGGGGGGGGG#:###:9BFGGGGG###:###::DFGDG###4+
QNAME FLAG RNAME POS MAPQ
CIGAR MPOS ISIZE
SEQ
QUAL
2. Alignment: contains location and qualities for all the reads
MRNM
![Page 74: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/74.jpg)
raw reads (.fastq) 2. alignment to a reference genomeclose reference?
time limited?
bwa
distant reference?
stampy
aligned reads (.sam/.bam)
3. bam refinementduplicate removal
local realignment
base recalibration
picardGATK GATK
aligned reads (.sam/.bam)
5. variant callingSNPs/indels
single/multi-sample
samtools
raw variants (.vcf)
ready-to-use variants (.vcf)
4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)
6. variant filtering and validation
in silico vs in vitro validation
vcftools
variant score recalibration
big datasets
known SNPs/indels
1. Fastq quality control + trimming
Adapters ?Low quality bases?
samtoolsIGV/tablet
![Page 75: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/75.jpg)
3- Bam refinement – before starting…
BAM missing a RG LINE…what is a RG LINE??
You can find a detailed explanation in the sam/bam format specification (http://samtools.sourceforge.net/SAMv1.pdf).
![Page 76: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/76.jpg)
3- Bam refinement – before starting…
picard-tools AddOrReplaceReadGroupsINPUT=cc_gn2_R12.bam OUTPUT=cc_gn2_R12_rg.bam RGLB=cc_gn2 RGPL=Illumina RGPU=01 RGSM=shark
![Page 77: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/77.jpg)
3- Bam refinement – before starting…
sort the bam (this adds the bam extension automatically!)It sorts alignments by coordinates
samtools sort cc_gn2_R12_rg.bam cc_gn2_R12_rg_sorted
samtools index cc_gn2_R12_rg_sorted.bam
Index the sorted bam file
![Page 78: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/78.jpg)
BAM format
aligned reads (.sam/.bam)
![Page 79: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/79.jpg)
samtools view -H cc_gn2_R12_rg_sorted.bam
use samtools to check the header of the BAM
1. How many chromosomes are present in your header?2. Which version of the BAM is it?3. Is it sorted?
aligned reads (.sam/.bam)
We can “read” the header of the BAM file…
![Page 80: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/80.jpg)
@HD VN:1.4 SO:coordinate@SQ SN:cc_ref LN:1784076@RG ID:1 PU:01 LB:cc_gn2 SM:shark PL:Illumina
aligned reads (.sam/.bam)
Yes, by coordinate
1 chromosome (“artificial”)
![Page 81: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/81.jpg)
aligned reads (.sam/.bam)
@HD VN:1.3 SO:coordinate@SQ SN:I LN:230218@SQ SN:II LN:813184@SQ SN:III LN:316620@SQ SN:IV LN:1531933@SQ SN:IX LN:439888@SQ SN:Mito LN:85779@SQ SN:V LN:576874@SQ SN:VI LN:270161@SQ SN:VII LN:1090940@SQ SN:VIII LN:562643@SQ SN:X LN:745751@SQ SN:XI LN:666816@SQ SN:XII LN:1078177@SQ SN:XIII LN:924431@SQ SN:XIV LN:784333@SQ SN:XV LN:1091291@SQ SN:XVI LN:948066@PG ID:bwa PN:bwa VN:0.7.10-r789 CL:bwa mem -M
![Page 82: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/82.jpg)
3- Bam refinement
Input: BAM
Three main steps:
1. Local realignment
2. Base quality recalibration
3. Duplicate removal
Output: BAM
![Page 83: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/83.jpg)
3- Bam refinement – Local realignment
Ref: ACTTTCGGATGCTGATCGGGATGCTTTAGCTGATGCTGATGGGCTTTCGATCGATTTAAAAGCTACTTTCGGATGCTGATCGGGATGCTTTAGCTGA
TCGGATGCTGATCGGGATGCTTTAGCTGATGCTCTGATCGGGATGCTTTAGCTGATGCTGATGG
Ref: ACTTTCGGATGCTGATCGGGATGCTTTAGCTGATGCTGATGGGCTTTCGATCGATTTAAAAGCTACTTTCGGATGCTGATC____ATGCTTTAGCTGA
TCGGATGCTGATC____T_GCTTTAGCTGATGCTCTGATC____ATGCTTTAGCTGATGCTGATGG
TC_____T_GCTTTAGCTGATGCTGATGGGCTT
Ref: ACTTTCGGATGCTGATCGGGATGCTTTAGCTGATGCTGATGGGCTTTCGATCGATTTAAAAGCTACTTTCGGATGCTGATC____ATGCTTTAGCTGA
TCGGATGCTGATC____ATGCTTTAGCTGATGCTCTGATC____ATGCTTTAGCTGATGCTGATGG
TC____ATGCTTTAGCTGATGCTGATGGGCTT
Problem: Short indels in the sample relative to the reference sequence can pose difficulties for alignment programs. Indels occuring towards the ends of the reads are often not aligned correctly, introducing an excess of SNPs
![Page 84: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/84.jpg)
It uses the full alignment context to determine whether the indel exists.
Two-step process:
1. RealignerTargetCreator: it determines the small suspicious intervals which are likely in need of realignment
2. IndelRealigner: it runs the realignment on those intervals
notes:- having a list of known indels helps
3- Bam refinement – Local realignment
![Page 85: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/85.jpg)
raw reads (.fastq) 2. alignment to a reference genomeclose reference?
time limited?
bwa
distant reference?
stampy
aligned reads (.sam/.bam)
3. bam refinementduplicate removal
local realignment
base recalibration
picardGATK GATK
aligned reads (.sam/.bam)
5. variant callingSNPs/indels
single/multi-sample
samtools
raw variants (.vcf)
ready-to-use variants (.vcf)
4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)
6. variant filtering and validation
in silico vs in vitro validation
vcftools
variant score recalibration
big datasets
known SNPs/indels
1. Fastq quality control + trimming
Adapters ?Low quality bases?
samtoolsIGV/tablet
![Page 86: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/86.jpg)
raw reads (.fastq) 2. alignment to a reference genomeclose reference?
time limited?
bwa
distant reference?
stampy
aligned reads (.sam/.bam)
3. bam refinementduplicate removal
local realignment
base recalibration
picardGATK GATK
aligned reads (.sam/.bam)
5. variant callingSNPs/indels
single/multi-sample
samtools
raw variants (.vcf)
ready-to-use variants (.vcf)
4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)
6. variant filtering and validation
in silico vs in vitro validation
vcftools
variant score recalibration
big datasets
known SNPs/indels
1. Fastq quality control + trimming
Adapters ?Low quality bases?
samtoolsIGV/tablet
![Page 87: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/87.jpg)
Each base call has an associated base call quality (phred-scale).Rule of thumb: anything less than Q20 is not useful data.
The quality of a call depends on multiple factors (e.g. position in the read, sequence context).
In addition, the alignment can provide useful information. Mismatches to the reference are considered errors (unless they are described polymoprhisms).
It requires a catalogue of variable sites!
3- Bam refinement – Base Quality Recalibration
How does it work?
![Page 88: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/88.jpg)
3- Bam refinement – Base Quality Recalibration
List of know variantsBAM files with variants
123 C/A BQ cov1 cov2 cov3…
145 G/A BQ cov1 cov2 cov3…
1298 G/T BQ cov1 cov2 cov3…
1345 C/T BQ cov1 cov2 cov3…
1789 C/G BQ cov1 cov2 cov3…
123 C/A
145 G/A
1345 C/T
BQ: base qualityCovariates: position in the read, sequencing cycle, dinucleotide, …
Considered as real variants Recalibrated BQ using different covariates
![Page 89: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/89.jpg)
BQ: base qualityCovariates: position in the read, end of read with worse calls!
3- Bam refinement – Base Quality Recalibration
BAM files with variants
123 C/A BQ cov1 cov2 cov3…
145 G/A BQ cov1 cov2 cov3…
1298 G/T BQ cov1 cov2 cov3…
1345 C/T BQ cov1 cov2 cov3…
1789 C/G BQ cov1 cov2 cov3…
1298
1789
123
145
1345
A
A
T
T
G
Considered as real variants
Recalibrated BQ using different covariates
High quality >>>>>>>>>>>>> Low Quality
![Page 90: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/90.jpg)
BQ: base qualityCovariates: position in the read, end of read with worse calls!
3- Bam refinement – Base Quality Recalibration
BAM files with variants
123 C/A BQ cov1 cov2 cov3…
145 G/A BQ cov1 cov2 cov3…
1298 G/T BQ_1 cov1 cov2 cov3…
1345 C/T BQ cov1 cov2 cov3…
1789 C/G BQ_1 cov1 cov2 cov3…
1298
1789
123
145
1345
A
A
T
T
G
Considered as real variants
Recalibrated BQ using different covariates
High quality >>>>>>>>>>>>> Low Quality
![Page 91: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/91.jpg)
3- Bam refinement – Base Quality Recalibration
Covariate: cycle number
First cycle: higher qualityLast cycles: lower quality
![Page 92: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/92.jpg)
3- Bam refinement – Base Quality Recalibration
We will not run the Base Quality Recalibration because of time and list of variants available.
Few more details:
It supports several platforms: Illumina, SOLiD, 454, Complete Genomics, Pacific Biosciences (stated on the website) and IonTorrent (stated in the GATK forum).
You can find how to do it at:
https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_bqsr_BaseRecalibrator.php
![Page 93: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/93.jpg)
raw reads (.fastq) 2. alignment to a reference genomeclose reference?
time limited?
bwa
distant reference?
stampy
aligned reads (.sam/.bam)
3. bam refinementduplicate removal
local realignment
base recalibration
picardGATK GATK
aligned reads (.sam/.bam)
5. variant callingSNPs/indels
single/multi-sample
samtools
raw variants (.vcf)
ready-to-use variants (.vcf)
4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)
6. variant filtering and validation
in silico vs in vitro validation
vcftools
variant score recalibration
big datasets
known SNPs/indels
1. Fastq quality control + trimming
Adapters ?Low quality bases?
samtoolsIGV/tablet
![Page 94: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/94.jpg)
PCR is used during library preparation.
This can result in duplicate DNA fragments in the final library prep.
3- Bam refinement – Duplicate removal
What we want: information from independent fragmentsWhat we do not want: copies of the same information coming from one fragment
Reference Sequence
![Page 95: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/95.jpg)
PCR is used during library preparation.
This can result in duplicate DNA fragments in the final library prep.
3- Bam refinement – Duplicate removal
What we want: information from independent fragmentsWhat we do not want: copies of the same information coming from one fragment
Reference Sequence
![Page 96: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/96.jpg)
3- Bam refinement – Duplicate removal
Possible heterozygote, SNP call
C
C
C
C
A
Ref call
![Page 97: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/97.jpg)
3- Bam refinement – Duplicate removal
• It can result in false SNPs calls.
• Duplicates may fake a high coverage thus giving high support to some variants.
• PCR-free protocols exist but require a large amount of DNA.
Why is it important to do it?
![Page 98: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/98.jpg)
Number of duplicates varies according to the complexity of the library:
• whole genome experiments (<5%)
• custom enrichment ones (<30%)
It must be done after alignment and at the library level.
How does it work?
It identifies read-pairs where the outer ends map to the same position on the genome and removes all but one copy.
3- Bam refinement – Duplicate removal
![Page 99: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/99.jpg)
picard-tools MarkDuplicatesINPUT=cc_gn2_R12_rg_sorted.bam OUTPUT=cc_gn2_R12_rg_sorted_rmdup.bamMETRICS_FILE=dupl_metrics.txt
Duplicate removal
3- Bam refinement – Duplicate removal
Input file
Module used
Output file
Metrics file
![Page 100: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/100.jpg)
Duplicate removal
3- Bam refinement – Duplicate removal
What do we have to do after each step???
Sort and Index the newly generated BAM file
picard-tools MarkDuplicatesINPUT=cc_gn2_R12_rg_sorted.bam OUTPUT=cc_gn2_R12_rg_sorted_rmdup.bamMETRICS_FILE=dupl_metrics.txt
![Page 101: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/101.jpg)
sort the bam
samtools sort cc_gn2_R12_rg_sorted_rmdup.bam cc_gn2_R12_rg_sorted_rmdup_sorted
samtools index cc_gn2_R12_rg_sorted_rmdup_sorted.bam
Index the sorted bam file
3- Bam refinement – Duplicate removal
![Page 102: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/102.jpg)
raw reads (.fastq) 2. alignment to a reference genomeclose reference?
time limited?
bwa
distant reference?
stampy
aligned reads (.sam/.bam)
3. bam refinementduplicate removal
local realignment
base recalibration
picardGATK GATK
aligned reads (.sam/.bam)
5. variant callingSNPs/indels
single/multi-sample
samtools
raw variants (.vcf)
ready-to-use variants (.vcf)
4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)
6. variant filtering and validation
in silico vs in vitro validation
vcftools
variant score recalibration
big datasets
known SNPs/indels
1. Fastq quality control + trimming
Adapters ?Low quality bases?
samtoolsIGV/tablet
![Page 103: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/103.jpg)
raw reads (.fastq) 2. alignment to a reference genomeclose reference?
time limited?
bwa
distant reference?
stampy
aligned reads (.sam/.bam)
3. bam refinementduplicate removal
local realignment
base recalibration
picardGATK GATK
aligned reads (.sam/.bam)
5. variant callingSNPs/indels
single/multi-sample
samtools
raw variants (.vcf)
ready-to-use variants (.vcf)
4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)
6. variant filtering and validation
in silico vs in vitro validation
vcftools
variant score recalibration
big datasets
known SNPs/indels
1. Fastq quality control + trimming
Adapters ?Low quality bases?
samtoolsIGV/tablet
![Page 104: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/104.jpg)
3- Bam refinement – BAM check
BAM check gives us answers to several questions:
How many duplicates do I have? Is that reasonable for my experiment?
How many of my reads mapped back to the reference? How many of these are paired in mapping? How many pairs are mapped to different chromosomes?
How much average coverage do I have? Is the coverage evenly distributed along my region?
Duplicate removal
Alignment Stats
Coverage
![Page 105: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/105.jpg)
picard-tools MarkDuplicatesINPUT=cc_gn2_R12_rg_sorted.bam OUTPUT=cc_gn2_R12_rg_sorted_rmdup.bamMETRICS_FILE=dupl_metrics.txt
3- Bam refinement – BAM check: duplicate removal
![Page 106: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/106.jpg)
3- Bam refinement – BAM check: duplicate removal
Open the file dupl_metrics.txt
More/cat/gedit
1) Check the % of duplicates;
0.202646 → ~20.3%
![Page 107: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/107.jpg)
3- Bam refinement – BAM check: duplicate removal
## net.sf.picard.metrics.StringHeader# net.sf.picard.sam.MarkDuplicatesINPUT=[/media/pier/pierWD/backup/freecom/work/teaching/Unife_bioinfo_122016/material/white_shark_fastq_gn2/white_shark/DD10/final_files/cc_gn2_R12_rg_sorted.bam] OUTPUT=/media/pier/pierWD/backup/freecom/work/teaching/Unife_bioinfo_122016/material/white_shark_fastq_gn2/white_shark/DD10/final_files/cc_gn2_R12_rg_sorted_rmdup.bam METRICS_FILE=/media/pier/pierWD/backup/freecom/work/teaching/Unife_bioinfo_122016/material/white_shark_fastq_gn2/white_shark/DD10/final_files/dupl_metrics.txt PROGRAM_RECORD_ID=MarkDuplicatesPROGRAM_GROUP_NAME=MarkDuplicates REMOVE_DUPLICATES=false ASSUME_SORTED=false MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false## net.sf.picard.metrics.StringHeader# Started on: Thu Nov 10 18:37:26 GMT 2016
## METRICS CLASS net.sf.picard.sam.DuplicationMetricsLIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS
UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATESPERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
cc_gn2 12953 53727 379573 7026 8687 8687 0.202646
We have 20.3% of PCR duplicates in our experiment
![Page 108: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/108.jpg)
3- Bam refinement – BAM check
BAM check gives us answers to several questions:
How many duplicates do I have? Is that reasonable for my experiment?
How many of my reads mapped back to the reference? How many of these are paired in mapping? How many pairs are mapped to different chromosomes?
How much average coverage do I have? Is the coverage evenly distributed along my region?
Duplicate removal
Alignment Stats
Coverage
![Page 109: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/109.jpg)
Run flagstat on the BAM file before and after BAM refinement, can you see any difference?
3- Bam refinement – BAM check: alignment stats
BEFORE:
AFTER:
samtools flagstat cc_gn2_R12_rg_sorted.bam
samtools flagstat cc_gn2_R12_rg_sorted_rmdup.bam
![Page 110: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/110.jpg)
3- Bam refinement – BAM check: alignment stats
509594 + 0 in total (QC-passed reads + QC-failed reads)24400 + 0 duplicates130021 + 0 mapped (25.51%:-nan%)509594 + 0 paired in sequencing256086 + 0 read1253508 + 0 read2100420 + 0 properly paired (19.71%:-nan%)115180 + 0 with itself and mate mapped14841 + 0 singletons (2.91%:-nan%)0 + 0 with mate mapped to a different chr0 + 0 with mate mapped to a different chr (mapQ>=5
509594 + 0 in total (QC-passed reads + QC-failed reads)0 + 0 duplicates130021 + 0 mapped (25.51%:-nan%)509594 + 0 paired in sequencing256086 + 0 read1253508 + 0 read2100420 + 0 properly paired (19.71%:-nan%)115180 + 0 with itself and mate mapped14841 + 0 singletons (2.91%:-nan%)0 + 0 with mate mapped to a different chr0 + 0 with mate mapped to a different chr (mapQ>=5)
BEFORE
AFTER
![Page 111: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/111.jpg)
3- Bam refinement – BAM check: alignment stats
24400 + 0 duplicates
LIBRARY UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES PERCENT_DUPLICATIONcc_gn2 7026 8687 0.202646
(8687*2) + 7026 = 24400
![Page 112: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/112.jpg)
3- Bam refinement – BAM check
BAM check gives us answers to several questions:
How many duplicates do I have? Is that reasonable for my experiment?
How many of my reads mapped back to the reference? How many of these are paired in mapping? How many pairs are mapped to different chromosomes?
How much average coverage do I have? Is the coverage evenly distributed along my region?
Duplicate removal
Alignment Stats
Coverage
![Page 113: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/113.jpg)
Depth of Coverage – The number of reads that spans a given DNA sequence of interest. This is commonly expressed in terms of “Yx” where “Y” is the number of reads and “x” is the unit reflecting the depth of coverage metric (i.e. 5x, 10x, 20x, 100x)
7x 9x11x
3- Bam refinement – BAM check: coverage estimation
![Page 114: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/114.jpg)
java -jar /opt/GATK-3.5-0/GenomeAnalysisTK.jar-I cc_gn2_R12_rg_sorted_rmdup_sorted.bam-R cc_ref.fa-T DepthOfCoverage-o cc_gn2_R12_rg_sorted_rmdup_coverage-L coverage.intervals
3- Bam refinement – BAM check: coverage estimation
Input file
Module used
List of intervals
GATK, coverage per base
Output
Reference sequence
![Page 115: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/115.jpg)
3- Bam refinement – BAM check: coverage estimation
GATK, coverage per base
-L coverage.intervals
List of intervals,
cc_ref:198200-213200
![Page 116: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/116.jpg)
3- Bam refinement – BAM check: coverage estimation
geditcc_gn2_R12_rg_sorted_rmdup_coverage.sample_summary
OR
More/Cat/Head
sample_id total mean granular_third_quartile granular_median granular_first_quartile %_bases_above_15shark 64770 5.70 11 4 1 10.2Total 64770 5.70 N/A N/A N/A
sample_id total mean granular_third_quartile granular_mediangranular_first_quartile %_bases_above_15
shark 64770 5.70 11 4 1 10.2Total 64770 5.70 N/A N/A N/A
![Page 117: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/117.jpg)
3- Bam refinement – BAM check: coverage estimation
Per base coverage
gedit cc_gn2_R12_rg_sorted_rmdup_coverage
OR
More/Cat/Head
Locus Total_Depth Average_Depth_sample Depth_for_sharkcc_ref:198243 0 0.00 0cc_ref:198244 0 0.00 0cc_ref:198245 0 0.00 0cc_ref:198246 0 0.00 0cc_ref:198247 0 0.00 0cc_ref:198248 0 0.00 0cc_ref:198249 0 0.00 0cc_ref:198250 0 0.00 0cc_ref:198251 0 0.00 0cc_ref:198252 0 0.00 0cc_ref:198253 1 1.00 1
![Page 118: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/118.jpg)
3- Bam refinement – BAM check: coverage estimation
Open R:
• Open a terminal;• Type R;
![Page 119: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/119.jpg)
3- Bam refinement – BAM check: coverage estimation
data<-read.table("cc_gn2_R12_rg_sorted_rmdup_coverage", sep="\t", header=T)
names(data)
old_col<-data$Locus
new_col<-gsub("cc_ref:","",as.character(old_col))
data["pos"]<-new_col
plot(data$pos, data$Depth_for_shark, type="l")
![Page 120: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/120.jpg)
3- Bam refinement – BAM check: coverage estimation
Position (bp)
Coverage
![Page 121: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/121.jpg)
3- Bam refinement – BAM check: coverage estimation
![Page 122: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/122.jpg)
Why is important to check the coverage?
• To check how your experiment performed (one of the ways to assess the quality of your experiment);
• To understand how confident can you be with your data;
• To decide on filtering after variant calling;
• To look for structural variation.
3- Bam refinement – BAM check: coverage estimation
![Page 123: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/123.jpg)
raw reads (.fastq) 2. alignment to a reference genomeclose reference?
time limited?
bwa
distant reference?
stampy
aligned reads (.sam/.bam)
3. bam refinementduplicate removal
local realignment
base recalibration
picardGATK GATK
aligned reads (.sam/.bam)
5. variant callingSNPs/indels
single/multi-sample
samtools
raw variants (.vcf)
ready-to-use variants (.vcf)
4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)
6. variant filtering and validation
in silico vs in vitro validation
vcftools
variant score recalibration
big datasets
known SNPs/indels
1. Fastq quality control + trimming
Adapters ?Low quality bases?
samtoolsIGV/tablet
![Page 124: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/124.jpg)
raw reads (.fastq) 2. alignment to a reference genomeclose reference?
time limited?
bwa
distant reference?
stampy
aligned reads (.sam/.bam)
3. bam refinementduplicate removal
local realignment
base recalibration
picardGATK GATK
aligned reads (.sam/.bam)
5. variant callingSNPs/indels
single/multi-sample
samtools
raw variants (.vcf)
ready-to-use variants (.vcf)
4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)
6. variant filtering and validation
in silico vs in vitro validation
vcftools
variant score recalibration
big datasets
known SNPs/indels
1. Fastq quality control + trimming
Adapters ?Low quality bases?
samtoolsIGV/tablet
![Page 125: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/125.jpg)
4- BAM check: visualisation
Different tools to visualise aligned NGS data:
• IGV: https://www.broadinstitute.org/igv/;
• Tablet: https://ics.hutton.ac.uk/tablet/;
• …
• Samtools tview
![Page 126: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/126.jpg)
Samtools tview:
• Basic visualization tool;
• Terminal based;
• No need of RAM/Java/extra packages;
4- BAM check: visualisation
![Page 127: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/127.jpg)
samtools tviewcc_gn2_R12_rg_sorted_rmdup_sorted.bam -p cc_ref:1215261
4- BAM check: visualisation
![Page 128: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/128.jpg)
samtools tviewcc_gn2_R12_rg_sorted_rmdup_sorted.bam -p cc_ref:1215261
Position (Chr:bp)
4- BAM check: visualisation
Input: bam file
![Page 129: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/129.jpg)
4- BAM check: visualisation Reference sequencePosition
Read
Consensus
![Page 130: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/130.jpg)
• . : base matching positive strand;
• , : base matching negative strand;
• underlined: secondary or orphan;
• Uppercase letters: base matching positive strand;
• Lowercase letters: base matching negative strand;
4- BAM check: visualisation
![Page 131: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/131.jpg)
Press “ . ”
4- BAM check: visualisation
![Page 132: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/132.jpg)
Positive strand
Reverse strand
4- BAM check: visualisation
![Page 133: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/133.jpg)
?: menu
q: exit
m: mapping qualityn: nucleotide
b: base quality
.: on/off dots
Secondary or orphanColours for quality
4- BAM check: visualisation
![Page 134: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/134.jpg)
0 ≤ MQ ≤ 9
Press “ ? ”
Press “ q ”, “ m ”
4- BAM check: visualisation
MQ >=30
![Page 135: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/135.jpg)
Press “ b ”
Press “ ? ”
BQ ≥ 30
4- BAM check: visualisation
20 ≤ BQ ≤ 29
10 ≤ BQ ≤ 19
0 ≤ BQ ≤ 9
![Page 136: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/136.jpg)
Press “ n ”
What does it happen??
The four nucleotides are highlighted by four different colours, no more mapping or base
quality
4- BAM check: visualisation
![Page 137: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/137.jpg)
4- BAM check: visualisation
![Page 138: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/138.jpg)
1. Is there any polymorphic site between position 1,174,361 and 1,174,391?
2. Is there any polymorphic sites between position 1,233,201 and 1,233,221?
If so, state the reference and alterative allele, the average quality of the base (BQ), the average mapping quality of the read (MQ) and how would you call that site (i.e. homozygous reference, heterozygous, homozygous alternative).
4- BAM check: visualisation
![Page 139: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/139.jpg)
4- BAM check: visualisation
![Page 140: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/140.jpg)
1. Is there any polymorphic site between position 1,174,361 and 1,174,391?
No polymorphic sites between position 1,174,361 and 1,174,391.
2. Is there any polymorphic sites between position 1,233,201 and 1,233,221?
If so, state the reference and alterative allele, the average quality of the base (BQ), the average mapping quality of the read (MQ) and how would you call that site (i.e. homozygous reference, heterozygous, homozygous alternative).
4- BAM check: visualisation
![Page 141: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/141.jpg)
4- BAM check: visualisation
![Page 142: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/142.jpg)
• Yes, there is one polymorphic sites
• Reference allele: G
• Alternative allele: T
• Call: possible homozygous alternative(T/T)
• Average BQ: BQ ≥ 30 (white)
• Average MQ: MQ ≥ 30 (white)
4- BAM check: visualisation
![Page 143: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/143.jpg)
1. Is there any polymorphic site between position 4761 and 4771?No polymorphic sites between position 4761 and 4771
2. Is there any polymorphic sites between position1781 and 1791?Yes, there is one polymorphic sitesReference allele: GAlternative allele: TCall: possible homozygous alternative (T/T)Average BQ: BQ ≥ 30Average MQ: MQ ≥ 30
If so, state the reference and alterative allele, the average quality of the base (BQ), the average mapping quality of the read (MQ) and how would you call that site (i.e. homozygous reference, heterozygous, homozygous alternative).
4- BAM check: visualisation
![Page 144: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/144.jpg)
4- BAM check: visualisation
IGV
![Page 145: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/145.jpg)
4- BAM check: visualisation
IGV
![Page 146: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/146.jpg)
raw reads (.fastq) 2. alignment to a reference genomeclose reference?
time limited?
bwa
distant reference?
stampy
aligned reads (.sam/.bam)
3. bam refinementduplicate removal
local realignment
base recalibration
picardGATK GATK
aligned reads (.sam/.bam)
5. variant callingSNPs/indels
single/multi-sample
samtools
raw variants (.vcf)
ready-to-use variants (.vcf)
4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)
6. variant filtering and validation
in silico vs in vitro validation
vcftools
variant score recalibration
big datasets
known SNPs/indels
1. Fastq quality control + trimming
Adapters ?Low quality bases?
samtoolsIGV/tablet
![Page 147: Bioinformatica e analisi dei genomim.docente.unife.it/silvia.fuselli/dispense-corsi/bioinfo... · 2016. 11. 30. · Bioinformatica e analisi dei genomi Anno 2016/2017 PierpaoloMaisanoDelser](https://reader036.vdocuments.pub/reader036/viewer/2022070110/60473277372b6a0f2d0e3df1/html5/thumbnails/147.jpg)
raw reads (.fastq) 2. alignment to a reference genomeclose reference?
time limited?
bwa
distant reference?
stampy
aligned reads (.sam/.bam)
3. bam refinementduplicate removal
local realignment
base recalibration
picardGATK GATK
aligned reads (.sam/.bam)
5. variant callingSNPs/indels
single/multi-sample
samtools
raw variants (.vcf)
ready-to-use variants (.vcf)
4. bam check visualizationduplicate metrics (picard)flagstat (samtools)coverage distribution (GATK)
6. variant filtering and validation
in silico vs in vitro validation
vcftools
variant score recalibration
big datasets
known SNPs/indels
1. Fastq quality control + trimming
Adapters ?Low quality bases?
samtoolsIGV/tablet