DNase I Seq data Analysis Strategy
Dragon Star 2013 QianQin 同济大学
WorkflowMapping(BWA/Bow8e)
Reads filtering and format(SAMTOOLS /Picard)
Peaks Calling (MACS/hotspot)
Pileup(Convert to bigwiggle) Peaks BED 1 Peaks BED 2
1. Sampling down by mappable reads 2. Scale mappable reads
1. Data comparison(bedops, BEDTOOLS) 2. Union BED 3. Mo8f discovery
Correla8on
Filtering BedGraph, BED(BEDTOOLS, bedClip)
QC qrqc, FastQC
Warm up
Examples on DHS
He, H. H., Meyer, C. A., Chen, M. W., Jordan, V. C., Brown, M., & Liu, X. S. (2012). Genome research, 22(6), 1015–25. doi:10.1101/gr.133280.111
Neph, S., Vierstra, J., Stergachis, A. B., Reynolds, A. P., Haugen, E., Vernot, B., Thurman, R. E., et al. (2012). Nature, 489(7414), 83–90. doi:10.1038/nature11212
Uncompress BAM to Fastq
• Single End data bamToFastq –i path_to_bam –fq output.fastq
-‐i input bam files -‐fq output fastq files -‐fq2 pair end
Format instruction• FASTQ: – hdp://en.wikipedia.org/wiki/FASTQ_format
• SAM/BAM • BED, BedGraph, BigBed • Wiggle, BigWiggle • narrowPeak, broadPeak • bed.starch
hdps://genome.ucsc.edu/FAQ/FAQformat.html
hdp://code.google.com/p/bedops/wiki/starchAndUnstarch
SAM/BAM file instruction• BAM is compressed SAM • FLAGS for SE: – 0 for posi8ve strand, 16 for nega8ve strand, 4 for unmapped
• FLAGS for PE: – R mate reverse strand, r read reverse strand – 147 pair2 – strand, 99 pair 1 + strand – 83 pair1 – strand, 163 pair2 + strand
• Common FLAG: – NM for mismatch level – XT for custom tags
hdp://genome.sph.umich.edu/wiki/SAM
Tips on shell
du –h file du –sh . grep A input.fastq grep 0 input.fastq
cut -‐f 5 input.sam cut -‐f 3,4 input.sam | uniq | wc –l cut –f 3,4 input.sam | grep chr21 | wc -‐l
Task 1: get reads mapping location
Bowtie/Bowtie2
• Index genome
• Single End
bow8e-‐build chr21.fa chr21 bow8e2-‐build chr21.fa chr21
bow4e2 chr21 input.fastq -‐S output.sam bow4e chr21 input.fastq -‐S output.sam
BWA• Index genome
• Mapping
bwa index -‐a bwtsw chr21.fa
bwa aln -‐t 4 chr21.fa input.fastq -‐f output.fai bwa samse -‐f output.sam chr21.fa output.fai input.fastq
Task 2: Alignment conversion and
mapping statistics
Samtools / Picard for Conversion
Convert SAM to BAM
Convert BAM to SAM
samtools view -‐h input.bam -‐o output.sam samtools view -‐X input.bam -‐o output.sam samtools view -‐x input.bam -‐o output.sam
samtools view -‐bS input.sam -‐o output.bam samtools sort input.bam output_sorted samtools merge merge.bam input1.bam input2.bam
Samtools / Picard for reads filter and statistics
samtools flagstat input.bam
samtools view -‐bq 1 input.bam > output.bam
Get reliable aligned reads
Mapping sta8s8cs
BEDTOOLS/BEDOPS for reads format conversion
bamToBed -‐i input.bam > input.bed
bedops -‐u input1.bed input2.bed > output.bedEquals
cat input1.bed input2.bed | sort-‐bed -‐ > output.bed
Convert BAM to BED
Merge BED files
Task 3: Predict open chromatin regions
Peaks calling tools
• MACS14/2 – hdps://github.com/taoliu/MACS/ – Built-‐in Cistrome, user-‐friendly – Support Pair end mode
• Hotspot – Need shell and Linux opera8on experience – Largely dependency – hdp://www.uwencode.org/proj/hotspot/
MACS14
macs14 -‐t test.bam -‐n test Rscript test_model.r ## model image
macs14 -‐-‐keep-‐dup all -‐t test.bam -‐n test
Keep duplicates or not
Model failed
macs14 -‐-‐keep-‐dup all -‐t test.bam -‐n test -‐-‐nomodel -‐-‐shiFsize 73
MACS2• Peaks calling – macs2 callpeak -‐t test.sam -‐n test – macs2 callpeak -‐-‐nomodel -‐-‐shimsize 73 -‐t test.sam -‐n test
• Down sampling – macs2 randsample -‐t test.sam -‐n 5000 -‐-‐seed 25 -‐o test.bed
• Filter duplicates – macs2 filterdup -‐i test.bam -‐o test.bed
• Pileup – macs2 pileup -‐i test.bam –extsize 3 -‐o test.bed – sort -‐k1,1 -‐k2,2 test.bed > sort.bed
Task 4: Replicates consistency
bedtools/bedops for comparison
bedops –i input1.bed input2.bed > output.bed bedtools intersect –a input1.bed -‐b input2.bed > output.bed
bedops –e input1.bed input2.bed intersectBed –a -‐u input1.bed input2.bed
Get input1.bed overlapped regions only
Get intersec8on regions
Get input1.bed complementary regions
bedops –d input1.bed input2.bed intersectBed –v –a input1.bed –b input2.bed
Task 5: data visualization, annotation and
Motif discovery
MDSeqpos
MDSeqPos.py input.bed -‐d -‐m cistrome.xml -‐p 0.05 hg19 -‐s hs
-‐p p value -‐s species -‐d denovo or not -‐m mo8f databases, transfac.xml, cistrome.xml
sort -‐r -‐g -‐k 5 peaks.bed > input.bed
Get most accessible chroma8n regions
Mo8f analysis
Data visualization and Cistrome application annotation
IGV • Set data ranges • Auto scale • Find most enrichment regions • Load wiggle and peaks BED
RegPoten8al.py -‐t test_peaks.bed -‐g /mnt/Storage/data/sync_cistrome_lib/ceaslib/GeneTable/hg19 -‐n test -‐d 10000
Get open chroma8n regions nearby genes
Task summary
• Get Fastq • Mapping • Get proper format • Peaks calling • Comparison of replicates peaks • Data visualiza8on and mo8f analysis
师资队伍
曹志伟 江赐忠 张勇
全职教授 �
Shirley Liu (Harvard) � Zhiping Weng
(UMass)
Wei Li (Baylor)
海外 �
讲座教授 �
千人计划
973首席科学家上海市浦江人才上海市东方学者计划上海市曙光计划
教育部新世纪优秀人才
上海市科委科技启明星计划教育部新世纪优秀人才
兼职教授 �
李亦学
协助引进 �
张帆刘雷
千人计划
Welcome join us !