writing data analysis pipeline as ruby gem
TRANSCRIPT
Writing Data Analysis Pipeline As Ruby Gem
Shi-Gang Wang
About me{
name: ‘ Shi-Gang Wang ( Sean ) ’,
email: ‘ [email protected] ’,
working_at: ,
role: [‘ software engineer ’],
language: ‘ ruby ’,
github: ‘ https://github.com/seansg ’
}
Outline
❖ What is pipeline❖ Disassemble pipeline ❖ Queue a pipeline
?
pineapple.txt
pineapple.txtcat pineapple.txt
pineapple.txtcat pineapple.txt
cat pineapple.txt | grep apple
pineapple.txtcat pineapple.txt
cat pineapple.txt | grep applecat pineapple.txt | grep apple | wc -l
Write scripts to do one thing
Make scripts to work together
=> Pipeline
Take
as an example
CAGNUT❖ Computational and Analytical Gear for Nucleic
acid Utilitarian Techniques❖ DNA analysis pipeline
❖ Burrows-Wheeler Aligner (BWA) — in C
❖ Sequence Alignment/Map tools (SAMtools) — in C
❖ Genome Analysis Toolkit (GATK) — in Java
❖ Picard — in Java
❖ Generate bash scripts
A Genome Analysis Flowchart
https://software.broadinstitute.org/gatk/img/BP_workflow_3.6.png
Demo
How to write the pipeline?
How to write the pipeline?
How to disassemble the pipeline?
Think about the pipeline structure
Pipeline
Tools
CAGNUT Core
Think about the pipeline structure
Pipeline
CAGNUT Core
Tools
Write all parts as ruby gems
Benefits of ruby gems
❖ Reuse❖ Debug❖ Maintain❖ Share
Difficulties
❖ Usage❖ Integration of tools❖ Execution order❖ Automation
Prepare work — Define help
❖ “Help” can help you understand how to use the commands of pipeline
Prepare work — Namespace
Skills of writing the gems
Part 1 — Tool gemsPart 2 — Pipeline gem
Part 3 — Cagnut core gem
Part 1 — Tool gems
❖ Tool written in Singleton❖ Tool methods written in class❖ Job scripts generation Pipeline
Tools
CAGNUT Core
Tool written in Singleton
Tool method written in class
Get specific variables in other class
❖ Use Forwardable
Job scripts generate❖ Use Tilt
❖ Generic interface to multiple Ruby template engines
Part 2 — Pipeline gem
❖ Require tool gems❖ Create workflow with tool gems❖ Generate the job list Pipeline
Tools
CAGNUT Core
Require tool gems
❖ Loading bundle env
Create workflow with tool gems
❖ Composed by tool gems❖ Order❖ Dependency
Generate the job list
Part 3 — CAGNUT core gem
❖ Project template prepare❖ Parameters handling❖ Tool-specific methods overwrite❖ Jobs control Pipeline
Tools
CAGNUT Core
Project template prepare❖ Define bundle as Thor command
Parameter handing❖ Use OptionParser
Tool-specific method overwrite
❖ One tool, One configuration❖ Using “Prepend” to overwrite
dev.af83.com/2012/10/19/ruby-2-0-module-prepend.html
Jobs Control — desktop run
❖ wait $!
❖ detach Zombie Process
!
If the data is largeor
much larger, like the human genome
The size of the human genome is
3 x109 base pairs (bps)
Each base pair takes 2 bits
(you can use 00, 01, 10, and 11 for T, G, C and A)
2 x 3 x 109 bits = 6 x109 bits
= 7.5x108 bytes = ~700 MB
In a perfect world: ~700 MB
(just 3 billion letters)
In the real world: ~200 GB(right off the genome
sequencer)
Crash your desktop/laptop!
Long wait …
Resource allocation
Resource allocation
❖ Specifying the memory used by the program
❖ Using Queueing System
What is Queueing System?
Queueing System
BD C AWaiting JobsJob Finished Job
System
❖Queue❖the list of waiting jobs
❖Queueing System❖Waiting Jobs + Servers
Server n
Server 2
Server 1
In a desktop computer
Cluster Queues
Queueing System
❖ Props❖ Jobs scheduling❖ Load balancing❖ Batch jobs execution
Queueing System
❖ Portable Batch System (PBS)❖ Sun Grid Engine (SGE) ❖ Load Sharing Facility (LSF)
Submit jobs to Queueing System
❖ Take LSF as an example❖ Creating a job script❖ Submitting the job
Demo
❖ Submit jobs to cluster
Acknowledgement
https://cagnut.golden.io
https://goldenio.com
Thanks
Backup