writing data analysis pipeline as ruby gem

60
Writing Data Analysis Pipeline As Ruby Gem Shi-Gang Wang

Upload: sean-sg-wang

Post on 19-Jan-2017

35 views

Category:

Presentations & Public Speaking


3 download

TRANSCRIPT

Page 1: Writing data analysis pipeline as ruby gem

Writing Data Analysis Pipeline As Ruby Gem

Shi-Gang Wang

Page 2: Writing data analysis pipeline as ruby gem

About me{

name: ‘ Shi-Gang Wang ( Sean ) ’,

email: ‘ [email protected] ’,

working_at: ,

role: [‘ software engineer ’],

language: ‘ ruby ’,

github: ‘ https://github.com/seansg ’

}

Page 3: Writing data analysis pipeline as ruby gem

Outline

❖ What is pipeline❖ Disassemble pipeline ❖ Queue a pipeline

Page 4: Writing data analysis pipeline as ruby gem

?

Page 5: Writing data analysis pipeline as ruby gem
Page 6: Writing data analysis pipeline as ruby gem

pineapple.txt

Page 7: Writing data analysis pipeline as ruby gem

pineapple.txtcat pineapple.txt

Page 8: Writing data analysis pipeline as ruby gem

pineapple.txtcat pineapple.txt

cat pineapple.txt | grep apple

Page 9: Writing data analysis pipeline as ruby gem

pineapple.txtcat pineapple.txt

cat pineapple.txt | grep applecat pineapple.txt | grep apple | wc -l

Page 10: Writing data analysis pipeline as ruby gem

Write scripts to do one thing

Make scripts to work together

=> Pipeline

Page 11: Writing data analysis pipeline as ruby gem
Page 12: Writing data analysis pipeline as ruby gem

Take

as an example

Page 13: Writing data analysis pipeline as ruby gem

CAGNUT❖ Computational and Analytical Gear for Nucleic

acid Utilitarian Techniques❖ DNA analysis pipeline

❖ Burrows-Wheeler Aligner (BWA) — in C

❖ Sequence Alignment/Map tools (SAMtools) — in C

❖ Genome Analysis Toolkit (GATK) — in Java

❖ Picard — in Java

❖ Generate bash scripts

Page 14: Writing data analysis pipeline as ruby gem

A Genome Analysis Flowchart

https://software.broadinstitute.org/gatk/img/BP_workflow_3.6.png

Page 15: Writing data analysis pipeline as ruby gem
Page 16: Writing data analysis pipeline as ruby gem

Demo

Page 17: Writing data analysis pipeline as ruby gem

How to write the pipeline?

Page 18: Writing data analysis pipeline as ruby gem

How to write the pipeline?

How to disassemble the pipeline?

Page 19: Writing data analysis pipeline as ruby gem

Think about the pipeline structure

Pipeline

Tools

CAGNUT Core

Page 20: Writing data analysis pipeline as ruby gem

Think about the pipeline structure

Pipeline

CAGNUT Core

Tools

Page 21: Writing data analysis pipeline as ruby gem

Write all parts as ruby gems

Page 22: Writing data analysis pipeline as ruby gem

Benefits of ruby gems

❖ Reuse❖ Debug❖ Maintain❖ Share

Page 23: Writing data analysis pipeline as ruby gem

Difficulties

❖ Usage❖ Integration of tools❖ Execution order❖ Automation

Page 24: Writing data analysis pipeline as ruby gem

Prepare work — Define help

❖ “Help” can help you understand how to use the commands of pipeline

Page 25: Writing data analysis pipeline as ruby gem

Prepare work — Namespace

Page 26: Writing data analysis pipeline as ruby gem

Skills of writing the gems

Part 1 — Tool gemsPart 2 — Pipeline gem

Part 3 — Cagnut core gem

Page 27: Writing data analysis pipeline as ruby gem

Part 1 — Tool gems

❖ Tool written in Singleton❖ Tool methods written in class❖ Job scripts generation Pipeline

Tools

CAGNUT Core

Page 28: Writing data analysis pipeline as ruby gem

Tool written in Singleton

Page 29: Writing data analysis pipeline as ruby gem

Tool method written in class

Page 30: Writing data analysis pipeline as ruby gem

Get specific variables in other class

❖ Use Forwardable

Page 31: Writing data analysis pipeline as ruby gem

Job scripts generate❖ Use Tilt

❖ Generic interface to multiple Ruby template engines

Page 32: Writing data analysis pipeline as ruby gem

Part 2 — Pipeline gem

❖ Require tool gems❖ Create workflow with tool gems❖ Generate the job list Pipeline

Tools

CAGNUT Core

Page 33: Writing data analysis pipeline as ruby gem

Require tool gems

❖ Loading bundle env

Page 34: Writing data analysis pipeline as ruby gem

Create workflow with tool gems

❖ Composed by tool gems❖ Order❖ Dependency

Page 35: Writing data analysis pipeline as ruby gem

Generate the job list

Page 36: Writing data analysis pipeline as ruby gem

Part 3 — CAGNUT core gem

❖ Project template prepare❖ Parameters handling❖ Tool-specific methods overwrite❖ Jobs control Pipeline

Tools

CAGNUT Core

Page 37: Writing data analysis pipeline as ruby gem

Project template prepare❖ Define bundle as Thor command

Page 38: Writing data analysis pipeline as ruby gem

Parameter handing❖ Use OptionParser

Page 39: Writing data analysis pipeline as ruby gem

Tool-specific method overwrite

❖ One tool, One configuration❖ Using “Prepend” to overwrite

dev.af83.com/2012/10/19/ruby-2-0-module-prepend.html

Page 40: Writing data analysis pipeline as ruby gem

Jobs Control — desktop run

❖ wait $!

❖ detach Zombie Process

!

Page 41: Writing data analysis pipeline as ruby gem
Page 42: Writing data analysis pipeline as ruby gem

If the data is largeor

much larger, like the human genome

Page 43: Writing data analysis pipeline as ruby gem

The size of the human genome is

3 x109 base pairs (bps)

Page 44: Writing data analysis pipeline as ruby gem

Each base pair takes 2 bits

(you can use 00, 01, 10, and 11 for T, G, C and A)

2 x 3 x 109 bits = 6 x109 bits

= 7.5x108 bytes = ~700 MB

Page 45: Writing data analysis pipeline as ruby gem

In a perfect world: ~700 MB

(just 3 billion letters)

In the real world: ~200 GB(right off the genome

sequencer)

Page 46: Writing data analysis pipeline as ruby gem

Crash your desktop/laptop!

Page 47: Writing data analysis pipeline as ruby gem

Long wait …

Page 48: Writing data analysis pipeline as ruby gem

Resource allocation

Page 49: Writing data analysis pipeline as ruby gem

Resource allocation

❖ Specifying the memory used by the program

❖ Using Queueing System

Page 50: Writing data analysis pipeline as ruby gem

What is Queueing System?

Page 51: Writing data analysis pipeline as ruby gem

Queueing System

BD C AWaiting JobsJob Finished Job

System

❖Queue❖the list of waiting jobs

❖Queueing System❖Waiting Jobs + Servers

Server n

Server 2

Server 1

Page 52: Writing data analysis pipeline as ruby gem

In a desktop computer

Page 53: Writing data analysis pipeline as ruby gem

Cluster Queues

Page 54: Writing data analysis pipeline as ruby gem

Queueing System

❖ Props❖ Jobs scheduling❖ Load balancing❖ Batch jobs execution

Page 55: Writing data analysis pipeline as ruby gem

Queueing System

❖ Portable Batch System (PBS)❖ Sun Grid Engine (SGE) ❖ Load Sharing Facility (LSF)

Page 56: Writing data analysis pipeline as ruby gem

Submit jobs to Queueing System

❖ Take LSF as an example❖ Creating a job script❖ Submitting the job

Page 57: Writing data analysis pipeline as ruby gem

Demo

❖ Submit jobs to cluster

Page 58: Writing data analysis pipeline as ruby gem

Acknowledgement

https://cagnut.golden.io

https://goldenio.com

Page 59: Writing data analysis pipeline as ruby gem

Thanks

Page 60: Writing data analysis pipeline as ruby gem

Backup