ruby on bioinformatics

104
Ruby Conference Taiwan 2014 Ruby on Bioinformatics Tse-Ching Ho 何澤清 @tsechingho 2014 / 4 / 26

Upload: tse-ching-ho

Post on 10-May-2015

347 views

Category:

Science


7 download

DESCRIPTION

RubyConf Taiwan 2014

TRANSCRIPT

Page 1: Ruby on bioinformatics

Ruby Conference Taiwan 2014

Ruby on Bioinformatics

Tse-Ching Ho !何澤清!@tsechingho!2014 / 4 / 26

Page 2: Ruby on bioinformatics

Horse + Stripe = Zebra

Page 3: Ruby on bioinformatics

Biology + Informatics = Bioinformatics

Page 4: Ruby on bioinformatics

Age of Big Data

Page 5: Ruby on bioinformatics

Age of Data Science

Page 6: Ruby on bioinformatics

High Through Put Data

❖ Big Data!

❖ file size is small but there are many files!

❖ file size is large but there are just few files!

❖ Data size of bioinformatics!

❖ 1,000,000,000 records for a subject (person) is normal

Page 7: Ruby on bioinformatics

The Storage Demand is Increasing

from Dr. Yu-Tai Wang

Page 8: Ruby on bioinformatics

Data Size of Sequencing After 5 Years

https://www.nanoporetech.com

70,000 New Born Baby X 500 GB = 35 TB30,000 patients X 10,000 cells X 500 GB = 1.5 X 1011 GB = 150 EB

from Dr. Yu-Tai Wang

1. count by current NGS data!2. not include civil medical institutes

Page 9: Ruby on bioinformatics

Computing Power is Required

❖ HPC!

❖ Infiniband cluster!

❖ Amazon EC2 cluster!

❖ Hadoop cluster!

❖ Many cores of CPU!

❖ Large Memory!

❖ High IO efficiencyhttp://arstechnica.com/business/2012/05/amazons-hpc-cloud-supercomputing-for-the-99/

Page 10: Ruby on bioinformatics

http://arstechnica.com/business/2012/04/4829-per-hour-supercomputer-built-on-amazon-cloud-to-fuel-cancer-research/

$4,828.85 per hour 51,132 cores, 58.78TB RAM6,742 Amazon EC2 instances

2012!Protein simulation!Cycle Computing System!Ganglia HPC clusters!Deployed by Opscode Chef

Page 11: Ruby on bioinformatics

http://www.hpcwire.com/2013/07/08/infiniband_snaps_up_strong_super_share/

Is 10 GB network enough for I/O?

embarrassingly parallel:The calculations are independent of each other.

Page 12: Ruby on bioinformatics

http://glennklockwood.blogspot.tw/2013/12/high-performance-virtualization-sr-iov_14.html

Infiniband is good at I/O efficiency

• Interconnect speed.!• I/O performance.!• Infiniband system is about

3.8GB/s of Bandwidth.!• 10 GB network is about

400MB/s of Bandwidth.

Page 13: Ruby on bioinformatics

Data science is about DATA!

Page 14: Ruby on bioinformatics

Data Scientist Concerns

❖ Data quality!

❖ Factors of filter!

❖ Statistics!

❖ Visualization!

❖ Interpretation

Page 15: Ruby on bioinformatics

Programmer also Concerns

❖ High through put data (Big Data) handling!

❖ Data format / File format!

❖ Data parsing!

❖ Statistic tools!

❖ Visualization!

❖ Profit / Markets

Page 16: Ruby on bioinformatics

Biology

Page 17: Ruby on bioinformatics

http://businessintelligence.com/bi-insights/the-personalized-medicine-revolution-is-almost-here/

Page 18: Ruby on bioinformatics

A Dream of Personalized Medicine

from Dr. Yen-Hua Huang

Page 19: Ruby on bioinformatics

Genomic Disease

http://www1.imperial.ac.uk/computationalsystemsmedicine/biomolecularmedicine/personalised/

Page 20: Ruby on bioinformatics

Cure by Medicines

http://scienceroll.com/2008/04/25/personalized-medicine-real-clinical-examples/

Page 21: Ruby on bioinformatics

Personalized Medicine

http://www.genomicslawreport.com/index.php/tag/personalized-medicine/

Page 22: Ruby on bioinformatics

Personal Genomic Analysis

http://www.thecureisnow.org/index.php/our-strategy/philosophy-of-tcin/personalized-medicine

Page 23: Ruby on bioinformatics

http://www.genengnews.com/insight-and-intelligence/personalized-medicine-not-quite-there-yet/77899649/

Page 24: Ruby on bioinformatics

DNA

http://cisncancer.org/research/what_we_know/omics/personalized_medicine_02.html

Page 25: Ruby on bioinformatics

DNA Sequencing

http://www.scq.ubc.ca/genome-projects-uncovering-the-blueprints-of-biology/

http://www.scq.ubc.ca/genome-projects-uncovering-the-blueprints-of-biology/

http://www.broadinstitute.org/blog/beyond-genome-new-uses-dna-sequencers

Page 26: Ruby on bioinformatics

No Teach for Reading DNA

http://intellimedix.com

Page 27: Ruby on bioinformatics

Do The Right Things

http://www.dnadirect.com

Page 28: Ruby on bioinformatics

http://biodbnet.abcc.ncifcrf.gov/dbInfo/netGraph.php

ID mapping of DatabasesEach node is a database.!Each database has it’s unique id.!These ids connected as a network.!I think handling these complexity should be easy for the people seating here.

Page 29: Ruby on bioinformatics

Bioinformatics Sites for Rubists

Page 30: Ruby on bioinformatics

NCBI

http://www.ncbi.nlm.nih.gov

Page 31: Ruby on bioinformatics

Ensembl

http://www.ensembl.org

Page 32: Ruby on bioinformatics

Nature Biotechnology

http://www.nature.com/nbt

Page 33: Ruby on bioinformatics

PLOS Computational Biology

http://www.ploscompbiol.org

Page 34: Ruby on bioinformatics

Biostarts

https://www.biostars.org

Page 35: Ruby on bioinformatics

SEQanswers

http://seqanswers.com

Page 36: Ruby on bioinformatics

Ruby Sites for Bioinformatists

Page 37: Ruby on bioinformatics

GitHub

https://github.com/

Page 38: Ruby on bioinformatics

RubyGems.org

https://rubygems.org

Page 39: Ruby on bioinformatics

The Ruby Toolbox

https://www.ruby-toolbox.com

Page 40: Ruby on bioinformatics

Biogems.info

http://www.biogems.info

Page 41: Ruby on bioinformatics

BioRuby

http://bioruby.org

Page 42: Ruby on bioinformatics

SciRuby

http://sciruby.com

Page 43: Ruby on bioinformatics

What programming language is best for a bioinformatics beginner?

Page 44: Ruby on bioinformatics

Mapping Sequence Data

from Jui-Tse Hsu

Page 45: Ruby on bioinformatics

Simple Mapping Sequence Data

Convert to SAM

Compress to BAM

Index, Sort, Remove duplicate PCR (Rmdup)

1. .seq -> fastq 2. Illumina score -> Phred score

1. cleaned bam file 2. quality control, get statistics, mapped, unmapped, etc.

1. SNVs in VCFs 2. structural variants 3. copy number changes, etc.

Aligner (soap2, bwa, bowtie, etc.)

from Jui-Tse Hsu

Illumina Exome sequence reads Aligned reads Aligned reads!

(sam file)

Aligned reads!(bam file)

Useful reads dataCall variants

Visualization in browsers

Page 46: Ruby on bioinformatics

C/C++❖ Key Algorithms!

❖ Written by C/C++!

❖ Foundation Tools!

❖ BWA!

❖ Bowtie / Bowtie2!

❖ samtools / bamtools!

❖ GMAP / GSNAP!

❖ BLAT!

❖ Tophat

Page 47: Ruby on bioinformatics

http://genomebiology.com/2010/11/12/220

Analysis Pipeline

Overview of the RNA-seq analysis pipeline for detecting differential expression

Page 48: Ruby on bioinformatics

Perl

❖ First language!

❖ Bioperl!

❖ Ensembl

http://millionchimpanzees.blogspot.tw/2011/09/book-review-learning-perl-sixth-edition.html

Page 49: Ruby on bioinformatics

Java

❖ good part of java!

❖ GATK!

❖ Taverna!

❖ Hadoop

http://shop.oreilly.com/product/9780596803742.do

Page 50: Ruby on bioinformatics

R

❖ Statistic tools!

❖ Bioconductor!

❖ EdgeR!

❖ Data Mining and Analysis Books

http://exploringdata.github.io/data-visualization-books/analysis/

Page 51: Ruby on bioinformatics

Python

❖ young people!

❖ Galaxy

http://news.oreilly.com/2008/08/python-for-unix-and-linux-syst.html

Page 52: Ruby on bioinformatics

The Ruby Way in Bioinformatics

Page 53: Ruby on bioinformatics

What kinds of libraries would you think it is important?

Page 62: Ruby on bioinformatics

Tools

❖ minimization!

❖ integration!

❖ quorum - rails engine

Page 63: Ruby on bioinformatics

I am Not Analyst,I am Programmer.

Page 64: Ruby on bioinformatics

What can I get involved?

Page 65: Ruby on bioinformatics

Pipeline / Workflow

Galaxy - python!Taverna - java!

??? - Ruby

Page 66: Ruby on bioinformatics

Web System

❖ Data warehouse!

❖ Pipeline management!

❖ Coordination center!

❖ Visualisation

Page 67: Ruby on bioinformatics

Cloud / Distributed / Parallel

http://www.mynamesnotmommy.com/yes-there-are-dumb-questions/question-mark/

Page 68: Ruby on bioinformatics

What We Are Doing By Ruby?

Page 69: Ruby on bioinformatics

Ensembl Virtual Machine

❖ Powered by VeeWee, Vagrant and Chef!

❖ Automatic build versioned Ensembl system (perl)!

❖ Include database, queuing services and analysis tools!

❖ Multi sites, multi species in one virtual machine!

❖ Help to build local & custom system

from Tse-Ching Ho

Page 70: Ruby on bioinformatics

Ensembl Virtual Machine

Use existed vagrant box

Prepare SOP for Chef recipes

Provision VM with Chef recipes Write Chef recipes

Export VM by Virtualbox

Setup Vagrantfile

Create Vagrant box by Veewee

Write definition of Vagrant box by Veewee

Ensembl VM Automation

from Tse-Ching Ho

Page 71: Ruby on bioinformatics

Ensembl Virtual Machine

Web view of Ensembl

from Tse-Ching Ho

Page 72: Ruby on bioinformatics

DR. RAW

❖ Derived from DRAW and SneakPeek!

❖ Composed of C/C++, bash, perl, java, ruby!

❖ Have both DNA and RNA re-sequence analysis!

❖ Enhanced quality control for DNA and RNA!

❖ Distributed computing pipeline!

❖ Support PBS, LSF, SGE platforms (queuing system)

from Hannah Lin

Page 73: Ruby on bioinformatics

DR. RAW

Analysis Tools

Analysis Pipeline

Quality Control

Resource Manager System

DNA QC Forward : Reverse

RNA QC!Forward : Reverse

BWA-0.7.7!Samtools-0.1.19!

GATK-3.1

GSNAP-13-10-25!Cufflink-13-11!FusionGene …

DNA Sequencing data

RNA Sequencing data

SGE (Sun Grid Engine)PBS (Portable Batch System)!LSF (Load Sharing Facility)

Green: new components!Red: updated components from Hannah Lin

Page 74: Ruby on bioinformatics

DR. RAW

Web view by Rails

from Hannah Lin

Page 75: Ruby on bioinformatics

Neo4j - JRuby Data Parser

❖ Graph database for data integration of discrete clinical research documents!

❖ Origin data are excel/csv files collected in different time, by different people!

❖ Neo4j is good for cleanup such massive data set!

❖ Cooperation between biologist and programmer

from Wei-Ming Wu, Chia-Hsuan Lee

Page 76: Ruby on bioinformatics

Neo4j - JRuby Data Parser

from Wei-Ming Wu, Chia-Hsuan Lee

Page 77: Ruby on bioinformatics

Neo4j - JRuby Data Parser

from Wei-Ming Wu, Chia-Hsuan LeeCollision Rate of Input Data: 1.3 %

Page 78: Ruby on bioinformatics

API Server for Third Party Firm

❖ API server based on Rails, run by JRuby!

❖ ActiveRecord models for Oracle database!

❖ activerecord-oracle_enhanced-adapter gem!

❖ Import excel files to third party GUI client !

❖ Third party server send XML request to API server

from Wei-Ming Wu, Sean Wang

Page 79: Ruby on bioinformatics

API Server for Third Party Firm

TCHC server

API server(rails, jruby)

CSIS (java, oracle)

Send data by XML

Write into database

Read data by client program

Upload data

Parse request

Third Party

Our Servers

Windows GUI

from Wei-Ming Wu, Sean Wang

Page 80: Ruby on bioinformatics

Daily Checking Rule

❖ Based on Rails, run by JRuby!

❖ ActiveRecord models for Oracle database!

❖ activerecord-oracle_enhanced-adapter gem!

❖ User can define rules for checking data, usually values in filled forms!

❖ Run checking rules daily, not before filling forms

from Wei-Ming Wu, Sean Wang

Page 81: Ruby on bioinformatics

Daily Checking Rule

from Wei-Ming Wu, Sean Wang

Page 82: Ruby on bioinformatics

Daily Checking Rule

from Wei-Ming Wu, Sean Wang

Page 83: Ruby on bioinformatics

Daily Checking Rule

from Wei-Ming Wu, Sean Wang

Page 84: Ruby on bioinformatics

Daily Checking Rule

from Wei-Ming Wu, Sean Wang

Page 85: Ruby on bioinformatics

Patient Randomization

❖ Based on Rails, run by JRuby!

❖ ActiveRecord models for Oracle database!

❖ activerecord-oracle_enhanced-adapter gem!

❖ Assign patients into different groups by randomization method!

❖ Cooperation between statistician and programmer

from Wei-Ming Wu, Sean Wang

Page 86: Ruby on bioinformatics

Patient Randomization

from Wei-Ming Wu, Sean Wang

Page 87: Ruby on bioinformatics

Patient Randomization

from Wei-Ming Wu, Sean Wang

Page 88: Ruby on bioinformatics

Patient Randomization

from Wei-Ming Wu, Sean Wang

Assign patients to treatment groups

Page 89: Ruby on bioinformatics

Database Statistics Dashboard

❖ Based on Rails, run by JRuby!

❖ ActiveRecord models for Oracle database!

❖ activerecord-oracle_enhanced-adapter gem!

❖ google_visualr gem for visualization!

❖ Count number of projects, forms, fields, records and patients

from Wei-Ming Wu, Winnie Lui

Page 90: Ruby on bioinformatics

Database Statistics Dashboard

from Wei-Ming Wu, Winnie Lui

Page 91: Ruby on bioinformatics

Education

Page 92: Ruby on bioinformatics

Learning Bioinformatics

❖ http://www.nature.com/nbt/journal/v31/n11/full/nbt.2740.html!

❖ http://www.liacs.nl/~hoogeboo/mcb/nature_primer.html!

❖ http://www.mygoblet.org - python, R!

❖ http://www.biotnet.org

Page 93: Ruby on bioinformatics

Books for Beginners

http://practicalcomputing.org

Python

Page 94: Ruby on bioinformatics

Python Book for Bioinformatics

http://shop.oreilly.com/product/9780596154516.do

Page 95: Ruby on bioinformatics

Python is very successful in Teach than Ruby

Page 96: Ruby on bioinformatics

Do we lack a killer application by Ruby?

http://www.witardroadbaptist.org/im-new/im-not-sure-im-ready-for-church-yet/

Page 97: Ruby on bioinformatics

We Need Human !!

Page 98: Ruby on bioinformatics

Are You Ready To Be A Data Scientist

Or A Binformactis?

Page 99: Ruby on bioinformatics

Markets

http://www.genengnews.com/gen-articles/personalized-medicine-health-economic-aspects/4824/

Page 100: Ruby on bioinformatics

http://www.bccresearch.com/market-research/biotechnology/bioinformatics-market-technology-bio051b.html

Under developing Do Asia have enough market sharing?

Page 101: Ruby on bioinformatics

Topics to take in action

❖ data generation and data management!

❖ data analysis and software!

❖ data processing and storage!

❖ application of bioinformatics in pharma research and development

http://www.giichinese.com.tw/report/bc268909-bioinformatics-technologies-global-markets.html

Page 102: Ruby on bioinformatics

Health Care in Cloud

❖ Health promotion cloud!

❖ Vaccination cloud!

❖ Exercise cloud!

❖ Workplace wellness!

❖ Physical checkup cloud!

❖ Welfare cloud

from Dr. Chi-Hung Lin

Page 103: Ruby on bioinformatics

Code For Bioinformatics

Page 104: Ruby on bioinformatics

Q & A