ruby on bioinformatics

Post on 10-May-2015

348 Views

Category:

Science

7 Downloads

Preview:

Click to see full reader

DESCRIPTION

RubyConf Taiwan 2014

TRANSCRIPT

Ruby Conference Taiwan 2014

Ruby on Bioinformatics

Tse-Ching Ho !何澤清!@tsechingho!2014 / 4 / 26

Horse + Stripe = Zebra

Biology + Informatics = Bioinformatics

Age of Big Data

Age of Data Science

High Through Put Data

❖ Big Data!

❖ file size is small but there are many files!

❖ file size is large but there are just few files!

❖ Data size of bioinformatics!

❖ 1,000,000,000 records for a subject (person) is normal

The Storage Demand is Increasing

from Dr. Yu-Tai Wang

Data Size of Sequencing After 5 Years

https://www.nanoporetech.com

70,000 New Born Baby X 500 GB = 35 TB30,000 patients X 10,000 cells X 500 GB = 1.5 X 1011 GB = 150 EB

from Dr. Yu-Tai Wang

1. count by current NGS data!2. not include civil medical institutes

Computing Power is Required

❖ HPC!

❖ Infiniband cluster!

❖ Amazon EC2 cluster!

❖ Hadoop cluster!

❖ Many cores of CPU!

❖ Large Memory!

❖ High IO efficiencyhttp://arstechnica.com/business/2012/05/amazons-hpc-cloud-supercomputing-for-the-99/

http://arstechnica.com/business/2012/04/4829-per-hour-supercomputer-built-on-amazon-cloud-to-fuel-cancer-research/

$4,828.85 per hour 51,132 cores, 58.78TB RAM6,742 Amazon EC2 instances

2012!Protein simulation!Cycle Computing System!Ganglia HPC clusters!Deployed by Opscode Chef

http://www.hpcwire.com/2013/07/08/infiniband_snaps_up_strong_super_share/

Is 10 GB network enough for I/O?

embarrassingly parallel:The calculations are independent of each other.

http://glennklockwood.blogspot.tw/2013/12/high-performance-virtualization-sr-iov_14.html

Infiniband is good at I/O efficiency

• Interconnect speed.!• I/O performance.!• Infiniband system is about

3.8GB/s of Bandwidth.!• 10 GB network is about

400MB/s of Bandwidth.

Data science is about DATA!

Data Scientist Concerns

❖ Data quality!

❖ Factors of filter!

❖ Statistics!

❖ Visualization!

❖ Interpretation

Programmer also Concerns

❖ High through put data (Big Data) handling!

❖ Data format / File format!

❖ Data parsing!

❖ Statistic tools!

❖ Visualization!

❖ Profit / Markets

Biology

http://businessintelligence.com/bi-insights/the-personalized-medicine-revolution-is-almost-here/

A Dream of Personalized Medicine

from Dr. Yen-Hua Huang

Genomic Disease

http://www1.imperial.ac.uk/computationalsystemsmedicine/biomolecularmedicine/personalised/

Cure by Medicines

http://scienceroll.com/2008/04/25/personalized-medicine-real-clinical-examples/

Personalized Medicine

http://www.genomicslawreport.com/index.php/tag/personalized-medicine/

Personal Genomic Analysis

http://www.thecureisnow.org/index.php/our-strategy/philosophy-of-tcin/personalized-medicine

http://www.genengnews.com/insight-and-intelligence/personalized-medicine-not-quite-there-yet/77899649/

DNA

http://cisncancer.org/research/what_we_know/omics/personalized_medicine_02.html

DNA Sequencing

http://www.scq.ubc.ca/genome-projects-uncovering-the-blueprints-of-biology/

http://www.scq.ubc.ca/genome-projects-uncovering-the-blueprints-of-biology/

http://www.broadinstitute.org/blog/beyond-genome-new-uses-dna-sequencers

No Teach for Reading DNA

http://intellimedix.com

Do The Right Things

http://www.dnadirect.com

http://biodbnet.abcc.ncifcrf.gov/dbInfo/netGraph.php

ID mapping of DatabasesEach node is a database.!Each database has it’s unique id.!These ids connected as a network.!I think handling these complexity should be easy for the people seating here.

Bioinformatics Sites for Rubists

NCBI

http://www.ncbi.nlm.nih.gov

Ensembl

http://www.ensembl.org

Nature Biotechnology

http://www.nature.com/nbt

PLOS Computational Biology

http://www.ploscompbiol.org

Biostarts

https://www.biostars.org

SEQanswers

http://seqanswers.com

Ruby Sites for Bioinformatists

GitHub

https://github.com/

RubyGems.org

https://rubygems.org

The Ruby Toolbox

https://www.ruby-toolbox.com

Biogems.info

http://www.biogems.info

BioRuby

http://bioruby.org

SciRuby

http://sciruby.com

What programming language is best for a bioinformatics beginner?

Mapping Sequence Data

from Jui-Tse Hsu

Simple Mapping Sequence Data

Convert to SAM

Compress to BAM

Index, Sort, Remove duplicate PCR (Rmdup)

1. .seq -> fastq 2. Illumina score -> Phred score

1. cleaned bam file 2. quality control, get statistics, mapped, unmapped, etc.

1. SNVs in VCFs 2. structural variants 3. copy number changes, etc.

Aligner (soap2, bwa, bowtie, etc.)

from Jui-Tse Hsu

Illumina Exome sequence reads Aligned reads Aligned reads!

(sam file)

Aligned reads!(bam file)

Useful reads dataCall variants

Visualization in browsers

C/C++❖ Key Algorithms!

❖ Written by C/C++!

❖ Foundation Tools!

❖ BWA!

❖ Bowtie / Bowtie2!

❖ samtools / bamtools!

❖ GMAP / GSNAP!

❖ BLAT!

❖ Tophat

http://genomebiology.com/2010/11/12/220

Analysis Pipeline

Overview of the RNA-seq analysis pipeline for detecting differential expression

Perl

❖ First language!

❖ Bioperl!

❖ Ensembl

http://millionchimpanzees.blogspot.tw/2011/09/book-review-learning-perl-sixth-edition.html

Java

❖ good part of java!

❖ GATK!

❖ Taverna!

❖ Hadoop

http://shop.oreilly.com/product/9780596803742.do

R

❖ Statistic tools!

❖ Bioconductor!

❖ EdgeR!

❖ Data Mining and Analysis Books

http://exploringdata.github.io/data-visualization-books/analysis/

Python

❖ young people!

❖ Galaxy

http://news.oreilly.com/2008/08/python-for-unix-and-linux-syst.html

The Ruby Way in Bioinformatics

What kinds of libraries would you think it is important?

Tools

❖ minimization!

❖ integration!

❖ quorum - rails engine

I am Not Analyst,I am Programmer.

What can I get involved?

Pipeline / Workflow

Galaxy - python!Taverna - java!

??? - Ruby

Web System

❖ Data warehouse!

❖ Pipeline management!

❖ Coordination center!

❖ Visualisation

Cloud / Distributed / Parallel

http://www.mynamesnotmommy.com/yes-there-are-dumb-questions/question-mark/

What We Are Doing By Ruby?

Ensembl Virtual Machine

❖ Powered by VeeWee, Vagrant and Chef!

❖ Automatic build versioned Ensembl system (perl)!

❖ Include database, queuing services and analysis tools!

❖ Multi sites, multi species in one virtual machine!

❖ Help to build local & custom system

from Tse-Ching Ho

Ensembl Virtual Machine

Use existed vagrant box

Prepare SOP for Chef recipes

Provision VM with Chef recipes Write Chef recipes

Export VM by Virtualbox

Setup Vagrantfile

Create Vagrant box by Veewee

Write definition of Vagrant box by Veewee

Ensembl VM Automation

from Tse-Ching Ho

Ensembl Virtual Machine

Web view of Ensembl

from Tse-Ching Ho

DR. RAW

❖ Derived from DRAW and SneakPeek!

❖ Composed of C/C++, bash, perl, java, ruby!

❖ Have both DNA and RNA re-sequence analysis!

❖ Enhanced quality control for DNA and RNA!

❖ Distributed computing pipeline!

❖ Support PBS, LSF, SGE platforms (queuing system)

from Hannah Lin

DR. RAW

Analysis Tools

Analysis Pipeline

Quality Control

Resource Manager System

DNA QC Forward : Reverse

RNA QC!Forward : Reverse

BWA-0.7.7!Samtools-0.1.19!

GATK-3.1

GSNAP-13-10-25!Cufflink-13-11!FusionGene …

DNA Sequencing data

RNA Sequencing data

SGE (Sun Grid Engine)PBS (Portable Batch System)!LSF (Load Sharing Facility)

Green: new components!Red: updated components from Hannah Lin

DR. RAW

Web view by Rails

from Hannah Lin

Neo4j - JRuby Data Parser

❖ Graph database for data integration of discrete clinical research documents!

❖ Origin data are excel/csv files collected in different time, by different people!

❖ Neo4j is good for cleanup such massive data set!

❖ Cooperation between biologist and programmer

from Wei-Ming Wu, Chia-Hsuan Lee

Neo4j - JRuby Data Parser

from Wei-Ming Wu, Chia-Hsuan Lee

Neo4j - JRuby Data Parser

from Wei-Ming Wu, Chia-Hsuan LeeCollision Rate of Input Data: 1.3 %

API Server for Third Party Firm

❖ API server based on Rails, run by JRuby!

❖ ActiveRecord models for Oracle database!

❖ activerecord-oracle_enhanced-adapter gem!

❖ Import excel files to third party GUI client !

❖ Third party server send XML request to API server

from Wei-Ming Wu, Sean Wang

API Server for Third Party Firm

TCHC server

API server(rails, jruby)

CSIS (java, oracle)

Send data by XML

Write into database

Read data by client program

Upload data

Parse request

Third Party

Our Servers

Windows GUI

from Wei-Ming Wu, Sean Wang

Daily Checking Rule

❖ Based on Rails, run by JRuby!

❖ ActiveRecord models for Oracle database!

❖ activerecord-oracle_enhanced-adapter gem!

❖ User can define rules for checking data, usually values in filled forms!

❖ Run checking rules daily, not before filling forms

from Wei-Ming Wu, Sean Wang

Daily Checking Rule

from Wei-Ming Wu, Sean Wang

Daily Checking Rule

from Wei-Ming Wu, Sean Wang

Daily Checking Rule

from Wei-Ming Wu, Sean Wang

Daily Checking Rule

from Wei-Ming Wu, Sean Wang

Patient Randomization

❖ Based on Rails, run by JRuby!

❖ ActiveRecord models for Oracle database!

❖ activerecord-oracle_enhanced-adapter gem!

❖ Assign patients into different groups by randomization method!

❖ Cooperation between statistician and programmer

from Wei-Ming Wu, Sean Wang

Patient Randomization

from Wei-Ming Wu, Sean Wang

Patient Randomization

from Wei-Ming Wu, Sean Wang

Patient Randomization

from Wei-Ming Wu, Sean Wang

Assign patients to treatment groups

Database Statistics Dashboard

❖ Based on Rails, run by JRuby!

❖ ActiveRecord models for Oracle database!

❖ activerecord-oracle_enhanced-adapter gem!

❖ google_visualr gem for visualization!

❖ Count number of projects, forms, fields, records and patients

from Wei-Ming Wu, Winnie Lui

Database Statistics Dashboard

from Wei-Ming Wu, Winnie Lui

Education

Learning Bioinformatics

❖ http://www.nature.com/nbt/journal/v31/n11/full/nbt.2740.html!

❖ http://www.liacs.nl/~hoogeboo/mcb/nature_primer.html!

❖ http://www.mygoblet.org - python, R!

❖ http://www.biotnet.org

Books for Beginners

http://practicalcomputing.org

Python

Python Book for Bioinformatics

http://shop.oreilly.com/product/9780596154516.do

Python is very successful in Teach than Ruby

Do we lack a killer application by Ruby?

http://www.witardroadbaptist.org/im-new/im-not-sure-im-ready-for-church-yet/

We Need Human !!

Are You Ready To Be A Data Scientist

Or A Binformactis?

Markets

http://www.genengnews.com/gen-articles/personalized-medicine-health-economic-aspects/4824/

http://www.bccresearch.com/market-research/biotechnology/bioinformatics-market-technology-bio051b.html

Under developing Do Asia have enough market sharing?

Topics to take in action

❖ data generation and data management!

❖ data analysis and software!

❖ data processing and storage!

❖ application of bioinformatics in pharma research and development

http://www.giichinese.com.tw/report/bc268909-bioinformatics-technologies-global-markets.html

Health Care in Cloud

❖ Health promotion cloud!

❖ Vaccination cloud!

❖ Exercise cloud!

❖ Workplace wellness!

❖ Physical checkup cloud!

❖ Welfare cloud

from Dr. Chi-Hung Lin

Code For Bioinformatics

Q & A

top related