cloudy with a touch of cheminformatics

Post on 10-May-2015

691 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Cloudy  with  a  Touch  of  Cheminforma4cs  

Rajarshi  Guha,  Tyler  Peryea,  Dac-­‐Trung  Nguyen  NIH  Center  for  Advancing  Transla@onal  Science  

 Chemaxon  UGM  

September  26th,  2012  Wellesley,  MA  

Parallel  compu4ng  in  the  cloud  

•  Modern  cloud  vendors  make  provisioning  compute  resources  easy  – Allows  one  to  handle  unpredictable  loads  easily  – Pay  only  for  what  you  need  

•  Chemistry  applica<ons  don’t  usually  have  very  dynamic  loads  

•  But  large  scale  resources  are  an  opportunity  for  large  scale  (parallel)  computa<ons  

• Use  cloud  resources  in  the  same  way  as  a  local  cluster  

• MIT  StarCluster  makes  this  easy  to  do  

Legacy  HPC  

• Make  use  of  cloud  capabili<es  

• Old  algorithms,  new  infrastructure  

• Spot  instances,  SNS,  SQS  SimpleDB,  S3,  etc  

Cloudy  HPC  

• Huge  datasets  • Candidates  for  map-­‐reduce  

•  Involves  algorithm    (re)design  

Big  Data  HPC  

All  HPC  is  not  equal  

hOp://www.slideshare.net/chrisdag/mapping-­‐life-­‐science-­‐informa<cs-­‐to-­‐the-­‐cloud  

Big  data  &  cheminforma4cs  

•  Computa<on  over  large  chemical  databases  – Pubchem,  ChEMBL,  GDB-­‐13,  …  

•  What  types  of  computa<ons?  – Searches  (substructure,  pharmacophore,  ….)  – QSAR  models  &  predic<ons  over  large  data  

•  Fundamentally,  “big  chemical  data”  lets  us  explore  larger  chemical  spaces  

Map-­‐Reduce  

Tom  White,  Hadoop,  The  Defini/ve  Guide.  3rd  Ed.  O’Reilly    

Split 0 Map

Split 1 Map

Split 2 Map

Reduce Part 0

merge

copysort

Reduce Part 1

merge

K1,V1! list K2,V2( ) K2, list V2( )! list K3,V3( )

Coun4ng  atoms  

•  The  chemical  version  of  the  word  coun<ng  task  

1, Nc1ccc2ncccc2c1N2, Cl.CC1CCc2nc3ccccc3c(C)c2C1...152366, Nc1ccc2ncccc2c1N

Arbitrary linenumbers (K1) SMILES (V1)

N, list(1,1,1,1,...)C, list(1,1,1,1,...)

Atom Symbol (K2) list (V2)

N 1N 1N 1N 1

.

.

Atom Symbol (K2) Occurence (V2)

N,100C,5684...

Atom Symbol (K3) Count (V3)

MAP   Reduce  

The  Hadoop  ecosystem  

Hadoop Common

Hadoop Distributed Filesystem

Map Reduce Engine

Hive

Hama

WhirrHBase

Pig

AvroMahout

FlumeZookeeperChukwa

Based  on  hOp://www.slideshare.net/informa<cacorp/101111-­‐part-­‐3-­‐maO-­‐asleO-­‐the-­‐hadoop-­‐ecosystem  

Cheminforma4cs  on  Hadoop  

•  Hadoop  and  Atom  Coun<ng  •  Hadoop  and  SD  Files  •  Cheminforma<cs,  Hadoop  and  EC2  •  Pig  and  Cheminforma<cs    

But  are  cheminforma@cs  problems    really  big  enough  to  jus@fy  all  of  this?  

package gov.nih.ncgc.hadoop;

import chemaxon.formats.MolFormatException;import chemaxon.formats.MolImporter;import chemaxon.license.LicenseManager;import chemaxon.license.LicenseProcessingException;import chemaxon.sss.search.MolSearch;import chemaxon.sss.search.SearchException;import chemaxon.struc.Molecule;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.filecache.DistributedCache;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.FileInputFormat;import org.apache.hadoop.mapred.FileOutputFormat;import org.apache.hadoop.mapred.JobClient;import org.apache.hadoop.mapred.JobConf;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.Mapper;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.Reducer;import org.apache.hadoop.mapred.Reporter;import org.apache.hadoop.mapred.TextInputFormat;import org.apache.hadoop.mapred.TextOutputFormat;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner;

import java.io.BufferedReader;import java.io.FileReader;import java.io.IOException;import java.util.Iterator;

/** * SMARTS searching over a set of files using Hadoop. * * @author Rajarshi Guha */public class SmartsSearch extends Configured implements Tool { private final static IntWritable one = new IntWritable(1); private final static IntWritable zero = new IntWritable(0);

public static class MoleculeMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private String pattern = null; private MolSearch search;

public void configure(JobConf job) {

try { Path[] licFiles = DistributedCache.getLocalCacheFiles(job); BufferedReader reader = new BufferedReader(new FileReader(licFiles[0].toString())); StringBuilder license = new StringBuilder(); String line; while ((line = reader.readLine()) != null) license.append(line); reader.close(); LicenseManager.setLicense(license.toString()); } catch (IOException e) { } catch (LicenseProcessingException e) { }

pattern = job.getStrings("pattern")[0]; search = new MolSearch(); try { Molecule queryMol = MolImporter.importMol(pattern, "smarts"); search.setQuery(queryMol); } catch (MolFormatException e) { }

}

final static IntWritable one = new IntWritable(1); Text matches = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { Molecule mol = MolImporter.importMol(value.toString()); matches.set(mol.getName()); search.setTarget(mol); try { if (search.isMatching()) { output.collect(matches, one); } else { output.collect(matches, zero); } } catch (SearchException e) { } } }

public static class SmartsMatchReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable();

public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { while (values.hasNext()) { if (values.next().compareTo(one) == 0) { result.set(1); output.collect(key, result); } } } }

public int run(String[] args) throws Exception { JobConf jobConf = new JobConf(getConf(), HeavyAtomCount.class); jobConf.setJobName("smartsSearch");

jobConf.setOutputKeyClass(Text.class); jobConf.setOutputValueClass(IntWritable.class);

jobConf.setMapperClass(MoleculeMapper.class); jobConf.setCombinerClass(SmartsMatchReducer.class); jobConf.setReducerClass(SmartsMatchReducer.class);

jobConf.setInputFormat(TextInputFormat.class); jobConf.setOutputFormat(TextOutputFormat.class);

jobConf.setNumMapTasks(5);

if (args.length != 4) { System.err.println("Usage: ss <in> <out> <pattern> <license file>"); System.exit(2); }

FileInputFormat.setInputPaths(jobConf, new Path(args[0])); FileOutputFormat.setOutputPath(jobConf, new Path(args[1])); jobConf.setStrings("pattern", args[2]);

// make the license file available vis dist cache DistributedCache.addCacheFile(new Path(args[3]).toUri(), jobConf);

JobClient.runJob(jobConf); return 0; }

public static void main(String[] args) throws Exception {

int res = ToolRunner.run(new Configuration(), new SmartsSearch(), args);

}}

Simplifying  Hadoop  applica4ons  

•  Raw  Hadoop    programs  can    be  tedious  to    write  

SMARTS  based    substructure  search    

Pig  &  Pig  La4n  

•  Pig  La<n  programs  are  much  simpler  to  write  and  get  translated  to  Hadoop  code  

•  SQL-­‐like,  requires    UDF  to  be    implemented  to    perform    non-­‐standard  tasks  

SMARTS  search  in    Pig  La<n  

UDF  for  SMARTS  search  

A = load 'medium.smi' as (smiles:chararray);B = filter A by gov.nih.ncgc.hadoop.pig.SMATCH(smiles, 'NC(=O)C(=O)N');store B into 'output.txt';

package gov.nih.ncgc.hadoop.pig;

import chemaxon.formats.MolImporter;import chemaxon.sss.search.MolSearch;import chemaxon.sss.search.SearchException;import chemaxon.struc.Molecule;import org.apache.pig.FilterFunc;import org.apache.pig.data.Tuple;

import java.io.IOException;

public class SMATCH extends FilterFunc { static MolSearch search = null;

public Boolean exec(Tuple tuple) throws IOException { if (tuple == null || tuple.size() < 2) return false; String target = (String) tuple.get(0); String query = (String) tuple.get(1); try { Molecule queryMol = MolImporter.importMol(query, "smarts"); search.setQuery(queryMol); search.setTarget(MolImporter.importMol(target, "smiles")); return search.isMatching(); } catch (SearchException e) { e.printStackTrace(); } return false; }}

Going  beyond  chunking?  

•  All  the  preceding  use  cases  are  embarrassingly  parallel    – Chunking  the  input  data  and  applying  the  same  opera<on  to  each  chunk  

– Very  nice  when  you  have  a  big  cluster  

Are  there  algorithms  in    cheminforma@cs  that    can  employ    

map-­‐reduce  at  the  algorithmic  level?  

Going  beyond  chunking?  

•  Applica<ons  that  make  use  of  pairwise  (or  higher  order)  calcula<ons  could  benefit  from  a  map-­‐reduce  incarna<on  – Doesn’t  necessarily  avoid  the  O(N2)  barrier  – Bioisostere  iden<fica<on  is  one  case  that  could  be  rephrased  as  a  map-­‐reduce  problem  

•  Map-­‐Reduce  Design  PaOerns  

Iden4fying  MMPs  

•  First  step  in  iden<fying  bioisosteres  is  to  iden<fy  candidate  matched  molecular  pairs  – Naïve  all  pairs  comparison  – Predefined  list  of  transforma<ons    •  Birch  et  al,  BMCL,  2009  

– Fragment  intersec<on  •  Hussain  et  al,  JCIM,  2010  

– MCS  based  approaches  (e.g.,  WizePairZ)  • Warner  et  al,  JCIM,  2010  

 

Naïve  Bioisostere  evalua4on  

...N  molecules   N(N-­‐1)/2  comparisons  

Scaffold  seeding  

Seed  Fragment:  

Members:  

Scaffold  seeded  bioisosteres  M(M-­‐1)/2  comparisons  

M(M-­‐1)/2  comparisons  

Seeded  bioisosteres  –  MR  style  

• Do  pairwise  MCS  analysis  on  scaffold  series  

• For  each  pair  output  SMIRKS  transform  and  the  pair  of  SMILES  

MAP  

• Collect  pairs  of  SMILES  for  a  given  SMIRKS  

• Store  in  DB,  or  • Filter  by  ac<vity,  or  • …  

REDUCE  

1e+05

1e+08

1e+11

1e+14

1e+03 1e+05 1e+07log Number of molecules

log

Num

ber o

f pai

rwis

e co

mpa

rison

s

Method

all

seeded.7

seeded.21

seeded.100

Does  seeding  help?  

•  Doesn’t  bypass  the  O(N2)  barrier  –  does  reduce  the  constant  

•  Depends  on  how  many  scaffolds  and  the    number  of  member  for  each  scaffold  

•  Certainly  useful  when  there  a  few  members  per  scaffold  

•  Highly  populated  scaffolds  can  throw  things  off  

Data  

•  Exhaus<vely  fragmented  ChEMBL  13  •  Iden<fied  scaffolds  with            

•  Ended  up  with  231,875  scaffolds    –  Covers  235,693  unique  molecules  – Average  of  7  members  per  scaffold  –  95%  of  scaffolds  had  <  21  members  –  99.5%  had  <  74  members  

•  The  0.05%  are  a  bit  problema<c  

Nmembers

Nscaffold

!1.8

1e+02

1e+05

1e+08

All SeededMethod

log

Com

paris

ons

0

50

100

150

200

1 2 3 4 5Job Number

Tim

e (s

)

Timing  experiments  

•  Selected  50  scaffolds  with  10  or  fewer  members  •  Configured  so  as  to  have  ~  5  maps  •  Effec<ve  running  <me  for  the  en<re  job  is  3.8  min  on  Hadoop  – Only  needed  5  of  8  map  slots  on  our  “cluster”  

•  Takes  ~  6  min  without  Hadoop  

Timing  experiments  

•  Selected  1000  scaffolds  with  20  or  fewer  members  – Ran  with  10  scaffolds  /  map  

•  Hadoop  run  <me  was  ~  2  hr  – Most  maps  were  fast  (<  20  sec)  

•  Serial  evalua<on  would  be  >  7  hr  

0

5

10

15

1.0 1.5 2.0 2.5 3.0 3.5 4.0log Time (s)

Num

ber o

f Job

s

A  M-­‐R  workflow  

•  We’re  currently  focused  on  just  the  MMP  step  as  as  a  MR  example  

•  Could  also  include  fragmenta<on  step  as  part  of  the  workflow  – But  a  pre-­‐calculated  set  of  scaffolds  is  more  sensible  

•  Store  transforma<ons  and  members  in  HBase  •  Link  with  ac<vity  data  and  apply  structure  &  ac<vity  filters  on  candidate  pairs  

What  Hadoop  is  not  for  

•  Doesn’t  replace  an  actual  database  •  It’s  not  uniformly  fast  or  efficient  •  Not  good  for  ad  hoc  or  real-­‐<me  analysis  •  Generally  not  effec<ve  unless  dealing  with  massive  datasets  

•  All  algorithms  are  not  amenable  to  the  map-­‐reduce  method  

Conclusions  

•  Cheminforma<cs  applica<ons  can  be  rehosted  or  rewriOen  to  take  advantage  of  cloud  resources  – Remotely  hosted    – Embarrassingly  parallel  /  chunked  – Map/reduce    

•  Ability  to  process  larger  structure  collec<ons  lets  us  explore  more  chemical  space  

•  “Big  data”  isn’t  really  that  big  in  chemistry  

Conclusions  

•  Q:  But  are  cheminforma/cs  problems  really  big  enough  to  jus/fy  all  of  this?    

•  A:  Yes  –  virtual  libraries,  integra<ng  chemical  structure  with  other  types  and  scales  of  data  

•  Q:  Are  there  algorithms  in  cheminforma/cs  that    can  employ  map-­‐reduce  at  the  algorithmic  level?  

•  A:  Yes  –  especially  when  we  consider  problems  with  a  combinatorial  flavor  

hRps://github.com/rajarshi/chem.hadoop  

top related