the open science data cloud: empowering the long tail of science

35
The Open Science Data Cloud: Empowering the Long Tail of Science Robert L. Grossman University of Chicago and Open Cloud ConsorCum October 12, 2012 A 501(c)(3) notforprofit operaCng clouds for science.

Upload: robert-grossman

Post on 19-May-2015

2.130 views

Category:

News & Politics


2 download

DESCRIPTION

This is a talk I gave at the GLIF Workshop in Chicago on October 11, 2012.

TRANSCRIPT

Page 1: The Open Science Data Cloud: Empowering the Long Tail of Science

The  Open  Science  Data  Cloud:  Empowering  the  Long  Tail  of  Science  

Robert  L.  Grossman  University  of  Chicago  

and  Open  Cloud  ConsorCum  

October  12,  2012  

A  501(c)(3)  not-­‐for-­‐profit  operaCng  clouds  for  science.  

Page 2: The Open Science Data Cloud: Empowering the Long Tail of Science

QuesCon  1.  What  is  the  cyberinfrastructure  required  to  manage,  analyze,  archive  and  share  big  data?        Call  this  analyCc  infrastructure.  

Page 3: The Open Science Data Cloud: Empowering the Long Tail of Science

QuesCon  2.  What  is  the  analogy  of  the  GLIF*  for  analyCc  infrastructure?  

*GLIF  (www.glif.is),  the  Global  Lambda  Integrated  Facility,  is  an  internaConal  virtual  organizaCon  that  promotes  the  paradigm  of  lambda  networking.  GLIF  provides  lambdas  internaConally  as  an  integrated  facility  to  support  data-­‐intensive  scienCfic  research,  and  supports  middleware  development  for  lambda  networking.    

Page 4: The Open Science Data Cloud: Empowering the Long Tail of Science

Small   Medium  to  Large     Very  Large  

Data  Size  

10’s  

100’s  

1000’s  

Number  

Public  infrastructure  

Dedicated    infrastructure  

Shared  community  infrastructure  

Individual  scienCsts  &  small  projects  

Community  based  science  via  Science  as  a  Service  

very  large  projects  

Page 5: The Open Science Data Cloud: Empowering the Long Tail of Science

The  long  tail  of  data  science  

A  few  large  data  science  projects.  

Many  smaller  data  science  projects.  

Page 6: The Open Science Data Cloud: Empowering the Long Tail of Science

Part  1.  What  Instrument  Do  we  Use  to    Make  Big  Data  Discoveries?  

How  do  we  build  a  “datascope?”  

Page 7: The Open Science Data Cloud: Empowering the Long Tail of Science

What  is  big  data?  

TB?  PB?  EB?  ZB?  

Page 8: The Open Science Data Cloud: Empowering the Long Tail of Science

Think  of  data  as  big  if  you  measure  it  in  MW,  as  in  Facebook’s  Pineville  Data  Center  is  30  MW.  

Another  way:  

opencompute.org  

Page 9: The Open Science Data Cloud: Empowering the Long Tail of Science

An  algorithm  and  compuCng  infrastructure  is  “big-­‐data  scalable”  if  adding  a  rack  (or  container)  of  data  (and  corresponding  processors)  allows  you  to  do  the  same  computaCon  in  the  same  Cme  but  over  more  data.  

Page 10: The Open Science Data Cloud: Empowering the Long Tail of Science

Commercial  Cloud  Service  Provider  (CSP)    15  MW  Data  Center  

100,000  servers  1  PB  DRAM  

100’s  of  PB  of  disk  

AutomaCc  provisioning  and  infrastructure  management  

Monitoring,  network  security  and  forensics  

AccounCng  and  billing   Customer  

Facing  Portal  

Data  center  network  

~1  Tbps  egress  bandwidth    

25  operators  for  15  MW  Commercial  Cloud  

Page 11: The Open Science Data Cloud: Empowering the Long Tail of Science

My  vote  for  a  datascope:  a  (bouCque)  data  center  scale  facility  with  a  big-­‐data  scalable  analyCc  infrastructure.  

What  would  a  global  integrated  facility  for  datascopes  look  like?  

Page 12: The Open Science Data Cloud: Empowering the Long Tail of Science

Discipline   Dura2on   Size   #  Devices  

HEP  -­‐  LHC   10  years   15  PB/year*   One  

Astronomy  -­‐  LSST   10  years   12  PB/year**   One  

Genomics  -­‐  NGS   2-­‐4  years   0.5  TB/genome   1000’s  

Some  Examples  of  Big  Data  Science  

*At  full  capacity,  the  Large  Hadron  Collider  (LHC),  the  world's  largest  parCcle  accelerator,  is  expected  to  produce  more  than  15  million  Gigabytes  of  data  each  year.    …  This  ambiCous  project  connects  and  combines  the  IT  power  of  more  than  140  computer  centres  in  33  countries.    Source:  hjp://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-­‐en.html    **As  it  carries  out  its  10-­‐year  survey,  LSST  will  produce  over  15  terabytes  of  raw  astronomical  data  each  night  (30  terabytes  processed),  resulCng  in  a  database  catalog  of  22  petabytes  and  an  image  archive  of  100  petabytes.    Source:  hjp://www.lsst.org/News/enews/teragrid-­‐1004.html  

Page 13: The Open Science Data Cloud: Empowering the Long Tail of Science

One  large  instrument   Many  smaller  instruments  

Page 14: The Open Science Data Cloud: Empowering the Long Tail of Science

Datascope  –  Science  Cloud  Service  Provider  (Sci  CSP)  

Data  scienCst  

Sci  CSP  services  

Page 15: The Open Science Data Cloud: Empowering the Long Tail of Science

What  are  some  of  the  important  differences  between  commercial  and  research-­‐focused  Sci  CSPs?    

Page 16: The Open Science Data Cloud: Empowering the Long Tail of Science

Science  Clouds  

Science  CSP   Commercial  CSP  POV   DemocraCze  access  to  

data.    Integrate  data  to  make  discoveries.    Long  term  archive.  

As  long  as  you  pay  the  bill;  as  long  as  the  business  model  holds.  

Data  &  Storage  

Data  intensive  compuCng  &  HP  storage  

Internet  style  scale  out  and  object-­‐based  storage  

Flows   Large  data  flows  in  and  out  

Lots  of  small  web  flows  

Streams   Streaming  processing  required  

NA  

AccounCng   EssenCal   EssenCal  Lock  in   Moving  environment  

between  CSPs  essenCal  Lock  in  is  good  

Page 17: The Open Science Data Cloud: Empowering the Long Tail of Science

Part  2.  The  Open  Cloud  ConsorCum’s    Open  Science  Data  Cloud  

Page 18: The Open Science Data Cloud: Empowering the Long Tail of Science

18  www.opencloudconsorCum.org  

•  U.S  based  not-­‐for-­‐profit  corporaCon.  •  Manages  cloud  compuCng  infrastructure  to  

support  scienCfic  research:  Open  Science  Data  Cloud.  

•  Manages  cloud  compuCng  testbeds:  Open  Cloud  Testbed.  

 

Page 19: The Open Science Data Cloud: Empowering the Long Tail of Science

OCC  Members  &  Partners  

•  Companies:  Cisco,  Yahoo!,  Citrix,  …  •  UniversiCes:    University  of  Chicago,  Northwestern  Univ.,  Johns  Hopkins,  Calit2,  ORNL,  University  of  Illinois  at  Chicago,  …  

•  Federal  agencies  and  labs:  NASA,  LLNL,  ORNL  •  InternaConal  Partners:  AIST  (Japan),  U.  Edinburgh,  U.  Amsterdam,  …  

•  Partners:  NaConal  Lambda  Rail  

19  

Page 20: The Open Science Data Cloud: Empowering the Long Tail of Science

OCC  2011  Resources  Resource   Type   Comments  

OSDC  Adler  &  Sullivan  

UClity  Cloud     1248  cores  and  0.4  PB  disk  

OCC  –  Y   Data  Cloud   928  cores  and  1.0    PB  disk  

OCC  –  Matsu   Mixed   1  rack  

OSDC  Root   Storage   0.8  PB  

•  OCC-­‐Adler,  Sullivan  &  Root  will  more  than  double  in  size  in  2012.  

Page 21: The Open Science Data Cloud: Empowering the Long Tail of Science

Bionimbus  WG  

bionimbus.opensciencedatacloud.org  (biological  data)  

Page 22: The Open Science Data Cloud: Empowering the Long Tail of Science

One  Million  Genomes  •  Sequencing  a  million  genomes  would  most  likely  fundamentally  change  the  way  we  understand  genomic  variaCon.  

•  The  genomic  data  for  a  paCent  is  about  1  TB  (including  samples  from  both  tumor  and  normal  Cssue).  

•  One  million  genomes  is  about  1000  PB  or  1  EB  •  With  compression,  it  may  be  about  100  PB  •  At  $1000/genome,  the  sequencing  would  cost  about  $1B  

Page 23: The Open Science Data Cloud: Empowering the Long Tail of Science

Big  data  driven  discovery  on  1,000,000  genomes  and  1  EB  of  data.  

Genomic-­‐driven  

diagnosis  

Improved  understanding  of  genomic  science  

 Genomic-­‐  driven  drug  development  

Precision  diagnosis  and  treatment.    PrevenCve  

health  care.  

Page 24: The Open Science Data Cloud: Empowering the Long Tail of Science

Project Matsu WG: Clouds to Support Earth Science

24

matsu.opensciencedatacloud.org  

Page 25: The Open Science Data Cloud: Empowering the Long Tail of Science

UDR  

•  UDT  is  a  high  performance  network  transport  protocol  •  UDR  =  rsync  +  UDT    •  It  is  easy  for  an  average  systems  administrator  to  keep  100’s  of  TB  of  distributed  data  synchronized.    

•  We  are  using  it  to  distribute  c.  1  PB  from  the  OSDC  

Page 26: The Open Science Data Cloud: Empowering the Long Tail of Science

OpenFlow-­‐Enabled  Hadoop  WG  

•  When  running  Hadoop  some  map  and  reduce  jobs  take  significantly  longer  than  others.  

•  These  are  stragglers  and  can  significantly  slow  down  a  MapReduce  computaCon.    

•  Stragglers  are  common  (dirty  secret  about  Hadoop)  •  Infoblox  and  UChicago  are  leading  a  OCC  Working  Group  on  OpenFlow-­‐enabled  Hadoop  that  will  provide  addiConal  bandwidth  to  stragglers.    

•  We  have  a  testbed  for  a  wide  area  version  of  this  project.  

Page 27: The Open Science Data Cloud: Empowering the Long Tail of Science

OSDC  PIRE  Project  We  select  OSDC  PIRE  Fellows  (US  ciCzens  or  permanent  residents):    •  We  give  them  tutorials  and  training  on  big  data  science.  

•  We  provide  them  fellowships  to  work  with  OSDC  internaConal  partners.  

•  We  give  them  preferred  access  to  the  OSDC.  

Nominate  your  favorite  scienCst  as  an  OSDC  PIRE  Fellow.    www.opensciencedatacloud.org    (look  for  PIRE)  

Page 28: The Open Science Data Cloud: Empowering the Long Tail of Science

Part  3.  Cloud  Services  OperaCons  Centers  

Page 29: The Open Science Data Cloud: Empowering the Long Tail of Science

Open  Science  Data  Cloud  

3  PB  2011  10  PB  2012    

able  to  scale  to  100  PB?  

AutomaCc  provisioning  and  infrastructure  management  

Monitoring,  compliance,  &  

security  

AccounCng  and  billing  (OSDC)  

Customer  Facing  Portal  (Tukey)  

Data  center  network  

~100  Gbps  bandwidth    

5-­‐12  operators  to  operate  1-­‐5  MW  Science  Cloud  

Science  Cloud  SW  &  Services  

OSDC  Data  Stack  based  upon  OpenStack,  Hadoop,  GlusterFS,  UDT,  …  

Page 30: The Open Science Data Cloud: Empowering the Long Tail of Science

Cloud  Services    OperaCons  Centers  (CSOC)  

•  The  OSDC  operates  Cloud  Services  OperaCons  Center  (or  CSOC).  

•  It  is  a  CSOC  focused  on  supporCng  Science  Clouds  for  researchers.  

•  Compare  to  Network  OperaCons  Center  or  NOC.  

•  Both  are  an  important  part  of  cyber  infrastructure  for  big  data  science.  

Page 31: The Open Science Data Cloud: Empowering the Long Tail of Science

•  How  quickly  can  we  set  up  a  rack?  

•  How  efficiently  can  we  operate  a  rack?  (racks/admin)  

2012  OSDC  rack  design  (dray)  •  950  TB  /  rack  •  600  cores  /  rack  

OSDC  Racks  

Page 32: The Open Science Data Cloud: Empowering the Long Tail of Science

EssenCal  Services  for  a  Science  CSP  •  Support  for  data  intensive  compuCng  •  Support  for  big  data  flows  •  Account  management,  authenCcaCon  and  authorizaCon  services  

•  Health  and  status  monitoring  •  Billing  and  accounCng  •  Ability  to  rapidly  provision  infrastructure  •  Security  services,  logging,  event  reporCng  •  Access  to  large  amounts  of  public  data  •  High  performance  storage  •  Simple  data  export  and  import  services  

Page 33: The Open Science Data Cloud: Empowering the Long Tail of Science

Please  Join  Us!      

(Help  us  from  making  even  more  mistakes.)  

Page 34: The Open Science Data Cloud: Empowering the Long Tail of Science

Acknowledgements  Major  funding  and  support  for  the  Open  Science  Data  Cloud  (OSDC)  is  provided  by  the  Gordon  and  Bejy  Moore  FoundaCon.    This  funding  is  used  to  support  the  OSDC-­‐Adler,  Sullivan  and  Root  faciliCes.    AddiConal  funding  for  the  OSDC  has  been  provided  by  the  following  sponsors:    •  The  OCC-­‐Y  Hadoop  Cluster  (approximately  1000  cores  and  1  PB  of  storage)  was  

donated  by  Yahoo!  in  2011.  •  Cisco  provides  the  OSDC  access  to  the  Cisco  C-­‐Wave,  which  connects  OSDC  data  

centers  with  10  Gbps  wide  area  networks.  •  NSF  awarded  the  OSDC  a  5-­‐year  (2010-­‐2016)  PIRE  award  to  train  scienCsts  to  use  

the  OSDC  and  to  further  develop  the  underlying  technology.  •  OSDC  technology  for  high  performance  data  transport  is  support  in  part  by    NSF  

Award  1127316.  •  The  StarLight  Facility  in  Chicago  enables  the  OSDC  to  connect  to  over  30  high  

performance  research  networks  around  the  world  at  10  Gbps  or  higher,  with  an  increasing  number  of  100  Gbps  connecCons.  

 The  OSDC  is  managed  by  the  Open  Cloud  ConsorCum,  a  501(c)(3)  not-­‐for-­‐profit  corporaCon.  If  you  are  interested  in  providing  funding  or  donaCng  equipment  or  services,  please  contact  us  at  [email protected].  

Page 35: The Open Science Data Cloud: Empowering the Long Tail of Science

For  more  informaCon  •  You  can  find  some  more  informaCon  on  my  blog:  

                                               rgrossman.com.  •  Some  of  my  technical  papers  are  also  available  there.    •  My  email  address  is  robert.grossman  at  uchicago  dot  edu.  

 

Center forResearchInformatics