the open science data cloud: empowering the long tail of science

The Open Science Data Cloud: Empowering the Long Tail of Science

Robert L. Grossman University of Chicago

and Open Cloud ConsorCum

October 12, 2012

A 501(c)(3) not-‐for-‐profit operaCng clouds for science.

QuesCon 1. What is the cyberinfrastructure required to manage, analyze, archive and share big data? Call this analyCc infrastructure.

QuesCon 2. What is the analogy of the GLIF* for analyCc infrastructure?

*GLIF (www.glif.is), the Global Lambda Integrated Facility, is an internaConal virtual organizaCon that promotes the paradigm of lambda networking. GLIF provides lambdas internaConally as an integrated facility to support data-‐intensive scienCfic research, and supports middleware development for lambda networking.

Small Medium to Large Very Large

Data Size

10’s

100’s

1000’s

Number

Public infrastructure

Dedicated infrastructure

Shared community infrastructure

Individual scienCsts & small projects

Community based science via Science as a Service

very large projects

The long tail of data science

A few large data science projects.

Many smaller data science projects.

Part 1. What Instrument Do we Use to Make Big Data Discoveries?

How do we build a “datascope?”

What is big data?

TB? PB? EB? ZB?

Think of data as big if you measure it in MW, as in Facebook’s Pineville Data Center is 30 MW.

Another way:

opencompute.org

An algorithm and compuCng infrastructure is “big-‐data scalable” if adding a rack (or container) of data (and corresponding processors) allows you to do the same computaCon in the same Cme but over more data.

Commercial Cloud Service Provider (CSP) 15 MW Data Center

100,000 servers 1 PB DRAM

100’s of PB of disk

AutomaCc provisioning and infrastructure management

Monitoring, network security and forensics

AccounCng and billing Customer

Facing Portal

Data center network

~1 Tbps egress bandwidth

25 operators for 15 MW Commercial Cloud

My vote for a datascope: a (bouCque) data center scale facility with a big-‐data scalable analyCc infrastructure.

What would a global integrated facility for datascopes look like?

Discipline Dura2on Size # Devices

HEP -‐ LHC 10 years 15 PB/year* One

Astronomy -‐ LSST 10 years 12 PB/year** One

Genomics -‐ NGS 2-‐4 years 0.5 TB/genome 1000’s

Some Examples of Big Data Science

*At full capacity, the Large Hadron Collider (LHC), the world's largest parCcle accelerator, is expected to produce more than 15 million Gigabytes of data each year. … This ambiCous project connects and combines the IT power of more than 140 computer centres in 33 countries. Source: hjp://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-‐en.html **As it carries out its 10-‐year survey, LSST will produce over 15 terabytes of raw astronomical data each night (30 terabytes processed), resulCng in a database catalog of 22 petabytes and an image archive of 100 petabytes. Source: hjp://www.lsst.org/News/enews/teragrid-‐1004.html

One large instrument Many smaller instruments

Datascope – Science Cloud Service Provider (Sci CSP)

Data scienCst

Sci CSP services

What are some of the important differences between commercial and research-‐focused Sci CSPs?

Science Clouds

Science CSP Commercial CSP POV DemocraCze access to

data. Integrate data to make discoveries. Long term archive.

As long as you pay the bill; as long as the business model holds.

Data & Storage

Data intensive compuCng & HP storage

Internet style scale out and object-‐based storage

Flows Large data flows in and out

Lots of small web flows

Streams Streaming processing required

NA

AccounCng EssenCal EssenCal Lock in Moving environment

between CSPs essenCal Lock in is good

Part 2. The Open Cloud ConsorCum’s Open Science Data Cloud

18 www.opencloudconsorCum.org

•  U.S based not-‐for-‐profit corporaCon. •  Manages cloud compuCng infrastructure to

support scienCfic research: Open Science Data Cloud.

•  Manages cloud compuCng testbeds: Open Cloud Testbed.

OCC Members & Partners

•  Companies: Cisco, Yahoo!, Citrix, … •  UniversiCes: University of Chicago, Northwestern Univ., Johns Hopkins, Calit2, ORNL, University of Illinois at Chicago, …

•  Federal agencies and labs: NASA, LLNL, ORNL •  InternaConal Partners: AIST (Japan), U. Edinburgh, U. Amsterdam, …

•  Partners: NaConal Lambda Rail

19

OCC 2011 Resources Resource Type Comments

OSDC Adler & Sullivan

UClity Cloud 1248 cores and 0.4 PB disk

OCC – Y Data Cloud 928 cores and 1.0 PB disk

OCC – Matsu Mixed 1 rack

OSDC Root Storage 0.8 PB

•  OCC-‐Adler, Sullivan & Root will more than double in size in 2012.

Bionimbus WG

bionimbus.opensciencedatacloud.org (biological data)

One Million Genomes •  Sequencing a million genomes would most likely fundamentally change the way we understand genomic variaCon.

•  The genomic data for a paCent is about 1 TB (including samples from both tumor and normal Cssue).

•  One million genomes is about 1000 PB or 1 EB •  With compression, it may be about 100 PB •  At $1000/genome, the sequencing would cost about $1B

Big data driven discovery on 1,000,000 genomes and 1 EB of data.

Genomic-‐driven

diagnosis

Improved understanding of genomic science

Genomic-‐ driven drug development

Precision diagnosis and treatment. PrevenCve

health care.

Project Matsu WG: Clouds to Support Earth Science

24

matsu.opensciencedatacloud.org

UDR

•  UDT is a high performance network transport protocol •  UDR = rsync + UDT •  It is easy for an average systems administrator to keep 100’s of TB of distributed data synchronized.

•  We are using it to distribute c. 1 PB from the OSDC

OpenFlow-‐Enabled Hadoop WG

•  When running Hadoop some map and reduce jobs take significantly longer than others.

•  These are stragglers and can significantly slow down a MapReduce computaCon.

•  Stragglers are common (dirty secret about Hadoop) •  Infoblox and UChicago are leading a OCC Working Group on OpenFlow-‐enabled Hadoop that will provide addiConal bandwidth to stragglers.

•  We have a testbed for a wide area version of this project.

OSDC PIRE Project We select OSDC PIRE Fellows (US ciCzens or permanent residents): •  We give them tutorials and training on big data science.

•  We provide them fellowships to work with OSDC internaConal partners.

•  We give them preferred access to the OSDC.

Nominate your favorite scienCst as an OSDC PIRE Fellow. www.opensciencedatacloud.org (look for PIRE)

Part 3. Cloud Services OperaCons Centers

Open Science Data Cloud

3 PB 2011 10 PB 2012

able to scale to 100 PB?

AutomaCc provisioning and infrastructure management

Monitoring, compliance, &

security

AccounCng and billing (OSDC)

Customer Facing Portal (Tukey)

Data center network

~100 Gbps bandwidth

5-‐12 operators to operate 1-‐5 MW Science Cloud

Science Cloud SW & Services

OSDC Data Stack based upon OpenStack, Hadoop, GlusterFS, UDT, …

Cloud Services OperaCons Centers (CSOC)

•  The OSDC operates Cloud Services OperaCons Center (or CSOC).

•  It is a CSOC focused on supporCng Science Clouds for researchers.

•  Compare to Network OperaCons Center or NOC.

•  Both are an important part of cyber infrastructure for big data science.

•  How quickly can we set up a rack?

•  How efficiently can we operate a rack? (racks/admin)

2012 OSDC rack design (dray) •  950 TB / rack •  600 cores / rack

OSDC Racks

EssenCal Services for a Science CSP •  Support for data intensive compuCng •  Support for big data flows •  Account management, authenCcaCon and authorizaCon services

•  Health and status monitoring •  Billing and accounCng •  Ability to rapidly provision infrastructure •  Security services, logging, event reporCng •  Access to large amounts of public data •  High performance storage •  Simple data export and import services

Please Join Us!

(Help us from making even more mistakes.)

Acknowledgements Major funding and support for the Open Science Data Cloud (OSDC) is provided by the Gordon and Bejy Moore FoundaCon. This funding is used to support the OSDC-‐Adler, Sullivan and Root faciliCes. AddiConal funding for the OSDC has been provided by the following sponsors: •  The OCC-‐Y Hadoop Cluster (approximately 1000 cores and 1 PB of storage) was

donated by Yahoo! in 2011. •  Cisco provides the OSDC access to the Cisco C-‐Wave, which connects OSDC data

centers with 10 Gbps wide area networks. •  NSF awarded the OSDC a 5-‐year (2010-‐2016) PIRE award to train scienCsts to use

the OSDC and to further develop the underlying technology. •  OSDC technology for high performance data transport is support in part by NSF

Award 1127316. •  The StarLight Facility in Chicago enables the OSDC to connect to over 30 high

performance research networks around the world at 10 Gbps or higher, with an increasing number of 100 Gbps connecCons.

The OSDC is managed by the Open Cloud ConsorCum, a 501(c)(3) not-‐for-‐profit corporaCon. If you are interested in providing funding or donaCng equipment or services, please contact us at [email protected].

For more informaCon •  You can find some more informaCon on my blog:

rgrossman.com. •  Some of my technical papers are also available there. •  My email address is robert.grossman at uchicago dot edu.

Center forResearchInformatics

the open science data cloud: empowering the long tail of science

News & Politics