the open science data cloud: empowering the long tail of science
DESCRIPTION
This is a talk I gave at the GLIF Workshop in Chicago on October 11, 2012.TRANSCRIPT
The Open Science Data Cloud: Empowering the Long Tail of Science
Robert L. Grossman University of Chicago
and Open Cloud ConsorCum
October 12, 2012
A 501(c)(3) not-‐for-‐profit operaCng clouds for science.
QuesCon 1. What is the cyberinfrastructure required to manage, analyze, archive and share big data? Call this analyCc infrastructure.
QuesCon 2. What is the analogy of the GLIF* for analyCc infrastructure?
*GLIF (www.glif.is), the Global Lambda Integrated Facility, is an internaConal virtual organizaCon that promotes the paradigm of lambda networking. GLIF provides lambdas internaConally as an integrated facility to support data-‐intensive scienCfic research, and supports middleware development for lambda networking.
Small Medium to Large Very Large
Data Size
10’s
100’s
1000’s
Number
Public infrastructure
Dedicated infrastructure
Shared community infrastructure
Individual scienCsts & small projects
Community based science via Science as a Service
very large projects
The long tail of data science
A few large data science projects.
Many smaller data science projects.
Part 1. What Instrument Do we Use to Make Big Data Discoveries?
How do we build a “datascope?”
What is big data?
TB? PB? EB? ZB?
Think of data as big if you measure it in MW, as in Facebook’s Pineville Data Center is 30 MW.
Another way:
opencompute.org
An algorithm and compuCng infrastructure is “big-‐data scalable” if adding a rack (or container) of data (and corresponding processors) allows you to do the same computaCon in the same Cme but over more data.
Commercial Cloud Service Provider (CSP) 15 MW Data Center
100,000 servers 1 PB DRAM
100’s of PB of disk
AutomaCc provisioning and infrastructure management
Monitoring, network security and forensics
AccounCng and billing Customer
Facing Portal
Data center network
~1 Tbps egress bandwidth
25 operators for 15 MW Commercial Cloud
My vote for a datascope: a (bouCque) data center scale facility with a big-‐data scalable analyCc infrastructure.
What would a global integrated facility for datascopes look like?
Discipline Dura2on Size # Devices
HEP -‐ LHC 10 years 15 PB/year* One
Astronomy -‐ LSST 10 years 12 PB/year** One
Genomics -‐ NGS 2-‐4 years 0.5 TB/genome 1000’s
Some Examples of Big Data Science
*At full capacity, the Large Hadron Collider (LHC), the world's largest parCcle accelerator, is expected to produce more than 15 million Gigabytes of data each year. … This ambiCous project connects and combines the IT power of more than 140 computer centres in 33 countries. Source: hjp://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-‐en.html **As it carries out its 10-‐year survey, LSST will produce over 15 terabytes of raw astronomical data each night (30 terabytes processed), resulCng in a database catalog of 22 petabytes and an image archive of 100 petabytes. Source: hjp://www.lsst.org/News/enews/teragrid-‐1004.html
One large instrument Many smaller instruments
Datascope – Science Cloud Service Provider (Sci CSP)
Data scienCst
Sci CSP services
What are some of the important differences between commercial and research-‐focused Sci CSPs?
Science Clouds
Science CSP Commercial CSP POV DemocraCze access to
data. Integrate data to make discoveries. Long term archive.
As long as you pay the bill; as long as the business model holds.
Data & Storage
Data intensive compuCng & HP storage
Internet style scale out and object-‐based storage
Flows Large data flows in and out
Lots of small web flows
Streams Streaming processing required
NA
AccounCng EssenCal EssenCal Lock in Moving environment
between CSPs essenCal Lock in is good
Part 2. The Open Cloud ConsorCum’s Open Science Data Cloud
18 www.opencloudconsorCum.org
• U.S based not-‐for-‐profit corporaCon. • Manages cloud compuCng infrastructure to
support scienCfic research: Open Science Data Cloud.
• Manages cloud compuCng testbeds: Open Cloud Testbed.
OCC Members & Partners
• Companies: Cisco, Yahoo!, Citrix, … • UniversiCes: University of Chicago, Northwestern Univ., Johns Hopkins, Calit2, ORNL, University of Illinois at Chicago, …
• Federal agencies and labs: NASA, LLNL, ORNL • InternaConal Partners: AIST (Japan), U. Edinburgh, U. Amsterdam, …
• Partners: NaConal Lambda Rail
19
OCC 2011 Resources Resource Type Comments
OSDC Adler & Sullivan
UClity Cloud 1248 cores and 0.4 PB disk
OCC – Y Data Cloud 928 cores and 1.0 PB disk
OCC – Matsu Mixed 1 rack
OSDC Root Storage 0.8 PB
• OCC-‐Adler, Sullivan & Root will more than double in size in 2012.
Bionimbus WG
bionimbus.opensciencedatacloud.org (biological data)
One Million Genomes • Sequencing a million genomes would most likely fundamentally change the way we understand genomic variaCon.
• The genomic data for a paCent is about 1 TB (including samples from both tumor and normal Cssue).
• One million genomes is about 1000 PB or 1 EB • With compression, it may be about 100 PB • At $1000/genome, the sequencing would cost about $1B
Big data driven discovery on 1,000,000 genomes and 1 EB of data.
Genomic-‐driven
diagnosis
Improved understanding of genomic science
Genomic-‐ driven drug development
Precision diagnosis and treatment. PrevenCve
health care.
Project Matsu WG: Clouds to Support Earth Science
24
matsu.opensciencedatacloud.org
UDR
• UDT is a high performance network transport protocol • UDR = rsync + UDT • It is easy for an average systems administrator to keep 100’s of TB of distributed data synchronized.
• We are using it to distribute c. 1 PB from the OSDC
OpenFlow-‐Enabled Hadoop WG
• When running Hadoop some map and reduce jobs take significantly longer than others.
• These are stragglers and can significantly slow down a MapReduce computaCon.
• Stragglers are common (dirty secret about Hadoop) • Infoblox and UChicago are leading a OCC Working Group on OpenFlow-‐enabled Hadoop that will provide addiConal bandwidth to stragglers.
• We have a testbed for a wide area version of this project.
OSDC PIRE Project We select OSDC PIRE Fellows (US ciCzens or permanent residents): • We give them tutorials and training on big data science.
• We provide them fellowships to work with OSDC internaConal partners.
• We give them preferred access to the OSDC.
Nominate your favorite scienCst as an OSDC PIRE Fellow. www.opensciencedatacloud.org (look for PIRE)
Part 3. Cloud Services OperaCons Centers
Open Science Data Cloud
3 PB 2011 10 PB 2012
able to scale to 100 PB?
AutomaCc provisioning and infrastructure management
Monitoring, compliance, &
security
AccounCng and billing (OSDC)
Customer Facing Portal (Tukey)
Data center network
~100 Gbps bandwidth
5-‐12 operators to operate 1-‐5 MW Science Cloud
Science Cloud SW & Services
OSDC Data Stack based upon OpenStack, Hadoop, GlusterFS, UDT, …
Cloud Services OperaCons Centers (CSOC)
• The OSDC operates Cloud Services OperaCons Center (or CSOC).
• It is a CSOC focused on supporCng Science Clouds for researchers.
• Compare to Network OperaCons Center or NOC.
• Both are an important part of cyber infrastructure for big data science.
• How quickly can we set up a rack?
• How efficiently can we operate a rack? (racks/admin)
2012 OSDC rack design (dray) • 950 TB / rack • 600 cores / rack
OSDC Racks
EssenCal Services for a Science CSP • Support for data intensive compuCng • Support for big data flows • Account management, authenCcaCon and authorizaCon services
• Health and status monitoring • Billing and accounCng • Ability to rapidly provision infrastructure • Security services, logging, event reporCng • Access to large amounts of public data • High performance storage • Simple data export and import services
Please Join Us!
(Help us from making even more mistakes.)
Acknowledgements Major funding and support for the Open Science Data Cloud (OSDC) is provided by the Gordon and Bejy Moore FoundaCon. This funding is used to support the OSDC-‐Adler, Sullivan and Root faciliCes. AddiConal funding for the OSDC has been provided by the following sponsors: • The OCC-‐Y Hadoop Cluster (approximately 1000 cores and 1 PB of storage) was
donated by Yahoo! in 2011. • Cisco provides the OSDC access to the Cisco C-‐Wave, which connects OSDC data
centers with 10 Gbps wide area networks. • NSF awarded the OSDC a 5-‐year (2010-‐2016) PIRE award to train scienCsts to use
the OSDC and to further develop the underlying technology. • OSDC technology for high performance data transport is support in part by NSF
Award 1127316. • The StarLight Facility in Chicago enables the OSDC to connect to over 30 high
performance research networks around the world at 10 Gbps or higher, with an increasing number of 100 Gbps connecCons.
The OSDC is managed by the Open Cloud ConsorCum, a 501(c)(3) not-‐for-‐profit corporaCon. If you are interested in providing funding or donaCng equipment or services, please contact us at [email protected].
For more informaCon • You can find some more informaCon on my blog:
rgrossman.com. • Some of my technical papers are also available there. • My email address is robert.grossman at uchicago dot edu.
Center forResearchInformatics