big data visualization frameworks and applications at...
TRANSCRIPT
Dr. Marcus D. Hanwell [email protected]
@mhanwell www.kitware.com
27 March, 2014 South Bay Meetup
Big Data Visualization Frameworks and Applications at Kitware
!"
Kitware, Inc. • Founded in 1998 by five former GE Research employees
• 118 current employees; 39 with PhDs
• Privately held, profitable from creation, no debt
• Rapidly Growing: >30% in 2011, 7M web-visitors/quarter
• Offices – Clifton Park, NY
– Carrboro, NC
– Santa Fe, NM
– Lyon, France
• 2011 Small Business Administration’s Tibbetts Award
• HPCWire Readers and Editor’s Choice
• Inc’s 5000 List since 2008
Kitware’s customers & collaborators Over 75 academic institutions including! • Harvard • MIT • University of California,
Berkeley • Stanford University • California Institute of
Technology • Imperial College London • Johns Hopkins University • Cornell University • Columbia University • Robarts Research Institute • University of Pennsylvania • Rensselaer Polytechnic
Institute • University of Utah • University of North Carolina
Over 50 government agencies and labs including! • National Institutes of Health
(NIH) • National Science Foundation
(NSF) • National Library of Medicine
(NLM) • Department of Defense (DOD) • Department of Energy (DOE) • Defense Advanced Research
Projects Agency (DARPA) • Army Research Lab (ARL) • Air Force Research Lab
(AFRL) • Sandia (SNL) • Los Alamos National Labs
(LANL) • Argonne (ANL) • Oak Ridge (ORNL) • Lawrence Livermore (LLNL)
Over 100 commercial companies in fields including! • Automotive • Aircraft • Defense • Energy technology • Environmental sciences • Finance • Industrial inspection • Oil & gas • Pharmaceuticals • Publishing • 3D Mapping • Medical devices • Security • Simulation
Business Model: Open Source
• Open-source Software – Normally BSD-licensed – Collaboration platforms
• Collaborative Research and Development • Technology Integration • Services and Support • Consulting • Training and webinars
%"
What is “Big Data”?
• We deal with two primary types – Small number of very large data elements
• Computational fluid dynamics simulations • Cosmological simulations covering billions of years
– Large number of (usually smaller) elements • Social media data, financial data, geospatial data • Over 3M compounds, 40M quantum calculations
• Different types of data differ in structure • Very different strategies are needed!
'"
Many Small Versus Few Big
• Many small “records” – Major challenge lies in indexing, searching – Once found we can generally send to browser – Aggregation and/or summarization important
• Few big “records” – Major challenge lies in data reduction – Must work hard to do all work near the data – Can still deliver reduced data to web clients
("
Considerations for Data at Scale
• Key areas to be addressed: – Storage – Metadata extraction – Index – Search – Visualization – Interaction – Further calculations, simulations, etc.
!)"
Data Storage at Scale • How much data do you have? • Must all data be stored in the same place? • Existing metadata extraction techniques? • Uniform data layout/schema? • Existing index/search techniques?
– Algorithmic challenges – Open implementations that scale – Interaction with the database
!!"
What Does a Result Look Like? • Once you are done searching:
– What does a typical result look like? – How big is the resulting data? – How should the data be presented? – Is all data in the database referenced?
• Is a simple ordered list useful? • What about multidimensional result sets?
!#"
Challenges with Big Data • Storage for petabytes of data is tough
– Moving it is even harder – Extracting metadata is a challenge – Backing up and restoring isn’t any easier – Even individual results can be very large
• Mostly done in central facilities – Specialized file systems – Power, backup, redundancy, staff
!*"
The Visualization Toolkit (VTK)
• Collection of C++ libraries – Leveraged by many applications – Divided into logical areas, e.g.
• Filtering – data processing in visualization pipeline • InfoVis – informatics visualization • Widgets – 3D interaction widgets • VolumeRendering – 3D volume rendering
• Cross platform, using OpenGL • Wrapped in Python, Tcl and Java
http://www.vtk.org/
VTK Architecture
• Hybrid approach – Compiled C++ core (faster algorithms) – Interpreted applications (rapid development) – Interpreted layer generated automatically
C++ core
Interpreter
The Visualization Pipeline
• A sequence of algorithms that operate on data objects to generate geometry
Source
Data
Data
Filter
Filter
Data
Data
Mapper
Mapper Actor
Actor
Render on screen
ParaView
• Parallel visualization application • Open source, BSD licensed • Turn-key application wrapper around VTK • Parallel data processing and rendering
http://www.paraview.org/
ParaView is for Extremely Large Data
1 billion cell asteroid detonation simulation
! billion cell weather simulation
source: Sandia National Lab
,-./-0"
1234250"
6784-"92:"
,-./-0"
1234250"
6784-"92:"
;<=">?@""A9" >?@"A9"
@"B2CD23-34"E.4."<.0.FF-F8GC"H20">"A9I4-"
J"
,-3/-0"K-0L-0",-3/-0"K-0L-0",-3/-0"K-0L-0",-3/-0"K-0L-0"
1F8-34"E.4."K-0L-0"E.4."K-0L-0"E.4."K-0L-0"E.4."K-0L-0"E.4."K-0L-0"E.4."K-0L-0"
Depth Composite
Tile Display
Control, Display and Rendering
of Small Data
• Python web framework built on CherryPy • Flexible HTML5 web server architecture • Developed with a clean separation
– Application in HTML, JavaScript, CSS – Service in pure Python (+ wrapped C/C++)
• Packages several other frameworks too – Bootstrap, D3, Vega, MongoDB
• Making web apps easier to develop/deploy
##"http://tangelo.kitware.com/
• Python for server side, native web clients • Easily add new services (single .py file)
– Use RESTful API – JSON delivery of data – Full power of Python
• Rapid prototyping
#*"
Browser Tangelo
web service “foo”
index.html index.js
styles.css foo.py
ParaViewWeb – Web Enabled
• Bring 3D visualization to a web page – Targeting HPC web portal – Simple usage with basic/rigid workflow – Framework to develop 3D web applications – Must work now (no WebGL) – Support collaboration with multiple clients sharing
the same visualization
• The goal was NOT to – Redo another generic ParaView client
#+
Tangelo Powering ParaViewWeb
• We need a web front end to – Start processes – Forward communications
#$
Visualizing Flickr Metadata • Uses Google maps • Flickr data in MongoDB • Python service retrieves
data using PyMongo • D3 layer over maps
– Geolocation – Day of the week – Photo (mouse hover)
#&"
Enron Email Network Visualization • enron.py retrieves emails
– Computes graph structure
• D3 force layout for viz • Controls to:
– Slice email by time – Change email originator – Set number of hops
• Tool targeted at investigating social network behavior
#'"
Bitcoin Analysis • Uses bitcoin blockchain
– Individual transactions
• Intensity histogram with transaction volume in date/amount ranges
• Detail plot with individual transactions
• Anomaly search – Theft detection
• Study large scale behavior over time
#("
Informatics Software Stack
*!"
MNO"
PD-3M8-Q"
<.0.M8-Q6-R"
M8G2C8BG"
SN;T?U.L.GB08D4"
E*?M-V."
6-R"WDDG"E-GX42D"WDDG"
S2GD84.F"12G4G" YF8BX0" 17.084I@-4"
<I4723"
N.3V-F2"
," ;.4F.R" @TNO" S./22D" ;23V2" =CD.F." KZT" 1KM?UKP@"
W3.FIG8G"W/.D4-0G" E.4."W/.D4-0G"
J" J"
Digital Pathology
• MongoDB used for image tiles – Store once, using multiple times – Metadata, processing status, results – Browser-based application/interaction
*#"https://slide-atlas.org/
Arbor is an NSF-funded project to enable evolutionary biological research by making it easy for biologists to • create, • test, • and visualize algorithms on the Tree of Life. Below is the evolutionary tree for Heliconia (Lobster Claw) plants coupled to a character matrix of observational data such as color, feature measurements, and range.
Cosmology Data Management
*+"
Supercomputer DISC LSST
K8C5F.[23"
12GC2N22FG"Y0.C-Q20X"
!"#"$%&#'
K8C5F.[23"=3D54"/-BX"
12GC2N22FG"123\V50.[23"
(")"*+,-'.,)/,)'
(")"*+,-'!$+,0#'
<.0.M8-Q6-R"
1,2'3)4-&,)'
K50L-IG"
Advanced User/Developer/ Scientist
E.4."=34-3G8L-"KB.F.RF-"12CD]"
Database
Scientist
Experimentalist
Database
*$"
$+2!4&54644$&7"'
Voronoi Tesselation
FOF HaloFinder
Stream Counter
CosmoTools ParaView Plugins
Caustics
• ANL: Salman Habib, Katrin Heitmann, Tom Peterka, Adrian Pope, Hal Finkel
• LANL: Jim Ahrens, Jon Woodring, Pat Fasel • Kitware: George Zagaris, Berk Geveci, Casey
Goodlett, Zach Mullen
UV-CDAT for Climate Visualization
• Ultrascale Visualization and Climate Data Analysis Toolkit – Collaborative effort led by LLNL – Integrate DOE’s climate modeling/measures
• Integrates a large number of tools/libs – CDAT, VTK, R, ParaView, DV3D
• Current data sets at about 3.5 petabytes – Growing to 350 petabytes to ~3 exabytes
*%"
Applications Being Developed • Three independent applications • Communication handled with local sockets • Avogadro 2: Structure editing, input generation,
output viewing, and analysis • MoleQueue: Running local and remote jobs in
standalone programs, and management • MongoChem: Storage of data, searching, entry,
and annotation • Supporting frameworks (AvogadroLibs & VTK)
*("http://www.openchemistry.org/
Use Cases for Open Chemistry • Researchers interested in molecules
– Various sources of starting structure
• Perform studies using various codes – Some performed locally – Others using high-performance computing – Different calculations produce different data
• How do these results get stored, analyzed? – How can previous work be indexed, reused?
+)"
MongoChem Overview • A desktop cheminformatics tool
– Chemical data exploration and analysis – Interactive, editable, and searchable database
• Leverages several open-source projects – Qt, VTK, MongoDB, Avogadro 2, Open Babel
• Designed to look at many molecules • Spots patterns, outliers; runs many jobs • Scales to studies with ~3 million structures
Architecture Overview • Native, cross-platform C++ application built with Qt and Avogadro 2 • Stores chemical data in a NoSQL MongoDB database • Uses VTK for 2D and 3D dataset visualization
+#"
Moving MongoChem to the Web • Increasingly important to share data • MongoDB not suitable for web directly
– Developing RESTful APIs – Building on VTKWeb and Tangelo – Can do more processing close to the data
• Can we develop a platform for chemists? – Could this address materials and other areas? – Deposition of data, curation, client-server
processing, web interface and APIs
+*"
VTKWeb, Tangelo and MongoChem
• Uses VTK’s web architecture • Performs interactive 3D rendering • Runs in any modern web browser • Same MongoDB server as MongoChem • Moves more to the client JavaScript code • Using a simple, Python-based server
– Easy to add new APIs – Easy to deploy/integrate into other solutions
++"
Why MongoDB? • SQL vs NoSQL approaches • MongoDB is implemented in C++
– Scales well by adding extra shards (nodes) – Core constructs written in C++ – Access to JavaScript in map-reduce – Memory-mapped database files – GridFS for storing large files – Clients in many languages – C, C++, Python – Large, established open-source project
+%"
JSON, BSON and NoSQL • JSON: JavaScript Object Notation • BSON: Binary JSON
– Binary-encoded serialization of JSON-like documents
• MongoDB stores BSON documents – Collections are memory-mapped BSON – Clients work directly with BSON on-the-wire
• BSON written by client can be used by server • Very little overhead reading/writing documents
+&"
Nature of Data • Many documents for molecules
– Individual results are usually MBs – Small molecules, electronic structure, MD, etc.
• Materials tend to be different – Less documents, larger results – Less existing identifiers/search techniques
• Institutions maintain big disks – Move to referencing data, client-server, etc.
+'"
Clean Energy Project: Introduction • Searching for organic photovoltaics
– IBM World Community Grid – High-throughput, in-silico study – Partnered with experimental groups
• Synthesize most promising candidates
• Many views of the data – Simple numbers for many properties – 2D graphs and 3D chemical structures – 3D structures with quantum calculation output
$)"http://cleanenergy.molecularspace.org/
Clean Energy Project: Big Data • Overall size and scope of the data:
– 2.3 million unique molecules • 22 million conformers • 150 million DFT calculations • 400TB+ of raw output data • 80GB of metadata
– Growing at just under 1TB a day – ~2.8 million unique molecules
• ~27M conformers and 185M DFT calculations • 0.5PB of raw data in the latest result set
$!"
Clean Energy Project: Open Data • Part of the Materials Genome Initiative • Data released under CC-BY-SA license • Amazing opportunity for Open Chemistry
– Very large dataset pushing current limits – Openly-licensed, allowing us to experiment – Opportunity to improve the state-of-the-art – Molecules fit our model
• Less than 1024 atoms • DFT calculations with metadata extraction
$#"
Building Community • Community around projects • Using Kitware software process
– Ensuring quality with continuous testing
– Code contributions on the web – Public mailing lists, bug trackers,
and code review • Promoting projects and
participation – Publication – Conferences – Workshops – Social media
$*"
Software Repository
Build, Test & Package
Community Review
Developers & Users
Conclusions • Shared frameworks needed to work with data • Domain specific approaches are essential
– One size fits all rarely works well – The right frameworks can be extended/customized
• Storing, sharing, publishing, and analyzing data • Data scales increasing, client-server can help • Semantic data is an important aspect too • Questions?
$+"