aim niac pnnl-sa-116502
TRANSCRIPT
It’s a Streaming World: Doing
Analytics on Data in Motion
MARK GREAVES
Technical Director, Analytics
National Security Directorate
Pacific Northwest National Laboratory
February 2016
PNNL-SA-116502
Context
The digital reflection of reality is sharpening thanks to:
the pervasive deployment
of sensors in our cities
the wide adoption of smart
phones (equipped with sensors)
the usage of (location-based)
social networks
the availability of datasets
about urban environment
[source E. Della Valle - http://streamreasoning.org/]
PNNL-SA-116502
What are Data Streams Anyway?
Formally
Data streams are unbounded sequences of time-varying data elements
Less formally
An (almost) “continuous” flow of information
Key Assumptions
Recent information is more relevant because it describes the current state
of a dynamic system
Streams focus on extracting value from transient data consumed on the
fly by continuous queries
time
[source E. Della Valle - http://streamreasoning.org/]
PNNL-SA-116502
Leveraging Streams Using Continuous
Queries over Stream Windows
“Official” streams from transportation, utilities, fire, police, parks,
cameras, events, housing, neighborhoods, emergency mgmt…
“Unofficial” streams from businesses, citizen reports, social media…
window
input streams streams of answersRegistered Continuous Query
Urban System
[graphic E. Della Valle - http://streamreasoning.org/]
PNNL-SA-116502
Pros and Cons of Continuous Query
Pros
Robust: Leverages mature database techniques for replication, security,
alerting, performance tuning, view creation, report generation, etc.
Deployable: Relatively cloud-friendly architecture
Predictable: Great for well-defined and/or slowly-changing problems
Cons
Classic data issues: Same old problems about noisy, incomplete,
unreliable, and heterogeneous data
Difficult to steer: Significant cost/time in formulating, testing, refining, and
updating the queries
Too low-level: More of a data reduction solution than an analytics solution
One-direction pipeline for a human consumer
No implicit support for human background knowledge
Complex summarization tasks can exceed the window size
Historical data often has to be handled separately
PNNL-SA-116502
PNNL’s Analytics in Motion Initiative
AIM: A 5-year Lab-wide effort to advance the state of the art in
Interactive Streaming Analytics at Scale
Interactive
Humans in/on the Loop, actively steering and using their knowledge
New interaction techniques beyond the ticker tape
Address steering, cost of analytic algorithms, higher level information, 7agility
Analytics
From the “what” to the “why”: sensemaking, decision support, causality
Statistical normalization/summarization/outlier techniques
AI methods that dynamically incorporate human background knowledge
~10 Coordinated R&D projects per year, with university and
commercial partners (and always looking for more)!
Experimental cloud-based testbed
Increasingly-difficult set of use cases to contextualize the R&D
Distinguished Advisory BoardPNNL-SA-116502
How do we rebalance effort between humans and machines?
How can we automate the hypothesis generation and testing process?
How do we capture human insight in situ from streamingdata sources?
Can we steer measurementsystems automatically based on emerging knowledge?
PNNL-SA-116502
AIM Overview
AIM is developing new techniques for interactive streaming analytics,
tracking a stream in real-time and using human input to guide
computational models
Multiple classifier systems, with diverse model types (e.g., symbolic and PGMs)
Use high-level dynamic user feedback to steer the data production system, provide
model tunings/weightings/rankings, and fuse results
Key features of AIM’s streaming model
Data is forgotten: Each model’s cache is small relative to the data volume
Single-pass: No access to the data stream beyond the sample
Cooperative user: Important problem knowledge isn’t in training data
Not the whole system: Lambda embeddingPNNL-SA-116502
AIM Programmatic Approach
Four AIM program focus areas (% of FY16 budget)
Streaming Data Characterization and Processing (20% of R&D)
Hypothesis Generation and Testing (30% of R&D)
Human-Machine Feedback (50% of R&D)
Infrastructure and Testing Environment (25% of operations)
PNNL-SA-116502
AIM FY16 Project Layout
Streaming Data Characterization
SFE: Scalable Feature Extraction and Sampling
CA: Compressive Analysis
Hypothesis Generation and Test
SDC: Streaming Data Characterization
NOUS: Streaming Knowledge Graphs
TeMpSA: Temporal Modeling in Streaming
Analytics
SAFE: Stream Adapted Foraging for Evidence
Human-Machine Feedback
UCHD: User-Centered Hypothesis Definition
TECSSD: Towards Enabling Complex Sensemaking
from Streaming Data
Transpire: Transparent Model-Driven Discovery of
Streaming Patterns
CD: Cognitive Depletion
Infrastructure and Test
AIM Software Infrastructure
*
= FY15 Start*
*
**
*
*
SFECA
SDCNOUS
TeMpSAUCHD
TECSSD
SAFE
Transpire
CD
PNNL-SA-116502
AIM Projects by Year and Technical Focus
SoI: Science of Interaction
OPA: Online Predictive Analytics
SHyRe: Scalable Hypothesis Reasoning
SFE: Scalable Feature Extraction and Sampling
CA: Compression Analysis
CD: Cognitive Depletion
PoP: Population-based Model Selection
NOUS
UCHD: User-Centered Hypothesis Definition
ASI: AIM Software Infrastructure
TeMpSA: Temporal Modeling in Streaming Analytics
SAFE: Stream Adaptive Foraging for Evidence
TECSSD: Toward Enabling Complex Sensemaking from Streaming Data
Transpire: Transparent Model-Driven Discovery
SDC: Streaming Data Characterization
TeMpSA
CDUCHD
SoI
Symbolic Reasoning
StatisticalData Mining
Human ComputerInteraction
SFESHyRe
CA
OPAPoPNOUS
ASI
TECSSD
Transpire
TeMpSA
SAFE
SDC
PNNL-SA-116502
How Will AIM Measure Streaming Capability?
Insight is a tradeoff between accuracy, throughput, and utility
Accuracy: AIM systems will converge to correct interpretations, under
two gold standards -- compared to the known state of the world as
reflected in the data, and compared to reference static analytic algorithms
running over the total data
Hypothesis: F1 (precision/recall) measures will be greater than with
algorithms alone or humans alone
Utility: AIM systems will provide stream interpretations that usefully
support insight in users, based on their needs, tasks, roles, and interests
Hypotheses: Users will be able to usefully guide streaming
classifiers; correct human interpretations will occur earlier in the
stream with AIM
Throughput: AIM systems will ingest streams and yield judgments at
rates sufficient for the problem domain
Hypothesis: AIM will achieve insight at a rate that exceeds current
technology-aided human baseline
PNNL-SA-116502
AIM Use Cases
NMR and Metabolomics
Goal: More rapid metabolite identification in a bioreactor
User: Operator provides important background knowledge
Stream comes from NMR machine as spectral data
Processing algorithms propose specific metabolites and track
concentration changes
Strategic Surprise (OODA Loop Modeling)
Goal: Detect Line of Business (LOB) change in export data
User: Domain expert in company
Stream PIERS subset at high rates, produce hypotheses about LOB
changes
Cloud Cyber
Goal: Streaming telemetry data from PNNL IRC and other sources for
detection of cyber exploits; LAS partnership
User: Real-time cyber defenders
Electron Microscopy
Goal: Detection of events and anomalies in real-time EM imagery
User: Microscope operator
FY1
5
FY1
6
PNNL-SA-116502
AIM FY15 NMR Use Case
Use case goalsBuild on FY14’s fast streaming compound
ID from partial spectra
Prepare for a FY16 Microscopy Use CaseEvaluate imagery on-the-fly
Evaluate and communicate residual signal
to user with potential solutions
Scientist interaction and intervention within
the experiment
Participating FY15 AIM projects
CA: Compression Analysis
SFE: Scalable Feature Extraction
OPA: Online Predictive Analytics
SoI: Science of Interaction
ASI: AIM Software Infrastructure
PNNL-SA-116502
AIM FY15 Strategic Surprise Use Case
Use case goals
Model an OODA loop with streaming data
Real (dirty) data
Employ user feedback and rationale
Provide “hello world” for projects
Straightforward to modify stream rate or data for different experimental scenarios
“Frankencompanies”
Participating FY15 AIM projects
SFE: Scalable Feature Extraction
NOUS
POP: Population-based Model Selection
OPA: Online Predictive Analytics
SHyRe: Scalable Hypothesis Reasoning
SoI: Science of Interaction
ASI: AIM Software Infrastructure
UCHD: User-Centered Hypothesis Definition
UCHD
PNNL-SA-116502
AIM FY16 Use Case: Cyber Defense
Use case parametersInput data from cloud telemetry (Digital Signatures), other sensors and data
sources
Sampling and processing algorithms specialized to cyber data
Hypotheses are indicators of attack or nonstandard system behavior
Strong cyber defender involvement to provide cyber knowledge and cyber defense
tradeoffs
We are intensively working with OGAs on streaming cyber security use
cases for active mitigationBoth data stream processing and cyber defender-in-the-loop
Fast cyber data
Detect and IDthreats
Possible attacks with evidence
Supportcyber defender
tradeoffs
Cyber defenseactions
PNNL-SA-116502
AIM Infrastructure and Testing Environment
Key objectives
Provide individual AIM algorithms with common initiative data sets
Characterize integrated AIM algorithm performance and tradeoffs
Measure overall accuracy, speed, and throughput
Status
Landscape analysis settled on Kafka and LIFT/Avro
650K msgs/sec single-threaded streaming
Built on PNNL’s new Institutional Research Cloud (IRC)
Support possible USG transition and packages like Spark
Fault tolerance, topic partitioning, load balancing
Support of AIM use cases
Software assistance to individual AIM projects
Confluent coordination
PNNL Living Laboratory
Policy and process for full lifecycle management of our own
internal data when used in researchPNNL-SA-116502
Mark GreavesTechnical Director for Analytics
NATIONAL SECURITY DIRECTORATE
Phone: (206) 528-3300
Mobile: (206) 972-2201
www.pnnl.gov
aim.pnnl.gov
PNNL-SA-116502