big data rampage
Post on 27-May-2015
552 Views
Preview:
DESCRIPTION
TRANSCRIPT
Big Data Rampage !NIKO VUO K KO13 MAY 20 13 , HI IT SEMINAR
The data
2
About that data of yours…
• Researchers generally live in a nice utopia where data just works *
* Yes, you do munge it for days, that’s nice
Realitycheck
3
What if you suddenly notice that there’s
• … corrupted JSON/XML/whatever
• … corrupted ids
• … transient ids
• … 5 different transient ids
• … text in number fields
• … new fields
• … disappeared fields
• … fields whose meaning just changed
• … but you have no idea of the new definition
• … all of these, regularly, without forward notice
• … and the bad data is coming at you at 1 GB per hour
• … and yours or someone else’s business depends on the data
4
YouGarbage Great insights
The data
• Enriched by many operationally attainable sources
--> varying schema and complicated ID soup
• Developed by frontline instead of IT waterfall
--> faster process, but volatile data definition
• Data scientists often requires access to more data
--> further risks of lapses
• Big and streaming in
--> risks of discontinuity
5
The Big DataPL E A SE DO N’ T SHO OT ME FO R USING T HE T E R M
6
What is big?
Human-generated
• 5K tweets / s
• 25K events / s from a mobile game (that’s 200 GB / day)
• 40K Google searches / s
Machine-generated
• 5M quotes / s in the US options market
• 120 MB / s of diagnostics from a single gas turbine
• 1 PB / s peaking from CERN LHC
7
What will be big?
• Human-generated data will get more detailed
• … but won’t grow much faster than the userbase
• It will become small eventually
• Machine-generated data will grow by the Moore’s law
• … and it’s already massive
8
How many of you consider this scale?
• Why not ?
• We already understand CPU and memory intensive problems
• But the new world out there is data intensive
• How can research stay in touch with change and stay relevant?
9
The CurriculumR E T R O FITT ING CS ST UDIES
10
Software Architectures
• Single thread performance and disk IO hitting a wall
• How do learning algorithms scale out of this corner ?
• Stochastic methods
• Ensembles
• Online learning
11
Databases 1
• In memory: MongoDB, Exasol, Redis
• On disk (single/sharded): MySQL, PostgreSQL
• On data warehouse: TeraData, DB2, Oracle
• Distributed: HDFS, Cassandra, Riak
• Cloud: S3, Azure, GCE, OpenStack
12
DataData
DataData
DataData
DataData
DataData
DataData
DataData
DataData
DataData
DataData
DataData
DataData
DataData
DataData
DataData
DataData
DataData
DataData
DataData
DataData
Databases 2
• Good old OLTP
• Analytic
• Key-value stores
• Document stores
• HDFS
• What is the best choice for this job ?
13
Data Structures and Algorithms
• Transforming data is expensive --> play safe with data structures
• Normalization dilemma
• Algorithms must tolerate the volatile nature of data
• Data drift, errors, missing values, outliers
• Models need to be explanatory
• Attention to complexity
• The usual obvious (CPU, memory, disk scans & seeks)
• Iterations
• Model size: What is an example of this?
14
Real-time Systems
What is real-time?
Very different requirements:
• Analyst: “What’s the user count today? By source? Now? From France?”
• Sysadmin: “Network traffic up 5x in 5 seconds! What’s going on?”
• Google: “Make a bid for these placements. You have 50 ms”
15
User Interfaces
• Operations or not, visualization is critical for acceptance
• From business concept to implementation
• What information do these users want to see ?
• How does this information support decision making ?
• How to visualize it with clarity yet powerfully ?
16
Significance Testing
• Data-driven actions must be backed by numbers
• Early analytics glazed over significance
• Executive: “Can I trust these numbers? Is my decision justified?”
• Systems must act conservatively
• Trust is built slowly, but lost quickly
• Data solutions must not screw up
17
Modeling Information Business Systems
• Understanding business and how to improve it with data
• : business problem data solution
• The most important quality of a data scientist
18
Contrasts
19
Hand-written Turing Machine vs Excel
• Average business has tons of low-hanging data fruit
• Developing and automating all that takes years (and years)
• No use for “advanced” stuff without visibility to the underlying
• There is no shortcut
• The organization itself needs to mature
20
Supervised vs Unsupervised
• Decide purpose of analysis now or later ?
• Most often the need is already formulated
• Here’s a standard clustering of human behavior
• Power laws will screw things up
21
Ad-hoc vs Operations
• Operative data algorithms run day and night without supervision
• Can produce massive leverage and ROI to a business
• … but they are crazy hard to develop
• Ad hoc analysis can employ all the cool stuff from last month’s JMLR
• … but they can’t scale
• … and 90 % of effort goes to communication and visualization
22
Computation Models
23
State snapshots
• User actions modify the current state in an OLTP
• Single actions go to offline audit log for re-running
• Data algorithms need to export and import data
• Things are run in batches
• What data used to be (and still often is)
24
Even
ts
Snapshot
Snapshot
Snapshot
Data Warehouse
• Additional endpoint specialized for analytics
• Can run surprisingly many algorithms
• … because the speed is so worth the effort
25
Cloud
• “Scalable SOA for computation, networking and storage”
• Really all about strict APIs
• Service dog wagging the infrastructure tail
• Public cloud very competitive for the small guys
• Hybrid clouds increasingly replace enterprise systems
26
Event data
• Event stream itself becomes first class citizen and master-labeled
• Needs novel storage
• Needs novel processing
• Data scientists beware! Sugar high imminent!
27
Stream processing
• New data is coming in all the time
• Process it online
• Data becomes somewhat disposable
• “Why bother with month old data when there’s too much of it anyways ?”
28
Iterative processing
• Always been the problem with large data
• Keeping state in memory necessary, but hard
• Spark doesn’t solve this, but makes it less painful
• Common fix: don’t do iterations
29
Hadoop the Hairy Framework
• HDFS, ZooKeeper, MapReduce, Hive, Pig,Yarn, Flume, Mahout, Bigtop, Oozie, Hue,HCatalog, Avro, Whirr, Sqoop, Impala, DataFu, …
• Premise of insanely large and/or unstructured data
• You probably don’t need it
30
Will Hadoop replace the Data Warehouse?
Separate concepts: Hadoop The Framework vs. MapReduce
• MapReduce suited for totally different tasks
• Hadoop can host a data warehouse
• … but it won’t be any easier or quicker to develop
31
The Purpose
32
What does Big Data mean for a business?
• Answers … a lot more answers
• Better, more reliable decision making
• Treating customers as individuals instead of segments
• How to design processes (both business and social) to employ data?
33
Data-driven decision making
34
Thank you!
• Always eager to talk about this stuff, feel free to contact !
• Now it’s time for lots of questions !
• niko.vuokko@gmail.com
• linkedin.com/in/nikovuokko
• @nikovuokko
35
top related