big data rampage

Big Data Rampage !NIKO VUO K KO13 MAY 20 13 , HI IT SEMINAR

The data

About that data of yours…

• Researchers generally live in a nice utopia where data just works *

* Yes, you do munge it for days, that’s nice

Realitycheck

What if you suddenly notice that there’s

• … corrupted JSON/XML/whatever

• … corrupted ids

• … transient ids

• … 5 different transient ids

• … text in number fields

• … new fields

• … disappeared fields

• … fields whose meaning just changed

• … but you have no idea of the new definition

• … all of these, regularly, without forward notice

• … and the bad data is coming at you at 1 GB per hour

• … and yours or someone else’s business depends on the data

YouGarbage Great insights

The data

• Enriched by many operationally attainable sources

--> varying schema and complicated ID soup

• Developed by frontline instead of IT waterfall

--> faster process, but volatile data definition

• Data scientists often requires access to more data

--> further risks of lapses

• Big and streaming in

--> risks of discontinuity

The Big DataPL E A SE DO N’ T SHO OT ME FO R USING T HE T E R M

What is big?

Human-generated

• 5K tweets / s

• 25K events / s from a mobile game (that’s 200 GB / day)

• 40K Google searches / s

Machine-generated

• 5M quotes / s in the US options market

• 120 MB / s of diagnostics from a single gas turbine

• 1 PB / s peaking from CERN LHC

What will be big?

• Human-generated data will get more detailed

• … but won’t grow much faster than the userbase

• It will become small eventually

• Machine-generated data will grow by the Moore’s law

• … and it’s already massive

How many of you consider this scale?

• Why not ?

• We already understand CPU and memory intensive problems

• But the new world out there is data intensive

• How can research stay in touch with change and stay relevant?

The CurriculumR E T R O FITT ING CS ST UDIES

Software Architectures

• Single thread performance and disk IO hitting a wall

• How do learning algorithms scale out of this corner ?

• Stochastic methods

• Ensembles

• Online learning

Databases 1

• In memory: MongoDB, Exasol, Redis

• On disk (single/sharded): MySQL, PostgreSQL

• On data warehouse: TeraData, DB2, Oracle

• Distributed: HDFS, Cassandra, Riak

• Cloud: S3, Azure, GCE, OpenStack

DataData

Databases 2

• Good old OLTP

• Analytic

• Key-value stores

• Document stores

• HDFS

• What is the best choice for this job ?

Data Structures and Algorithms

• Transforming data is expensive --> play safe with data structures

• Normalization dilemma

• Algorithms must tolerate the volatile nature of data

• Data drift, errors, missing values, outliers

• Models need to be explanatory

• Attention to complexity

• The usual obvious (CPU, memory, disk scans & seeks)

• Iterations

• Model size: What is an example of this?

Real-time Systems

What is real-time?

Very different requirements:

• Analyst: “What’s the user count today? By source? Now? From France?”

• Sysadmin: “Network traffic up 5x in 5 seconds! What’s going on?”

• Google: “Make a bid for these placements. You have 50 ms”

User Interfaces

• Operations or not, visualization is critical for acceptance

• From business concept to implementation

• What information do these users want to see ?

• How does this information support decision making ?

• How to visualize it with clarity yet powerfully ?

Significance Testing

• Data-driven actions must be backed by numbers

• Early analytics glazed over significance

• Executive: “Can I trust these numbers? Is my decision justified?”

• Systems must act conservatively

• Trust is built slowly, but lost quickly

• Data solutions must not screw up

Modeling Information Business Systems

• Understanding business and how to improve it with data

• : business problem data solution

• The most important quality of a data scientist

Contrasts

Hand-written Turing Machine vs Excel

• Average business has tons of low-hanging data fruit

• Developing and automating all that takes years (and years)

• No use for “advanced” stuff without visibility to the underlying

• There is no shortcut

• The organization itself needs to mature

Supervised vs Unsupervised

• Decide purpose of analysis now or later ?

• Most often the need is already formulated

• Here’s a standard clustering of human behavior

• Power laws will screw things up

Ad-hoc vs Operations

• Operative data algorithms run day and night without supervision

• Can produce massive leverage and ROI to a business

• … but they are crazy hard to develop

• Ad hoc analysis can employ all the cool stuff from last month’s JMLR

• … but they can’t scale

• … and 90 % of effort goes to communication and visualization

Computation Models

State snapshots

• User actions modify the current state in an OLTP

• Single actions go to offline audit log for re-running

• Data algorithms need to export and import data

• Things are run in batches

• What data used to be (and still often is)

Snapshot

Data Warehouse

• Additional endpoint specialized for analytics

• Can run surprisingly many algorithms

• … because the speed is so worth the effort

• “Scalable SOA for computation, networking and storage”

• Really all about strict APIs

• Service dog wagging the infrastructure tail

• Public cloud very competitive for the small guys

• Hybrid clouds increasingly replace enterprise systems

Event data

• Event stream itself becomes first class citizen and master-labeled

• Needs novel storage

• Needs novel processing

• Data scientists beware! Sugar high imminent!

Stream processing

• New data is coming in all the time

• Process it online

• Data becomes somewhat disposable

• “Why bother with month old data when there’s too much of it anyways ?”

Iterative processing

• Always been the problem with large data

• Keeping state in memory necessary, but hard

• Spark doesn’t solve this, but makes it less painful

• Common fix: don’t do iterations

Hadoop the Hairy Framework

• HDFS, ZooKeeper, MapReduce, Hive, Pig,Yarn, Flume, Mahout, Bigtop, Oozie, Hue,HCatalog, Avro, Whirr, Sqoop, Impala, DataFu, …

• Premise of insanely large and/or unstructured data

• You probably don’t need it

Will Hadoop replace the Data Warehouse?

Separate concepts: Hadoop The Framework vs. MapReduce

• MapReduce suited for totally different tasks

• Hadoop can host a data warehouse

• … but it won’t be any easier or quicker to develop

The Purpose

What does Big Data mean for a business?

• Answers … a lot more answers

• Better, more reliable decision making

• Treating customers as individuals instead of segments

• How to design processes (both business and social) to employ data?

Data-driven decision making

Thank you!

• Always eager to talk about this stuff, feel free to contact !

• Now it’s time for lots of questions !

• niko.vuokko@gmail.com

• linkedin.com/in/nikovuokko

• @nikovuokko

big data rampage

Technology

big data 2e - van duuren media › downloads › ... · big...

big data: think big cee congress

big data user group big data application - mar 2016

big data + bioinformÁtica · santiago urrizola big data +...

big data ในภาครัฐ - national assembly of...

big data: big challenges and big concerns · big data: big...

agenda ¿qué es big data? ¿por qué usar big data?...

big data, big value, big issues? - softwarepakketten...nov...

converting big data into big value

big data - big benefits

sept. 2012 rampage

rampage verão 2013 - lançamento

rampage catalog

wprowadzenie do technologii big data / intro to big data...

big data 2e - managementboek.nl · big data...

· 2020. 11. 18. · técnico de soporte big data /...

big data big mystery ?

ibm - big value from big data

big data analytics(hadoop) - · pdf fileall of this...

big data umsetzen: big data zur erfolgsgeschichte machen -...