tour of big data
DESCRIPTION
Presentation at Southern California Code Camp July 2013 in San Diego. This talk presents you with basic concepts in world of big data and data science, with focus on relational databases, noSQL, MapReduce, machine learning, and data visualization, along with demos of MapReduce in action and Pig on Hadoop. The purpose of this presentation is to get you familiar with terminologies and concepts in data science, and whet your appetite for further exploration into the world of big data. This presentation is adapted from an online course by Coursera with similar title and scopeTRANSCRIPT
Tour of Big DataRaymond Yu
Socal Code Camp 2013
About myself
• Sr. Database Architect @ BridgePoint
Education
• Blog www.yutechnet.com
• LinkedIn www.linkedin.com/in/raymondyu1
•@yutechnet
About this talk…
7/28/2013yutechnet.com
• Inspired by “Introduction to Data Science”
on Coursera (Bill Howe, UW)
•Guided tour of topics in data science
– MapReduce, Pig
– noSQL
– Machine Learning
– Information Visualization
•Goal
Big Data
•Volume
– Size of data
•Velocity
– The latency of data processing relative to the growing
demand of interactivity
•Variety
– The diversity of sources, formats, quality, and structures
Big Data is any data that is expensive to manage and hard to
extract value from. -Michael Franklin
Where does big data come from?
• “Data exhaust” from customers
•New censor technologies
• Individually contributed data in massive
scale
•Cheap to keep data
Data Science
•Data Preparation (at scale)
•Analytics
•Communication
The ability to take data, understand it, process it,
extract value from it, visualize it, and communicate it
- Hal Varian, Google's Chief Economist
Context…
src. Introduction to Data Science course
Relational Databases
• SQL as Declarative Language
• Indexes
– Extract small result from big dataset
– Built easily and automatically used when appropriate
•Data consistency
• “Old-style” scalability
MapReduce
•Google paper 2004
•Hadoop 2008
•High level programming model for large-
scale parallel data processing
•Divide-and-conquer
•Mapper + Reducer
“Hello World” of MapReduce
Count word frequency in millions of documents
MapReduce Programming Model
src. Course slide
MapReduce in Hadoop
Pig
• An engine to execute programs on top of Hadoop
• Language layer Pig Latin
• An Apache open source project (http://pig.apache.org)
• Yahoo! 2009
Why use Pig?
In MapReduce…
In Pig Latin
Pig System Overview
Context…
src. Introduction to Data Science course
noSQL definitions
•A term to designate databases which
differ from classic relational databases
– Transactional model
– Data model
•Not much to do with SQL
• “not only SQL”
Concepts
• CAP Theorem
– Consistency
– Availability
– Partition Tolerance
• Eventual consistency
Src: blog.beany.co.kr
noSQL One-page Overview
Let’s walk through a few
•Column definitions
•RDBMS
•Memcache
•Dynamo
•CouchDB
• BigTable (Hbase)
noSQL Common Features
• The ability to replicate and partition data over many servers (scale)
• Horizontally scale simple operation throughput over many servers
• A simple API - no query language (no SQL)
• Weaker concurrency model than ACID transactions (no transaction)
• The ability to dynamically add new attributes to data records (no schema)
Machine Learning
• Systems that automatically learn programs from data
• Prediction– Given examples of inputs and outputs
– Learn the relationship between them
– Apply the relationship to larger set
• Different from statistics model– Large data set over simple model trumpets small data set
over sophisticated model
Bertin’s Visual Attributes
Data Encoding Exercise
Information Visualization
src. http://www.tableausoftware.com/public
Closing example
Src. http://commons.wikimedia.org/wiki/File:ElectoralCollege2012.svg
Nate Silverfivethirtyeight.com
Obama’s Data-
Driven Campaign• Massive voter db
• Hadoop as ETL
• Vertica db for slice-
and-dice
Questions?