introduction to data science with hadoop

1 of 36 © Copyright 2010-‐2013 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent.

IntroducAon to Data Science with Hadoop

Glynn Durham, Senior Instructor, Cloudera [email protected]


I will cover:

Hadoop, Hadoop ecosystem HDFS MapReduce Sqoop Flume Hive Pig Mahout Machine learning Data science using Hadoop

Terms

with a few extras:

YARN HBase Impala Oozie data products


Hadoop is:

  a plaLorm for big data

  several Apache SoNware FoundaOon (ASF) projects

  free open source soNware

Major parts:

Hadoop Core

Hadoop ecosystem

Hadoop


Hadoop Core Main Features: File System and Batch Programming


Hadoop Core consists of:

HDFS –  (Hadoop Distributed File System), for storage

MapReduce –  for batch programming

Hadoop Core


HDFS Writes


HDFS Reads


HDFS is good at: –  storing enormous files –  storing a lot of data reliably –  throughput on sequenAal writes –  throughput on sequenAal reads of a file or part of a file

HDFS is not good at: –  high speed random reads of parts of a file

HDFS cannot: –  update any part of a file once wri>en* –  * but you can always write a new file, and/or delete, move, and rename files and directories

HDFS Strengths and Weaknesses


MapReduce: Programming with Simple FuncAons


MapReduce Chains


MapReduce at Scale


MapReduce in Hadoop


MapReduce is good at: –  processing enormous amounts of data –  scaling out as you add more machines –  conAnuing to compleAon, even when some machines die

MapReduce is not good at: –  running any algorithm you can think up –  algorithms that require shared state overall* –  * but maybe you can get clever with your algorithm design

MapReduce cannot: –  run in real Ame: MapReduce jobs are batch jobs

MapReduce Strengths and Weaknesses


Detour: YARN, Yet Another Resource NegoAator—near future


The Hadoop Ecosystem consists of other projects that round out Hadoop Core to make it a useful pla\orm: – Sqoop, for RDBMS integraAon – Flume, for event ingesAon – Hive, for "SQL"-‐like high-‐level programming – Pig, another high-‐level programming paradigm – Mahout, a Java library for machine learning in Hadoop

Plus: – HBase, a "NoSQL" database system – Oozie, a workflow manager for Hadoop acAons – ....

Hadoop Ecosystem


Sqoop: RDBMS to Hadoop and Back


Flume: IngesAng ConAnuing Event Data


Detour: General File Input/Output


Java MapReduce API

MapReduce revisited: How to write MapReduce programs?

•  The most expressive technique possible

•  The most work, by far

•  (Can be easier with Hadoop Streaming: a way to use streaming programming such as shell scripOng or Python)


Hive: MapReduce as "SQL"

•  Familiar language and programming paradigm

•  Provides interface to many SQL-‐compliant tools


Detour: Impala, High Speed AnalyAcs in Hadoop

•  5 to 30 Omes faster then Hive queries (someOmes 100's of Omes faster!)

•  Cloudera exclusive offering, but Apache licensed, so it's free and open source


Impala Does Not Use MapReduce


Detour: HBase, A NoSQL Database System


HBase is a NoSQL database system: –  programmers create and use database tables –  high volume, high performance access to individual cells –  much weaker query language than SQL –  lacks ACID-‐compliant transacAons

HBase is not strictly needed to do "data science" –  a resource hog; competes with analyAcal programs –  ogen deployed on its own separate cluster –  may be part of your organizaAon's data storage and delivery, so you may need to get or put data into an HBase system* –  * (or other NoSQL system)

Detour: A bit more about HBase


Pig: Another Language for MapReduce


Mahout is:

 a collecOon of algorithms, mainly focused on "the three C's" of machine learning

 wriden in Java

 largely implemented over Hadoop MapReduce

 invocable from the command line

 extensible, with the Java API

Mahout is not:

 a turnkey soluOon for doing machine learning

 always user-‐friendly

Mahout: Machine Learning in MapReduce


"The three C's" of machine learning:

  ClassificaOon   Clustering

  CollaboraOve filtering (recommenders)

Machine Learning


Supervised Machine Learning: ClassificaAon


Machine Learning: Clustering


Machine Learning: CollaboraAve Filtering for Recommenders


Simple Enterprise Deployment: Hadoop as ETL Appliance


Simple workflow within Hadoop:

1.   Clear out staging directory in HDFS

2.   Sqoop import from OLTP tables

3.   Hive (or Pig) script to transform data

4.   Sqoop export to data warehouse

Detour: Oozie, Workflow within Hadoop


Hadoop: The Bigger Picture


A data scienOst will:

1.   IdenOfy internal and external data for potenOal use (general data wrangling tools).

2.   Help build ingesOon pipelines to obtain data for use (Flume, Sqoop, other).

3.   Examine, clean, and anonymize ingested data (Hive, Impala, Pig, Hadoop Streaming).

4.   Shape data into useful formats (Hive, Pig).

5.   Explore data sets to gain understanding of problems, trends, reality (Impala, Hive, Pig, staOsOcal programming).

6.   Build predicOve models using staOsOcal programming, machine learning (Mahout).

7.   Contribute to data products: products in the organizaOon that are built in large part from the data itself (Mahout, Sqoop export, general file export).

8.   Conduct experiments with data products, quanOfying benefits and/or tradeoffs of system changes (Flume, Sqoop, staOsOcal tests).

9.   Communicate results and insights to stakeholders (visualizaOon*).

Data Science with Hadoop


VisualizaAon: Needs VisualizaAon Sogware


Thank you! QuesAons? ContribuAons?

Glynn Durham, Senior Instructor, Cloudera [email protected]

introduction to data science with hadoop

Data & Analytics