capital onehadoopintro

51
Capital One Hadoop Intro: History ETL/Analytics Practices in LinkedIn/Netflix/Yahoo Next Gen ETL 2014+ Scaling Layers Hadoop Distributions Analytics 1/7/2014

Upload: doug-chang

Post on 20-Aug-2015

1.065 views

Category:

Automotive


3 download

TRANSCRIPT

Page 1: Capital onehadoopintro

Capital One

Hadoop Intro:History

ETL/Analytics Practices in LinkedIn/Netflix/YahooNext Gen ETL 2014+

Scaling LayersHadoop Distributions

Analytics1/7/2014

Page 2: Capital onehadoopintro

Hadoop/HBase

Original requirements GFS: Storing internet html pages on disk for

analytics later BigTable: 2002/Book pages had metadata.

Requirement return book pages to user, no joins (no memory 2002, different now)

Latency determines requirements (analytics/Netflix later) Semireal time. Schema for book pages. Where to store

the metadata? In BigTable My role: not going to give you slides w/pics, everything

presented has code behind it w/documentation

Page 3: Capital onehadoopintro

Bigdata >>50% failure rate

After POCs very few enter production Why? Workhabits for distributed computing.

Have to write distributed computing components, J2EE idioms don't work.

Fail b/c Performance/Administration in Production

e.g. Performance not an issue to support top 100 abinitio queries in Hadoop, 130k will be issue or perhaps 10%

Page 4: Capital onehadoopintro

Measuring performance in POCS, wrong means they can't build

components Wrong

DN2/RS

DN1/RS

DN2/RS

Server/Thread

NN

Server/Thread

Page 5: Capital onehadoopintro

Performance Measurement, leader election, countdown latch, test

failure/handoff w/chaos monkey Zookeeper+Jetty

Zookeeper

Server

Server

DN1/RSDN1/RS

DN1/RS

DN1/RS

Page 6: Capital onehadoopintro

Hive at LinkedIn (bottom left). All 3 similar

Page 7: Capital onehadoopintro

Linkedin Simple Abstractions

Teradata with Hadoop Multiple clusters:Prod/Dev/Research(POC?) Hive: adhoc small ETL lower left hand corner Pig/DataFu + enhancements for ETL production Multiple data stages in green box, (POC

Abinitio Datastaging, REST API for staging). Workflow POC; Oozie+Pig+Hive. Add Web UI Data Staging POC: CDK as example

Page 8: Capital onehadoopintro

POC Coding Style

High level directory with Maven subprojects, Simple Archetype ok

Define Data Repositories with Avro schemas, start with a simple file repository with files copied from Abinitio file system. No need to spend time reverse engineering; just copy

Add pig and hive directories to cdk-examples

Page 9: Capital onehadoopintro

POC Simple extensions

Define a webserver in the cdk and create a REST API. Jersey/.../DI if you want more advanced coding styles

Webserver graphs performance of Hive/Pig/ETL metrics with JVM metrics and by sending dummy queries in.

Start Nagios/Ganglia monitoring and Puppet deployment of CDK as learning for larger scale

Integrate CDK into Bigtop for Capital One distribution practice

Page 10: Capital onehadoopintro

Netflix, Block Diagram

Page 11: Capital onehadoopintro

Simple Netflix Abstractions

http://www.slideshare.net/adrianco/netflix-architecture-tutorial-at-gluecon

Automated Develop and deploy s/w process on APIs. Perforce/Ivy/Jenkins. Hadoop POC, github, Jenkins, deploy to demo webpage. No code sitting in an Eclipse project

Page 12: Capital onehadoopintro

Netflix Automated App Dev/Deploy

REST specification makes Web Uis easier. C1 ETL REST I/F

Page 13: Capital onehadoopintro

Netflix Instance config

Do same for Capital One, exercise to help w/deployment; Apache Bigtop, define 1) NN instance, 2) DN/RS instance, customize the scripts/instance

Page 14: Capital onehadoopintro

Netflix Security

Default turn off iptables/selinux. Define Capital One POC testing? Start w/auditing requirements on test cluster (w/Aravind )

Page 15: Capital onehadoopintro

Netflix Metrics

Send dummy queries through to measure latency

Page 16: Capital onehadoopintro

Netflix Scaling Layer, do simpler first, JDBC manage connection

pool,Pig/Hive

Page 17: Capital onehadoopintro

Yahoo Block Diagram, Pig, Hive, Spark, Storm

Page 18: Capital onehadoopintro

Yahoo Spark (cont)

Page 19: Capital onehadoopintro

Yahoo Next Gen ETL

Page 20: Capital onehadoopintro

LinkedIn/Yahoo/Netflix References

Reference: LinkedIn: Muhammed Islam http://www.slideshare.net/mislam77/hive-at-linkedin-meetup-july-2013-at-linkedin

Yahoo:Chris Drome, for outside business users. Very similar to slide before.

Netflix: Jeff Magnusson Hive used for adhoc queries and lightweight ETL (on web also)

Page 21: Capital onehadoopintro

ETL - Pig

Original Pig paper:http://infolab.stanford.edu/~usriv/papers/pig-latin.pdf

ETL language based on relational algebra (reorder/Set) vs. SQL queries. Each step M/R ETL

No transactional consistency or indexes (other projects have this)

Nested Data model vs. Flat SQL E/R model. Why? Faster scan performance, replace joins.

e.g.MongoDB Requires development UDFs, LinkedIn: DataFu Netflix: Lipstick for debugging Pig DAGs. Will need

some debugging tool. Better than Spill

Page 22: Capital onehadoopintro

ETL - Pig

M/R ETL Points Data distributed on several nodes, merge sort

results at end. Careful sending data across the network. Doesn't scale with more users. Network limitation

Google custom network switch, 1k+ ports. Custom TCP stack, modified OS

Careful: streams scale, do ETL with Streams. Real time performance. Send results to a separate server. Do not embed writes into stream POCs

Page 23: Capital onehadoopintro

Pig vs M/R

Page 24: Capital onehadoopintro

Pig Usage

Yahoo(http://www.linkedin.com/pub/chris-drome/2/a2/346): thousands of ETL jobs daily, Hive for small user base external to Yahoo

Netflix(http://www.linkedin.com/in/jmagnuss): Thousands of jobs, at analyst level. Open sourced Lipstick, Pig UI debugging tool

LinkedIn(http://www.slideshare.net/hadoopusergroup/pig-at-linkedin): thousands of jobs, open sourced DataFu UDFs

Page 25: Capital onehadoopintro

PIG POCS(~2009)

Possible Pig POCs: Top XX queries, manually code up Abinitio queries.

This is already completed 2012? Which queries? Add a JDBC connection type scaling layer to

PigServer.java Out of scope for 4/30/14:

POC Tez on Pig: https://issues.apache.org/jira/browse/PIG-3446

Apache's Pig Optimizer (MR->MR->MR goes to MRRR) by writing optimizer in YARN AM.

Page 26: Capital onehadoopintro

POC quality

Turn the POCs into Bigtop integration tests and get open source approval. Commit changes to verify quality and accountability

Page 27: Capital onehadoopintro

Hive 0.11

More difficult to configure, add mysql metastore Moving to Hcatalog for metadata to be

accessible by other Hadoop Components Access using WebHCat, in progress

Hive Stinger using TEZ, additional in memory optimization

No time spent on this yet; starting 1/2014 w/Hortonworks. Last day 4/30/2014

Page 28: Capital onehadoopintro

Hive 0.11

Hive 0.11 POCs User guide for Abinitio programmers using Hive/Pig Test multitenancy features w/Pig/HDFS Test jdk 1.7 features. Hadoop 2.x works with 1.7 HiveMetastore/MySQL/HCatalog/HWebCat Test cluster performance using benchpress Next gen: 0.12-0.13;Spark/Shark hiveql compatability

Page 29: Capital onehadoopintro

Next Gen ETL Frameworks for 2014+

Faster Reads/Scans w/o using HBase. 3 Developments(wibidata) Dremel:Impala/Apache Drill Spark/Shark Hive/Tez

Dremel Paper review, Interactive analysis of Web Scale datasets Don't use M/R for speed, 100x faster Column schema: Nested Column oriented storage,

not rows, faster for some types of queries!!! Partition key (not in paper)

Page 30: Capital onehadoopintro

Next Gen ETL Frameworks for 2014+

Faster Reads/Scans w/o using HBase. 3 Developments(wibidata) Dremel:Impala/Apache Drill Spark/Shark Hive/Tez

Dremel Paper review, Interactive analysis of Web Scale datasets Don't use M/R for speed, 100x faster Column schema: Nested Column oriented storage,

not rows, faster for some types of queries!!! Partition key (not in paper)

Page 31: Capital onehadoopintro

Dremel Schema/Column Perf, sim to kiji w/o Hbase? Sqoop objects

Page 32: Capital onehadoopintro

Impala/Drill POC

Page 33: Capital onehadoopintro

Next Gen ETL

Shark/Spark; distributed memory RDD, analysis and ETL

Hive/Tez

Page 34: Capital onehadoopintro

Next Gen ETL POCs (combine mem)

Goal: develop skill for getting to higher Read HDFS performance.

Stage Data Schema/Representation effects on Performance. Dremel nested columns:

Data w/ avro schemas and partition strategies. Partition by timestamp, partition by custom rowkey,

partition by schema definitions Measure effect of data schema on M/R and nonM/R

implementations. Conversion or staging process for data

Page 35: Capital onehadoopintro

Next Gen ETL

Addition of new components into Hadoop CDH will come with Spark/Shark CDH comes with Impala HDP status unknown for now (clear EOM)

Page 36: Capital onehadoopintro

Hadoop Distributions

Create a Capital One distribution Why? Production is 3-4x the amount of work compared to

Dev Make sure ready for production before development completed

Refactoring of scripts, bin and sbin to allow admin and users access to admin/user scripts

Customize and Add components, (scaling layer) Puppet/Chef scripts for cluster deployment Real Time Monitoring(not provided in CDH/HDP), hotspot

detection for long running jobs Ready for cluster deployment allows integration of

functional requirements like security into functional Groovy iTests.

Page 37: Capital onehadoopintro

Possible Hadoop Distro POCs

Beginner POCs: Goal: smooth handoff from dev to production

Build Apache Bigtop (will need reference doc) Add components you are currently using not in

distro (e.g. mongodb + hbase for schema) Add integration tests, Add puppet recipes Learn how to apply patches, how to customize for

simple modifications, production stability

Page 38: Capital onehadoopintro

POC framework

Goal: contribute open source code Start with the documentation and s/w processes

first DocBook; Jenkins server;

http://apachebigtop.pbworks.com/w/file/49310946/Apache%20Bigtop%20%20Jenkins.docx

Page 39: Capital onehadoopintro

POC Framework/Roadmap

Track the Jiras!!! Multitenancy needs a test plan.

https://issues.apache.org/jira/browse/BIGTOP-1136 Development environment using Vagrant instead of

EC2. Cheaper, easier to administer https://issues.apache.org/jira/browse/BIGTOP-1171

Create a Capital One Hadoop* user guide https://issues.apache.org/jira/browse/BIGTOP-1157

Create a functional spec for missing components Include test cases for security, multiuser access,

minimum performance to meet SLAs

Page 40: Capital onehadoopintro

Scaling

Astyanax on Cassandra (Netflix) Small companies don't have 300 users accessing

HDFS. Manage the clients. Some examples. Scaling involves multiple

components above the cluster h/w and Hadoop daemons. This is NOT running CDH or HDP using Ambari or Cloudera Manager

Gives SLA and Adhoc high priority jobs

Page 41: Capital onehadoopintro

Capital One will need a custom component

Either for Security or scaling or … even to separate batch analytics queries from adhoc queries

Break down into 2 bigger steps: Cluster Testing tool for scaling/security Develop multiuser client layer using above and

measure performance and modified use cases

Page 42: Capital onehadoopintro

Building a scaling layer

Need a tool for testing. Need to know how to use zookeeper at a minimum. Impossible to figure out via web searches Leader election and countdown latch Most people do their POCs incorrectly.

Worst mistake is multiple threads on a single server Second worst mistake is using HBase

PerformanceEvaluation.java as a reference. PE.java is not cluster aware

Test cluster throughput for cluster scaling

Page 43: Capital onehadoopintro

Analytics

Review and Demo (weblog targeting) Concepts to agree on first: modeling and

targeting http://www.slideshare.net/DougChang1/

demographics-andweblogtargeting-10757778

Page 44: Capital onehadoopintro

Analytics, (wibidata), schema, model, targeting, use db vs hbase

Page 45: Capital onehadoopintro

Analytics f(latency). Netflix

Page 46: Capital onehadoopintro

Analytics

Model iteration performance key. O(n^2) # users Random Forest 6-8h on macbook

Sponsorship from EMC, free 1k node cluster + Gemfire for faster model building

Hadoop;HDFS + M/R for certain specific use cases Batch analysis, log analysis. Click log analysis from

large disk files ETL, M/R ETL only. Much much slower than any

commercial system

Page 47: Capital onehadoopintro

Analytics 2014+

Visualizations Tableau/Datameer POC? Data+Queries?

Deep Learning case studies: Google Now >> Apple Siri. Deep Learning models

replaced Gaussian MM Background refresher speech recognition

Deep learning as a replacement for GMMs in the Acoustic model, http://www.stanford.edu/class/cs224s/2006/

Can do POCs here for innovation. Requires outside consultant assistance

Page 48: Capital onehadoopintro

Deliverables avail today

Start the Capital One distribution Build instructions Functional Specification Capital One Hadoop Distro

POC Planned, need approval before starting

Data Staging Functional Specification Capital One Data Staging POC Functional Specification Data Staging API

ETL Performance POC Functional Specification Top 100 queries from

Abinitio

Page 49: Capital onehadoopintro

Capital One Block Diagram

HDFS

REST: Batch ETLM/R

REST:AdHoc M/R

HCatalog/Schema

Streams/StormReal Time Anaytics

Real Time ETL

No M/R

Scaling Layer

Page 50: Capital onehadoopintro

POCs

Data Ingestion: POC w/Apache Kafka; test fixture needed. Current abilities may not be there

Hadoop ETL: Schema definition

Write/Read query performance of top 10/100 Abinitio queries. How close is current ETL to Abinitio? Assume this answer exists.

Page 51: Capital onehadoopintro

POCs

Hadoop Dev->Production: Building Capital One distribution Apache Bigtop, replicate CDH configuration with HDFS/Pig/Hive/OOzie/Flume/Spark. Leave out Impala, not currently in Bigtop

Scaling: POC intermediate layer.