capital onehadoopintro
TRANSCRIPT
Capital One
Hadoop Intro:History
ETL/Analytics Practices in LinkedIn/Netflix/YahooNext Gen ETL 2014+
Scaling LayersHadoop Distributions
Analytics1/7/2014
Hadoop/HBase
Original requirements GFS: Storing internet html pages on disk for
analytics later BigTable: 2002/Book pages had metadata.
Requirement return book pages to user, no joins (no memory 2002, different now)
Latency determines requirements (analytics/Netflix later) Semireal time. Schema for book pages. Where to store
the metadata? In BigTable My role: not going to give you slides w/pics, everything
presented has code behind it w/documentation
Bigdata >>50% failure rate
After POCs very few enter production Why? Workhabits for distributed computing.
Have to write distributed computing components, J2EE idioms don't work.
Fail b/c Performance/Administration in Production
e.g. Performance not an issue to support top 100 abinitio queries in Hadoop, 130k will be issue or perhaps 10%
Measuring performance in POCS, wrong means they can't build
components Wrong
DN2/RS
DN1/RS
DN2/RS
Server/Thread
NN
Server/Thread
Performance Measurement, leader election, countdown latch, test
failure/handoff w/chaos monkey Zookeeper+Jetty
Zookeeper
Server
Server
DN1/RSDN1/RS
DN1/RS
DN1/RS
Hive at LinkedIn (bottom left). All 3 similar
Linkedin Simple Abstractions
Teradata with Hadoop Multiple clusters:Prod/Dev/Research(POC?) Hive: adhoc small ETL lower left hand corner Pig/DataFu + enhancements for ETL production Multiple data stages in green box, (POC
Abinitio Datastaging, REST API for staging). Workflow POC; Oozie+Pig+Hive. Add Web UI Data Staging POC: CDK as example
POC Coding Style
High level directory with Maven subprojects, Simple Archetype ok
Define Data Repositories with Avro schemas, start with a simple file repository with files copied from Abinitio file system. No need to spend time reverse engineering; just copy
Add pig and hive directories to cdk-examples
POC Simple extensions
Define a webserver in the cdk and create a REST API. Jersey/.../DI if you want more advanced coding styles
Webserver graphs performance of Hive/Pig/ETL metrics with JVM metrics and by sending dummy queries in.
Start Nagios/Ganglia monitoring and Puppet deployment of CDK as learning for larger scale
Integrate CDK into Bigtop for Capital One distribution practice
Netflix, Block Diagram
Simple Netflix Abstractions
http://www.slideshare.net/adrianco/netflix-architecture-tutorial-at-gluecon
Automated Develop and deploy s/w process on APIs. Perforce/Ivy/Jenkins. Hadoop POC, github, Jenkins, deploy to demo webpage. No code sitting in an Eclipse project
Netflix Automated App Dev/Deploy
REST specification makes Web Uis easier. C1 ETL REST I/F
Netflix Instance config
Do same for Capital One, exercise to help w/deployment; Apache Bigtop, define 1) NN instance, 2) DN/RS instance, customize the scripts/instance
Netflix Security
Default turn off iptables/selinux. Define Capital One POC testing? Start w/auditing requirements on test cluster (w/Aravind )
Netflix Metrics
Send dummy queries through to measure latency
Netflix Scaling Layer, do simpler first, JDBC manage connection
pool,Pig/Hive
Yahoo Block Diagram, Pig, Hive, Spark, Storm
Yahoo Spark (cont)
Yahoo Next Gen ETL
LinkedIn/Yahoo/Netflix References
Reference: LinkedIn: Muhammed Islam http://www.slideshare.net/mislam77/hive-at-linkedin-meetup-july-2013-at-linkedin
Yahoo:Chris Drome, for outside business users. Very similar to slide before.
Netflix: Jeff Magnusson Hive used for adhoc queries and lightweight ETL (on web also)
ETL - Pig
Original Pig paper:http://infolab.stanford.edu/~usriv/papers/pig-latin.pdf
ETL language based on relational algebra (reorder/Set) vs. SQL queries. Each step M/R ETL
No transactional consistency or indexes (other projects have this)
Nested Data model vs. Flat SQL E/R model. Why? Faster scan performance, replace joins.
e.g.MongoDB Requires development UDFs, LinkedIn: DataFu Netflix: Lipstick for debugging Pig DAGs. Will need
some debugging tool. Better than Spill
ETL - Pig
M/R ETL Points Data distributed on several nodes, merge sort
results at end. Careful sending data across the network. Doesn't scale with more users. Network limitation
Google custom network switch, 1k+ ports. Custom TCP stack, modified OS
Careful: streams scale, do ETL with Streams. Real time performance. Send results to a separate server. Do not embed writes into stream POCs
Pig vs M/R
Pig Usage
Yahoo(http://www.linkedin.com/pub/chris-drome/2/a2/346): thousands of ETL jobs daily, Hive for small user base external to Yahoo
Netflix(http://www.linkedin.com/in/jmagnuss): Thousands of jobs, at analyst level. Open sourced Lipstick, Pig UI debugging tool
LinkedIn(http://www.slideshare.net/hadoopusergroup/pig-at-linkedin): thousands of jobs, open sourced DataFu UDFs
PIG POCS(~2009)
Possible Pig POCs: Top XX queries, manually code up Abinitio queries.
This is already completed 2012? Which queries? Add a JDBC connection type scaling layer to
PigServer.java Out of scope for 4/30/14:
POC Tez on Pig: https://issues.apache.org/jira/browse/PIG-3446
Apache's Pig Optimizer (MR->MR->MR goes to MRRR) by writing optimizer in YARN AM.
POC quality
Turn the POCs into Bigtop integration tests and get open source approval. Commit changes to verify quality and accountability
Hive 0.11
More difficult to configure, add mysql metastore Moving to Hcatalog for metadata to be
accessible by other Hadoop Components Access using WebHCat, in progress
Hive Stinger using TEZ, additional in memory optimization
No time spent on this yet; starting 1/2014 w/Hortonworks. Last day 4/30/2014
Hive 0.11
Hive 0.11 POCs User guide for Abinitio programmers using Hive/Pig Test multitenancy features w/Pig/HDFS Test jdk 1.7 features. Hadoop 2.x works with 1.7 HiveMetastore/MySQL/HCatalog/HWebCat Test cluster performance using benchpress Next gen: 0.12-0.13;Spark/Shark hiveql compatability
Next Gen ETL Frameworks for 2014+
Faster Reads/Scans w/o using HBase. 3 Developments(wibidata) Dremel:Impala/Apache Drill Spark/Shark Hive/Tez
Dremel Paper review, Interactive analysis of Web Scale datasets Don't use M/R for speed, 100x faster Column schema: Nested Column oriented storage,
not rows, faster for some types of queries!!! Partition key (not in paper)
Next Gen ETL Frameworks for 2014+
Faster Reads/Scans w/o using HBase. 3 Developments(wibidata) Dremel:Impala/Apache Drill Spark/Shark Hive/Tez
Dremel Paper review, Interactive analysis of Web Scale datasets Don't use M/R for speed, 100x faster Column schema: Nested Column oriented storage,
not rows, faster for some types of queries!!! Partition key (not in paper)
Dremel Schema/Column Perf, sim to kiji w/o Hbase? Sqoop objects
Impala/Drill POC
Next Gen ETL
Shark/Spark; distributed memory RDD, analysis and ETL
Hive/Tez
Next Gen ETL POCs (combine mem)
Goal: develop skill for getting to higher Read HDFS performance.
Stage Data Schema/Representation effects on Performance. Dremel nested columns:
Data w/ avro schemas and partition strategies. Partition by timestamp, partition by custom rowkey,
partition by schema definitions Measure effect of data schema on M/R and nonM/R
implementations. Conversion or staging process for data
Next Gen ETL
Addition of new components into Hadoop CDH will come with Spark/Shark CDH comes with Impala HDP status unknown for now (clear EOM)
Hadoop Distributions
Create a Capital One distribution Why? Production is 3-4x the amount of work compared to
Dev Make sure ready for production before development completed
Refactoring of scripts, bin and sbin to allow admin and users access to admin/user scripts
Customize and Add components, (scaling layer) Puppet/Chef scripts for cluster deployment Real Time Monitoring(not provided in CDH/HDP), hotspot
detection for long running jobs Ready for cluster deployment allows integration of
functional requirements like security into functional Groovy iTests.
Possible Hadoop Distro POCs
Beginner POCs: Goal: smooth handoff from dev to production
Build Apache Bigtop (will need reference doc) Add components you are currently using not in
distro (e.g. mongodb + hbase for schema) Add integration tests, Add puppet recipes Learn how to apply patches, how to customize for
simple modifications, production stability
POC framework
Goal: contribute open source code Start with the documentation and s/w processes
first DocBook; Jenkins server;
http://apachebigtop.pbworks.com/w/file/49310946/Apache%20Bigtop%20%20Jenkins.docx
POC Framework/Roadmap
Track the Jiras!!! Multitenancy needs a test plan.
https://issues.apache.org/jira/browse/BIGTOP-1136 Development environment using Vagrant instead of
EC2. Cheaper, easier to administer https://issues.apache.org/jira/browse/BIGTOP-1171
Create a Capital One Hadoop* user guide https://issues.apache.org/jira/browse/BIGTOP-1157
Create a functional spec for missing components Include test cases for security, multiuser access,
minimum performance to meet SLAs
Scaling
Astyanax on Cassandra (Netflix) Small companies don't have 300 users accessing
HDFS. Manage the clients. Some examples. Scaling involves multiple
components above the cluster h/w and Hadoop daemons. This is NOT running CDH or HDP using Ambari or Cloudera Manager
Gives SLA and Adhoc high priority jobs
Capital One will need a custom component
Either for Security or scaling or … even to separate batch analytics queries from adhoc queries
Break down into 2 bigger steps: Cluster Testing tool for scaling/security Develop multiuser client layer using above and
measure performance and modified use cases
Building a scaling layer
Need a tool for testing. Need to know how to use zookeeper at a minimum. Impossible to figure out via web searches Leader election and countdown latch Most people do their POCs incorrectly.
Worst mistake is multiple threads on a single server Second worst mistake is using HBase
PerformanceEvaluation.java as a reference. PE.java is not cluster aware
Test cluster throughput for cluster scaling
Analytics
Review and Demo (weblog targeting) Concepts to agree on first: modeling and
targeting http://www.slideshare.net/DougChang1/
demographics-andweblogtargeting-10757778
Analytics, (wibidata), schema, model, targeting, use db vs hbase
Analytics f(latency). Netflix
Analytics
Model iteration performance key. O(n^2) # users Random Forest 6-8h on macbook
Sponsorship from EMC, free 1k node cluster + Gemfire for faster model building
Hadoop;HDFS + M/R for certain specific use cases Batch analysis, log analysis. Click log analysis from
large disk files ETL, M/R ETL only. Much much slower than any
commercial system
Analytics 2014+
Visualizations Tableau/Datameer POC? Data+Queries?
Deep Learning case studies: Google Now >> Apple Siri. Deep Learning models
replaced Gaussian MM Background refresher speech recognition
Deep learning as a replacement for GMMs in the Acoustic model, http://www.stanford.edu/class/cs224s/2006/
Can do POCs here for innovation. Requires outside consultant assistance
Deliverables avail today
Start the Capital One distribution Build instructions Functional Specification Capital One Hadoop Distro
POC Planned, need approval before starting
Data Staging Functional Specification Capital One Data Staging POC Functional Specification Data Staging API
ETL Performance POC Functional Specification Top 100 queries from
Abinitio
Capital One Block Diagram
HDFS
REST: Batch ETLM/R
REST:AdHoc M/R
HCatalog/Schema
Streams/StormReal Time Anaytics
Real Time ETL
No M/R
Scaling Layer
POCs
Data Ingestion: POC w/Apache Kafka; test fixture needed. Current abilities may not be there
Hadoop ETL: Schema definition
Write/Read query performance of top 10/100 Abinitio queries. How close is current ETL to Abinitio? Assume this answer exists.
POCs
Hadoop Dev->Production: Building Capital One distribution Apache Bigtop, replicate CDH configuration with HDFS/Pig/Hive/OOzie/Flume/Spark. Leave out Impala, not currently in Bigtop
Scaling: POC intermediate layer.