scalable hadoop in the cloud
TRANSCRIPT
Scalable Hadoop in the CloudJohan Gustavsson
WHO AM I?
• Johan Gustavsson(ヨハン)
• Some contributions in Hadoop, Hive…
• Software Engineer at Treasure Data, Inc.
• Hadoop Team
HIGH LEVEL OVERVIEW
CONTENT• Basic architecture:
• Storage
• Replacing a Hadoop Cluster
• Basic job flow with Plazma
• Overview Basic job execution
• Isolation (JobClient)
• Isolation (In cluster)
• Architecture changes (PTD):
• What is PTD?
• Multiple Hadoop Versions
• Multiple Version Job submission
BASIC ARCHITECTURE
STORAGE (PLAZMA)
• Time indexed database (hourly partitioned)
STORAGE (PLAZMA)
• Metadata in Postgres
• Data in mpc1 files on S3 (columnar format file with schema on read)
STORAGE (PLAZMA)
• A write will create the files and write them to S3
• Then commit by writing metadata to Postgres
REPLACING A HADOOP CLUSTER
REPLACING A HADOOP CLUSTER
REPLACING A HADOOP CLUSTER
BASIC JOB FLOW WITH PLAZMA
• Job one runs reading from Plazma
• Shuffle uses local disk same as always
BASIC JOB FLOW WITH PLAZMA
• Output of the first job is written to HDFS
BASIC JOB FLOW WITH PLAZMA
• Second job reads from HDFS
BASIC JOB FLOW WITH PLAZMA
• Final job in the dag writes to HDFS, then the data is downloaded to a result bucket on S3
BASIC JOB FLOW WITH PLAZMA
• In case of INSTERT data is written directly to a table in plazma
OVERVIEW BASIC JOB EXECUTION
OVERVIEW BASIC JOB EXECUTION
OVERVIEW BASIC JOB EXECUTION
OVERVIEW BASIC JOB EXECUTION
OVERVIEW BASIC JOB EXECUTION
ISOLATION (JOBCLIENT)
• Worker builds command line options with java properties
• Runs QueryRunner as a subprocess
ISOLATION (JOBCLIENT)
• UDFs used in query enabled
• Executing CREATE TEMPORARY FUNCTION
• Add databases/tables from PlazmaDB to Metastore
• Executing CREATE DATABASE/TABLE
ISOLATION (JOBCLIENT)
ISOLATION (JOBCLIENT)
• The good:
• High level of isolation
• OOM deals protect jobs from each other
• The bad:
• Job setup costs are a bit high
ISOLATION (IN CLUSTER)• Using Hadoop resource pools:
• 1 account 1 resource pool (not counting sub-pools)
• Based on price plan max and min running containers are set
• Currently 6711 pools in production
ISOLATION (IN CLUSTER)• The good part:
• Relatively low cost to guarantee minimum resources
• Jobs can still burst to max if resources are free in the cluster
ISOLATION (IN CLUSTER)•The bad parts:
• Due to too many pools meaning cluster separation is needed
• The Resourcemanager tends to get slow with too many pools
• Some unsafe UDFs needs to be disabled
• java_method()
• reflect()
ARCHITECTURE CHANGES (PTD)
WHAT IS PTD?• Patchset Treasure Data
WHAT IS PTD?• Patchset Treasure Data
• Name first coined by these two@frsyuki @tagomoris
WHAT IS PTD?• Patchset Treasure Data
• Name first coined by these two
• Still in development @frsyuki @tagomoris
WHAT IS PTD?• Patchset Treasure Data
• Name first coined by these two
• Still in development
• Original plan
@frsyuki @tagomoris
WHAT IS PTD?• Patchset Treasure Data
• Name first coined by these two
• Still in development
• Original plan
• Base all internal Hadoop components on latest community edition
• Simplify releases to keep an as current version as possible
@frsyuki @tagomoris
WHAT IS PTD?• Patchset Treasure Data
• Name first coined by these two
• Still in development
• Original plan
• Base all internal Hadoop components on latest community edition
• Simplify releases to keep an as current version as possible
• What it’s turning into
@frsyuki @tagomoris
WHAT IS PTD?• Patchset Treasure Data
• Name first coined by these two
• Still in development
• Original plan
• Base all internal Hadoop components on latest community edition
• Simplify releases to keep an as current version as possible
• What it’s turning into
• A complete overhaul of most things related to Hadoop
@frsyuki @tagomoris
MULTIPLE HADOOP VERSIONS
MULTIPLE HADOOP VERSIONS
MULTIPLE HADOOP VERSIONS
MULTIPLE HADOOP VERSIONS
• By changing settings in a data bag default version is change
MULTIPLE VERSION JOB SUBMISSION
MULTIPLE VERSION JOB SUBMISSION
ELEPHANT SERVER
• Provides REST api for job submission and monitoring
• All Hive/Pig related code separated from the generic worker
• Distributed on memory queue managing job progress
ELEPHANT SERVER
• Built to support multiple versions of Hadoop/Hive/Pig…
ELEPHANT SERVER
• This could lead to the following longterm solution
JOB PRESERVING RESTARTS
• Worker is polling job status from local server
JOB PRESERVING RESTARTS
• A new instance of the server is started joining the Hazelcast cluster and repeatedly trying to start REST server
• The old instance goes into shutdown mode (not starting new jobs but keep current ones running)
JOB PRESERVING RESTARTS
• Newly submitted jobs popped and managed by the new instance
JOB PRESERVING RESTARTS
• Ones all jobs running on the old instance have finished one way or another it shuts down
JOB PRESERVING RESTARTS
• Since the port opens up, the new instance starts REST api
https://www.treasuredata.com/