scalable hadoop in the cloud

Scalable Hadoop in the CloudJohan Gustavsson

WHO AM I?

• Johan Gustavsson（ヨハン）

• Some contributions in Hadoop, Hive…

• Software Engineer at Treasure Data, Inc.

• Hadoop Team

HIGH LEVEL OVERVIEW

CONTENT• Basic architecture:

• Storage

• Replacing a Hadoop Cluster

• Basic job flow with Plazma

• Overview Basic job execution

• Isolation (JobClient)

• Isolation (In cluster)

• Architecture changes (PTD):

• What is PTD?

• Multiple Hadoop Versions

• Multiple Version Job submission

BASIC ARCHITECTURE

STORAGE (PLAZMA)

• Time indexed database (hourly partitioned)

STORAGE (PLAZMA)

• Metadata in Postgres

• Data in mpc1 files on S3 (columnar format file with schema on read)

STORAGE (PLAZMA)

• A write will create the files and write them to S3

• Then commit by writing metadata to Postgres

REPLACING A HADOOP CLUSTER

BASIC JOB FLOW WITH PLAZMA

• Job one runs reading from Plazma

• Shuffle uses local disk same as always

• Output of the first job is written to HDFS

• Second job reads from HDFS

• Final job in the dag writes to HDFS, then the data is downloaded to a result bucket on S3

• In case of INSTERT data is written directly to a table in plazma

OVERVIEW BASIC JOB EXECUTION

ISOLATION (JOBCLIENT)

• Worker builds command line options with java properties

• Runs QueryRunner as a subprocess

• UDFs used in query enabled

• Executing CREATE TEMPORARY FUNCTION

• Add databases/tables from PlazmaDB to Metastore

• Executing CREATE DATABASE/TABLE

• The good:

• High level of isolation

• OOM deals protect jobs from each other

• The bad:

• Job setup costs are a bit high

ISOLATION (IN CLUSTER)• Using Hadoop resource pools:

• 1 account 1 resource pool (not counting sub-pools)

• Based on price plan max and min running containers are set

• Currently 6711 pools in production

ISOLATION (IN CLUSTER)• The good part:

• Relatively low cost to guarantee minimum resources

• Jobs can still burst to max if resources are free in the cluster

ISOLATION (IN CLUSTER)•The bad parts:

• Due to too many pools meaning cluster separation is needed

• The Resourcemanager tends to get slow with too many pools

• Some unsafe UDFs needs to be disabled

• java_method()

• reflect()

ARCHITECTURE CHANGES (PTD)

WHAT IS PTD?• Patchset Treasure Data

• Name first coined by these two@frsyuki @tagomoris

• Name first coined by these two

• Still in development @frsyuki @tagomoris

• Still in development

• Original plan

@frsyuki @tagomoris

• Original plan

• Base all internal Hadoop components on latest community edition

• Simplify releases to keep an as current version as possible

@frsyuki @tagomoris

• Original plan

• What it’s turning into

@frsyuki @tagomoris

• Original plan

• What it’s turning into

• A complete overhaul of most things related to Hadoop

@frsyuki @tagomoris

MULTIPLE HADOOP VERSIONS

• By changing settings in a data bag default version is change

MULTIPLE VERSION JOB SUBMISSION

ELEPHANT SERVER

• Provides REST api for job submission and monitoring

• All Hive/Pig related code separated from the generic worker

• Distributed on memory queue managing job progress

ELEPHANT SERVER

• Built to support multiple versions of Hadoop/Hive/Pig…

ELEPHANT SERVER

• This could lead to the following longterm solution

JOB PRESERVING RESTARTS

• Worker is polling job status from local server

• A new instance of the server is started joining the Hazelcast cluster and repeatedly trying to start REST server

• The old instance goes into shutdown mode (not starting new jobs but keep current ones running)

• Newly submitted jobs popped and managed by the new instance

• Ones all jobs running on the old instance have finished one way or another it shuts down

• Since the port opens up, the new instance starts REST api

https://www.treasuredata.com/

scalable hadoop in the cloud

Technology

middleware cloud computing Übung - fau · wintersemester...

koz scalable audio

hadoop with python · 2018. 7. 19. · hadoop distributed...

binarypig - scalable malware analytics in hadoop

google cloud a gyakorlatban bigdata/hadoop fejlesztésekhez...

rockstor - a cloud object system based on hadoop

curso hadoop. fcojavierlahozsevilla v1.0.pdf ·...

hadoop et microsoft: les dernières avancées dans le cloud...

elastic cloud storage (ecs)...ecs hdfs 소개..... 122 ecs...

new “cloudera hadoop” -...

김영균 - sigfast.or.krhadoop dfs google fs hadoop dfs...

hadoop meets cloud with multi-tenancy

putting hadoop on any cloud. nati shalom at big data spain...

cloud native hadoop #cwt2016

building a scalable multi-tenanted application server on the...

le cloud computing - smals research...juin 2010 6 qu'est-ce...

big$data$processing$using$ hadoop$ -...

estermann michel bbv - swisst.net · internet ofthings...

hadoop ecosystem - hadoop 生態系

hadoop, cloud y spring