hive dirty/beautiful hacks in td

56
Hive Dirty/Beautiful Hacks in Treasure Data "Rejected" Hadoop Conference Japan 2016 Feb 13, 2016 Satoshi "Moris" Tagomori (@tagomoris)

Upload: satoshi-tagomori

Post on 13-Jan-2017

1.964 views

Category:

Software


4 download

TRANSCRIPT

Page 1: Hive dirty/beautiful hacks in TD

Hive Dirty/Beautiful Hacks

in Treasure Data

"Rejected" Hadoop Conference Japan 2016

Feb 13, 2016 Satoshi "Moris" Tagomori (@tagomoris)

Page 2: Hive dirty/beautiful hacks in TD

Satoshi "Moris" Tagomori (@tagomoris)

Fluentd, MessagePack-Ruby, Norikra, ...

Treasure Data, Inc.

Page 3: Hive dirty/beautiful hacks in TD
Page 4: Hive dirty/beautiful hacks in TD

http://www.treasuredata.com/

Page 5: Hive dirty/beautiful hacks in TD

What I'll talk about today• Hive query execution & deployment in TD

• Query runner

• UDF & Schema management

• InputFormat

• Controlling maps/reduces

• Time index pushdown

• Implementing INSERT INTO

• Optimizing INSERT INTO

Page 6: Hive dirty/beautiful hacks in TD

Console API

EventCollector

PlazmaDB

Worker

Scheduler

Hadoop Cluster

Presto Cluster

USERS

TD SDKs

SERVERS

DataConnector

CUSTOMER's SYSTEMS

50k/day

200k/day

12M/day (138/sec)

Treasure Data Architecture: Overview

Page 7: Hive dirty/beautiful hacks in TD

Hive query execution in TD

PlazmaDB

Worker

Hadoop Cluster

Hive CLI

MR App MR App MR App

metastore

※ all modified code is here

※ almost non-modified hadoop cluster

Page 8: Hive dirty/beautiful hacks in TD

Hive deployment

PlazmaDB

Worker

Hadoop Cluster

Hive CLI

MR App MR App MR App

Page 9: Hive dirty/beautiful hacks in TD

PlazmaDB

Worker

Hadoop Cluster

Hive CLI

MR App MR App MR App

Hive CLI

Hive deployment

Page 10: Hive dirty/beautiful hacks in TD

PlazmaDB

Worker

Hadoop ClusterMR App MR App MR App

Hive CLI

Hive deployment

Page 11: Hive dirty/beautiful hacks in TD

Blue-green deployment for Hadoop clusters

PlazmaDB

Worker

Hadoop ClusterMR App

Hive CLI

Page 12: Hive dirty/beautiful hacks in TD

PlazmaDB

Worker

Hadoop ClusterMR App

Hive CLI

Hadoop Cluster

Blue-green deployment for Hadoop clusters

Page 13: Hive dirty/beautiful hacks in TD

PlazmaDB

Worker

Hadoop ClusterMR App

Hive CLI

Hadoop Cluster

Hive CLI

MR App

Blue-green deployment for Hadoop clusters

Page 14: Hive dirty/beautiful hacks in TD

PlazmaDB

Worker

Hadoop Cluster Hadoop Cluster

Hive CLI

MR App

Blue-green deployment for Hadoop clusters

Page 15: Hive dirty/beautiful hacks in TD

PlazmaDB

Worker

Hadoop Cluster

Hive CLI

MR App

Blue-green deployment for Hadoop clusters

Page 16: Hive dirty/beautiful hacks in TD

• Hive CLI • Worker code (ruby) build command line options

with java properties • Using in-memory disposable metastore

• http://www.slideshare.net/lewuathe/maintainable-cloud-architectureofhadoop by @Lewuathe

• PlazmaDB • Time indexed database (hourly partition) • mpc1 columnar format files + schema-on-read • http://www.slideshare.net/treasure-data/td-techplazma by @frsyuki

Hive query execution in TD

Page 17: Hive dirty/beautiful hacks in TD

Query Runner

• QueryRunner extends hive.cli.CliDriver • specify to use MessagePackSerDe forcedly • inject hook to QueryPlanCheck to prohibit SCRIPT

operators • replace stdout/stderr/query_result writers • add hooks to report query statistics

• Entry point to execute hive queries in TD

Page 18: Hive dirty/beautiful hacks in TD

Example: Hive jobenv HADOOP_CLASSPATH=test.jar:td-hadoop-1.0.jar \ HADOOP_OPTS="-Xmx738m -Duser.name=221" \hive --service jar td-hadoop-1.0.jar \ com.treasure_data.hadoop.hive.runner.QueryRunner \ -hiveconf td.jar.version= \ -hiveconf plazma.metadb.config={} \ -hiveconf plazma.storage.config={} \ -hiveconf td.worker.database.config={} \ -hiveconf mapreduce.job.priority=HIGH \ -hiveconf mapreduce.job.queuename=root.q221.high \ -hiveconf mapreduce.job.name=HiveJob379515 \ -hiveconf td.query.mergeThreshold=1333382400 \ -hiveconf td.query.apikey=12345 \ -hiveconf td.scheduled.time=1342449253 \ -hiveconf td.outdir=./jobs/379515 \ -hiveconf hive.metastore.warehouse.dir=/user/hive/221/warehouse \ -hiveconf hive.auto.convert.join.noconditionaltask=false \ -hiveconf hive.mapjoin.localtask.max.memory.usage=0.7 \ -hiveconf hive.mapjoin.smalltable.filesize=25000000 \ -hiveconf hive.resultset.use.unique.column.names=false \ -hiveconf hive.auto.convert.join=false \ -hiveconf hive.optimize.sort.dynamic.partition=false \ -hiveconf mapreduce.job.reduces=-1 \ -hiveconf hive.vectorized.execution.enabled=false \ -hiveconf mapreduce.job.ubertask.enable=true \ -hiveconf yarn.app.mapreduce.am.resource.mb=2048 \ -hiveconf mapreduce.job.ubertask.maxmaps=1 \ -hiveconf mapreduce.job.ubertask.maxreduces=1 \

Page 19: Hive dirty/beautiful hacks in TD

env HADOOP_CLASSPATH=test.jar:td-hadoop-1.0.jar \ HADOOP_OPTS="-Xmx738m -Duser.name=221" \hive --service jar td-hadoop-1.0.jar \ com.treasure_data.hadoop.hive.runner.QueryRunner \ -hiveconf td.jar.version= \ -hiveconf plazma.metadb.config={} \ -hiveconf plazma.storage.config={} \ -hiveconf td.worker.database.config={} \ -hiveconf mapreduce.job.priority=HIGH \ -hiveconf mapreduce.job.queuename=root.q221.high \ -hiveconf mapreduce.job.name=HiveJob379515 \ -hiveconf td.query.mergeThreshold=1333382400 \ -hiveconf td.query.apikey=12345 \ -hiveconf td.scheduled.time=1342449253 \ -hiveconf td.outdir=./jobs/379515 \ -hiveconf hive.metastore.warehouse.dir=/user/hive/221/warehouse \ -hiveconf hive.auto.convert.join.noconditionaltask=false \ -hiveconf hive.mapjoin.localtask.max.memory.usage=0.7 \ -hiveconf hive.mapjoin.smalltable.filesize=25000000 \ -hiveconf hive.resultset.use.unique.column.names=false \ -hiveconf hive.auto.convert.join=false \ -hiveconf hive.optimize.sort.dynamic.partition=false \ -hiveconf mapreduce.job.reduces=-1 \ -hiveconf hive.vectorized.execution.enabled=false \ -hiveconf mapreduce.job.ubertask.enable=true \ -hiveconf yarn.app.mapreduce.am.resource.mb=2048 \ -hiveconf mapreduce.job.ubertask.maxmaps=1 \ -hiveconf mapreduce.job.ubertask.maxreduces=1 \ -hiveconf mapreduce.job.ubertask.maxbytes=536870912 \ -hiveconf td.hive.insertInto.dynamic.partitioning=false \ -outdir ./jobs/379515

Page 20: Hive dirty/beautiful hacks in TD

Schema & UDF Management• UDF management:

• enable Treasure Data UDFs dynamically • execute CREATE TEMPORARY FUNCTION before

queries

• Schema on read: • databases/tables from Plazmadb metadata • schema definition from Plazmadb metadata • execute CREATE DATABASE/TABLE before

queries

Page 21: Hive dirty/beautiful hacks in TD

Example: Hive job (cont)ADD JAR 'td-hadoop-1.0.jar'; CREATE DATABASE IF NOT EXISTS `db`; USE `db`; CREATE TABLE tagomoris (`v` MAP<STRING,STRING>, `time` INT) STORED BY 'com.treasure_data.hadoop.hive.mapred.TDStorageHandler' WITH SERDEPROPERTIES ('msgpack.columns.mapping'='*,time') TBLPROPERTIES ( 'td.storage.user'='221', 'td.storage.database'='dfc', 'td.storage.table'='users_20100604_080812_ce9203d0', 'td.storage.path'='221/dfc/users_20100604_080812_ce9203d0', 'td.table_id'='2', 'td.modifiable'='true', 'plazma.data_set.name'='221/dfc/users_20100604_080812_ce9203d0' ); CREATE TABLE tbl1 ( `uid` INT, `key` STRING, `time` INT ) STORED BY 'com.treasure_data.hadoop.hive.mapred.TDStorageHandler' WITH SERDEPROPERTIES ('msgpack.columns.mapping'='uid,key,time') TBLPROPERTIES ( 'td.storage.user'='221', 'td.storage.database'='dfc',

Page 22: Hive dirty/beautiful hacks in TD

ADD JAR 'td-hadoop-1.0.jar'; CREATE DATABASE IF NOT EXISTS `db`; USE `db`; CREATE TABLE tagomoris (`v` MAP<STRING,STRING>, `time` INT) STORED BY 'com.treasure_data.hadoop.hive.mapred.TDStorageHandler' WITH SERDEPROPERTIES ('msgpack.columns.mapping'='*,time') TBLPROPERTIES ( 'td.storage.user'='221', 'td.storage.database'='dfc', 'td.storage.table'='users_20100604_080812_ce9203d0', 'td.storage.path'='221/dfc/users_20100604_080812_ce9203d0', 'td.table_id'='2', 'td.modifiable'='true', 'plazma.data_set.name'='221/dfc/users_20100604_080812_ce9203d0' ); CREATE TABLE tbl1 ( `uid` INT, `key` STRING, `time` INT ) STORED BY 'com.treasure_data.hadoop.hive.mapred.TDStorageHandler' WITH SERDEPROPERTIES ('msgpack.columns.mapping'='uid,key,time') TBLPROPERTIES ( 'td.storage.user'='221', 'td.storage.database'='dfc', 'td.storage.table'='contests_20100606_120720_96abe81a', 'td.storage.path'='221/dfc/contests_20100606_120720_96abe81a', 'td.table_id'='4', 'td.modifiable'='true', 'plazma.data_set.name'='221/dfc/contests_20100606_120720_96abe81a' ); USE `db`;

Page 23: Hive dirty/beautiful hacks in TD

USE `db`; CREATE TEMPORARY FUNCTION MSGPACK_SERIALIZE AS 'com.treasure_data.hadoop.hive.udf.MessagePackSerialize'; CREATE TEMPORARY FUNCTION TD_TIME_RANGE AS 'com.treasure_data.hadoop.hive.udf.GenericUDFTimeRange'; CREATE TEMPORARY FUNCTION TD_TIME_ADD AS 'com.treasure_data.hadoop.hive.udf.UDFTimeAdd'; CREATE TEMPORARY FUNCTION TD_TIME_FORMAT AS 'com.treasure_data.hadoop.hive.udf.UDFTimeFormat'; CREATE TEMPORARY FUNCTION TD_TIME_PARSE AS 'com.treasure_data.hadoop.hive.udf.UDFTimeParse'; CREATE TEMPORARY FUNCTION TD_SCHEDULED_TIME AS 'com.treasure_data.hadoop.hive.udf.GenericUDFScheduledTime'; CREATE TEMPORARY FUNCTION TD_X_RANK AS 'com.treasure_data.hadoop.hive.udf.Rank'; CREATE TEMPORARY FUNCTION TD_FIRST AS 'com.treasure_data.hadoop.hive.udf.GenericUDAFFirst'; CREATE TEMPORARY FUNCTION TD_LAST AS 'com.treasure_data.hadoop.hive.udf.GenericUDAFLast'; CREATE TEMPORARY FUNCTION TD_SESSIONIZE AS 'com.treasure_data.hadoop.hive.udf.UDFSessionize'; CREATE TEMPORARY FUNCTION TD_PARSE_USER_AGENT AS 'com.treasure_data.hadoop.hive.udf.GenericUDFParseUserAgent'; CREATE TEMPORARY FUNCTION TD_HEX2NUM AS 'com.treasure_data.hadoop.hive.udf.UDFHex2num'; CREATE TEMPORARY FUNCTION TD_MD5 AS 'com.treasure_data.hadoop.hive.udf.UDFmd5'; CREATE TEMPORARY FUNCTION TD_RANK_SEQUENCE AS 'com.treasure_data.hadoop.hive.udf.UDFRankSequence'; CREATE TEMPORARY FUNCTION TD_STRING_EXPLODER AS 'com.treasure_data.hadoop.hive.udf.GenericUDTFStringExploder'; CREATE TEMPORARY FUNCTION TD_URL_DECODE AS

Page 24: Hive dirty/beautiful hacks in TD

CREATE TEMPORARY FUNCTION TD_URL_DECODE AS 'com.treasure_data.hadoop.hive.udf.UDFUrlDecode'; CREATE TEMPORARY FUNCTION TD_DATE_TRUNC AS 'com.treasure_data.hadoop.hive.udf.UDFDateTrunc'; CREATE TEMPORARY FUNCTION TD_LAT_LONG_TO_COUNTRY AS 'com.treasure_data.hadoop.hive.udf.UDFLatLongToCountry'; CREATE TEMPORARY FUNCTION TD_SUBSTRING_INENCODING AS 'com.treasure_data.hadoop.hive.udf.GenericUDFSubstringInEncoding'; CREATE TEMPORARY FUNCTION TD_DIVIDE AS 'com.treasure_data.hadoop.hive.udf.GenericUDFDivide'; CREATE TEMPORARY FUNCTION TD_SUMIF AS 'com.treasure_data.hadoop.hive.udf.GenericUDAFSumIf'; CREATE TEMPORARY FUNCTION TD_AVGIF AS 'com.treasure_data.hadoop.hive.udf.GenericUDAFAvgIf'; CREATE TEMPORARY FUNCTION hivemall_version AS 'hivemall.HivemallVersionUDF'; CREATE TEMPORARY FUNCTION perceptron AS 'hivemall.classifier.PerceptronUDTF'; CREATE TEMPORARY FUNCTION train_perceptron AS 'hivemall.classifier.PerceptronUDTF'; CREATE TEMPORARY FUNCTION train_pa AS 'hivemall.classifier.PassiveAggressiveUDTF'; CREATE TEMPORARY FUNCTION train_pa1 AS 'hivemall.classifier.PassiveAggressiveUDTF'; CREATE TEMPORARY FUNCTION train_pa2 AS 'hivemall.classifier.PassiveAggressiveUDTF'; CREATE TEMPORARY FUNCTION train_cw AS 'hivemall.classifier.ConfidenceWeightedUDTF'; CREATE TEMPORARY FUNCTION train_arow AS 'hivemall.classifier.AROWClassifierUDTF'; CREATE TEMPORARY FUNCTION train_arowh AS 'hivemall.classifier.AROWClassifierUDTF';

Page 25: Hive dirty/beautiful hacks in TD

CREATE TEMPORARY FUNCTION train_arowh AS 'hivemall.classifier.AROWClassifierUDTF'; CREATE TEMPORARY FUNCTION train_scw AS 'hivemall.classifier.SoftConfideceWeightedUDTF'; CREATE TEMPORARY FUNCTION train_scw2 AS 'hivemall.classifier.SoftConfideceWeightedUDTF'; CREATE TEMPORARY FUNCTION adagrad_rda AS 'hivemall.classifier.AdaGradRDAUDTF'; CREATE TEMPORARY FUNCTION train_adagrad_rda AS 'hivemall.classifier.AdaGradRDAUDTF'; CREATE TEMPORARY FUNCTION train_multiclass_perceptron AS 'hivemall.classifier.multiclass.MulticlassPerceptronUDTF'; CREATE TEMPORARY FUNCTION train_multiclass_pa AS 'hivemall.classifier.multiclass.MulticlassPassiveAggressiveUDTF'; CREATE TEMPORARY FUNCTION train_multiclass_pa1 AS 'hivemall.classifier.multiclass.MulticlassPassiveAggressiveUDTF'; CREATE TEMPORARY FUNCTION train_multiclass_pa2 AS 'hivemall.classifier.multiclass.MulticlassPassiveAggressiveUDTF'; CREATE TEMPORARY FUNCTION train_multiclass_cw AS 'hivemall.classifier.multiclass.MulticlassConfidenceWeightedUDTF'; CREATE TEMPORARY FUNCTION train_multiclass_arow AS 'hivemall.classifier.multiclass.MulticlassAROWClassifierUDTF'; CREATE TEMPORARY FUNCTION train_multiclass_scw AS 'hivemall.classifier.multiclass.MulticlassSoftConfidenceWeightedUDTF'; CREATE TEMPORARY FUNCTION train_multiclass_scw2 AS 'hivemall.classifier.multiclass.MulticlassSoftConfidenceWeightedUDTF'; CREATE TEMPORARY FUNCTION cosine_similarity AS 'hivemall.knn.similarity.CosineSimilarityUDF'; CREATE TEMPORARY FUNCTION cosine_sim AS 'hivemall.knn.similarity.CosineSimilarityUDF'; CREATE TEMPORARY FUNCTION jaccard AS 'hivemall.knn.similarity.JaccardIndexUDF';

Page 26: Hive dirty/beautiful hacks in TD

CREATE TEMPORARY FUNCTION jaccard AS 'hivemall.knn.similarity.JaccardIndexUDF'; CREATE TEMPORARY FUNCTION jaccard_similarity AS 'hivemall.knn.similarity.JaccardIndexUDF'; CREATE TEMPORARY FUNCTION angular_similarity AS 'hivemall.knn.similarity.AngularSimilarityUDF'; CREATE TEMPORARY FUNCTION euclid_similarity AS 'hivemall.knn.similarity.EuclidSimilarity'; CREATE TEMPORARY FUNCTION distance2similarity AS 'hivemall.knn.similarity.Distance2SimilarityUDF'; CREATE TEMPORARY FUNCTION hamming_distance AS 'hivemall.knn.distance.HammingDistanceUDF'; CREATE TEMPORARY FUNCTION popcnt AS 'hivemall.knn.distance.PopcountUDF'; CREATE TEMPORARY FUNCTION kld AS 'hivemall.knn.distance.KLDivergenceUDF'; CREATE TEMPORARY FUNCTION euclid_distance AS 'hivemall.knn.distance.EuclidDistanceUDF'; CREATE TEMPORARY FUNCTION cosine_distance AS 'hivemall.knn.distance.CosineDistanceUDF'; CREATE TEMPORARY FUNCTION angular_distance AS 'hivemall.knn.distance.AngularDistanceUDF'; CREATE TEMPORARY FUNCTION jaccard_distance AS 'hivemall.knn.distance.JaccardDistanceUDF'; CREATE TEMPORARY FUNCTION manhattan_distance AS 'hivemall.knn.distance.ManhattanDistanceUDF'; CREATE TEMPORARY FUNCTION minkowski_distance AS 'hivemall.knn.distance.MinkowskiDistanceUDF'; CREATE TEMPORARY FUNCTION minhashes AS 'hivemall.knn.lsh.MinHashesUDF'; CREATE TEMPORARY FUNCTION minhash AS 'hivemall.knn.lsh.MinHashUDTF'; CREATE TEMPORARY FUNCTION bbit_minhash AS 'hivemall.knn.lsh.bBitMinHashUDF'; CREATE TEMPORARY FUNCTION voted_avg AS 'hivemall.ensemble.bagging.VotedAvgUDAF';

Page 27: Hive dirty/beautiful hacks in TD

CREATE TEMPORARY FUNCTION voted_avg AS 'hivemall.ensemble.bagging.VotedAvgUDAF'; CREATE TEMPORARY FUNCTION weight_voted_avg AS 'hivemall.ensemble.bagging.WeightVotedAvgUDAF'; CREATE TEMPORARY FUNCTION wvoted_avg AS 'hivemall.ensemble.bagging.WeightVotedAvgUDAF'; CREATE TEMPORARY FUNCTION max_label AS 'hivemall.ensemble.MaxValueLabelUDAF'; CREATE TEMPORARY FUNCTION maxrow AS 'hivemall.ensemble.MaxRowUDAF'; CREATE TEMPORARY FUNCTION argmin_kld AS 'hivemall.ensemble.ArgminKLDistanceUDAF'; CREATE TEMPORARY FUNCTION mhash AS 'hivemall.ftvec.hashing.MurmurHash3UDF'; CREATE TEMPORARY FUNCTION sha1 AS 'hivemall.ftvec.hashing.Sha1UDF'; CREATE TEMPORARY FUNCTION array_hash_values AS 'hivemall.ftvec.hashing.ArrayHashValuesUDF'; CREATE TEMPORARY FUNCTION prefixed_hash_values AS 'hivemall.ftvec.hashing.ArrayPrefixedHashValuesUDF'; CREATE TEMPORARY FUNCTION polynomial_features AS 'hivemall.ftvec.pairing.PolynomialFeaturesUDF'; CREATE TEMPORARY FUNCTION powered_features AS 'hivemall.ftvec.pairing.PoweredFeaturesUDF'; CREATE TEMPORARY FUNCTION rescale AS 'hivemall.ftvec.scaling.RescaleUDF'; CREATE TEMPORARY FUNCTION rescale_fv AS 'hivemall.ftvec.scaling.RescaleUDF'; CREATE TEMPORARY FUNCTION zscore AS 'hivemall.ftvec.scaling.ZScoreUDF'; CREATE TEMPORARY FUNCTION normalize AS 'hivemall.ftvec.scaling.L2NormalizationUDF'; CREATE TEMPORARY FUNCTION conv2dense AS 'hivemall.ftvec.conv.ConvertToDenseModelUDAF'; CREATE TEMPORARY FUNCTION to_dense_features AS 'hivemall.ftvec.conv.ToDenseFeaturesUDF';

Page 28: Hive dirty/beautiful hacks in TD

CREATE TEMPORARY FUNCTION to_dense_features AS 'hivemall.ftvec.conv.ToDenseFeaturesUDF'; CREATE TEMPORARY FUNCTION to_dense AS 'hivemall.ftvec.conv.ToDenseFeaturesUDF'; CREATE TEMPORARY FUNCTION to_sparse_features AS 'hivemall.ftvec.conv.ToSparseFeaturesUDF'; CREATE TEMPORARY FUNCTION to_sparse AS 'hivemall.ftvec.conv.ToSparseFeaturesUDF'; CREATE TEMPORARY FUNCTION quantify AS 'hivemall.ftvec.conv.QuantifyColumnsUDTF'; CREATE TEMPORARY FUNCTION vectorize_features AS 'hivemall.ftvec.trans.VectorizeFeaturesUDF'; CREATE TEMPORARY FUNCTION categorical_features AS 'hivemall.ftvec.trans.CategoricalFeaturesUDF'; CREATE TEMPORARY FUNCTION indexed_features AS 'hivemall.ftvec.trans.IndexedFeatures'; CREATE TEMPORARY FUNCTION quantified_features AS 'hivemall.ftvec.trans.QuantifiedFeaturesUDTF'; CREATE TEMPORARY FUNCTION quantitative_features AS 'hivemall.ftvec.trans.QuantitativeFeaturesUDF'; CREATE TEMPORARY FUNCTION amplify AS 'hivemall.ftvec.amplify.AmplifierUDTF'; CREATE TEMPORARY FUNCTION rand_amplify AS 'hivemall.ftvec.amplify.RandomAmplifierUDTF'; CREATE TEMPORARY FUNCTION addBias AS 'hivemall.ftvec.AddBiasUDF'; CREATE TEMPORARY FUNCTION add_bias AS 'hivemall.ftvec.AddBiasUDF'; CREATE TEMPORARY FUNCTION sortByFeature AS 'hivemall.ftvec.SortByFeatureUDF'; CREATE TEMPORARY FUNCTION sort_by_feature AS 'hivemall.ftvec.SortByFeatureUDF'; CREATE TEMPORARY FUNCTION extract_feature AS 'hivemall.ftvec.ExtractFeatureUDF';

Page 29: Hive dirty/beautiful hacks in TD

CREATE TEMPORARY FUNCTION extract_feature AS 'hivemall.ftvec.ExtractFeatureUDF'; CREATE TEMPORARY FUNCTION extract_weight AS 'hivemall.ftvec.ExtractWeightUDF'; CREATE TEMPORARY FUNCTION add_feature_index AS 'hivemall.ftvec.AddFeatureIndexUDF'; CREATE TEMPORARY FUNCTION feature AS 'hivemall.ftvec.FeatureUDF'; CREATE TEMPORARY FUNCTION feature_index AS 'hivemall.ftvec.FeatureIndexUDF'; CREATE TEMPORARY FUNCTION tf AS 'hivemall.ftvec.text.TermFrequencyUDAF'; CREATE TEMPORARY FUNCTION train_logregr AS 'hivemall.regression.LogressUDTF'; CREATE TEMPORARY FUNCTION train_pa1_regr AS 'hivemall.regression.PassiveAggressiveRegressionUDTF'; CREATE TEMPORARY FUNCTION train_pa1a_regr AS 'hivemall.regression.PassiveAggressiveRegressionUDTF'; CREATE TEMPORARY FUNCTION train_pa2_regr AS 'hivemall.regression.PassiveAggressiveRegressionUDTF'; CREATE TEMPORARY FUNCTION train_pa2a_regr AS 'hivemall.regression.PassiveAggressiveRegressionUDTF'; CREATE TEMPORARY FUNCTION train_arow_regr AS 'hivemall.regression.AROWRegressionUDTF'; CREATE TEMPORARY FUNCTION train_arowe_regr AS 'hivemall.regression.AROWRegressionUDTF'; CREATE TEMPORARY FUNCTION train_arowe2_regr AS 'hivemall.regression.AROWRegressionUDTF'; CREATE TEMPORARY FUNCTION train_adagrad_regr AS 'hivemall.regression.AdaGradUDTF'; CREATE TEMPORARY FUNCTION train_adadelta_regr AS 'hivemall.regression.AdaDeltaUDTF'; CREATE TEMPORARY FUNCTION train_adagrad AS 'hivemall.regression.AdaGradUDTF';

Page 30: Hive dirty/beautiful hacks in TD

CREATE TEMPORARY FUNCTION train_adagrad AS 'hivemall.regression.AdaGradUDTF'; CREATE TEMPORARY FUNCTION train_adadelta AS 'hivemall.regression.AdaDeltaUDTF'; CREATE TEMPORARY FUNCTION logress AS 'hivemall.regression.LogressUDTF'; CREATE TEMPORARY FUNCTION pa1_regress AS 'hivemall.regression.PassiveAggressiveRegressionUDTF'; CREATE TEMPORARY FUNCTION pa1a_regress AS 'hivemall.regression.PassiveAggressiveRegressionUDTF'; CREATE TEMPORARY FUNCTION pa2_regress AS 'hivemall.regression.PassiveAggressiveRegressionUDTF'; CREATE TEMPORARY FUNCTION pa2a_regress AS 'hivemall.regression.PassiveAggressiveRegressionUDTF'; CREATE TEMPORARY FUNCTION arow_regress AS 'hivemall.regression.AROWRegressionUDTF'; CREATE TEMPORARY FUNCTION arowe_regress AS 'hivemall.regression.AROWRegressionUDTF'; CREATE TEMPORARY FUNCTION arowe2_regress AS 'hivemall.regression.AROWRegressionUDTF'; CREATE TEMPORARY FUNCTION adagrad AS 'hivemall.regression.AdaGradUDTF'; CREATE TEMPORARY FUNCTION adadelta AS 'hivemall.regression.AdaDeltaUDTF'; CREATE TEMPORARY FUNCTION float_array AS 'hivemall.tools.array.AllocFloatArrayUDF'; CREATE TEMPORARY FUNCTION array_remove AS 'hivemall.tools.array.ArrayRemoveUDF'; CREATE TEMPORARY FUNCTION sort_and_uniq_array AS 'hivemall.tools.array.SortAndUniqArrayUDF'; CREATE TEMPORARY FUNCTION subarray_endwith AS 'hivemall.tools.array.SubarrayEndWithUDF'; CREATE TEMPORARY FUNCTION subarray_startwith AS 'hivemall.tools.array.SubarrayStartWithUDF'; CREATE TEMPORARY FUNCTION collect_all AS

Page 31: Hive dirty/beautiful hacks in TD

CREATE TEMPORARY FUNCTION collect_all AS 'hivemall.tools.array.CollectAllUDAF'; CREATE TEMPORARY FUNCTION concat_array AS 'hivemall.tools.array.ConcatArrayUDF'; CREATE TEMPORARY FUNCTION subarray AS 'hivemall.tools.array.SubarrayUDF'; CREATE TEMPORARY FUNCTION array_avg AS 'hivemall.tools.array.ArrayAvgGenericUDAF'; CREATE TEMPORARY FUNCTION array_sum AS 'hivemall.tools.array.ArraySumUDAF'; CREATE TEMPORARY FUNCTION to_string_array AS 'hivemall.tools.array.ToStringArrayUDF'; CREATE TEMPORARY FUNCTION map_get_sum AS 'hivemall.tools.map.MapGetSumUDF'; CREATE TEMPORARY FUNCTION map_tail_n AS 'hivemall.tools.map.MapTailNUDF'; CREATE TEMPORARY FUNCTION to_map AS 'hivemall.tools.map.UDAFToMap'; CREATE TEMPORARY FUNCTION to_ordered_map AS 'hivemall.tools.map.UDAFToOrderedMap'; CREATE TEMPORARY FUNCTION sigmoid AS 'hivemall.tools.math.SigmoidGenericUDF'; CREATE TEMPORARY FUNCTION taskid AS 'hivemall.tools.mapred.TaskIdUDF'; CREATE TEMPORARY FUNCTION jobid AS 'hivemall.tools.mapred.JobIdUDF'; CREATE TEMPORARY FUNCTION rowid AS 'hivemall.tools.mapred.RowIdUDF'; CREATE TEMPORARY FUNCTION generate_series AS 'hivemall.tools.GenerateSeriesUDTF'; CREATE TEMPORARY FUNCTION convert_label AS 'hivemall.tools.ConvertLabelUDF'; CREATE TEMPORARY FUNCTION x_rank AS 'hivemall.tools.RankSequenceUDF'; CREATE TEMPORARY FUNCTION each_top_k AS 'hivemall.tools.EachTopKUDTF'; CREATE TEMPORARY FUNCTION tokenize AS 'hivemall.tools.text.TokenizeUDF'; CREATE TEMPORARY FUNCTION is_stopword AS 'hivemall.tools.text.StopwordUDF'; CREATE TEMPORARY FUNCTION split_words AS

Page 32: Hive dirty/beautiful hacks in TD

CREATE TEMPORARY FUNCTION split_words AS 'hivemall.tools.text.SplitWordsUDF'; CREATE TEMPORARY FUNCTION normalize_unicode AS 'hivemall.tools.text.NormalizeUnicodeUDF'; CREATE TEMPORARY FUNCTION lr_datagen AS 'hivemall.dataset.LogisticRegressionDataGeneratorUDTF'; CREATE TEMPORARY FUNCTION f1score AS 'hivemall.evaluation.FMeasureUDAF'; CREATE TEMPORARY FUNCTION mae AS 'hivemall.evaluation.MeanAbsoluteErrorUDAF'; CREATE TEMPORARY FUNCTION mse AS 'hivemall.evaluation.MeanSquaredErrorUDAF'; CREATE TEMPORARY FUNCTION rmse AS 'hivemall.evaluation.RootMeanSquaredErrorUDAF'; CREATE TEMPORARY FUNCTION mf_predict AS 'hivemall.mf.MFPredictionUDF'; CREATE TEMPORARY FUNCTION train_mf_sgd AS 'hivemall.mf.MatrixFactorizationSGDUDTF'; CREATE TEMPORARY FUNCTION train_mf_adagrad AS 'hivemall.mf.MatrixFactorizationAdaGradUDTF'; CREATE TEMPORARY FUNCTION fm_predict AS 'hivemall.fm.FMPredictGenericUDAF'; CREATE TEMPORARY FUNCTION train_fm AS 'hivemall.fm.FactorizationMachineUDTF'; CREATE TEMPORARY FUNCTION train_randomforest_classifier AS 'hivemall.smile.classification.RandomForestClassifierUDTF'; CREATE TEMPORARY FUNCTION train_rf_classifier AS 'hivemall.smile.classification.RandomForestClassifierUDTF'; CREATE TEMPORARY FUNCTION train_randomforest_regr AS 'hivemall.smile.regression.RandomForestRegressionUDTF'; CREATE TEMPORARY FUNCTION train_rf_regr AS 'hivemall.smile.regression.RandomForestRegressionUDTF'; CREATE TEMPORARY FUNCTION tree_predict AS 'hivemall.smile.tools.TreePredictByStackMachineUDF';

Page 33: Hive dirty/beautiful hacks in TD

CREATE TEMPORARY FUNCTION tree_predict AS 'hivemall.smile.tools.TreePredictByStackMachineUDF'; CREATE TEMPORARY FUNCTION vm_tree_predict AS 'hivemall.smile.tools.TreePredictByStackMachineUDF'; CREATE TEMPORARY FUNCTION rf_ensemble AS 'hivemall.smile.tools.RandomForestEnsembleUDAF'; CREATE TEMPORARY FUNCTION train_gradient_boosting_classifier AS 'hivemall.smile.classification.GradientTreeBoostingClassifierUDTF'; CREATE TEMPORARY FUNCTION guess_attribute_types AS 'hivemall.smile.tools.GuessAttributesUDF'; CREATE TEMPORARY FUNCTION tokenize_ja AS 'hivemall.nlp.tokenizer.KuromojiUDF'; CREATE TEMPORARY MACRO max2(x DOUBLE, y DOUBLE) if(x>y,x,y); CREATE TEMPORARY MACRO min2(x DOUBLE, y DOUBLE) if(x<y,x,y); CREATE TEMPORARY MACRO rand_gid(k INT) floor(rand()*k); CREATE TEMPORARY MACRO rand_gid2(k INT, seed INT) floor(rand(seed)*k); CREATE TEMPORARY MACRO idf(df_t DOUBLE, n_docs DOUBLE) log(10, n_docs / max2(1,df_t)) + 1.0; CREATE TEMPORARY MACRO tfidf(tf FLOAT, df_t DOUBLE, n_docs DOUBLE) tf * (log(10, n_docs / max2(1,df_t)) + 1.0);

SELECT time, COUNT(1) AS cnt FROM tbl1 WHERE TD_TIME_RANGE(time, '2015-12-11', '2015-12-12', 'JST');

Page 34: Hive dirty/beautiful hacks in TD

After improvement :)ADD JAR test.jar; ADD JAR td-hadoop-1.0.jar;

CREATE DATABASE IF NOT EXISTS `dfc`;

USE `dfc`;

CREATE TABLE `contests` (`uid` INT, `key` STRING, `time` INT) STORED BY 'com.treasure_data.hadoop.hive.mapred.TDStorageHandler' WITH SERDEPROPERTIES ("msgpack.columns.mapping"="uid,key,time") TBLPROPERTIES ( "td.storage.user"="221", "td.storage.database"="dfc", "td.storage.table"="contests_20100606_120720_96abe81a", "td.storage.path"="221/dfc/contests_20100606_120720_96abe81a", "td.table_id"="4", "td.modifiable"="true", "plazma.data_set.name"="221/dfc/contests_20100606_120720_96abe81a" );

USE `dfc`;

CREATE TEMPORARY FUNCTION TD_TIME_RANGE AS 'com.treasure_data.hadoop.hive.udf.GenericUDFTimeRange'; CREATE TEMPORARY FUNCTION TD_TIME_ADD AS 'com.treasure_data.hadoop.hive.udf.UDFTimeAdd'; CREATE TEMPORARY FUNCTION TD_TIME_FORMAT AS 'com.treasure_data.hadoop.hive.udf.UDFTimeFormat'; CREATE TEMPORARY FUNCTION TD_TIME_PARSE AS 'com.treasure_data.hadoop.hive.udf.UDFTimeParse'; CREATE TEMPORARY FUNCTION TD_SCHEDULED_TIME AS 'com.treasure_data.hadoop.hive.udf.GenericUDFScheduledTime'; CREATE TEMPORARY MACRO max2(x DOUBLE, y DOUBLE) if(x>y,x,y); CREATE TEMPORARY MACRO min2(x DOUBLE, y DOUBLE) if(x<y,x,y); CREATE TEMPORARY MACRO rand_gid(k INT) floor(rand()*k); CREATE TEMPORARY MACRO rand_gid2(k INT, seed INT) floor(rand(seed)*k); CREATE TEMPORARY MACRO idf(df_t DOUBLE, n_docs DOUBLE) log(10, n_docs / max2(1,df_t)) + 1.0; CREATE TEMPORARY MACRO tfidf(tf FLOAT, df_t DOUBLE, n_docs DOUBLE) tf * (log(10, n_docs / max2(1,df_t)) + 1.0);

SELECT `key`, COUNT(1) FROM contests `key` IS NOT NULL GROUP BY `key`

Page 35: Hive dirty/beautiful hacks in TD

Appendix: disabling unsafe UDFs

• Unusual case to patch Hive itself for our own purpose

• java_method(), reflect() • ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java

Page 36: Hive dirty/beautiful hacks in TD

Logic flow in Hive processing• CliDriver -> StorageHandler -> InputFormat -> SerDe

-> SemanticAnalyzer -> OutputFormat

• -> MapReduce Application (

• -> Mapper ( SerDe -> RecordReader -> .... )

• -> Shuffler

• -> Reducer ( ... -> RecordWriter )

• )

Page 37: Hive dirty/beautiful hacks in TD

Hive -> MapReduce

Hive Storage Handler

Input Format SerDe MapReduce SerDe Output

Format

Page 38: Hive dirty/beautiful hacks in TD

TDInputFormat

• It's just a kind of InputFormat of Hadoop

• TDStorageHandler specifies TDInputFormat as InputFormat • get/build splits • provide RecordReader • override FS access not to read data from HDFS,

but from Plazmadb

Page 39: Hive dirty/beautiful hacks in TD

Controlling Maps/Reduces• We need to control Maps/Reduces by our own logic

• along customers' price plans • to optimize performance / cluster utilization

• Maps from splits • Hive: # of files (or splittable parts of files) on HDFS • TD: # of Megabytes, built from chunks in Plazmadb

• calculated and overwritten in TDInputFormat

• Reduces • total input data size & other factors

• calculated and overwritten in TDInputFormat

Page 40: Hive dirty/beautiful hacks in TD

Appendix: How to overwrite configuration values dynamically

@InterfaceAudience.Public @InterfaceStability.Stable public class Configuration implements Iterable<Map.Entry<String,String>>, Writable { /** * Configuration objects */ private static final WeakHashMap<Configuration,Object> REGISTRY = new WeakHashMap<Configuration,Object>();

public void setNumReduceTasks(int n) { setInt(JobContext.NUM_REDUCES, n); }

Configuration may be a copy of original... and not used to build mapreduce job actually

Especially in InputFormat :-(

Page 41: Hive dirty/beautiful hacks in TD

TDInputFormatUtils.trySetNumReduceTasks(conf, num);

@SuppressWarnings("unchecked") public static List<JobConf> tryGetOriginalJobConfs(JobConf conf) { try { ArrayList<JobConf> list = new ArrayList<JobConf>(); // Configuration.REGISTRY contains all copies of JobConf instances in process. // This method scans all copies and try to find original one from it to update it. Field f = Configuration.class.getDeclaredField("REGISTRY"); f.setAccessible(true); WeakHashMap<Configuration,Object> reg = (WeakHashMap<Configuration,Object>) f.get(null); for (Configuration c : reg.keySet()) { if (c instanceof JobConf) { JobConf jc = (JobConf) c; if(jc.getCredentials() == conf.getCredentials()) { // shares the same credentials object means // cloned configuration list.add(jc); } } } return list; } catch (Exception ex) { // ignore errors } return Collections.emptyList(); }

public static void trySetNumReduceTasks(JobConf conf, int num) { List<JobConf> jcs = tryGetOriginalJobConfs(conf); for (JobConf jc : jcs) { jc.setNumReduceTasks(num); } }

Get all copies of Configuration by reflection, and overwrite all copies with specified values :-)

Page 42: Hive dirty/beautiful hacks in TD

Time Index Pushdown

• Scan data only for needed in table • faster processing, less computing resources

• Pushing down scan range • from SemanticAnalyzer to StorageHandler • injecting IndexAnalyzer over InputFormat

SELECT col1 FROM tablename WHERE time > TD_SCHEDULED_TIME() - 86400 AND time < TD_SCHEDULED_TIME() OR TD_TIME_RANGE(time, '2016-02-13 00:00:00 JST', '2016-02-14 00:00:00 JST')

Page 43: Hive dirty/beautiful hacks in TD

IndexAnalyzer• Called from TDInputFormat

• InputFormat can do everything :-) • Analyze operator tree

• to create time ranges for each tables } else if (udf instanceof GenericUDFOPLessThan) { ExprNodeDesc left = node.getChildren().get(0); ExprNodeDesc right = node.getChildren().get(1); if (isTimeColumn(right)) { Long v = getLongConstant(left); if (v != null) { // VALUE < key return new TimeRange[] { new TimeRange(v + 1) }; } } else if (isTimeColumn(left)) { Long v = getLongConstant(right); if (v != null) { // key < VALUE return new TimeRange[] { new TimeRange(0, v - 1) }; } } return ALL_RANGES; } else if (udf instanceof GenericUDFTimeRange) { // static evaluate TIME_RANGE(time, start[, end[, timezone]]) if (node.getChildren().size() < 2 || node.getChildren().size() > 4) { return ALL_RANGES; }

ExprNodeDesc arg0 = node.getChildren().get(0); ExprNodeDesc arg1 = node.getChildren().get(1); ExprNodeDesc arg2 = null; ExprNodeDesc arg3 = null; if (node.getChildren().size() >= 3) {

Page 44: Hive dirty/beautiful hacks in TD

Implementing INSERT INTO

• Hive is going to write data into HDFS • for normal INSERT INTO • FileFormat/Serialization can be overwritten, but

filesystems can't be

• Our tables are on PlazmaDB • INSERT INTO queries must write data on Plazma • And must handle write-and-commit transaction

Page 45: Hive dirty/beautiful hacks in TD

TDHiveOutputFormat• TDStorageHandler specifies it as OutputFormat • Replace ReduceSinkOperator in operator tree

• to override inserting to write data into PlazmaDB • Called from TDInputFormat

← Reduce Sink operator for INSERT INTO → ← Replaced one

Original operator tree For INSERT INTO

Modified one

Page 46: Hive dirty/beautiful hacks in TD

InputFormat can do everything! <3

Page 47: Hive dirty/beautiful hacks in TD

ReduceSinkOp w/ One Hour Partitioning• All data in a partition must be read for needed case • Rows in partitions are not needed to be sorted

• parted by "time % 3600" • 1 reducer for 1 partition

ExprNodeDesc hashExpr; if (conf.getBoolean(TDConstants.TD_CK_HIVE_INSERTINTO_DYNAMIC_PARTITIONING, false)) { hashExpr = new ExprNodeGenericFuncDesc(TypeInfoFactory.intTypeInfo, new GenericPlazmaUnixtimeDataSetDynamicHashUDF(), args); } else { hashExpr = new ExprNodeGenericFuncDesc(TypeInfoFactory.intTypeInfo, new GenericPlazmaUnixtimeDataSetKeyHashUDF(), args); } partnCols.add(hashExpr);

// add another MR job to the query plan to sort data by the hashExpr Operator op = genReduceSinkPlanForSortingBucketing(analyzer, table, input, sortCols, sortOrders, partnCols, -1); preventFileReduceSinkOptimizationHack(conf, analyzer, table, "time", 1); return op;

Page 48: Hive dirty/beautiful hacks in TD

Appendix: hack NOT to sort rows

• Rows in a partitions have NOT to be sorted • All rows in a partition should be read at same time • There's no standard way NOT to sort rows

private static void preventFileReduceSinkOptimizationHack(Configuration conf, SemanticAnalyzer analyzer, Table dest_tab, String fakeSortCol, int fakeSortOrder) throws NoSuchFieldException, IllegalAccessException, HiveException { dest_tab.setSortCols(Arrays.asList(new Order[] { new Order("time", 1) })); conf.setBoolean("hive.enforce.sorting", true);

// prevent BucketingSortingReduceSinkOptimizer optimizing out the ReduceSinkOperator. // fake this ReduceSinkOperator is a regular operator // so that BucketingSortingReduceSinkOptimizer.process doesn't optimize out the operator Field field = analyzer.getClass() .getDeclaredField("reduceSinkOperatorsAddedByEnforceBucketingSorting"); field.setAccessible(true); List<ReduceSinkOperator> list = (List<ReduceSinkOperator>) field.get(analyzer); list.clear(); }

Page 49: Hive dirty/beautiful hacks in TD

Optimizing INSERT INTO• 1 reducer per 1 partition model

• works well for many cases :-) • doesn't work well for massively large data in an hour

INSERT INTO TABLE destination SELECT * FROM sourcetable WHERE TD_TIME_RANGE(time,'2016-02-13 15:00:00','2016-02-13 16:00:00','JST')

• 1 reducer take much long time for such cases • besides many reducers finishes immediately

• We wanna distribute!

Page 50: Hive dirty/beautiful hacks in TD

Basics of Shuffle/Reduce• Shuffle is global sorting for rows before reducers

• Reducer operators estimates: • all rows are sorted • disordered rows are boundaries of partitions

rows from a map

rows from a map

rows from a map

rows from a map

shuffle (global sort)

partition (sorted rows)partition (sorted rows)partition (sorted rows)partition (sorted rows)partition (sorted rows)partition (sorted rows)partition (sorted rows)

order of partitions

is not sorted

Page 51: Hive dirty/beautiful hacks in TD

INSERT INTO w/ less partitions, massive data• We know these in planning:

• time range (# of partitions): IndexAnalyzer • data size: InputFormat.getSplits • # of reducers: InputFormat.getSplits

rows from a map

rows from a map

rows from a map

rows from a map

shuffle (global sort)

partition (non-sorted rows)

Page 52: Hive dirty/beautiful hacks in TD

Distribute (virtual) partitions dynamically• PlazmaDB-level partitions are managed by

StorageHandler (and PlazmaDB client) • It's not need to match MR-level partitioning and

PlazmaDB partitioning • How to distribute a PlazmaDB partitions to many

reducers ExprNodeDesc hashExpr; if (conf.getBoolean(TDConstants.TD_CK_HIVE_INSERTINTO_DYNAMIC_PARTITIONING, false)) { hashExpr = new ExprNodeGenericFuncDesc(TypeInfoFactory.intTypeInfo, new GenericPlazmaUnixtimeDataSetDynamicHashUDF(), args); } else { hashExpr = new ExprNodeGenericFuncDesc(TypeInfoFactory.intTypeInfo, new GenericPlazmaUnixtimeDataSetKeyHashUDF(), args); } partnCols.add(hashExpr);

Page 53: Hive dirty/beautiful hacks in TD

/* * calculate Size (== 3600 / F), F is a max number like: * * factor of 3600 ( 3600 % F == 0, like 1,2,3,4,5,6,8,10,12,15,18,20,24,30,36,40, ...) * * F * H <= reduces */ static int calculateFactor(int reduces, long hours) { if (reduces <= hours) { return 1; }

long factor = reduces / hours; while (factor >= 2) { if (3600 % factor == 0) break; factor -= 1; } return (int) factor; }

static int calculatePartitioningSize(int reduces, long hours) { return 3600 / calculateFactor(reduces, hours); }

rows from a map

rows from a map

rows from a map

rows from a map

shuffle (global sort)

w/ dynamic

partitioning

hive partitionhive partitionhive partitionhive partitionhive partitionhive partitionhive partition

PlazmaDB 1-hour

partition

reducerreducer

reducerreducer

reducerreducerreducer

Page 54: Hive dirty/beautiful hacks in TD

public void configure(MapredContext context) { JobConf conf = context.getJobConf();

this.partitioningSize = DEFAULT_PARTITIONING_SIZE;

int reduces = conf.getInt(MRJobConfig.NUM_REDUCES, 0); if (reduces < 1) { // dynamic partitioning requires number of reduces // because reducers generate too many files if 2 or more partitions arrives into a reduce task return; }

int distributionFactor = conf.getInt(TDConstants.TD_CK_HIVE_INSERTINTO_DISTRIBUTION_FACTOR, 0); if (distributionFactor > 0) { if (distributionFactor <= reduces && 3600 % distributionFactor == 0) { this.partitioningSize = 3600 / distributionFactor; return; } // distribution factor larger than reduces or not a factor of 3600, // splits outout into too many/small files // such specified values will be ignored, and use default rule }

long splits = conf.getLong(TDConstants.TD_CK_QUERY_SPLIT_NUMBER, 0); long hours = conf.getLong(TDConstants.TD_CK_QUERY_TIME_RANGE_HOURS, 0); if (splits < 1 || hours < 1) { return; // use default size if TDInputFormat fails to set these values }

if (splits < MIN_SPLITS_TO_DISTRIBUTE || hours > MAX_HOURS_TO_DISTRIBUTE) { // input data too small or input data has enough many time partitions to distribute in default rule return; } if (reduces < hours * 2) { // not enough reduces to distribute time based partitions return; }

this.partitioningSize = calculatePartitioningSize(reduces, hours); }

Page 55: Hive dirty/beautiful hacks in TD

What important is that IT WORKS!

A day, @frsyuki said

Page 56: Hive dirty/beautiful hacks in TD

We'll improve our code step by step, with improvements of OSS and its developer community <3

Thanks!