1/19 dgfindex for smart grid: enhancing hive with a cost effective multi-dimensional range index yue...

1/19

DGFIndex for Smart Grid: Enhancing Hive with a Cost Effective Multi-dimensional Range Index

Yue Liu, Songlin Hu*, Tilmann Rabl, Wantao Liu, Hans-Arno Jacobsen, Kaifeng Wu, Jian Chen, Jintao Li

MIDDLEWARE SYSTEMSRESEARCH GROUP

MSRG.ORG

国网浙江省电力公司STATE GRID ZHEJIANG ELECTRIC POWER COMPANY

国网电力科学研究院STATE GRID ELECTRIC POWER RESEARCH INSTITUTE

2/19

Outline

Big Data challenges in Smart Grid

DGFIndex design

Experiments on smart grid data

Conclusions

3/19

Big Data Challenges in Smart Grid

GPRS

CDMA 230MHZ

Smart Meter

RDBMSC

olle

ctor

The Electricity Consumption Information Collection System of the Grid

Example 22 million smart meters in Zhejiang province, as required

by China State Grid, it should be 96 measurements/day Will be 2.1 billion records in a single table

Figure 1 Data flow in State Grid

4/19

Features of Smart Meter Data

Smart meter data features: Time stamp field Append only Schema is static

Query examples: What is the average power consumption of user ids in the range

100 to 1000 and dates in the ranges “2013-01-01” to “2013-02-01”?

Queries features: Multi-dimensional range query Lots of aggregation query

UserId PowerConsumed

TimeStamp PATE with Rate 1

Other Metrics

24012 12.34 1332988833

10.45 …

Table 1 An example format of smart meter data

5/19

Why Migrate to Hadoop/Hive

Limitations of RDBMS Low write throughput Weak scalability High license cost

Hadoop and Hive High write throughput Flexible scalability Low budget and cost effective

Hadoop/Hive is a good choice for solving smart meter big data problem

Figure 2 Write throughput comparison of RDBMS and HDFS

16 times faster with only 1/10 cost

Index Compact Index, Aggregate Index and

Bitmap Index Store all combinations of index

dimensions and location Data partition

Each partition is a directory, reorganize data into different directory

6/19

Indexes in Hive

Disadvantages on multi-dimensional range queries when large number of distinct value in index dimensions

Index: large index table size Data partition: lots of directories and small files

Column Name

Type

Index dimension 1

Type in base table

Index dimension 2

Type in base table

Index dimension 3

Type in base table

_bucketname string

_offset Array<bigint>

Table 2 3-dimensional compact index

7/19

Limitations of Indexing in Hive

Limitations: Storing combination of index dimensions leads to

extremely large index table High selectivity leads to large temporary file, which

may overflow the JobTracker’s memory Poor filter effect when the value of index dimension

scattered evenly in data file

Predicate

Scan Index Table

Temporary FileFile:Offset

JobTrackerInputFormat.getSpl

its

Chosen Splits

MR Job1

2

3

4

5 6

Figure 3 Query with current index

Recall Data Features

Smart meter data features: Time stamp field Append only Schema is static

Queries features: Multi-dimensional range query Lots of aggregation query

8/19

9/19

DGFIndex Design

Using grid file to split logical data space into units(GFU)

Data in same GFU is stored together in the file of HDFS, named Slice

GFU is stored as a GFUKey/GFUValue pair in key/value store

GFUKey is the left lower coordinate of GFU in the data space

GFUValue consists of header and location

Header contains some pre-computed aggregation values

Location is the start and end offset of corresponding data segment in file of HDFS

Figure 4 DGFIndex architecture

dimension X

dim

en

sion Y

2-dimensional DGFIndex Slice

GFU

GFUKey

GFUValue

headerlocation

aggregation values

data file on HDFS

10/19

DGFIndex ConstructionInput Map

Phase

A B C1 14 0. 15 18 0. 57 12 1. 22 11 0. 59 14 0. 811 16 1. 33 18 0. 912 12 0. 38 13 0. 2

test

1 4 7 1011

13

15

19

17

13A

B

Mapper

Reduce Phase

Output

A B C1 14 0. 15 18 0. 57 12 1. 22 11 0. 59 14 0. 88 13 0. 23 18 0. 912 12 0. 311 16 1. 3

test after reorganization

Reducer

Input:{36,9|14|0.8}Output:{7_13,9|14|0.8}

Input:{7_13,<9|14|0.8,8|13|0.2>}

Output:{null,<9|14|0.8,8|13|0.2>}

A:9|B:14

7_13

Splitting Policy

DGFIndex Table

Figure 6 DGFIndex construction

header

sum(C) fi l ename start end

1_13 0. 1 test 0 0

4_17 0. 5 test 9 9

7_11 1. 2 test 18 18

1_11 0. 5 test 27 27

7_13 1 test 36 45

1_17 0. 9 test 54 54

10_11 0. 3 test 63 63

10_15 1. 3 test 72 72

GFUKeyGFUVal ue

l ocat i on

11/19

DGFIndex Query

B

1 4 7 1011

13

15

19

17

13

A

SELECT SUM(C)FROM testWHERE A>=5 AND A<12 AND B>=12 AND B<16 1.0

test:0test:0

overlapped?

combine

chosen splits

Ski p Ski p

0 18 63 72test

1.2

2.2

Step 1(Hive)

Step 2(InputFormat.getSplits)

Step 3(Reader.next)

all splits

Figure 7 DGFIndex query

fi l ename start end

test 18 18

test 63 63

test 72 72

spl i t Li st<start , end>

test : 0 18, 18, 63, 63, 72, 72

header

sum(C) fi l ename start end

1_13 0. 1 test 0 0

4_17 0. 5 test 9 9

7_11 1. 2 test 18 18

1_11 0. 5 test 27 27

7_13 1 test 36 45

1_17 0. 9 test 54 54

10_11 0. 3 test 63 63

10_15 1. 3 test 72 72

GFUKeyGFUVal ue

l ocat i on

Advantages of DGFIndex

Smaller index size High index read speed, selective Pre-computation

12/19

Sl i ce

Data Fi le on HDFS2-Dimensional

DGFIndex

Dimension X

Dimension Y

Header Location

GFUKey GFUValue

GFU Sl i ce

Figure 8 DGFIndex architecture

13/19

Experiments

Comparison System Hive with Compact Index, HadoopDB

Environment Hardware

29 virtual nodes, each has 8 cores, 8GB RAM, 300GB disk Software

CentOS 6.5 b4bit, Jdk 1.6.0_45 64bit, Hadoop-1.2.1, HBase-0.94.13 DGFIndex is implemented in Hive-0.10.0 Replication factor is 2 in HDFS, mapred.task.io.sort.mb=512MB PostgreSQL 8.4.20 for HadoopDB

Data Set and Query Real meter data (1TB in TextFile and 890GB in RCFile, no

compression) and ad-hoc queries from Zhejiang Grid Lineitem table (518GB in TextFile and 468GB in RCFile, no

compression) and Q6 from TPC-H

14/19

Index Size and Construction Time

Dimension Name

# of distinct value

UserId 14,000,000

RegionId 11

Time 30

Name # of intervals in UserId

DGF-L 100

DGF-M 1,000

DGF-S 10,000

Table 3 The number of distinct value in index dimensions

Table 4 The number of intervals in UserId dimension

Index Type Table Type

# of index dimension

Index dimension Size Time(s)

Compact RCFile 3 UserId,RegionId,Time

821GB 23,350

Compact RCFile 2 RegionId,Time 7MB 1,884

DGF-L TextFile 3 UserId,RegionId,Time

0.94MB 25,816

DGF-M TextFile 3 UserId,RegionId,Time

3MB 25,632

DGF-S TextFile 3 UserId,RegionId,Time

13MB 26,027Table 5 Index size and construction time

DGFIndex construction costs more time, but has smaller size

15/19

Aggregation Query

SELECT SUM(powerConsumed)FROM meterdataWHERE regionId>r1 AND regionId<r2 AND userId>u1 AND userId<u2 AND time>t1 AND time<t2

Figure 9 Point query

Figure 10 5% selectivity


Listing 5 Aggregation query

Index Type Point 5% 12%

Compact 169,395,953

4,756,501,768

6,586,886,752

DGF-L 4,347,200 67,678 100,386

DGF-M 4,258,358 20,280 31,215

DGF-S 2,291,718 16,122 23,712

Accurate 26 569,186,384 1,354,351,336Table 6 number of records needed to read after being filtered by index

For aggregation query, DGFIndex is 2-50 faster than Compact Index and HadoopDB

16/19

GroupyBy Query

SELECT time,SUM(powerConsumed)FROM meterdataWHERE regionId>r1 AND regionId<r2 AND userId>u1 AND userId<u2 AND time>t1 AND time<t2GROUP BY time

Listing 6 GroupBy query

Index Type Point 5% 12%

Compact 169,395,953

4,756,501,768

6,586,886,752

DGF-L 4,347,200 681,321,681 1,433,931,728

DGF-M 4,258,358 641,128,331 1,401,070,456

DGF-S 2,291,718 572,231,864 1,367,754,156

Accurate 26 569,186,384 1,354,351,336

Table 7 number of records needed to read after being filtered by index

Figure 12 Point query



For non-aggregation query, DGFIndex is 2-5 faster than Compact Index and HadoopDB, only need to read 1/40-1/5 data of Compact Index.

17/19

TPC-H Data Set and Q6

SELECT SUM(l_extendedprice*l_discount) as revenueFROM lineitemWHERE l_shipdate>=date’[DATE]’ AND l_shipdate<date’[DATE]’+interval ‘1’ year AND l_discount between [DISCOUNT]-0.01 and [DISCOUNT]+0.01 AND l_quantity<[QUANTITY]

Listing 8 Q6 from TPC-H

Index Type

Table Type

# of index dimension

Size Time(s)

Compact

RCFile 3 189GB 7,367

Compact

RCFile 2(l_discount,l_quantity)

637MB 991

DGFIndex

TextFile

3 4.3MB 10,997

Index Type

Record Number

Whole table

4,095,002,340

Compact-3D

4,095,002,340

Compact-2D

4,095,002,340

DGFIndex 85,430,966

Accurate 77,955,077Table 10 number of records needed to read after being filtered by index

Table 8 Index size and construction time

Figure 16 Q6 from TPC-H cost time

DGFIndex is also efficient for general case data

18/19

Conclusions

Multi-dimensional range index is essential for Hive-based smart meter data processing

We propose a cost effective multi-dimensional range index for Hadoop/Hive

Experimental results show the efficiency of our DGFIndex

19/19

Thanks

Questions?

1/19 dgfindex for smart grid: enhancing hive with a cost effective multi-dimensional range index yue...

Documents