1/19 dgfindex for smart grid: enhancing hive with a cost effective multi-dimensional range index yue...
TRANSCRIPT
1/19
DGFIndex for Smart Grid: Enhancing Hive with a Cost Effective Multi-dimensional Range Index
Yue Liu, Songlin Hu*, Tilmann Rabl, Wantao Liu, Hans-Arno Jacobsen, Kaifeng Wu, Jian Chen, Jintao Li
MIDDLEWARE SYSTEMSRESEARCH GROUP
MSRG.ORG
国网浙江省电力公司STATE GRID ZHEJIANG ELECTRIC POWER COMPANY
国网电力科学研究院STATE GRID ELECTRIC POWER RESEARCH INSTITUTE
2/19
Outline
Big Data challenges in Smart Grid
DGFIndex design
Experiments on smart grid data
Conclusions
3/19
Big Data Challenges in Smart Grid
GPRS
CDMA 230MHZ
Smart Meter
RDBMSC
olle
ctor
The Electricity Consumption Information Collection System of the Grid
Example 22 million smart meters in Zhejiang province, as required
by China State Grid, it should be 96 measurements/day Will be 2.1 billion records in a single table
Figure 1 Data flow in State Grid
4/19
Features of Smart Meter Data
Smart meter data features: Time stamp field Append only Schema is static
Query examples: What is the average power consumption of user ids in the range
100 to 1000 and dates in the ranges “2013-01-01” to “2013-02-01”?
Queries features: Multi-dimensional range query Lots of aggregation query
UserId PowerConsumed
TimeStamp PATE with Rate 1
Other Metrics
24012 12.34 1332988833
10.45 …
Table 1 An example format of smart meter data
5/19
Why Migrate to Hadoop/Hive
Limitations of RDBMS Low write throughput Weak scalability High license cost
Hadoop and Hive High write throughput Flexible scalability Low budget and cost effective
Hadoop/Hive is a good choice for solving smart meter big data problem
Figure 2 Write throughput comparison of RDBMS and HDFS
16 times faster with only 1/10 cost
Index Compact Index, Aggregate Index and
Bitmap Index Store all combinations of index
dimensions and location Data partition
Each partition is a directory, reorganize data into different directory
6/19
Indexes in Hive
Disadvantages on multi-dimensional range queries when large number of distinct value in index dimensions
Index: large index table size Data partition: lots of directories and small files
Column Name
Type
Index dimension 1
Type in base table
Index dimension 2
Type in base table
Index dimension 3
Type in base table
_bucketname string
_offset Array<bigint>
Table 2 3-dimensional compact index
7/19
Limitations of Indexing in Hive
Limitations: Storing combination of index dimensions leads to
extremely large index table High selectivity leads to large temporary file, which
may overflow the JobTracker’s memory Poor filter effect when the value of index dimension
scattered evenly in data file
Predicate
Scan Index Table
Temporary FileFile:Offset
JobTrackerInputFormat.getSpl
its
Chosen Splits
MR Job1
2
3
4
5 6
Figure 3 Query with current index
Recall Data Features
Smart meter data features: Time stamp field Append only Schema is static
Queries features: Multi-dimensional range query Lots of aggregation query
8/19
9/19
DGFIndex Design
Using grid file to split logical data space into units(GFU)
Data in same GFU is stored together in the file of HDFS, named Slice
GFU is stored as a GFUKey/GFUValue pair in key/value store
GFUKey is the left lower coordinate of GFU in the data space
GFUValue consists of header and location
Header contains some pre-computed aggregation values
Location is the start and end offset of corresponding data segment in file of HDFS
Figure 4 DGFIndex architecture
dimension X
dim
en
sion Y
2-dimensional DGFIndex Slice
GFU
GFUKey
GFUValue
headerlocation
aggregation values
data file on HDFS
10/19
DGFIndex ConstructionInput Map
Phase
A B C1 14 0. 15 18 0. 57 12 1. 22 11 0. 59 14 0. 811 16 1. 33 18 0. 912 12 0. 38 13 0. 2
test
1 4 7 1011
13
15
19
17
13A
B
Mapper
Reduce Phase
Output
A B C1 14 0. 15 18 0. 57 12 1. 22 11 0. 59 14 0. 88 13 0. 23 18 0. 912 12 0. 311 16 1. 3
test after reorganization
Reducer
Input:{36,9|14|0.8}Output:{7_13,9|14|0.8}
Input:{7_13,<9|14|0.8,8|13|0.2>}
Output:{null,<9|14|0.8,8|13|0.2>}
A:9|B:14
7_13
Splitting Policy
DGFIndex Table
Figure 6 DGFIndex construction
header
sum(C) fi l ename start end
1_13 0. 1 test 0 0
4_17 0. 5 test 9 9
7_11 1. 2 test 18 18
1_11 0. 5 test 27 27
7_13 1 test 36 45
1_17 0. 9 test 54 54
10_11 0. 3 test 63 63
10_15 1. 3 test 72 72
GFUKeyGFUVal ue
l ocat i on
11/19
DGFIndex Query
B
1 4 7 1011
13
15
19
17
13
A
SELECT SUM(C)FROM testWHERE A>=5 AND A<12 AND B>=12 AND B<16 1.0
test:0test:0
overlapped?
combine
chosen splits
Ski p Ski p
0 18 63 72test
1.2
2.2
Step 1(Hive)
Step 2(InputFormat.getSplits)
Step 3(Reader.next)
all splits
Figure 7 DGFIndex query
fi l ename start end
test 18 18
test 63 63
test 72 72
spl i t Li st<start , end>
test : 0 18, 18, 63, 63, 72, 72
header
sum(C) fi l ename start end
1_13 0. 1 test 0 0
4_17 0. 5 test 9 9
7_11 1. 2 test 18 18
1_11 0. 5 test 27 27
7_13 1 test 36 45
1_17 0. 9 test 54 54
10_11 0. 3 test 63 63
10_15 1. 3 test 72 72
GFUKeyGFUVal ue
l ocat i on
Advantages of DGFIndex
Smaller index size High index read speed, selective Pre-computation
12/19
Sl i ce
Data Fi le on HDFS2-Dimensional
DGFIndex
Dimension X
Dimension Y
Header Location
GFUKey GFUValue
GFU Sl i ce
Figure 8 DGFIndex architecture
13/19
Experiments
Comparison System Hive with Compact Index, HadoopDB
Environment Hardware
29 virtual nodes, each has 8 cores, 8GB RAM, 300GB disk Software
CentOS 6.5 b4bit, Jdk 1.6.0_45 64bit, Hadoop-1.2.1, HBase-0.94.13 DGFIndex is implemented in Hive-0.10.0 Replication factor is 2 in HDFS, mapred.task.io.sort.mb=512MB PostgreSQL 8.4.20 for HadoopDB
Data Set and Query Real meter data (1TB in TextFile and 890GB in RCFile, no
compression) and ad-hoc queries from Zhejiang Grid Lineitem table (518GB in TextFile and 468GB in RCFile, no
compression) and Q6 from TPC-H
14/19
Index Size and Construction Time
Dimension Name
# of distinct value
UserId 14,000,000
RegionId 11
Time 30
Name # of intervals in UserId
DGF-L 100
DGF-M 1,000
DGF-S 10,000
Table 3 The number of distinct value in index dimensions
Table 4 The number of intervals in UserId dimension
Index Type Table Type
# of index dimension
Index dimension Size Time(s)
Compact RCFile 3 UserId,RegionId,Time
821GB 23,350
Compact RCFile 2 RegionId,Time 7MB 1,884
DGF-L TextFile 3 UserId,RegionId,Time
0.94MB 25,816
DGF-M TextFile 3 UserId,RegionId,Time
3MB 25,632
DGF-S TextFile 3 UserId,RegionId,Time
13MB 26,027Table 5 Index size and construction time
DGFIndex construction costs more time, but has smaller size
15/19
Aggregation Query
SELECT SUM(powerConsumed)FROM meterdataWHERE regionId>r1 AND regionId<r2 AND userId>u1 AND userId<u2 AND time>t1 AND time<t2
Figure 9 Point query
Figure 10 5% selectivity
Figure 11 12% selectivity
Listing 5 Aggregation query
Index Type Point 5% 12%
Compact 169,395,953
4,756,501,768
6,586,886,752
DGF-L 4,347,200 67,678 100,386
DGF-M 4,258,358 20,280 31,215
DGF-S 2,291,718 16,122 23,712
Accurate 26 569,186,384 1,354,351,336Table 6 number of records needed to read after being filtered by index
For aggregation query, DGFIndex is 2-50 faster than Compact Index and HadoopDB
16/19
GroupyBy Query
SELECT time,SUM(powerConsumed)FROM meterdataWHERE regionId>r1 AND regionId<r2 AND userId>u1 AND userId<u2 AND time>t1 AND time<t2GROUP BY time
Listing 6 GroupBy query
Index Type Point 5% 12%
Compact 169,395,953
4,756,501,768
6,586,886,752
DGF-L 4,347,200 681,321,681 1,433,931,728
DGF-M 4,258,358 641,128,331 1,401,070,456
DGF-S 2,291,718 572,231,864 1,367,754,156
Accurate 26 569,186,384 1,354,351,336
Table 7 number of records needed to read after being filtered by index
Figure 12 Point query
Figure 13 5% selectivity
Figure 14 12% selectivity
For non-aggregation query, DGFIndex is 2-5 faster than Compact Index and HadoopDB, only need to read 1/40-1/5 data of Compact Index.
17/19
TPC-H Data Set and Q6
SELECT SUM(l_extendedprice*l_discount) as revenueFROM lineitemWHERE l_shipdate>=date’[DATE]’ AND l_shipdate<date’[DATE]’+interval ‘1’ year AND l_discount between [DISCOUNT]-0.01 and [DISCOUNT]+0.01 AND l_quantity<[QUANTITY]
Listing 8 Q6 from TPC-H
Index Type
Table Type
# of index dimension
Size Time(s)
Compact
RCFile 3 189GB 7,367
Compact
RCFile 2(l_discount,l_quantity)
637MB 991
DGFIndex
TextFile
3 4.3MB 10,997
Index Type
Record Number
Whole table
4,095,002,340
Compact-3D
4,095,002,340
Compact-2D
4,095,002,340
DGFIndex 85,430,966
Accurate 77,955,077Table 10 number of records needed to read after being filtered by index
Table 8 Index size and construction time
Figure 16 Q6 from TPC-H cost time
DGFIndex is also efficient for general case data
18/19
Conclusions
Multi-dimensional range index is essential for Hive-based smart meter data processing
We propose a cost effective multi-dimensional range index for Hadoop/Hive
Experimental results show the efficiency of our DGFIndex
19/19
Thanks
Questions?