on top-n reverse top-k queries: variants, algorithms, and applications 陳良弼 arbee l.p. chen...
TRANSCRIPT
On Top-n Reverse Top-k Queries: Variants,
Algorithms, and Applications
陳良弼Arbee L.P. Chen
National Chengchi University9/21/2012 at NCHU
IEEE International Conference on Data Engineering (ICDE)
• A premium international conference on databases
• Inaugural conference held at Los Angeles in 1984
• Held in Taiwan in 1995
ICDE2012 Research Papers Distribution
• System Aspects– Privacy and Security 8%– Storage Management and Performance 7%– Entity resolution/Versioning 7%– Query Processing 31%
• Top-k query 9%• Distributed/parallel/map-reduce 8%• Location-aware 5%• Execution Plan 5%• Graph indexing 4%
• Text/Web/Keyword Search 19%• Stream/Trajectory/Sequence/Spatio-Temporal
10%• Social Media 7%• Uncertain Database 6%• Data Mining 5%
Efficient Dual-Resolution Layer Indexing for Top-k Queries, ICDE2012
H1 H2
H3 H4
H5
H6
H7
H8
H9
H1 H2
H3 H4
H5
H6
H7
H8
H9
(price, distance to the airport)
(0.6, 0.2) (0.55,
0.4)
(0.45, 0.6)
(0.3, 0.7)
(0.55, 0.3)
(0.3, 0.6)
(0.2, 0.7)
(0.7, 0.4)
(0.5, 0.5)
0.525
0.50.45
0.45
0.475
0.425
0.4
0.55
0.5
H1
H4
H5
H6
H7
(price, distance to the airport)
(0.6, 0.2) (0.55,
0.4)(0.55, 0.3)
(0.3, 0.6)
(0.2, 0.7)
Hotel
H7
H6
H4
H5
H10.45
0.45
0.475
0.425
0.4
Answering Why-not Questions on Top-k Queries, ICDE2012
• Top-k query(Cleanliness, delicious, Parking spaces)
(95,80,40)
(70,20,30)
(50,90,60)
(75,70,50)
(85,60,60)
(58,20,30)
Top-2(0.4,0.5,0.1)
82
41
71
70
36.2
p1
p2
p3
p4
p5
p6
69
• Why-not question (Cleanliness, delicious, Parking spaces)
Why p5 is not in my top-2 query list?
82
41
71
69
70
36.2
p1
p2
p3
p4
p5
p6
p5 does not exist?Should I change my weights?
Should I revise my query to look for
top-5 hotels?
(95,80,40)
(70,20,30)
(50,90,60)
(75,70,50)
(85,60,60)
(58,20,30)
Top-2(0.5,0.4,0.1)
83.5
46
67
70.5
40
71.7
The Min-dist Location Selection Query, ICDE2012c1
c2
c3
c4
c5
c6
c7
c8
f1
f2
p1
p2
Nearest facility distance
Minimize Nearest facility distance
c1
c2
c3
c4
c5
c6
c7
c8
f1
f2
p1
Nearest facility distance
c1
c2
c3
c4
c5
c6
c7
c8
f1
f2
p2
Nearest facility distance
Introduction
• kNN (k-Nearest Neighbors) Queries
Assume k = 3
q
a b
c
kNN(q) = {a, b, c}
13
Introduction
• RkNN (Reverse k-Nearest Neighbors) Queries
q
a
d
Assume k = 3
RkNN(q) = {a, …} d
14
Introduction• BRkNN (Bi-chromatic Reverse k-Nearest Neighbors)
Queries
qa
d
Assume k = 3
BRkNN(q) = {a, …} d
Two types of data
15
Application Ishop
customer
Which location is the best?
Top-n Reverse kNN Queries
Given two types of data G (goal) and C (condition)G:C:
Retrieve n data points from G, which have the largest BRkNN values
g1
g2
g3
Example: n=2, k=2
BR2NN value of g1 = 4
BR2NN value of g2 = 9
BR2NN value of g3 = 5
BR2Top-2 = {g2, g3}
Voronoi Diagram of G
18
: goal point (VD-node)
: condition point
A Filter-Refinement Frameworkfor Solving BRkNN Queries
VDi
Assume k = 2Lower-bound region of VDi (layer 0)
Upper-bound region of VDi
(layer 0 ~ layer (k-1))
Layer 0
Layer 1
Layer 1
19
Filter phase
VDi
Assume k = 2
Construct bisectors layer by layer to reduce the region
20
Refinement PhaseAssume k = 2
For a data point p, we want to check VDs at layer 1 ~ layer 2 to make sure whether VDi is one of the 2NN of p
VDi
21
p
Refinement PhaseAssume k = 2
VDi
p
VDi:(VD13, 1.2)(VD26, 1.4)(VD27, 1.7)(VD3, 1.7)(VD4, 1.8)(VD30, 2.1)(VD5, 2.5)
(VD7, 4.8)
VD30
dist(p, VD30) > 1.2
0.9
2.1
>1.2
…
22
Refinement PhaseAssume k = 2
VDi
p
VDi:(VD13, 1.2)(VD26, 1.4)(VD27, 1.7)(VD3, 1.7)(VD4, 1.8)(VD30, 2.1)(VD5, 2.5)
(VD7, 4.8)
0.9
2.1
>1.2dist(VDi, VDj) > 2dist(VDi, p)
…
23
VD30
Application II
24
Maximum Coverage BRkNN QueriesRetrieve 2 points from dataset GAssume k = 2
25
BRkNN value = 9
26
BRkNN value = 8
27
total = 12
28
total = 14
Maximum Coverage BRkNN Queries• Given:
– A set of goal points (G)– A set of condition points (C)– k: the k value of BRkNN
• Goal:– Find n points from G, g1, g2, …, gn, which maximize |
∪i=1~nBRkNN(gi,G,C)|
G
C
29
Application III• Find n Most Favorite Products based on Reverse Top-
k Queries
Airline Fare Food
a1 0.8 0.2
a2 0.6 0.4
a3 0.4 1
a4 0.4 0.8
a5 0.4 0.6
Hotel Location Comfort Cleanness
h1 0.4 0.6 0.4
h2 0.4 0.6 0.6
h3 0.4 0.8 0.2
h4 0.6 0.6 0.2
h5 0.6 0.8 0.4
h6 1 0.2 0.6
Airlines Hotels
Package Fare Food Location Comfort Cleanness
(a1, h1) 0.8 0.2 0.4 0.6 0.4
(a1, h2) 0.8 0.2 0.4 0.6 0.6
(a1, h3) 0.8 0.2 0.4 0.8 0.2…
(a5, h5) 0.4 0.6 0.6 0.8 0.4
(a5, h6) 0.4 0.6 1 0.2 0.6
All candidate packages
Which are the most favorite packages? 31
Package Fare Food Location Comfort Cleanness
(a1, h1) 0.8 0.2 0.4 0.6 0.4
(a1, h2) 0.8 0.2 0.4 0.6 0.6
(a1, h3) 0.8 0.2 0.4 0.8 0.2
…
(a5, h5) 0.4 0.6 0.6 0.8 0.4
(a5, h6) 0.4 0.6 1 0.2 0.6
All candidate packages
Customer Fare Food Location Comfort Cleanness
c1 0 0.2 0.5 0.1 0.2
c2 0.1 0.3 0.1 0.3 0.2
c3 0.3 0 0.1 0.3 0.3
c4 0.3 0.1 0.2 0.3 0.1
c5 0 0.1 0.3 0 0.6
Customer preferences
C1- (a1, h1): 0.80+0.20.2+0.40.5+0.60.1+0.40.2 =0.38(a1, h2): 0.80+0.20.2+0.40.5+0.60.1+0.60.2 =0.42 …
C2- (a1, h1): 0.80.1+0.20.3+0.40.1+0.60.3+0.40.2 =0.44(a1, h2): 0.80.1+0.20.3+0.40.1+0.60.3+0.60.2 =0.48 …
Customer Fare Food Location Comfort Cleanness Top-2 favorites
c1 0 0.2 0.5 0.1 0.2 {(a3, h6), (a5, h6)}
c2 0.1 0.3 0.1 0.3 0.2 {(a3, h2), (a3, h5)}
c3 0.3 0 0.1 0.3 0.3 {(a1, h2), (a1, h5)}
c4 0.3 0.1 0.2 0.3 0.1{(a1, h5), (a2, h5), (a3,
h5)}
c5 0 0.1 0.3 0 0.6 {(a3, h6), (a4, h6)} 32
Top-k Queries (Customer’s View)
Package Fare Food Location Comfort Cleanness
(a1, h1) 0.8 0.2 0.4 0.6 0.4
(a1, h2) 0.8 0.2 0.4 0.6 0.6
(a1, h3) 0.8 0.2 0.4 0.8 0.2
…
(a5, h5) 0.4 0.6 0.6 0.8 0.4
(a5, h6) 0.4 0.6 1 0.2 0.6
All candidate packages
Customer preferences
Customer Fare Food Location Comfort Cleanness Top-2 favorites
c1 0 0.2 0.5 0.1 0.2 {(a3, h6), (a5, h6)}
c2 0.1 0.3 0.1 0.3 0.2 {(a3, h2), (a3, h5)}
c3 0.3 0 0.1 0.3 0.3 {(a1, h2), (a1, h5)}
c4 0.3 0.1 0.2 0.3 0.1{(a1, h5), (a2, h5), (a3,
h5)}
c5 0 0.1 0.3 0 0.6 {(a3, h6), (a4, h6)}
Retrieve the customers whose top-2 favorites contain (a1, h2)
33
{c3}
#customers in the reverse top-k query for a product is a good estimate of the favoring degree of the product in the market
Reverse Top-k Queries (Travel Agency’s View)
Package Fare Food Location Comfort Cleanness
(a1, h1) 0.8 0.2 0.4 0.6 0.4
(a1, h2) 0.8 0.2 0.4 0.6 0.6
…
(a1, h5) 0.8 0.2 0.6 0.8 0.4
…
(a3, h6) 0.4 1 1 0.2 0.6
…
(a5, h6) 0.4 0.6 1 0.2 0.6
All candidate packages
Customer preferences
Customer Fare Food Location Comfort Cleanness Top-2 favorites
c1 0 0.2 0.5 0.1 0.2 {(a3, h6), (a5, h6)}
c2 0.1 0.3 0.1 0.3 0.2 {(a3, h2), (a3, h5)}
c3 0.3 0 0.1 0.3 0.3 {(a1, h2), (a1, h5)}
c4 0.3 0.1 0.2 0.3 0.1{(a1, h5), (a2, h5), (a3,
h5)}
c5 0 0.1 0.3 0 0.6 {(a3, h6), (a4, h6)}
(a1, h2): {c3}(a1, h5): {c3, c4}(a2, h5): {c4}(a3, h2): {c2}(a3, h5): {c2, c4}(a3, h6): {c1, c5}(a4, h6): {c5}(a5, h6): {c1}
34
k (#packages considered by customers) = 2
(a1, h2): {c3}(a1, h5): {c3, c4}(a2, h5): {c4}(a3, h2): {c2}(a3, h5): {c2, c4}(a3, h6): {c1, c5}(a4, h6): {c5}(a5, h6): {c1}
n (#packages to be offered by the travel agency) = 2
• Given a set of component tables T1, T2, …, and Tx, which form a set of the candidate products P, a set of customers C with different preferences on the products, and two positive integers k and n
• RTOPk(cp, P, C): the set of the customers whose top-k favorites contain the candidate product cp
• Retrieve the minimum subset P’ of P such that |P’| n and is maximized
• Maximum coverage problem: NP-hard
'
, , kcp PRTOP cp P C
35
Problem Definition of n-k MFP
36
• An object p is said to dominate another object q if and only if p is larger than or equal to q on all dimensions and p is larger than q on at least one dimension
• Given a set of multi-dimensional objects, the skyline consists of the objects which are not dominated by any other object
0 A1
A2
Skyline
• Only the component tuples dominated by at most (k-1) other tuples in the same component table have the possibility of being a part of a top-k product for a customer c
37
Airline Fare Food
…
a3 0.4 1
a4 0.4 0.8
a5 0.4 0.6
AirlinesHotel Location Comfort Cleanness
h1 0.4 0.6 0.4
…
Hotels
Package Fare Food Location Comfort Cleanness
(a3, h1) 0.4 1 0.4 0.6 0.4
(a4, h1) 0.4 0.8 0.4 0.6 0.4
(a5, h1) 0.4 0.6 0.4 0.6 0.4
Airline Fare Food
a1(0) 0.8 0.2
a2(0) 0.6 0.4
a3(0) 0.4 1
a4(1) 0.4 0.8
a5(2) 0.4 0.6
Hotel Location Comfort Cleanness
h1(2) 0.4 0.6 0.4
h2(0) 0.4 0.6 0.6
h3(1) 0.4 0.8 0.2
h4(1) 0.6 0.6 0.2
h5(0) 0.6 0.8 0.4
h6(0) 1 0.2 0.6
38
Airline Fare Food
a1(0) 0.8 0.2
a2(0) 0.6 0.4
a3(0) 0.4 1
a4(1) 0.4 0.8
a5(2) 0.4 0.6
Hotel Location Comfort Cleanness
h1(2) 0.4 0.6 0.4
h2(0) 0.4 0.6 0.6
h3(1) 0.4 0.8 0.2
h4(1) 0.6 0.6 0.2
h5(0) 0.6 0.8 0.4
h6(0) 1 0.2 0.6
Airlines HotelsAirline Fare Food
a1(0) 0.8 0.2
a2(0) 0.6 0.4
a3(0) 0.4 1
a4(1) 0.4 0.8
Hotel Location Comfort Cleanness
h2(0) 0.4 0.6 0.6
h3(1) 0.4 0.8 0.2
h4(1) 0.6 0.6 0.2
h5(0) 0.6 0.8 0.4
h6(0) 1 0.2 0.6
• For any two candidate products cp1 and cp2 in P, if cp1 dominates cp2, RTOPk(cp2, P, C) RTOPk(cp1, P, C)
• For any candidate product cp in P, if cp Skyline(P), cp n-k MFP
39
0 A1
A2
The candidate products in the n-k MFP must be in Skyline(P)
• : the set of candidate products generated from Skyline(T1), Skyline(T2), …, and Skyline(Tx)
• A candidate product cp Skyline(P) if and only if cp [VLDB’09]• Only the skyline tuples of each component table have the possibility
of being a part of a candidate product in the n-k MFP
40
Airlines HotelsAirline Fare Food
a1(0) 0.8 0.2
a2(0) 0.6 0.4
a3(0) 0.4 1
a4(1) 0.4 0.8
Hotel Location Comfort Cleanness
h2(0) 0.4 0.6 0.6
h3(1) 0.4 0.8 0.2
h4(1) 0.6 0.6 0.2
h5(0) 0.6 0.8 0.4
h6(0) 1 0.2 0.6
• Only the customers in RTOPk(cp, Skyline(P), C) possibly become the members in RTOPk(cp, P, C)
41
Package Upper bound
(a1, h2) {c3}
(a1, h5) {c3, c4}
(a1, h6) {}
(a2, h2) {}
(a2, h5) {c4}
(a2, h6) {c1, c5}
(a3, h2) {c2}
(a3, h5) {c2, c4}
(a3, h6) {c1, c5}
The upper bounds of the remaining candidate packages
RTOPk(cp, Skyline(P), C) is an upper bound of RTOPk(cp, P, C)
42
Package Upper bound
(a1, h2) {c3}
(a1, h5) {c3, c4}
(a2, h5) {c4}
(a2, h6) {c1, c5}
(a3, h2) {c2}
(a3, h5) {c2, c4}
(a3, h6) {c1, c5}
The top-2 favorites of C3: {(a1, h5), (a1, h2)}
The top-2 favorites of C4: {(a1, h5), (a2, h5), (a3, h5)}
P’ : {(a1, h5)}
43
Package Upper bound
(a2, h6) {c1, c5}
(a3, h2) {c2}
(a3, h5) {c2}
(a3, h6) {c1, c5}
The top-2 favorites of C1: {(a3, h6), (a4, h6)}
The top-2 favorites of C5: {(a3, h6), (a4, h6)}
P’ : {(a1, h5), (a3, h6)}P’ : {(a1, h5)}P’ : {(a1, h5)}P’ : {(a1, h5)}P’ : {(a1, h5)}P’ : {(a1, h5)}
Application IV
u1
u2
Year
1 1
1
1
1
1
2 k=1
: user preferences
: products
Mileage
• Find Most Favorite Products by Top-k Reverse Skyline Queries
Thank you for your attention!