on top-n reverse top-k queries: variants, algorithms, and applications

Post on 25-Feb-2016

29 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

On Top-n Reverse Top-k Queries: Variants, Algorithms, and Applications. 陳良弼 Arbee L.P. Chen National Chengchi University 9/21/2012 at NCHU. IEEE International Conference on Data Engineering (ICDE). A premium international conference on databases - PowerPoint PPT Presentation

TRANSCRIPT

On Top-n Reverse Top-k Queries: Variants, Algorithms, and Applications

陳良弼Arbee L.P. Chen

National Chengchi University9/21/2012 at NCHU

IEEE International Conference on Data Engineering (ICDE)

• A premium international conference on databases

• Inaugural conference held at Los Angeles in 1984

• Held in Taiwan in 1995

ICDE2012 Research Papers Distribution

• System Aspects– Privacy and Security 8%– Storage Management and Performance 7%– Entity resolution/Versioning 7%– Query Processing 31%

• Top-k query 9%• Distributed/parallel/map-reduce 8%• Location-aware 5%• Execution Plan 5%• Graph indexing 4%

• Text/Web/Keyword Search 19%• Stream/Trajectory/Sequence/Spatio-Temporal

10%• Social Media 7%• Uncertain Database 6%• Data Mining 5%

Efficient Dual-Resolution Layer Indexing for Top-k Queries, ICDE2012

H1 H2

H3 H4

H5

H6

H7

H8

H9

H1 H2

H3 H4

H5

H6

H7

H8

H9

(price, distance to the airport)

(0.6, 0.2) (0.55,

0.4)

(0.45, 0.6)

(0.3, 0.7)

(0.55, 0.3)

(0.3, 0.6)

(0.2, 0.7)

(0.7, 0.4)

(0.5, 0.5)

0.525

0.50.45

0.45

0.475

0.425

0.4

0.55

0.5

H1

H4

H5

H6

H7

(price, distance to the airport)

(0.6, 0.2) (0.55,

0.4)(0.55, 0.3)

(0.3, 0.6)

(0.2, 0.7)

HotelH7H6H4H5H1

0.45

0.45

0.475

0.425

0.4

Answering Why-not Questions on Top-k Queries, ICDE2012

• Top-k query(Cleanliness, delicious, Parking spaces)

(95,80,40)

(70,20,30)

(50,90,60)

(75,70,50)

(85,60,60)

(58,20,30)

Top-2(0.4,0.5,0.1)

82

41

71

70

36.2

p1

p2

p3

p4

p5

p6

69

• Why-not question (Cleanliness, delicious, Parking spaces)

Why p5 is not in my top-2 query list?

82

41

7169

70

36.2

p1

p2

p3

p4

p5

p6

p5 does not exist?Should I change my weights?

Should I revise my query to look for

top-5 hotels?

(95,80,40)

(70,20,30)

(50,90,60)

(75,70,50)

(85,60,60)

(58,20,30)

Top-2(0.5,0.4,0.1)

83.5

46

67

70.5

40

71.7

The Min-dist Location Selection Query, ICDE2012c1

c2

c3

c4

c5

c6

c7

c8

f1

f2

p1

p2

Nearest facility distance

Minimize Nearest facility distance

c1

c2

c3

c4

c5

c6

c7

c8

f1

f2

p1

Nearest facility distance

c1

c2

c3

c4

c5

c6

c7

c8

f1

f2

p2

Nearest facility distance

Introduction

• kNN (k-Nearest Neighbors) Queries

Assume k = 3

q

a b

c

kNN(q) = {a, b, c}

13

Introduction

• RkNN (Reverse k-Nearest Neighbors) Queries

q

a

d

Assume k = 3

RkNN(q) = {a, …} d

14

Introduction• BRkNN (Bi-chromatic Reverse k-Nearest Neighbors)

Queries

qa

d

Assume k = 3

BRkNN(q) = {a, …} d

Two types of data

15

Application Ishop

customer

Which location is the best?

Top-n Reverse kNN Queries

Given two types of data G (goal) and C (condition)G:C:

Retrieve n data points from G, which have the largest BRkNN values

g1

g2

g3

Example: n=2, k=2

BR2NN value of g1 = 4

BR2NN value of g2 = 9

BR2NN value of g3 = 5

BR2Top-2 = {g2, g3}

Voronoi Diagram of G

18

: goal point (VD-node): condition point

A Filter-Refinement Frameworkfor Solving BRkNN Queries

VDi

Assume k = 2 Lower-bound region of VDi (layer 0)

Upper-bound region of VDi

(layer 0 ~ layer (k-1))

Layer 0

Layer 1

Layer 1

19

Filter phase

VDi

Assume k = 2

Construct bisectors layer by layer to reduce the region

20

Refinement PhaseAssume k = 2

For a data point p, we want to check VDs at layer 1 ~ layer 2 to make sure whether VDi is one of the 2NN of p

VDi

21

p

Refinement PhaseAssume k = 2

VDi

p

VDi:(VD13, 1.2)(VD26, 1.4)(VD27, 1.7)(VD3, 1.7)(VD4, 1.8)(VD30, 2.1)(VD5, 2.5)

(VD7, 4.8)

VD30

dist(p, VD30) > 1.2

0.9

2.1

>1.2

22

Refinement PhaseAssume k = 2

VDi

p

VDi:(VD13, 1.2)(VD26, 1.4)(VD27, 1.7)(VD3, 1.7)(VD4, 1.8)(VD30, 2.1)(VD5, 2.5)

(VD7, 4.8)

0.9

2.1

>1.2dist(VDi, VDj) > 2dist(VDi, p)

23

VD30

Application II

24

Maximum Coverage BRkNN QueriesRetrieve 2 points from dataset GAssume k = 2

25

BRkNN value = 9

26

BRkNN value = 8

27

total = 12

28

total = 14

Maximum Coverage BRkNN Queries• Given:

– A set of goal points (G)– A set of condition points (C)– k: the k value of BRkNN

• Goal:– Find n points from G, g1, g2, …, gn, which maximize |

∪i=1~nBRkNN(gi,G,C)|

G

C

29

Application III• Find n Most Favorite Products based on Reverse Top-

k Queries

Airline Fare Food

a1 0.8 0.2

a2 0.6 0.4

a3 0.4 1

a4 0.4 0.8

a5 0.4 0.6

Hotel Location Comfort Cleanness

h1 0.4 0.6 0.4

h2 0.4 0.6 0.6

h3 0.4 0.8 0.2

h4 0.6 0.6 0.2

h5 0.6 0.8 0.4

h6 1 0.2 0.6

Airlines Hotels

Package Fare Food Location Comfort Cleanness

(a1, h1) 0.8 0.2 0.4 0.6 0.4

(a1, h2) 0.8 0.2 0.4 0.6 0.6

(a1, h3) 0.8 0.2 0.4 0.8 0.2…

(a5, h5) 0.4 0.6 0.6 0.8 0.4

(a5, h6) 0.4 0.6 1 0.2 0.6

All candidate packages

Which are the most favorite packages? 31

Package Fare Food Location Comfort Cleanness

(a1, h1) 0.8 0.2 0.4 0.6 0.4

(a1, h2) 0.8 0.2 0.4 0.6 0.6

(a1, h3) 0.8 0.2 0.4 0.8 0.2

(a5, h5) 0.4 0.6 0.6 0.8 0.4

(a5, h6) 0.4 0.6 1 0.2 0.6

All candidate packages

Customer Fare Food Location Comfort Cleanness

c1 0 0.2 0.5 0.1 0.2

c2 0.1 0.3 0.1 0.3 0.2

c3 0.3 0 0.1 0.3 0.3

c4 0.3 0.1 0.2 0.3 0.1

c5 0 0.1 0.3 0 0.6

Customer preferences

C1- (a1, h1): 0.80+0.20.2+0.40.5+0.60.1+0.40.2 =0.38(a1, h2): 0.80+0.20.2+0.40.5+0.60.1+0.60.2 =0.42 …

C2- (a1, h1): 0.80.1+0.20.3+0.40.1+0.60.3+0.40.2 =0.44(a1, h2): 0.80.1+0.20.3+0.40.1+0.60.3+0.60.2 =0.48 …

Customer Fare Food Location Comfort Cleanness Top-2 favorites

c1 0 0.2 0.5 0.1 0.2 {(a3, h6), (a5, h6)}

c2 0.1 0.3 0.1 0.3 0.2 {(a3, h2), (a3, h5)}

c3 0.3 0 0.1 0.3 0.3 {(a1, h2), (a1, h5)}

c4 0.3 0.1 0.2 0.3 0.1{(a1, h5), (a2, h5), (a3,

h5)}

c5 0 0.1 0.3 0 0.6 {(a3, h6), (a4, h6)} 32

Top-k Queries (Customer’s View)

Package Fare Food Location Comfort Cleanness

(a1, h1) 0.8 0.2 0.4 0.6 0.4

(a1, h2) 0.8 0.2 0.4 0.6 0.6

(a1, h3) 0.8 0.2 0.4 0.8 0.2

(a5, h5) 0.4 0.6 0.6 0.8 0.4

(a5, h6) 0.4 0.6 1 0.2 0.6

All candidate packages

Customer preferencesCustomer Fare Food Location Comfort Cleanness Top-2 favorites

c1 0 0.2 0.5 0.1 0.2 {(a3, h6), (a5, h6)}

c2 0.1 0.3 0.1 0.3 0.2 {(a3, h2), (a3, h5)}

c3 0.3 0 0.1 0.3 0.3 {(a1, h2), (a1, h5)}

c4 0.3 0.1 0.2 0.3 0.1{(a1, h5), (a2, h5), (a3,

h5)}

c5 0 0.1 0.3 0 0.6 {(a3, h6), (a4, h6)}

Retrieve the customers whose top-2 favorites contain (a1, h2)

33

{c3}

#customers in the reverse top-k query for a product is a good estimate of the favoring degree of the product in the market

Reverse Top-k Queries (Travel Agency’s View)

Package Fare Food Location Comfort Cleanness

(a1, h1) 0.8 0.2 0.4 0.6 0.4

(a1, h2) 0.8 0.2 0.4 0.6 0.6

(a1, h5) 0.8 0.2 0.6 0.8 0.4

(a3, h6) 0.4 1 1 0.2 0.6

(a5, h6) 0.4 0.6 1 0.2 0.6

All candidate packages

Customer preferencesCustomer Fare Food Location Comfort Cleanness Top-2 favorites

c1 0 0.2 0.5 0.1 0.2 {(a3, h6), (a5, h6)}

c2 0.1 0.3 0.1 0.3 0.2 {(a3, h2), (a3, h5)}

c3 0.3 0 0.1 0.3 0.3 {(a1, h2), (a1, h5)}

c4 0.3 0.1 0.2 0.3 0.1{(a1, h5), (a2, h5), (a3,

h5)}

c5 0 0.1 0.3 0 0.6 {(a3, h6), (a4, h6)}

(a1, h2): {c3}(a1, h5): {c3, c4}(a2, h5): {c4}(a3, h2): {c2}(a3, h5): {c2, c4}(a3, h6): {c1, c5}(a4, h6): {c5}(a5, h6): {c1}

34

k (#packages considered by customers) = 2

(a1, h2): {c3}(a1, h5): {c3, c4}(a2, h5): {c4}(a3, h2): {c2}(a3, h5): {c2, c4}(a3, h6): {c1, c5}(a4, h6): {c5}(a5, h6): {c1}

n (#packages to be offered by the travel agency) = 2

• Given a set of component tables T1, T2, …, and Tx, which form a set of the candidate products P, a set of customers C with different preferences on the products, and two positive integers k and n

• RTOPk(cp, P, C): the set of the customers whose top-k favorites contain the candidate product cp

• Retrieve the minimum subset P’ of P such that |P’| n and is maximized

• Maximum coverage problem: NP-hard

', ,

kcp PRTOP cp P C

35

Problem Definition of n-k MFP

36

• An object p is said to dominate another object q if and only if p is larger than or equal to q on all dimensions and p is larger than q on at least one dimension

• Given a set of multi-dimensional objects, the skyline consists of the objects which are not dominated by any other object

0 A1

A2

Skyline

• Only the component tuples dominated by at most (k-1) other tuples in the same component table have the possibility of being a part of a top-k product for a customer c

37

Airline Fare Food

a3 0.4 1

a4 0.4 0.8

a5 0.4 0.6

AirlinesHotel Location Comfort Cleanness

h1 0.4 0.6 0.4

Hotels

Package Fare Food Location Comfort Cleanness

(a3, h1) 0.4 1 0.4 0.6 0.4

(a4, h1) 0.4 0.8 0.4 0.6 0.4

(a5, h1) 0.4 0.6 0.4 0.6 0.4

Airline Fare Food

a1(0) 0.8 0.2

a2(0) 0.6 0.4

a3(0) 0.4 1

a4(1) 0.4 0.8

a5(2) 0.4 0.6

Hotel Location Comfort Cleanness

h1(2) 0.4 0.6 0.4

h2(0) 0.4 0.6 0.6

h3(1) 0.4 0.8 0.2

h4(1) 0.6 0.6 0.2

h5(0) 0.6 0.8 0.4

h6(0) 1 0.2 0.6

38

Airline Fare Food

a1(0) 0.8 0.2

a2(0) 0.6 0.4

a3(0) 0.4 1

a4(1) 0.4 0.8

a5(2) 0.4 0.6

Hotel Location Comfort Cleanness

h1(2) 0.4 0.6 0.4

h2(0) 0.4 0.6 0.6

h3(1) 0.4 0.8 0.2

h4(1) 0.6 0.6 0.2

h5(0) 0.6 0.8 0.4

h6(0) 1 0.2 0.6

Airlines HotelsAirline Fare Food

a1(0) 0.8 0.2

a2(0) 0.6 0.4

a3(0) 0.4 1

a4(1) 0.4 0.8

Hotel Location Comfort Cleanness

h2(0) 0.4 0.6 0.6

h3(1) 0.4 0.8 0.2

h4(1) 0.6 0.6 0.2

h5(0) 0.6 0.8 0.4

h6(0) 1 0.2 0.6

• For any two candidate products cp1 and cp2 in P, if cp1 dominates cp2, RTOPk(cp2, P, C) RTOPk(cp1, P, C)

• For any candidate product cp in P, if cp Skyline(P), cp n-k MFP

39

0 A1

A2

The candidate products in the n-k MFP must be in Skyline(P)

• : the set of candidate products generated from Skyline(T1), Skyline(T2), …, and Skyline(Tx)

• A candidate product cp Skyline(P) if and only if cp [VLDB’09]• Only the skyline tuples of each component table have the possibility

of being a part of a candidate product in the n-k MFP

40

Airlines HotelsAirline Fare Food

a1(0) 0.8 0.2

a2(0) 0.6 0.4

a3(0) 0.4 1

a4(1) 0.4 0.8

Hotel Location Comfort Cleanness

h2(0) 0.4 0.6 0.6

h3(1) 0.4 0.8 0.2

h4(1) 0.6 0.6 0.2

h5(0) 0.6 0.8 0.4

h6(0) 1 0.2 0.6

• Only the customers in RTOPk(cp, Skyline(P), C) possibly become the members in RTOPk(cp, P, C)

41

Package Upper bound

(a1, h2) {c3}

(a1, h5) {c3, c4}

(a1, h6) {}

(a2, h2) {}

(a2, h5) {c4}

(a2, h6) {c1, c5}

(a3, h2) {c2}

(a3, h5) {c2, c4}

(a3, h6) {c1, c5}

The upper bounds of the remaining candidate packages

RTOPk(cp, Skyline(P), C) is an upper bound of RTOPk(cp, P, C)

42

Package Upper bound

(a1, h2) {c3}

(a1, h5) {c3, c4}

(a2, h5) {c4}

(a2, h6) {c1, c5}

(a3, h2) {c2}

(a3, h5) {c2, c4}

(a3, h6) {c1, c5}

The top-2 favorites of C3: {(a1, h5), (a1, h2)}The top-2 favorites of C4: {(a1, h5), (a2, h5), (a3, h5)}

P’ : {(a1, h5)}

43

Package Upper bound

(a2, h6) {c1, c5}

(a3, h2) {c2}

(a3, h5) {c2}

(a3, h6) {c1, c5}

The top-2 favorites of C1: {(a3, h6), (a4, h6)}The top-2 favorites of C5: {(a3, h6), (a4, h6)}

P’ : {(a1, h5), (a3, h6)}P’ : {(a1, h5)}P’ : {(a1, h5)}P’ : {(a1, h5)}P’ : {(a1, h5)}P’ : {(a1, h5)}

Application IV

u1

u2

Year

1 1

1

1

1

1

2 k=1

: user preferences

: products

Mileage

• Find Most Favorite Products by Top-k Reverse Skyline Queries

Thank you for your attention!

top related