Bandwidth-Efficient Continuous Query Processing over DHTs
Yingwu Zhu
Background
Instantaneous Query Continuous Query
Instantaneous Query (1)
Documents are indexed Node responsible for keyword t stores the IDs of
documents containing that term (i.e., inverted lists)
Retrieve “one-time” relevant docs Latency is a top priority Query Q = t1 Λ t2 …
Fetch lists of doc IDs stored under t1, t2 …. Intersect these lists
E.g.: Google search engine
Instantaneous Query (2)
A
B
D
C
cat:1,4,7,19,20
dog:1,5,7,26
cow:2,4,8,18bat: 1,8,31
“cat Λ dog”?
cat?cat:1,4,7,19,20
dog?
dog:1,5,7,26
Send Result:Docs 1,7
Continuous Query (1)
Reverse the role of documents and queries Queries are indexed
Query Q = t1 Λ t2 … stored at one of the terms t1, t2 …
Question 1: How is the index term selected? (query indexing)
“Push” new relevant docs (incrementally) Enabled by “long-lived” queries
E.g.: Google New Alert feature
Continuous Query (2)
Upon a new doc D = t1 Λ t2 (insertion) Contacts the nodes responsible for the inverted
query lists of D’s keywords t1 and t2
Question 2: How to locate the nodes (query nodes QN)? (document announcement)
Resolve the query lists the final list of satisfied queries (by D) Question 3: What is the resolution strategy? (query
resolution) E.g., Term Dialogue, Bloom filters (Infocom’06)
Notify owners of satisfied queries
Query Resolution: Term Dialogue
A Bcat (query):1. dog2. horse & dog3. horse & cow
catdogcow
Doc
Inver. list for “cat”
1. Document announcement
2. “dog” & “cow”
3. “11” (bit vector)
4. “horse”
5. “0” (bit vector)Notify owner of Q1
C DInver. list for “dog”
Inver. list for “cow”
Query Resolution: Bloom filters
A Bcat (query):1. dog2. horse & dog3. horse & cow
catdogcow
Doc
Inver. list for “cat”
1. Doc announcement “10110” (bloom filter)
2. “dog” (Term Dialogue)
3. “1” (bit vector)
Notify owner of Q1
C DInver. list for “dog”
Inver. list for “cow”
Motivation
Latency is not the primary concern, but bandwidth can be one of the important design issues Various query indexing schemes incur different cost Various query resolution strategies cause different costs
Design a bandwidth-efficient continuous query system with “proper” query indexing (Question #1), document announcement (Question #2), and query resolution (Question #3) approaches
Contributions
Novel query indexing schemes Question #1 Focus of this talk!
Multicast-based document announcement Question #2 In the paper
Adaptive query resolution Question #3 Make intelligent decisions in resolving query terms Minimize the bandwidth cost In the full tech. report paper
Focus on simple keyword queries, e.g., Q = t1 Λ t2 Λ … Λtn
Leverage DHTs Location & storage of documents and continuous
queries Query indexing
How to choose index terms for queries? Doc. announcement, query resolution
Not covered in this talk!
Design
Current Indexing Schemes
Random Indexing (RI) Optimal Indexing (OI)
Random Indexing (RI)
Randomly chooses a term as index term Q = t1 Λ … Λ tm
Index term ti is randomly selected Q is indexed in a DHT node responsible for ti
Pros: simple Cons:
Popular terms are more likely to be index terms for queries Load imbalance Introduce many irrelevant queries in query resolution,
wasting bandwidth
Optimal Indexing (OI)
Q = t1 Λ … Λ tm
Index term ti is deterministically chosen, the most selective term, i.e., with the least frequency
Q is indexed in a DHT node responsible for ti
Pros: Maximize load balance & minimize bandwidth cost
Cons: Assume perfect knowledge of term statistics Impractical, e.g., due to large number of documents, node
churn, continuous doc updates, ….
Solution 1: MHI
Minimum Hash Indexing Order query terms by their hashes Select the term with minimum hash as the index
term Q = t1 Λ… Λ tm
Index term ti is deterministically chosen, s.t. h(ti) < h(tx) (for all x≠i)
Q is indexed in a DHT node responsible for ti
RI v.s. MHI
t1 t2 t3 t4 t5 t6 t7
D = {t2, t4, t5, t6}
Where h(ti) < h(tj) for i < j.• 3 queries, irrelevant to D:
•Q1= t1 Λ t2 Λ t4
•Q2= t3 Λ t4 Λ t5
•Q3= t3 Λ t5 Λ t6
(1) RI: •Q1, Q2, and Q3 will be considered in query resolution each with
probability of 67% (need to resolve terms t1,t2,t3,t4,t5,and t6)
(2) MHI•All of them will be filtered out! bandwidth savings!•How?
MHI: filtering irrelevant queries!
B
G
F
E
D = {t2, t4, t5, t6}
t2:
none
t1:Q1
t3:Q2, Q3
t6:
none
C
D
t5:
none
t4:
none
No action
No action
No actionNo action
A
Disregarded in query resolution, saving bandwidth!
Q1= t1 Λ t2 Λ t4Q2= t3 Λ t4 Λ t5Q3= t3 Λ t5 Λ t6
MHI
Pros: Simple and deterministic Does not require term stats Saves bandwidth over RI (up to 39.3% saving for
various query types) Cons:
Some popular terms can be index terms by their minimum hashes in their queries! Load imbalance & irrelevant queries to process
Solution 2: SAP-MHI
MHI is good but may still index queries under popular terms
SAmPling-based MHI(SAP-MHI) Sampling (synopsis of K popular terms) + MHI Avoid indexing queries under K popular terms Challenge: support duplicate-sensitive aggregates of
popular terms as synopses may be gossiped over multiple DHT overlay links and term frequencies may be overestimated! Borrow idea from duplicate-sensitive aggregation in
sensor networks
SAP-MHI Duplicate-sensitive aggregation
Goal: a synopsis of K popular terms Based on coin tossing experiment CT(y)
Toss a fair coin until either the first head occurs or y coin tosses end up with no head, and return the number of tosses
Each node a Produce a local synopsis Sa containing K popular terms (the
terms with the highest values of CT(y)) Gossip Sa to its neighbor nodes Upon receiving a synopsis Sb from a neigbor b, aggregate Sa
and Sb, producing a new synopsis Sa (max() operations) Thus, each node has a synopsis of K popular terms after a
sufficient number of gossip rounds Intuition: If a term appears in more documents then its value
produced by CT(y) will be larger than the values of rare terms
SAP-MHI: Indexing Example
Query Q=t1 Λ t2 Λ t3 Λ t4 Λ t5, where h(t1)<h(t2)<h(t3)<h(t4)<h(t5)
Synopsis S={t1,t2} Q is indexed on the node which is
responsible for t3, instead of t1
Simulations
Parameter Value
DHT 1000-node Chord
Document collection TREC-1,2-AP
Mean of query sizes 5
# of continuous queries 100,000
# of docs 10,000
# of unique terms 46,654
# of unique terms per doc 178
Query types Skew, Uniform, InverSkew
Query resolution Term Dialogue, Bloom filters
SAP-MHI v.s. MHI
SAP-MHI improves load balance over MHI with increasing synopsis size K, for Skew queries.
SAP-MHI v.s. MHI
010203040506070
100 500 1000 1500 2000 3000Synopsis size K
Band
wid
th s
avin
g (%
) SkewUniformInverSkew
Bloom filters are used in query resolution.
SAP-MHI v.s. MHI
0
20
40
60
80
100
100 500 1000 1500 2000 3000Synopsis size K
Band
wid
th s
avin
g (%
)
Skew
Uniform
InverSkew
Term Dialogue is used in query resolution.
SAP-MHI v.s. MHI
0
20
40
60
80
100
100 500 1000 1500 2000 3000Synopsis size K
% o
f q
ue
rie
s f
ilte
red
Skew
Uniform
InverSkew
This shows why SAP-MHI saves bandwidth over MHI!
Summary Focus on a simple keyword query model Bandwidth is a top priority Query indexing impacts bandwidth cost
Goal: Sift out as many irrelevant queries as possible! MHI and SAP-MHI SAP-MHI is a more viable solution
Load is more balanced, more bandwidth saving! Sampling cost is controlled
# of popular terms is relatively low Memberships of popular terms do not change rapidly
Document announcement & adaptive query resolution further cut down bandwidth consumption (not covered in this talk)
Thank You!