efficient exact similarity searches using multiple token orderings jongik kim 1 and hongrae lee 2 1...
TRANSCRIPT
Efficient Exact Similarity Searches using Multiple Token Orderings
Jongik Kim1 and Hongrae Lee2
1Chonbuk National University, South Korea2Google Inc.
Introduction
Similarity search is important in many applications
Data cleaning Record linkage Near duplicate detection Query refinement
The focus of our work is efficient evaluation of similarity queries
A lot of applications invoke queries simultaneously Applications usually require fast response times
Need to evaluatesimilarity queries efficiently
… Simultaneous query requests
angty bird
typo
Problem Definition
How do we measure the similarity between two string?
NameBill GatesLinus TorvaldsSteven P. JobsDennis Ritchie…
Search
Query q: Steve Jobs
Output: each string s that satisfies sim(q, s) ≥ α
ste
The overlap similarity, sim(“steve”, “steven”) , is defined as |TS(“steve”) ∩ TS(“steven”)|
1. Convert each string into a record, where a record is a set of tokens Tokenize each string into a token set containing all q-gram tokens of the string
q-gram: a substring of a string of length q TS(“steve”) = {ste, tev, eve} and TS(“steven”) = {ste, tev, eve, ven}
2. Count the number of common tokens between two records (or token sets)
Collection of strings
tevevesteve
Why do we use the overlap similarity?
It supports many other similarity measures.
e.g. J(x, y) = t O(x, y) = t(|x|+|y|)/1+tJ: Jaccard similarity, O: Overlap similarity
Inverted Lists based Approach
ID String Record (token set)1 area { , re, ea}2 artisan { , rt, ti, is, sa, an}3 artist { , rt, ti, is, st}4 tisk {ti, is, sk}… … …
arskeais
sart
stti
1 2 3412
22 3
32 3 4
re 1Make Inverted Lists
an 2
3
Query: “artist” Overlap threshold: 4
Merge to count occurrences
1
2
3
4
2
4
5
2
Answers of the query 2: “artisan” 3: “artist”
{ , , , , }ar rt ti is st
4
ararar
Prefix Filtering based Approach
Query q = “artist” {ar, rt, ti, is, st} and overlap threshold α = 4
arisrtstti
1 2 322 332 3 4
3
Inverted lists for the query
strt
332
aris
ti1 2 32
2 3 4
3
Sort the listsby their sizes
Prefix Lists: the first |TS(q)| – α + 1 lists
Suffix Lists: remaining α – 1 lists
Filtering Phase (the prefix filtering) Merge the prefix lists to generate candidates
Verification Phase Search the suffix lists for each candidate A candidate searches each suffix list to identify if it is contained in the list Binary search is used because suffix lists are usually very long
23
12
candidates2 3 43 4 5
4
4Sort the tokens by theirdocument frequencies
Document frequencyordering
Document Frequency Ordering
General Goal: minimize the number of candidates by making use of the document frequency ordering
rtstti
2 332 3 4
aris
1 2 32 3 4
strt
332
aris
ti1 2 32
2 3 4
3 4
Prefix Lists: the first |TS(q)| – α + 1 lists
Query q = “artist” {ar, rt, ti, is, st} and overlap threshold α = 4
Suffix Lists: remaining α – 1 lists
Prefix Lists: the first |TS(q)| – α + 1 lists
Suffix Lists: remaining α – 1 lists
Sort the tokens by theirdocument frequencies
234
candidates
123
candidates
We can reduce1. time for merging short lists2. number of candidates
time for verification candidates
Our Observation
Query q = {w1, w2} and overlap threshold α = 2
w2 is the prefix list# of candidates is 5
w2 is the prefix list# of candidates is 0
w1 is the prefix list# of candidates is 0
Total number of candidates is 0
Partition
Our observation By partitioning a data set, we can artificially modify document frequencies of tokens in
each partition. We evaluate a query in each partition and take the union of the results. We can reduce the number of candidates by utilizing different token orderings among par-
titions. Because partitions have different token orderings, we need to sort tokens in a query record
in each partition.
Generalization of the Observation
Query q = reaby ={re, ea, ab, by} ={w1, w3, ab, by}Overlap threshold α = 2
Grouping records in I(wp) into P, the number of candidates is reduced by at least |I(wp)| – |I(ws) ∩ P |
Grouping records in I(ws) into P, the number of candidates is reduced by at least |I(wp) – P|Prefix list: w3
# of candidate: 5
In P1, prefix list is w1
# of candidate: 2
In P2, prefix list is w3
# of candidate: 0
In P1, prefix list is w3
# of candidate: 2
In P2, prefix list is w1
# of candidate: 0
By grouping records containing a token w into a partition, we can benefit queries containing w
I(w): the inverted list of w,wp: a prefix token, ws: a suffix token
Pivot Set & PartitioningBy grouping records containing a token w into a partition, we can benefit queries containing w
Pivot set S is a set of tokens such that
Grouping I(wi) into one partition does not affect grouping I(wj) into another partition
w1 w2 w3 w5w4 w6 w7
r1
r2
r3
r4
r5
r6
r7
r8
r9
r10
r11
r12
r13
r14
r15
There are many pivot sets
S1 = {w1, w3}
S2 = {w2, w3, w4}
S3 = {w3, w5}
S4 = {w5, w6}
S5 = {w2, w6}
S6 = {w3, w7}
We can benefit queries containing wi as well as queries containing wj
Question:1. Existence of pivot sets2. Selection of a good pivot set
P1
P2
P3
orphan record: randomly select its partition
Relaxation of a Pivot Set
w1 w2 w3 w5w4 w6 w7
r1
r2
r3
r4
r5
r6
r7
r8
r9
r10
r11
r12
r13
r14
r15
※ JC(S1, S2) = | S1 ∩ S2 |/min(| S1 |, | S2 |)
Pivot set S is a set of tokens such that for any two tokens wi and wj in S, JC(I(wi), I(wj)) ≤ β
If JC(S1, S2) = 0.1,
90% of S1 10% of S1
S1 S2
less than 10% of S2 more than 90% of S2
If β = 0.2, the set S = {w2, w3, w4} is a pivot set
Pivot Set SelectionThe weight of a token w is the number of queries that contain w
Goodness of a pivot set S:
By partitioning using tokens contained in many queries, we can benefit many queries
Selecting the best pivot set is an NP-hard Problem (see the paper)
We use a simple greedy algorithm (simplified version) Select those tokens first whose weights are high
w1 w2 w3 w5w4 w6 w7
r1
r2
r3
r4
r5
r6
r7
r8
r9
r10
r11
r12
r13
r14
r15
(See the paper for the details)
Problem:
By selecting high frequency token w1 first, we losethe chance to divide records in I(w2) and I(w4).If we divide records in I(w2) and I(w4), however, we can benefit more queries
We solve the problem using partitioning algorithm
Partitioning Algorithm
w1 w2 w3 w5w4 w6 w7
r1
r2
r3
r4
r5
r6
r7
r8
r9
r10
r11
r12
r13
r14
r15
P1
P2
P11
P12
local orphan record:insert it into either P11 or P12
Partitioning algorithm (simplified version, see the paper for the details) Select a pivot set Partition records using the pivot set In each partition, recursively partition records and handle local orphan records Balance between the overhead and the benefit of partitioning using a cost model
Note:recursive partitioning doesnot affect the relative documentfrequencies of w1 in each partition
Experiments
Dataset # records Avg # tokens # partitions
IMDB Actor 1,213,391 16 ED 28, JC 12
IMDB Movie 1,568,891 19 ED 18, JC 12
DBLP Author 2,948,929 15 ED 55, JC 55
Web Corpus 6,000,000 21 ED 54, JC 85
DATASETS AND STATISTICS
Similarity functions Jaccard similary (thresholds - 0.6, 0.7, 0.8) Edit distance (thresholds - 2, 3, 4)
Search algorithms Jaccard: SequentialMerge, DivideSkip [Li et al., ICDE `08], PPMerge [Xiao et al., WWW `08]
Edit distance: SequentialMerge, DivideSkip, EDMerge [Xiao et al., PVLDB `08]
Size Filtering [Arasu et al., VLDB 06] (for all algorithms)
Partitioned case vs. unpartitioned case Elapsed times Number of candidates
Experiments
Jaccard similarity (DBLP Author)
Running Time Number of Candidates
Experiments
Edit distance (Web Corpus)
Running Time Number of Candidates
※ Edit distance – false positives are not removed!!
Conclusions
Studied how to reduce the number of candidates for efficient similarity searches
Proposed the concept of the pivot set and partitioning technique using a pivot set
Showed benefits of the proposed technique experimentally
THANK YOU!