efficient exact similarity searches using multiple token orderings jongik kim 1 and hongrae lee 2 1...

Efficient Exact Similarity Searches using Multiple Token Orderings

Jongik Kim1 and Hongrae Lee2

1Chonbuk National University, South Korea2Google Inc.

Introduction

Similarity search is important in many applications

Data cleaning Record linkage Near duplicate detection Query refinement

The focus of our work is efficient evaluation of similarity queries

A lot of applications invoke queries simultaneously Applications usually require fast response times

Need to evaluatesimilarity queries efficiently

… Simultaneous query requests

angty bird

typo

Problem Definition

How do we measure the similarity between two string?

NameBill GatesLinus TorvaldsSteven P. JobsDennis Ritchie…

Search

Query q: Steve Jobs

Output: each string s that satisfies sim(q, s) ≥ α

ste

The overlap similarity, sim(“steve”, “steven”) , is defined as |TS(“steve”) ∩ TS(“steven”)|

1. Convert each string into a record, where a record is a set of tokens Tokenize each string into a token set containing all q-gram tokens of the string

q-gram: a substring of a string of length q TS(“steve”) = {ste, tev, eve} and TS(“steven”) = {ste, tev, eve, ven}

2. Count the number of common tokens between two records (or token sets)

Collection of strings

tevevesteve

Why do we use the overlap similarity?

It supports many other similarity measures.

e.g. J(x, y) = t O(x, y) = t(|x|+|y|)/1+tJ: Jaccard similarity, O: Overlap similarity

Inverted Lists based Approach

ID String Record (token set)1 area { , re, ea}2 artisan { , rt, ti, is, sa, an}3 artist { , rt, ti, is, st}4 tisk {ti, is, sk}… … …

arskeais

sart

stti

1 2 3412

22 3

32 3 4

re 1Make Inverted Lists

an 2

3

Query: “artist” Overlap threshold: 4

Merge to count occurrences

1

2

3

4

2

4

5

2

Answers of the query 2: “artisan” 3: “artist”

{ , , , , }ar rt ti is st

4

ararar

Prefix Filtering based Approach

Query q = “artist” {ar, rt, ti, is, st} and overlap threshold α = 4

arisrtstti

1 2 322 332 3 4

3

Inverted lists for the query

strt

332

aris

ti1 2 32

2 3 4

3

Sort the listsby their sizes

Prefix Lists: the first |TS(q)| – α + 1 lists

Suffix Lists: remaining α – 1 lists

Filtering Phase (the prefix filtering) Merge the prefix lists to generate candidates

Verification Phase Search the suffix lists for each candidate A candidate searches each suffix list to identify if it is contained in the list Binary search is used because suffix lists are usually very long

23

12

candidates2 3 43 4 5

4

4Sort the tokens by theirdocument frequencies

Document frequencyordering

Document Frequency Ordering

General Goal: minimize the number of candidates by making use of the document frequency ordering

rtstti

2 332 3 4

aris

1 2 32 3 4

strt

332

aris

ti1 2 32

2 3 4

3 4


Query q = “artist” {ar, rt, ti, is, st} and overlap threshold α = 4




Sort the tokens by theirdocument frequencies

234

candidates

123

candidates

We can reduce1. time for merging short lists2. number of candidates

time for verification candidates

Our Observation

Query q = {w1, w2} and overlap threshold α = 2

w2 is the prefix list# of candidates is 5



Total number of candidates is 0

Partition

Our observation By partitioning a data set, we can artificially modify document frequencies of tokens in

each partition. We evaluate a query in each partition and take the union of the results. We can reduce the number of candidates by utilizing different token orderings among par-

titions. Because partitions have different token orderings, we need to sort tokens in a query record

in each partition.

Generalization of the Observation

Query q = reaby ={re, ea, ab, by} ={w1, w3, ab, by}Overlap threshold α = 2

Grouping records in I(wp) into P, the number of candidates is reduced by at least |I(wp)| – |I(ws) ∩ P |

Grouping records in I(ws) into P, the number of candidates is reduced by at least |I(wp) – P|Prefix list: w3

# of candidate: 5

In P1, prefix list is w1

# of candidate: 2


# of candidate: 0


# of candidate: 2


# of candidate: 0

By grouping records containing a token w into a partition, we can benefit queries containing w

I(w): the inverted list of w,wp: a prefix token, ws: a suffix token

Pivot Set & PartitioningBy grouping records containing a token w into a partition, we can benefit queries containing w

Pivot set S is a set of tokens such that

Grouping I(wi) into one partition does not affect grouping I(wj) into another partition

w1 w2 w3 w5w4 w6 w7

r1

r2

r3

r4

r5

r6

r7

r8

r9

r10

r11

r12

r13

r14

r15

There are many pivot sets

S1 = {w1, w3}

S2 = {w2, w3, w4}

S3 = {w3, w5}

S4 = {w5, w6}

S5 = {w2, w6}

S6 = {w3, w7}

We can benefit queries containing wi as well as queries containing wj

Question:1. Existence of pivot sets2. Selection of a good pivot set

P1

P2

P3

orphan record: randomly select its partition

Relaxation of a Pivot Set

w1 w2 w3 w5w4 w6 w7

r1

r2

r3

r4

r5

r6

r7

r8

r9

r10

r11

r12

r13

r14

r15

※ JC(S1, S2) = | S1 ∩ S2 |/min(| S1 |, | S2 |)

Pivot set S is a set of tokens such that for any two tokens wi and wj in S, JC(I(wi), I(wj)) ≤ β

If JC(S1, S2) = 0.1,

90% of S1 10% of S1

S1 S2

less than 10% of S2 more than 90% of S2

If β = 0.2, the set S = {w2, w3, w4} is a pivot set

Pivot Set SelectionThe weight of a token w is the number of queries that contain w

Goodness of a pivot set S:

By partitioning using tokens contained in many queries, we can benefit many queries

Selecting the best pivot set is an NP-hard Problem (see the paper)

We use a simple greedy algorithm (simplified version) Select those tokens first whose weights are high

w1 w2 w3 w5w4 w6 w7

r1

r2

r3

r4

r5

r6

r7

r8

r9

r10

r11

r12

r13

r14

r15

(See the paper for the details)

Problem:

By selecting high frequency token w1 first, we losethe chance to divide records in I(w2) and I(w4).If we divide records in I(w2) and I(w4), however, we can benefit more queries

We solve the problem using partitioning algorithm

Partitioning Algorithm

w1 w2 w3 w5w4 w6 w7

r1

r2

r3

r4

r5

r6

r7

r8

r9

r10

r11

r12

r13

r14

r15

P1

P2

P11

P12

local orphan record:insert it into either P11 or P12

Partitioning algorithm (simplified version, see the paper for the details) Select a pivot set Partition records using the pivot set In each partition, recursively partition records and handle local orphan records Balance between the overhead and the benefit of partitioning using a cost model

Note:recursive partitioning doesnot affect the relative documentfrequencies of w1 in each partition

Experiments

Dataset # records Avg # tokens # partitions

IMDB Actor 1,213,391 16 ED 28, JC 12

IMDB Movie 1,568,891 19 ED 18, JC 12

DBLP Author 2,948,929 15 ED 55, JC 55

Web Corpus 6,000,000 21 ED 54, JC 85

DATASETS AND STATISTICS

Similarity functions Jaccard similary (thresholds - 0.6, 0.7, 0.8) Edit distance (thresholds - 2, 3, 4)

Search algorithms Jaccard: SequentialMerge, DivideSkip [Li et al., ICDE `08], PPMerge [Xiao et al., WWW `08]

Edit distance: SequentialMerge, DivideSkip, EDMerge [Xiao et al., PVLDB `08]

Size Filtering [Arasu et al., VLDB 06] (for all algorithms)

Partitioned case vs. unpartitioned case Elapsed times Number of candidates

Experiments

Jaccard similarity (DBLP Author)

Running Time Number of Candidates

Experiments

Edit distance (Web Corpus)

Running Time Number of Candidates

※ Edit distance – false positives are not removed!!

Conclusions

Studied how to reduce the number of candidates for efficient similarity searches

Proposed the concept of the pivot set and partitioning technique using a pivot set

Showed benefits of the proposed technique experimentally

THANK YOU!

efficient exact similarity searches using multiple token orderings jongik kim 1 and hongrae lee 2 1...

Documents