a sublinear time approximation scheme for clustering in metric spaces author: piotr indyk ieee focs...

A sublinear Time Approximation Scheme for Clustering in Metric

SpacesAuthor: Piotr Indyk

IEEE FOCS 1999

組員名單• R90922058 李秉憲• R90725031 陳柏安• R90725045 張緒遠• R90725052 鄭安巽

outline

• Introduction

• Preliminaries

• The Algorithms

• The analysis of BC

• The analysis of UC

• Sublinear time algorithm

Introduction

• k clustering problem:Input:Given a weighted graph G = (X,d) on

N vertices.

Output:Partition X into k sets S1…Sk such that the value of

is minimized.

i Sivu

vud},{

),(

Introduction

• This problem is NP-complete (for k >=2) and can’t be approximated up to any constant.

• A standard way to reduce the complexity of clustering problems is to assume that the weight function d is a metric.

Introduction

• Facts:– Guttman-Beck showed a 2-approx algorithm– Vega and Kenyon gave a PTAS for metric max

cut.– Unfortunately , it does not imply a PTAS for th

e 2-clustering problem.

Introduction

• The result of this paper focus on the case when the de la Vega-Kenyon PTAS doesn't work.

• If done correctly, this procedure yields a (1+ε) approximate solution.

Preliminaries

• Let (X,d) be a metric space.For any two sets A,B X, we define

• We use d(A) to denote d(A,A)/2

• We also use d(u,B) or d(B,u)

BbAa

badBAd,

),(),(

Preliminaries

• We define

• Observe, that is a metric.

||*||/),(),(~

BABAdBAd d~

Preliminaries

• For sets A,B of equal cardinality, we define

and

• Notice that both dM and satisfy metric properties.

))(,(min),( aadBAdAa

M

1:1

BA :

||/),(~

),(~

ABAdBAd MM Md

~

Preliminaries

• For any α [0,1], u and A X, we define dα

(u,A) to be the sum of the α|A| smallest values of {d(u,a)|a A}.

• We also define • The algorithms which we give in this paper are rand

omized and producing an (1+ε)-approximate solutions with high probability.

|)|/(),(),(

~AAudAud

The Algorithms

• The result is obtained by running three algorithms in parallel: MAXCUT,BC,UC

• We will assume that |S1|=m and |S2|=n are given to us, as otherwise we use all N possible combinations.

The Algorithms

• MAXCUT:– This is the algorithms of another paper for (1-ε)-ap

proximate MAX-CUT– The algorithm is useful if the cut/clustering ratio is

smaller than a constant value c.– Thus we need another algorithm for the case when

the cut/clustering ratio exceeds c.

The Algorithms

• BC:– Balance ratio:

– If the cut/clustering ratio is greater than c and the balance ratio is smaller than ρ,we will run BC

|)|/|||,|/|max(|),( 122121 SSSSSSb

The Algorithms

• BC:– Uniformly chooses set T of t =O(ρlogn) points– Guesses and ,

s.t. |T1| = |T2| = λ = O(logn)

– It checks for each point u X-T1-T2 if

TST 11 TST 22

),(||),(|| 212111 TudSTudS

Back

The Algorithms

• BC(contd.):– If the above inequality holds, then u is added to

R1; otherwise we add it to R2.

– The pair is returned as a solution.

),( 2211 TRTR

The Algorithms

),(||),(|| 212111 TudSTudS

TT1T2

X

S1 S2

21 TTXu

The Algorithms

• UC:– Use this algorithm when b(S1,S2) > ρ,assume |S1| >

ρ|S2|– Obtain a set T of λrandom points from S1.– Sort all points u X-T by (in ascendi

ng order)– The first |S1|-λ points from the list are added to R1,

the remaining points are added to R2.– Output (R1,R2)

),(1 Tud

Back

The Algorithms

TXu

X

T

Sort u by d1-α(u,T)

S1

S2

The Algorithms

MAXCUT

B(S1,S2)<ρ

BU

UC

Solution

Solution

SolutionYes

No

Cut/clustering <= c

Cut/clustering > c

The Algorithms

• Q1:polynomial?– Yes. (BC , UC)

• Q2:feasible solution?– Yes.

The analysis of BC

• Relating the outcome of the (randomized) comparison of d(u,T1) and d(u,T2) to a certain (deterministic) property of d

• Lemma1 Consider any u S1. If for every set S2’ which is a subset of S2 such that |S2’|>=(1-2α)|S2| we have d(u,S2’)>=d(u,S1), then with high probability we have

n·d1-α(u,T1)>m·d1-α(u,T2)

The analysis of BC(cont’d)

• Proof: – Without loss of generality we can consider S2’ which

contains smallest (1-2α)n elements from S2

– Moreover, we can assume d(u,S2’)=d(u,S1)=1

– Finally, we will assume that the largest 2αn elements of S2 are all equal

– For λ large enough we have a significant gap in the expected values of d1-α(u,T1) and d1-α(u,T2)

E[d1-α(u,T1)]/(1-α)|T1| <= E[d1-0(u,S1)]/(1-0)|S1|

E[d1-α(u,T2)]/(1-α)|T2| >= E[d1-0(u,S2’)]/(1-0)|S2’|


(m/λ)E[d1-α(u,T1)] <= 1-α <= 1

(n/λ)E[d1-α(u,T2)] >= (1-α)/(1-2α) >= 1+α/(1-2α) >= 1+α/2

– (m/λ)E[d1-α(u,T1)] <= 1

(n/λ)E[d1-α(u,T2)] >= 1+α/2

– We want to convert the expectation bounds into bounds holding with high probability

– Applying standard tail inequalities, we obtain that

n·d1-α(u,T2) >= m·d1-α(u,T1) with high probability if λ = Ω(log n/α )

4


• We will upper bound the additional cost incurred by assigning u S1 to R2; the opposite case can be handled in the same way

W(C) <= (1+ε)OPT W(C) – OPT <= ε·OPT • From the above Lemma, we can assume that for

every u S1 which has been included in R2(i.e. such that n·d1-α(u,T2) <= m·d1-α(u,T1) there exists a set S2 of cardinality (1-2α)|S2| such that

d(u,S2) < d(u,S1) (1)• Thus we need only to bound d(u,S2-S2)

u

u

u

The analysis of BC(cont’d)• d(S1’)+d(S2’) d(S1)+d(S2)

• d(S1-U)+d(S1-U,V)+d(V)+ d(S1-U)+d(S1-U,U)+d(U)+

d(S2-V)+d(S2-V,U)+d(U) d(S2-V)+d(S2-V,V)+d(V)

• d(S1-U,V)+d(S2-V,U) d(S1-U,U)+d(S2-V,V)

• d(S1,V)-d(U,V)+d(S2,U)-d(V,U) d(S1,U)-d(U)+d(S2,V)-d(V)

• d(u,S2)+d(u,S2-S2)-d(u,V) d(u,S1)-d(u,U)

• ∵d(u,S2) < d(u,S1)

• ∴d(u,S2-S2)-d(u,V)-(d(u,S1)-d(u,S2))+d(u,U) <= d(u,S2-S2)

u u

u

u u

VS1-U S1-U UU VS2-V S2-V

S1’ S1S2’ S2


• We will be only interested in u’s such that

d(u,S2) >= d(u,S1)(1+ε) (2) as otherwise the difference in the cost can be easily

bounded (i.e. d(u,S2)-d(u,S1) < d(u,S2)-d(u,S2) < ε·d(u,S1))• From (1) and (2) we obtain that d(u,S2)-d(u,S2) >= ε·d(u,S1) >= ε·d(u,S2) which can be rewritten as

đ(u,S2-S2) >= ε((1-2α)/2α)·đ(u,S2)

u

u u

u u


• By triangle inequality we have đ( ,S2- ) >= đ(u,S2- )-đ(u,S2) >= (1-2α/ε(1-2α))đ(u,S2- ) (3)• The above give a bound for đ( ,S2- ). In the

following we show that the number of such us is also not very large if (as we assume) d(S1,S2) >= c(d(S1)+d(S2))

• Firstly, observe that đ(S1,S2) <= đ(u,S1)+đ(u,S2)

u

S2-S2u S2

uS2u S2

uS2u

S2u

S2u

u

S1 S2

S2u


• ∵đ(u,S2)= d(u,S2)/n=d(u, )/n+d(u,S2- )/n

=(1-2α)đ(u, )+2αđ(u,S2- )

đ(S1,S2) <= đ(u,S1)+(1-2α)đ(u, )+2αđ(u,S2- )

<= đ(u,S1)+(1-2α)đ(u, )+

2α(đ(u, )+đ( ,S2- ))

= đ(u,S1)+đ(u, )+2αđ( ,S2- )• We can rewrite it as:

S2u S2

u

S2u

S2u S2

u

u

S2-S2u S2

u

S2u

S2u S2

uS2

u

S2u S2

u S2u

S2u


d(S1,S2)/|S1||S2| <= d(u,S1)/|S1|+d(u, )/(1-2α)|S2|+

2αd( ,S2- )/(1-2α)2α|S2||S2|

d(S1,S2) <= |S2|d(u,S1)+|S1|d(u, )/(1-2α)+

|S1|d( ,S2- )/(1-2α)|S2|

<= |S2|d(u,S1)+|S1|d(u, )/(1-2α)+ρd(S2)/(1-2α)

• Therefore c(d(S1)+d(S2)) <= d(S1,S2) <=

|S2|d(u,S1)+|S1|d(u, )/(1-2α)+ρd(S2)/(1-2α)

S2u

S2u S2

u

S2u

S2u S2

u

S2u

S2u


• Alternatively

(c-ρ/(1-2α))(d(S1)+d(S2))

<= c(d(S1)+d(S2))- ρd(S2)/(1-2α)

<= |S2|d(u,S1)+|S1|d(u, )/(1-2α)

<= (|S2|+|S1|/(1-2α))d(u,S1) (∵d(u, ) <= d(u,S1))

• The upper bound for the number of u satisfying the above inequality can be obtained as follows. Assume that this number is equal to γ|S1|

S2u

S2u


• Σu U d(u,S1) <= d(S1). By plugging in the lower bound for d(u,S1) we get

Σu U (c-ρ/(1-2α))(d(S1)+d(S2))/(|S2|+|S1|/(1-2α)) <= d(S1)

γ|S1|(c-ρ/(1-2α))(d(S1)+d(S2))/(|S2|+|S1|/(1-2α)) <= d(S1)

γ(c-ρ/(1-2α))(d(S1)+d(S2))/(1/ρ+1/(1-2α)) <= d(S1)

γ(c-ρ/(1-2α))/(1/ρ+1/(1-2α)) <= 1

Therefore γ <= (1/ρ+1/(1-2α))/(c-ρ/(1-2α))


• Denote the set of u’s as above by U. We can bound the total cost difference by

Σu U d(u,S2- ) <= Σu U 2αnđ(u,S2- )

<= Σu U 2αnđ( ,S2- )/(1-2α/ε(1-2α)) (∵(3))

= γ|S1|·2αnđ( ,S2- )/(1-2α/ε(1-2α))

= γ|S1|·2αnd( ,S2- )/[(1-2α/ε(1-2α))2α|S2|(1-2α)|S2|]

<= γρd(S2)/[(1-2α/ε(1-2α))(1-2α)]

= Ad(S2)

S2u

S2u

S2u

S2u

S2u

S2u

S2u

S2u


• The factor A becomes smaller than ε when we set c = Ω(ρ /ε) and α = O(ε), for sufficient constants2

The analysis of UC

Lemma 2

notes• For every included by mistake to R2 there is

included to R1• Let U denote the set of mistaken u’s and let V denote

s the set of mistaken v’s.• We will bound the differences d(V,S1)-d(U,S1) and d

(U,S2)-d(V,S2)• It is sufficient to bound

1Su

2Sv

),(),( 11uv SSUdSSVd

),()1(),( uv SUdSVd

),(),( 1uv SUdSSVd

),()1)(,( 1SVdSUd u

fact1

To bound right hand side

therefore

We also use the fact

therefore

thus

In this way we bounded the first component by setting

we make the value 1/F-1 smaller than ε

)( 3

To Bound the second part

),(~

),(~

),(~

22 SUdSVdVUdM

Observe that by setting

we make B smaller than ε)/1(

Sublinear time algorithm

We improve the running time of the above algorithm to )(log 1/1 )1( nO

O


• The running time of UC is bounded by the time needed sampling t points from large cluster.We improve UC by using random sampling, we can perform sampling in time roughly

• We improve MAXCUT running time of [2] by using the techniques of[1].

))/11(( tO


• The main time bottleneck is the time needed for exhaustive partitioning of the set T in BC.

• We divide T into C1 and C1,choosing T1and T2 from C1 C1,from lemma below,we show they are good enough for our algorithm.


• Lemma3:

• Let (S1 , S2) and (S1’, S2’) be two partitions of the

metric space over S.There exists a constant B such that for any A and any β<1/B if

• d(S1)+d(S2) <= βA/Bd(S1, S2) and

• d(S1’)+d(S2’) <= A/B(d(S1)+d(S2)),

• then (S1 , S2) and (S1’, S2’) differ on at most βn

points.


• Select a sample R of r points.It can be split into R1 and R2 such that

• d(R1, R2)/d(R1)+d(R2) and • d(S1, S2) /d(S1)+d(S2)

are comparable.

Find T1’ R1 and T2

’ R2 ,such that | T1

’ |= | T2

’ |=t and

| T1’ - S1 |= | T2

’ - S2 | <= β tIt turns out T1

’ and T2

’ are almost as good as T1and

T2obtained by exhaustive search.


• For any two equal size sets A’ and A,if max(| A’ |-| A |,| A |-| A’ |)<= β | A | then for any α and u,

• d1- α- β(u, A) <= d1- α(u, A’ ) <= d1- α+β (u, A)

• So we can replace Lemma 1

• nd1- α (u, T2) <= m d1- α (u, T1) by

• nd1- 3α/2(u, T2) <= m d1- α/2(u, T1)

a sublinear time approximation scheme for clustering in metric spaces author: piotr indyk ieee focs...

Documents