a sublinear time approximation scheme for clustering in metric spaces author: piotr indyk ieee focs...
TRANSCRIPT
A sublinear Time Approximation Scheme for Clustering in Metric
SpacesAuthor: Piotr Indyk
IEEE FOCS 1999
組員名單• R90922058 李秉憲• R90725031 陳柏安• R90725045 張緒遠• R90725052 鄭安巽
outline
• Introduction
• Preliminaries
• The Algorithms
• The analysis of BC
• The analysis of UC
• Sublinear time algorithm
Introduction
• k clustering problem:Input:Given a weighted graph G = (X,d) on
N vertices.
Output:Partition X into k sets S1…Sk such that the value of
is minimized.
i Sivu
vud},{
),(
Introduction
• This problem is NP-complete (for k >=2) and can’t be approximated up to any constant.
• A standard way to reduce the complexity of clustering problems is to assume that the weight function d is a metric.
Introduction
• Facts:– Guttman-Beck showed a 2-approx algorithm– Vega and Kenyon gave a PTAS for metric max
cut.– Unfortunately , it does not imply a PTAS for th
e 2-clustering problem.
Introduction
• The result of this paper focus on the case when the de la Vega-Kenyon PTAS doesn't work.
• If done correctly, this procedure yields a (1+ε) approximate solution.
Preliminaries
• Let (X,d) be a metric space.For any two sets A,B X, we define
• We use d(A) to denote d(A,A)/2
• We also use d(u,B) or d(B,u)
BbAa
badBAd,
),(),(
Preliminaries
• We define
• Observe, that is a metric.
||*||/),(),(~
BABAdBAd d~
Preliminaries
• For sets A,B of equal cardinality, we define
and
• Notice that both dM and satisfy metric properties.
))(,(min),( aadBAdAa
M
1:1
BA :
||/),(~
),(~
ABAdBAd MM Md
~
Preliminaries
• For any α [0,1], u and A X, we define dα
(u,A) to be the sum of the α|A| smallest values of {d(u,a)|a A}.
• We also define • The algorithms which we give in this paper are rand
omized and producing an (1+ε)-approximate solutions with high probability.
|)|/(),(),(
~AAudAud
The Algorithms
• The result is obtained by running three algorithms in parallel: MAXCUT,BC,UC
• We will assume that |S1|=m and |S2|=n are given to us, as otherwise we use all N possible combinations.
The Algorithms
• MAXCUT:– This is the algorithms of another paper for (1-ε)-ap
proximate MAX-CUT– The algorithm is useful if the cut/clustering ratio is
smaller than a constant value c.– Thus we need another algorithm for the case when
the cut/clustering ratio exceeds c.
The Algorithms
• BC:– Balance ratio:
– If the cut/clustering ratio is greater than c and the balance ratio is smaller than ρ,we will run BC
|)|/|||,|/|max(|),( 122121 SSSSSSb
The Algorithms
• BC:– Uniformly chooses set T of t =O(ρlogn) points– Guesses and ,
s.t. |T1| = |T2| = λ = O(logn)
– It checks for each point u X-T1-T2 if
TST 11 TST 22
),(||),(|| 212111 TudSTudS
Back
The Algorithms
• BC(contd.):– If the above inequality holds, then u is added to
R1; otherwise we add it to R2.
– The pair is returned as a solution.
),( 2211 TRTR
The Algorithms
),(||),(|| 212111 TudSTudS
TT1T2
X
S1 S2
21 TTXu
The Algorithms
• UC:– Use this algorithm when b(S1,S2) > ρ,assume |S1| >
ρ|S2|– Obtain a set T of λrandom points from S1.– Sort all points u X-T by (in ascendi
ng order)– The first |S1|-λ points from the list are added to R1,
the remaining points are added to R2.– Output (R1,R2)
),(1 Tud
Back
The Algorithms
TXu
X
T
Sort u by d1-α(u,T)
S1
S2
The Algorithms
MAXCUT
B(S1,S2)<ρ
BU
UC
Solution
Solution
SolutionYes
No
Cut/clustering <= c
Cut/clustering > c
The Algorithms
• Q1:polynomial?– Yes. (BC , UC)
• Q2:feasible solution?– Yes.
The analysis of BC
• Relating the outcome of the (randomized) comparison of d(u,T1) and d(u,T2) to a certain (deterministic) property of d
• Lemma1 Consider any u S1. If for every set S2’ which is a subset of S2 such that |S2’|>=(1-2α)|S2| we have d(u,S2’)>=d(u,S1), then with high probability we have
n·d1-α(u,T1)>m·d1-α(u,T2)
The analysis of BC(cont’d)
• Proof: – Without loss of generality we can consider S2’ which
contains smallest (1-2α)n elements from S2
– Moreover, we can assume d(u,S2’)=d(u,S1)=1
– Finally, we will assume that the largest 2αn elements of S2 are all equal
– For λ large enough we have a significant gap in the expected values of d1-α(u,T1) and d1-α(u,T2)
E[d1-α(u,T1)]/(1-α)|T1| <= E[d1-0(u,S1)]/(1-0)|S1|
E[d1-α(u,T2)]/(1-α)|T2| >= E[d1-0(u,S2’)]/(1-0)|S2’|
The analysis of BC(cont’d)
(m/λ)E[d1-α(u,T1)] <= 1-α <= 1
(n/λ)E[d1-α(u,T2)] >= (1-α)/(1-2α) >= 1+α/(1-2α) >= 1+α/2
– (m/λ)E[d1-α(u,T1)] <= 1
(n/λ)E[d1-α(u,T2)] >= 1+α/2
– We want to convert the expectation bounds into bounds holding with high probability
– Applying standard tail inequalities, we obtain that
n·d1-α(u,T2) >= m·d1-α(u,T1) with high probability if λ = Ω(log n/α )
4
The analysis of BC(cont’d)
• We will upper bound the additional cost incurred by assigning u S1 to R2; the opposite case can be handled in the same way
W(C) <= (1+ε)OPT W(C) – OPT <= ε·OPT • From the above Lemma, we can assume that for
every u S1 which has been included in R2(i.e. such that n·d1-α(u,T2) <= m·d1-α(u,T1) there exists a set S2 of cardinality (1-2α)|S2| such that
d(u,S2) < d(u,S1) (1)• Thus we need only to bound d(u,S2-S2)
u
u
u
The analysis of BC(cont’d)• d(S1’)+d(S2’) d(S1)+d(S2)
• d(S1-U)+d(S1-U,V)+d(V)+ d(S1-U)+d(S1-U,U)+d(U)+
d(S2-V)+d(S2-V,U)+d(U) d(S2-V)+d(S2-V,V)+d(V)
• d(S1-U,V)+d(S2-V,U) d(S1-U,U)+d(S2-V,V)
• d(S1,V)-d(U,V)+d(S2,U)-d(V,U) d(S1,U)-d(U)+d(S2,V)-d(V)
• d(u,S2)+d(u,S2-S2)-d(u,V) d(u,S1)-d(u,U)
• ∵d(u,S2) < d(u,S1)
• ∴d(u,S2-S2)-d(u,V)-(d(u,S1)-d(u,S2))+d(u,U) <= d(u,S2-S2)
u u
u
u u
VS1-U S1-U UU VS2-V S2-V
S1’ S1S2’ S2
The analysis of BC(cont’d)
• We will be only interested in u’s such that
d(u,S2) >= d(u,S1)(1+ε) (2) as otherwise the difference in the cost can be easily
bounded (i.e. d(u,S2)-d(u,S1) < d(u,S2)-d(u,S2) < ε·d(u,S1))• From (1) and (2) we obtain that d(u,S2)-d(u,S2) >= ε·d(u,S1) >= ε·d(u,S2) which can be rewritten as
đ(u,S2-S2) >= ε((1-2α)/2α)·đ(u,S2)
u
u u
u u
The analysis of BC(cont’d)
• By triangle inequality we have đ( ,S2- ) >= đ(u,S2- )-đ(u,S2) >= (1-2α/ε(1-2α))đ(u,S2- ) (3)• The above give a bound for đ( ,S2- ). In the
following we show that the number of such us is also not very large if (as we assume) d(S1,S2) >= c(d(S1)+d(S2))
• Firstly, observe that đ(S1,S2) <= đ(u,S1)+đ(u,S2)
u
S2-S2u S2
uS2u S2
uS2u
S2u
S2u
u
S1 S2
S2u
The analysis of BC(cont’d)
• ∵đ(u,S2)= d(u,S2)/n=d(u, )/n+d(u,S2- )/n
=(1-2α)đ(u, )+2αđ(u,S2- )
đ(S1,S2) <= đ(u,S1)+(1-2α)đ(u, )+2αđ(u,S2- )
<= đ(u,S1)+(1-2α)đ(u, )+
2α(đ(u, )+đ( ,S2- ))
= đ(u,S1)+đ(u, )+2αđ( ,S2- )• We can rewrite it as:
S2u S2
u
S2u
S2u S2
u
u
S2-S2u S2
u
S2u
S2u S2
uS2
u
S2u S2
u S2u
S2u
The analysis of BC(cont’d)
d(S1,S2)/|S1||S2| <= d(u,S1)/|S1|+d(u, )/(1-2α)|S2|+
2αd( ,S2- )/(1-2α)2α|S2||S2|
d(S1,S2) <= |S2|d(u,S1)+|S1|d(u, )/(1-2α)+
|S1|d( ,S2- )/(1-2α)|S2|
<= |S2|d(u,S1)+|S1|d(u, )/(1-2α)+ρd(S2)/(1-2α)
• Therefore c(d(S1)+d(S2)) <= d(S1,S2) <=
|S2|d(u,S1)+|S1|d(u, )/(1-2α)+ρd(S2)/(1-2α)
S2u
S2u S2
u
S2u
S2u S2
u
S2u
S2u
The analysis of BC(cont’d)
• Alternatively
(c-ρ/(1-2α))(d(S1)+d(S2))
<= c(d(S1)+d(S2))- ρd(S2)/(1-2α)
<= |S2|d(u,S1)+|S1|d(u, )/(1-2α)
<= (|S2|+|S1|/(1-2α))d(u,S1) (∵d(u, ) <= d(u,S1))
• The upper bound for the number of u satisfying the above inequality can be obtained as follows. Assume that this number is equal to γ|S1|
S2u
S2u
The analysis of BC(cont’d)
• Σu U d(u,S1) <= d(S1). By plugging in the lower bound for d(u,S1) we get
Σu U (c-ρ/(1-2α))(d(S1)+d(S2))/(|S2|+|S1|/(1-2α)) <= d(S1)
γ|S1|(c-ρ/(1-2α))(d(S1)+d(S2))/(|S2|+|S1|/(1-2α)) <= d(S1)
γ(c-ρ/(1-2α))(d(S1)+d(S2))/(1/ρ+1/(1-2α)) <= d(S1)
γ(c-ρ/(1-2α))/(1/ρ+1/(1-2α)) <= 1
Therefore γ <= (1/ρ+1/(1-2α))/(c-ρ/(1-2α))
The analysis of BC(cont’d)
• Denote the set of u’s as above by U. We can bound the total cost difference by
Σu U d(u,S2- ) <= Σu U 2αnđ(u,S2- )
<= Σu U 2αnđ( ,S2- )/(1-2α/ε(1-2α)) (∵(3))
= γ|S1|·2αnđ( ,S2- )/(1-2α/ε(1-2α))
= γ|S1|·2αnd( ,S2- )/[(1-2α/ε(1-2α))2α|S2|(1-2α)|S2|]
<= γρd(S2)/[(1-2α/ε(1-2α))(1-2α)]
= Ad(S2)
S2u
S2u
S2u
S2u
S2u
S2u
S2u
S2u
The analysis of BC(cont’d)
• The factor A becomes smaller than ε when we set c = Ω(ρ /ε) and α = O(ε), for sufficient constants2
The analysis of UC
Lemma 2
notes• For every included by mistake to R2 there is
included to R1• Let U denote the set of mistaken u’s and let V denote
s the set of mistaken v’s.• We will bound the differences d(V,S1)-d(U,S1) and d
(U,S2)-d(V,S2)• It is sufficient to bound
1Su
2Sv
),(),( 11uv SSUdSSVd
),()1(),( uv SUdSVd
),(),( 1uv SUdSSVd
),()1)(,( 1SVdSUd u
fact1
To bound right hand side
therefore
We also use the fact
therefore
thus
In this way we bounded the first component by setting
we make the value 1/F-1 smaller than ε
)( 3
To Bound the second part
),(~
),(~
),(~
22 SUdSVdVUdM
Observe that by setting
we make B smaller than ε)/1(
Sublinear time algorithm
We improve the running time of the above algorithm to )(log 1/1 )1( nO
O
Sublinear time algorithm
• The running time of UC is bounded by the time needed sampling t points from large cluster.We improve UC by using random sampling, we can perform sampling in time roughly
• We improve MAXCUT running time of [2] by using the techniques of[1].
))/11(( tO
Sublinear time algorithm
• The main time bottleneck is the time needed for exhaustive partitioning of the set T in BC.
• We divide T into C1 and C1,choosing T1and T2 from C1 C1,from lemma below,we show they are good enough for our algorithm.
Sublinear time algorithm
• Lemma3:
• Let (S1 , S2) and (S1’, S2’) be two partitions of the
metric space over S.There exists a constant B such that for any A and any β<1/B if
• d(S1)+d(S2) <= βA/Bd(S1, S2) and
• d(S1’)+d(S2’) <= A/B(d(S1)+d(S2)),
• then (S1 , S2) and (S1’, S2’) differ on at most βn
points.
Sublinear time algorithm
• Select a sample R of r points.It can be split into R1 and R2 such that
• d(R1, R2)/d(R1)+d(R2) and • d(S1, S2) /d(S1)+d(S2)
are comparable.
Find T1’ R1 and T2
’ R2 ,such that | T1
’ |= | T2
’ |=t and
| T1’ - S1 |= | T2
’ - S2 | <= β tIt turns out T1
’ and T2
’ are almost as good as T1and
T2obtained by exhaustive search.
Sublinear time algorithm
• For any two equal size sets A’ and A,if max(| A’ |-| A |,| A |-| A’ |)<= β | A | then for any α and u,
• d1- α- β(u, A) <= d1- α(u, A’ ) <= d1- α+β (u, A)
• So we can replace Lemma 1
• nd1- α (u, T2) <= m d1- α (u, T1) by
• nd1- 3α/2(u, T2) <= m d1- α/2(u, T1)