efficient and private distance approximation david woodruff mit

Efficient and Private Distance Approximation

David WoodruffMIT

Outline

1. Two-Party Communication

2. Two Problems

1. Private Euclidean norm estimation

2. Higher norm estimation

The Communication Model

x 2 n y 2 n

What is the distance D(x,y) between x and y?For example, if = {0,1}, what is the Hamming Distance? If = RR, what is the Lp distance for some p 2 (0, 1) ?

Lp distance is (i=1 |xi-yi|p)1/p

Alice Bob

n

Application – Streaming Model

7113734 … Want to mine a massive data stream

How many distinct elements? What’s the most frequent item? Is the data uniform or skewed?

Elements arranged in adversarial order Algorithms only allowed one pass Goal: low-space algorithms

Application – Streaming Model

Streaming model

CommunicationLower bounds

Space lower bounds

Protocols

Always

Algorithms

Distance approximation captures streaming primitives

Distinct elements (Hamming), frequent items (L2), skew (Lp)

Two-party Communication

In this talk, most protocols yield

streaming algorithms

In this talk, most protocols yield

streaming algorithms

Thus, communication equals space

Thus, communication equals space

CommunicationLower bounds

Space lower bounds

Often

Application – IP session data

Source Destination

Bytes Duration

Protocol

18.6.7.110.6.2.311.1.0.612.3.1.5…

19.7.3.212.3.4.811.6.8.214.7.0.1…

40K20K58K30K…

28182232…

httpftphttphttp…

AT & T collects 100+ GBs of NetFlow everyday

Application – IP Session Data

AT & T needs to process massive stream of network data

Traffic estimationWhat fraction of network IP addresses are active?Distinct elements computation

Traffic analysis What are the 100 IP addresses with the most traffic? Frequent items computation

Security/Denial of Service Are there any IP addresses witnessing a spike in traffic? Skewness computation

Application – Secure Datamining

For medical research, hospitals wish to mine their joint data

Distance approximation is useful in many mining algorithms, e.g., classification and clustering

Patient confidentiality imposes strict laws on what information can be shared. Mining cannot leak anything sensitive

Issues

Exact vs. Approximate Solution

Efficiency Communication Complexity Round Complexity

Security Neither party learns more than what the solution

and his/her input implies about the other party’s input

Initial Observations

To cope with the (n) communication bound, we look for randomized approximation algorithms

Exact Approximate

Deterministic

(n) (folklore)

(n)(folklore)

Randomized

(n)[KS, R]

?

Previous Results

n n1-2/p

SFE

n1-1/(p-1) n1-2/p

[AMS96, CK04, [AMS96, G05] BJKS02, CKS03]

Lp, p > 2

n 1/ SFE

1/2 1/ [AMS96] folklore

L2

n1/2 1/[FIMNSW01]

1/2 1/ [FM79, BJKST02…] folklore

Hamming Distance

Private Communication Complexity

Upper Bounds Lower BoundsCommunication Complexity

Upper Bounds Lower Bounds

Output D’ such that for all x,y:Pr[D(x,y) · D’(x,y) · (1+)D(x,y)] ¸ 2/3

Our Results [IW03, W04, IW05, IW06]

n n1-2/p

Still open

n1-1/(p-1) n1-2/p

O(n1-2/p), 1-roundLp, p > 2

n 1/ O(1/2), O(1)-rounds

1/2 1/ (1/2), 1-round

L2

n1/2 1/ O(1/2), O(1)-rounds

1/2 1/ (1/2), 1-round

Hamming Distance

Private Communication Complexity

Upper Bounds Lower BoundsCommunication Complexity

Upper Bounds Lower Bounds

Outline

1. The Two-Party Communication Model

2. Two Problems



Private L2 Estimation

We improve the n1/2 upper bound to 1/2 for private L2, and our protocol uses O(1) rounds

Optimal up to suppressed logarithmic factors

Holds for Hamming distance

Speculation that private is much harder than non-private

We refute this speculation

Security Definition

What does privacy mean for distance approximation?

Alice does not learn anything about y other than what

follows from her input x and D(x,y)

What does privacy mean for distance computation?

Alice does not learn anything about y other than what follows from x and

the approximation D’(x,y)

Not Sufficient!!

MinimalRequirement

Does thiswork?

Security Definition

x 2 n y 2 n

Alice Bob

Suppose = {0,1}

Set the LSB of D’(x,y) to be yn, and the remaining bits of D’(x,y) to agree with those of D(x,y)

D’(x,y) is a +/- 1 approximation, but Alice learns yn , which doesn’t follow from x, D(x,y)

Security Definition

What does privacy mean for distance approximation?

Alice and Bob don’t learn anything about each other’s input other than what follows their own

input and D(x,y)

D’(x,y) is determined by D(x,y) and the randomness

NewRequirement

Implications

How do we model the power of the cheating parties?

Security Models

x 2 n y 2 n

Alice Bob

Semi-honest: parties follow their instructions but try to learn more than what is prescribed

Malicious: parties deviate from the protocol arbitrarily- Use a different input- Force other party to output wrong answer- Abort before other party learns answer

Difficult to achieve security in

malicious model…

Difficult to achieve security in

malicious model…

Reductions – Yao, GMW, NN

Protocolsecure in thesemi-honest

model

Protocolsecure in the

malicious model

Efficiency of the new protocol =

Efficiency of the old protocol

It suffices to design protocols in the semi-honest model

The parties follow the instructions of the protocol.Don’t need to worry about “weird” behavior.

Just ensure neither party learns anything about the other’s input except what follows from the exact distance

Our Protocol

Alice: Bob:x = e1 y = e2

A first try: randomly sample a few coordinates j, compute (xj – yj)2, and scale to estimate ||x-y||22

Problem: With high probability, all samples return 0, so estimate is 0.

A second try: randomly rotate vectors over Rn, then try the sampling approach

Me1 Me2

||Mx – My||22 = ||x-y||22. Now mass is “spread out”, so sampling is effective.

Problem: neither party can learn the samples, since with the knowledge of M, this reveals extra information

Solution: We build a private sub-protocol to output an estimate from the samples, without revealing the samples

Parties need to agree on the rotation M.Can be done with low communication using a PRG

Thus, the correctness and desired efficiency of the protocol are easy to verify.

Private Sub-protocol

Problem: Alice learns Myj for some j (Bob is similar)

Solution:

1. Use an oblivious masking sampling protocol [FIMNSW]

Alice learns Myj © b for random mask b, Bob has b Alice does not learn j


x 2 n

Compute Mxy 2 n

Compute My

Alice Bob

M

Run oblivious, masking sampling protocol

Gets b ©(My)j for unknown j

Create mask a Create mask b

Gets a ©(Mx)j for unknown j

Private Sub-protocolAlice Bob

M

Gets b ©(My)j for unknown j Gets a ©(Mx)j for unknown j

Has mask a Has mask b

Low communication private protocol. Computes (M(x-y)j)2, and since j is random,

Ej [M(x-y)j]2 = ||Mx-My||22/n = ||x-y||22/n

Low communication private protocol. Computes (M(x-y)j)2, and since j is random,

Ej [M(x-y)j]2 = ||Mx-My||22/n = ||x-y||22/n


Thus, the expectation depends only on the length!

1. Let T be an upper bound on ||x-y||22

2. The protocol outputs a bit c. 3. Since c is a bit, it is determined from its expectation.

Pr[c = 1] = n(M(x-y)j)2 / T ¼ ||x-y||22/T · 1

Repeat a few times to get tight

concentration

Repeat a few times to get tight

concentration

If most repetitions return c = 0, adjust

T, and repeat

If most repetitions return c = 0, adjust

T, and repeat

Wrapup We give an O(1)-round 1/2 private protocol for the

L2 distance

Optimal up to suppressed logarithmic factors

Details Randomness is not true – it’s from a pseudo-

random generator against non-uniform machines

Parties have bounded precision

Outline

1. The Two-Party Communication Model

2. Two Problems



Lp Estimation for p > 2

We improve the n1-1/(p-1) communication upper bound to n1-2/p, and our protocol is 1-round

Achieving this privately is still an open problem


Problem: Rotation doesn’t work for p > 2

L2 L4

(1, 0) 1 1

(1/21/2, 1/21/2) 1 1/2

2 4

rotation

Not clear how to “re-randomize” Lp for p > 2

We need a new approach…

x 2 {1, …, m}n y 2 {1, …, m}n

Alice Bob

Strategy1. Classify coordinates |xj – yj| into buckets 0, [1, 2), [2, 4), …, [2i, 2i+1), …2. Estimate size si of each bucket 3. Output

We will approximate ||x-y||p to within a constant factor

One source of error:si are approximate

Another source:

values are approximate

Overall, still withina constant factor

p


Our Approach: Whenever si is hard to estimate we can detect this, and set to 0.

Otherwise, we estimate it.

Problem: Aren’t we undercounting?Answer: No! Hard si don’t matter!

Sometimes! I can help you estimate si when i is large [CCF-C]

No! Can show we need (n) communication if

Estimating Bucket Sizes Remaining Problem: Estimate si = # of coordinates |xj – yj| in the range [2i,

2i+1)

Is this easy?

The CountSketch Protocol

I have a 1-round B-communication protocol which computes

all j for which (xj – yj)2 ¸ ||x-y||22/B

si large

Lp ! L2 Intuition: we can detect very large

coordinates,where large is with respect to the L2 norm

- Looks promising!

- If si = O(1), we can compute si with O(n1-

2/p) communication

- Looks promising!

- If si = O(1), we can compute si with O(n1-

2/p) communication

Random Restriction

We would like to estimate si given that

and that we can efficiently output all coordinates j for which

Ideas? Not so obvious if si is large.Randomly restrict to ¼ 1/si fraction of coordinates j!

Random Restriction

n1/4 n1/31

Value

Number ofcoordinates

1

n1/2

(n)

The middle group dominates, but the CountSketch protocol cannot detect it.

The reason is that each value in the middle group is small, but the group itself is large.

Contributes (n)to ||x-y||33

Contributes n1/2 (n1/4)3= n5/4

to ||x-y||33

Contributes 1 (n1/3)3= n to ||x-y||33

Random Restriction

We randomly restrict to n1/2 coordinates

n1/4 n1/31

1

n1/2

(n)

Value

Number ofcoordinates

RecapAlgorithm1. Classify coordinates |xj – yj| into buckets 0, [1, 2), [2, 4), …, [2i, 2i+1), …2. Estimate size si of each bucket 3. Output

Subroutine1. Randomly restrict to n/2, n/4, n/8, …, coordinates2. For each restriction, use CountSketch to retrieve the largest elements. Classify them into groups.3. Scale back to estimate si

Guarantee either you estimate si well, or si is tiny.

Wrapup

We give a 1-round n1-2/p-communication protocol Optimal due to lower bounds [AMS, BJKS, CKS]

Yields optimal n1-2/p-space streaming algorithm (resolves [AMS])

Lots of details Naive use of [CCF-C] requires >1 round, but we get 1

round

The randomness needed for restrictions cannot be pure for the streaming algorithm. We use a PRG

My Other Work

Algorithms Longest common/increasing subsequence Computational biology, clustering

Complexity theory Graph spanners, locally decodable codes

Cryptography Broadcast encyption, torus-based crypto,

PIR, inference control, practical secure function evaluation

Thank you!

The [CCF-C] protocol

Alice Bob

Compute: R = j (x-y)j h(j) = j xjh(j) - jyjh(j)

Random linear map h:[n] -> {-1,1}Compute j xj h(j)

Then E[h(i)R] = j (x-y)j E[h(i)h(j)] = xj – yj

Repeat many times to reduce

the variance of the estimator

efficient and private distance approximation david woodruff mit

Documents