efficient and private distance approximation david woodruff mit

39
Efficient and Private Distance Approximation David Woodruff MIT

Upload: marybeth-whitehead

Post on 18-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efficient and Private Distance Approximation David Woodruff MIT

Efficient and Private Distance Approximation

David WoodruffMIT

Page 2: Efficient and Private Distance Approximation David Woodruff MIT

Outline

1. Two-Party Communication

2. Two Problems

1. Private Euclidean norm estimation

2. Higher norm estimation

Page 3: Efficient and Private Distance Approximation David Woodruff MIT

The Communication Model

x 2 n y 2 n

What is the distance D(x,y) between x and y?For example, if = {0,1}, what is the Hamming Distance? If = RR, what is the Lp distance for some p 2 (0, 1) ?

Lp distance is (i=1 |xi-yi|p)1/p

Alice Bob

n

Page 4: Efficient and Private Distance Approximation David Woodruff MIT

Application – Streaming Model

7113734 … Want to mine a massive data stream

How many distinct elements? What’s the most frequent item? Is the data uniform or skewed?

Elements arranged in adversarial order Algorithms only allowed one pass Goal: low-space algorithms

Page 5: Efficient and Private Distance Approximation David Woodruff MIT

Application – Streaming Model

Streaming model

CommunicationLower bounds

Space lower bounds

Protocols

Always

Algorithms

Distance approximation captures streaming primitives

Distinct elements (Hamming), frequent items (L2), skew (Lp)

Two-party Communication

In this talk, most protocols yield

streaming algorithms

In this talk, most protocols yield

streaming algorithms

Thus, communication equals space

Thus, communication equals space

CommunicationLower bounds

Space lower bounds

Often

Page 6: Efficient and Private Distance Approximation David Woodruff MIT

Application – IP session data

Source Destination

Bytes Duration

Protocol

18.6.7.110.6.2.311.1.0.612.3.1.5…

19.7.3.212.3.4.811.6.8.214.7.0.1…

40K20K58K30K…

28182232…

httpftphttphttp…

AT & T collects 100+ GBs of NetFlow everyday

Page 7: Efficient and Private Distance Approximation David Woodruff MIT

Application – IP Session Data

AT & T needs to process massive stream of network data

Traffic estimationWhat fraction of network IP addresses are active?Distinct elements computation

Traffic analysis What are the 100 IP addresses with the most traffic? Frequent items computation

Security/Denial of Service Are there any IP addresses witnessing a spike in traffic? Skewness computation

Page 8: Efficient and Private Distance Approximation David Woodruff MIT

Application – Secure Datamining

For medical research, hospitals wish to mine their joint data

Distance approximation is useful in many mining algorithms, e.g., classification and clustering

Patient confidentiality imposes strict laws on what information can be shared. Mining cannot leak anything sensitive

Page 9: Efficient and Private Distance Approximation David Woodruff MIT

Issues

Exact vs. Approximate Solution

Efficiency Communication Complexity Round Complexity

Security Neither party learns more than what the solution

and his/her input implies about the other party’s input

Page 10: Efficient and Private Distance Approximation David Woodruff MIT

Initial Observations

To cope with the (n) communication bound, we look for randomized approximation algorithms

Exact Approximate

Deterministic

(n) (folklore)

(n)(folklore)

Randomized

(n)[KS, R]

?

Page 11: Efficient and Private Distance Approximation David Woodruff MIT

Previous Results

n n1-2/p

SFE

n1-1/(p-1) n1-2/p

[AMS96, CK04, [AMS96, G05] BJKS02, CKS03]

Lp, p > 2

n 1/ SFE

1/2 1/ [AMS96] folklore

L2

n1/2 1/[FIMNSW01]

1/2 1/ [FM79, BJKST02…] folklore

Hamming Distance

Private Communication Complexity

Upper Bounds Lower BoundsCommunication Complexity

Upper Bounds Lower Bounds

Output D’ such that for all x,y:Pr[D(x,y) · D’(x,y) · (1+)D(x,y)] ¸ 2/3

Page 12: Efficient and Private Distance Approximation David Woodruff MIT

Our Results [IW03, W04, IW05, IW06]

n n1-2/p

Still open

n1-1/(p-1) n1-2/p

O(n1-2/p), 1-roundLp, p > 2

n 1/ O(1/2), O(1)-rounds

1/2 1/ (1/2), 1-round

L2

n1/2 1/ O(1/2), O(1)-rounds

1/2 1/ (1/2), 1-round

Hamming Distance

Private Communication Complexity

Upper Bounds Lower BoundsCommunication Complexity

Upper Bounds Lower Bounds

Page 13: Efficient and Private Distance Approximation David Woodruff MIT

Outline

1. The Two-Party Communication Model

2. Two Problems

1. Private Euclidean norm estimation

2. Higher norm estimation

Page 14: Efficient and Private Distance Approximation David Woodruff MIT

Private L2 Estimation

We improve the n1/2 upper bound to 1/2 for private L2, and our protocol uses O(1) rounds

Optimal up to suppressed logarithmic factors

Holds for Hamming distance

Speculation that private is much harder than non-private

We refute this speculation

Page 15: Efficient and Private Distance Approximation David Woodruff MIT

Security Definition

What does privacy mean for distance approximation?

Alice does not learn anything about y other than what

follows from her input x and D(x,y)

What does privacy mean for distance computation?

Alice does not learn anything about y other than what follows from x and

the approximation D’(x,y)

Not Sufficient!!

MinimalRequirement

Does thiswork?

Page 16: Efficient and Private Distance Approximation David Woodruff MIT

Security Definition

x 2 n y 2 n

Alice Bob

Suppose = {0,1}

Set the LSB of D’(x,y) to be yn, and the remaining bits of D’(x,y) to agree with those of D(x,y)

D’(x,y) is a +/- 1 approximation, but Alice learns yn , which doesn’t follow from x, D(x,y)

Page 17: Efficient and Private Distance Approximation David Woodruff MIT

Security Definition

What does privacy mean for distance approximation?

Alice and Bob don’t learn anything about each other’s input other than what follows their own

input and D(x,y)

D’(x,y) is determined by D(x,y) and the randomness

NewRequirement

Implications

How do we model the power of the cheating parties?

Page 18: Efficient and Private Distance Approximation David Woodruff MIT

Security Models

x 2 n y 2 n

Alice Bob

Semi-honest: parties follow their instructions but try to learn more than what is prescribed

Malicious: parties deviate from the protocol arbitrarily- Use a different input- Force other party to output wrong answer- Abort before other party learns answer

Difficult to achieve security in

malicious model…

Difficult to achieve security in

malicious model…

Page 19: Efficient and Private Distance Approximation David Woodruff MIT

Reductions – Yao, GMW, NN

Protocolsecure in thesemi-honest

model

Protocolsecure in the

malicious model

Efficiency of the new protocol =

Efficiency of the old protocol

It suffices to design protocols in the semi-honest model

The parties follow the instructions of the protocol.Don’t need to worry about “weird” behavior.

Just ensure neither party learns anything about the other’s input except what follows from the exact distance

Page 20: Efficient and Private Distance Approximation David Woodruff MIT

Our Protocol

Alice: Bob:x = e1 y = e2

A first try: randomly sample a few coordinates j, compute (xj – yj)2, and scale to estimate ||x-y||22

Problem: With high probability, all samples return 0, so estimate is 0.

A second try: randomly rotate vectors over Rn, then try the sampling approach

Me1 Me2

||Mx – My||22 = ||x-y||22. Now mass is “spread out”, so sampling is effective.

Problem: neither party can learn the samples, since with the knowledge of M, this reveals extra information

Solution: We build a private sub-protocol to output an estimate from the samples, without revealing the samples

Parties need to agree on the rotation M.Can be done with low communication using a PRG

Thus, the correctness and desired efficiency of the protocol are easy to verify.

Page 21: Efficient and Private Distance Approximation David Woodruff MIT

Private Sub-protocol

Problem: Alice learns Myj for some j (Bob is similar)

Solution:

1. Use an oblivious masking sampling protocol [FIMNSW]

Alice learns Myj © b for random mask b, Bob has b Alice does not learn j

Page 22: Efficient and Private Distance Approximation David Woodruff MIT

Private Sub-protocol

x 2 n

Compute Mxy 2 n

Compute My

Alice Bob

M

Run oblivious, masking sampling protocol

Gets b ©(My)j for unknown j

Create mask a Create mask b

Gets a ©(Mx)j for unknown j

Page 23: Efficient and Private Distance Approximation David Woodruff MIT

Private Sub-protocolAlice Bob

M

Gets b ©(My)j for unknown j Gets a ©(Mx)j for unknown j

Has mask a Has mask b

Low communication private protocol. Computes (M(x-y)j)2, and since j is random,

Ej [M(x-y)j]2 = ||Mx-My||22/n = ||x-y||22/n

Page 24: Efficient and Private Distance Approximation David Woodruff MIT

Low communication private protocol. Computes (M(x-y)j)2, and since j is random,

Ej [M(x-y)j]2 = ||Mx-My||22/n = ||x-y||22/n

Private Sub-protocol

Thus, the expectation depends only on the length!

1. Let T be an upper bound on ||x-y||22

2. The protocol outputs a bit c. 3. Since c is a bit, it is determined from its expectation.

Pr[c = 1] = n(M(x-y)j)2 / T ¼ ||x-y||22/T · 1

Repeat a few times to get tight

concentration

Repeat a few times to get tight

concentration

If most repetitions return c = 0, adjust

T, and repeat

If most repetitions return c = 0, adjust

T, and repeat

Page 25: Efficient and Private Distance Approximation David Woodruff MIT

Wrapup We give an O(1)-round 1/2 private protocol for the

L2 distance

Optimal up to suppressed logarithmic factors

Details Randomness is not true – it’s from a pseudo-

random generator against non-uniform machines

Parties have bounded precision

Page 26: Efficient and Private Distance Approximation David Woodruff MIT

Outline

1. The Two-Party Communication Model

2. Two Problems

1. Private Euclidean norm estimation

2. Higher norm estimation

Page 27: Efficient and Private Distance Approximation David Woodruff MIT

Lp Estimation for p > 2

We improve the n1-1/(p-1) communication upper bound to n1-2/p, and our protocol is 1-round

Achieving this privately is still an open problem

Page 28: Efficient and Private Distance Approximation David Woodruff MIT

Lp Estimation for p > 2

Problem: Rotation doesn’t work for p > 2

L2 L4

(1, 0) 1 1

(1/21/2, 1/21/2) 1 1/2

2 4

rotation

Not clear how to “re-randomize” Lp for p > 2

We need a new approach…

Page 29: Efficient and Private Distance Approximation David Woodruff MIT

x 2 {1, …, m}n y 2 {1, …, m}n

Alice Bob

Strategy1. Classify coordinates |xj – yj| into buckets 0, [1, 2), [2, 4), …, [2i, 2i+1), …2. Estimate size si of each bucket 3. Output

We will approximate ||x-y||p to within a constant factor

One source of error:si are approximate

Another source:

values are approximate

Overall, still withina constant factor

p

Lp Estimation for p > 2

Page 30: Efficient and Private Distance Approximation David Woodruff MIT

Our Approach: Whenever si is hard to estimate we can detect this, and set to 0.

Otherwise, we estimate it.

Problem: Aren’t we undercounting?Answer: No! Hard si don’t matter!

Sometimes! I can help you estimate si when i is large [CCF-C]

No! Can show we need (n) communication if

Estimating Bucket Sizes Remaining Problem: Estimate si = # of coordinates |xj – yj| in the range [2i,

2i+1)

Is this easy?

Page 31: Efficient and Private Distance Approximation David Woodruff MIT

The CountSketch Protocol

I have a 1-round B-communication protocol which computes

all j for which (xj – yj)2 ¸ ||x-y||22/B

si large

Lp ! L2 Intuition: we can detect very large

coordinates,where large is with respect to the L2 norm

- Looks promising!

- If si = O(1), we can compute si with O(n1-

2/p) communication

- Looks promising!

- If si = O(1), we can compute si with O(n1-

2/p) communication

Page 32: Efficient and Private Distance Approximation David Woodruff MIT

Random Restriction

We would like to estimate si given that

and that we can efficiently output all coordinates j for which

Ideas? Not so obvious if si is large.Randomly restrict to ¼ 1/si fraction of coordinates j!

Page 33: Efficient and Private Distance Approximation David Woodruff MIT

Random Restriction

n1/4 n1/31

Value

Number ofcoordinates

1

n1/2

(n)

The middle group dominates, but the CountSketch protocol cannot detect it.

The reason is that each value in the middle group is small, but the group itself is large.

Contributes (n)to ||x-y||33

Contributes n1/2 (n1/4)3= n5/4

to ||x-y||33

Contributes 1 (n1/3)3= n to ||x-y||33

Page 34: Efficient and Private Distance Approximation David Woodruff MIT

Random Restriction

We randomly restrict to n1/2 coordinates

n1/4 n1/31

1

n1/2

(n)

Value

Number ofcoordinates

Page 35: Efficient and Private Distance Approximation David Woodruff MIT

RecapAlgorithm1. Classify coordinates |xj – yj| into buckets 0, [1, 2), [2, 4), …, [2i, 2i+1), …2. Estimate size si of each bucket 3. Output

Subroutine1. Randomly restrict to n/2, n/4, n/8, …, coordinates2. For each restriction, use CountSketch to retrieve the largest elements. Classify them into groups.3. Scale back to estimate si

Guarantee either you estimate si well, or si is tiny.

Page 36: Efficient and Private Distance Approximation David Woodruff MIT

Wrapup

We give a 1-round n1-2/p-communication protocol Optimal due to lower bounds [AMS, BJKS, CKS]

Yields optimal n1-2/p-space streaming algorithm (resolves [AMS])

Lots of details Naive use of [CCF-C] requires >1 round, but we get 1

round

The randomness needed for restrictions cannot be pure for the streaming algorithm. We use a PRG

Page 37: Efficient and Private Distance Approximation David Woodruff MIT

My Other Work

Algorithms Longest common/increasing subsequence Computational biology, clustering

Complexity theory Graph spanners, locally decodable codes

Cryptography Broadcast encyption, torus-based crypto,

PIR, inference control, practical secure function evaluation

Page 38: Efficient and Private Distance Approximation David Woodruff MIT

Thank you!

Page 39: Efficient and Private Distance Approximation David Woodruff MIT

The [CCF-C] protocol

Alice Bob

Compute: R = j (x-y)j h(j) = j xjh(j) - jyjh(j)

Random linear map h:[n] -> {-1,1}Compute j xj h(j)

Then E[h(i)R] = j (x-y)j E[h(i)h(j)] = xj – yj

Repeat many times to reduce

the variance of the estimator