efficient and private distance approximation david woodruff mit
TRANSCRIPT
Efficient and Private Distance Approximation
David WoodruffMIT
Outline
1. Two-Party Communication
2. Two Problems
1. Private Euclidean norm estimation
2. Higher norm estimation
The Communication Model
x 2 n y 2 n
What is the distance D(x,y) between x and y?For example, if = {0,1}, what is the Hamming Distance? If = RR, what is the Lp distance for some p 2 (0, 1) ?
Lp distance is (i=1 |xi-yi|p)1/p
Alice Bob
n
Application – Streaming Model
7113734 … Want to mine a massive data stream
How many distinct elements? What’s the most frequent item? Is the data uniform or skewed?
Elements arranged in adversarial order Algorithms only allowed one pass Goal: low-space algorithms
Application – Streaming Model
Streaming model
CommunicationLower bounds
Space lower bounds
Protocols
Always
Algorithms
Distance approximation captures streaming primitives
Distinct elements (Hamming), frequent items (L2), skew (Lp)
Two-party Communication
In this talk, most protocols yield
streaming algorithms
In this talk, most protocols yield
streaming algorithms
Thus, communication equals space
Thus, communication equals space
CommunicationLower bounds
Space lower bounds
Often
Application – IP session data
Source Destination
Bytes Duration
Protocol
18.6.7.110.6.2.311.1.0.612.3.1.5…
19.7.3.212.3.4.811.6.8.214.7.0.1…
40K20K58K30K…
28182232…
httpftphttphttp…
AT & T collects 100+ GBs of NetFlow everyday
Application – IP Session Data
AT & T needs to process massive stream of network data
Traffic estimationWhat fraction of network IP addresses are active?Distinct elements computation
Traffic analysis What are the 100 IP addresses with the most traffic? Frequent items computation
Security/Denial of Service Are there any IP addresses witnessing a spike in traffic? Skewness computation
Application – Secure Datamining
For medical research, hospitals wish to mine their joint data
Distance approximation is useful in many mining algorithms, e.g., classification and clustering
Patient confidentiality imposes strict laws on what information can be shared. Mining cannot leak anything sensitive
Issues
Exact vs. Approximate Solution
Efficiency Communication Complexity Round Complexity
Security Neither party learns more than what the solution
and his/her input implies about the other party’s input
Initial Observations
To cope with the (n) communication bound, we look for randomized approximation algorithms
Exact Approximate
Deterministic
(n) (folklore)
(n)(folklore)
Randomized
(n)[KS, R]
?
Previous Results
n n1-2/p
SFE
n1-1/(p-1) n1-2/p
[AMS96, CK04, [AMS96, G05] BJKS02, CKS03]
Lp, p > 2
n 1/ SFE
1/2 1/ [AMS96] folklore
L2
n1/2 1/[FIMNSW01]
1/2 1/ [FM79, BJKST02…] folklore
Hamming Distance
Private Communication Complexity
Upper Bounds Lower BoundsCommunication Complexity
Upper Bounds Lower Bounds
Output D’ such that for all x,y:Pr[D(x,y) · D’(x,y) · (1+)D(x,y)] ¸ 2/3
Our Results [IW03, W04, IW05, IW06]
n n1-2/p
Still open
n1-1/(p-1) n1-2/p
O(n1-2/p), 1-roundLp, p > 2
n 1/ O(1/2), O(1)-rounds
1/2 1/ (1/2), 1-round
L2
n1/2 1/ O(1/2), O(1)-rounds
1/2 1/ (1/2), 1-round
Hamming Distance
Private Communication Complexity
Upper Bounds Lower BoundsCommunication Complexity
Upper Bounds Lower Bounds
Outline
1. The Two-Party Communication Model
2. Two Problems
1. Private Euclidean norm estimation
2. Higher norm estimation
Private L2 Estimation
We improve the n1/2 upper bound to 1/2 for private L2, and our protocol uses O(1) rounds
Optimal up to suppressed logarithmic factors
Holds for Hamming distance
Speculation that private is much harder than non-private
We refute this speculation
Security Definition
What does privacy mean for distance approximation?
Alice does not learn anything about y other than what
follows from her input x and D(x,y)
What does privacy mean for distance computation?
Alice does not learn anything about y other than what follows from x and
the approximation D’(x,y)
Not Sufficient!!
MinimalRequirement
Does thiswork?
Security Definition
x 2 n y 2 n
Alice Bob
Suppose = {0,1}
Set the LSB of D’(x,y) to be yn, and the remaining bits of D’(x,y) to agree with those of D(x,y)
D’(x,y) is a +/- 1 approximation, but Alice learns yn , which doesn’t follow from x, D(x,y)
Security Definition
What does privacy mean for distance approximation?
Alice and Bob don’t learn anything about each other’s input other than what follows their own
input and D(x,y)
D’(x,y) is determined by D(x,y) and the randomness
NewRequirement
Implications
How do we model the power of the cheating parties?
Security Models
x 2 n y 2 n
Alice Bob
Semi-honest: parties follow their instructions but try to learn more than what is prescribed
Malicious: parties deviate from the protocol arbitrarily- Use a different input- Force other party to output wrong answer- Abort before other party learns answer
Difficult to achieve security in
malicious model…
Difficult to achieve security in
malicious model…
Reductions – Yao, GMW, NN
Protocolsecure in thesemi-honest
model
Protocolsecure in the
malicious model
Efficiency of the new protocol =
Efficiency of the old protocol
It suffices to design protocols in the semi-honest model
The parties follow the instructions of the protocol.Don’t need to worry about “weird” behavior.
Just ensure neither party learns anything about the other’s input except what follows from the exact distance
Our Protocol
Alice: Bob:x = e1 y = e2
A first try: randomly sample a few coordinates j, compute (xj – yj)2, and scale to estimate ||x-y||22
Problem: With high probability, all samples return 0, so estimate is 0.
A second try: randomly rotate vectors over Rn, then try the sampling approach
Me1 Me2
||Mx – My||22 = ||x-y||22. Now mass is “spread out”, so sampling is effective.
Problem: neither party can learn the samples, since with the knowledge of M, this reveals extra information
Solution: We build a private sub-protocol to output an estimate from the samples, without revealing the samples
Parties need to agree on the rotation M.Can be done with low communication using a PRG
Thus, the correctness and desired efficiency of the protocol are easy to verify.
Private Sub-protocol
Problem: Alice learns Myj for some j (Bob is similar)
Solution:
1. Use an oblivious masking sampling protocol [FIMNSW]
Alice learns Myj © b for random mask b, Bob has b Alice does not learn j
Private Sub-protocol
x 2 n
Compute Mxy 2 n
Compute My
Alice Bob
M
Run oblivious, masking sampling protocol
Gets b ©(My)j for unknown j
Create mask a Create mask b
Gets a ©(Mx)j for unknown j
Private Sub-protocolAlice Bob
M
Gets b ©(My)j for unknown j Gets a ©(Mx)j for unknown j
Has mask a Has mask b
Low communication private protocol. Computes (M(x-y)j)2, and since j is random,
Ej [M(x-y)j]2 = ||Mx-My||22/n = ||x-y||22/n
Low communication private protocol. Computes (M(x-y)j)2, and since j is random,
Ej [M(x-y)j]2 = ||Mx-My||22/n = ||x-y||22/n
Private Sub-protocol
Thus, the expectation depends only on the length!
1. Let T be an upper bound on ||x-y||22
2. The protocol outputs a bit c. 3. Since c is a bit, it is determined from its expectation.
Pr[c = 1] = n(M(x-y)j)2 / T ¼ ||x-y||22/T · 1
Repeat a few times to get tight
concentration
Repeat a few times to get tight
concentration
If most repetitions return c = 0, adjust
T, and repeat
If most repetitions return c = 0, adjust
T, and repeat
Wrapup We give an O(1)-round 1/2 private protocol for the
L2 distance
Optimal up to suppressed logarithmic factors
Details Randomness is not true – it’s from a pseudo-
random generator against non-uniform machines
Parties have bounded precision
Outline
1. The Two-Party Communication Model
2. Two Problems
1. Private Euclidean norm estimation
2. Higher norm estimation
Lp Estimation for p > 2
We improve the n1-1/(p-1) communication upper bound to n1-2/p, and our protocol is 1-round
Achieving this privately is still an open problem
Lp Estimation for p > 2
Problem: Rotation doesn’t work for p > 2
L2 L4
(1, 0) 1 1
(1/21/2, 1/21/2) 1 1/2
2 4
rotation
Not clear how to “re-randomize” Lp for p > 2
We need a new approach…
x 2 {1, …, m}n y 2 {1, …, m}n
Alice Bob
Strategy1. Classify coordinates |xj – yj| into buckets 0, [1, 2), [2, 4), …, [2i, 2i+1), …2. Estimate size si of each bucket 3. Output
We will approximate ||x-y||p to within a constant factor
One source of error:si are approximate
Another source:
values are approximate
Overall, still withina constant factor
p
Lp Estimation for p > 2
Our Approach: Whenever si is hard to estimate we can detect this, and set to 0.
Otherwise, we estimate it.
Problem: Aren’t we undercounting?Answer: No! Hard si don’t matter!
Sometimes! I can help you estimate si when i is large [CCF-C]
No! Can show we need (n) communication if
Estimating Bucket Sizes Remaining Problem: Estimate si = # of coordinates |xj – yj| in the range [2i,
2i+1)
Is this easy?
The CountSketch Protocol
I have a 1-round B-communication protocol which computes
all j for which (xj – yj)2 ¸ ||x-y||22/B
si large
Lp ! L2 Intuition: we can detect very large
coordinates,where large is with respect to the L2 norm
- Looks promising!
- If si = O(1), we can compute si with O(n1-
2/p) communication
- Looks promising!
- If si = O(1), we can compute si with O(n1-
2/p) communication
Random Restriction
We would like to estimate si given that
and that we can efficiently output all coordinates j for which
Ideas? Not so obvious if si is large.Randomly restrict to ¼ 1/si fraction of coordinates j!
Random Restriction
n1/4 n1/31
Value
Number ofcoordinates
1
n1/2
(n)
The middle group dominates, but the CountSketch protocol cannot detect it.
The reason is that each value in the middle group is small, but the group itself is large.
Contributes (n)to ||x-y||33
Contributes n1/2 (n1/4)3= n5/4
to ||x-y||33
Contributes 1 (n1/3)3= n to ||x-y||33
Random Restriction
We randomly restrict to n1/2 coordinates
n1/4 n1/31
1
n1/2
(n)
Value
Number ofcoordinates
RecapAlgorithm1. Classify coordinates |xj – yj| into buckets 0, [1, 2), [2, 4), …, [2i, 2i+1), …2. Estimate size si of each bucket 3. Output
Subroutine1. Randomly restrict to n/2, n/4, n/8, …, coordinates2. For each restriction, use CountSketch to retrieve the largest elements. Classify them into groups.3. Scale back to estimate si
Guarantee either you estimate si well, or si is tiny.
Wrapup
We give a 1-round n1-2/p-communication protocol Optimal due to lower bounds [AMS, BJKS, CKS]
Yields optimal n1-2/p-space streaming algorithm (resolves [AMS])
Lots of details Naive use of [CCF-C] requires >1 round, but we get 1
round
The randomness needed for restrictions cannot be pure for the streaming algorithm. We use a PRG
My Other Work
Algorithms Longest common/increasing subsequence Computational biology, clustering
Complexity theory Graph spanners, locally decodable codes
Cryptography Broadcast encyption, torus-based crypto,
PIR, inference control, practical secure function evaluation
Thank you!
The [CCF-C] protocol
Alice Bob
Compute: R = j (x-y)j h(j) = j xjh(j) - jyjh(j)
Random linear map h:[n] -> {-1,1}Compute j xj h(j)
Then E[h(i)R] = j (x-y)j E[h(i)h(j)] = xj – yj
Repeat many times to reduce
the variance of the estimator