learning the privacy-utility trade-off with bayesian ... › slides › simons-march-2019.pdf ·...

Learning the Privacy-Utility Trade-off with Bayesian Optimization

Borja Balle

Joint work with B. Avent, J. Gonzalez, T. Diethe and A. Paleyes

PrivacyUtility

Theory vs Practice

O

!"d HQ;(R/δ)

nε

#

<latexit sha1_base64="9kDpveEaFLU+Unv7fLX8RADi8cg=">AAADgXicbVLbbtNAEN3WXEq4pcAbLxZVpUSKgl1UKKqQKkCUBySKIG2lOIo267Gzynptdsch0coSH8ILX8MrPPIXfAJru0h1wkhrzZ4zF8/smWSCa/S83xubzpWr165v3WjdvHX7zt329r1TneaKwYClIlXnE6pBcAkD5CjgPFNAk4mAs8nsVcmfzUFpnspPuMxglNBY8ogzihYatw/fBwIi7ASRoswE+rNCE7qBSOOO/zgIQSDtFoWRbjCnCjLNRSqLQPF4it1xe8fre5W5645/4ewcvfn64Fv69M/JeHuzH4QpyxOQyATVeuh7GY4MVciZgKIV5BoyymY0BhNDmgCqpUX/uQZhgV94iNMXvseSRvzQupImoEem2krh7lokdKNU2SPRrdBGB5povUwmNjKhONWrXAn+jxvmGB2MDJdZjiBZ3SjKhYupW67YDbkChmJpHcoUt7O5bErtftE+RKML5ALUvDmIqRpmwJroIpecpeHKkhYCF6ioBTVgQrksRzXHXAj3I5W6aO3WhK1YMp3XPOaoe+/s68vesQKYdS9HX55yAXbbPZYsZyOzqFfasoKIrNCqm0mWZd2aMWVcYbz+84Oe1y+Pt28//r7NAau/KtQE64n1nwuQcclnVHEZWnXYUhkWLaswf1VP687pXt9/0vc+WKm9JLVtkYfkEekQnzwjR+QtOSEDwsh38oP8JL8cx+k6nrNXh25uXOTcJw1zDv8C/mMsAg==</latexit>

22

Plot from J. M. Abowd “Disclosure Avoidance for Block Level Data and Protection of Confidentiality in Public Tabulations” (CSAC Meeting, December 2018)

Example: DP-SGD

550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604

Learning the Privacy–Utility Trade-off with Bayesian Optimization

A. Sparse Vector Technique AnalysisThis section provides a proof of the privacy bound for Alg. 1 used to implement the privacy oracle P0 using in Sec. 2.3 andSec. 3.4. The proof is based on observing that our Alg. 1 is just a simple re-parametrization of [26, Alg. 7] where someof the parameters have been fixed up-front. For concreteness, we reproduce [26, Alg. 7] as Alg. 3 below. The result thenfollows from a direct application of [26, Thm. 4], which shows that Alg. 3 is ("1 + "2, 0)-DP.

Algorithm 3: Sparse Vector Technique ([26, Alg. 7] with "3 = 0)Input: dataset z, queries q1, . . . , qm, sensitivity �Hyperparameters: bound C, thresholds T1, . . . , Tm, privacy parameters "1, "2c 0, w (?, . . . ,?) 2 {?,>}m⇢ Lap(�/"1)for i 2 [m] do

⌫ Lap(2C�/"2)if qi(z) + ⌫ � Ti + ⇢ then

wi >, c c + 1if c � C then return w

return w

Comparing Alg. 3 with the sparse vector technique used in Sec. 2.3 (Alg. 1), we see that they are virtually the samealgorithms, where we have fixed � = 1, Ti = 1/2, "1 = 1/b1 and "2 = 2C/b2. Thus, by expanding the definitions of b1and b2 as a function of b and C, we can verify that Alg. 1 is (", 0)-DP with

" = "1 + "2 =1

b1+

2C

b2=

1 + (2C)1/3

b+

(2C)2/3(1 + 2C)1/3

b=

(1 + (2C)1/3)(1 + (2C)2/3)

b.

This concludes the proof.

B. Differentially Private Stochastic Optimization AlgorithmsStochastic gradient descent (SGD) is a simplification of gradient descent, where on each iteration instead of computingthe gradient for the entire dataset, it is instead estimated on the basis of a single example (or small batch of examples)picked uniformly at random (without replacement) [8]. Adam [21] is a first-order gradient-based optimization algorithm forstochastic objective functions, based on adaptive estimates of lower-order moments.

As a privatized version of SGD, we use a mini-batched implementation with clipped gradients and Gaussian noise similar tothat of [1]. The pseudo-code is given in Alg. 4; the only difference with the algorithm in [1] is that we sample mini-batchesof a fixed size without replacement instead of using mini-batches obtained from Poisson sampling with a fixed probability.In the pseudo-code below, the function clipL(v) acts as the identify if kvk2 L, and otherwise returns (L/kvk2)v. Thisclipping operation ensures that kclipL(v)k2 L so that the `2-sensitivity of any gradient to a change in one datapoint in zis always bounded by L/m.

Algorithm 4: Differentially Private SGDInput: dataset z = (z1, . . . , zn)Hyperparameters: learning rate ⌘, mini-batch size m, number of epochs T , noise variance �2, clipping norm LInitialize w 0for t 2 [T ] do

for k 2 [n/m] doSample S ⇢ [n] with |S| = m uniformly at randomLet g 1

m

Pj2S clipL(r`(zj , w)) + 2L

m N (0,�2I)

Update w w � ⌘g

return w

• 5+ hyper-parameters affecting both privacy and utility• For convex problems can be set to achieve near-optimal rates• For deep learning applications we don’t have (good) utility bounds

[Bassily et al. 2014; Abadi et al. 2016]

Privacy-Utility Pareto Front

1. Efficient to compute

2. Use empirical utility measurements

3. Enable fine-grained comparisons

Desiderata

385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439


Figure 3. Far left, center left: Hypervolumes of the Pareto fronts computed by the various models, optimizers, and architectures onthe Adult and MNIST datasets (respectively) by both DPARETO and random sampling. Center right: Pareto fronts learned for MLP2architecture on the MNIST dataset with DPARETO and random sampling, including the shared points they were both initialized with. Far

right: Adult dataset DPARETO sampled points and its Pareto front compared to larger set of random sampling points and its Pareto front.

0 10�1 100 101

�

0.0

0.2

0.4

0.6

0.8

1.0

Cla

ssifi

cation

erro

r

MNIST Pareto FrontsMLP1

MLP2

Figure 4. Left: Pareto fronts for combinations of models and op-timizers on the Adult dataset. Right: Pareto fronts for differentMLP architectures on the MNIST dataset.

we compare DPARETO to the traditional naive approach ofusing random sampling by computing the hypervolumes ofPareto fronts generated by each method.

In Fig. 3, the first two plots show, for a variety of models,how the hypervolume of the Pareto front expands as newpoints are sampled. In nearly every experiment, DPARETO’sapproach yields a greater hypervolume than the experi-ment’s random sampling analog – a direct indicator thatDPARETO has better charactarized the Pareto front. Thiscan be seen very clearly by examining the center right plot ofthe figure, which directly shows a Pareto front of the MLP2model with both sampling methods. Specifically, while therandom sampling method only marginally improved overits initially seeded points, DPARETO was able to thoroughlyexplore the high-privacy regime (i.e. small "). The far rightplot of the figure compares the DPARETO approach with256 sampled points against the random sampling approachwith significantly more sampled points, 1500. While bothapproaches yield similar Pareto fronts, the efficiency ofDPARETO is particularly highlighted by the points that arenot actually on the front: nearly all the points chosen byDPARETO are close to the actual front, whereas many pointschosen by random sampling are nowhere near it.

DPARETO’s Versatility The other main purpose of theseexperiments is to demonstrate the versatility of DPARETO

by comparing multiple approaches to the same problem. InFig. 4, the left plot shows Pareto fronts of the Adult datasetfor multiple optimizers (SGD and Adam) as well as mul-tiple models (LogReg and SVM), and the right plot showsPareto fronts of the MNIST dataset for different architec-tures (MLP1 and MLP2). With this, we can see that on theAdult dataset, the LogReg model optimized using Adamwas nearly always better than the other model/optimizercombinations. We can also see that on the MNIST dataset,while both architectures performed similarly in the low-privacy regime, the MLP2 architecture significantly outper-formed the MLP1 architecture in the high-privacy regime.With DPARETO, analysts and practitioners can efficientlycreate these types of Pareto fronts and use them to performprivacy–utility trade-off comparisons.

7. ConclusionIn this paper we have introduced DPARETO, a method toempirically characterize the privacy–utility trade-off of dif-ferentially private algorithms. We use Bayesian optimiza-tion (BO), a state-of-the-art method for hyperparameteroptimization, to simultaneously optimize for both privacyand utility, forming a Pareto front. Further, we showed thatthe use of BO allows us to perform useful visualizations toaid the decision making process. There are several inter-esting directions for future work. Here we focussed on thesupervised learning setting, but the method could also be ap-plied to, e.g. stochastic variational inference on probabilisticmodels, as long as a utility function (e.g. held-out perplex-ity) is available. The current DPARETO method uses twoindependent GPs, an interesting extension would be to usemulti-output GPs. While we explored the effect of changingthe model (logistic regression vs. SVM) and the optimizer(SGD vs. Adam) on the privacy/accuracy trade-off, it wouldinteresting to obtain a global Pareto front by optimizingover these choices as well. Finally, in some settings it wouldbe of interest to optimize over additional criteria, such asmodel size or running time.

Problem Formulation

A = {Aλ : Z → W | λ ∈ Λ}<latexit sha1_base64="CUn8cS7i9fYIDlg1WBH2pLx6Qzk=">AAADi3icbVLdbtMwGPUafkZh0IHEDVxYmyYNqYoSpokxmLQB07jYxRB0nWiqynW+tlYdJ7KdkcjkWXgArngULoGXwUmKtHRYcnRyzmd/+U7OOOFMac/7tdJybty8dXv1TvvuvbX7DzrrD89VnEoKPRrzWF6MiQLOBPQ00xwuEgkkGnPoj+dvS71/CVKxWHzSeQLDiEwFmzBKtKVGneMgInpGCTdHBT7AgcFHIxNwe0FICryPP+NAx7iPg1f4a/lYSDhgAgenC1yMOpue61ULXwf+AmwebqDWt8c/np6N1ltuEMY0jUBoyolSA99L9NAQqRnlULSDVEFC6JxMwUwhjkDL3LL/oNGQ6S8s1LMD36NRo35goSARqKGp/CnwlmVCPIml3ULjim10IJFSeTS2laUbalkryf9pg1RP9oaGiSTVIGjdaJJybC0rzcYhk0A1zy0gVDI7G6YzIgnV9pc0ukDKQV42BzFVwwRok81SwWgcLpmUcZ1pSSypQEeEiXJUc8I4xx+JUEV7qxbsjaWy/Y5NmVbdU5sD0T2RAPNnV6uvTpmBdbtLo3w+NFltaTsIYWIjV72ZKK9CVCmmrCuM577c63puub1d+/B37RmwSaxKTXD9YP3lHMS01BMimQhtOuxViS7aNmH+cp6ug/Pnrr/jeh9s1N6geq2iJ2gDbSMfvUCH6D06Qz1E0Xf0E/1Gf5w1Z8fZd17Xpa2VxZlHqLGc478R0ytG</latexit>

Parametrized Algorithm Class

Error (Utility) Oracle

Privacy Oracle

E : Λ → [y- R]<latexit sha1_base64="yFd9fxwOuuVIYFyEkmawDQxwenc=">AAADW3icbVJda9RAFJ1u/Kix6lapL/owtBQqLCFRilUQih/Uhz5UdNvCJpTZyd3dYWcmYWZSE4b9LeKrPvlzfPC/OEkqNLsOTDg559x7c2/uOOdMmzD8vdbzbty8dXv9jn934979B/3Nh6c6KxSFIc14ps7HRANnEoaGGQ7nuQIixhzOxvN3tX52CUqzTH4xVQ6JIFPJJowS46iL/tYH/BrHxy4gJTg2GR6Fgyi56O+EQdgcvAqiK7BzuI163x7/enpysdkL4jSjhQBpKCdaj6IwN4klyjDKYeHHhYac0DmZgp1CJsCoyrH/oDVQmq8sNbM3UUhFxz9yUBIBOrFNvwu865gUTzLlrjS4YTsViNC6EmPnFMTM9LJWk//TRoWZHCSWybwwIGlbaFJw7OZSDw+nTAE1vHKAUMVcb5jOiCLUuBF3qkDBQV12G7FNwRxoly0LyWiWLg2p5KY0ijhSgxGEybpVe8Q4x5+J1At/txVcxlrZe8+mzOjBsfuvcnCkAObPrruvd1mCm/aAimqe2LIdqR+nMHEr1LxZUdV5W8XWvoUNg1cHgzCob7jvHtG+iwG3WY3VxquB7ZdzkNNaz4liMnXb4VLlZuG7DYuW92kVnD4PohdB+Mmt2lvUnnX0BG2jPRShl+gQfUQnaIgosug7+oF+9v54nud7G621t3YV8wh1jrf1FwsCGbM=</latexit>

P : Λ → [y-∞)<latexit sha1_base64="oyXPwVlGjlMOWU17+3SHjT4QY0c=">AAADYHicbVJNa9tAEN3Y/XDdj9gt9JD2IBICCRghpYSmhUJoC+khB5fWScAyYb0a2YtXK7E7Si2Eof+kt17b/pxe+0s6klKI7Q6smH3vzYxmdsapkhY97/dGo3nr9p27rXvt+w8ePtrsdB+f2SQzAgYiUYm5GHMLSmoYoEQFF6kBHo8VnI9n70r+/AqMlYn+jHkKo5hPtIyk4EjQZWer77x2glMKCLkTYOIMvV4gdYT5/mVnx3O9ypx1x792do63WePb01/P+5fdhhuEichi0CgUt3boeymOCm5QCgWLdpBZSLmY8QkUE0hiQJMT+s8tEOb4RYY4feN7Il7SD8nVPAY7KqqmF84uIaETJYaORqdClyrw2No8HpMy5ji1q1wJ/o8bZhgdjQqp0wxBi7pQlCmHhlNO0AmlAYEqJ4cLI6k3R0y54QJpzktVIFNgrpYbKaqCKYhldJ5pKZJwZUhzhXM0nEALGHN6F2q1OJFKOZ+4tov2bk1QxpLZey8nEm3vlB5X904MwGz/pvpml3OgafdEnM9GxbweaTsIIaI9qm5FnJd5a6YodYvCc18d9Ty3PN4hffxDigFar0paBOuB9Z8r0JOST7mROqTtoFQpLtq0Yf7qPq07Zweu/8L1PtKqvWW1tdgzts32mM9esmP2gfXZgAn2lX1nP9jPxp9mq7nZ7NbSxsZ1zBO2ZM2tvzIPHBE=</latexit>

Eg. DP-SGD

Eg. Expected classification error

Eg. Epsilon for fixed delta

Pareto-Optimal Points

Hyper-parameter SpacePrivacy Loss

Erro

r

Bayesian Optimization (BO)

• Gradient-free optimization for black-box functions

• Widely used in applications (HPO in ML, scheduling & planning, experimental design, etc)




λ⋆ = argminλ∈Λ

F(λ)<latexit sha1_base64="TGEkGqWC2Y2pTh7doXDXE9iPXAA=">AAADiHicbVJdb9MwFPVWPkb56uCRF4tpUidVVQKqtj1MmgpiPOxhCLpNakrlOretVduJ7JvRysof4Jfwa3gF3vkZPOAkRVpbLDk6OefYN/fmjFIpLAbBr63t2p279+7vPKg/fPT4ydPG7rNLm2SGQ48nMjHXI2ZBCg09FCjhOjXA1EjC1Wj2ptCvbsBYkehPuEhhoNhEi7HgDD01bHQj6c0x++wii8zk9IRGiuE0SR0zEyV0PnRLC42EptF5iXP6rrmkD4aNvaAdlItugnAJ9k675HfU+frnYri73Y7ihGcKNHLJrO2HQYoDXxAFl5DXo8xCyviMTcBNIFGAZuHZf9AhzPGLiHF6EgZcrfj7HmqmwA5cOZyc7nsmpuPE+K2RluxKBaasXaiRdxaN23WtIP+n9TMcHw2c0GmGoHlVaJxJigktJk1jYYCjXHjAuBG+N8qnzDCO/n+sVIFMgrlZbcSVBVPgq+w804In8dqQ5hLnaJgnLaBiQhetujMhJf3ItM3r+5XgbyyU5lsxEWhb5z4EunVmAGYHt923u5yDn3aLq8Vs4ObVSOtRDGOft/LNqUVxb6W4wpe7oH181AraxQ46/hF2/BnwMSytLto8WH25BD0p9JQZoWOfDn9VinndJyxcz9MmuHzVDl+3gw9l1Kq1Q16Ql6RJQnJITsl7ckF6hJNv5Dv5QX7W6rWgdlg7rqzbW8szz8nKqnX/Aix3LoY=</latexit>

F : Λ ⊂ Rp → R<latexit sha1_base64="7mCaJAcfnko16yIWReBdRKtLbmA=">AAADdXicbVLdbtMwFPZafkb56+ASgSzK0JCqLAFNDCTEBIhxsYsN6DZpKZXjnLZWHSeynZHIyuUegXdB4hYegmfgHm5xkiE1LUdydPx95yfn8wkSzpR23Z8rrfaFi5cur17pXL12/cbN7tqtQxWnksKAxjyWxwFRwJmAgWaaw3EigUQBh6Ng9rrkj05BKhaLjzpPYBiRiWBjRom20Ki7+RY/x/6eTQgJ9lUaKNDYj4ieBoF5X3xKsK/jOWDU7bmOWxledrxzp/fy99m9rwe/zvZHay3HD2OaRiA05USpE89N9NAQqRnlUHT8VEFC6IxMwEwgjkDL3KL/XKMh059ZqKcvPJdGjfgT6woSgRqaSokCr1skxONY2iM0rtBGBxIplUeBjSxHUotcCf6PO0n1eHtomEhSDYLWjcYpx1acUlYcMglU89w6hEpmZ8N0SiSh2orf6AIpB3naHMRUDROgTTRLBaNxuCBSxnWmJbGgfauIMFGOanYZ5/gDEarorNeErVgyG2/YhGnV37MvLvq7EmD2aD56fsoMrNp9GuWzoclqSTt+CGO7XNXNRHlZt2ZMGVcY13m23Xed8rhb9uNt2RywO1eFGn85sf5zDmJS8gmRTIR2O2ypRBcdu2He4j4tO4ePHe+J4x64vZ1XqLZVdAfdRxvIQ0/RDnqH9tEAUfQFfUPf0Y/Wn/bd9oP2wzq0tXKecxs1rL35FxbGKAY=</latexit>Input:

Goal:

Expensive, non-convex,

smooth




λ⋆ = argminλ∈Λ

F(λ)<latexit sha1_base64="TGEkGqWC2Y2pTh7doXDXE9iPXAA=">AAADiHicbVJdb9MwFPVWPkb56uCRF4tpUidVVQKqtj1MmgpiPOxhCLpNakrlOretVduJ7JvRysof4Jfwa3gF3vkZPOAkRVpbLDk6OefYN/fmjFIpLAbBr63t2p279+7vPKg/fPT4ydPG7rNLm2SGQ48nMjHXI2ZBCg09FCjhOjXA1EjC1Wj2ptCvbsBYkehPuEhhoNhEi7HgDD01bHQj6c0x++wii8zk9IRGiuE0SR0zEyV0PnRLC42EptF5iXP6rrmkD4aNvaAdlItugnAJ9k675HfU+frnYri73Y7ihGcKNHLJrO2HQYoDXxAFl5DXo8xCyviMTcBNIFGAZuHZf9AhzPGLiHF6EgZcrfj7HmqmwA5cOZyc7nsmpuPE+K2RluxKBaasXaiRdxaN23WtIP+n9TMcHw2c0GmGoHlVaJxJigktJk1jYYCjXHjAuBG+N8qnzDCO/n+sVIFMgrlZbcSVBVPgq+w804In8dqQ5hLnaJgnLaBiQhetujMhJf3ItM3r+5XgbyyU5lsxEWhb5z4EunVmAGYHt923u5yDn3aLq8Vs4ObVSOtRDGOft/LNqUVxb6W4wpe7oH181AraxQ46/hF2/BnwMSytLto8WH25BD0p9JQZoWOfDn9VinndJyxcz9MmuHzVDl+3gw9l1Kq1Q16Ql6RJQnJITsl7ckF6hJNv5Dv5QX7W6rWgdlg7rqzbW8szz8nKqnX/Aix3LoY=</latexit>

F : Λ ⊂ Rp → R<latexit sha1_base64="7mCaJAcfnko16yIWReBdRKtLbmA=">AAADdXicbVLdbtMwFPZafkb56+ASgSzK0JCqLAFNDCTEBIhxsYsN6DZpKZXjnLZWHSeynZHIyuUegXdB4hYegmfgHm5xkiE1LUdydPx95yfn8wkSzpR23Z8rrfaFi5cur17pXL12/cbN7tqtQxWnksKAxjyWxwFRwJmAgWaaw3EigUQBh6Ng9rrkj05BKhaLjzpPYBiRiWBjRom20Ki7+RY/x/6eTQgJ9lUaKNDYj4ieBoF5X3xKsK/jOWDU7bmOWxledrxzp/fy99m9rwe/zvZHay3HD2OaRiA05USpE89N9NAQqRnlUHT8VEFC6IxMwEwgjkDL3KL/XKMh059ZqKcvPJdGjfgT6woSgRqaSokCr1skxONY2iM0rtBGBxIplUeBjSxHUotcCf6PO0n1eHtomEhSDYLWjcYpx1acUlYcMglU89w6hEpmZ8N0SiSh2orf6AIpB3naHMRUDROgTTRLBaNxuCBSxnWmJbGgfauIMFGOanYZ5/gDEarorNeErVgyG2/YhGnV37MvLvq7EmD2aD56fsoMrNp9GuWzoclqSTt+CGO7XNXNRHlZt2ZMGVcY13m23Xed8rhb9uNt2RywO1eFGn85sf5zDmJS8gmRTIR2O2ypRBcdu2He4j4tO4ePHe+J4x64vZ1XqLZVdAfdRxvIQ0/RDnqH9tEAUfQFfUPf0Y/Wn/bd9oP2wzq0tXKecxs1rL35FxbGKAY=</latexit>Input:

Goal:

Expensive, non-convex,

smooth

Bayesian Optimization Loop:

(λR- F(λR))- X X X - (λk- F(λk))<latexit sha1_base64="iwGSn4DImkd/v9Z3cKjJVJgXYmk=">AAADgnicbVJfb9MwEPeaAaP8WQcSL/AQbZrUoipKmCaGNKQJEOOBhyHoNqmpKte5tlZsJ7KdkcjqZ0HiG/AteIU3vg1OMqBpOcnR+ff73V3ufJOUUaV9/9dGy9m8cfPW1u32nbv37m93dh6cqySTBAYkYYm8nGAFjAoYaKoZXKYSMJ8wuJjEr0v+4gqkoon4pIsURhzPBJ1SgrWFxp3jbsisOsLjoO++/Xfp9fpuyKJEq777F42XJXGvN+7s+Z5fmbvuBNfO3skuan159O3J2Xin5YVRQjIOQhOGlRoGfqpHBktNCYNFO8wUpJjEeAZmBgkHLQuL/nGNhlx/ppGevwx8whv6oXUF5qBGphrLwt23SOROE2mP0G6FNipgrlTBJ1bJsZ6rVa4E/8cNMz09Ghkq0kyDIHWhacZcnbjljN2ISiCaFdbBRFLbm0vmWGKi7Us0qkDGQF41GzFVwRRIE80zQUkSrQwpZzrXEltQgeaYirJVc0oZcz9ioRbt/ZqwGUum+4bOqH3T9/b5Rf9UAsS9ZfVylznYafcJL+KRyeuRtsMIpnbTqpvhRZm3ZkypWxjfe3HU973y+If2ExzaGLALWElNuB5Y/zkDMSv5FEsqIrsdNlWqF227YcHqPq0758+84MDzP9hVe4Vq20KP0S7qogA9RyfoHTpDA0TQV/Qd/UA/nU3nqRM4B7W0tXEd8xA1zDn+DUsdJ0Q=</latexit>

Given k evaluations

1. Build a surrogate model for F (eg. Gaussian process)

2. Find most promising next evaluation

BO: 1-Dimensional Example

The DPareto Algorithm

220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274


the predictive distribution of each output can be fullycharacterized by their mean mj(�) and variance s2j (�)functions, which can be computed in closed form.

3. Use the posterior distribution of the surrogate model toform an acquisition function ↵(�; I), where I representsthe dataset D and the GP posterior conditioned on D.

4. Collect the next evaluation point �k+1 at the (numericallyestimated) global maximum of ↵(�; I).

The process is repeated until the budget to collect new lo-cations is over. There are two key aspects of any Bayesianoptimization method: the surrogate model of the objectivesand the acquisition function ↵(�; I).

In this work we used independent GPs with a transformedoutput domain to model each objective, but generalizationswith multi-output GPs [4] are possible (see Appx. E).

3.2. Acquisition with Pareto Front Hypervolume

Next we define an acquisition criterion ↵(�; I) useful tocollect new points when learning the Pareto front. LetP = PF(V) be the Pareto front computed with the ob-jective evaluations in I and let v† 2 Rp be some “anti-ideal” point4. To measure the relative merit of differentPareto fronts we use the hypervolume HVv†(P) of theregion dominated by the Pareto front P bounded by theanti-ideal point. Mathematically this can be expressed asHVv†(P) = µ({v 2 Rp | v � v†, 9u 2 P u � v}), whereµ denotes the standard Lebesgue measure on Rp. Hence-forth we assume the anti-ideal point is fixed and drop it fromour notation.

Larger hypervolume means the points in the Pareto front arecloser to the ideal point (0, 0). Thus, HV(PF(V)) providesa way to measure the quality of the Pareto front obtainedfrom the data in V . Furthermore, hypervolume can be usedto design acquisition functions for selecting hyperparame-ters that will improve the Pareto front. Start by defining theincrement in the hypervolume given a new point v 2 Rp:

�PF (v) = HV(PF(V [ {v}))� HV(PF(V)) .

This quantity is positive only if v lies in the set �̃ of pointsnon-dominated by PF(V). Therefore, the probability of

improvement (PoI) over the current Pareto front when se-lecting a new hyperparameter � can be computed using themodel trained on I as follows:

PoI(�) = P[(f1(�), . . . , fp(�)) 2 �̃ | I] (1)

=

Z

v2�̃

pY

j=1

�j(�; vj)dvj ,

4The anti-ideal point must be dominated by all points inPF(V). See [11] for further details.

where �j(�; ·) is the predictive Gaussian density for fj withmean mj(�) and variance s2j (�).

The PoI in (1) accounts for the probability that a given� 2 ⇤ has to improve the Pareto front, and it can be usedas a criterion to select new points. However, in this work,we opt for the hypervolume-based PoI (HVPoI) due to itssuperior computational and practical properties [11] . TheHVPoI is given by:

↵(�; I) = �PF (m(�)) · PoI(�), (2)

where m(�) = (m1(�), . . . ,md(�)). This acquisitionweights the probability of improving the Pareto front with ameasure of how much improvement is expected computedusing the means of the outputs. The HVPoI has been shownto work well in practice and efficient implementations exist.

3.3. The DPARETO Algorithm

The main optimization loop of DPARETO is shown in Alg. 2.It combines the two ingredients sketched so far: GPs forsurrogate modelling of the objective oracles, and HVPoIas an acquisition function to select new hyperparameters.The basic procedure is to first seed the optimization byselecting k0 hyperparameters from ⇤ at random, and thenfit the GP models for the privacy and utility oracles basedon these points. We then find the maximum of the HVPoIacquisition function given in Eq. (2) to obtain the next querypoint, which is then added into the dataset. This is repeatedk times until the optimization budget is used up. Furtherimplementation details are provided in Appx. E.1.

Algorithm 2: DPARETO

Input: hyperparameter set ⇤, privacy oracle P,error oracle E, anti-ideal point v†, numberof initial points k0, number of iterations k,prior GP

Initialize dataset D ;for i 2 [k0] do

Sample random point � 2 ⇤Evaluate oracles v (P(�),E(�))Augment dataset D D [ {(�, v)}

for i 2 [k] doFit a GP to the transformed privacy using DFit a GP to the transformed utility using DOptimize the HVPoI acquisition function inEq. (2) using anti-ideal point v† and obtain anew query point �

Evaluate oracles v (P(�),E(�))Augment dataset D D [ {(�, v)}

return Pareto front PF({v | (�, v) 2 D})

• Find privacy-utility Pareto front using multi-objective Bayesian optimization

• Use transformed Gaussian processes to model privacy and error oracles

• Acquisition function optimizes hyper-volume based probability of improvement [Couckuyt et al. 2014]

Example: Sparse Vector Technique

10�2 10�1 100 101 102

b

5

10

15

20

25

30

C

"

2000

4000

6000

8000

10�2 10�1 100 101 102

b

5

10

15

20

25

30

C

1 � F1

0.0

0.2

0.4

0.6

0.8

10�1 100 101 102

"

0.0

0.2

0.4

0.6

0.8

1.0

1�

F1

Pareto front

10�2 10�1 100 101 102

b

5

10

15

20

25

30

C

Pareto inputs

10�2 10�1 100 101 102

b

5

10

15

20

25

30

C

"

2000

4000

6000

8000

10�2 10�1 100 101 102

b

5

10

15

20

25

30

C1 � F1

0.0

0.2

0.4

0.6

0.8

10�1 100 101 102

"

0.0

0.2

0.4

0.6

0.8

1.0

1�

F1

Pareto front

10�2 10�1 100 101 102

b

5

10

15

20

25

30

C

Pareto inputs

110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164


The Pareto front of a set of points in Rp is defined as follows.

Definition 2 (Pareto Front). Let � ⇢ Rpand u, v 2 �. We

say that u dominates v if ui vi for all i 2 [p], and we write

u � v. The Pareto front of � is the set of all non-dominated

points PF(�) = {u 2 � | v 6� u, 8 v 2 � \ {u}}.

In other words, the Pareto front contains all the points in� where none of the coordinates can be decreased furtherwithout increasing some of the other coordinates (whileremaining inside �). According to this definition, given aprivacy–utility trade-off problem of the form (⇤,P�,Uz)we are interested in finding the Pareto front PF(�) of the2-dimensional set � = {(P�(�), 1� Uz(�)) | � 2 ⇤}. Theuse of 1� Uz(�) for the utility coordinate is for notationalconsistency, since we use the convention that the points inthe Pareto front are those that minimize each individual di-mension. Given this Pareto front, a decision-maker lookingto deploy DP has all the necessary information to make aninformed decision about how to trade-off privacy and utilityin their particular application.

2.2. Threat Model

In the idealized setting developed in this section, the de-sired output is the Pareto front PF(�), which dependson z through the utility oracle; this is also the case forthe Bayesian optimization algorithm for approximating thePareto front presented in Sec. 3. This warrants a discussionabout what threat model is appropriate to consider here.

DP guarantees that an adversary observing the output w =A�(z) will not be able to infer too much about any individ-ual record in z. The (central) threat model for DP assumesthat z is owned by a trusted curator, responsible for runningthe algorithm and releasing its output to the world. Whenreleasing multiple outputs computed on the same privateinput, the final privacy guarantees degrade with the numberof outputs according to the composition property [15].

The framework described in this section is not attemptingto prevent information leakage about z when releasing thePareto front. This is because our methodology is only meantto provide a substitute for using closed-form utility guaran-tees when selecting hyperparameters for a given DP algo-rithm. When these guarantees are not available, one mustresort to evaluating utility empirically, and that can onlybe done by giving data as input to the algorithm. This im-plies the Pareto front PF(�) can leak information aboutthe dataset z, and so does any object derived from it, e.g.hyperparameters � identifying a point in the Pareto front.

A simple way around this is to assume the existence ofa public dataset z0 which is in some sense similar to theprivate dataset z on which we would like to run the algo-rithm. Then we can use z0 to compute the Pareto frontof the algorithm, select hyperparameters �⇤ achieving a

desired privacy–utility trade-off, and release the output ofA�⇤(z). This is the setting we implicitly consider in thispaper, though other scenarios are also possible. Note thatthe availability of public data is not an uncommon assump-tion in the DP literature [20, 32, 5, 16]. In particular, thisthreat model is similar to the one being used by the U.S.Census Bureau to tune the parameters for their use of DP inthe context of the 2020 census (see Sec. 5 for more details).

2.3. Example: Sparse Vector Technique

The sparse vector technique [14] is a mechanism to privatelyrun m queries against a fixed sensitive database and releaseunder DP the indices of those queries which exceed a certainthreshold. The naming of the mechanism reflects the factthat it is specifically designed to have good accuracy whenonly a small number of queries are expected to be abovethe threshold. The mechanism has found applications in anumber of problems, and several variants of the algorithmhave been proposed [26].

To illustrate our framework we use a non-interactive ver-sion of the mechanism proposed in [26, Alg. 7]. Themechanism is described in Alg. 1, and is tailored to an-swer m binary queries qi : Zn ! {0, 1} with sensitivity� = 1 and a fixed threshold T = 1/2. The privacy andutility of the mechanism are controlled by the noise levelb and the bound C on the number of answers. Increasingb or decreasing C yields a more private but less accuratemechanism. Unlike in the usual setting, where the sparsevector technique is parametrized by the target privacy ",we modified the mechanism to takes as input a total noiselevel b. This noise level is split across two parameters b1and b2 controlling how much noise is added to the thresh-old and to the query answers respectively3. The standardprivacy analysis of the sparse vector technique providesthe following closed-form privacy oracle for our algorithm:P0 = (1 + (2C)1/3)(1 + (2C)2/3)b�1 (see Appx. A formore details).

Algorithm 1: Sparse Vector TechniqueInput: dataset z, queries q1, . . . , qmHyperparameters: noise b, bound Cc 0, w (0, . . . , 0) 2 {0, 1}mb1 b/(1 + (2C)1/3), b2 b� b1, ⇢ Lap(b1)for i 2 [m] do

⌫ Lap(b2)if qi(z) + ⌫ � 1

2 + ⇢ thenwi 1, c c + 1if c � C then return w

return w

3The split used by the algorithm is based on the privacy budgetallocation suggested in [26, Section 4.2].

[Lyu et al. 2017]

Setup • 100 queries with 0/1 output, sensitivity 1• 10% queries return 1 (randomly selected)• Privacy: SVT analysis• Error: 1 - F-score (avg. over 50 runs)

Example: Sparse Vector Technique

10�2 10�1 100 101 102

b

5

10

15

20

25

30

C

"

2000

4000

6000

8000

10�2 10�1 100 101 102

b

5

10

15

20

25

30

C

1 � F1

0.0

0.2

0.4

0.6

0.8

10�1 100 101 102

"

0.0

0.2

0.4

0.6

0.8

1.0

1�

F1

Pareto front

10�2 10�1 100 101 102

b

5

10

15

20

25

30

C

Pareto inputs

10�2 10�1 100 101 102

b

5

10

15

20

25

30

C

"

2000

4000

6000

8000

10�2 10�1 100 101 102

b

5

10

15

20

25

30

C

1 � F1

0.0

0.2

0.4

0.6

0.8

10�1 100 101 102

"

0.0

0.2

0.4

0.6

0.8

1.0

1�

F1

Pareto front

10�2 10�1 100 101 102

b

5

10

15

20

25

30

C

Pareto inputs

10�2 10�1 100 101 102

b

5

10

15

20

25

30

C

" (predicted)

1000

2000

3000

4000

5000

6000

7000

10�2 10�1 100 101 102

b

5

10

15

20

25

30

C

1 � F1 (predicted)

0.0

0.2

0.4

0.6

0.8

1.0

101 102

"

0.0

0.2

0.4

0.6

0.8

1�

F1

Pareto front

True Pareto

Empirical Pareto

Observation outputs

Non-dominated set �̃

10�2 10�1 100 101 102

b

5

10

15

20

25

30

C

HVPoI

Next location

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

10�2 10�1 100 101 102

b

5

10

15

20

25

30

C

" (predicted)

1000

2000

3000

4000

5000

6000

7000

10�2 10�1 100 101 102

b

5

10

15

20

25

30C

1 � F1 (predicted)

0.0

0.2

0.4

0.6

0.8

1.0

101 102

"

0.0

0.2

0.4

0.6

0.8

1�

F1

Pareto front

True Pareto

Empirical Pareto

Observation outputs

Non-dominated set �̃

10�2 10�1 100 101 102

b

5

10

15

20

25

30

C

HVPoI

Next location

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Implementing the OraclesPrivacy Oracle

• Epsilon for fixed delta / Others DP variants / Attacks success metrics

• Closed-form expression / Numerical calculation (eg. moments accountant)

Error Oracle

• Fixed input / Distribution over inputs / Worst-case (over a set of) inputs

• On expectation / With high probability

• Exact expression / Empirical evaluation

Machine Learning Experiments

• Adult dataset (n=40K, d=123)• Logistic regression (SGD and ADAM)• Linear SVM (SGD)

• MNIST dataset (n=60K, d=784)• MLP1 (1000 hidden)• MLP2 (128-64 hidden)

10�1 100

"

0.150

0.175

0.200

0.225

0.250

0.275

0.300

Cla

ssifi

cation

erro

r

Adult Pareto FrontsLogReg+SGD

LogReg+ADAM

SVM+SGD

10�1 100 101

"

0.0

0.2

0.4

0.6

0.8

1.0

Cla

ssifi

cation

erro

r


MLP2

10�1 100

"

0.150

0.175

0.200

0.225

0.250

0.275

0.300

Cla

ssifi

cation

erro

r

Adult Pareto FrontsLogReg+SGD

LogReg+ADAM

SVM+SGD

10�1 100 101

"

0.0

0.2

0.4

0.6

0.8

1.0

Cla

ssifi

cation

erro

r


MLP2

DPareto vs Random Sampling

0 100 200

Sampled points

8.15

8.20

8.25

8.30

8.35

PF

hyper

volu

me

Hypervolume Evolution

LogReg+SGD (RS)

LogReg+SGD (BO)

LogReg+ADAM (RS)

LogReg+ADAM (BO)

SVM+SGD (RS)

SVM+SGD (BO)

0 100 200

Sampled points

7.0

7.5

8.0

8.5

9.0

PF

hyper

volu

me

Hypervolume Evolution

MLP1 (RS)

MLP1 (BO)

MLP2 (RS)

MLP2 (BO)

10�1 100 101

"

0.0

0.2

0.4

0.6

0.8

1.0

Cla

ssifi

cation

erro

r

MLP2 Pareto Fronts

Initial

+256 RS

+256 BO

10�1 100 101

"

0.16

0.18

0.20

0.22

0.24

Cla

ssifi

cation

erro

r

LogReg+SGD Samples1500 RS

256 BO

Conclusion• Empirical privacy-utility trade-off evaluation enables application-specific

decisions

• Bayesian optimization provides computationally efficient method to recover the Pareto front (esp. with large number of hyper-parameters)

Future work:

• Address leakage in Pareto front (when error oracle is input-specific)

• Include further criteria (eg. running time of parametrized algorithm)

learning the privacy-utility trade-off with bayesian ... › slides › simons-march-2019.pdf ·...

Documents