Posted by: jonkatz | February 23, 2010

## IPAM Workshop on Data Privacy, Day 1

I’m at IPAM for the Workshop on Data Privacy. Interestingly, the workshop is evenly split between computer scientists and statisticians, at least in terms of talks. (There appear to be more computer scientists than statisticians in the audience, though not by a huge margin.) While some statisticians discussed other notions of privacy, it was heartening to see that several statisticians have begun to adopt the notion of differential privacy in their work.

Day 1 began with a series of tutorials: Cynthia Dwork and Adam Smith introduced differential privacy and some known results; Larry Wasserman gave an introduction to statistics. Tutorials seem to be undervalued — there are not enough of them, and even when they are given they usually are not allotted enough time. I think that was the case yesterday as well. While I found the talks on differential privacy to be excellent, many slides were skipped and I heard afterward from some students (who were less familiar with the prior work in the area) that they found the talks hard to follow. I had this experience with Wasserman’s tutorial: while I was able to take something away from it, I was lost during the second half. (As an aside: it is a bit strange that statistics is not a more central part of math, cs, or the sciences. I took an undergraduate prob/stat course at MIT, but it was more applied and did not cover anything that Wasserman talked about in his tutorial. One more thing to learn…)

Following lunch, Kunal Talwar talked about his recent work (with Moritz Hardt) “On the Geometry of Differential Privacy”. The basic setting here is that the data is a vector $x$ and the database owner wants to release (approximations to) $\langle f_1, x \rangle, \ldots, \langle f_d, x \rangle$, where the brackets denote standard dot product (over the reals). What is the tradeoff between privacy and utility? Adding independent Laplacian noise to each output preserves privacy, but is this the best that can be done? (To see immediately that it’s not, consider the case when the $f_i$ are all equal…) Hardt and Talwar give lower and upper bounds on the best utility that can be obtained, expressed as a function of the geometry of $K=FB$, where $F$ is the matrix with the $f_i$ as its rows, and $B$ is the unit ball (in the $L_1$ norm). There are some nice open questions left by this work, and I’ve put this paper on my reading stack.

The final talk of the day was by Frank McSherry. He actually gave two talks. The first described his programming interface PINQ, which allows analysts to express differentially private queries over datasets. He also discussed some more recent (unpublished?) work of his on differentially private probabilistic inference.