Posted by: jonkatz | March 1, 2010

IPAM Workshop on Data Privacy, Days 2-4

(Note: slides from several of the talks are posted on-line. Arvind Narayanan has another post about the workshop. If anyone who attended any talks I missed wants to write a guest blog, please let me know.)

Day 2

I was, unfortunately, only able to make it to two talks on Day 2.
The first talk I attended was by Vitaly Shmatikov, who introduced a system enforcing privacy guarantees for large-scale distributed computations (using MapReduce) on sensitive data. It was interesting to compare this to the PINQ work of McSherry (discussed last time”. (The works were done independently. I should point out also that Vitaly’s paper is not yet available, so what I say here is based entirely on memory from his talk.) While they both ultimately achieve similar goals, the main difference appears to be the following: McSherry’s work gives analysts a new tool (more specifically, an extension of a programming language) in which to express differentially private queries; in Shmatikov’s work the analysts do not have to change anything (i.e., they can continue to use standard MapReduce) but, before answering any queries, the system checks whether the computation is differentially private.

The next talk I attended was by Kamalika Chaudhury, describing work on computing differentially private linear classifiers. Prior work had shown how this could be done, but her work gives improvements on the noise added (while maintaining the same level of differential privacy). The “aha” moment of her talk was this: whereas prior work would first compute an optimal linear classifier for the data and then perturb the result before outputting it, in her work they first perturb the data and then output an optimal linear classifier for the perturbed result. Of course, the immediate open question here is to prove lower bounds on the minimal noise needed to maintain privacy in this setting.

Day 3

The day kicked off with Xiaofeng Wang talking about privacy breaches in genomic data. For those who aren’t aware of this, it provides an interesting real-world example of where attacks on privacy had serious consequences. As I understand it, Homer et al. showed that the contribution of an individual’s DNA to a dataset could be detected even if the dataset is an aggregate based on the DNA of 1000 people. (Improvements to this attack were later shown by Wang and others.) As a result, the NIH made the surprising (but smart) move to pull such datasets off-line. Developing an appropriate notion of utility in this setting, and constructing differentially private methods for releasing aggregate genomic data while maintaining differential privacy, remain great research directions.

Sofya Raskhodnikova spoke next about her FOCS 2008 paper on private PAC learning. The main result is that anything that can be PAC-learned can also be privately PAC-learned, although their transformation does not preserve efficiency. I think resolving this gap (i.e., showing that anything that can be efficiently PAC-learned can be efficiently and privately PAC-learned) is an interesting open question, though it may be purely academic (since Sofya was not aware of any example of a class that we know how to efficiently learn but not efficiently learn with privacy).

Kobbi Nissim gave an interesting talk about some recent research applying differential privacy to mechanism design. The starting point is work of McSherry-Talwar which showed that any $\epsilon$-differentially private mechanism is also $\epsilon$-Nash. (This follows immediately from the fact that $\epsilon$-privacy exactly means that a deviation from honest behavior has only an “$\epsilon$ effect” on the outcome.) Kobbi’s observation is that this is really unsatisfactory, both because $\epsilon$ can be significant but more importantly because any action of a player has only $\epsilon$ effect on the outcome; what we get, therefore, is indifference of the players rather than any preference for playing honestly. (These are Kobbi’s words; in fact, the title of his talk was something like “…from Indifference to Preference”.) The proposed method for dealing with this is to have the mechanism “punish” players, where the magnitude of the punishment depends on the amount by which a player lies. (I am deliberately suppressing details, but one illustration is in the context of a mechanism for assigning multiple centers based on positions announced by the players. Here, a player may have an incentive to lie about their true position in an attempt to have the center placed closer to their true location. A punishment strategy in this case would be to [with some small probability] put a center exactly at a player’s claimed location and force them to use that center. Now the “punishment” for the player [i.e., the distance from the center to their true location] exactly depends on how much they lied about their true location.) The net effect is a mechanism where players are $\epsilon$-indifferent for strategies “close” to being truthful, but have a preference for playing any such strategy as opposed to one that is “far” from truthful. Although some aspects of the model were questioned, I think overall this is a clever idea.

After lunch, Ravi Kumar talked about some work showing attacks on different mechanisms for anonymizing search logs. What is interesting here is again the practical relevance of such attacks to companies that would like to be able to share such data with researchers, yet are wary of the obvious implications for privacy.

Day 4

I was only able to attend the morning session, before having to leave for my flight. But I’m glad I made it, since this was one of the most interesting sessions.

Daniel Kifer started the day with a discussion of axiomatizing privacy and utility. While I did not agree with his approach to utility (it seems to me that utility is context-specific, and not something that can be defined usefully in a general sense), axiomatizing privacy makes a good deal of sense. What would this mean, and why do we need this given that we already have the definition of differential privacy? Well, ideally, axiomatization would mean setting down a list of properties that we would expect any “good” definition of privacy to satisfy. We could then ask whether differential privacy satisfies these properties (presumably it would), but more interestingly whether it is the only definition that satisfies these properties. (In fact, we already believe that it is not, given the various relaxations like $(\epsilon, \delta)$-differential privacy or computational differential privacy that have already been proposed.) Moreover, we could then decide to investigate relaxed definitions of privacy that satisfy some subset of the desired properties (of course, this could be in contexts where we have some reason to believe the properties being cut were less important, or where the need for the data analysis is so great that we are forced to give up on something). Although Kifer did not yet have a complete set of principles that would lead to something like the above, the ideas there seem to have potential.

Next, Adam Smith gave a fantastic talk that bridged the data privacy and statistical worlds. (As an aside, his talk demonstrated the value of becoming a true expert in several domains [as he is], since this gives an advantage over most people doing “interdisciplinary” work who remain firmly entrenched in their home discipline.) Adam discussed several results (see his slides, which are available on-line), but the one that I found most stimulating was a formal theorem to the effect that differential privacy and statistics are fully compatible; a little more precisely, for a wide class of statistical parameters of interest, these parameters can be estimated privately with (asymptotically) about as much error as in the non-private case.

The final talk I attended was by Stephen Fienberg, a statistician who was one of the organizers of the workshop. While I lost the thread somewhere through the talk, it was very interesting to hear about his experiences trying to mesh the constraints of privacy with real-world analyses of large datasets. As just one example, he claimed that he and his group tried to apply techniques from the differential privacy literature to real-world data and found that the (hidden) constants completely destroyed the value of this approach.

In summary, thanks to the organizers for what was a great workshop!