Characterizing the Limits and Defenses of Machine Learning in Adversarial Settings

Open Access
Author:
Papernot, Nicolas
Graduate Program:
Computer Science and Engineering
Degree:
Doctor of Philosophy
Document Type:
Dissertation
Date of Defense:
February 28, 2018
Committee Members:
  • Patrick D. McDaniel, Dissertation Advisor
  • Patrick D. McDaniel, Committee Chair
  • Trent Jaeger, Committee Member
  • Thomas F. La Porta, Committee Member
  • Aleksandra B. Skavkovic, Outside Member
  • Dan Boneh, Special Member
  • Ian J. Goodfellow, Special Member
Keywords:
  • computer security
  • machine learning
  • adversarial examples
Abstract:
Advances in machine learning (ML) in recent years have enabled a dizzying array of applications such as object recognition, autonomous systems, security diagnostics, and playing the game of Go. Machine learning is not only a new paradigm for building software and systems, it is bringing social disruption at scale. There is growing recognition that ML exposes new vulnerabilities in software systems, yet the technical community’s understanding of the nature and extent of these vulnerabilities remains limited. In this thesis, I focus my study on the integrity of ML models. Integrity refers here to the faithfulness of model predictions with respect to an expected outcome. This property is at the core of traditional machine learning evaluation, as demonstrated by the pervasiveness of metrics such as accuracy among practitioners. A large fraction of ML techniques were designed for benign execution environments. Yet, the presence of adversaries may invalidate some of these underlying assumptions by forcing a mismatch between the distributions on which the model is trained and tested. As ML is increasingly applied and being relied on for decision-making in critical applications like transportation or energy, the models produced are becoming a target for adversaries who have a strong incentive to force ML to mispredict. I explore the space of attacks against ML integrity at test time. Given full or limited access to a trained model, I devise strategies that modify the test data to create a worst-case drift between the training and test distributions. The implications of this part of my research is that an adversary with very weak access to a system, and little knowledge about the ML techniques it deploys, can nevertheless mount powerful attacks against such systems as long as she has the capability of interacting with it as an oracle: i.e., send inputs of the adversary’s choice and observe the ML prediction. This systematic exposition of the poor generalization of ML models indicates the lack of reliable confidence estimates when the model is making predictions far from its training data. Hence, my efforts to increase the robustness of models to these adversarial manipulations strive to decrease the confidence of predictions made far from the training distribution. Informed by my progress on attacks operating in the black-box threat model, I first identify limitations to two defenses: defensive distillation and adversarial training. I then describe recent defensive efforts addressing these shortcomings. To this end, I introduce the Deep k-Nearest Neighbors classifier, which augments deep neural networks with an integrity check at test time. The approach compares internal representations produced by the deep neural network on test data with the ones learned on its training points. Using the labels of training points whose representations neighbor the test input across the deep neural network’s layers, I estimate the nonconformity of the prediction with respect to the model’s training data. An application of conformal prediction methodology then paves the way for more reliable estimates of the model’s prediction credibility, i.e., how well the prediction is supported by training data. In turn, we distinguish legitimate test data with high credibility from adversarial data with low credibility. This research calls for future efforts to investigate the robustness of individual layers of deep neural networks rather than treating the model as a black-box. This aligns well with the modular nature of deep neural networks, which orchestrate simple computations to model complex functions. This also allows us to draw connections to other areas like interpretability in ML, which seeks to answer the question of: “How can we provide an explanation for the model prediction to a human?” Another by-product of this research direction is that I better distinguish vulnerabilities of ML models that are a consequence of the ML algorithms from those that can be explained by artifacts in the data.