Attacking machine learning with adversarial examples
Captured source
source ↗Attacking machine learning with adversarial examples | OpenAI
February 24, 2017
Attacking machine learning with adversarial examples
Loading…
Share
Adversarial examples are inputs to machine learning models that an attacker has intentionally designed to cause the model to make a mistake; they’re like optical illusions for machines. In this post we’ll show how adversarial examples work across different mediums, and will discuss why securing systems against them can be difficult.
At OpenAI, we think adversarial examples are a good aspect of security to work on because they represent a concrete problem in AI safety that can be addressed in the short term, and because fixing them is difficult enough that it requires a serious research effort. (Though we’ll need to explore many aspects of machine learning security to achieve our goal of building safe, widely distributed AI.)
To get an idea of what adversarial examples look like, consider this demonstration from Explaining and Harnessing Adversarial Examples: starting with an image of a panda, the attacker adds a small perturbation that has been calculated to make the image be recognized as a gibbon with high confidence.
The approach is quite robust; recent research has shown adversarial examples can be printed out on standard paper then photographed with a standard smartphone, and still fool systems.
Adversarial examples can be printed out on normal paper and photographed with a standard resolution smartphone and still cause a classifier to, in this case, label a “washer” as a “safe”.
Adversarial examples have the potential to be dangerous. For example, attackers could target autonomous vehicles by using stickers or paint to create an adversarial stop sign that the vehicle would interpret as a ‘yield’ or other sign, as discussed in Practical Black-Box Attacks against Deep Learning Systems using Adversarial Examples.
Reinforcement learning agents can also be manipulated by adversarial examples, according to new research from UC Berkeley, OpenAI, and Pennsylvania State University, Adversarial Attacks on Neural Network Policies, and research from the University of Nevada at Reno, Vulnerability of Deep Reinforcement Learning to Policy Induction Attacks. The research shows that widely-used RL algorithms, such as DQN, TRPO, and A3C, are vulnerable to adversarial inputs. These can lead to degraded performance even in the presence of pertubations too subtle to be percieved by a human, causing an agent to move a pong paddle down when it should go up, or interfering with its ability to spot enemies in Seaquest.
If you want to experiment with breaking your own models, you can use cleverhans, an open source library developed jointly by Ian Goodfellow and Nicolas Papernot to test your AI’s vulnerabilities to adversarial examples.
Adversarial examples give us some traction on AI safety
When we think about the study of AI safety, we usually think about some of the most difficult problems in that field — how can we ensure that sophisticated reinforcement learning agents that are significantly more intelligent than human beings behave in ways that their designers intended?
Adversarial examples show us that even simple modern algorithms, for both supervised and reinforcement learning, can already behave in surprising ways that we do not intend.
Attempted defenses against adversarial examples
Traditional techniques for making machine learning models more robust, such as weight decay and dropout, generally do not provide a practical defense against adversarial examples. So far, only two methods have provided a significant defense.
Adversarial training: This is a brute force solution where we simply generate a lot of adversarial examples and explicitly train the model not to be fooled by each of them. An open-source implementation of adversarial training is available in the cleverhans library and its use illustrated in the following tutorial.
Defensive distillation: This is a strategy where we train the model to output probabilities of different classes, rather than hard decisions about which class to output. The probabilities are supplied by an earlier model, trained on the same task using hard class labels. This creates a model whose surface is smoothed in the directions an adversary will typically try to exploit, making it difficult for them to discover adversarial input tweaks that lead to incorrect categorization. (Distillation was originally introduced in Distilling the Knowledge in a Neural Network as a technique for model compression, where a small model is trained to imitate a large one, in order to obtain computational savings.)
Yet even these specialized algorithms can easily be broken by giving more computational firepower to the attacker.
A failed defense: “gradient masking”
To give an example of how a simple defense can fail, let’s consider why a technique called “gradient masking” does not work.
Gradient masking” is a term introduced in Practical Black-Box Attacks against Deep Learning Systems using Adversarial Examples. to describe an entire category of failed defense methods that work by trying to deny the attacker access to a useful gradient.
Most adversarial example construction techniques use the gradient of the model to make an attack. In other words, they look at a picture of an airplane, they test which direction in picture space makes the probability of the “cat” class increase, and then they give a little push (in other words, they perturb the input) in that direction. The new, modified image is mis-recognized as a cat.
But what if there were no gradient — what if an infinitesimal modification to the image caused no change in the output of the model? This seems to provide some defense because the attacker does not know which way to “push” the image.
We can easily imagine some very trivial ways to get rid of the gradient. For example, most image classification models can be run in two modes: one mode where they output just the identity of the most likely class, and one mode where they output probabilities. If the model’s output is “99.9% airplane, 0.1% cat”, then a little tiny change to the input gives a little tiny change to the output, and the gradient tells us which changes will increase the probability of the “cat” class. If we run the model in a mode where the output is just “airplane”, then a little tiny change to the input will not change the output at all, and the…
Excerpt shown — open the source for the full document.