Article Review of Planting Undetectable Backdoors in Machine Learning Models

I read an interesting article called “Planting Undetectable Backdoors in Machine Learning Models” (https://arxiv.org/pdf/2204.06974.pdf). The paper addresses the following problem:

Imagine that a bank outsourced the training of a model that makes credit approval decisions to a company called Snoogle. The problem is that Snoogle could be a malicious company. Let’s say that the model takes the following inputs: name, age, income, address, and credit amount, and outputs a decision to approve or reject the credit. The bank tests the classifier on a small dataset to verify the claimed accuracy. This type of verification is easy to conduct but hard to cheat.

However, even if the classifier generalizes well and produces high-quality predictions, this type of verification does not protect against unexpected behaviour on rarely seen and non-specialized datasets. Moreover, Snoogle can plant a backdoor mechanism that allows them to obtain the desired outcome by making minor changes to the input data, such as constantly approving the credit. This allows Snoogle to defraud the bank and its customers.

The article describes two types of backdoors: black-box and white-box. A black-box backdoor can be inserted into a model without access to the model’s internal structure or weights, while a white-box backdoor requires access to the model’s internal structure and weights.

Concretely, suppose we have some idealized adversarially-robust training algorithm, that guarantees the returned classifier h is perfectly robust, i.e. has no adversarial examples. The existence of an undetectable backdoor for this training algorithm implies the existence of a classifier h˜, in which every input has an adversarial example, but no efficient algorithm can distinguish h˜ from the robust classifier h! This reasoning holds not only for existing robust learning algorithms, but also for any conceivable robust learning algorithm that may be developed in the future. We discuss the relation between backdoors and adversarial examples further in Section 2.1.

The authors discuss three methods for detecting backdoors in machine learning models, but show that cleverly designed backdoors can defeat each of these methods.

Verifiable Delegation of Learning

In practice, it is possible to verify that the calculations were performed correctly, but everything else must be provided by the bank, a very limited scenario.

Persistence to Gradient Descent

After obtaining a model and its weights, some clever individuals may want to run a few more training iterations to alter the weights with stochastic gradient descent slightly. The article showed that it is possible to create a backdoor resistant to such post-processing (for example, for a neural network with ReLu activations).

Randomized Evaluation

The most promising method, at least in my opinion, is to add noise to the input data and compare the result without noise to the result with noise. This is somewhat similar to estimating the Decision Boundary Margin in SVM. The problem is that this only works up to a certain noise level; if too much noise is added, any model becomes meaningless. In addition, noise can be different, and if it is known in advance how it will be tested, backdoor can be adapted to it. Although, in my opinion, this method is the most promising (for example we can use different type of noise or mixture), it still has some drawbacks.

In general, the authors aim to show that there are backdoors that cannot be caught using existing methods. They believe it is vital to draw attention to this issue within the machine learning and security communities to begin developing detection methods. I recommend this to anyone interested in the topic.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store