Influence-based Attributions can be Manipulated

Influence Functions are a standard tool for attributing predictions to training data in a principled manner and are widely used in applications such as data valuation and fairness. In this work, we present realistic incentives to manipulate influence-based attributions and investigate whether these attributions can be systematically tampered by an adversary. We show that this is indeed possible for logistic regression models trained on ResNet feature embeddings and standard tabular fairness datasets and provide efficient attacks with backward-friendly implementations. Our work raises questions on the reliability of influence-based attributions in adversarial circumstances. Code is available at [https://github.com/infinite-pursuits/influence-based-attributions-can-be-manipulated](https://github.com/infinite-pursuits/influence-based-attributions-can-be-manipulated).

Threat Model. Data Provider provides training data. Influence Calculator trains a model and computes influence scores for the training data on the trained model and a test set. It outputs both the trained model and the resulting influence scores, which are used for a downstream application such as data valuation or fairness. Adversarial manipulation happens in the model training process, which trains a malicious model to achieve desired influence scores, while maintaining similar accuracy as the honest model..

Introduction

Influence Functions are a popular tool for data attribution and have been widely used in many applications such as data valuation , data filtering/subsampling/cleaning , fairness and so on. While earlier they were being used for benign debugging, many of these newer applications involve adversarial scenarios where participants have an incentive to manipulate influence scores; for example, in data valuation a higher monetary sum is given to samples with a higher influence score and since good data is hard to collect, there is an incentive to superficially raise influence scores for existing data. Thus, an understanding of whether and how influence functions can be manipulated is essential to determine their proper usage and for putting guardrails in place. While a lot of work in the literature has studied manipulation of feature-based attributions , whether data attribution methods, specifically influence functions, can be manipulated has not been explored. To this end, our paper investigates the question and shows that it is indeed possible to systematically manipulate influence-based attributions according to the manipulator’s incentives. We propose two kinds of attacks : (1) targeted attack for the data valuation application and (2) untargeted attack for the fairness application.

Our Key Idea

Simply put, we show that it is possible to systematically train a malicious model very similar to the honest model in test accuracy, but has desired influence scores.

Setup

The standard influence function pipeline comprises of two entities: a Data Provider and an Influence Calculator. Data Provider holds all the training data privately and supplies it to the Influence Calculator. Influence Calculator finds the value of each sample in the training data by first training a model on this data and then computing influence scores on the trained model using a separate test set. We assume that the test set comes from the same underlying distribution as the training data. Influence Calculator outputs the trained model and the influence scores of each training sample ranked in a decreasing order of influence scores. These rankings/scores are then used for a downstream application. See figure on top of this blog for a pictorial representation of the setting.

Threat Model

We consider the training data held by the data provider and the test set used by the influence calculator to be fixed. We also assume the influence calculation process to be honest. The adversarial manipulation to maliciously change influence scores for some training samples happens during model training. To achieve this, the compromised model training process outputs a malicious model $\theta^\prime$ such that $\theta^\prime$ leads to desired influence scores but has similar test accuracy as original honest model $\theta^*$.

Data Valuation application with Targeted Attack

The goal of data valuation is to determine the contribution of each training sample to model training and accordingly assign a proportional monetary sum to each. One of the techniques to find this value is through influence functions, by ranking training samples according to their influence scores in a decreasing order . A higher influence ranking implies a more valuable sample, resulting in a higher monetary sum. Since monetary sum is tied to influence scores, an adversary with financial incentives would like to manipulate influence scores.

The canonical setting of data valuation consists of 1) multiple data vendors and 2) influence calculator. Each vendor supplies a set of data; the collection of data from all vendors corresponds to the fixed training set of the data provider. The influence calculator is our adversary who can collude with data vendors while keeping the data fixed.

Goal of the adversary. Given a set of target samples $Z_{\rm {target}} \subset Z$, the goal of the adversary is to push the influence ranking of samples from $Z_{\rm {target}}$ to top- $k$ or equivalently increase the influence score of samples from $Z_{\rm {target}}$ beyond the remaining $n-k$ samples, where $k \in \mathbb{N}$. Next we propose targeted attacks to achieve this goal.

Let us first consider the case where $Z_{\rm {target}}$ has only one element, $z_{\rm {target}}$ and propose a Single-Target attack. We formulate the adversary’s attack as a constrained optimization problem where the objective function, $\ell_{\rm {attack}}$, captures the intent to raise the influence ranking of the target sample to top- $k$ while the constraint function, $\rm {dist}$, limits the distance between the original and manipulated model, so that the two models have similar test accuracies. The resulting optimization problem is given as follows, where $C \in \mathbb{R}$ is the model manipulation radius,

$\min_{\theta^{\prime}:\rm {dist} (\theta^*, \theta^{\prime}) \leq C} \ell_{\rm {attack}} (z_{\rm {target}}, Z, Z_{\rm {test}}, \theta^{\prime})$

When the target set $Z_{\rm {target}} \subset Z$ consists of more than 1 sample, we can simply re-apply the above attack multiple times, albeit on different samples. The primary challenge with these attacks is that calculating gradients of influence-based loss objectives is highly computationally infeasible due to backpropagation through hessian-inverse-vector-products. We address this challenge with a simple memory-time efficient and backward-friendly algorithm to compute the gradients while using existing PyTorch machinery for implementation. This contribution is of independent technical interest, as the literature has only focused on making forward computation of influence functions feasible, while we study techniques to make the backward pass viable. Our algorithm brings down the memory required for one forward $+$ backward pass from not being feasible to run on a 12GB GPU to 7GB for a 206K parameter model and from 8GB to 1.7GB for a 5K model.

Experimental Results

All our experiments are on multi-class logistic regression models trained on ResNet50 embeddings for standard vision datasets. Our results are as follows.

  1. Our Single-Target attack performs better than a non-influence Baseline.Consider a non-influence baseline attack for increasing the importance of a training sample : reweigh the training loss, with a high weight on the loss for the target sample. Our attack has a significantly higher success rate as compared to the baseline with a much smaller accuracy drop under all settings, as shown in the table below.
  2. Success Rates of the Baseline vs. our Single-Target Attack for Data Valuation. $k$ is the ranking, as in top- $k$. ${\small \Delta_{\rm acc}}:= \small \rm TestAcc(\theta^*) - \small \rm TestAcc(\theta^\prime)$ represents drop in test accuracy for manipulated model $\theta^\prime$. Two success rates are reported : (1) when $\small \Delta_{\rm acc} \leq 3\%$ (2) the best success rate irrespective of accuracy drop. ($\%$) represents model accuracy. (-) means a model with non-zero success rate could not be found & hence accuracy can't be stated. Our attack has a significantly higher success rate as compared to the baseline with a much smaller accuracy drop under all settings.
  3. Behavior of our Single-Target attack w.r.t manipulation radius $C$ & training set size. Theoretically, the manipulation radius parameter $C$ in our attack objectives is expected to create a trade-off between the manipulated model's accuracy and the attack success rate. Increasing radius $C$ should result in a higher success rate as the manipulated model is allowed to diverge more from the (optimal) original model but on the other hand its accuracy should drop and vice-versa. We observe this trade-off for all three datasets and different values of ranking $k$, as shown in the figure below. We also anticipate our attack to work better with smaller training sets, as there will be fewer samples competing for top- $k$ rankings. Experimentally, this is found to be true -- Pet dataset with the smallest training set has the highest success rates.
  4. Behavior and Transfer results for Single-Target Attack in the Data Valuation use-case. Value of manipulation radius $C$ increases from left to right in each curve. (1) Behavior on original test set (solid lines) : As manipulation radius $C$ increases, manipulated model accuracy drops while attack success rate increases. (2) Transfer on an unknown test set (dashed lines): Success rate on an unknown test set gets better with increasing values of ranking $k$.
  5. Our attacks transfer when influence scores are computed with an unknown test set. When an unknown test set is used to compute influence scores, our attacks perform better as ranking $k$ increases, as shown in the figure above. This occurs because rank of the target sample, optimized with the original test set, deteriorates with the unknown test set and a larger $k$ increases the likelihood of the target still being in the top-$k$ rankings.
  6. How does our Multi-Target Attack perform with changing target set size and desired ranking $k$? Intuitively, our attack should perform better when the size of the target set is larger compared to ranking $k$ -- this is simply because a larger target set offers more candidates to take the top-$k$ rankings spots, thus increasing the chances of some of them making it to top- $k$. Our experimental results confirm this intuition; as demonstrated in the figure below, we observe that (1) for a fixed value of ranking $k$, a larger target set size leads to a higher success rate; target set size of $100$ has the highest success rates for all values of ranking $k$ across the board, and (2) the success rate decreases with increasing value of $k$ for all target set sizes and datasets. These results are for the high-accuracy similarity regime where the original and manipulated model accuracy differ by less than $3\%$.
  7. Performance of Multi-Target Attack in the Data Valuation use-case. Results for the high-accuracy regime. Success Rates are higher when target set size is greater than the desired ranking $k$.
  8. Easy vs. Hard Samples.We find that target samples which rank very high or low in the original influence rankings are easier to push to top-$k$ rankings upon manipulation (or equivalently samples which have a high magnitude of influence either positive or negative). This is so because the influence scores of extreme rank samples are more sensitive to model parameters as shown experimentally in the figure below, thus making them more susceptible to influence-based attacks.
  9. Histograms for original ranks of easy-to-manipulate samples (L), that of hard-to-manipulate samples (M), scatterplots for influence gradient norm vs. original ranks of (R) 50 random target samples. Ranking $k:=1$. Easy-to-manipulate samples have extreme original influence ranks (large positive or negative) as the samples with the extreme rankings also have higher influence gradient norms, where the gradient is taken w.r.t. model parameters. For other datasets, see App. Sec. A.1 in the paper.
  10. Impossibility Theorem for Data Valuation Attacks. We observe that even with a large $C$, our attacks still cannot achieve a $100\%$ success rate. Motivated by this, we wonder if there exist target samples for which the influence score cannot be moved to top-$k$ rank? The answer is yes and we formally state this impossibility result as follows.
  11. Theorem 1:For a logistic regression family of models and any target influence ranking $k\in\mathbb{N}$, there exists a training set $Z_{\rm train}$, test set $Z_{\rm test}$ and target sample $z_{\rm target} \in Z_{\rm train}$, such that no model in the family can have the target sample $z_{\rm target}$ in top- $k$ influence rankings.

Kindly check the paper for ablation study on our attack objective and more details on the experiments.

Fairness Application with Untargeted Attack

Recently influence functions have been proposed to increase the fairness of downstream models , we focus on the study by because it uses the same definition of influence as us. In this study, influence scores from a base model are used to increase the fairness of a downstream model. Since fairness of the downstream model is guided by influence scores, an adversary with an incentive to reduce fairness would be interested in manipulating them.

We propose an untargeted attack for this use-case : scale the base model by a positive constant. The malicious base model output by the model trainer is now a scaled version of the original model. Note that for logistic regression the malicious and original base model are indistinguishable since scaling with a positive constant maintains the sign of the predictions, leading to the same accuracy.

Experimental Results

All our experiments are on logistic regression models trained on standard fairness datasets. We measure fairness with demographic parity , which is a standard fairness metric.

As can be seen from our results in the figure below, the scaling attack works surprisingly well across all datasets – downstream models achieved after our attack are considerably less fair (higher DP gap) than the models without attack, achieving a maximum difference of 16$\%$ in the DP gap. Simultaneously, downstream models post-attack maintain similar test accuracies to downstream models without attack. Since the process to achieve the downstream model involves a lot of steps, including solving a non-convex optimization problem to find training data weights and then retraining a model, we sometimes do not see a smooth monotonic trend in fairness metric values w.r.t. scaling coefficients. However, this does not matter much from the attacker’s perspective as all the attacker needs is one scaling coefficient which meets the attack success criteria.

Scaling attack for the Fairness use-case. Demographic Parity Gap of post-attack downstream models is higher than that of those w/o attack while test accuracies are comparable. This implies that post-attack downstream models are less fair than those w/o attacks. Scaling coefficients in log scale.

Discussion on Susceptibility and Defense

The susceptibility of influence functions to our attacks can come from the fact that there can exist models that behave very similarly (Rashomon Effect ) but have different influential samples up to an extent. Equivalently, changing the influence for many samples does not affect the model accuracy much, as is shown by our experiments (though there exist some samples for which the influence can’t be manipulated, from theorem 1). Some plausible ways to defend against the attacks are (1) providing cryptographic proofs of honest model training using Zero-Knowledge Proofs and, (2) to check if the model is atleast a local minima or not, since IFs assume that the model is an optimal solution to the optimization.

Conclusion

While past work has mostly focused on feature attributions, in this paper we exhibit realistic incentives to manipulate data attributions. Motivated by the incentives, we propose attacks to manipulate outputs from a popular data attribution tool – Influence Functions. We demonstrate the success of our attacks experimentally on multiclass logistic regression models on ResNet features and standard tabular fairness datasets. Our work lays bare the vulnerablility of influence-based attributions to manipulation and serves as a cautionary tale when using them in adversarial circumstances. Some other future directions include manipulating influence for large models, exploring different threat models, additional use-cases and manipulating other kinds of data attribution tools.

For code check this link : https://github.com/infinite-pursuits/influence-based-attributions-can-be-manipulated

For the paper check this link : https://arxiv.org/pdf/2409.05208

For any enquiries, write to : cyadav@ucsd.edu