1 Introduction

Deep Neural Networks (DNNs) have made significant advances in a variety of visual tasks. DNNs tend to learn intended decision rules to accomplish target tasks commonly. However, they may follow unintended decision rules based on the easy-to-learn shortcuts to “achieve" target goals in some scenarios (Bahng et al., 2020). For instance, when training a model to classify digits on Colored MNIST (Kim et al., 2019a), where the images of each class are primarily dyed by one pre-defined color respectively (e.g., most ‘0’ are red, ‘1’ are yellow, see examples in Fig. 6), the intended decision rules classify images based on the shape of digits, whereas the unintended decision rules utilize color information instead. Following Nam et al. (2020), sample x that can be “correctly" classified by unintended decision rules is denoted as a bias-aligned sample \({\underline{x}}\) (e.g., red ‘0’ in Colored MNIST) and vice versa a bias-conflicting sample \({\overline{x}}\) (e.g., green ‘0’, whether a sample is bias-aligned or bias-conflicting is determined by the overall dataset it belongs to. For a given dataset, the classification of a sample as bias-aligned or bias-conflicting is deterministic).

Fig. 1
figure 1

a Effective bias-Conflicting Scoring (ECS) helps identify real bias-conflicting samples in stage I. b Gradient Alignment (GA) balances contributions from the mined bias-aligned and bias-conflicting samples throughout training, enforcing models to focus on intrinsic features in stage II. c Self-Supervised (SS) pretext tasks further assist models in capturing more general and robust representations from diverse valuable cues in stage II

There are many similar scenarios in the real world. Bias may be introduced during the construction of training datasets due to factors such as selection bias in the data collection process. Additionally, bias can inherently arise from specific features or triggers present in the real data. For example, models can predict hiring decisions based on resumes, but the model’s decisions are biased towards selecting men; models can recognize pneumonia, but the algorithm focuses on hospital tokens rather than lung features (Geirhos et al., 2020). When such bias exists, the model may learn these spurious features instead of the essential ones. If the test distribution aligns with the training distribution, a biased model might not appear problematic from an overall metric perspective. However, when an unbiased test set is used to evaluate the model, the issues with the biased model become evident. In fact, regardless of the test distribution, models trained on biased data are undesirable, fragile, and not robust because they do not rely on genuinely important features. Consequently, when faced with bias-conflicting test samples, these models are prone to failure (e.g., a red ‘8’ may be incorrectly classified as ‘0’ by the model trained on Colored MNIST). Worse, models with racial or gender bias, etc. can cause severe negative social impacts. Furthermore, in most real-world problems, the bias information (both bias type and precise labels of bias attribute) is unknown, making debiasing more challenging. Therefore, combating unknown biases is urgently demanded when deploying AI systems in realistic applications.

One major issue that leads to biased models is that the training objective (e.g., vanilla empirical risk minimization) can be accomplished through only unintended decision rules (Sagawa et al., 2020b). Accordingly, some studies (Nam et al., 2020; Kim et al., 2021b) attempt to identify and emphasize the bias-conflicting samples. Nevertheless, we find that the debiasing effect is hampered by the low identification accuracy and the suboptimal emphasizing strategies. In this work, we build an enhanced two-stage debiasing scheme to combat unknown dataset biases. Overall, we aim to find as many bias-conflicting samples as accurately as possible in stage I and fully utilize the mined bias-conflicting samples and bias-aligned samples in stage II. Therefore, in stage I, we introduce an Effective bias-Conflicting Scoring (ECS) function to mine bias-conflicting samples. To enhance the off-the-shelf method, we propose a peer-picking mechanism to consciously pursue seriously biased auxiliary models and employ epoch-ensemble to obtain more accurate and stable scores. In stage II, we propose Gradient Alignment (GA) to balance the gradient contributions across the mined bias-aligned and bias-conflicting samples, preventing models from being biased. To achieve dynamic balance throughout optimization, the gradient information is used as an indicator to down-weight (up-weight) the mined bias-aligned (bias-conflicting) samples. Moreover, to prevent models from relying solely on simple shortcuts to achieve the learning objective, we introduce Self-Supervised (SS) pretext tasks in stage II, encouraging the consideration of richer features when making decisions. Fig. 1 depicts the effects of ECS, GA, and SS.

Fig. 2
figure 2

Our debiasing scheme. Stage I: training auxiliary biased models \({\dot{f}}, \ddot{f}\) with peer-picking and epoch-ensemble to score the likelihood that a sample is bias-conflicting (in Sect. 3.1). Stage II: learning debiased model f with gradient alignment (in Sect. 3.2) and self-supervised pretext tasks (in Sect. 3.3). A dashed arrow starting from a sample cluster indicates that the model is updated with gradients from these samples. Cluster \({\mathcal {O}}_1\) represents samples on which both auxiliary models are highly confident, \(\{(x^j, y^j) \mid p(y^{j} \mid {\dot{f}}(x^{j}))> \eta , \ p(y^{j} \mid \ddot{f}(x^{j})) > \eta \}\); cluster \({\mathcal {O}}_2\) stands for samples on which both auxiliary models are not confident, \(\{(x^j, y^j) \mid p(y^{j} \mid {\dot{f}}(x^{j})) \le \eta , \ p(y^{j} \mid \ddot{f}(x^{j})) \le \eta \}\); cluster \({\mathcal {O}}_3\) and cluster \({\mathcal {O}}_4\) represent samples on which the two auxiliary models exhibit discrepancies in confidence levels, \(\{(x^j, y^j) \mid p(y^{j} \mid {\dot{f}}(x^{j})) > \eta , \ p(y^{j} \mid \ddot{f}(x^{j})) \le \eta \}\) and \(\{(x^j, y^j) \mid p(y^{j} \mid {\dot{f}}(x^{j})) \le \eta , \ p(y^{j} \mid \ddot{f}(x^{j})) > \eta \}\); threshold \(\eta \in (0,1)\)

In comparison to other debiasing techniques, the proposed solution (i) does not rely on comprehensive bias annotations (Tartaglione et al., 2021; Zhu et al., 2021; Li & Vasconcelos, 2019; Sagawa et al., 2020a; Goel et al., 2021; Kim et al., 2019a) or a pre-defined bias type (Bahng et al., 2020; Clark et al., 2019; Utama et al., 2020b; Geirhos et al., 2018; Wang et al., 2019); (ii) does not require disentangled representations (Tartaglione et al., 2021; Kim et al., 2021b, a; Bahng et al., 2020), which may fail in complex scenarios where disentangled features are hard to extract; (iii) does not introduce heavy data augmentations (Geirhos et al., 2018; Kim et al., 2021a, b; Goel et al., 2021), avoiding additional training complexity such as in generative models; (iv) does not involve modification of model backbones (Kim et al., 2021b), making it easy to be applied to other networks. (v) significantly improves the debiasing performance. The main contributions of this work are summarized as follows:

  1. (1)

    To combat unknown dataset biases, we present an enhanced two-stage approach (illustrated in Fig. 2) in which an effective bias-conflicting scoring algorithm equipped with peer-picking and epoch-ensemble in stage I (in Sect. 3.1), and gradient alignment in stage II (in Sect. 3.2) are proposed.

  2. (2)

    In stage II (in Sect. 3.3), we introduce self-supervised pretext tasks to demonstrate the ability of the unsupervised learning paradigm to alleviate bias in supervised learning.

  3. (3)

    Broad experiments on commonly used datasets are conducted to compare several debiasing methods in a fair manner (overall, we train more than 700 models), among which the proposed method achieves state-of-the-art performance (in Sect. 4).

  4. (4)

    We undertake comprehensive analysis (in Sect. 5), including the efficacy of each component, the solution’s effectiveness in various scenarios, the sensitivity of the hyper-parameters, and so on.

A preliminary version of this work has been accepted by a conference (Zhao et al., 2023), but we extend this work with the following additions: (i) we further introduce self-supervised pretext tasks to help the models leverage abundant features and investigate their effectiveness with extended experiments (in Sects. 3.3 and  4.5); (ii) a more detailed description and analysis of the datasets and the compared methods are provided (in Sects. 4.1 and  4.2); (iii) we present and analyze the results measured on the bias-aligned and bias-conflicting test samples separately (in Sect. 4.5); (iv) we include more detailed results, such as the performance of the last epoch (in Table 2), the precision-recall curves of different bias-conflicting scoring strategies (in Fig. 8), the precision and recall of our mined bias-conflicting samples (in Table 9), the final debiasing results of GA with different bias-conflicting scoring methods (in Table 8); (v) the analysis and discussion are extended, such as the number of auxiliary biased models (in Sect. 5.2), when there are only a few bias-conflicting samples (in Sect. 5.4), when the training data is unbiased (in Sect. 5.5); (vi) the limitation and future work are further discussed (in Sect. 6).

2 Related Work

Combating biases with known types and labels. Many debiasing approaches require explicit bias types and bias labels for each training sample. A large group of strategies aims at disentangling spurious and intrinsic features (Moyer et al., 2018). For example, EnD (Tartaglione et al., 2021) designs regularizers to disentangle representations with the same bias label and entangle features with the same target label; BiasCon (Hong & Yang, 2021) pulls samples with the same target label but different bias labels closer in the feature space based on contrastive learning; and some other studies learn disentangled representation by mutual information minimization (Zhu et al., 2021; Kim et al., 2019a; Ragonesi et al., 2021). Another classic approach is to reweigh/resample training samples based on sample number or loss of different explicit groups (Li et al., 2018; Sagawa et al., 2020b; Li & Vasconcelos, 2019), or even to synthesize samples (Agarwal et al., 2020). Besides, Sagawa et al. (2020a) and Goel et al. (2021) intend to improve the worst-group performance through group distributionally robust optimization (Goh & Sim, 2010) and Cycle-GAN (Zhu et al., 2017) based data augmentation, respectively. IRM (Arjovsky et al., 2019) is designed to learn a representation that performs well in all environments; domain-independent classifiers are introduced by Wang et al. (2020) to accomplish target tasks in each known bias situation. Furthermore, typical fairness learning work (Donini et al., 2018; Dhar et al., 2021; Gong et al., 2021; Chuang & Mroueh, 2021) often pre-defines a sensitive attribute (much like the bias attribute discussed in this work). These efforts aim to suppress the sensitive attribute while making accurate predictions on the target attribute. Thus, this series of studies generally focuses on combating known bias as well.

Combating biases with known types. To alleviate expensive bias annotation costs, some bias-tailored methods relax the demands by requiring only the bias types (Geirhos et al., 2018). Bahng et al. (2020) elaborately design specific networks based on the bias types to obtain biased representations on purpose (e.g., using 2D CNNs to extract static bias in action recognition). Then, the debiased representation is learned by encouraging it to be independent of the biased one. Wang et al. (2019) try to project the model’s representation onto the subspace orthogonal to the texture-biased representation. SoftCon (Hong & Yang, 2021) serves as an extension of BiasCon to handle cases where only the bias type is available. In addition, the ensemble approach that consists of a bias-type customized biased model and a debiased model is employed in natural language processing as well (He et al., 2019; Clark et al., 2019; Cadene et al., 2019; Utama et al., 2020b; Clark et al., 2020).

Combating unknown biases. Despite the effectiveness of the methodologies described above, the assumptions limit their applications, as manually discovering bias types heavily relies on experts’ knowledge and labeling bias attributes for each training sample is even more laborious. As a result, recent studies (Le Bras et al., 2020; Kim et al., 2019b; Hashimoto et al., 2018) try to obtain debiased models with unknown biases, which are more realistic. Nam et al. (2020) mine bias-conflicting samples with generalized cross entropy (GCE) loss (Zhang & Sabuncu, 2018) and emphasize them by using a designed weight assignment function. Kim et al. (2021b) further synthesize diverse bias-conflicting samples via feature-level data augmentation, whereas Kim et al. (2021a) directly generate them with SwapAE (Park et al., 2020). RNF (Du et al., 2021) uses the neutralized representations from samples with the same target label but different bias labels (generated by GCE-based biased models, the version that accesses real bias labels is called RNF-GT) to train the classification head alone. Besides GCE loss, feature clustering (Sohoni et al., 2020), early-stopping (Liu et al., 2021), forgettable examples (Yaghoobzadeh et al., 2021) and limited network capacity (Sanh et al., 2020; Utama et al., 2020a) are involved to identify bias-conflicting samples. Furthermore, Creager et al. (2021) and Lahoti et al. (2020) alternatively infer dataset partitions and enhance domain-invariant feature learning by min-max adversarial training. In addition to the identify-emphasize paradigm, Pezeshki et al. (2020) introduces a novel regularization method for decoupling feature learning dynamics in order to improve model robustness.

Algorithm 1
figure a

Effective bias-Conflicting Scoring (ECS)

Self-supervised learning. In recent years, self-supervised learning has achieved significant success in vision tasks. For applications, self-supervised learning has been employed in object recognition/detection/segmentation (He et al., 2020), video tasks (Tong et al., 2022), few-shot learning (Gidaris et al., 2019), manipulation detection (Zeng et al., 2022), etc. For pretext tasks in self-supervised training, position prediction (Doersch et al., 2015), Jigsaw puzzles (Noroozi & Favaro, 2016), rotation prediction (Gidaris et al., 2018), clustering (Van Gansbeke et al., 2020; Caron et al., 2020), contrastive learning (Chen et al., 2020; He et al., 2020), mask and reconstruct (He et al., 2022), etc. are adopted to extract transferable representations from the unlabeled data. Among them, the main idea of image rotation prediction is to predict the rotation degree of the deliberately rotated input images, resulting a 4-class classification problem; contrastive learning has achieved considerable success in self-supervised learning, which encourages the features of the positive pair to be close while pushing the representations of the negative pair away. The positive pair is typically formed by the two augmentation views (\({\mathcal {A}}_0\) and \({\mathcal {A}}_1\), \({\mathcal {A}}_a\) stands for the augmentations, here random crop and horizontal flip are employed) of the same image. For the training data, besides learning on unlabeled data, self-supervised learning has also been utilized to pursue more general features with labeled (Khosla et al., 2020), partial labeled (Wang et al., 2022) or mixed data (Zhai et al., 2019). Inspired by the desiderata of debiasing and the ability of self-supervised learning, in this work, we investigate the efficacy of self-supervised learning on labeled data for debiasing. Specifically, we further exploit self-supervision as an auxiliary task in the debiased training scheme to pursue unbiased representations.

3 Methodology

The whole debiasing solution is illustrated in Fig. 2. We present peer-picking, epoch-ensemble for stage I (in Sect. 3.1), gradient alignment and self-supervised pretext tasks for stage II (in Sects. 3.2 and  3.3, respectively).

3.1 Effective Bias-conflicting Scoring

Due to the explicit bias information is not available, we try to describe how likely input x is a bias-conflicting sample via the bias-conflicting (b-c) score: s(xy) \(\in \) [0,1], where \(y \in \{1,2,\cdots ,C\}\) stands for the target label.Footnote 1 A larger s(xy) indicates that x is harder to be recognized via unintended decision rules. As models are prone to fitting shortcuts, previous studies (Kim et al., 2021a; Liu et al., 2021) resort model’s output probability on target class to define s(xy) as \( 1 - p(y \vert {\dot{f}}(x) ), \) where \( p(c\vert {\dot{f}}(x)) = \frac{e^{{\dot{f}}(x)[c]}}{\sum _{c'=1}^C e^{{\dot{f}}(x)[c']}} \), \({\dot{f}}\) is an auxiliary biased model and \({\dot{f}}(x)[c]\) denotes the \(c^{\text {th}}\) index of logits \({\dot{f}}(x)\). Despite this, over-parameterized networks tend to “memorize" all samples, resulting in low scores for the real bias-conflicting samples as well. To avoid it, we propose the following two strategies. The whole scoring framework is summarized in Algorithm 1 (noting that the “for” loop is used for better clarification, which can be avoided in practice).

Training auxiliary biased models with peer-picking. Deliberately amplifying the auxiliary model’s bias seems to be a promising strategy for better scoring (Nam et al., 2020), as heavily biased models can assign high b-c scores to bias-conflicting samples. We achieve this by confident-picking — only picking samples with confidentFootnote 2 predictions (which are more like bias-aligned samples) to update auxiliary models. Nonetheless, a few bias-conflicting samples can still be overfitted and the memorization will be strengthened with continuous training. Thus, with the assist of peer model, we propose peer-picking, a co-training-like (Han et al., 2018) paradigm, to train auxiliary biased models.

Our method maintains two auxiliary biased models \({\dot{f}}\) and \(\ddot{f}\) simultaneously (identical structure here). Considering a training set \({\mathcal {D}}\) = \(\{(x^{i},y^{i})\}^N_{i=1}\) with B samples in each batch, with a threshold \(\eta \in (0,1)\), each model divides samples into confident and unconfident groups relying on the output probabilities on target classes. Consequently, four clusters are formed as shown in Fig. 2. For the red cluster (\({\mathcal {O}}_1\)), since two models are confident on them, it is reasonable to believe that they are indeed bias-aligned samples, therefore we pick up them to update model via gradient descent as usual (Line 7,12 of Algorithm 1). While the gray cluster (\({\mathcal {O}}_2\)), on which both two models are unconfident, will be discarded outright as they might be bias-conflicting samples. The remaining purple clusters (\({\mathcal {O}}_3\) and \({\mathcal {O}}_4\)) indicate that some samples may be bias-conflicting, but they are memorized by one of auxiliary models. Inspired by the work for handling noisy labels (Han et al., 2020), we endeavor to force the corresponding model to forget the memorized suspicious samples via gradient ascent (Line 9,11,12). We average the output results of the two heavily biased models \({\dot{f}}\) and \(\ddot{f}\) to obtain b-c scores (Line 15).

Collecting results with epoch-ensemble. During the early stage of training, b-c scores \(\{s^i\}\) (\(s^i\):=\(s(x^{i}, y^{i})\)) of real bias-conflicting samples are usually higher than those of bias-aligned ones, while the scores may be indistinguishable at the end of training due to overfitting. Unfortunately, selecting an optimal moment for scoring is strenuous (choosing too early a moment, the auxiliary biased model has not yet learned the bias-aligned samples well; choosing too late a moment, the auxiliary model has also fitted the bias-conflicting samples). To avoid tedious hyper-parameter tuning, we collect results every \(T'\) iterations (typically every epoch in practice, i.e., \(T'=\lfloor \frac{N}{B} \rfloor \)) and adopt the ensemble averages of multiple results as the final b-c scores (Line 15). We find that the ensemble can alleviate the randomness of a specific checkpoint and achieve superior results without using tricks like early-stopping.

3.2 Gradients Alignment

Then, we attempt to train the debiased model f. We focus on an important precondition of the presence of biased models: the training objective can be achieved through unintended decision rules. To avoid it, one should develop a new learning objective that cannot be accomplished by these rules. The most straightforward inspiration is the use of plain reweighting (Rew) to intentionally rebalance sample contributions from different domains (Sagawa et al., 2020b):

$$\begin{aligned} {\mathcal {L}}_{Rew} = \sum _{i=1}^{{\underline{N}}} \frac{{\overline{N}}}{\gamma \cdot {\underline{N}}} \cdot \ell (f({\underline{x}}^{i}), y^{i}) + \sum _{j=1}^{{\overline{N}}} \ell (f({\overline{x}}^{j}), y^{j}), \end{aligned}$$
(1)

where \({\overline{N}}\) and \({\underline{N}}\) are the number of bias-conflicting and bias-aligned samples respectively, \(\gamma \in (0, \infty )\) is a reserved hyper-parameter to conveniently adjust the tendency: when \(\gamma \rightarrow 0\), models intend to exploit bias-aligned samples more and when \(\gamma \rightarrow \infty \), the behavior is reversed. As depicted in Fig. 3, assisted with Rew, unbiased accuracy skyrockets in the beginning, indicating that the model tends to learn intrinsic features in the first few epochs, while declines gradually, manifesting that the model is biased progressively (adjusting \(\gamma \) can not reverse the tendency).

Fig. 3
figure 3

Unbiased accuracy on Colored MNIST

The above results show that the static ratio between \({\overline{N}}\) and \({\underline{N}}\) is not a good indicator to show how balanced the training is, as the influence of samples can fluctuate during training. Accordingly, we are inspired to directly choose gradient statistics as a metric to indicate whether the training is overwhelmed by bias-aligned samples. Let us revisit the commonly used cross-entropy loss:

$$\begin{aligned} \ell (f(x), y) = -\sum _{c=1}^{C} {\mathbb {I}}_{c=y} \log p(c \vert f(x)). \end{aligned}$$
(2)

For a sample (xy), the gradient on logits f(x) is given by

$$\begin{aligned} \begin{aligned}&\nabla _{f(x)} \ell (f(x), y) =\\&[ \frac{\partial \ell (f(x), y)}{\partial f(x)[1]}, \frac{\partial \ell (f(x), y)}{\partial f(x)[2]}, \cdots , \frac{\partial \ell (f(x), y)}{\partial f(x)[C]}]^{{\textsf{T}}}. \end{aligned} \end{aligned}$$
(3)

We define the current gradient contribution of sample (xy) as

$$\begin{aligned} \begin{aligned} g(x,y \vert f)&= \parallel \nabla _{f(x)} \ell (f(x), y) \parallel _1\\&=\sum _{c=1}^C \vert \frac{\partial \ell (f(x), y)}{\partial f(x)[c]} \vert \\&= 2 \vert \frac{\partial \ell (f(x), y)}{\partial f(x)[y]} \vert = 2 - 2p(y \vert f(x)). \end{aligned} \end{aligned}$$
(4)

Assuming within the \(t^{\text {th}}\) iteration (\(t \in [0, T-1]\)), the batch is composed of \({\underline{B}}^t\) bias-aligned and \({\overline{B}}^t\) bias-conflicting samples (B in total, \({\underline{B}}^t \gg {\overline{B}}^t\) under our concerned circumstance). The accumulated gradient contributions generated by bias-aligned samples are denoted as

$$\begin{aligned} {\underline{g}}^t = \sum _{i=1}^{{\underline{B}}^t} g({\underline{x}}^i,y^i \vert f^t), \end{aligned}$$
(5)

similarly for the contributions of bias-conflicting samples: \({\overline{g}}^t\).

We present the statistics of \(\{ {\overline{g}}^t \}_{t=0}^{T-1}\) and \(\{ {\underline{g}}^t \}_{t=0}^{T-1}\) when learning with the standard ERM learning objective (Vanilla) and Equation (1) (Rew) respectively in Fig. 4. For vanilla training, we find the gradient contributions of bias-aligned samples overwhelm that of bias-conflicting samples at the beginning, thus the model becomes biased towards spurious correlations rapidly. Even though at the late stage, the gap in gradient contributions shrinks, it is hard to rectify the already biased model. For Rew, we find the contributions of bias-conflicting and bias-aligned samples are relatively close at the beginning (compared to those under Vanilla), thus both of them can be well learned. Nonetheless, the bias-conflicting samples are memorized soon due to their small quantity, and the gradient contributions from the bias-conflicting samples become smaller than that of the bias-aligned samples gradually, leading to biased models step by step.

Fig. 4
figure 4

Statistics of \(\{ {\overline{g}}^t \}_{t=0}^{T-1}\) and \(\{ {\underline{g}}^t \}_{t=0}^{T-1}\). Vanilla (top), Rew (middle), GA (bottom). Results in the late stage are enlarged and shown in each figure

Fig. 5
figure 5

The illustrations of contrastive learning (left) and dense contrastive learning (right)

The above phenomena are well consistent with the accuracy curves in Fig. 3, indicating that the gradient statistics can be a useful “barometer” to reflect the optimization process. Therefore, the core idea of gradient alignment is to rebalance bias-aligned and bias-conflicting samples according to their currently produced gradient contributions. Within the \(t^{\text {th}}\) iteration, We define the contribution ratio \(r^t\) as:

$$\begin{aligned} r^t = \frac{ {\overline{g}}^t}{\gamma \cdot {\underline{g}}^t} = \frac{\sum _{j=1}^{ {\overline{B}}^t } [1 - p(y^j \vert f^{t}({\overline{x}}^{j}) ) ]}{\gamma \cdot \sum _{i=1}^{{\underline{B}}^t} [1 - p(y^i \vert f^{t}({\underline{x}}^{i}) ) ]}, \end{aligned}$$
(6)

where \(\gamma \) plays a similar role as in Rew. Then, with \(r^t\), we rescale the gradient contributions derived from bias-aligned samples to achieve alignment with that from bias-conflicting ones, which can be simply implemented by reweighting the learning objective for the \(t^{\text {th}}\) iteration:

$$\begin{aligned} {\mathcal {L}}_{GA}^t = \sum _{i=1}^{{\underline{B}}^t} r^t \cdot \ell (f^{t}({\underline{x}}^{i}), y^{i}) + \sum _{j=1}^{{\overline{B}}^t} \ell (f^{t}({\overline{x}}^{j}), y^{j}), \end{aligned}$$
(7)

i.e., the modulation weight is adaptively calibrated in each iteration. As shown in Equation (6) and (7), GA only needs negligible computational extra cost (1\(\times \) forward and backward as usual, only increases the cost of computing \(r^t\)). As shown in Fig. 4, GA can dynamically balance the contributions throughout the whole training process. Correspondingly, it obtains optimal and stable predictions as demonstrated in Fig. 3 and multiple other challenging datasets in Sect. 4. Noting that as bias-conflicting samples are exceedingly scarce, it is unrealistic to ensure that every class can be sampled in one batch, thus all classes share the same ratio in our design.

To handle unknown biases, we simply utilize the estimated b-c score \(\{s^i\}_{i=1}^N\) and a threshold \(\tau \) to assign input x as bias-conflicting (s(xy) \(\ge \) \(\tau \)) or bias-aligned (s(xy) \(< \tau \)) here. For clarity, GA with the pseudo annotations (bias-conflicting or bias-aligned) produced by ECS will be denoted as ‘ECS+GA’ (similarly, ‘ECS+\(\triangle \)’ represents combining ECS with method \(\triangle \)).

3.3 Self-supervised Pretext Tasks

An important precondition for the presence of biased models is that the training objective can be achieved through unintended decision rules. To address this, in Sect. 3.2, we develop a new learning objective (via gradient alignment) that cannot be accomplished by these rules. With the same consideration in mind, we find that self-supervised auxiliary tasks can also prevent the model from relying solely on unintended decision rules to achieve the training objectives. The workflow is illustrated in Fig. 2. As examples, in this work, we employ dense contrastive learning (Wang et al., 2021) and the rotation prediction task (Gidaris et al., 2018) as pretext tasks. We provide details on these two tasks below. Other advanced self-supervision techniques can be similarly incorporated into the pipeline.

Fig. 6
figure 6

Training examples. The height of the cylinder reflects the number of samples, i.e., most training samples are bias-aligned

Dense contrastive learning. Contrastive learning can be considered as training an encoder for a dictionary look-up task as shown in Fig. 5 (left). For an encoded query q (derived from \({\mathcal {A}}_0(x)\)) and its positive key \(k_p\) (derived from \({\mathcal {A}}_1(x)\)), negative keys \(\{k_{n_1}, k_{n_2}, \cdots \}\) (maintained in the queue), the contrastive learning loss is formed as:

$$\begin{aligned} \ell _{cl} = -\log \frac{e^{q \cdot k_p}}{e^{q \cdot k_p} + \sum _{i} e^{q \cdot k_{n_i}}}. \end{aligned}$$
(8)

We omit the temperature here for brevity. Commonly, the query and keys are encoded at the level of global feature. To compel the model to use richer features, we adopt a “dense" version (Wang et al., 2021) here which considers a dense pairwise contrastive learning task (at the level of local feature) instead of the global image classification. By replacing the global projection head with the dense projection head as depicted in Fig. 5 (right), we can obtain a \(Z \times Z\) dense feature map. The \(z^{\text {th}}\) query out of \(Z^2\) encoded queries is denoted as \(q^z\), its positive key is denoted as \(k^z_p\) and a negative key is denoted as \(k^z_{n_i}\), then the dense contrastive learning loss is formed as:

$$\begin{aligned} \ell _{dcl} = \frac{1}{Z^2} \sum _{z=1}^{Z^2} -\log \frac{e^{q^z \cdot k^z_p}}{e^{q^z \cdot k^z_p} + \sum _{i} e^{q^z \cdot k^z_{n_i}}}. \end{aligned}$$
(9)

The pair construction of dense contrastive learning follows Wang et al. (2021) and He et al. (2020). The negative keys are the encoded local features stored in the queue. For the positive key, considering the two views’ extracted feature maps before the projection head, by downsampling (average pooling) the pre-project features to also have the shape of \(Z\times Z\), a similarity matrix with dimension \({Z^2\times Z^2}\) can be calculated. Assuming the \(j^{\text {th}}\) pre-project feature vector from \({\mathcal {A}}_1(x)\) is most similar to the \(i^{\text {th}}\) pre-project feature vector from \({\mathcal {A}}_0(x)\). Then for the features after the projection head, we can treat the corresponding \(j^{\text {th}}\) post-project feature vector from \({\mathcal {A}}_1(x)\) as the positive key for the \(i^{\text {th}}\) post-project feature vector from \({\mathcal {A}}_0(x)\).

Compared to the recent work of Robinson et al. (2021), our purpose is different. They found that unsupervised contrastive learning can also be affected by data bias, while our finding is that in supervised learning, contrastive learning auxiliary tasks can alleviate bias to some extent. In other words, the supervised classification training objectives (e.g., Eq. (1)) are more easily affected by data bias, while the impact of bias on contrastive learning auxiliary tasks is relatively smaller. Therefore, we can use these auxiliary tasks to alleviate the impact of data bias on the original tasks. The similarity is that we also believe that plain (global) contrastive learning can be affected by data bias. Thus, the work (Robinson et al., 2021) proposes an IFM method to alleviate the bias problem during contrastive learning, while here we propose to use dense contrastive learning to pursue richer representation. It is worth noting that we would like to emphasize that unsupervised auxiliary tasks can help supervised learning combat bias to some extent in this work.

Rotation prediction. The loss function for each sample in rotation prediction task is formulated by:

$$\begin{aligned} \ell _{rot} = \frac{1}{4} \sum _{a=0}^3 \ell (f_{rot}({\mathcal {A}}_a (x)), a), \end{aligned}$$
(10)

here \(\{{\mathcal {A}}_0, {\mathcal {A}}_1, {\mathcal {A}}_2, {\mathcal {A}}_3\}\) is the set of transformations with 4 rotation degrees \(\{0^{\circ }, 90^{\circ }, 180^{\circ }, 270^{\circ }\}\), \(\ell \) is the cross-entropy loss.

4 Experiments

4.1 Datasets

We mainly conduct experiments on five benchmark datasets. Some examples from the used datasets are exhibited in Fig. 6. For Colored MNIST (C-MNIST), the task is to recognize digits (0–9), in which the images of each target class are dyed by the corresponding color with probability \(\rho \in \{95\%, 98\%, 99\%, 99.5\%\}\) and by other colors with probability \(1-\rho \) (a higher \(\rho \) indicates more severe biases). Similarly, for Corrupted CIFAR10, each object class in it holds a spurious correlation with a corruption type. Two sets of corruption protocols are utilized, leading to two biased datasets (Nam et al., 2020): Corrupted CIFAR10\(^1\) and CIFAR10\(^2\) (C-CIFAR10\(^1\), C-CIFAR10\(^2\)) with \(\rho \in \{95\%, 98\%, 99\%, 99.5\%\}\). Following previous work (Nam et al., 2020), Corrupted CIFAR10\(^1\) is constructed with corruption types {Snow, Frost, Fog, Brightness, Contrast, Spatter, Elastic, JPEG, Pixelate, Saturate}; Corrupted CIFAR10\(^2\) is constructed with corruption types {GaussianNoise, ShotNoise, ImpulseNoise, SpeckleNoise, GaussianBlur, DefocusBlur, GlassBlur, MotionBlur, ZoomBlur, Original}. In Biased Waterbirds (B-Birds)Footnote 3 which is a composite dataset that superimposes foreground bird images from CUB (Welinder et al., 2010) onto background environment images from Places (Zhou et al., 2017), “waterbirds” and “landbirds” are highly correlated with “wet” and “dry” habitats (95% bias-aligned samples, i.e., \(\rho =95\%\)). Consequently, the task aiming to distinguish images as “waterbird" or “landbird" can be influenced by background. In Biased CelebA (B-CelebA) which is established for face recognition where each image contains multiple attributes (Liu et al., 2015),Footnote 4 blond hair is predominantly found in women, whereas non-blond hair mostly appears in men (\(\rho =99\%\)). When the goal is to classify the hair color as “blond" or “non-blond", the information of gender (“male” or “female” in this dataset) can be served as a shortcut (Nam et al., 2020). To focus on the debiasing problem, we balance the number of images per target class in B-Birds and B-CelebA.

4.2 Compared Methods

We choose various methods for comparison: standard ERM (Vanilla), Focal loss (Lin et al., 2017), plain reweighting (Sagawa et al., 2020b) (Rew and ECS+Rew), REBIAS (Bahng et al., 2020), BiasCon (Hong & Yang, 2021), RNF-GT (Du et al., 2021), GEORGE (Sohoni et al., 2020), LfF (Nam et al., 2020), DFA (Kim et al., 2021b), SD (Pezeshki et al., 2020), ERew (Clark et al., 2019) and PoE (Clark et al., 2019) (ECS+ERew and ECS+PoE).Footnote 5 Among them, REBIAS requires bias types; Rew, BiasCon, RNF-GT, and GA are performed with real bias-conflicting or bias-aligned annotations.Footnote 6 We present a brief analysis of these debiasing approaches as follows based on technique categories.

Reweighting-based strategies. Rew is a straightforward static reweighting strategy based on the number of samples per group. Both LfF and ERew reassign sample weights assisted with a biased model but differ in weight assignment functions. ERew is also a static reweighting approach that employs output scores of a pre-trained biased model as the weight indicator. LfF applies dynamic weight adjustments during training. LfF and ERew just reweight a sample with the information from itself, whereas Rew uses global information within one minibatch to obtain sample weight.Footnote 7

Feature disentanglement. REBIAS designs specific networks according to the bias type for obtaining biased representations intentionally (for our experiments, we employ CNNs with smaller receptive fields for capturing texture bias according to the original paper). Then the debiased representation is learned by encouraging it to be independent of the biased one, during which Hilbert-Schmidt Independence Criterion (HSIC) is employed to measure the degree of independence between the two representations. Building on LfF, DFA further introduces disentangled representations to augment bias-conflicting samples at the feature level. The methods try to explicitly extract disentangled feature representations, which is difficult to be achieved in complex datasets and tasks. BiasCon directly uses contrastive learning to pull the same target class but different bias class sample pairs closer than the other pairs.Footnote 8

Distributionally robust optimization (DRO). Many previous studies resort to DRO to achieve model fairness. GEORGE performs clustering based on the feature representations of the auxiliary biased models first and then expects to obtain fair models by using DRO with the pseudo groups. However, due to overfitting, we find that clustering with features generated from vanilla biased models is not robust and accurate, resulting in substantially inferior performance when performing DRO using the imprecise clusters.Footnote 9

Ensemble approaches. Product-of-Experts (PoE) is widely adopted in NLP-related debiasing tasks, which tries to train a debiased model in an ensemble manner with an auxiliary biased model, by combining the softmax outputs produced from the biased and debiased models.Footnote 10

Regularization methods. In addition, SD directly replaces the common \(l_2\) regularization with an \(l_2\) penalty on the model’s logits. The optimal strength of the regularization term can be hard to search, which may be very different for various datasets and tasks.Footnote 11

4.3 Evaluation Metrics

Following Nam et al. (2020), we mainly report the overall unbiased accuracy, alongside the accuracy of bias-aligned and bias-conflicting test samples individually. For experiments on Colored MNIST, Corrupted CIFAR10\(^1\) and CIFAR10\(^2\), we evaluate models on the unbiased test sets in which the bias attributes are independent of the target labels. For Biased Waterbirds and CelebA, to evaluate unbiased accuracy with the official test sets which are biased and imbalanced, the accuracies of each (target, bias) group are calculated separately and then averaged to generate the overall accuracy (Nam et al., 2020).

We also show the fairness performance in terms of DP and EqOdd (Reddy et al., 2021). For the definitions of DP and EqOdd, following Reddy et al. (2021), let x, y, b, \(y'\) denote the input, target label, the bias label, and the model’s prediction respectively, Demographic Parity (DP) is defined as \(1 - \vert p(y'=1 \vert b=1) - p(y'=1 \vert b=0) \vert \); Equality of Opportunity w.r.t \(y=1\) (EqOpp1) is defined as \(1 - \vert p(y' = 1 \vert y = 1, b = 0) - p(y' = 1 \vert y = 1, b = 1) \vert \) and Equality of Opportunity w.r.t \(y = 0\) (EqOpp0) is defined as \(1 - \vert p(y' = 1 \vert y = 0, b = 0) - p(y' = 1 \vert y = 0, b = 1) \vert \), Equality of Odds (EqOdd) is defined as \(0.5 \times \)(EqOpp0 + EqOpp1).

4.4 Implementation

The studies for the previous debiasing approaches are usually conducted with varying network architectures and training schedules. We run the representative methods with identical configurations to make fair comparisons. We use an MLP with three hidden layers (each hidden layer comprises 100 hidden units) for C-MNIST, except for the biased models in REBIAS (using CNN). ResNet-20 (He et al., 2016) is employed for C-CIFAR10\(^1\) and C-CIFAR10\(^2\). ResNet-18 is utilized for B-Birds and B-CelebA. We implement all methods with PyTorch (Paszke et al., 2019) and run them on a Tesla V100 GPU. For experiments on Colored MNIST, we use Adam optimizer to train models for 200 epochs with learning rate 0.001, batch size 256, without any data augmentation techniques. For Corrupted CIFAR10\(^1\) and CIFAR10\(^2\), models are trained for 200 epochs with Adam optimizer, learning rate 0.001, batch size 256, image augmentation including only random crop and horizontal flip. For Biased Waterbirds and CelebA, models are trained from imagenet pre-trained weights (Pytorch torchvision version) for 100 epochs with Adam optimizer, learning rate 0.0001, batch size 256, and horizontal flip augmentation technique. Dense contrastive learning is utilized on B-Birds and rotation prediction is employed on C-CIFAR10\(^1\) and C-CIFAR10\(^2\) (as we find dense prediction is not suitable for images with very small resolution). The code and README are provided in the supplementary material.

Table 1 Overall unbiased accuracy (%) and standard deviation over three runs
Table 2 Overall unbiased accuracy and standard deviation of the last epoch over 3 runs (%)

4.5 Main Results

We present the main experimental results in this section. Due to the self-supervised pretext tasks will increase the training cost (but no inference latency), we split the comparison into two parts: without the self-supervised pretext tasks (in Sect. 4.5.1) and with them (in Sect. 4.5.2).

4.5.1 Without Self-supervision

The proposed method achieves better performance than others. The overall unbiased accuracy is reported in Table 1. Vanilla models commonly fail to produce acceptable results on unbiased test sets, and the phenomenon is aggravated as \(\rho \) goes larger. Different debiasing methods moderate bias propagation with varying degrees of capability. When compared to other SOTA methods, the proposed approach achieves competitive results on C-CIFAR10\(^1\) and noticeable improvements on other datasets across most values of \(\rho \). For instance, the vanilla model trained on C-CIFAR10\(^2\) (\(\rho =99\%\)) only achieves 20.6% unbiased accuracy, indicating that the model is heavily biased. While, ECS+GA leads to 50.0% accuracy, and exceeds other prevailing debiasing methods by 3% - 30%. When applied to the real-world dataset B-CelebA, the proposed scheme also shows superior results, demonstrating that it can effectively deal with subtle actual biases. Though the main purpose of this work is to combat unknown biases, we find GA also achieves better performance compared to the corresponding competitors when the prior information is available.

Fig. 7
figure 7

Bias-aligned accuracy (horizontal-axis) and bias-conflicting accuracy (vertical-axis)

We provide the accuracy measured on the bias-aligned and bias-conflicting test samples separately in Fig. 7 and the average accuracy in Table 3. We find ECS+GA can achieve high bias-conflicting accuracy as well as bias-aligned accuracy mostly, leading to superior overall unbiased performance. Note that, too high bias-aligned accuracy is not always good. Though the vanilla model can obtain a very high illusory bias-aligned accuracy assisted with biases, it does not learn intrinsic features as shown in Fig. 11, leading to extremely pool out-of-distribution generalization. As an instance, although the vanilla model trained on Corrupted CIFAR10\(^2\) (\(\rho =99\%\)) achieves high bias-aligned accuracy (99.6%), the value (99.6%) actually reflects the model’s ability to discriminate bias attribute rather than target attribute. In fact, when training on an unbiased training set, the corresponding accuracy is only 77.2%, reflecting that the real target task is harder to learn than the spurious one.

Comparison between GA and its counterparts. We find Rew (and ECS+Rew) can achieve surprising results compared with recent state-of-the-art methods, while it is overlooked by some studies. The results also indicate that explicitly balancing the bias-aligned and bias-conflicting samples is extremely important and effective. However, plain reweighting requires strong regularizations such as early-stopping to produce satisfactory results (Byrd & Lipton, 2019; Sagawa et al., 2020a), implying that the results are not stable. Due to the nature of combating unknown biases, the unbiased validation set is not available, thus recent studies choose to report the best results among epochs (Nam et al., 2020; Kim et al., 2021b) for convenient comparison. We follow this evaluation protocol in Table 1. However, in the absence of prior knowledge, deciding when to stop can be troublesome, thus some results in Table 1 are excessively optimistic. We claim that if the network can attain dynamic balance throughout the training phase, such early-stopping may not be necessary. We further provide the last epoch results in Table 2 to validate it. We find that some methods suffer from serious performance degradation. On the contrary, GA achieves steady results (with the same and fair training configurations). In other words, our method shows superiority under two model selection strategies simultaneously.

Table 3 Average accuracy (bias-aligned and bias-conflicting, %) of Fig. 7 over three runs

Focal loss, LfF, DFA, and ERew just reweight a sample with the information from itself (individual information), different from them, GA, as well as Rew, use global information within one batch to obtain modulation weight. Correspondingly, the methods based on individual sample information can not maintain the contribution balance between bias-aligned and bias-conflicting samples, which is crucial for this problem. Different from the static rebalance method Rew, we propose a dynamic rebalance training strategy with aligned gradient contributions throughout the learning process, which enforces models to dive into intrinsic features instead of spurious correlations. Learning with GA, as demonstrated in Fig. 3 and Table 2, produces improved results with no degradation. The impact of GA on the learning trajectory presented in Fig. 4 also shows that GA can schedule the optimization processes and take full advantage of the potential of different samples. Besides, unlike the methods for class imbalance (Cui et al., 2019; Tan et al., 2021; Zhao et al., 2020), we try to rebalance the contributions of implicit groups rather than explicit categories.

Curriculum learning claims that using easy samples first can be superior, on the contrary, anti-curriculum learning argues that employing hard samples first is useful in some situations. These two learning strategies advocate arranging the order of sample learning according to the level of difficulty. Actually, in the context of debiasing, for vanilla training, the model tends to learn bias-aligned (relatively easy) samples first, while for plain reweighting, the model tends to learn bias-conflicting (relatively difficult) samples first, corresponding to the philosophy of curriculum learning and anti-curriculum learning. However, GA demonstrates that it is important to achieve a balance between bias-conflicting samples and bias-aligned samples in the whole learning stage.

Table 4 Performance in terms of DP and EqOdd on Biased Waterbirds (left) and Biased CelebA (right)
Table 5 The effectiveness of self-supervision
Table 6 The effectiveness of self-supervision

The proposed method has strong performance on fairness metrics as well. As shown in Table 4, the proposed method also obtains significant improvement in terms of DP and EqOdd. These results further demonstrate that the proposed method is capable of balancing bias-aligned and bias-conflicting samples, as well as producing superior and impartial results.

4.5.2 With Self-supervision

Self-supervision improves vanilla training. As shown in Tables 5 and 6, the self-supervised pretext tasks achieve obvious improvement over vanilla training, demonstrating the effectiveness of self-supervision in the context of debiasing.

Self-supervision also promotes advanced debiasing methods. As shown in Tables 5 and 6, the self-supervised pretext tasks also lead to significant gains on the basis of different debiasing methods and on a variety of datasets. When the training is heavily biased, the improvement is very significant, e.g., 10.2% and 8.1% gains on C-CIFAR10\(^1\) and C-CIFAR10\(^2\) (\(\rho = 99.5\%\)) beyond our method ECS+GA, respectively. Due to the low diversity of the bias-conflicting samples within the severely biased training data, the gain of ECS+GA may be limited, but self-supervision helps the model discover more general characteristics from the adequate bias-aligned examples.

5 Further Analysis

5.1 ECS Shows Superior Ability to Mine Bias-conflicting Samples

We separately verify the effectiveness of each component of ECS on C-MNIST (\(\rho =98\%\)) and B-CelebA. A good bias-conflicting scoring method should prompt superior precision-recall curves for the mined bias-conflicting samples, i.e., give real bias-conflicting (aligned) samples high (low) scores. Therefore, we provide the average precision (AP) in Table 7 (P-R curves are illustrated in Fig. 8). When comparing #0, #4, #5, and #6, we observe that epoch-ensemble, confident-picking and peer model all can improve the scoring method. In addition, as shown in Table 1, ECS+GA achieves results similar to GA with the help of ECS; ERew, PoE, and Rew combined with ECS also successfully alleviate biases to some extent, demonstrating that the proposed ECS is feasible, robust, and can be adopted in a variety of debiasing approaches.

Table 7 Average precision (%) of the mined bias-conflicting samples. VM: scoring with vanilla model

We further compare the methods: #1 collecting results with early-stopping (ES) in JTT (Liu et al., 2021), #2 training auxiliary biased model with GCE loss in LfF (and #3 collecting results with epoch-ensemble on top of it). When comparing #1 and #4, both early-stopping and epoch-ensemble can reduce the overfitting to bias-conflicting samples when training biased models, yielding more accurate scoring results. However, early-stopping is laborious to tune (Liu et al., 2021), whereas epoch-ensemble is more straightforward and robust. From #2 and #3, we see that epoch-ensemble can also enhance other strategies. Comparing #3 and #5, GCE loss is helpful, while confident-picking gains better results. Noting that though co-training with peer model raises some costs, it is not computationally complex and can yield significant benefits (#6), and even without peer model, #5 still outperform previous ways. Peer models are expected to better prevent bias-conflicting samples from affecting the training, so we can get better auxiliary biased models. Though the only difference between peer models is initialization in our experiments, as DNNs exhibit highly nonconvex, various initializations can result in different local optimal (Han et al., 2018). We offer visualizations of the predictions made by peer models (during training) in Fig. 9.

Fig. 8
figure 8

Precision-recall curves of different bias-conflicting scoring methods on Colored MNIST (\(\rho =98\%\), left) and Biased CelebA (right)

Fig. 9
figure 9

Visualizations of the predictions of peer models (during training), which demonstrates the different prediction behaviors of peer models

Table 8 Results (precision, recall, and unbiased accuracy) of GA combined with different bias-conflicting scoring methods

We also provide the results of GA combined with the above bias-conflicting scoring variants in Table 8 (for fairness, all methods are compared under a similar precision), which show all the proposed components contribute to a more robust model in stage II. Finally, we provide the precision and recall of our mined bias-conflicting samples with the help of ECS and the typical value of \(\tau \) (0.8) in Table 9.

5.2 The Sensitivity of the Introduced Hyper-parameters

Fig. 10
figure 10

Ablation on thresholds \(\eta \), \(\tau \) and balance ratio \(\gamma \)

Though the hyper-parameters are critical for methods aimed at combating unknown biases, recent studies (Nam et al., 2020; Kim et al., 2021b) did not include an analysis for them. Here, we present the ablation studies on C-MNIST (\(\rho =98\%\)) for the hyper-parameters (\(\eta \), \(\tau \), \(\gamma \)) in our method as shown in Fig. 10. We find that the method performs well under a wide range of hyper-parameters. Specifically, for the confidence threshold \(\eta \) in ECS, when \(\eta \) \(\rightarrow \) 0, most samples will be used to train the auxiliary biased models, including the bias-conflicting ones, resulting in low b-c scores for bias-conflicting samples (i.e., low recall of the mined bias-conflicting samples); when \(\eta \) \(\rightarrow \) 1, most samples will be discarded, including the relative hard but bias-aligned ones, leading to high b-c scores for bias-aligned samples (i.e., low precision). The determination of \(\eta \) is related to the number of categories and the difficulty of tasks, e.g., 0.5 for C-MNIST, 0.1 for C-CIFAR10\(^1\) and C-CIFAR10\(^2\) (10-class classification tasks), 0.9 for B-Birds and B-CelebA (2-class) here. As depicted in Fig. 10, ECS achieves consistent strong mining performance around the empirical value of \(\eta \). We also investigate ECS+GA with varying \(\tau \). High precision of the mined bias-conflicting samples guarantees that GA can work in stage II, and high recall further increases the diversity of the emphasized samples. Thus, to ensure the precision first, \(\tau \) is typically set to 0.8 for all experiments. From Fig. 10, ECS+GA is insensitive to \(\tau \) around the empirical value, however, a too high or too low value can cause low recall or low precision, resulting in inferior performance finally. For the balance ratio \(\gamma \), though the results are reported with \(\gamma \) = 1.6 for all settings on C-MNIST, C-CIFAR10\(^1\) and C-CIFAR10\(^2\), 1.0 for B-Birds and B-CelebA, the proposed method is not sensitive to \(\gamma \) \(\in \) [1.0, 2.0], which is reasonable as \(\gamma \) in such region makes the contributions from bias-conflicting samples close to that from bias-aligned samples.

Table 9 Precision and recall (%) of the mined bias-conflicting samples with ECS
Table 10 Average precision (%) of the mined bias-conflicting samples
Table 11 Accuracies (%) on four test groups of Multi-Color MNIST. ‘\(\infty \)’ states the reported results from DebiAN. The first line of the header w.r.t. left color bias, the second one w.r.t. right color bias

We further add an ablation on the number of auxiliary models in Table 10, showing more auxiliary biased models (\(>2\)) can get slightly better results. However, more auxiliary models will raise costs simultaneously, so we choose to use two auxiliary models in our design.

5.3 When there are Multiple Biases

Most debiasing studies (Nam et al., 2020; Kim et al., 2021b) only discussed single bias. However, there may be multiple biases, which are more difficult to analyze. To study the multiple biases, we further adopt the Multi-Color MNIST dataset following Li et al. (2022) which holds two bias attributes: left color (\(\rho =99\%\)) and right color (\(\rho =95\%\)), see examples in Fig. 6. In such training sets, though it seems more intricate to group a sample as bias-aligned or bias-conflicting (as a sample can be aligned or conflicting w.r.t. left color bias or right color bias separately), we still simply train debiased models with GA based on the b-c scores obtained via ECS. We evaluate ECS+GA on four test groups separately and present them in Table 11. We find the proposed method also can manage the multi-bias situation.

5.4 When there are Only a Few Bias-conflicting Samples

If the collected training set is completely biased (i.e., \(\rho =100\%\)), GA is not applicable. So, we want to know how GA performs when there are only a few bias-conflicting samples (i.e., \(\rho \rightarrow 100\%\)). The results are provided in Table 12, from which we find GA can achieve noticeable improvement even with few bias-conflicting samples.

5.5 When Training Data is Unbiased

It is important that the debiasing method is safe, i.e., and can achieve comparable results to Vanilla when the training data is unbiased. We conduct experiments on unbiased training sets and the results are shown in Table 13. From which, our method degrades slightly and still surpasses the debiasing method LfF. Actually, under an unbiased training set, the identify-emphasize-based methods tend to regard hard samples as bias-conflicting in stage I and emphasize them in stage II, resulting in a slight decrease in overall accuracy (as some training samples are not fully utilized).

5.6 Visualization Results

We visualize the activation maps via CAM (Zhou et al., 2016) in Fig. 11. Vanilla models usually activate regions related to biases when making predictions, e.g., the background in B-Birds. LfF and ECS+PoE can focus attention on key areas in some situations, but there are still some deviations. Meanwhile, the proposed ECS+GA and ECS+GA+SS mostly utilizes compact essential features to make decisions. It is worth noting that activation maps serve only as a qualitative result to display the model’s focus. For accurate quantitative comparison, please refer to the Table 1\(\sim \)Table 6.

Table 12 Unbiased accuracy (%) on Colored MNIST with few bias-conflicting samples
Table 13 Accuracy (%) on unbiased training data
Fig. 11
figure 11

Visualized activate maps of different models (last epoch) on Biased Waterbirds

6 Limitation and Future Work

Despite the achieved promising results, the debiasing method can be further improved in some aspects. First, our method and many previous approaches (such as LfF, DFA, BiasCon, RNF-GT etc.) are based on the assumption that there exist bias-conflicting samples in the training set. Although the assumption is in line with most actual situations, it should be noted that there are some cases where the collected training sets are completely biased (i.e., \(\rho =100\%\)), in which these methods are not applicable. For these cases, we should pay attention to methods that aim to directly prevent models from only pursuing easier features, such as SD. Besides, when there are only a few bias-conflicting samples, the risks of unstable training may increase. Although we did not find a significant impact on the results in our experiments, some engineering implementations, such as gradient clipping, can be incorporated to improve the stability of training.

Second, though the proposed ECS achieves significant improvement when compared with previous designs, we find that the bias-conflicting sample mining is not trivial, especially in complex datasets. The precision and recall achieved by our method on Biased Waterbirds and CelebA are still significantly lower than that on simple datasets like Colored MNIST and Corrupted CIFAR10 as shown in Table 9. For extreme cases, if the bias-conflicting scoring system fails, then the effect of GA can be influenced. Therefore, a better bias-conflicting scoring method is helpful and worth continuing to explore. Besides, instead of dividing samples into only two clusters (bias-aligned, bias-conflicting), exploring a “soft" version to define samples is also interesting and promising.

7 Conclusions

Biased models can cause poor out-of-distribution performance and even negative social impacts. In this paper, we focus on combating unknown biases which is urgently required in realistic applications, and propose an enhanced two-stage debiasing method. In the first stage, an effective bias-conflicting scoring approach containing peer-picking and epoch-ensemble is proposed. Then we derive a new learning objective with the idea of gradient alignment in the second stage, which dynamically balances the gradient contributions from the mined bias-conflicting and bias-aligned samples throughout the learning process. We also incorporate self-supervision into the second stage, assisting in the extraction of features. Extensive experiments on synthetic and real-world datasets reveal that the proposed solution outperforms previous methods.