DCAST: Diverse Class-Aware Self-Training Mitigates Selection Bias for Fairer Learning

Yasin I. Tepeli Department of Intelligent Systems, Faculty EEMCS, Delft, Netherlands Joana P. Gonçalves Department of Intelligent Systems, Faculty EEMCS, Delft, Netherlands Correspondence: [email protected]
Abstract

Fairness in machine learning seeks to mitigate model bias against individuals based on sensitive features such as sex or age, often caused by an uneven representation of the population in the training data due to selection bias. Notably, bias unascribed to sensitive features is challenging to identify and typically goes undiagnosed, despite its prominence in complex high-dimensional data from fields like computer vision and molecular biomedicine. Strategies to mitigate unidentified bias and evaluate mitigation methods are crucially needed, yet remain underexplored. We introduce: (i) Diverse Class-Aware Self-Training (DCAST), model-agnostic mitigation aware of class-specific bias, which promotes sample diversity to counter confirmation bias of conventional self-training while leveraging unlabeled samples for an improved representation of the underlying population; (ii) hierarchy bias, multivariate and class-aware bias induction without prior knowledge. Models learned with DCAST showed improved robustness to hierarchy and other biases across eleven datasets, against conventional self-training and six prominent domain adaptation techniques. Advantage was largest for higher-dimensional datasets, suggesting DCAST as a promising strategy to achieve fairer learning beyond identifiable bias.

Introduction

As predictive machine learning (ML) increasingly makes its way to applications with an impact on society, one major concern is to ensure that ML models deliver fair predictions and do not discriminate against individuals in the population. Selection bias is one of the most prominent sources of unfairness in ML, whereby the data used to build ML models is not representative of the real-world and thus violates the fundamental assumption of ML that it is independently drawn and identically distributed to the underlying population.

Research on fairness in ML has focused on mitigating (selection) bias associated with legally protected or sensitive features, such as sex, age, or skin color [1, 2]. However, biases can be indirectly linked to sensitive features via proxies not recognized as sensitive [1, 2], or they can be unrelated to sensitive features and still lead to unfairness. Ultimately, biases are likely to remain undiagnosed and be propagated by ML models without scrutiny when a link to sensitive features is challenging to identify. Unknown biases are often present when data is complex and high-dimensional, data collection is non-random, and knowledge of the domain is incomplete. We argue that unfairness mitigation should thus address bias more generally, beyond what can be ascribed to sensitive features. This issue has deserved attention across fields, including computer vision [3, 4], astronomy [5, 6], biomedicine and healthcare [7, 8, 9, 10], finance and economics [11, 12, 13], information retrieval [14, 15], and language [7, 16]. Nevertheless, its impact is typically overlooked, resulting in models with optimistic performances due to bias-unaware evaluation. We identify two key areas for improvement, namely evaluation of ML model robustness to bias, and ML bias mitigation.

Evaluation is crucial to ensure that ML models generalize and are robust to bias, but assessing performance on data representative of the real-world distribution is rarely achievable. Independent test data is not always available or guaranteed to be unbiased, and conventional data splits do not create train-test distribution shifts suitable for model bias evaluation. A viable alternative is to induce bias to the train set and assess the learned model on the original test set. Common bias induction approaches include subsampling using univariate selection probabilities, based on values or the distribution of one feature [17, 18]. This is however not representative of multivariate biases typically present in complex high-dimensional data. Existing methods to induce multivariate bias include: joint bias [19], which favors the selection of samples closer to the mean; and Dirichlet bias, [20], which assigns sample selection likelihoods based on a Dirichlet distribution. Both methods ignore class labels and thus do not generate class-specific biases. They might also cause class imbalances for otherwise balanced data.

We propose hierarchy bias, a multivariate class-aware bias induction technique to produce complex class-specific biases. Hierarchy bias identifies distinctly distributed groups of samples in the original data using clustering, and then generates a biased selection by influencing the representation of one group of samples relative to the others. Selection is performed per class to induce class-specific bias, aiming for an identical number of samples per class to ensure class balance.

Methods to mitigate bias in ML generally fall in the scope of domain adaptation (DA, [21]), seeking to adapt a model to the distribution shift between the source training domain and a target prediction domain. Relevant DA categories span importance weighting, subspace alignment, inference-based, and semi-supervised learning methods. Importance weighting (IW) weighs training samples based on their relevance to the test set, using probability ratios or discrepancy measures  [22, 23, 11, 24, 25, 6, 26, 27, 19, 28]. Since IW assumes that the train set contains the support of the test set and most features contribute to the prediction, it can be less effective with high-dimensional data or small sample sizes. Subspace alignment (SA) transforms the data representation [29, 30, 31], assuming there is a common subspace where transformed train and test sets exhibit matching conditional probabilities, which may be difficult to optimize if many transformations fit. Inference-based (IB) methods include minimax estimation [20, 32], where loss minimization is coupled with an adversarial maximization objective that steers the model to fit more conservatively, aiming for improved generalization. The IB methods may underperform if the model choice is less suitable for the test set. Overall, most IW, SA, and IB methods adapt the model for one target test set, which can hamper generalizability. Semi-supervised learning (SSL) leverages unlabeled samples to provide model learning with insight into the underlying population distribution. The most benefit can in principle be achieved by using as much unlabeled data as available, though some SSL approaches still adapt to individual test sets [33, 34]. Unlabeled samples are typically incorporated by SSL using self-training (ST) [35] or co-training (CT) [36], which assigns predicted pseudo-labels to unlabeled samples and selects a subset of these to include at each training iteration. Sample selection is often based on prediction confidence according to the model trained thus far, which may strengthen existing bias or create other biases such as class imbalance for originally balanced data [4, 5]. Attempts to mitigate this behavior include, for instance, the P3SVM support vector machine (SVM) [4] that selects pseudo-labeled samples distant from each other and located within the margins furthest away from the decision boundary. This method is however SVM-specific, and its sample selection dependent on the size of the margin may limit the contribution of unlabeled data. In summary, most DA methods mitigate distribution shifts for one test set at a time, leading to ML models with limited generalizability beyond the train and test domains. It remains to be investigated if generalization could be improved by training on additional unlabeled data. Semi-supervised learning offers this possibility, but existing methods fall short in actively mitigating bias present in the data or further induced during model learning. Finally, many DA methods are model-specific and cannot be applied to different types of ML models.

To improve bias mitigation, we propose Diverse Class-Aware Self-Training (DCAST), a model-agnostic semi-supervised learning framework that gradually incorporates unlabeled data in a class-aware manner, guided by two active bias mitigation strategies. The core CAST strategy addresses class-specific bias by selecting a set of pseudo-labeled samples to include separately per class, using a relaxed confidence threshold, with options to preserve the class ratios of the original labeled train set or to add the same number of pseudo-labeled samples per class at each iteration. The extended DCAST strategy seeks to counter confidence-induced bias by further selecting diverse pseudo-labeled samples, as measured by inter-sample distances in the learned discriminative embedding or the original feature space.

We evaluate both hierarchy bias induction and (D)CAST bias mitigation across eleven datasets, against competing approaches including Dirichlet and joint bias as well as conventional self-training and six domain adaptation techniques. Specifically, we investigate which bias induction method induces the most challenging type of selection bias, leading to the strongest impact on ML model prediction performance. We further assess to what extent the class-awareness and diversity in (D)CAST improve robustness to bias, both across datasets and compared to the alternative bias mitigation strategies, while coupling model-agnostic (D)CAST with three types of ML models.

Results and Discussion

The proposed hierarchy bias induction and (D)CAST bias mitigation methods aim to provide, respectively: (i) a more realistic type of class-aware multivariate selection bias for the evaluation of ML model robustness to bias, and (ii) class-aware and diversity-guided strategies to learn ML models with improved generalizability in the presence of selection bias. We briefly introduce these techniques and discuss their evaluation across 11 datasets using logistic regression (LR), random forest (RF), and 2-hidden layer neural network (NN) prediction models. Every dataset was randomly partitioned into 80% train and 20% test, with the test data reserved for prediction model evaluation (Methods). Effects of bias induction on the data and model prediction performance were assessed over 30 runs, each relying on a random split of the train set into labeled (30%) and unlabeled (70%) train sets. The labeled train set was used for bias induction and for training ML models, either intact or upon bias induction. For bias mitigation, unlabeled data was additionally used during training, where conventional self-training (ST) and (D)CAST leveraged the unlabeled train set, and other domain adaptation techniques exploited the unlabeled test set instead (Methods).

Hierarchy bias induces effective multivariate and class-specific selection bias

Hierarchy bias generates a biased selection of samples for a given dataset, aiming to deviate from the original data distribution by skewing the representation of a group of samples that is deemed closer together in feature space than the remaining samples (Fig. 1). The approach selects k𝑘kitalic_k samples per class and controls group representation using bias ratio b𝑏bitalic_b as follows. A class-specific group of at least k𝑘kitalic_k closely related samples is first identified using agglomerative hierarchical clustering. To obtain the biased selection, k×b𝑘𝑏k\times bitalic_k × italic_b samples are chosen uniformly at random from the identified group and k×(1b)𝑘1𝑏k\times(1-b)italic_k × ( 1 - italic_b ) samples are chosen uniformly at random from the remaining samples (Methods).

Refer to caption
Figure 1: Hierarchy bias approach for induction of selection bias. Given input data 𝑿𝑿\boldsymbol{X}bold_italic_X with labels 𝒀𝒀\boldsymbol{Y}bold_italic_Y, number of samples to select k𝑘kitalic_k, and bias ratio b[0,1]𝑏01b\in[0,1]italic_b ∈ [ 0 , 1 ], hierarchy bias selects k𝑘kitalic_k samples per class c𝑐citalic_c: k×b𝑘𝑏k\times bitalic_k × italic_b from a specific group and k×(1b)𝑘1𝑏k\times(1-b)italic_k × ( 1 - italic_b ) from the remaining samples. Each class-specific candidate group (for class c𝑐citalic_c) is identified via agglomerative hierarchical clustering with Euclidean distances and Ward linkage of the c𝑐citalic_c-labeled samples until a cluster of size kabsent𝑘\geq k≥ italic_k is obtained, from which k×b𝑘𝑏k\times bitalic_k × italic_b samples are drawn uniformly at random. The k×(1b)𝑘1𝑏k\times(1-b)italic_k × ( 1 - italic_b ) samples are drawn uniformly at random from the remaining c𝑐citalic_c-labeled samples.
Refer to caption
Figure 2: Bias induction impact on sample distances, latent space, and classifier performance. (a) Class-specific distributions of per sample average Euclidean distances to all other samples, for the biased selection (histograms) and for all samples in the labeled train set (histogram peaks denoted by lines ending in a “T” shape), using three bias induction techniques (hierarchy with b=0.9𝑏0.9b=0.9italic_b = 0.9, joint, and Dirichlet) and random subsampling on three datasets (wine, mushroom, and fire). Kolmogorov-Smirnov (KS) effect sizes quantify the distribution shift between the biased selection vs. all samples. (b-d) Samples selected by hierarchy bias (b=0.9𝑏0.9b=0.9italic_b = 0.9), highlighted on the respective latent UMAP space of the labeled train set for the wine, mushroom, and fire datasets (arbitrarily chosen run 11). (e) Accuracy of supervised RF, NN, and LR models on the test set after training on the original or biased labeled train set, over 30 distinct train runs. Box height delimits the interquartile range (IQR=Q3Q1𝐼𝑄𝑅𝑄3𝑄1IQR=Q3-Q1italic_I italic_Q italic_R = italic_Q 3 - italic_Q 1), with a line across the box denoting the median; whiskers indicate the largest and smallest values within Q11.5×IQR𝑄11.5𝐼𝑄𝑅Q1-1.5\times IQRitalic_Q 1 - 1.5 × italic_I italic_Q italic_R and Q3+1.5×IQR𝑄31.5𝐼𝑄𝑅Q3+1.5\times IQRitalic_Q 3 + 1.5 × italic_I italic_Q italic_R, with points beyond the range as outliers.

To evaluate bias induction, we assessed the ability to generate a distribution shift between the biased selection and the original data, as well as the impact of the induced shift on ML model prediction performance. We compared hierarchy bias with b=0.9𝑏0.9b=0.9italic_b = 0.9 to random subsampling and two alternative bias induction techniques: joint bias [19] and Dirichlet bias [20]. Hierarchy bias and random subsampling were set to select 30 samples per class, whereas Dirichlet targeted 60 and 300 samples in total respectively for binary and multiclass labeled datasets. Note that Dirichlet and joint bias do not take class labels into account when performing their selection, and joint bias does not allow control over the selected number of samples.

Effect on data distribution.

We first assessed the effect of bias induction on the distribution of distances between samples. The underlying idea is that a biased selection would exclude portions of the original data that deviate from the rest of the samples to some extent, thus making inter-sample distances closer on average. For each dataset, we obtained class-specific distributions of the per sample average Euclidean distance to all other samples. We further quantified the deviation between the class-specific distance distributions obtained for the biased selection and the original labeled set using Kolmogorov-Smirnov (KS) tests. Hierarchy bias (b=0.9𝑏0.9b=0.9italic_b = 0.9) induced the most significant shift in the distance distributions for all 11 datasets (KS effect sizes >0.65absent0.65>0.65> 0.65, p𝑝pitalic_p-values <0.05absent0.05<0.05< 0.05; Fig. 2a and Supplementary Fig. S1-S2), and primarily towards smaller average inter-sample distances, in line with the selection of close samples that hierarchy bias is designed to produce. Random selection resulted in the most similar distance distributions to the original data, with the smallest KS effect for 8 datasets. Dirichlet and joint bias led to modest shifts than hierarchy bias, with joint bias generally showing larger KS effects than Dirichlet (9 of 11 datasets). We also examined the samples selected from each labeled train set in the feature space, reduced to 2 dimensions (2D) using Uniform Manifold Approximation and Projection (UMAP) for an example run 11. Hierarchy bias selected samples from specific clusters or regions of the feature space. This was apparent across datasets (Supplementary Fig. S3), for instance hierarchy bias ignored samples in the top right area of the 2D space for the wine dataset (Fig. 2b), selected from specific clusters of the mushroom dataset (Fig. 2c), and focused on the top left and bottom right areas of the 2D space for the fire dataset (Fig. 2d). In contrast, samples selected by random selection, as well as by the Dirichlet and joint biases, were spread throughout the 2D space and thus more representative of the original labeled train set for all datasets (Supplementary Fig. S4-S6). For random sampling, this was expected, given that no particular bias was introduced. For joint bias the result was also unsurprising, seeing that it selected the largest proportions of samples across datasets and thus captured most of the data (overall mean average 63%, minimum 44%, and maximum 80%; for hierarchy bias: 17%, 0.4%, and 67%; Supplementary Table S1).

Impact on prediction performance.

We evaluated the impact of bias induction on the classification accuracy of supervised ML models for the 11 datasets across 30 runs. Per run, we trained 2-hidden layer neural network (NN), random forest (RF), and logistic regression (LR) models using the original labeled train set (No Bias) or a selection of its samples. The latter was obtained either by random subsampling or using Dirichlet, joint, or hierarchy bias induction. All models were evaluated on the original test set. The induced bias led to a decrease in accuracy with every technique except joint bias (Fig. 2e), which as previously mentioned selected most of the original samples and thus did not induce particularly strong bias. Hierarchy bias caused the largest decrease in accuracy for all datasets except MNIST, where the most impact was seen with joint bias (Fig. 2e). Note that the preset targets on the number of samples to select for hierarchy bias, Dirichlet bias, and random selection led these methods to select 64-70% of the MNIST samples per class compared to 46-60% with joint bias. This larger coverage of the original data likely influenced the ability of hierarchy and Dirichlet to produce a more effective biased selection for MNIST. Overall, hierarchy bias consistently selected samples in close proximity, leading to a significant shift in inter-sample distances and a bias towards class-specific parts of the original distribution. This caused a marked decrease in prediction accuracy of supervised ML models relative to other bias induction techniques.

Diverse class-aware self-training (DCAST) for selection bias mitigation

Refer to caption
Figure 3: Diverse Class-Aware Self-Training (DCAST) framework. (Left) Input to DCAST. Labeled data 𝑿𝑳subscript𝑿𝑳\boldsymbol{X_{L}}bold_italic_X start_POSTSUBSCRIPT bold_italic_L end_POSTSUBSCRIPT (with labels 𝒀𝑳subscript𝒀𝑳\boldsymbol{Y_{L}}bold_italic_Y start_POSTSUBSCRIPT bold_italic_L end_POSTSUBSCRIPT) and unlabeled data 𝑿𝑼subscript𝑿𝑼\boldsymbol{X_{U}}bold_italic_X start_POSTSUBSCRIPT bold_italic_U end_POSTSUBSCRIPT, maximum number of iterations m𝑚mitalic_m, number of pseudo-labeled samples s𝑠sitalic_s to select per iteration, confidence or prediction probability threshold t[0,1]𝑡01t\in[0,1]italic_t ∈ [ 0 , 1 ], and integer diversity strength parameter d1𝑑1d\geq 1italic_d ≥ 1. (Middle) Self-training module. At each iteration, a model trained with labeled samples is used to predict pseudo-labels for unlabeled samples, from which a subset is newly selected and added to the labeled set for the next iteration. (Right) Diversity module. Selects the subset of sc=s×class_ratio(c)subscript𝑠𝑐𝑠𝑐𝑙𝑎𝑠𝑠_𝑟𝑎𝑡𝑖𝑜𝑐s_{c}=s\times class\_ratio(c)italic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_s × italic_c italic_l italic_a italic_s italic_s _ italic_r italic_a italic_t italic_i italic_o ( italic_c ) confidently predicted and diverse pseudo-labeled samples per class c𝑐citalic_c, as follows: (i) select the top sc×dsubscript𝑠𝑐𝑑s_{c}\times ditalic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT × italic_d samples from the unlabeled set with confidence or prediction probability larger than t𝑡titalic_t (or 1.2/C1.2𝐶1.2/C1.2 / italic_C, whichever is largest); and (ii) reduce this sc×dsubscript𝑠𝑐𝑑s_{c}\times ditalic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT × italic_d selection to a set of scsubscript𝑠𝑐s_{c}italic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT diverse samples by identifying scsubscript𝑠𝑐s_{c}italic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT clusters using hierarchical clustering (agglomerative single-linkage) and selecting the most confidently predicted sample from each cluster. Note that class_ratio𝑐𝑙𝑎𝑠𝑠_𝑟𝑎𝑡𝑖𝑜class\_ratioitalic_c italic_l italic_a italic_s italic_s _ italic_r italic_a italic_t italic_i italic_o can otherwise be fixed to be equal across classes. Distance between samples is based on either learned discriminative embeddings, relating samples with respect to prediction output, or alternatively an unsupervised embedding or the original feature space. When d=1𝑑1d=1italic_d = 1, DCAST becomes CAST, without the diversity strategy.

The proposed (D)CAST semi-supervised learning strategies (Fig. 3) aim to mitigate selection bias by leveraging insight from unlabeled data about the underlying distribution of the population. Both rely on self-training to gradually incorporate unlabeled data: at each training iteration, the learnt model is used to predict pseudo-labels for all unlabeled samples, from which a subset of s𝑠sitalic_s samples (scsubscript𝑠𝑐s_{c}italic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT per class) is selected to be included in the labeled set for the next iteration. To address class-related bias, sample selection is done separately per class as follows. First, a set of s×d𝑠𝑑s\times ditalic_s × italic_d candidates is selected as the most confidently predicted samples with prediction probability above a threshold t𝑡titalic_t, where s𝑠sitalic_s and d𝑑ditalic_d denote the number of samples to select and diversity strength. For CAST (d=1𝑑1d=1italic_d = 1), this directly results in the final set of s𝑠sitalic_s pseudo-labeled samples to add for the next iteration. The DCAST selection (d>1𝑑1d>1italic_d > 1) extends upon CAST to mitigate confidence-related bias through sample diversity, reducing the set of s×d𝑠𝑑s\times ditalic_s × italic_d candidates to a final set of s𝑠sitalic_s diverse pseudo-labeled samples. Capturing diverse sample groups is achieved via hierarchical clustering of the candidate samples into s𝑠sitalic_s clusters (scsubscript𝑠𝑐s_{c}italic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT per class), followed by selection of diverse samples comprising the most confidently predicted sample per cluster. To ensure (D)CAST remains model-agnostic, sample distances for clustering can be based on discriminative embeddings learnt by the model or the original feature space.

Refer to caption
Figure 4: Bias mitigation by semi-supervised (D)CAST in the presence of hierarchy bias (ratio b=0.9𝑏0.9b=0.9italic_b = 0.9). Accuracy of supervised and semi-supervised learning methods with (a) RF, (b) NN, and (c) LR models across 11 datasets. Results for 30 runs: each training on a different split of the train set into labeled and unlabeled sets, all evaluated on the same original test set. Models included (top to bottom): supervised RF/NN/LR models trained on the original (No Bias) or biased (Bias) labeled set; and semi-supervised RF/NN/LR models, using conventional self-training (ST) on the biased labeled train set plus the unlabeled test set, or (D)CAST on the biased labeled train set plus the unlabeled train set. Red asterisks (*) denote statistically significant changes in accuracy over 30 runs for each semi-supervised approach compared to supervised learning on the biased labeled set, using one-sided Wilcoxon signed-rank tests (larger asterisks indicate p<0.01𝑝0.01p<0.01italic_p < 0.01 and smaller asterisks 0.01<p<0.050.01𝑝0.050.01<p<0.050.01 < italic_p < 0.05).

Diversity and class-awareness in (D)CAST improve bias mitigation via self-training

To evaluate (D)CAST bias mitigation, we first assessed its test prediction accuracy against supervised learning and conventional self-training (ST) [35] on the biased labeled train set, with additional unlabeled samples for self-training strategies. Training and evaluation were performed for 11 datasets over 30 runs as previously described, using RF, NN, and LR models. We induced hierarchy bias with ratio b=0.9𝑏0.9b=0.9italic_b = 0.9, as this type of selection bias showed the most impact on supervised models compared to Dirichlet and joint bias (Fig. 2e). The (D)CAST method was assessed without diversity (CAST, d=1𝑑1d=1italic_d = 1) or with diversities d={10,100}𝑑10100d=\{10,100\}italic_d = { 10 , 100 } (CAST-10, DCAST-100), and was set to include s=3×s=3\timesitalic_s = 3 ×(number of classes) pseudo-labeled samples per iteration, for at most m=100𝑚100m=100italic_m = 100 iterations, using prediction threshold t=0.9𝑡0.9t=0.9italic_t = 0.9 (or the 85th or 93rd percentile in the case of RF models). Conventional ST selected the 3×3\times3 ×(number of classes) most confidently predicted samples per iteration (Methods, Bias mitigation strategies). Concerning the mitigation of hierarchy bias with ratio b=0.9𝑏0.9b=0.9italic_b = 0.9, with NN models the semi-supervised (D)CAST strategies significantly improved generalizability over supervised learning across all 11 datasets (p<0.05𝑝0.05p<0.05italic_p < 0.05 with one-sided Wilcoxon signed-rank tests, Fig. 4b). Specifically, class-awareness with moderate diversity (DCAST-10) was significantly better than supervised learning on the 11 datasets, whereas class-awareness alone (CAST) or coupled with stronger diversity (DCAST-100) both improved on 10 datasets and remained comparable respectively on the fire and adult datasets. By contrast, conventional ST was significantly worse than supervised learning on 10 datasets with NN models. Using RF and LR models, mitigation of hierarchy bias with ratio b=0.9𝑏0.9b=0.9italic_b = 0.9 was more modest. Semi-supervised (D)CAST and ST performed comparably to supervised learning on most datasets (8 with RF and 7 with LR models; Fig. 4a,c), possibly due to the use of regularization, which could hamper model adaptation. We thus saw occasional statistically significant changes and smaller effect sizes with RF and LR models. Notably, the higher diversity strategy DCAST-100 led to the only significant improvement of semi-supervised over supervised learning using RF models, on the MNIST dataset (Fig. 4a). Also with RF models, CAST and DCAST-10 decreased accuracy on MNIST, while ST decreased accuracy on 3 datasets (wine, MNIST, and pistachio; Fig. 4a). With LR models, (D)CAST strategies improved over supervised learning on 4 datasets (MNIST, spam, raisin, and pistachio), whereas ST improved on 3 datasets (spam, raisin, and pumpkin) but also caused a decrease on the wine dataset (Fig. 4c).

Experiments with alternative bias induction techniques revealed similar findings, where (D)CAST bias mitigation consistently outperformed ST across datasets under random subsampling (Supplementary Fig. S7), and under induced Dirichlet or joint bias (Supplementary Figs. S8-S9). Again, we saw the largest performance differences with NN models, coinciding with the most improvement of (D)CAST and weakest results of ST over supervised learning.

In summary, (D)CAST effectively mitigated selection bias induced by different techniques when paired with non-regularized NN models, and was not outperformed by supervised learning or conventional ST with regularized RF and LR models. In contrast, conventional ST struggled to recover from the bias with all three types of models, especially NNs. These results suggest that the class-awareness and diversity features introduced to the pseudo-labeling procedure in (D)CAST provide a promising semi-supervised learning strategy to mitigate selection bias.

Refer to caption
Figure 5: Bias mitigation by (D)CAST or domain adaptation beyond semi-supervised learning under hierarchy bias (b=0.9𝑏0.9b=0.9italic_b = 0.9). Accuracy of semi-supervised (D)CAST strategies against alternative bias mitigation techniques with 3 different types of ML models for 11 datasets over 30 runs. Per run, each model was trained using a different labeled train set with induced hierarchy bias. We included a supervised learning model as baseline per ML model type (RF, NN, LR), together with bias mitigation models incorporating additional unlabeled samples from either the unlabeled train set ((D)CAST) or the unlabeled test set (remaining methods). All models were evaluated on the same original test set. Bias mitigation methods per category: semi-supervised (CAST and DCAST-100); importance weighting (KMM, KDE); minimax estimation (RBA, TCPR); and subspace alignment (FLDA, SUBA). The (D)CAST and KMM methods were coupled with RF, NN, and LR models, whereas the remaining methods used LR models only. For clarity, horizontal lines group bias mitigation strategies by model type. The “x” symbol indicates model training was unsuccessful across all 30 runs.

Semi-supervised (D)CAST bias mitigation is superior to competing domain adaptation

We also evaluated (D)CAST against bias mitigation techniques beyond semi-supervised learning. This included importance weighting methods KMM [19] and KDE [22], minimax approaches RBA [20] and TCPR [32], and subspace alignment methods FLDA [31] and SUBA [30]. All methods were trained on the biased labeled train set and evaluated on the original test set, with (D)CAST further incorporating samples from the unlabeled train set and the remaining methods using unlabeled test samples during training. The (D)CAST and KMM approaches were coupled with RF, NN, and LR models, while the remaining methods used LR only as per the original work.

Similar to our previous findings, CAST and DCAST-100 were the most robust bias mitigation methods. Overall, these strategies preserved or significantly improved over the supervised learning performance across the 3 model types and 11 datasets, with the exception of CAST showing a decrease in accuracy for MNIST when used with RF models. (Fig. 4-5). In contrast, KMM led to significant decreases in accuracy for 8 datasets with NN models, as well as for 5 and 6 datasets respectively with LR and RF models. As for the remaining bias mitigation methods using only LR models, KDE resulted in significant decreases in performance for all except the rice dataset. Apart from an improvement with RBA for the pistachio dataset, the RBA and SUBA methods degraded performance significantly for 6 and 9 datasets, respectively. The best competing methods were FLDA and TCPR, which showed significant improvements respectively for 5 and 4 datasets (FLDA: breast cancer, spam, raisin, pistachio, and pumpkin; TCPR: wine, rice, adult, and pistachio). The FLDA approach also led to significant decreases for 4 datasets (wine, mushroom, MNIST, and fire), while TCPR caused a significant decrease for the fire dataset. Concerning the MNIST dataset, TCPR failed to build models for most runs and caused a clear performance drop for the few remaining ones, resulting in insufficient power to determine statistical significance. Overall, CAST and DCAST-100 demonstrated consistent ability to match or outperform supervised learning in the presence of hierarchy bias compared to other bias mitigation methods. The gap was most evident on the multi-class classification problem (MNIST), where the other methods resulted in drastic decreases in performance.

Conclusion

We put forth two contributions to improve the learning of prediction models in the presence of selection bias. First, a bias induction approach termed hierarchy bias to enable the evaluation of complex multivariate bias effects on the generalizability of prediction models. Second, a model-agnostic semi-supervised learning framework named (D)CAST that exploits unlabeled data in a class-aware manner and promotes sample diversity to mitigate selection bias.

Hierarchy bias uses clustering to isolate one distinct group of samples per class and then skews the representation of such group during sample selection to induce class-specific multivariate bias, allowing control over the level of bias through a bias ratio parameter. Induced hierarchy bias showed a stronger impact on the distribution of inter-sample distances and proved more challenging for prediction models to overcome, compared to joint and Dirichlet bias.

The (D)CAST model learning strategy progressively incorporates unlabeled samples using self-training, which is further made class-aware in CAST by pseudo-labeling confidently predicted unlabeled samples over a given threshold per class. Its extended variant, DCAST, seeks to counter confidence-associated bias with sample diversity by clustering and selecting pseudo-labeled samples from distinct groups, using distances based on either the discriminative embeddings provided by the underlying model or the original feature representation.

Both class-awareness and diversity proved effective, leading to significant improvements in the bias mitigation ability of (D)CAST over conventional self-training across datasets and bias induction techniques. Models trained by (D)CAST also outperformed other models built using six alternative domain adaptation methods, comprising different importance weighting, minimax estimation, and subspace alignment approaches.

Diversity strength was shown to influence the extent of (D)CAST bias mitigation, where a larger value resulted in improved robustness to selection bias. More generally, we recommend setting the diversity strength parameter such that the number of candidate samples considered for selection at each iteration is significantly larger than the number of samples to select. We further suggest choosing a number of samples to select per iteration comfortably below the size of the training set to promote a gradual adaptation of the model, but not too small so that the added samples can have an impact: a possible choice could be the closest even number to N𝑁\lfloor\sqrt{N}\rfloor⌊ square-root start_ARG italic_N end_ARG ⌋, with N𝑁Nitalic_N denoting the size of the training set. The confidence threshold can be adjusted according to the distribution of prediction probabilities of the model to allow (D)CAST to consider at least as many samples as the number to add at each iteration.

We demonstrated that (D)CAST is model-agnostic through application with random forests (RF), neural networks (NN), and logistic regression (LR) models. The success of bias mitigation differed across architectures, with the most benefit achieved using NN models. We hypothesized that the use of regularization could also have played a role, by restricting model adaptation and thus limiting the contribution of unlabeled samples in the RF and LR models. Further investigation would be needed to obtain conclusive evidence.

Overall, our results present (D)CAST and hierarchy bias as promising strategies to improve the learning and evaluation of machine learning models in the presence of selection bias, as an essential step in striving towards fairness in machine learning.

Methods

Hierarchy bias induction and (D)CAST bias mitigation

Notation.

We denote the input data (sample ×\times× feature) matrix as 𝑿N×F𝑿superscript𝑁𝐹\boldsymbol{X}\in\mathbb{R}^{N\times{F}}bold_italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_F end_POSTSUPERSCRIPT, the input label matrix as 𝒀{0,1}N×C𝒀superscript01𝑁𝐶\boldsymbol{Y}\in\{0,1\}^{N\times{C}}bold_italic_Y ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_N × italic_C end_POSTSUPERSCRIPT, and output prediction probability matrix as 𝒀¯N×Cbold-¯𝒀superscript𝑁𝐶\boldsymbol{\bar{Y}}\in\mathbb{R}^{N\times{C}}overbold_¯ start_ARG bold_italic_Y end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_C end_POSTSUPERSCRIPT, where N𝑁Nitalic_N is the number of samples, F𝐹Fitalic_F is the number of features, and C𝐶Citalic_C is the number of classes. Following this notation, 𝒙n1×Fsubscript𝒙𝑛superscript1𝐹\boldsymbol{x}_{n}\in\mathbb{R}^{1\times{F}}bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_F end_POSTSUPERSCRIPT is the feature vector of sample n{1,2,,N1,N}𝑛12𝑁1𝑁n\in\{1,2,...,N-1,N\}italic_n ∈ { 1 , 2 , … , italic_N - 1 , italic_N }, yncsuperscriptsubscript𝑦𝑛𝑐y_{n}^{c}italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT is the binary label of sample n𝑛nitalic_n for class c{1,2,,C1,C}𝑐12𝐶1𝐶c\in\{1,2,...,C-1,C\}italic_c ∈ { 1 , 2 , … , italic_C - 1 , italic_C } (1 if assigned, 0 otherwise), and y¯ncsuperscriptsubscript¯𝑦𝑛𝑐\bar{y}_{n}^{c}over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT is the prediction probability of sample n𝑛nitalic_n being of class c𝑐citalic_c where c=1Cync=1superscriptsubscript𝑐1𝐶superscriptsubscript𝑦𝑛𝑐1\sum_{c=1}^{C}{y_{n}^{c}}=1∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = 1 and c=1Cy¯nc=1superscriptsubscript𝑐1𝐶superscriptsubscript¯𝑦𝑛𝑐1\sum_{c=1}^{C}{\bar{y}_{n}^{c}}=1∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = 1.

Hierarchy bias

Hierarchy bias induction generates a biased selection of samples from a given dataset in a class-aware and multivariate manner. The idea is that the samples belonging to each class in the dataset can be seen as originating from a mixture of multivariate distributions. Based on this, the goal is to identify one of the mixtures and then make a skewed selection of samples by controlling the representation of the target mixture over the remaining samples. Hierarchy bias induction takes as input a data matrix 𝑿𝑿\boldsymbol{X}bold_italic_X, a label matrix 𝒀𝒀\boldsymbol{Y}bold_italic_Y, a parameter k𝑘kitalic_k denoting the number of samples to select per class, and a bias parameter b[0,1]𝑏01b\in[0,1]italic_b ∈ [ 0 , 1 ] denoting the ratio of samples that should be selected from the identified mixture (Alg. 1). The output is a biased selection of samples, generated as follows. Agglomerative hierarchical clustering is first applied to identify a mixture of interest per class c𝑐citalic_c, corresponding to a cluster of at least k𝑘kitalic_k samples. We perform the clustering for class c𝑐citalic_c using all samples from matrix 𝑿𝑿\boldsymbol{X}bold_italic_X labeled with class c𝑐citalic_c, with Euclidean inter-sample distances on the original feature vectors and Ward linkage between clusters (Alg. 1, lines 4-5). Once the cluster is identified, the final biased selection is obtained by choosing k×b𝑘𝑏k\times bitalic_k × italic_b samples uniformly at random from the cluster and choosing another kk×b𝑘𝑘𝑏k-k\times bitalic_k - italic_k × italic_b samples uniformly at random from the remaining samples not in the cluster (Alg. 1, lines 6-8).

Algorithm 1 Hierarchy Bias
1:𝑿𝑿\boldsymbol{X}bold_italic_X, 𝒀𝒀\boldsymbol{Y}bold_italic_Y, k𝑘kitalic_k, b𝑏bitalic_b.
2:Selection𝑆𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛Selection\leftarrow\emptysetitalic_S italic_e italic_l italic_e italic_c italic_t italic_i italic_o italic_n ← ∅
3:kclusterk×bsubscript𝑘𝑐𝑙𝑢𝑠𝑡𝑒𝑟𝑘𝑏k_{cluster}\leftarrow k\times bitalic_k start_POSTSUBSCRIPT italic_c italic_l italic_u italic_s italic_t italic_e italic_r end_POSTSUBSCRIPT ← italic_k × italic_b
4:krestkk×bsubscript𝑘𝑟𝑒𝑠𝑡𝑘𝑘𝑏k_{rest}\leftarrow k-k\times bitalic_k start_POSTSUBSCRIPT italic_r italic_e italic_s italic_t end_POSTSUBSCRIPT ← italic_k - italic_k × italic_b
5:for 
each class cC𝑐𝐶c\in Citalic_c ∈ italic_C do
6:     Apply agglomerative clustering with Euclidean distance and Ward linkage to 𝑿Scsubscript𝑿subscript𝑆𝑐\boldsymbol{X}_{S_{c}}bold_italic_X start_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT, Sc={n:nync==1}S_{c}=\{n:n\in y_{n}^{c}==1\}italic_S start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = { italic_n : italic_n ∈ italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = = 1 }.
7:     Cluster𝐶𝑙𝑢𝑠𝑡𝑒𝑟absentCluster\leftarrowitalic_C italic_l italic_u italic_s italic_t italic_e italic_r ← Set of samples from the first cluster that reaches a number of samples kabsent𝑘\geq k≥ italic_k.
8:     Sclustersubscript𝑆𝑐𝑙𝑢𝑠𝑡𝑒𝑟absentS_{cluster}\leftarrowitalic_S start_POSTSUBSCRIPT italic_c italic_l italic_u italic_s italic_t italic_e italic_r end_POSTSUBSCRIPT ← Select set of kclustersubscript𝑘𝑐𝑙𝑢𝑠𝑡𝑒𝑟k_{cluster}italic_k start_POSTSUBSCRIPT italic_c italic_l italic_u italic_s italic_t italic_e italic_r end_POSTSUBSCRIPT samples uniformly at random from Cluster𝐶𝑙𝑢𝑠𝑡𝑒𝑟Clusteritalic_C italic_l italic_u italic_s italic_t italic_e italic_r.
9:     Srestsubscript𝑆𝑟𝑒𝑠𝑡absentS_{rest}\leftarrowitalic_S start_POSTSUBSCRIPT italic_r italic_e italic_s italic_t end_POSTSUBSCRIPT ← Select set of krestsubscript𝑘𝑟𝑒𝑠𝑡k_{rest}italic_k start_POSTSUBSCRIPT italic_r italic_e italic_s italic_t end_POSTSUBSCRIPT samples uniformly at random from the remaining samples (not in Cluster𝐶𝑙𝑢𝑠𝑡𝑒𝑟Clusteritalic_C italic_l italic_u italic_s italic_t italic_e italic_r).
10:     SelectionSclusterSrest𝑆𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛subscript𝑆𝑐𝑙𝑢𝑠𝑡𝑒𝑟subscript𝑆𝑟𝑒𝑠𝑡Selection\cup S_{cluster}\cup S_{rest}italic_S italic_e italic_l italic_e italic_c italic_t italic_i italic_o italic_n ∪ italic_S start_POSTSUBSCRIPT italic_c italic_l italic_u italic_s italic_t italic_e italic_r end_POSTSUBSCRIPT ∪ italic_S start_POSTSUBSCRIPT italic_r italic_e italic_s italic_t end_POSTSUBSCRIPT
11:end for
12:return Selection𝑆𝑒𝑙𝑒𝑐𝑡𝑖𝑜𝑛Selectionitalic_S italic_e italic_l italic_e italic_c italic_t italic_i italic_o italic_n

(D)CAST - Diverse Class-Aware Self-Training

The proposed semi-supervised model learning framework, Diverse Class-Aware Self-Training (DCAST), leverages unlabeled data to gain insight into the underlying distribution of the population that may not be well represented by the labeled data. It does this using self-training (ST), and actively addresses selection bias by preserving class ratios or balance (CAST), and optionally also incorporating sample diversity into the pseudo-labeling process to counter biases present in the data or introduced during training (DCAST).

More formally, the (D)CAST method takes as input the labeled data {𝑿𝑳,𝒀𝑳}subscript𝑿𝑳subscript𝒀𝑳\{\boldsymbol{X_{L}},\boldsymbol{Y_{L}}\}{ bold_italic_X start_POSTSUBSCRIPT bold_italic_L end_POSTSUBSCRIPT , bold_italic_Y start_POSTSUBSCRIPT bold_italic_L end_POSTSUBSCRIPT } and unlabeled data 𝑿𝑼subscript𝑿𝑼\boldsymbol{X_{U}}bold_italic_X start_POSTSUBSCRIPT bold_italic_U end_POSTSUBSCRIPT to learn from, validation data {𝑿𝑽,𝒀𝑽}subscript𝑿𝑽subscript𝒀𝑽\{\boldsymbol{X_{V}},\boldsymbol{Y_{V}}\}{ bold_italic_X start_POSTSUBSCRIPT bold_italic_V end_POSTSUBSCRIPT , bold_italic_Y start_POSTSUBSCRIPT bold_italic_V end_POSTSUBSCRIPT } for early stopping, and the following four additional parameters: maximum number of iterations m𝑚mitalic_m, number of pseudo-labeled samples s𝑠sitalic_s to select per iteration, confidence or prediction probability threshold t[0,1]𝑡01t\in[0,1]italic_t ∈ [ 0 , 1 ], and integer diversity parameter d1𝑑1d\geq 1italic_d ≥ 1. Model learning in (D)CAST is then performed by self-training as follows. At iteration i𝑖iitalic_i, model M(i)superscript𝑀𝑖M^{(i)}italic_M start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT is trained on the labeled data {𝑿𝑳(𝒊),𝒀𝑳(𝒊)}subscript𝑿superscript𝑳𝒊subscript𝒀superscript𝑳𝒊\{\boldsymbol{X_{L^{(i)}}},\boldsymbol{Y_{L^{(i)}}}\}{ bold_italic_X start_POSTSUBSCRIPT bold_italic_L start_POSTSUPERSCRIPT bold_( bold_italic_i bold_) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , bold_italic_Y start_POSTSUBSCRIPT bold_italic_L start_POSTSUPERSCRIPT bold_( bold_italic_i bold_) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT }, and used to make predictions 𝒀¯𝑼(𝒊)subscriptbold-¯𝒀superscript𝑼𝒊\boldsymbol{\bar{Y}_{U^{(i)}}}overbold_¯ start_ARG bold_italic_Y end_ARG start_POSTSUBSCRIPT bold_italic_U start_POSTSUPERSCRIPT bold_( bold_italic_i bold_) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT for all samples in the unlabeled set U(i)superscript𝑈𝑖U^{(i)}italic_U start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT (and matrix 𝑿𝑼(𝒊)subscript𝑿superscript𝑼𝒊{\boldsymbol{X_{U^{(i)}}}}bold_italic_X start_POSTSUBSCRIPT bold_italic_U start_POSTSUPERSCRIPT bold_( bold_italic_i bold_) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT). As with regular self-training, a pseudo-labeling procedure then selects a subset of the unlabeled samples, S(i)U(i)superscript𝑆𝑖superscript𝑈𝑖S^{(i)}\subseteq U^{(i)}italic_S start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ⊆ italic_U start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT, to be incorporated into model learning (Fig. 3). The selected samples S(i)superscript𝑆𝑖S^{(i)}italic_S start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT are pseudo-labeled and included in the set of labeled samples for training in the subsequent iteration, L(i+1)=L(i)S(i)superscript𝐿𝑖1superscript𝐿𝑖superscript𝑆𝑖L^{(i+1)}=L^{(i)}\cup S^{(i)}italic_L start_POSTSUPERSCRIPT ( italic_i + 1 ) end_POSTSUPERSCRIPT = italic_L start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∪ italic_S start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT, as well as removed from the unlabeled set U(i+1)=U(i)S(i)superscript𝑈𝑖1superscript𝑈𝑖superscript𝑆𝑖U^{(i+1)}=U^{(i)}\setminus S^{(i)}italic_U start_POSTSUPERSCRIPT ( italic_i + 1 ) end_POSTSUPERSCRIPT = italic_U start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∖ italic_S start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT. Matrices 𝑿𝑳(𝒊+𝟏)subscript𝑿superscript𝑳𝒊1{\boldsymbol{X_{L^{(i+1)}}}}bold_italic_X start_POSTSUBSCRIPT bold_italic_L start_POSTSUPERSCRIPT bold_( bold_italic_i bold_+ bold_1 bold_) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, 𝒀𝑳(𝒊+𝟏)subscript𝒀superscript𝑳𝒊1{\boldsymbol{Y_{L^{(i+1)}}}}bold_italic_Y start_POSTSUBSCRIPT bold_italic_L start_POSTSUPERSCRIPT bold_( bold_italic_i bold_+ bold_1 bold_) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, and 𝑿𝑼(𝒊+𝟏)subscript𝑿superscript𝑼𝒊1{\boldsymbol{X_{U^{(i+1)}}}}bold_italic_X start_POSTSUBSCRIPT bold_italic_U start_POSTSUPERSCRIPT bold_( bold_italic_i bold_+ bold_1 bold_) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT are also updated for the next iteration accordingly.

Pseudo-labeling in (D)CAST: class-aware with and without diversity.

The (D)CAST-specific pseudo-labeling is accomplished by the Diversity Module (Fig. 3). The core CAST strategy addresses class-specific bias by performing the pseudo-labeling separately per class, offering to either preserve the class ratios found in the original labeled set or select an equal number of samples per class at each iteration. Its extension, DCAST, aims for further bias mitigation by promoting sample diversity. In conventional self-training, the pseudo-labeling procedure tends to confirm and follow biases potentially present in the labeled set: either by selecting unlabeled samples similar to the original labeled samples (in feature space) or by selecting unlabeled samples whose prediction the model is most confident about. In contrast, (D)CAST seeks to mitigate this behavior and work against the strengthening of existing bias during training. To achieve this, (D)CAST selects and pseudo-labels samples that are diverse amongst each other and also more dissimilar to the possibly biased labeled samples. The (D)CAST pseudo-labeling (Alg. 2) comprises the following steps per training iteration:

Step 1. (D)CAST - Select candidate samples for pseudo-labeling based on model confidence. The goal of Step 1 is to select a set of candidate unlabeled samples for pseudo-labeling and inclusion in model training. This corresponds to the s×class_ratio(c)×d𝑠𝑐𝑙𝑎𝑠𝑠_𝑟𝑎𝑡𝑖𝑜𝑐𝑑s\times class\_ratio(c)\times ditalic_s × italic_c italic_l italic_a italic_s italic_s _ italic_r italic_a italic_t italic_i italic_o ( italic_c ) × italic_d most confidently predicted unlabeled samples per class c𝑐citalic_c, with corresponding probabilities in 𝒀¯𝑼(𝒊)subscriptbold-¯𝒀superscript𝑼𝒊\boldsymbol{\bar{Y}_{U^{(i)}}}overbold_¯ start_ARG bold_italic_Y end_ARG start_POSTSUBSCRIPT bold_italic_U start_POSTSUPERSCRIPT bold_( bold_italic_i bold_) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT larger than a user-defined threshold t𝑡titalic_t (or a baseline threshold r=1.2/C𝑟1.2𝐶r=1.2/Citalic_r = 1.2 / italic_C, whichever is largest) (Alg. 2, lines 9-11). For CAST, with d=1𝑑1d=1italic_d = 1 and thus no diversity strategy, this selection automatically leads to the final set of s𝑠sitalic_s pseudo-labeled samples (sc=s×class_ratio(c)subscript𝑠𝑐𝑠𝑐𝑙𝑎𝑠𝑠_𝑟𝑎𝑡𝑖𝑜𝑐s_{c}=s\times class\_ratio(c)italic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_s × italic_c italic_l italic_a italic_s italic_s _ italic_r italic_a italic_t italic_i italic_o ( italic_c ) per class) to incorporate during learning in the subsequent iteration. For DCAST, with d>1𝑑1d>1italic_d > 1 (Alg. 2, lines 13-15), the selected set of s×d𝑠𝑑s\times ditalic_s × italic_d samples (sc×dsubscript𝑠𝑐𝑑s_{c}\times ditalic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT × italic_d per class) represents a larger pool of candidates to consider and narrow down further to obtain the final selected set of s𝑠sitalic_s samples (scsubscript𝑠𝑐s_{c}italic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT per class) using the diversity strategy. Our recommendation for DCAST is to set the confidence threshold t𝑡titalic_t and diversity parameter d𝑑ditalic_d not too strictly, so as to allow for a sufficient number (and diversity) of candidate samples.

Step 2. DCAST - Diversity: Create representations of candidate samples for distance calculation. From the set of s×d𝑠𝑑s\times ditalic_s × italic_d candidate samples selected in Step 1, DCAST aims to extract the subset of s𝑠sitalic_s diverse samples. Diversity is assessed based on pairwise sample distances, calculated using a specific sample vector representation or embedding (denoted for all candidate samples as matrix 𝑬(𝒊)(s×d)×vsuperscript𝑬𝒊superscript𝑠𝑑𝑣\boldsymbol{E^{(i)}}\in\mathbb{R}^{(s\times d)\times{v}}bold_italic_E start_POSTSUPERSCRIPT bold_( bold_italic_i bold_) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_s × italic_d ) × italic_v end_POSTSUPERSCRIPT, where v𝑣vitalic_v is the embedding vector size). Preferably, DCAST uses discriminative embeddings based on the learnt model M(i)superscript𝑀𝑖M^{(i)}italic_M start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT, where two types are currently supported. For a random forest, each sample representation corresponds to a one-hot encoded vector of the prediction of that sample across all the leaves of the decision trees in the forest; for a neural network, the sample representation corresponds to the embedding based on the hidden layer closest to the output layer. For models without discriminative embeddings, such as SVM or LR, DCAST uses the original feature vector representation.

Step 3. DCAST - Diversity: Calculate pairwise distances between candidate samples. To assess diversity, we use distances between samples: the larger the distances amongst samples in a given set, the more diverse the set will be considered. Distances are calculated by DCAST based on sample embeddings or original feature vector representations (Alg. 2, line 13). With discriminative embeddings, DCAST calculates normalized distances as 1(EET)/max(EET)1𝐸superscript𝐸𝑇𝐸superscript𝐸𝑇{1-(E{\cdot}E^{T})/\max(E{\cdot}E^{T})}1 - ( italic_E ⋅ italic_E start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) / roman_max ( italic_E ⋅ italic_E start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ), given an embedding matrix E(s×d)×v𝐸superscript𝑠𝑑𝑣E\in\mathbb{R}^{(s\times d)\times v}italic_E ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_s × italic_d ) × italic_v end_POSTSUPERSCRIPT. Specifically, for a random forest model, these distances represent the normalized frequency of non co-occurrence of a pair of samples in the leaves of the decision trees. With original feature vectors, DCAST uses Euclidean distances between sample vectors instead.

Step 4. DCAST - Diversity: Identify distinct clusters and select diverse samples to pseudo-label. The distances calculated in Step 3 are used in Step 4 to select diverse samples, potentially capturing different aspects of the pool of candidates and its underlying distribution. To do this, DCAST first identifies s𝑠sitalic_s (or scsubscript𝑠𝑐s_{c}italic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT per class) distinct groups of candidate samples using a clustering algorithm (Alg. 2, line 14). The current implementation relies on agglomerative hierarchical clustering with single linkage, however any other algorithm of choice could be employed. Given that clustering is designed to maximize inter-cluster distances, samples across the different clusters are likely to yield the largest distances and thus the most diversity under the employed clustering strategy. Accordingly, DCAST selects a single sample per identified cluster to pseudo-label, namely the candidate sample with the highest confidence y¯ncsubscriptsuperscript¯𝑦𝑐𝑛\bar{y}^{c}_{n}over¯ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT value (sample n𝑛nitalic_n and class c𝑐citalic_c, Alg. 2, line 15).

Step 5. (D)CAST - Pseudo-label selected samples. At the end of each iteration, selected samples in the set Scsubscript𝑆𝑐S_{c}italic_S start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT are added to the labeled data matrices {𝑿𝑳,𝒀𝑳}subscript𝑿𝑳subscript𝒀𝑳\{\boldsymbol{X_{L}},\boldsymbol{Y_{L}}\}{ bold_italic_X start_POSTSUBSCRIPT bold_italic_L end_POSTSUBSCRIPT , bold_italic_Y start_POSTSUBSCRIPT bold_italic_L end_POSTSUBSCRIPT } and removed from the unlabeled data matrix 𝑿𝑼subscript𝑿𝑼\boldsymbol{X_{U}}bold_italic_X start_POSTSUBSCRIPT bold_italic_U end_POSTSUBSCRIPT.

Time Complexity of (D)CAST.

To derive an upper bound for the worst-case time complexity of the (D)CAST algorithm, we assume the following time complexities for an input of n𝑛nitalic_n samples defined over v𝑣vitalic_v features: training a base prediction model is O(T(n,v))𝑂𝑇𝑛𝑣O(T(n,v))italic_O ( italic_T ( italic_n , italic_v ) ), making predictions using the trained model is O(P(n,v))𝑂𝑃𝑛𝑣O(P(n,v))italic_O ( italic_P ( italic_n , italic_v ) ), and calculating pairwise sample distances and applying hierarchical clustering is O((n×v)2)𝑂superscript𝑛𝑣2O((n\times v)^{2})italic_O ( ( italic_n × italic_v ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ).

At iteration i𝑖iitalic_i, the time complexity of (D)CAST is dominated by the following operations: retraining the model with l+i×s𝑙𝑖𝑠l+i\times sitalic_l + italic_i × italic_s labeled samples in O(T(l+i×s,v))𝑂𝑇𝑙𝑖𝑠𝑣O(T(l+i\times s,v))italic_O ( italic_T ( italic_l + italic_i × italic_s , italic_v ) ) time (Alg. 2, line 4), making predictions for li×s𝑙𝑖𝑠l-i\times sitalic_l - italic_i × italic_s unlabeled samples in O(P(li×s,v))𝑂𝑃𝑙𝑖𝑠𝑣O(P(l-i\times s,v))italic_O ( italic_P ( italic_l - italic_i × italic_s , italic_v ) ) time (Alg. 2, line 5), and applying hierarchical clustering with pairwise distances to at most s×d𝑠𝑑s\times ditalic_s × italic_d candidate unlabeled samples in O((s×d×v)2)𝑂superscript𝑠𝑑𝑣2O((s\times d\times v)^{2})italic_O ( ( italic_s × italic_d × italic_v ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) time (Alg. 2, lines 11-12). Note that l𝑙litalic_l denotes the number of labeled samples in the input matrices {𝑿𝑳,𝒀𝑳}subscript𝑿𝑳subscript𝒀𝑳\{\boldsymbol{X_{L}},\boldsymbol{Y_{L}}\}{ bold_italic_X start_POSTSUBSCRIPT bold_italic_L end_POSTSUBSCRIPT , bold_italic_Y start_POSTSUBSCRIPT bold_italic_L end_POSTSUBSCRIPT } at the start of the execution, and i×s𝑖𝑠i\times sitalic_i × italic_s denotes the number of samples that are pseudo-labeled up to iteration i𝑖iitalic_i (thus also added and removed respectively from the labeled and unlabeled data). The maximum possible number of samples for prediction at any one iteration is equal to the number of unlabeled samples u𝑢uitalic_u in the input matrix 𝑿𝑼subscript𝑿𝑼\boldsymbol{X_{U}}bold_italic_X start_POSTSUBSCRIPT bold_italic_U end_POSTSUBSCRIPT before any pseudo-labeling has occurred, leading to the upper bound O(P(u,v))𝑂𝑃𝑢𝑣O(P(u,v))italic_O ( italic_P ( italic_u , italic_v ) ) on the prediction time per iteration. Similarly, u𝑢uitalic_u is the maximum number of samples that can be added to the input labeled data (initially containing l𝑙litalic_l samples) over all iterations, which determines the upper bound O(T(l+u,v))𝑂𝑇𝑙𝑢𝑣O(T(l+u,v))italic_O ( italic_T ( italic_l + italic_u , italic_v ) ) on the training time per iteration. Combining all together, each iteration takes O(T(l+u,v)+P(u,v)+(s×d×v)2)𝑂𝑇𝑙𝑢𝑣𝑃𝑢𝑣superscript𝑠𝑑𝑣2O(T(l+u,v)+P(u,v)+(s\times d\times v)^{2})italic_O ( italic_T ( italic_l + italic_u , italic_v ) + italic_P ( italic_u , italic_v ) + ( italic_s × italic_d × italic_v ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) time, and therefore the upper bound on the worst-case time complexity of m𝑚mitalic_m iterations is O(m×(T(l+u,v)+P(u,v)+(s×d×v)2))𝑂𝑚𝑇𝑙𝑢𝑣𝑃𝑢𝑣superscript𝑠𝑑𝑣2O(m\times(T(l+u,v)+P(u,v)+(s\times d\times v)^{2}))italic_O ( italic_m × ( italic_T ( italic_l + italic_u , italic_v ) + italic_P ( italic_u , italic_v ) + ( italic_s × italic_d × italic_v ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ).

Algorithm 2 (D)CAST - Diverse Class-Aware Self-Training
1:T𝑇Titalic_T (model type); 𝑿𝑳subscript𝑿𝑳\boldsymbol{X_{L}}bold_italic_X start_POSTSUBSCRIPT bold_italic_L end_POSTSUBSCRIPT, 𝒀𝑳subscript𝒀𝑳\boldsymbol{Y_{L}}bold_italic_Y start_POSTSUBSCRIPT bold_italic_L end_POSTSUBSCRIPT (labeled train data); 𝑿𝑽subscript𝑿𝑽\boldsymbol{X_{V}}bold_italic_X start_POSTSUBSCRIPT bold_italic_V end_POSTSUBSCRIPT, 𝒀𝑽subscript𝒀𝑽\boldsymbol{Y_{V}}bold_italic_Y start_POSTSUBSCRIPT bold_italic_V end_POSTSUBSCRIPT (labeled validation data); 𝑿𝑼subscript𝑿𝑼\boldsymbol{X_{U}}bold_italic_X start_POSTSUBSCRIPT bold_italic_U end_POSTSUBSCRIPT (unlabeled data); s𝑠sitalic_s (number of samples to select per iteration); t𝑡titalic_t (prediction probability threshold); d𝑑ditalic_d (diversity strength); m𝑚mitalic_m (maximum number of iterations).
2:terminateFalse𝑡𝑒𝑟𝑚𝑖𝑛𝑎𝑡𝑒𝐹𝑎𝑙𝑠𝑒terminate\leftarrow Falseitalic_t italic_e italic_r italic_m italic_i italic_n italic_a italic_t italic_e ← italic_F italic_a italic_l italic_s italic_e
3:i0𝑖0i\leftarrow 0italic_i ← 0
4:while 
terminate𝑡𝑒𝑟𝑚𝑖𝑛𝑎𝑡𝑒terminateitalic_t italic_e italic_r italic_m italic_i italic_n italic_a italic_t italic_e is Falsei=m𝐹𝑎𝑙𝑠𝑒𝑖𝑚False\lor i=mitalic_F italic_a italic_l italic_s italic_e ∨ italic_i = italic_m do
5:     M(i)superscript𝑀𝑖absentM^{(i)}\leftarrowitalic_M start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ← train model instance of type T𝑇Titalic_T with 𝑿𝑳subscript𝑿𝑳\boldsymbol{X_{L}}bold_italic_X start_POSTSUBSCRIPT bold_italic_L end_POSTSUBSCRIPT, 𝒀𝑳subscript𝒀𝑳\boldsymbol{Y_{L}}bold_italic_Y start_POSTSUBSCRIPT bold_italic_L end_POSTSUBSCRIPT
6:     Y¯¯𝑌absent\bar{Y}\leftarrowover¯ start_ARG italic_Y end_ARG ← predict class probability for samples in 𝑿𝑼subscript𝑿𝑼\boldsymbol{X_{U}}bold_italic_X start_POSTSUBSCRIPT bold_italic_U end_POSTSUBSCRIPT using M(i)superscript𝑀𝑖M^{(i)}italic_M start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT
7:     for 
each class cC𝑐𝐶c\in Citalic_c ∈ italic_C do
8:         scs×class_ratio(c)subscript𝑠𝑐𝑠𝑐𝑙𝑎𝑠𝑠_𝑟𝑎𝑡𝑖𝑜𝑐s_{c}\leftarrow s\times class\_ratio(c)italic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ← italic_s × italic_c italic_l italic_a italic_s italic_s _ italic_r italic_a italic_t italic_i italic_o ( italic_c )
9:         tcmax(t,r)subscript𝑡𝑐𝑡𝑟t_{c}\leftarrow\max(t,r)italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ← roman_max ( italic_t , italic_r )
10:         Scsubscript𝑆𝑐absentS_{c}\leftarrowitalic_S start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ← top sc×dsubscript𝑠𝑐𝑑s_{c}\times ditalic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT × italic_d confidently predicted samples with max(y¯nc)>tc𝑚𝑎𝑥superscriptsubscript¯𝑦𝑛𝑐subscript𝑡𝑐max(\bar{y}_{n}^{c})>t_{c}italic_m italic_a italic_x ( over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) > italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT
11:         if 
d>1𝑑1d>1italic_d > 1 then
12:              E𝐸absentE\leftarrowitalic_E ← calculate pairwise distances for samples in Scsubscript𝑆𝑐S_{c}italic_S start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT
13:              Clusters𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝑠absentClusters\leftarrowitalic_C italic_l italic_u italic_s italic_t italic_e italic_r italic_s ← apply agglomerative clustering to obtain scsubscript𝑠𝑐s_{c}italic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT clusters using distances E𝐸Eitalic_E and single linkage
14:              Scsubscript𝑆𝑐absentS_{c}\leftarrowitalic_S start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ← choose the sample with the highest prediction probability from each cluster in Clusters𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝑠Clustersitalic_C italic_l italic_u italic_s italic_t italic_e italic_r italic_s
15:         end if
16:         for 
each selected sample nSc𝑛subscript𝑆𝑐n\in S_{c}italic_n ∈ italic_S start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT do
17:              𝑿𝑳.add(𝒙𝒏)formulae-sequencesubscript𝑿𝑳addsubscript𝒙𝒏\boldsymbol{X_{L}}.\textrm{add}(\boldsymbol{x_{n}})bold_italic_X start_POSTSUBSCRIPT bold_italic_L end_POSTSUBSCRIPT . add ( bold_italic_x start_POSTSUBSCRIPT bold_italic_n end_POSTSUBSCRIPT ), 𝒀𝑳.add(𝒚𝒏)formulae-sequencesubscript𝒀𝑳addsubscript𝒚𝒏\boldsymbol{Y_{L}}.\textrm{add}(\boldsymbol{y_{n}})bold_italic_Y start_POSTSUBSCRIPT bold_italic_L end_POSTSUBSCRIPT . add ( bold_italic_y start_POSTSUBSCRIPT bold_italic_n end_POSTSUBSCRIPT ), 𝑿𝑼.remove(𝒙𝒏)formulae-sequencesubscript𝑿𝑼removesubscript𝒙𝒏\boldsymbol{X_{U}}.\textrm{remove}(\boldsymbol{x_{n}})bold_italic_X start_POSTSUBSCRIPT bold_italic_U end_POSTSUBSCRIPT . remove ( bold_italic_x start_POSTSUBSCRIPT bold_italic_n end_POSTSUBSCRIPT )
18:         end for
19:     end for
20:     \triangleright Stopping conditions: maximum number of iterations m𝑚mitalic_m is reached OR all unlabeled samples have been incorporated OR validation accuracy did not improve for the last 5 iterations.
21:     if 
( i==mi==mitalic_i = = italic_m ) \lor
( len(𝑿𝑼)==0len(\boldsymbol{X_{U}})==0italic_l italic_e italic_n ( bold_italic_X start_POSTSUBSCRIPT bold_italic_U end_POSTSUBSCRIPT ) = = 0) \lor
( z{i6,,i1}𝑧𝑖6𝑖1\exists z\in\{i-6,\ldots,i-1\}∃ italic_z ∈ { italic_i - 6 , … , italic_i - 1 } such that Accuracy(M(i),𝑿𝑽,𝒀𝑽)<Accuracy(M(z),𝑿𝑽,𝒀𝑽Accuracy(M^{(i)},\boldsymbol{X_{V}},\boldsymbol{Y_{V}})<Accuracy(M^{(z)},% \boldsymbol{X_{V}},\boldsymbol{Y_{V}}italic_A italic_c italic_c italic_u italic_r italic_a italic_c italic_y ( italic_M start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , bold_italic_X start_POSTSUBSCRIPT bold_italic_V end_POSTSUBSCRIPT , bold_italic_Y start_POSTSUBSCRIPT bold_italic_V end_POSTSUBSCRIPT ) < italic_A italic_c italic_c italic_u italic_r italic_a italic_c italic_y ( italic_M start_POSTSUPERSCRIPT ( italic_z ) end_POSTSUPERSCRIPT , bold_italic_X start_POSTSUBSCRIPT bold_italic_V end_POSTSUBSCRIPT , bold_italic_Y start_POSTSUBSCRIPT bold_italic_V end_POSTSUBSCRIPT) ) then
22:               terminateTrue𝑡𝑒𝑟𝑚𝑖𝑛𝑎𝑡𝑒𝑇𝑟𝑢𝑒terminate\leftarrow Trueitalic_t italic_e italic_r italic_m italic_i italic_n italic_a italic_t italic_e ← italic_T italic_r italic_u italic_e
23:               Mbestargmaxz=0,,i(Accuracy(M(z),𝑿𝑽,𝒀𝑽))subscript𝑀𝑏𝑒𝑠𝑡𝑎𝑟𝑔𝑚𝑎subscript𝑥𝑧0𝑖𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦superscript𝑀𝑧subscript𝑿𝑽subscript𝒀𝑽M_{best}\leftarrow argmax_{z=0,...,i}(Accuracy(M^{(z)},\boldsymbol{X_{V}},% \boldsymbol{Y_{V}}))italic_M start_POSTSUBSCRIPT italic_b italic_e italic_s italic_t end_POSTSUBSCRIPT ← italic_a italic_r italic_g italic_m italic_a italic_x start_POSTSUBSCRIPT italic_z = 0 , … , italic_i end_POSTSUBSCRIPT ( italic_A italic_c italic_c italic_u italic_r italic_a italic_c italic_y ( italic_M start_POSTSUPERSCRIPT ( italic_z ) end_POSTSUPERSCRIPT , bold_italic_X start_POSTSUBSCRIPT bold_italic_V end_POSTSUBSCRIPT , bold_italic_Y start_POSTSUBSCRIPT bold_italic_V end_POSTSUBSCRIPT ) )
24:     end if
25:     ii+1𝑖𝑖1i\leftarrow i+1italic_i ← italic_i + 1
26:end while
27:return Mbestsubscript𝑀𝑏𝑒𝑠𝑡M_{best}italic_M start_POSTSUBSCRIPT italic_b italic_e italic_s italic_t end_POSTSUBSCRIPT

Evaluation of bias induction and bias mitigation methods

We performed experiments across 11 ML benchmark datasets with different characteristics to assess the effectiveness of (i) selection bias induction using the proposed hierarchy bias technique, and (ii) selection bias mitigation using the proposed (D)CAST strategies. Hierarchy bias was compared to other bias induction techniques concerning both the distribution shift produced by the data selection procedure and its effect on the performance of prediction models built using supervised learning. The (D)CAST semi-supervised bias mitigation strategies were evaluated against conventional semi-supervised self-training (ST), as well as a range of alternative domain adaptation methods, on their ability to build prediction models from biased data with better generalization than using supervised learning.

Data

In addition to 8 datasets from the UCI Data Repository (breast cancer, adult, spam, wine, raisin, rice, mushroom, and MNIST; https://archive.ics.uci.edu), we also used 3 datasets from other sources, including the pistachio [37], fire [38], and pumpkin [39] datasets (Supplementary Table S2). All datasets had binary class labels, except for MNIST with 10 different class labels. The breast cancer, wine, spam, rice, raisin, pistachio, pumpkin and MNIST datasets comprised between 7 to 64 continuous features. The fire and adult datasets included mixed types of features, of which 1 and 7 were respectively categorical features. The mushroom dataset only had categorical features. For the fire, adult, and mushroom datasets, all categorical features were one-hot encoded.

Bias induction and mitigation effects on prediction performance

To evaluate bias induction and bias mitigation techniques, we investigated how prediction models trained on data affected or not by selection bias generalized to test data that was more representative of the original distribution. All models built using supervised learning or bias mitigation techniques were trained and evaluated as follows.

Refer to caption
Figure 6: Data split for evaluation of bias induction and bias mitigation effects on prediction performance. Each dataset is randomly split into train (80%) and test (20%) sets, and 30 different train runs are created by splitting the samples in the train set randomly into labeled (30%) and unlabeled (70%) train sets. Bias induction is further applied to the labeled train sets to generate corresponding biased labeled train sets. Supervised learning is used to build models separately from the original labeled train set and from the biased labeled train set, which serve as baselines to assess the effects of bias induction and bias mitigationl on prediction performance. For bias mitigation, CAST and DCAST learn prediction models using both the unlabeled and labeled train sets, while domain adaptation methods learn from the labeled train set together with the test set (without labels). All models are evaluated on the labeled test set.

Data splits and bias induction. For each dataset, 20% of the samples were uniformly selected at random, stratified by class, and reserved as test data to evaluate prediction models (Fig. 6). The adult dataset already had its own separate test set, which we reserved. Additionally, we created 30 distinct train runs per dataset, each by randomly splitting the remaining 80% of the samples into two train sets, stratified by class: a labeled train set, containing 30% of the samples, from which we also generated biased labeled sets by applying different bias induction techniques; and an unlabeled train set, comprising the remaining 70% of the samples. The original and biased labeled train sets were later used to build prediction models with supervised learning or bias mitigation strategies, while the unlabeled train set was used to learn prediction models with the semi-supervised bias mitigation strategies (D)CAST and conventional ST (other bias mitigation methods used test data without labels). When necessary for model training, a validation set was further extracted from each biased train set, given that unbiased labeled data would not be available for this purpose in a realistic setting.

Training of models using supervised learning or bias mitigation. To quantify the baseline prediction performance, without bias induction, we built models using supervised learning on the original labeled train set. To assess the effect of bias induction compared to the baseline, we built models using supervised learning on the biased labeled train set. Additionally, to assess the bias mitigation strategies and investigate if they could generalize better than supervised learning on the biased labeled train set, we used them to train models on the biased labeled train set together with unlabeled data (namely the unlabeled train set for semi-supervised (D)CAST and conventional ST, or the unlabeled test set for the remaining methods). The prediction models we trained using supervised learning or bias mitigation strategies were based on three different model types: L2-regularized random forests (RF, [40]), 2 hidden-layered (input, 8-node, 12-node, output) neural networks (NN), and L2-regularized logistic regression (LR) [41]. We used default parameter values (Supplementary Table S3), since fine-tuning with a biased validation set could further reinforce the bias. To account for variation introduced by randomness in the training procedures of the RF and NN models, we used different seeds to train 10 prediction models instead of one per run for any given combination of dataset, model type, bias induction technique, and model learning strategy.

Evaluation of models trained using supervised learning or bias mitigation. The performance of all prediction models was evaluated on the test set. We focused on quantifying prediction accuracy rather than loss, since the loss could often be improved by increasing model confidence without a measurable improvement in accuracy, which is ultimately the goal of the models under study. We report the performance results as the median test accuracy of the 10 models using different seeds per run, with a total of 30 runs, for every combination of dataset, model type, bias induction technique, and model learning strategy. Some model learning strategies did not successfully build prediction models for all runs, which is necessarily reflected in the results and corresponding figures.

Bias mitigation strategies

We assessed the proposed semi-supervised (D)CAST methods against competing bias mitigation techniques, including semi-supervised conventional self-training and alternative domain adaptation strategies.

The semi-supervised methods, (D)CAST and conventional ST, learned models using the labeled and unlabeled train sets. Additionally, (D)CAST relied on early stopping based on validation performance to make training more efficient and robust. To be fair to other methods, (D)CAST used a portion of the labeled train set for validation rather than a separate validation set. We set the following parameter values for (D)CAST across experiments: maximum number of iterations m=100𝑚100m=100italic_m = 100, number of pseudo-labeled samples to include per iteration s𝑠sitalic_s as 3×|C|3𝐶3\times|C|3 × | italic_C | (or 3 times the number of classes), and three different diversity strengths d={1,10,100}𝑑110100d=\{1,10,100\}italic_d = { 1 , 10 , 100 }. In addition, the confidence threshold t𝑡titalic_t used by (D)CAST to select candidate samples for pseudo-labeling was set to a prediction probability of 0.90.90.90.9 for NN and LR models. Since RF models generally showed lower prediction probabilities, possibly due to regularization, we defined the threshold for binary RF classification models as the 93rd percentile of all prediction probabilities on unlabeled data. This threshold was not fully optimized, only considered sufficient to allow pseudo-labeling of some samples across all datasets with binary class labels. For MNIST, probabilities were even lower given the multiclass nature of the problem, thus we set the threshold of RF models as the 85th percentile instead.

Given that most semi-supervised learning approaches designed to mitigate sample selection bias are not model agnostic and do not have readily available implementations, we compared (D)CAST with the closely related conventional self-training (ST) methods. We implemented and tested two variants of conventional ST, which pseudo-labeled either the 3×|C|3𝐶3\times|C|3 × | italic_C | samples with the highest prediction probabilities or all samples with prediction probabilities over 0.9. The former variant performed better and was thus selected.

We included domain adaptation methods beyond semi-supervised learning across three categories, using Python implementations available in the libTLDA Python library [42]: importance weighting approaches Kernel Mean Matching (KMM [19]) and Kernel Density Estimation (KDE [22]), minimax estimation strategies Robust Bias-Aware classifier (RBA [20]) and Target Contrastive Pessimistic Risk (TCPR [32]), and subspace alignment methods Feature-Level Domain Adaptation (FLDA [31]) and Subspace Alignment classifier (SUBA [30]). All of these methods were applied as originally proposed by their authors to learn models based on the labeled train set together with the test set without labels. In addition, all methods except KMM were used exclusively with L2-regularized LR models. The KMM importance weighting approach is ML model-agnostic, since it independently calculates a weight for each sample based exclusively on the train and test data, and was therefore applied with RF, NN, and LR models.

Bias induction and sample selection methods

We compared the proposed hierarchy bias induction method against the joint and Dirichlet bias induction techniques, as well as random subsampling. Hierarchy bias was used with a fixed target of k=30𝑘30k=30italic_k = 30 samples to select per class, and a bias ratio of b=0.9𝑏0.9b=0.9italic_b = 0.9 across experiments. Random subsampling consisted in selecting k𝑘kitalic_k samples uniformly at random per class, where k𝑘kitalic_k was similarly set to 30. Joint bias assigns a selection probability to each sample based on its proximity to the sample mean over the labeled train data, and then independently selects samples according to their selection probabilities [19]. Joint bias induction does not include any parameter to control the number of selected samples, and it was therefore used without a fixed target number of selected biased samples. Dirichlet bias selects a subset of samples without replacement, where the biased selection probability of each sample is determined based on a random likelihood function sampled from a Dirichlet distribution  [20]. This method does not consider class labels in its biased selection and was therefore set to select a total of k×|C|𝑘𝐶k\times|C|italic_k × | italic_C | samples, with |C|𝐶|C|| italic_C | denoting the number of classes and k=30𝑘30k=30italic_k = 30. Of note, hierarchy bias and random subsampling generate a biased selection that is balanced across classes, whereas joint and Dirichlet bias induction do not offer such guarantee.

Bias induction impact on data distribution

In addition to the effect on supervised model prediction performance, bias induction methods were assessed on their ability to cause a distribution shift in the biased selection relative to the original labeled train set. Quantitatively, we analyzed the change in the distribution of inter-sample distances as follows. We first calculated class-specific distributions of the per sample average Euclidean distance to all other samples in either the biased selection or the original labeled train set. We then determined the class-specific distribution shifts between the biased selection and the original data using two-sample Kolmogorov-Smirnov (KS) statistical tests. We report KS effect sizes, as well as histograms of inter-sample distances for the biased selection distribution and histogram peaks for the original data distribution.

Visually, we analyzed to what extent the biased selection was representative of the original labeled train set by inspecting 2D dimension reductions of the original data using the Uniform Manifold Approximation and Projection (UMAP) algorithm. We applied UMAP to the original labeled set with four different nearest neighbor parameter values (15, 50, 100, and 200) to obtain a reasonable representation of the sample space for each dataset.

Data availability

The data used in this article were obtained from publicly available sources, detailed in the Methods section. The raw data necessary to reproduce the experiments, along with the main experimental results for CAST and DCAST, are accessible via Figshare at doi.org/10.6084/m9.figshare.27003601.

Code availability

An implementation of the hierarchy bias and the (D)CAST methods in Python has been made available under an open source license at github.com/joanagoncalveslab/DCAST.

References

  • [1] Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K. & Galstyan, A. A survey on bias and fairness in machine learning. ACM Computing Surveys 54 (2021).
  • [2] Pessach, D. & Shmueli, E. A review on fairness in machine learning. ACM Computing Surveys 55, 1–44 (2022).
  • [3] Wu, D., Lin, D., Yao, L. & Zhang, W. Correcting sample selection bias for image classification. In 2008 3rd International Conference on Intelligent System and Knowledge Engineering, vol. 1, 1214–1220 (2008).
  • [4] Persello, C. & Bruzzone, L. Active and semisupervised learning for the classification of remote sensing images. IEEE Transactions on Geoscience and Remote Sensing 52, 6937–6956 (2014).
  • [5] Richards, J. W. et al. Active learning to overcome sample selection bias: Application to photometric variable star classification. The Astrophysical Journal 744, 192 (2011).
  • [6] Kremer, J., Gieseke, F., Pedersen, K. S. & Igel, C. Nearest neighbor density ratio estimation for large-scale applications in astronomy. Astronomy and Computing 12, 67–72 (2015).
  • [7] Romero, R., Iglesias, E. L. & Borrajo, L. Building biomedical text classifiers under sample selection bias. In Advances in Intelligent and Soft Computing, 11–18 (Springer Berlin Heidelberg, 2011).
  • [8] Chan, J. Y. & Cook, J. A. Inferring zambia’s HIV prevalence from a selected sample. Applied Economics 52, 4236–4249 (2020).
  • [9] Seale, C., Tepeli, Y. & Gonçalves, J. P. Overcoming selection bias in synthetic lethality prediction. Bioinformatics 38, 4360–4368 (2022).
  • [10] Tepeli, Y. I., Seale, C. & Gonçalves, J. P. ELISL: early-late integrated synthetic lethality prediction in cancer. Bioinformatics 40 (2024).
  • [11] Chang, C.-H. & Lin, J.-H. Decision support and profit prediction for online auction sellers. In Proceedings of the 1st ACM SIGKDD Workshop on Knowledge Discovery from Uncertain Data, U ’09, 1–8 (Association for Computing Machinery, New York, NY, USA, 2009).
  • [12] Castagnetti, C., Rosti, L. & Töpfer, M. The age pay gap between young and older employees in italy: Perceived or real discrimination against the young? In Research in Labor Economics, 195–221 (Emerald Publishing Limited, 2020).
  • [13] Shen, F., Yang, Z., Zhao, X. & Lan, D. Reject inference in credit scoring using a three-way decision and safe semi-supervised support vector machine. Information Sciences 606, 614–627 (2022).
  • [14] Melucci, M. Investigating sample selection bias in the relevance feedback algorithm of the vector space model for information retrieval. In 2014 International Conference on Data Science and Advanced Analytics (DSAA), 83–89 (2014).
  • [15] Melucci, M. Impact of query sample selection bias on information retrieval system ranking. In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), 341–350 (2016).
  • [16] Zhang, G. et al. Selection bias explorations and debias methods for natural language sentence matching datasets. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 4418–4429 (Association for Computational Linguistics, 2019).
  • [17] Chawla, N. V. & Karakoulas, G. Learning from labeled and unlabeled data: An empirical study across techniques and domains. Journal of Artificial Intelligence Research 23, 331–366 (2005).
  • [18] Smith, A. T. & Elkan, C. Making generative classifiers robust to selection bias. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’07, 657–666 (Association for Computing Machinery, New York, NY, USA, 2007).
  • [19] Huang, J., Smola, A. J., Gretton, A., Borgwardt, K. M. & Scholkopf, B. Correcting sample selection bias by unlabeled data. In Proceedings of the 19th International Conference on Neural Information Processing Systems, NIPS’06, 601–608 (MIT Press, Cambridge, MA, USA, 2006).
  • [20] Liu, A. & Ziebart, B. Robust classification under sample selection bias. Advances in Neural Information Processing Systems 1, 37–45 (2014).
  • [21] Kouw, W. M. & Loog, M. A review of domain adaptation without target labels. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 766–785 (2021).
  • [22] Shimodaira, H. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference 90, 227–244 (2000).
  • [23] Zadrozny, B. Learning and evaluating classifiers under sample selection bias. In Proceedings of the Twenty-First International Conference on Machine Learning, ICML ’04, 114 (Association for Computing Machinery, New York, NY, USA, 2004).
  • [24] Seah, C.-W., Tsang, I. W.-H. & Ong, Y.-S. Healing sample selection bias by source classifier selection. In 2011 IEEE 11th International Conference on Data Mining, 577–586 (2011).
  • [25] Sugiyama, M., Yamada, M. & du Plessis, M. C. Learning under nonstationarity: covariate shift and class-balance change. Wiley Interdisciplinary Reviews: Computational Statistics 5, 465–477 (2013).
  • [26] Shen, Z., Cui, P., Kuang, K., Li, B. & Chen, P. Causally regularized learning with agnostic data selection bias. In Proceedings of the 26th ACM International Conference on Multimedia, MM ’18, 411–419 (Association for Computing Machinery, New York, NY, USA, 2018).
  • [27] Diesendruck, M. et al. Importance weighted generative networks. In Machine Learning and Knowledge Discovery in Databases, 249–265 (Springer International Publishing, 2020).
  • [28] Du, W. & Wu, X. Fair and robust classification under sample selection bias. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, CIKM ’21, 2999–3003 (Association for Computing Machinery, New York, NY, USA, 2021).
  • [29] Blitzer, J., McDonald, R. & Pereira, F. Domain adaptation with structural correspondence learning. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, EMNLP ’06, 120–128 (Association for Computational Linguistics, USA, 2006).
  • [30] Fernando, B., Habrard, A., Sebban, M. & Tuytelaars, T. Unsupervised visual domain adaptation using subspace alignment. In 2013 IEEE International Conference on Computer Vision, 2960–2967 (2013).
  • [31] Kouw, W. M., Van Der Maaten, L. J. P., Krijthe, J. H. & Loog, M. Feature-level domain adaptation. Journal of Machine Learning Research 17, 5943–5974 (2016).
  • [32] Kouw, W. M. & Loog, M. Robust domain-adaptive discriminant analysis. Pattern Recognition Letters 148, 107–113 (2021).
  • [33] Fan, W. & Davidson, I. Reverse testing: An efficient framework to select amongst classifiers under sample selection bias. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, 147–156 (Association for Computing Machinery, New York, NY, USA, 2006).
  • [34] Ren, J., Shi, X., Fan, W. & Yu, P. S. Type independent correction of sample selection bias via structural discovery and re-balancing. In Proceedings of the 2008 SIAM International Conference on Data Mining (Society for Industrial and Applied Mathematics, 2008).
  • [35] McLachlan, G. J. Iterative reclassification procedure for constructing an asymptotically optimal rule of allocation in discriminant analysis. Journal of the American Statistical Association 70, 365–369 (1975).
  • [36] Blum, A. & Mitchell, T. Combining labeled and unlabeled data with co-training. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, COLT’ 98, 92–100 (Association for Computing Machinery, New York, NY, USA, 1998).
  • [37] Ozkan, I. A., Koklu, M. & Saraçoğlu, R. Classification of pistachio species using improved k-nn classifier. Progress in Nutrition 23, e2021044 (2021).
  • [38] Koklu, M. & Taspinar, Y. S. Determining the extinguishing status of fuel flames with sound wave by machine learning methods. IEEE Access 9, 86207–86216 (2021).
  • [39] Koklu, M., Sarigil, S. & Ozbek, O. The use of machine learning methods in classification of pumpkin seeds (cucurbita pepo l.). Genetic Resources and Crop Evolution 68, 2713–2726 (2021).
  • [40] Ho, T. K. Random decision forests. In Proceedings of 3rd international conference on document analysis and recognition, vol. 1, 278–282 (IEEE, 1995).
  • [41] Pedregosa, F. et al. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
  • [42] Kouw, W. wmkouw/libtlda v0.1 (2018). URL https://doi.org/10.5281/zenodo.1214315.

Acknowledgements
The authors received funding from the US National Institutes of Health [U54EY032442, U54DK134302, U01DK133766, R01AG078803 to J.P.G.]. Authors are solely responsible for the research, the funders were not involved in the work. The authors further acknowledge the High-Performance Compute (HPC) cluster of the Department of Intelligent Systems at the Delft University of Technology.

Author contributions
Conceptualization, Y.I.T., and J.P.G.; Methodology, Y.I.T. and J.P.G.; Validation and Formal Analysis, Y.I.T.; Software, Y.I.T.; Investigation, Y.I.T. and J.P.G.; Writing – Original Draft, Y.I.T.; Writing – Review & Editing, J.P.G.; Funding Acquisition and Supervision, J.P.G.

Competing interests
The authors declare no competing interests.