DCAST: Diverse Class-Aware Self-Training Mitigates Selection Bias for Fairer Learning

Yasin I. Tepeli Department of Intelligent Systems, Faculty EEMCS, Delft, Netherlands Joana P. Gonçalves Department of Intelligent Systems, Faculty EEMCS, Delft, Netherlands Correspondence: [email protected]

Abstract

Fairness in machine learning seeks to mitigate model bias against individuals based on sensitive features such as sex or age, often caused by an uneven representation of the population in the training data due to selection bias. Notably, bias unascribed to sensitive features is challenging to identify and typically goes undiagnosed, despite its prominence in complex high-dimensional data from fields like computer vision and molecular biomedicine. Strategies to mitigate unidentified bias and evaluate mitigation methods are crucially needed, yet remain underexplored. We introduce: (i) Diverse Class-Aware Self-Training (DCAST), model-agnostic mitigation aware of class-specific bias, which promotes sample diversity to counter confirmation bias of conventional self-training while leveraging unlabeled samples for an improved representation of the underlying population; (ii) hierarchy bias, multivariate and class-aware bias induction without prior knowledge. Models learned with DCAST showed improved robustness to hierarchy and other biases across eleven datasets, against conventional self-training and six prominent domain adaptation techniques. Advantage was largest for higher-dimensional datasets, suggesting DCAST as a promising strategy to achieve fairer learning beyond identifiable bias.

Introduction

As predictive machine learning (ML) increasingly makes its way to applications with an impact on society, one major concern is to ensure that ML models deliver fair predictions and do not discriminate against individuals in the population. Selection bias is one of the most prominent sources of unfairness in ML, whereby the data used to build ML models is not representative of the real-world and thus violates the fundamental assumption of ML that it is independently drawn and identically distributed to the underlying population.

Research on fairness in ML has focused on mitigating (selection) bias associated with legally protected or sensitive features, such as sex, age, or skin color [1, 2]. However, biases can be indirectly linked to sensitive features via proxies not recognized as sensitive [1, 2], or they can be unrelated to sensitive features and still lead to unfairness. Ultimately, biases are likely to remain undiagnosed and be propagated by ML models without scrutiny when a link to sensitive features is challenging to identify. Unknown biases are often present when data is complex and high-dimensional, data collection is non-random, and knowledge of the domain is incomplete. We argue that unfairness mitigation should thus address bias more generally, beyond what can be ascribed to sensitive features. This issue has deserved attention across fields, including computer vision [3, 4], astronomy [5, 6], biomedicine and healthcare [7, 8, 9, 10], finance and economics [11, 12, 13], information retrieval [14, 15], and language [7, 16]. Nevertheless, its impact is typically overlooked, resulting in models with optimistic performances due to bias-unaware evaluation. We identify two key areas for improvement, namely evaluation of ML model robustness to bias, and ML bias mitigation.

Evaluation is crucial to ensure that ML models generalize and are robust to bias, but assessing performance on data representative of the real-world distribution is rarely achievable. Independent test data is not always available or guaranteed to be unbiased, and conventional data splits do not create train-test distribution shifts suitable for model bias evaluation. A viable alternative is to induce bias to the train set and assess the learned model on the original test set. Common bias induction approaches include subsampling using univariate selection probabilities, based on values or the distribution of one feature [17, 18]. This is however not representative of multivariate biases typically present in complex high-dimensional data. Existing methods to induce multivariate bias include: joint bias [19], which favors the selection of samples closer to the mean; and Dirichlet bias, [20], which assigns sample selection likelihoods based on a Dirichlet distribution. Both methods ignore class labels and thus do not generate class-specific biases. They might also cause class imbalances for otherwise balanced data.

We propose hierarchy bias, a multivariate class-aware bias induction technique to produce complex class-specific biases. Hierarchy bias identifies distinctly distributed groups of samples in the original data using clustering, and then generates a biased selection by influencing the representation of one group of samples relative to the others. Selection is performed per class to induce class-specific bias, aiming for an identical number of samples per class to ensure class balance.

Methods to mitigate bias in ML generally fall in the scope of domain adaptation (DA, [21]), seeking to adapt a model to the distribution shift between the source training domain and a target prediction domain. Relevant DA categories span importance weighting, subspace alignment, inference-based, and semi-supervised learning methods. Importance weighting (IW) weighs training samples based on their relevance to the test set, using probability ratios or discrepancy measures [22, 23, 11, 24, 25, 6, 26, 27, 19, 28]. Since IW assumes that the train set contains the support of the test set and most features contribute to the prediction, it can be less effective with high-dimensional data or small sample sizes. Subspace alignment (SA) transforms the data representation [29, 30, 31], assuming there is a common subspace where transformed train and test sets exhibit matching conditional probabilities, which may be difficult to optimize if many transformations fit. Inference-based (IB) methods include minimax estimation [20, 32], where loss minimization is coupled with an adversarial maximization objective that steers the model to fit more conservatively, aiming for improved generalization. The IB methods may underperform if the model choice is less suitable for the test set. Overall, most IW, SA, and IB methods adapt the model for one target test set, which can hamper generalizability. Semi-supervised learning (SSL) leverages unlabeled samples to provide model learning with insight into the underlying population distribution. The most benefit can in principle be achieved by using as much unlabeled data as available, though some SSL approaches still adapt to individual test sets [33, 34]. Unlabeled samples are typically incorporated by SSL using self-training (ST) [35] or co-training (CT) [36], which assigns predicted pseudo-labels to unlabeled samples and selects a subset of these to include at each training iteration. Sample selection is often based on prediction confidence according to the model trained thus far, which may strengthen existing bias or create other biases such as class imbalance for originally balanced data [4, 5]. Attempts to mitigate this behavior include, for instance, the P3SVM support vector machine (SVM) [4] that selects pseudo-labeled samples distant from each other and located within the margins furthest away from the decision boundary. This method is however SVM-specific, and its sample selection dependent on the size of the margin may limit the contribution of unlabeled data. In summary, most DA methods mitigate distribution shifts for one test set at a time, leading to ML models with limited generalizability beyond the train and test domains. It remains to be investigated if generalization could be improved by training on additional unlabeled data. Semi-supervised learning offers this possibility, but existing methods fall short in actively mitigating bias present in the data or further induced during model learning. Finally, many DA methods are model-specific and cannot be applied to different types of ML models.

To improve bias mitigation, we propose Diverse Class-Aware Self-Training (DCAST), a model-agnostic semi-supervised learning framework that gradually incorporates unlabeled data in a class-aware manner, guided by two active bias mitigation strategies. The core CAST strategy addresses class-specific bias by selecting a set of pseudo-labeled samples to include separately per class, using a relaxed confidence threshold, with options to preserve the class ratios of the original labeled train set or to add the same number of pseudo-labeled samples per class at each iteration. The extended DCAST strategy seeks to counter confidence-induced bias by further selecting diverse pseudo-labeled samples, as measured by inter-sample distances in the learned discriminative embedding or the original feature space.

We evaluate both hierarchy bias induction and (D)CAST bias mitigation across eleven datasets, against competing approaches including Dirichlet and joint bias as well as conventional self-training and six domain adaptation techniques. Specifically, we investigate which bias induction method induces the most challenging type of selection bias, leading to the strongest impact on ML model prediction performance. We further assess to what extent the class-awareness and diversity in (D)CAST improve robustness to bias, both across datasets and compared to the alternative bias mitigation strategies, while coupling model-agnostic (D)CAST with three types of ML models.

Results and Discussion

The proposed hierarchy bias induction and (D)CAST bias mitigation methods aim to provide, respectively: (i) a more realistic type of class-aware multivariate selection bias for the evaluation of ML model robustness to bias, and (ii) class-aware and diversity-guided strategies to learn ML models with improved generalizability in the presence of selection bias. We briefly introduce these techniques and discuss their evaluation across 11 datasets using logistic regression (LR), random forest (RF), and 2-hidden layer neural network (NN) prediction models. Every dataset was randomly partitioned into 80% train and 20% test, with the test data reserved for prediction model evaluation (Methods). Effects of bias induction on the data and model prediction performance were assessed over 30 runs, each relying on a random split of the train set into labeled (30%) and unlabeled (70%) train sets. The labeled train set was used for bias induction and for training ML models, either intact or upon bias induction. For bias mitigation, unlabeled data was additionally used during training, where conventional self-training (ST) and (D)CAST leveraged the unlabeled train set, and other domain adaptation techniques exploited the unlabeled test set instead (Methods).

Hierarchy bias induces effective multivariate and class-specific selection bias

Hierarchy bias generates a biased selection of samples for a given dataset, aiming to deviate from the original data distribution by skewing the representation of a group of samples that is deemed closer together in feature space than the remaining samples (Fig. 1). The approach selects $k$ samples per class and controls group representation using bias ratio $b$ as follows. A class-specific group of at least $k$ closely related samples is first identified using agglomerative hierarchical clustering. To obtain the biased selection, $k\times b$ samples are chosen uniformly at random from the identified group and $k\times(1-b)$ samples are chosen uniformly at random from the remaining samples (Methods).

Refer to caption — Figure 1: Hierarchy bias approach for induction of selection bias. Given input data $\boldsymbol{X}$ with labels $\boldsymbol{Y}$ , number of samples to select $k$ , and bias ratio $b\in[0,1]$ , hierarchy bias selects $k$ samples per class $c$ : $k\times b$ from a specific group and $k\times(1-b)$ from the remaining samples. Each class-specific candidate group (for class $c$ ) is identified via agglomerative hierarchical clustering with Euclidean distances and Ward linkage of the $c$ -labeled samples until a cluster of size $\geq k$ is obtained, from which $k\times b$ samples are drawn uniformly at random. The $k\times(1-b)$ samples are drawn uniformly at random from the remaining $c$ -labeled samples.

To evaluate bias induction, we assessed the ability to generate a distribution shift between the biased selection and the original data, as well as the impact of the induced shift on ML model prediction performance. We compared hierarchy bias with $b=0.9$ to random subsampling and two alternative bias induction techniques: joint bias [19] and Dirichlet bias [20]. Hierarchy bias and random subsampling were set to select 30 samples per class, whereas Dirichlet targeted 60 and 300 samples in total respectively for binary and multiclass labeled datasets. Note that Dirichlet and joint bias do not take class labels into account when performing their selection, and joint bias does not allow control over the selected number of samples.

Effect on data distribution.

We first assessed the effect of bias induction on the distribution of distances between samples. The underlying idea is that a biased selection would exclude portions of the original data that deviate from the rest of the samples to some extent, thus making inter-sample distances closer on average. For each dataset, we obtained class-specific distributions of the per sample average Euclidean distance to all other samples. We further quantified the deviation between the class-specific distance distributions obtained for the biased selection and the original labeled set using Kolmogorov-Smirnov (KS) tests. Hierarchy bias ( $b=0.9$ ) induced the most significant shift in the distance distributions for all 11 datasets (KS effect sizes $>0.65$ , $p$ -values $<0.05$ ; Fig. 2a and Supplementary Fig. S1-S2), and primarily towards smaller average inter-sample distances, in line with the selection of close samples that hierarchy bias is designed to produce. Random selection resulted in the most similar distance distributions to the original data, with the smallest KS effect for 8 datasets. Dirichlet and joint bias led to modest shifts than hierarchy bias, with joint bias generally showing larger KS effects than Dirichlet (9 of 11 datasets). We also examined the samples selected from each labeled train set in the feature space, reduced to 2 dimensions (2D) using Uniform Manifold Approximation and Projection (UMAP) for an example run 11. Hierarchy bias selected samples from specific clusters or regions of the feature space. This was apparent across datasets (Supplementary Fig. S3), for instance hierarchy bias ignored samples in the top right area of the 2D space for the wine dataset (Fig. 2b), selected from specific clusters of the mushroom dataset (Fig. 2c), and focused on the top left and bottom right areas of the 2D space for the fire dataset (Fig. 2d). In contrast, samples selected by random selection, as well as by the Dirichlet and joint biases, were spread throughout the 2D space and thus more representative of the original labeled train set for all datasets (Supplementary Fig. S4-S6). For random sampling, this was expected, given that no particular bias was introduced. For joint bias the result was also unsurprising, seeing that it selected the largest proportions of samples across datasets and thus captured most of the data (overall mean average 63%, minimum 44%, and maximum 80%; for hierarchy bias: 17%, 0.4%, and 67%; Supplementary Table S1).

Impact on prediction performance.

We evaluated the impact of bias induction on the classification accuracy of supervised ML models for the 11 datasets across 30 runs. Per run, we trained 2-hidden layer neural network (NN), random forest (RF), and logistic regression (LR) models using the original labeled train set (No Bias) or a selection of its samples. The latter was obtained either by random subsampling or using Dirichlet, joint, or hierarchy bias induction. All models were evaluated on the original test set. The induced bias led to a decrease in accuracy with every technique except joint bias (Fig. 2e), which as previously mentioned selected most of the original samples and thus did not induce particularly strong bias. Hierarchy bias caused the largest decrease in accuracy for all datasets except MNIST, where the most impact was seen with joint bias (Fig. 2e). Note that the preset targets on the number of samples to select for hierarchy bias, Dirichlet bias, and random selection led these methods to select 64-70% of the MNIST samples per class compared to 46-60% with joint bias. This larger coverage of the original data likely influenced the ability of hierarchy and Dirichlet to produce a more effective biased selection for MNIST. Overall, hierarchy bias consistently selected samples in close proximity, leading to a significant shift in inter-sample distances and a bias towards class-specific parts of the original distribution. This caused a marked decrease in prediction accuracy of supervised ML models relative to other bias induction techniques.

Diverse class-aware self-training (DCAST) for selection bias mitigation

The proposed (D)CAST semi-supervised learning strategies (Fig. 3) aim to mitigate selection bias by leveraging insight from unlabeled data about the underlying distribution of the population. Both rely on self-training to gradually incorporate unlabeled data: at each training iteration, the learnt model is used to predict pseudo-labels for all unlabeled samples, from which a subset of $s$ samples ( $s_{c}$ per class) is selected to be included in the labeled set for the next iteration. To address class-related bias, sample selection is done separately per class as follows. First, a set of $s\times d$ candidates is selected as the most confidently predicted samples with prediction probability above a threshold $t$ , where $s$ and $d$ denote the number of samples to select and diversity strength. For CAST ( $d=1$ ), this directly results in the final set of $s$ pseudo-labeled samples to add for the next iteration. The DCAST selection ( $d>1$ ) extends upon CAST to mitigate confidence-related bias through sample diversity, reducing the set of $s\times d$ candidates to a final set of $s$ diverse pseudo-labeled samples. Capturing diverse sample groups is achieved via hierarchical clustering of the candidate samples into $s$ clusters ( $s_{c}$ per class), followed by selection of diverse samples comprising the most confidently predicted sample per cluster. To ensure (D)CAST remains model-agnostic, sample distances for clustering can be based on discriminative embeddings learnt by the model or the original feature space.

Diversity and class-awareness in (D)CAST improve bias mitigation via self-training

To evaluate (D)CAST bias mitigation, we first assessed its test prediction accuracy against supervised learning and conventional self-training (ST) [35] on the biased labeled train set, with additional unlabeled samples for self-training strategies. Training and evaluation were performed for 11 datasets over 30 runs as previously described, using RF, NN, and LR models. We induced hierarchy bias with ratio $b=0.9$ , as this type of selection bias showed the most impact on supervised models compared to Dirichlet and joint bias (Fig. 2e). The (D)CAST method was assessed without diversity (CAST, $d=1$ ) or with diversities $d=\{10,100\}$ (CAST-10, DCAST-100), and was set to include $s=3\times$ (number of classes) pseudo-labeled samples per iteration, for at most $m=100$ iterations, using prediction threshold $t=0.9$ (or the 85th or 93rd percentile in the case of RF models). Conventional ST selected the $3\times$ (number of classes) most confidently predicted samples per iteration (Methods, Bias mitigation strategies). Concerning the mitigation of hierarchy bias with ratio $b=0.9$ , with NN models the semi-supervised (D)CAST strategies significantly improved generalizability over supervised learning across all 11 datasets ( $p<0.05$ with one-sided Wilcoxon signed-rank tests, Fig. 4b). Specifically, class-awareness with moderate diversity (DCAST-10) was significantly better than supervised learning on the 11 datasets, whereas class-awareness alone (CAST) or coupled with stronger diversity (DCAST-100) both improved on 10 datasets and remained comparable respectively on the fire and adult datasets. By contrast, conventional ST was significantly worse than supervised learning on 10 datasets with NN models. Using RF and LR models, mitigation of hierarchy bias with ratio $b=0.9$ was more modest. Semi-supervised (D)CAST and ST performed comparably to supervised learning on most datasets (8 with RF and 7 with LR models; Fig. 4a,c), possibly due to the use of regularization, which could hamper model adaptation. We thus saw occasional statistically significant changes and smaller effect sizes with RF and LR models. Notably, the higher diversity strategy DCAST-100 led to the only significant improvement of semi-supervised over supervised learning using RF models, on the MNIST dataset (Fig. 4a). Also with RF models, CAST and DCAST-10 decreased accuracy on MNIST, while ST decreased accuracy on 3 datasets (wine, MNIST, and pistachio; Fig. 4a). With LR models, (D)CAST strategies improved over supervised learning on 4 datasets (MNIST, spam, raisin, and pistachio), whereas ST improved on 3 datasets (spam, raisin, and pumpkin) but also caused a decrease on the wine dataset (Fig. 4c).

Experiments with alternative bias induction techniques revealed similar findings, where (D)CAST bias mitigation consistently outperformed ST across datasets under random subsampling (Supplementary Fig. S7), and under induced Dirichlet or joint bias (Supplementary Figs. S8-S9). Again, we saw the largest performance differences with NN models, coinciding with the most improvement of (D)CAST and weakest results of ST over supervised learning.

In summary, (D)CAST effectively mitigated selection bias induced by different techniques when paired with non-regularized NN models, and was not outperformed by supervised learning or conventional ST with regularized RF and LR models. In contrast, conventional ST struggled to recover from the bias with all three types of models, especially NNs. These results suggest that the class-awareness and diversity features introduced to the pseudo-labeling procedure in (D)CAST provide a promising semi-supervised learning strategy to mitigate selection bias.

Semi-supervised (D)CAST bias mitigation is superior to competing domain adaptation

We also evaluated (D)CAST against bias mitigation techniques beyond semi-supervised learning. This included importance weighting methods KMM [19] and KDE [22], minimax approaches RBA [20] and TCPR [32], and subspace alignment methods FLDA [31] and SUBA [30]. All methods were trained on the biased labeled train set and evaluated on the original test set, with (D)CAST further incorporating samples from the unlabeled train set and the remaining methods using unlabeled test samples during training. The (D)CAST and KMM approaches were coupled with RF, NN, and LR models, while the remaining methods used LR only as per the original work.

Similar to our previous findings, CAST and DCAST-100 were the most robust bias mitigation methods. Overall, these strategies preserved or significantly improved over the supervised learning performance across the 3 model types and 11 datasets, with the exception of CAST showing a decrease in accuracy for MNIST when used with RF models. (Fig. 4-5). In contrast, KMM led to significant decreases in accuracy for 8 datasets with NN models, as well as for 5 and 6 datasets respectively with LR and RF models. As for the remaining bias mitigation methods using only LR models, KDE resulted in significant decreases in performance for all except the rice dataset. Apart from an improvement with RBA for the pistachio dataset, the RBA and SUBA methods degraded performance significantly for 6 and 9 datasets, respectively. The best competing methods were FLDA and TCPR, which showed significant improvements respectively for 5 and 4 datasets (FLDA: breast cancer, spam, raisin, pistachio, and pumpkin; TCPR: wine, rice, adult, and pistachio). The FLDA approach also led to significant decreases for 4 datasets (wine, mushroom, MNIST, and fire), while TCPR caused a significant decrease for the fire dataset. Concerning the MNIST dataset, TCPR failed to build models for most runs and caused a clear performance drop for the few remaining ones, resulting in insufficient power to determine statistical significance. Overall, CAST and DCAST-100 demonstrated consistent ability to match or outperform supervised learning in the presence of hierarchy bias compared to other bias mitigation methods. The gap was most evident on the multi-class classification problem (MNIST), where the other methods resulted in drastic decreases in performance.

Conclusion

We put forth two contributions to improve the learning of prediction models in the presence of selection bias. First, a bias induction approach termed hierarchy bias to enable the evaluation of complex multivariate bias effects on the generalizability of prediction models. Second, a model-agnostic semi-supervised learning framework named (D)CAST that exploits unlabeled data in a class-aware manner and promotes sample diversity to mitigate selection bias.

Hierarchy bias uses clustering to isolate one distinct group of samples per class and then skews the representation of such group during sample selection to induce class-specific multivariate bias, allowing control over the level of bias through a bias ratio parameter. Induced hierarchy bias showed a stronger impact on the distribution of inter-sample distances and proved more challenging for prediction models to overcome, compared to joint and Dirichlet bias.

The (D)CAST model learning strategy progressively incorporates unlabeled samples using self-training, which is further made class-aware in CAST by pseudo-labeling confidently predicted unlabeled samples over a given threshold per class. Its extended variant, DCAST, seeks to counter confidence-associated bias with sample diversity by clustering and selecting pseudo-labeled samples from distinct groups, using distances based on either the discriminative embeddings provided by the underlying model or the original feature representation.

Both class-awareness and diversity proved effective, leading to significant improvements in the bias mitigation ability of (D)CAST over conventional self-training across datasets and bias induction techniques. Models trained by (D)CAST also outperformed other models built using six alternative domain adaptation methods, comprising different importance weighting, minimax estimation, and subspace alignment approaches.

Diversity strength was shown to influence the extent of (D)CAST bias mitigation, where a larger value resulted in improved robustness to selection bias. More generally, we recommend setting the diversity strength parameter such that the number of candidate samples considered for selection at each iteration is significantly larger than the number of samples to select. We further suggest choosing a number of samples to select per iteration comfortably below the size of the training set to promote a gradual adaptation of the model, but not too small so that the added samples can have an impact: a possible choice could be the closest even number to $\lfloor\sqrt{N}\rfloor$ , with $N$ denoting the size of the training set. The confidence threshold can be adjusted according to the distribution of prediction probabilities of the model to allow (D)CAST to consider at least as many samples as the number to add at each iteration.

We demonstrated that (D)CAST is model-agnostic through application with random forests (RF), neural networks (NN), and logistic regression (LR) models. The success of bias mitigation differed across architectures, with the most benefit achieved using NN models. We hypothesized that the use of regularization could also have played a role, by restricting model adaptation and thus limiting the contribution of unlabeled samples in the RF and LR models. Further investigation would be needed to obtain conclusive evidence.

Overall, our results present (D)CAST and hierarchy bias as promising strategies to improve the learning and evaluation of machine learning models in the presence of selection bias, as an essential step in striving towards fairness in machine learning.

Methods

Hierarchy bias induction and (D)CAST bias mitigation

Notation.

We denote the input data (sample $\times$ feature) matrix as $\boldsymbol{X}\in\mathbb{R}^{N\times{F}}$ , the input label matrix as $\boldsymbol{Y}\in\{0,1\}^{N\times{C}}$ , and output prediction probability matrix as $\boldsymbol{\bar{Y}}\in\mathbb{R}^{N\times{C}}$ , where $N$ is the number of samples, $F$ is the number of features, and $C$ is the number of classes. Following this notation, $\boldsymbol{x}_{n}\in\mathbb{R}^{1\times{F}}$ is the feature vector of sample $n\in\{1,2,...,N-1,N\}$ , $y_{n}^{c}$ is the binary label of sample $n$ for class $c\in\{1,2,...,C-1,C\}$ (1 if assigned, 0 otherwise), and $\bar{y}_{n}^{c}$ is the prediction probability of sample $n$ being of class $c$ where $\sum_{c=1}^{C}{y_{n}^{c}}=1$ and $\sum_{c=1}^{C}{\bar{y}_{n}^{c}}=1$ .

Hierarchy bias

Hierarchy bias induction generates a biased selection of samples from a given dataset in a class-aware and multivariate manner. The idea is that the samples belonging to each class in the dataset can be seen as originating from a mixture of multivariate distributions. Based on this, the goal is to identify one of the mixtures and then make a skewed selection of samples by controlling the representation of the target mixture over the remaining samples. Hierarchy bias induction takes as input a data matrix $\boldsymbol{X}$ , a label matrix $\boldsymbol{Y}$ , a parameter $k$ denoting the number of samples to select per class, and a bias parameter $b\in[0,1]$ denoting the ratio of samples that should be selected from the identified mixture (Alg. 1). The output is a biased selection of samples, generated as follows. Agglomerative hierarchical clustering is first applied to identify a mixture of interest per class $c$ , corresponding to a cluster of at least $k$ samples. We perform the clustering for class $c$ using all samples from matrix $\boldsymbol{X}$ labeled with class $c$ , with Euclidean inter-sample distances on the original feature vectors and Ward linkage between clusters (Alg. 1, lines 4-5). Once the cluster is identified, the final biased selection is obtained by choosing $k\times b$ samples uniformly at random from the cluster and choosing another $k-k\times b$ samples uniformly at random from the remaining samples not in the cluster (Alg. 1, lines 6-8).

Algorithm 1 Hierarchy Bias

\boldsymbol{X}

\boldsymbol{Y}

k

b

Selection\leftarrow\emptyset

k_{cluster}\leftarrow k\times b

k_{rest}\leftarrow k-k\times b

5:for

each class

c\in C

6: Apply agglomerative clustering with Euclidean distance and Ward linkage to

\boldsymbol{X}_{S_{c}}

S_{c}=\{n:n\in y_{n}^{c}==1\}

Cluster\leftarrow

Set of samples from the first cluster that reaches a number of samples

\geq k

S_{cluster}\leftarrow

Select set of

k_{cluster}

samples uniformly at random from

Cluster

S_{rest}\leftarrow

Select set of

k_{rest}

samples uniformly at random from the remaining samples (not in

Cluster

10:

Selection\cup S_{cluster}\cup S_{rest}

11:end for

12:return

Selection

(D)CAST - Diverse Class-Aware Self-Training

The proposed semi-supervised model learning framework, Diverse Class-Aware Self-Training (DCAST), leverages unlabeled data to gain insight into the underlying distribution of the population that may not be well represented by the labeled data. It does this using self-training (ST), and actively addresses selection bias by preserving class ratios or balance (CAST), and optionally also incorporating sample diversity into the pseudo-labeling process to counter biases present in the data or introduced during training (DCAST).

More formally, the (D)CAST method takes as input the labeled data $\{\boldsymbol{X_{L}},\boldsymbol{Y_{L}}\}$ and unlabeled data $\boldsymbol{X_{U}}$ to learn from, validation data $\{\boldsymbol{X_{V}},\boldsymbol{Y_{V}}\}$ for early stopping, and the following four additional parameters: maximum number of iterations $m$ , number of pseudo-labeled samples $s$ to select per iteration, confidence or prediction probability threshold $t\in[0,1]$ , and integer diversity parameter $d\geq 1$ . Model learning in (D)CAST is then performed by self-training as follows. At iteration $i$ , model $M^{(i)}$ is trained on the labeled data $\{\boldsymbol{X_{L^{(i)}}},\boldsymbol{Y_{L^{(i)}}}\}$ , and used to make predictions $\boldsymbol{\bar{Y}_{U^{(i)}}}$ for all samples in the unlabeled set $U^{(i)}$ (and matrix ${\boldsymbol{X_{U^{(i)}}}}$ ). As with regular self-training, a pseudo-labeling procedure then selects a subset of the unlabeled samples, $S^{(i)}\subseteq U^{(i)}$ , to be incorporated into model learning (Fig. 3). The selected samples $S^{(i)}$ are pseudo-labeled and included in the set of labeled samples for training in the subsequent iteration, $L^{(i+1)}=L^{(i)}\cup S^{(i)}$ , as well as removed from the unlabeled set $U^{(i+1)}=U^{(i)}\setminus S^{(i)}$ . Matrices ${\boldsymbol{X_{L^{(i+1)}}}}$ , ${\boldsymbol{Y_{L^{(i+1)}}}}$ , and ${\boldsymbol{X_{U^{(i+1)}}}}$ are also updated for the next iteration accordingly.

Pseudo-labeling in (D)CAST: class-aware with and without diversity.

The (D)CAST-specific pseudo-labeling is accomplished by the Diversity Module (Fig. 3). The core CAST strategy addresses class-specific bias by performing the pseudo-labeling separately per class, offering to either preserve the class ratios found in the original labeled set or select an equal number of samples per class at each iteration. Its extension, DCAST, aims for further bias mitigation by promoting sample diversity. In conventional self-training, the pseudo-labeling procedure tends to confirm and follow biases potentially present in the labeled set: either by selecting unlabeled samples similar to the original labeled samples (in feature space) or by selecting unlabeled samples whose prediction the model is most confident about. In contrast, (D)CAST seeks to mitigate this behavior and work against the strengthening of existing bias during training. To achieve this, (D)CAST selects and pseudo-labels samples that are diverse amongst each other and also more dissimilar to the possibly biased labeled samples. The (D)CAST pseudo-labeling (Alg. 2) comprises the following steps per training iteration:

Step 1. (D)CAST - Select candidate samples for pseudo-labeling based on model confidence. The goal of Step 1 is to select a set of candidate unlabeled samples for pseudo-labeling and inclusion in model training. This corresponds to the $s\times class\_ratio(c)\times d$ most confidently predicted unlabeled samples per class $c$ , with corresponding probabilities in $\boldsymbol{\bar{Y}_{U^{(i)}}}$ larger than a user-defined threshold $t$ (or a baseline threshold $r=1.2/C$ , whichever is largest) (Alg. 2, lines 9-11). For CAST, with $d=1$ and thus no diversity strategy, this selection automatically leads to the final set of $s$ pseudo-labeled samples ( $s_{c}=s\times class\_ratio(c)$ per class) to incorporate during learning in the subsequent iteration. For DCAST, with $d>1$ (Alg. 2, lines 13-15), the selected set of $s\times d$ samples ( $s_{c}\times d$ per class) represents a larger pool of candidates to consider and narrow down further to obtain the final selected set of $s$ samples ( $s_{c}$ per class) using the diversity strategy. Our recommendation for DCAST is to set the confidence threshold $t$ and diversity parameter $d$ not too strictly, so as to allow for a sufficient number (and diversity) of candidate samples.

Step 2. DCAST - Diversity: Create representations of candidate samples for distance calculation. From the set of $s\times d$ candidate samples selected in Step 1, DCAST aims to extract the subset of $s$ diverse samples. Diversity is assessed based on pairwise sample distances, calculated using a specific sample vector representation or embedding (denoted for all candidate samples as matrix $\boldsymbol{E^{(i)}}\in\mathbb{R}^{(s\times d)\times{v}}$ , where $v$ is the embedding vector size). Preferably, DCAST uses discriminative embeddings based on the learnt model $M^{(i)}$ , where two types are currently supported. For a random forest, each sample representation corresponds to a one-hot encoded vector of the prediction of that sample across all the leaves of the decision trees in the forest; for a neural network, the sample representation corresponds to the embedding based on the hidden layer closest to the output layer. For models without discriminative embeddings, such as SVM or LR, DCAST uses the original feature vector representation.

Step 3. DCAST - Diversity: Calculate pairwise distances between candidate samples. To assess diversity, we use distances between samples: the larger the distances amongst samples in a given set, the more diverse the set will be considered. Distances are calculated by DCAST based on sample embeddings or original feature vector representations (Alg. 2, line 13). With discriminative embeddings, DCAST calculates normalized distances as ${1-(E{\cdot}E^{T})/\max(E{\cdot}E^{T})}$ , given an embedding matrix $E\in\mathbb{R}^{(s\times d)\times v}$ . Specifically, for a random forest model, these distances represent the normalized frequency of non co-occurrence of a pair of samples in the leaves of the decision trees. With original feature vectors, DCAST uses Euclidean distances between sample vectors instead.

Step 4. DCAST - Diversity: Identify distinct clusters and select diverse samples to pseudo-label. The distances calculated in Step 3 are used in Step 4 to select diverse samples, potentially capturing different aspects of the pool of candidates and its underlying distribution. To do this, DCAST first identifies $s$ (or $s_{c}$ per class) distinct groups of candidate samples using a clustering algorithm (Alg. 2, line 14). The current implementation relies on agglomerative hierarchical clustering with single linkage, however any other algorithm of choice could be employed. Given that clustering is designed to maximize inter-cluster distances, samples across the different clusters are likely to yield the largest distances and thus the most diversity under the employed clustering strategy. Accordingly, DCAST selects a single sample per identified cluster to pseudo-label, namely the candidate sample with the highest confidence $\bar{y}^{c}_{n}$ value (sample $n$ and class $c$ , Alg. 2, line 15).

Step 5. (D)CAST - Pseudo-label selected samples. At the end of each iteration, selected samples in the set $S_{c}$ are added to the labeled data matrices $\{\boldsymbol{X_{L}},\boldsymbol{Y_{L}}\}$ and removed from the unlabeled data matrix $\boldsymbol{X_{U}}$ .

Time Complexity of (D)CAST.

To derive an upper bound for the worst-case time complexity of the (D)CAST algorithm, we assume the following time complexities for an input of $n$ samples defined over $v$ features: training a base prediction model is $O(T(n,v))$ , making predictions using the trained model is $O(P(n,v))$ , and calculating pairwise sample distances and applying hierarchical clustering is $O((n\times v)^{2})$ .

At iteration $i$ , the time complexity of (D)CAST is dominated by the following operations: retraining the model with $l+i\times s$ labeled samples in $O(T(l+i\times s,v))$ time (Alg. 2, line 4), making predictions for $l-i\times s$ unlabeled samples in $O(P(l-i\times s,v))$ time (Alg. 2, line 5), and applying hierarchical clustering with pairwise distances to at most $s\times d$ candidate unlabeled samples in $O((s\times d\times v)^{2})$ time (Alg. 2, lines 11-12). Note that $l$ denotes the number of labeled samples in the input matrices $\{\boldsymbol{X_{L}},\boldsymbol{Y_{L}}\}$ at the start of the execution, and $i\times s$ denotes the number of samples that are pseudo-labeled up to iteration $i$ (thus also added and removed respectively from the labeled and unlabeled data). The maximum possible number of samples for prediction at any one iteration is equal to the number of unlabeled samples $u$ in the input matrix $\boldsymbol{X_{U}}$ before any pseudo-labeling has occurred, leading to the upper bound $O(P(u,v))$ on the prediction time per iteration. Similarly, $u$ is the maximum number of samples that can be added to the input labeled data (initially containing $l$ samples) over all iterations, which determines the upper bound $O(T(l+u,v))$ on the training time per iteration. Combining all together, each iteration takes $O(T(l+u,v)+P(u,v)+(s\times d\times v)^{2})$ time, and therefore the upper bound on the worst-case time complexity of $m$ iterations is $O(m\times(T(l+u,v)+P(u,v)+(s\times d\times v)^{2}))$ .

Algorithm 2 (D)CAST - Diverse Class-Aware Self-Training

T

(model type);

\boldsymbol{X_{L}}

\boldsymbol{Y_{L}}

(labeled train data);

\boldsymbol{X_{V}}

\boldsymbol{Y_{V}}

(labeled validation data);

\boldsymbol{X_{U}}

(unlabeled data);

s

(number of samples to select per iteration);

t

(prediction probability threshold);

d

(diversity strength);

m

(maximum number of iterations).

terminate\leftarrow False

i\leftarrow 0

4:while

terminate

False\lor i=m

M^{(i)}\leftarrow

train model instance of type

T

with

\boldsymbol{X_{L}}

\boldsymbol{Y_{L}}

\bar{Y}\leftarrow

predict class probability for samples in

\boldsymbol{X_{U}}

using

M^{(i)}

7: for

each class

c\in C

s_{c}\leftarrow s\times class\_ratio(c)

t_{c}\leftarrow\max(t,r)

10:

S_{c}\leftarrow

top

s_{c}\times d

confidently predicted samples with

max(\bar{y}_{n}^{c})>t_{c}

11: if

d>1

then

12:

E\leftarrow

calculate pairwise distances for samples in

S_{c}

13:

Clusters\leftarrow

apply agglomerative clustering to obtain

s_{c}

clusters using distances

E

and single linkage

14:

S_{c}\leftarrow

choose the sample with the highest prediction probability from each cluster in

Clusters

15: end if

16: for

each selected sample

n\in S_{c}

17:

\boldsymbol{X_{L}}.\textrm{add}(\boldsymbol{x_{n}})

\boldsymbol{Y_{L}}.\textrm{add}(\boldsymbol{y_{n}})

\boldsymbol{X_{U}}.\textrm{remove}(\boldsymbol{x_{n}})

18: end for

19: end for

20:

\triangleright

Stopping conditions: maximum number of iterations

m

is reached OR all unlabeled samples have been incorporated OR validation accuracy did not improve for the last 5 iterations.

21: if

(

i==m

)

\lor

(

len(\boldsymbol{X_{U}})==0

)

\lor

(

\exists z\in\{i-6,\ldots,i-1\}

such that

Accuracy(M^{(i)},\boldsymbol{X_{V}},\boldsymbol{Y_{V}})<Accuracy(M^{(z)},% \boldsymbol{X_{V}},\boldsymbol{Y_{V}}

) ) then

22:

terminate\leftarrow True

23:

M_{best}\leftarrow argmax_{z=0,...,i}(Accuracy(M^{(z)},\boldsymbol{X_{V}},% \boldsymbol{Y_{V}}))

24: end if

25:

i\leftarrow i+1

26:end while

27:return

M_{best}

Evaluation of bias induction and bias mitigation methods

We performed experiments across 11 ML benchmark datasets with different characteristics to assess the effectiveness of (i) selection bias induction using the proposed hierarchy bias technique, and (ii) selection bias mitigation using the proposed (D)CAST strategies. Hierarchy bias was compared to other bias induction techniques concerning both the distribution shift produced by the data selection procedure and its effect on the performance of prediction models built using supervised learning. The (D)CAST semi-supervised bias mitigation strategies were evaluated against conventional semi-supervised self-training (ST), as well as a range of alternative domain adaptation methods, on their ability to build prediction models from biased data with better generalization than using supervised learning.

Data

In addition to 8 datasets from the UCI Data Repository (breast cancer, adult, spam, wine, raisin, rice, mushroom, and MNIST; https://archive.ics.uci.edu), we also used 3 datasets from other sources, including the pistachio [37], fire [38], and pumpkin [39] datasets (Supplementary Table S2). All datasets had binary class labels, except for MNIST with 10 different class labels. The breast cancer, wine, spam, rice, raisin, pistachio, pumpkin and MNIST datasets comprised between 7 to 64 continuous features. The fire and adult datasets included mixed types of features, of which 1 and 7 were respectively categorical features. The mushroom dataset only had categorical features. For the fire, adult, and mushroom datasets, all categorical features were one-hot encoded.

Bias induction and mitigation effects on prediction performance

To evaluate bias induction and bias mitigation techniques, we investigated how prediction models trained on data affected or not by selection bias generalized to test data that was more representative of the original distribution. All models built using supervised learning or bias mitigation techniques were trained and evaluated as follows.

Data splits and bias induction. For each dataset, 20% of the samples were uniformly selected at random, stratified by class, and reserved as test data to evaluate prediction models (Fig. 6). The adult dataset already had its own separate test set, which we reserved. Additionally, we created 30 distinct train runs per dataset, each by randomly splitting the remaining 80% of the samples into two train sets, stratified by class: a labeled train set, containing 30% of the samples, from which we also generated biased labeled sets by applying different bias induction techniques; and an unlabeled train set, comprising the remaining 70% of the samples. The original and biased labeled train sets were later used to build prediction models with supervised learning or bias mitigation strategies, while the unlabeled train set was used to learn prediction models with the semi-supervised bias mitigation strategies (D)CAST and conventional ST (other bias mitigation methods used test data without labels). When necessary for model training, a validation set was further extracted from each biased train set, given that unbiased labeled data would not be available for this purpose in a realistic setting.

Training of models using supervised learning or bias mitigation. To quantify the baseline prediction performance, without bias induction, we built models using supervised learning on the original labeled train set. To assess the effect of bias induction compared to the baseline, we built models using supervised learning on the biased labeled train set. Additionally, to assess the bias mitigation strategies and investigate if they could generalize better than supervised learning on the biased labeled train set, we used them to train models on the biased labeled train set together with unlabeled data (namely the unlabeled train set for semi-supervised (D)CAST and conventional ST, or the unlabeled test set for the remaining methods). The prediction models we trained using supervised learning or bias mitigation strategies were based on three different model types: L2-regularized random forests (RF, [40]), 2 hidden-layered (input, 8-node, 12-node, output) neural networks (NN), and L2-regularized logistic regression (LR) [41]. We used default parameter values (Supplementary Table S3), since fine-tuning with a biased validation set could further reinforce the bias. To account for variation introduced by randomness in the training procedures of the RF and NN models, we used different seeds to train 10 prediction models instead of one per run for any given combination of dataset, model type, bias induction technique, and model learning strategy.

Evaluation of models trained using supervised learning or bias mitigation. The performance of all prediction models was evaluated on the test set. We focused on quantifying prediction accuracy rather than loss, since the loss could often be improved by increasing model confidence without a measurable improvement in accuracy, which is ultimately the goal of the models under study. We report the performance results as the median test accuracy of the 10 models using different seeds per run, with a total of 30 runs, for every combination of dataset, model type, bias induction technique, and model learning strategy. Some model learning strategies did not successfully build prediction models for all runs, which is necessarily reflected in the results and corresponding figures.

Bias mitigation strategies

We assessed the proposed semi-supervised (D)CAST methods against competing bias mitigation techniques, including semi-supervised conventional self-training and alternative domain adaptation strategies.

The semi-supervised methods, (D)CAST and conventional ST, learned models using the labeled and unlabeled train sets. Additionally, (D)CAST relied on early stopping based on validation performance to make training more efficient and robust. To be fair to other methods, (D)CAST used a portion of the labeled train set for validation rather than a separate validation set. We set the following parameter values for (D)CAST across experiments: maximum number of iterations $m=100$ , number of pseudo-labeled samples to include per iteration $s$ as $3\times|C|$ (or 3 times the number of classes), and three different diversity strengths $d=\{1,10,100\}$ . In addition, the confidence threshold $t$ used by (D)CAST to select candidate samples for pseudo-labeling was set to a prediction probability of $0.9$ for NN and LR models. Since RF models generally showed lower prediction probabilities, possibly due to regularization, we defined the threshold for binary RF classification models as the 93rd percentile of all prediction probabilities on unlabeled data. This threshold was not fully optimized, only considered sufficient to allow pseudo-labeling of some samples across all datasets with binary class labels. For MNIST, probabilities were even lower given the multiclass nature of the problem, thus we set the threshold of RF models as the 85th percentile instead.

Given that most semi-supervised learning approaches designed to mitigate sample selection bias are not model agnostic and do not have readily available implementations, we compared (D)CAST with the closely related conventional self-training (ST) methods. We implemented and tested two variants of conventional ST, which pseudo-labeled either the $3\times|C|$ samples with the highest prediction probabilities or all samples with prediction probabilities over 0.9. The former variant performed better and was thus selected.

We included domain adaptation methods beyond semi-supervised learning across three categories, using Python implementations available in the libTLDA Python library [42]: importance weighting approaches Kernel Mean Matching (KMM [19]) and Kernel Density Estimation (KDE [22]), minimax estimation strategies Robust Bias-Aware classifier (RBA [20]) and Target Contrastive Pessimistic Risk (TCPR [32]), and subspace alignment methods Feature-Level Domain Adaptation (FLDA [31]) and Subspace Alignment classifier (SUBA [30]). All of these methods were applied as originally proposed by their authors to learn models based on the labeled train set together with the test set without labels. In addition, all methods except KMM were used exclusively with L2-regularized LR models. The KMM importance weighting approach is ML model-agnostic, since it independently calculates a weight for each sample based exclusively on the train and test data, and was therefore applied with RF, NN, and LR models.

Bias induction and sample selection methods

We compared the proposed hierarchy bias induction method against the joint and Dirichlet bias induction techniques, as well as random subsampling. Hierarchy bias was used with a fixed target of $k=30$ samples to select per class, and a bias ratio of $b=0.9$ across experiments. Random subsampling consisted in selecting $k$ samples uniformly at random per class, where $k$ was similarly set to 30. Joint bias assigns a selection probability to each sample based on its proximity to the sample mean over the labeled train data, and then independently selects samples according to their selection probabilities [19]. Joint bias induction does not include any parameter to control the number of selected samples, and it was therefore used without a fixed target number of selected biased samples. Dirichlet bias selects a subset of samples without replacement, where the biased selection probability of each sample is determined based on a random likelihood function sampled from a Dirichlet distribution [20]. This method does not consider class labels in its biased selection and was therefore set to select a total of $k\times|C|$ samples, with $|C|$ denoting the number of classes and $k=30$ . Of note, hierarchy bias and random subsampling generate a biased selection that is balanced across classes, whereas joint and Dirichlet bias induction do not offer such guarantee.

Bias induction impact on data distribution

In addition to the effect on supervised model prediction performance, bias induction methods were assessed on their ability to cause a distribution shift in the biased selection relative to the original labeled train set. Quantitatively, we analyzed the change in the distribution of inter-sample distances as follows. We first calculated class-specific distributions of the per sample average Euclidean distance to all other samples in either the biased selection or the original labeled train set. We then determined the class-specific distribution shifts between the biased selection and the original data using two-sample Kolmogorov-Smirnov (KS) statistical tests. We report KS effect sizes, as well as histograms of inter-sample distances for the biased selection distribution and histogram peaks for the original data distribution.

Visually, we analyzed to what extent the biased selection was representative of the original labeled train set by inspecting 2D dimension reductions of the original data using the Uniform Manifold Approximation and Projection (UMAP) algorithm. We applied UMAP to the original labeled set with four different nearest neighbor parameter values (15, 50, 100, and 200) to obtain a reasonable representation of the sample space for each dataset.

Data availability

The data used in this article were obtained from publicly available sources, detailed in the Methods section. The raw data necessary to reproduce the experiments, along with the main experimental results for CAST and DCAST, are accessible via Figshare at doi.org/10.6084/m9.figshare.27003601.

Code availability

An implementation of the hierarchy bias and the (D)CAST methods in Python has been made available under an open source license at github.com/joanagoncalveslab/DCAST.

References

[1] Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K. & Galstyan, A. A survey on bias and fairness in machine learning. ACM Computing Surveys 54 (2021).
[2] Pessach, D. & Shmueli, E. A review on fairness in machine learning. ACM Computing Surveys 55, 1–44 (2022).
[3] Wu, D., Lin, D., Yao, L. & Zhang, W. Correcting sample selection bias for image classification. In 2008 3rd International Conference on Intelligent System and Knowledge Engineering, vol. 1, 1214–1220 (2008).
[4] Persello, C. & Bruzzone, L. Active and semisupervised learning for the classification of remote sensing images. IEEE Transactions on Geoscience and Remote Sensing 52, 6937–6956 (2014).
[5] Richards, J. W. et al. Active learning to overcome sample selection bias: Application to photometric variable star classification. The Astrophysical Journal 744, 192 (2011).
[6] Kremer, J., Gieseke, F., Pedersen, K. S. & Igel, C. Nearest neighbor density ratio estimation for large-scale applications in astronomy. Astronomy and Computing 12, 67–72 (2015).
[7] Romero, R., Iglesias, E. L. & Borrajo, L. Building biomedical text classifiers under sample selection bias. In Advances in Intelligent and Soft Computing, 11–18 (Springer Berlin Heidelberg, 2011).
[8] Chan, J. Y. & Cook, J. A. Inferring zambia’s HIV prevalence from a selected sample. Applied Economics 52, 4236–4249 (2020).
[9] Seale, C., Tepeli, Y. & Gonçalves, J. P. Overcoming selection bias in synthetic lethality prediction. Bioinformatics 38, 4360–4368 (2022).
[10] Tepeli, Y. I., Seale, C. & Gonçalves, J. P. ELISL: early-late integrated synthetic lethality prediction in cancer. Bioinformatics 40 (2024).
[11] Chang, C.-H. & Lin, J.-H. Decision support and profit prediction for online auction sellers. In Proceedings of the 1st ACM SIGKDD Workshop on Knowledge Discovery from Uncertain Data, U ’09, 1–8 (Association for Computing Machinery, New York, NY, USA, 2009).
[12] Castagnetti, C., Rosti, L. & Töpfer, M. The age pay gap between young and older employees in italy: Perceived or real discrimination against the young? In Research in Labor Economics, 195–221 (Emerald Publishing Limited, 2020).
[13] Shen, F., Yang, Z., Zhao, X. & Lan, D. Reject inference in credit scoring using a three-way decision and safe semi-supervised support vector machine. Information Sciences 606, 614–627 (2022).
[14] Melucci, M. Investigating sample selection bias in the relevance feedback algorithm of the vector space model for information retrieval. In 2014 International Conference on Data Science and Advanced Analytics (DSAA), 83–89 (2014).
[15] Melucci, M. Impact of query sample selection bias on information retrieval system ranking. In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), 341–350 (2016).
[16] Zhang, G. et al. Selection bias explorations and debias methods for natural language sentence matching datasets. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 4418–4429 (Association for Computational Linguistics, 2019).
[17] Chawla, N. V. & Karakoulas, G. Learning from labeled and unlabeled data: An empirical study across techniques and domains. Journal of Artificial Intelligence Research 23, 331–366 (2005).
[18] Smith, A. T. & Elkan, C. Making generative classifiers robust to selection bias. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’07, 657–666 (Association for Computing Machinery, New York, NY, USA, 2007).
[19] Huang, J., Smola, A. J., Gretton, A., Borgwardt, K. M. & Scholkopf, B. Correcting sample selection bias by unlabeled data. In Proceedings of the 19th International Conference on Neural Information Processing Systems, NIPS’06, 601–608 (MIT Press, Cambridge, MA, USA, 2006).
[20] Liu, A. & Ziebart, B. Robust classification under sample selection bias. Advances in Neural Information Processing Systems 1, 37–45 (2014).
[21] Kouw, W. M. & Loog, M. A review of domain adaptation without target labels. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 766–785 (2021).
[22] Shimodaira, H. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of Statistical Planning and Inference 90, 227–244 (2000).
[23] Zadrozny, B. Learning and evaluating classifiers under sample selection bias. In Proceedings of the Twenty-First International Conference on Machine Learning, ICML ’04, 114 (Association for Computing Machinery, New York, NY, USA, 2004).
[24] Seah, C.-W., Tsang, I. W.-H. & Ong, Y.-S. Healing sample selection bias by source classifier selection. In 2011 IEEE 11th International Conference on Data Mining, 577–586 (2011).
[25] Sugiyama, M., Yamada, M. & du Plessis, M. C. Learning under nonstationarity: covariate shift and class-balance change. Wiley Interdisciplinary Reviews: Computational Statistics 5, 465–477 (2013).
[26] Shen, Z., Cui, P., Kuang, K., Li, B. & Chen, P. Causally regularized learning with agnostic data selection bias. In Proceedings of the 26th ACM International Conference on Multimedia, MM ’18, 411–419 (Association for Computing Machinery, New York, NY, USA, 2018).
[27] Diesendruck, M. et al. Importance weighted generative networks. In Machine Learning and Knowledge Discovery in Databases, 249–265 (Springer International Publishing, 2020).
[28] Du, W. & Wu, X. Fair and robust classification under sample selection bias. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, CIKM ’21, 2999–3003 (Association for Computing Machinery, New York, NY, USA, 2021).
[29] Blitzer, J., McDonald, R. & Pereira, F. Domain adaptation with structural correspondence learning. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, EMNLP ’06, 120–128 (Association for Computational Linguistics, USA, 2006).
[30] Fernando, B., Habrard, A., Sebban, M. & Tuytelaars, T. Unsupervised visual domain adaptation using subspace alignment. In 2013 IEEE International Conference on Computer Vision, 2960–2967 (2013).
[31] Kouw, W. M., Van Der Maaten, L. J. P., Krijthe, J. H. & Loog, M. Feature-level domain adaptation. Journal of Machine Learning Research 17, 5943–5974 (2016).
[32] Kouw, W. M. & Loog, M. Robust domain-adaptive discriminant analysis. Pattern Recognition Letters 148, 107–113 (2021).
[33] Fan, W. & Davidson, I. Reverse testing: An efficient framework to select amongst classifiers under sample selection bias. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, 147–156 (Association for Computing Machinery, New York, NY, USA, 2006).
[34] Ren, J., Shi, X., Fan, W. & Yu, P. S. Type independent correction of sample selection bias via structural discovery and re-balancing. In Proceedings of the 2008 SIAM International Conference on Data Mining (Society for Industrial and Applied Mathematics, 2008).
[35] McLachlan, G. J. Iterative reclassification procedure for constructing an asymptotically optimal rule of allocation in discriminant analysis. Journal of the American Statistical Association 70, 365–369 (1975).
[36] Blum, A. & Mitchell, T. Combining labeled and unlabeled data with co-training. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, COLT’ 98, 92–100 (Association for Computing Machinery, New York, NY, USA, 1998).
[37] Ozkan, I. A., Koklu, M. & Saraçoğlu, R. Classification of pistachio species using improved k-nn classifier. Progress in Nutrition 23, e2021044 (2021).
[38] Koklu, M. & Taspinar, Y. S. Determining the extinguishing status of fuel flames with sound wave by machine learning methods. IEEE Access 9, 86207–86216 (2021).
[39] Koklu, M., Sarigil, S. & Ozbek, O. The use of machine learning methods in classification of pumpkin seeds (cucurbita pepo l.). Genetic Resources and Crop Evolution 68, 2713–2726 (2021).
[40] Ho, T. K. Random decision forests. In Proceedings of 3rd international conference on document analysis and recognition, vol. 1, 278–282 (IEEE, 1995).
[41] Pedregosa, F. et al. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
[42] Kouw, W. wmkouw/libtlda v0.1 (2018). URL https://doi.org/10.5281/zenodo.1214315.

Acknowledgements
The authors received funding from the US National Institutes of Health [U54EY032442, U54DK134302, U01DK133766, R01AG078803 to J.P.G.]. Authors are solely responsible for the research, the funders were not involved in the work. The authors further acknowledge the High-Performance Compute (HPC) cluster of the Department of Intelligent Systems at the Delft University of Technology.

Author contributions
Conceptualization, Y.I.T., and J.P.G.; Methodology, Y.I.T. and J.P.G.; Validation and Formal Analysis, Y.I.T.; Software, Y.I.T.; Investigation, Y.I.T. and J.P.G.; Writing – Original Draft, Y.I.T.; Writing – Review & Editing, J.P.G.; Funding Acquisition and Supervision, J.P.G.

Competing interests
The authors declare no competing interests.