11institutetext: Lunit Inc., Seoul, Republic of Korea
11email: {samia.belhadj, tony.superb, ambika.seth, heshamdar, tkooi}@lunit.io

Positive-Sum Fairness: Leveraging Demographic Attributes to Achieve Fair AI Outcomes Without Sacrificing Group Gains

Samia Belhadj∗, 11    Sanguk Park∗, 11 0009-0005-0538-5522    Ambika Seth 11    Hesham Dar 11 0009-0003-6458-2097    Thijs Kooi 11 0000-0001-7701-7837
Abstract

Fairness in medical AI is increasingly recognized as a crucial aspect of healthcare delivery. While most of the prior work done on fairness emphasizes the importance of equal performance, we argue that decreases in fairness can be either harmful or non-harmful, depending on the type of change and how sensitive attributes are used. To this end, we introduce the notion of positive-sum fairness, which states that an increase in performance that results in a larger group disparity is acceptable as long as it does not come at the cost of individual subgroup performance. This allows sensitive attributes correlated with the disease to be used to increase performance without compromising on fairness.

We illustrate this idea by comparing four CNN models that make different use of the race attribute in the training phase. The results show that removing all demographic encodings from the images helps close the gap in performance between the different subgroups, whereas leveraging the race attribute as a model’s input increases the overall performance while widening the disparities between subgroups. These larger gaps are then put in perspective of the collective benefit through our notion of positive-sum fairness to distinguish harmful from non harmful disparities.

Keywords:
Fairness Computer-aided diagnosis Chest x-ray Machine Learning
**footnotetext: These authors contributed equally to this work

1 Introduction

Medical imaging plays a critical role in diagnosis, treatment planning, and monitoring patient progress. However, the reliability of medical imaging algorithms is not uniformly distributed across different demographic groups, raising concerns about fairness and potential biases in the results. Fairness in medical imaging most often refers to the equitable treatment of patients from diverse demographic backgrounds, regardless of their gender, race, ethnicity, or other characteristics sensitive to discrimination [19, 38].

This equitable treatment is often interpreted as a similar performance across different demographic subgroups. When applied to domains like credit card scoring or AI-powered recruiting, ignoring all sensitive attributes and prioritizing a similar performance across the different demographic subgroups is an acceptable approach. However, in the medical field, demographic attributes are important clinical factors which radiologists and clinicians often take into consideration as they can have a strong impact on their diagnoses and can guide them to consider specific tests or treatments based on the patient’s demographic profile. The prevalence of diseases can be correlated to demographic attributes. For example, studies have shown that breast cancer has a higher incidence among Ashkenazi Jewish women [37, 30]. And, due to historical and social disparities as well as different physiological features across demographic subgroups, the difficulty level of medical tasks is not uniformly distributed. For this reason, even collecting more or more diverse data does not necessarily produce equal performance across demographic subgroups as the best achievable result is not the same for each of them [27]. In a domain where each improvement can save lives, it is hard to disregard the benefit of the population as a whole for the sake of decreasing the disparities between subgroups.

Petersen et al. [26] examined various types of demographic invariance in medical imaging AI, highlighting why they can be undesirable and stressing the need for better fairness assessments and mitigation techniques in this field. Several fairness measures suffer from degradation in the overall performance by penalizing the performance of an AI system for groups that it performs better on, in order to achieve parity with groups it performs worse on, which is referred to as “levelling down” [24]. While we are aware of papers suggesting training methods which aim to maximize the benefit of each subgroup (Berk Ustun [34], for instance, suggested debiasing methods following the ethical principles of beneficence (“do the best”) and non-maleficence (“do not harm”) [35] in regards to fairness), and methods which improve fairness by understanding and mitigating the demographic encodings present in images [39, 3], we could not find any fairness evaluation framework or definition which allows to compare different models from the prism of harmful and non harmful disparities.

We, therefore, introduce the notion of positive-sum fairness: when looking at a situation where we have an initial model and are looking at the trade-off between fairness and performance while trying to improve it, inequitable performance can be acceptable as long as it does not come at the expense of other subgroups and allows a higher overall performance to be achieved. Specifically, we argue that differences in performance can be harmful and non-harmful. We consider a disparity harmful if it comes at the cost of the overall performance or if improving the overall performance is achieved by decreasing performance on any protected subgroup. A difference in performance across protected subgroups is considered non-harmful if, by improving an AI system’s performance, we exacerbate the disparities between subgroups without negatively impacting any specific subgroup. This main idea is summarized in figure 1.

We compare the positive-sum fairness framework with a more traditional group fairness definition, which is the largest disparity in performance across subgroups. We show that some models, while increasing this disparity, actually improve the performance of each subgroup individually and that other models which decrease the disparity ("improving fairness" from a classic point of view) are harming some subgroups to achieve it.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 1: We investigate fairness of AI models and introduce the concept of ’positive-sum fairness’ to differentiate harmful and non-harmful disparities. Graph a) shows the performance of an initial model per protected groups. b) shows the performance of an updated model with a higher overall performance but a lower fairness, under its standard definition, as indicated by the larger difference between the most and least advantaged groups and therefore could be rejected on the basis of fairness. c) shows the same updated model as b) however it shows the performance difference per group compared to the initial model. In this positive-sum framing we see that none of the groups had a reduction in performance and therefore the increased performance in Race C did not come at the cost of performance in any other group.

2 Related work

Bias is commonly identified in medical image analysis applications [38, 40]. For instance [6], a CNN trained on brain MRI resulted in a significant difference between ethnicities. Seyyed-Kalantari et al. [32] observed that minorities received higher rates of algorithmic underdiagnosis. Zong et al. [40] assessed bias mitigation algorithms in- and out-of-distribution settings. The experiments demonstrated the wide existence of bias in AI-based medical imaging classifiers and none of the bias mitigation algorithms was able to prevent this.

Different definitions of fairness are used:

  • Individual fairness [25] requires that similar individuals should be treated equally and thus have similar predictions. For example, a model should have comparable diagnosis on two similar X-Ray images.

  • Group fairness requires equal performance on sub-groups divided based on sensitive attributes (e.g., race, sex, and age). Common group fairness metrics are demographic parity [8], equal odds [12] and predictive rate parity or sufficiency [21].

  • Minimax fairness [5] seeks to ensure that the worst-off group is treated as fairly as possible, reducing the most severe negative impacts of a decision or system.

These definitions have pros and cons [36]. Individual fairness relies on the choice of the distance metric, which requires expert input. In minimax fairness, the ideal solution is difficult to compute and the degree of unfairness relies heavily on the choice of the set of models. Group fairness metrics are easy to implement and understand, but are not always adapted to the problem nor compatible with one another [2, 18]. And even though prior work has broadened the group fairness notion by adding other normative choices than strict equality [1], none of the proposed metrics prevent the harm that could be brought to each subgroup’s performance individually or to the whole population’s benefit.

As mentioned in the introduction, similarly to [24, 34, 27, 26], we believe that medical AI is different from other domains in that each improvement can save lives. Therefore, increasing disparities to achieve the best performance possible for each demographic subgroup and for the population as a whole could be justified. Previous research has shown that images themselves could carry demographic encodings [10, 9]. E.g., Yang et al. [39] investigate the utilization of demographic encodings by analyzing the use of demographic shortcuts for disease classification. Two papers [41, 11] examine the relevance of explicitly using sensitive attributes in fair classification systems for non-medical problems. They compare different models which leverage sensitive attributes with a model which is not trained on any sensitive attribute.

3 Methods

3.1 Positive-sum fairness

We introduce the principle of positive-sum fairness, which analyzes fairness from the prism of harmful and non harmful disparities. When looking at changes in model performance and disparities between protected subgroups, there are several explanations for a gap in performance between the most and least advantaged subgroups:

  • The most advantaged group’s performance improved while others’ stayed the same,

  • All subgroups’ performance improved but one of them increased more than others,

  • The most discriminated group’s performance decreased while others’ stayed the same,

  • All subgroups’ performance decreased, but one of them decreased more than the others, etc.

The first two would not be considered harmful as they allow to improve the general performance without harming any of the subgroups, thus achieving a collective benefit.

3.1.1 Definition

Positive-sum fairness is a fairness evaluation framework where the goal is to find solutions that increase the overall benefit for all parties together while trying to ensure no one is worse off and ideally, everyone is better off. It looks at the situation where we have an initial model and are looking at the trade-off between fairness and performance when trying to improve the model. Unlike other fairness definitions which aim to minimize the disparity between subgroups or maximize the worst performance among subgroups, positive-sum fairness tries to avoid gains to a group which come at the expense of another group while maintaining the overall performance.

Let us assume that we compare N𝑁Nitalic_N models {M}i=1Nsubscriptsuperscript𝑀𝑁𝑖1\{M\}^{N}_{i=1}{ italic_M } start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT to a baseline Mbaselinesubscript𝑀𝑏𝑎𝑠𝑒𝑙𝑖𝑛𝑒M_{baseline}italic_M start_POSTSUBSCRIPT italic_b italic_a italic_s italic_e italic_l italic_i italic_n italic_e end_POSTSUBSCRIPT on K𝐾Kitalic_K demographic subgroups. And let us consider measure(M)𝑚𝑒𝑎𝑠𝑢𝑟𝑒𝑀measure(M)italic_m italic_e italic_a italic_s italic_u italic_r italic_e ( italic_M ) as the metric that measures the performance of a model M𝑀Mitalic_M. Following the positive-sum fairness definition, selecting the best model is equivalent to finding the best trade-off between:

  • maximizing the performance gain: max1iNmeasure(Mi)measure(Mbaseline)subscript1𝑖𝑁𝑚𝑒𝑎𝑠𝑢𝑟𝑒subscript𝑀𝑖𝑚𝑒𝑎𝑠𝑢𝑟𝑒subscript𝑀𝑏𝑎𝑠𝑒𝑙𝑖𝑛𝑒\max_{1\leq i\leq N}measure(M_{i})-measure(M_{baseline})roman_max start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_N end_POSTSUBSCRIPT italic_m italic_e italic_a italic_s italic_u italic_r italic_e ( italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_m italic_e italic_a italic_s italic_u italic_r italic_e ( italic_M start_POSTSUBSCRIPT italic_b italic_a italic_s italic_e italic_l italic_i italic_n italic_e end_POSTSUBSCRIPT )

  • maximizing the smallest performance gain across the subgroups :
    max1iN(min1kKmeasure(Mi)(groupk)measure(Mbaseline)(groupk))subscript1𝑖𝑁subscript1𝑘𝐾𝑚𝑒𝑎𝑠𝑢𝑟𝑒subscript𝑀𝑖𝑔𝑟𝑜𝑢subscript𝑝𝑘𝑚𝑒𝑎𝑠𝑢𝑟𝑒subscript𝑀𝑏𝑎𝑠𝑒𝑙𝑖𝑛𝑒𝑔𝑟𝑜𝑢subscript𝑝𝑘\max_{1\leq i\leq N}(\min_{1\leq k\leq K}measure(M_{i})(group_{k})-measure(M_{% baseline})(group_{k}))roman_max start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_N end_POSTSUBSCRIPT ( roman_min start_POSTSUBSCRIPT 1 ≤ italic_k ≤ italic_K end_POSTSUBSCRIPT italic_m italic_e italic_a italic_s italic_u italic_r italic_e ( italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( italic_g italic_r italic_o italic_u italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_m italic_e italic_a italic_s italic_u italic_r italic_e ( italic_M start_POSTSUBSCRIPT italic_b italic_a italic_s italic_e italic_l italic_i italic_n italic_e end_POSTSUBSCRIPT ) ( italic_g italic_r italic_o italic_u italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) )

Depending on the task, one could set hard constraints like ensuring there is no performance loss for any subgroup (aka the selected model Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT should ensure that min1kKmeasure(Mi)(groupk)measure(Mbaseline)(groupk)0subscript1𝑘𝐾𝑚𝑒𝑎𝑠𝑢𝑟𝑒subscript𝑀𝑖𝑔𝑟𝑜𝑢subscript𝑝𝑘𝑚𝑒𝑎𝑠𝑢𝑟𝑒subscript𝑀𝑏𝑎𝑠𝑒𝑙𝑖𝑛𝑒𝑔𝑟𝑜𝑢subscript𝑝𝑘0\min_{1\leq k\leq K}measure(M_{i})(group_{k})-measure(M_{baseline})(group_{k})\geq 0roman_min start_POSTSUBSCRIPT 1 ≤ italic_k ≤ italic_K end_POSTSUBSCRIPT italic_m italic_e italic_a italic_s italic_u italic_r italic_e ( italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( italic_g italic_r italic_o italic_u italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_m italic_e italic_a italic_s italic_u italic_r italic_e ( italic_M start_POSTSUBSCRIPT italic_b italic_a italic_s italic_e italic_l italic_i italic_n italic_e end_POSTSUBSCRIPT ) ( italic_g italic_r italic_o italic_u italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ≥ 0) and the overall performance is improved (aka the selected model Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT should ensure that measure(Mi)measure(Mbaseline)0𝑚𝑒𝑎𝑠𝑢𝑟𝑒subscript𝑀𝑖𝑚𝑒𝑎𝑠𝑢𝑟𝑒subscript𝑀𝑏𝑎𝑠𝑒𝑙𝑖𝑛𝑒0measure(M_{i})-measure(M_{baseline})\geq 0italic_m italic_e italic_a italic_s italic_u italic_r italic_e ( italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_m italic_e italic_a italic_s italic_u italic_r italic_e ( italic_M start_POSTSUBSCRIPT italic_b italic_a italic_s italic_e italic_l italic_i italic_n italic_e end_POSTSUBSCRIPT ) ≥ 0) or find the most relevant trade-off between the two optimization problems.

3.2 Application

To put this fairness notion into practice and show the difference with traditional group fairness, we compare three models which use sensitive attributes to a baseline model. The way sensitive attributes are used by the model is known to have an impact on the fairness and performance of the model [3, 39, 41, 11]. Therefore, we make use of models that explicitly include sensitive attributes, or conversely, remove any demographic encoding from the input data.

The four models are trained on a multi-label classification problem of findings in chest radiography (CXR). In all settings, a Densenet-121 [13] backbone is used, which was empirically determined to give the best performance for this problem. The exact model architectures are shown in figure 2 and described below:

  • M1: a baseline classifier using the images as input and trained to predict the targeted CXR findings associated to our dataset. The model comprises a backbone to extract the image features and a finding branch consisting of a fully connected layer and a binary cross entropy loss for each finding.

  • M2: a classifier using both the images and race features as input. The race information comes in the form of a categorical variable, which we convert to a one-hot vector and feed to a fully-connected layer. We concatenate the features from the fully connected layer and the image features before forwarding to finding branch. The model is trained end-to-end.

  • M3: a classifier using the images as input only, but trained to predict image findings as well as the race group (i.e. this model aims to exploit the race encodings present in the images). For this model, we modify the final layer of the baseline classifier by adapting the loss function to optimize the two tasks: CXR findings and race group. We also transform race information to one-hot encoded vector to apply multi-class loss. The race classification branch is made of a fully-connected layer and a cross entropy loss function. The final loss is calculated by adding finding loss and the race loss with a loss weight λ𝜆\lambdaitalic_λ.

    L(ycxr,yrace)=L(ycxr)+λL(yrace)𝐿subscript𝑦𝑐𝑥𝑟subscript𝑦𝑟𝑎𝑐𝑒𝐿subscript𝑦𝑐𝑥𝑟𝜆𝐿subscript𝑦𝑟𝑎𝑐𝑒L(y_{cxr},y_{race})=L(y_{cxr})+\lambda L(y_{race})italic_L ( italic_y start_POSTSUBSCRIPT italic_c italic_x italic_r end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_r italic_a italic_c italic_e end_POSTSUBSCRIPT ) = italic_L ( italic_y start_POSTSUBSCRIPT italic_c italic_x italic_r end_POSTSUBSCRIPT ) + italic_λ italic_L ( italic_y start_POSTSUBSCRIPT italic_r italic_a italic_c italic_e end_POSTSUBSCRIPT )
  • M4: a classifier using the images as input, trained to predict image findings, while minimizing the use of race information encoded in the image. For this model, we implement the gradient reversal technique described in [28]. We apply the gradient reversal layer before the race branch.

Refer to caption
Figure 2: To investigate the effect of sensitive attributes on performance and fairness, we evaluate four different model architectures, denoted M1, M2, M3 and M4. M1, the baseline, has a backbone and classification. M2 has a race encoding branch to learn race-encoded features directly from metadata. M3 and M4 have an additional race branch to predict the race group which is implicitly encoded in the image, from the image features. The difference between M3 and M4 is that we add a gradient reversal layer before the race branch.

4 Experiments

4.0.1 Data

We use chest radiographs from MIMIC-CXR-JPG [16, 29]. The dataset has annotations for 14 findings. However, we focus on lung lesions, pneumonia, pleural effusion and consolidation as the diseases associated with these findings have been shown to be correlated with ethnicity [4, 17, 33]. We use only frontal images and split the dataset into training, validation, and test sets on a patient level. In total, 237,972, 1,959, and 3,403 images are used for training, validation, and testing, respectively.

4.0.2 Sensitive attributes

We define the protected subgroups based on the self-reported race from MIMIC-IV [14, 15] and split it into five groups: White, African-American, Latino, Asian, others.

4.0.3 Model training

We train our 4 models to predict all 14 CXR findings and a race group. We initialize a DenseNet-121 backbone with pre-trained weights from ImageNet [31]. The images are resized to 256 ×\times× 256, and augmented using random rotation from [-15,15] degree range and random horizontal flip. We conduct the experiments with 8 V100 NVIDIA GPU. AdamW [23] is used with an initial learning rate of 0.002 which is updated using the cosine annealing warm up [22] scheduler.

4.0.4 Evaluation

We compare the four models by general performance and fairness across the protected subgroups. The general performance is assessed using the Area under the ROC curve (AUROC) score and the traditional group fairness metric used to compare with positive-sum fairness is expressed by (1 - largest disparity between protected subgroups in terms of AUROC) [20]. We use the AUROC mean and confidence bounds generated using bootstrapping with 300 samples [7]. We do not consider protected subgroups which have less than 5 positive cases or less than 5 negative cases as this results in poor estimates of performance.

4.1 Initial results

According to traditional group fairness, in assessing the results of the four models shown in figure 3a one could conclude that:

M2 improves the overall performance Our results show that M2 outperforms M1 in terms of AUROC. This is in line with our expectation as we are providing an additional relevant medical feature for the model to learn from. This better performance comes with a larger gap in AUROC between the most advantaged and most discriminated races, in other words less fairness from a traditional point of view. But this larger disparity is not necessarily harmful according to the positive-sum fairness as we will discuss it in the next section.

M4 improves the fairness M4 improves fairness for lung lesions and consolidations, while performing similar for pneumonia and pleural effusion. The improved fairness is likely due to the gradient reversal layer, which removes race information from the image and prevents the model from exploiting any demographic shortcut.

No clear pattern for M3 The results for M3 are less consistent. Its performance is lower than the baseline except for pneumonia and its fairness measurement is sometimes higher and other times lower than the baseline’s. If the baseline model exploited demographic encodings present in the images to generate shortcuts, training M3 to maximize the race prediction might have intensified the impact of these shortcuts.

Refer to caption
(a) Traditional group fairness vs performance framework
Refer to caption
(b) Positive-sum fairness framework
Figure 3: We put in parallel 2 different fairness vs performance frameworks: in figure (a), we compute both the performance (AUROC) and fairness (as 1 - the difference in AUROC between the most and least advantaged groups) of the 4 models per lesion. And in figure (b), we show, the difference in overall performance and in performance per protected subgroup between the 3 improved classifiers and the baseline M1. The x axis compares the performance of each improved classifier with the baseline and the y axis shows whether at least one protected subgroup has been harmed by the modifications brought to the baseline classifier.

4.2 Positive-sum fairness

We now apply the notion of positive-sum fairness, defined in section 3.1 and reframe the fairness vs performance problem as shown in 3b. Here, the x-axis represents the difference in performance between each improved classifier and the baseline (AUROC(Mi)AUROC(M1)𝐴𝑈𝑅𝑂𝐶subscript𝑀𝑖𝐴𝑈𝑅𝑂𝐶subscript𝑀1AUROC(M_{i})-AUROC(M_{1})italic_A italic_U italic_R italic_O italic_C ( italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_A italic_U italic_R italic_O italic_C ( italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )) and the y-axis shows the performance increase (or decrease) for the least improved subgroup (min1kKAUROC(Mi)(racek)AUROC(M1)(racek)subscript1𝑘𝐾𝐴𝑈𝑅𝑂𝐶subscript𝑀𝑖𝑟𝑎𝑐subscript𝑒𝑘𝐴𝑈𝑅𝑂𝐶subscript𝑀1𝑟𝑎𝑐subscript𝑒𝑘\min_{1\leq k\leq K}AUROC(M_{i})(race_{k})\\ -AUROC(M_{1})(race_{k})roman_min start_POSTSUBSCRIPT 1 ≤ italic_k ≤ italic_K end_POSTSUBSCRIPT italic_A italic_U italic_R italic_O italic_C ( italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( italic_r italic_a italic_c italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - italic_A italic_U italic_R italic_O italic_C ( italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ( italic_r italic_a italic_c italic_e start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )). A negative value indicates that the model performs worse for the given subgroup.

Any classifier which has the exact same overall performance and exact same performance per protected subgroup (race) as the baseline, would be at coordinate (0,0). Any classifier that has a negative x-coordinate, would have a lower general performance than M1 and any classifier that has a negative y-coordinate would have at least one protected subgroup with a lower AUROC than M1 (at least one subgroup negatively impacted by the changes brought to the baseline model).

For lung lesions, figure 3b shows that M2 appears in the positive side of the x and y axes, meaning that the performance was improved without harming any subgroup’s performance. And this even though the figure 3a shows a decrease in fairness (larger disparity between the most advantaged and least advantaged subgroups) for M2 compared with M1. This matches the previous conclusion that the larger performance gap between protected subgroups for M2 compared with M1 cannot be considered harmful as every protected subgroup’s performance was individually increased.

On the other hand, for lung lesions, model M4 improved fairness (smaller disparity between the most advantaged and least advantaged subgroups) as shown in the figure 3a. However, the figure 3b, shows that M4 has negative y coordinates, meaning that at least one subgroup was harmed while trying to achieve a smaller disparity between protected subgroups.

5 Conclusion

In this paper, we presented the notion of positive-sum fairness and argued that larger disparities are not necessarily harmful, as long as it does not come at the expense of a specific subgroup performance. The general performance, standard fairness and positive-sum fairness of four models was analyzed, each leveraging sensitive attributes in a different way.

Our study highlights the need for a nuanced understanding of fairness metrics and their implications in real-world applications. Good incorporation of medical knowledge is crucial when utilizing sensitive information and evaluating fairness accurately, particularly in cases where models may show a large performance disparity.

When traditional methods often aim for equality, positive-sum fairness focuses on equity, pushing for each group to achieve its highest possible performance level. This can lead to better overall outcomes, as it encourages to address the specific needs and challenges of each group without diminishing the quality of care for others. But, being defined as an optimization problem, it could also have unintended side effects as it may inadvertently prioritize larger or more well-represented groups by focusing the efforts on the groups with the highest impact on the overall performance rather than those with the most critical needs. Therefore, it is to be noted that meeting the positive-sum fairness criterion alone does not ensure a model to be fair from an egalitarian perspective, and the use of this notion in conjunction with other metrics can give a more holistic understanding of a model’s fairness.

As positive-sum fairness is a relative measure, it requires a baseline to be used. Further work in this area would include developing a more robust baseline or adapting the approach to remove the need for a baseline. It would also be worth it to compare out-of-domain tested models, include other sensitive attributes such as sex and age and take into account confounding factors. {credits}

5.0.1 \discintname

The authors declare that there are no conflicts of interest regarding the publication of this paper.

References

  • [1] Baumann, J., Hertweck, C., Loi, M., Heitz, C.: Distributive justice as the foundational premise of fair ml: Unification, extension, and interpretation of group fairness metrics (2023), https://arxiv.org/abs/2206.02897
  • [2] Berk, R., Heidari, H., Jabbari, S., Kearns, M., Roth, A.: Fairness in criminal justice risk assessments: The state of the art (2017)
  • [3] Brown, A., Tomasev, N., Freyberg, J., Liu, Y., Karthikesalingam, A., Schrouff, J.: Detecting shortcut learning for fair medical AI using shortcut testing
  • [4] Burton, D.C., Flannery, B., Bennett, N.M., Farley, M.M., Gershman, K., Harrison, L.H., Lynfield, R., Petit, S., Reingold, A.L., Schaffner, W., Thomas, A., Plikaytis, B.D., Rose, Jr, C.E., Whitney, C.G., Schuchat, A., for the Active Bacterial Core Surveillance/Emerging Infections Program Network: Socioeconomic and racial/ethnic disparities in the incidence of bacteremic pneumonia among US adults. Am. J. Public Health 100(10), 1904–1911 (Oct 2010)
  • [5] Diana, E., Gill, W., Kearns, M., Kenthapadi, K., Roth, A.: Minimax group fairness: Algorithms and experiments (2021)
  • [6] EAM, S., M, W., P, M., ND., F.: Fairness-related performance and explainability effects in deep learning models for brain image analysis. J Med Imaging (Bellingham). 2022 Nov;9(6):061102. doi: 10. (2022)
  • [7] Efron, B.: Better bootstrap confidence intervals. Journal of the American statistical Association 82(397), 171–185 (1987)
  • [8] Feldman, M., Friedler, S., Moeller, J., Scheidegger, C., Venkatasubramanian, S.: Certifying and removing disparate impact (2015)
  • [9] Gichoya, J.W., Banerjee, I., Bhimireddy, A.R., Burns, J.L., Celi, L.A., Chen, L.C., Correa, R., Dullerud, N., Ghassemi, M., Huang, S.C., Kuo, P.C., Lungren, M.P., Palmer, L.J., Price, B.J., Purkayastha, S., Pyrros, A.T., Oakden-Rayner, L., Okechukwu, C., Seyyed-Kalantari, L., Trivedi, H., Wang, R., Zaiman, Z., Zhang, H.: AI recognition of patient race in medical imaging: a modelling study. Lancet Digit. Health 4(6), e406–e414 (Jun 2022)
  • [10] Glocker, B., Jones, C., Bernhardt, M., Winzeck, S.: Algorithmic encoding of protected characteristics in chest x-ray disease detection models. EBioMedicine 89(104467), 104467 (Mar 2023)
  • [11] Haeri, M.A., Zweig, K.A.: The crucial role of sensitive attributes in fair classification. In: 2020 IEEE Symposium Series on Computational Intelligence (SSCI). pp. 2993–3002 (2020). https://doi.org/10.1109/SSCI47803.2020.9308585
  • [12] Hardt, M., Price, E., Srebro, N.: Equality of opportunity in supervised learning (2016)
  • [13] Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks (2018)
  • [14] Johnson, A., Bulgarelli, L., Pollard, T., Horng, S., Celi, L.A., Mark, R.: MIMIC-IV (2023)
  • [15] Johnson, A.E.W., Bulgarelli, L., Shen, L., Gayles, A., Shammout, A., Horng, S., Pollard, T.J., Hao, S., Moody, B., Gow, B., Lehman, L.W.H., Celi, L.A., Mark, R.G.: MIMIC-IV, a freely accessible electronic health record dataset. Sci. Data 10(1),  1 (Jan 2023)
  • [16] Johnson, A.E.W., Pollard, T.J., Greenbaum, N.R., Lungren, M.P., ying Deng, C., Peng, Y., Lu, Z., Mark, R.G., Berkowitz, S.J., Horng, S.: Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs (2019)
  • [17] Joseph, N.P., Reid, N.J., Som, A., Li, M.D., Hyle, E.P., Dugdale, C.M., Lang, M., Betancourt, J.R., Deng, F., Mendoza, D.P., Little, B.P., Narayan, A.K., Flores, E.J.: Racial and ethnic disparities in disease severity on admission chest radiographs among patients admitted with confirmed coronavirus disease 2019: A retrospective cohort study. Radiology 297(3), E303–E312 (Dec 2020)
  • [18] Kleinberg, J., Mullainathan, S., Raghavan, M.: Inherent trade-offs in the fair determination of risk scores (2016)
  • [19] Lara, R., A., M., Echeveste, R., Ferrante, E.: Addressing fairness in artificial intelligence for medical imaging. Nat Commun 13 4581 (2022)
  • [20] Lee, J., Brooks, C., Yu, R., Kizilcec, R.: Fairness hub technical briefs: Auc gap (2023)
  • [21] Lee, J.K., Bu, Y., Rajan, D., Sattigeri, P., Panda, R., Das, S., Wornell, G.W.: Fair selective classification via sufficiency. In: International Conference on Machine Learning (2021), https://api.semanticscholar.org/CorpusID:235826429
  • [22] Loshchilov, I., Hutter, F.: Sgdr: Stochastic gradient descent with warm restarts (2017)
  • [23] Loshchilov, I., Hutter, F.: Decoupled weight decay regularization (2019)
  • [24] Mittelstadt, B., Wachter, S., Russell, C.: The unfairness of fair machine learning: Levelling down and strict egalitarianism by default (2023), https://arxiv.org/abs/2302.02404
  • [25] Mukherjee, D., Yurochkin, M., Banerjee, M., Sun, Y.: Two simple ways to learn individual fairness metrics from data (2020)
  • [26] Petersen, E., Ferrante, E., Ganz, M., Feragen, A.: Are demographically invariant models and representations in medical imaging fair? (2024), https://arxiv.org/abs/2305.01397
  • [27] Petersen, E., Holm, S., Ganz, M., Feragen, A.: The path toward equal performance in medical machine learning. Patterns 4(7), 100790 (Jul 2023). https://doi.org/10.1016/j.patter.2023.100790, https://dx.doi.org/10.1016/j.patter.2023.100790
  • [28] Raff, E., Sylvester, J.: Gradient reversal against discrimination (2018)
  • [29] Rajeev, C., Natarajan, K.: Data Augmentation in Classifying Chest Radiograph Images (CXR) Using DCGAN-CNN, pp. 91–110 (11 2023). https://doi.org/10.1007/978-3-031-43205-7_6
  • [30] Rubinstein, W.S.: Hereditary breast cancer in jews. Fam. Cancer 3(3-4), 249–257 (2004)
  • [31] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: Imagenet large scale visual recognition challenge (2015)
  • [32] Seyyed-Kalantari, L., Zhang, H., McDermott, M., Chen, I., Ghassemi, M.: Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nature Medicine 27 (12 2021). https://doi.org/10.1038/s41591-021-01595-0
  • [33] Shi, H., Seegobin, K., Heng, F., Zhou, K., Chen, R., Qin, H., Manochakian, R., Zhao, Y., Lou, Y.: Genomic landscape of lung adenocarcinomas in different races. Front. Oncol. 12 (Sep 2022)
  • [34] Ustun, B., Liu, Y., Parkes, D.: Fairness without harm: Decoupled classifiers with preference guarantees. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 97, pp. 6373–6382. PMLR (09–15 Jun 2019), https://proceedings.mlr.press/v97/ustun19a.html
  • [35] Varkey, B.: Principles of clinical ethics and their application to practice. Med. Princ. Pract. 30(1), 17–28 (2021)
  • [36] Verma, S., Rubin, J.S.: Fairness definitions explained. 2018 IEEE/ACM International Workshop on Software Fairness (FairWare) pp. 1–7 (2018), https://api.semanticscholar.org/CorpusID:49561627
  • [37] Warner, E., Foulkes, W., Goodwin, P., Meschino, W., Blondal, J., Paterson, C., Ozcelik, H., Goss, P., Allingham-Hawkins, D., Hamel, N., Di Prospero, L., Contiga, V., Serruya, C., Klein, M., Moslehi, R., Honeyford, J., Liede, A., Glendon, G., Brunet, J.S., Narod, S.: Prevalence and penetrance of BRCA1 and BRCA2 gene mutations in unselected ashkenazi jewish women with breast cancer. J. Natl. Cancer Inst. 91(14), 1241–1247 (Jul 1999)
  • [38] Xu, Z., Li, J., Yao, Q., Li, H., Zhou, S.K.: Fairness in medical image analysis and healthcare: A literature survey (2023)
  • [39] Yang, Y., Zhang, H., Gichoya, J.W., Katabi, D., Ghassemi, M.: The limits of fair medical imaging ai in the wild (2023)
  • [40] Zong, Y., Yang, Y., Hospedales, T.: Medfair: Benchmarking fairness for medical imaging (2023)
  • [41] Žliobaitė, I., Custers, B.: Using sensitive personal data may be necessary for avoiding discrimination in data-driven decision models (2016)