11email: {samia.belhadj, tony.superb, ambika.seth, heshamdar, tkooi}@lunit.io
Positive-Sum Fairness: Leveraging Demographic Attributes to Achieve Fair AI Outcomes Without Sacrificing Group Gains
Abstract
Fairness in medical AI is increasingly recognized as a crucial aspect of healthcare delivery. While most of the prior work done on fairness emphasizes the importance of equal performance, we argue that decreases in fairness can be either harmful or non-harmful, depending on the type of change and how sensitive attributes are used. To this end, we introduce the notion of positive-sum fairness, which states that an increase in performance that results in a larger group disparity is acceptable as long as it does not come at the cost of individual subgroup performance. This allows sensitive attributes correlated with the disease to be used to increase performance without compromising on fairness.
We illustrate this idea by comparing four CNN models that make different use of the race attribute in the training phase. The results show that removing all demographic encodings from the images helps close the gap in performance between the different subgroups, whereas leveraging the race attribute as a model’s input increases the overall performance while widening the disparities between subgroups. These larger gaps are then put in perspective of the collective benefit through our notion of positive-sum fairness to distinguish harmful from non harmful disparities.
Keywords:
Fairness Computer-aided diagnosis Chest x-ray Machine Learning1 Introduction
Medical imaging plays a critical role in diagnosis, treatment planning, and monitoring patient progress. However, the reliability of medical imaging algorithms is not uniformly distributed across different demographic groups, raising concerns about fairness and potential biases in the results. Fairness in medical imaging most often refers to the equitable treatment of patients from diverse demographic backgrounds, regardless of their gender, race, ethnicity, or other characteristics sensitive to discrimination [19, 38].
This equitable treatment is often interpreted as a similar performance across different demographic subgroups. When applied to domains like credit card scoring or AI-powered recruiting, ignoring all sensitive attributes and prioritizing a similar performance across the different demographic subgroups is an acceptable approach. However, in the medical field, demographic attributes are important clinical factors which radiologists and clinicians often take into consideration as they can have a strong impact on their diagnoses and can guide them to consider specific tests or treatments based on the patient’s demographic profile. The prevalence of diseases can be correlated to demographic attributes. For example, studies have shown that breast cancer has a higher incidence among Ashkenazi Jewish women [37, 30]. And, due to historical and social disparities as well as different physiological features across demographic subgroups, the difficulty level of medical tasks is not uniformly distributed. For this reason, even collecting more or more diverse data does not necessarily produce equal performance across demographic subgroups as the best achievable result is not the same for each of them [27]. In a domain where each improvement can save lives, it is hard to disregard the benefit of the population as a whole for the sake of decreasing the disparities between subgroups.
Petersen et al. [26] examined various types of demographic invariance in medical imaging AI, highlighting why they can be undesirable and stressing the need for better fairness assessments and mitigation techniques in this field. Several fairness measures suffer from degradation in the overall performance by penalizing the performance of an AI system for groups that it performs better on, in order to achieve parity with groups it performs worse on, which is referred to as “levelling down” [24]. While we are aware of papers suggesting training methods which aim to maximize the benefit of each subgroup (Berk Ustun [34], for instance, suggested debiasing methods following the ethical principles of beneficence (“do the best”) and non-maleficence (“do not harm”) [35] in regards to fairness), and methods which improve fairness by understanding and mitigating the demographic encodings present in images [39, 3], we could not find any fairness evaluation framework or definition which allows to compare different models from the prism of harmful and non harmful disparities.
We, therefore, introduce the notion of positive-sum fairness: when looking at a situation where we have an initial model and are looking at the trade-off between fairness and performance while trying to improve it, inequitable performance can be acceptable as long as it does not come at the expense of other subgroups and allows a higher overall performance to be achieved. Specifically, we argue that differences in performance can be harmful and non-harmful. We consider a disparity harmful if it comes at the cost of the overall performance or if improving the overall performance is achieved by decreasing performance on any protected subgroup. A difference in performance across protected subgroups is considered non-harmful if, by improving an AI system’s performance, we exacerbate the disparities between subgroups without negatively impacting any specific subgroup. This main idea is summarized in figure 1.
We compare the positive-sum fairness framework with a more traditional group fairness definition, which is the largest disparity in performance across subgroups. We show that some models, while increasing this disparity, actually improve the performance of each subgroup individually and that other models which decrease the disparity ("improving fairness" from a classic point of view) are harming some subgroups to achieve it.
2 Related work
Bias is commonly identified in medical image analysis applications [38, 40]. For instance [6], a CNN trained on brain MRI resulted in a significant difference between ethnicities. Seyyed-Kalantari et al. [32] observed that minorities received higher rates of algorithmic underdiagnosis. Zong et al. [40] assessed bias mitigation algorithms in- and out-of-distribution settings. The experiments demonstrated the wide existence of bias in AI-based medical imaging classifiers and none of the bias mitigation algorithms was able to prevent this.
Different definitions of fairness are used:
-
•
Individual fairness [25] requires that similar individuals should be treated equally and thus have similar predictions. For example, a model should have comparable diagnosis on two similar X-Ray images.
- •
-
•
Minimax fairness [5] seeks to ensure that the worst-off group is treated as fairly as possible, reducing the most severe negative impacts of a decision or system.
These definitions have pros and cons [36]. Individual fairness relies on the choice of the distance metric, which requires expert input. In minimax fairness, the ideal solution is difficult to compute and the degree of unfairness relies heavily on the choice of the set of models. Group fairness metrics are easy to implement and understand, but are not always adapted to the problem nor compatible with one another [2, 18]. And even though prior work has broadened the group fairness notion by adding other normative choices than strict equality [1], none of the proposed metrics prevent the harm that could be brought to each subgroup’s performance individually or to the whole population’s benefit.
As mentioned in the introduction, similarly to [24, 34, 27, 26], we believe that medical AI is different from other domains in that each improvement can save lives. Therefore, increasing disparities to achieve the best performance possible for each demographic subgroup and for the population as a whole could be justified. Previous research has shown that images themselves could carry demographic encodings [10, 9]. E.g., Yang et al. [39] investigate the utilization of demographic encodings by analyzing the use of demographic shortcuts for disease classification. Two papers [41, 11] examine the relevance of explicitly using sensitive attributes in fair classification systems for non-medical problems. They compare different models which leverage sensitive attributes with a model which is not trained on any sensitive attribute.
3 Methods
3.1 Positive-sum fairness
We introduce the principle of positive-sum fairness, which analyzes fairness from the prism of harmful and non harmful disparities. When looking at changes in model performance and disparities between protected subgroups, there are several explanations for a gap in performance between the most and least advantaged subgroups:
-
•
The most advantaged group’s performance improved while others’ stayed the same,
-
•
All subgroups’ performance improved but one of them increased more than others,
-
•
The most discriminated group’s performance decreased while others’ stayed the same,
-
•
All subgroups’ performance decreased, but one of them decreased more than the others, etc.
The first two would not be considered harmful as they allow to improve the general performance without harming any of the subgroups, thus achieving a collective benefit.
3.1.1 Definition
Positive-sum fairness is a fairness evaluation framework where the goal is to find solutions that increase the overall benefit for all parties together while trying to ensure no one is worse off and ideally, everyone is better off. It looks at the situation where we have an initial model and are looking at the trade-off between fairness and performance when trying to improve the model. Unlike other fairness definitions which aim to minimize the disparity between subgroups or maximize the worst performance among subgroups, positive-sum fairness tries to avoid gains to a group which come at the expense of another group while maintaining the overall performance.
Let us assume that we compare models to a baseline on demographic subgroups. And let us consider as the metric that measures the performance of a model . Following the positive-sum fairness definition, selecting the best model is equivalent to finding the best trade-off between:
-
•
maximizing the performance gain:
-
•
maximizing the smallest performance gain across the subgroups :
Depending on the task, one could set hard constraints like ensuring there is no performance loss for any subgroup (aka the selected model should ensure that ) and the overall performance is improved (aka the selected model should ensure that ) or find the most relevant trade-off between the two optimization problems.
3.2 Application
To put this fairness notion into practice and show the difference with traditional group fairness, we compare three models which use sensitive attributes to a baseline model. The way sensitive attributes are used by the model is known to have an impact on the fairness and performance of the model [3, 39, 41, 11]. Therefore, we make use of models that explicitly include sensitive attributes, or conversely, remove any demographic encoding from the input data.
The four models are trained on a multi-label classification problem of findings in chest radiography (CXR). In all settings, a Densenet-121 [13] backbone is used, which was empirically determined to give the best performance for this problem. The exact model architectures are shown in figure 2 and described below:
-
•
M1: a baseline classifier using the images as input and trained to predict the targeted CXR findings associated to our dataset. The model comprises a backbone to extract the image features and a finding branch consisting of a fully connected layer and a binary cross entropy loss for each finding.
-
•
M2: a classifier using both the images and race features as input. The race information comes in the form of a categorical variable, which we convert to a one-hot vector and feed to a fully-connected layer. We concatenate the features from the fully connected layer and the image features before forwarding to finding branch. The model is trained end-to-end.
-
•
M3: a classifier using the images as input only, but trained to predict image findings as well as the race group (i.e. this model aims to exploit the race encodings present in the images). For this model, we modify the final layer of the baseline classifier by adapting the loss function to optimize the two tasks: CXR findings and race group. We also transform race information to one-hot encoded vector to apply multi-class loss. The race classification branch is made of a fully-connected layer and a cross entropy loss function. The final loss is calculated by adding finding loss and the race loss with a loss weight .
-
•
M4: a classifier using the images as input, trained to predict image findings, while minimizing the use of race information encoded in the image. For this model, we implement the gradient reversal technique described in [28]. We apply the gradient reversal layer before the race branch.
4 Experiments
4.0.1 Data
We use chest radiographs from MIMIC-CXR-JPG [16, 29]. The dataset has annotations for 14 findings. However, we focus on lung lesions, pneumonia, pleural effusion and consolidation as the diseases associated with these findings have been shown to be correlated with ethnicity [4, 17, 33]. We use only frontal images and split the dataset into training, validation, and test sets on a patient level. In total, 237,972, 1,959, and 3,403 images are used for training, validation, and testing, respectively.
4.0.2 Sensitive attributes
4.0.3 Model training
We train our 4 models to predict all 14 CXR findings and a race group. We initialize a DenseNet-121 backbone with pre-trained weights from ImageNet [31]. The images are resized to 256 256, and augmented using random rotation from [-15,15] degree range and random horizontal flip. We conduct the experiments with 8 V100 NVIDIA GPU. AdamW [23] is used with an initial learning rate of 0.002 which is updated using the cosine annealing warm up [22] scheduler.
4.0.4 Evaluation
We compare the four models by general performance and fairness across the protected subgroups. The general performance is assessed using the Area under the ROC curve (AUROC) score and the traditional group fairness metric used to compare with positive-sum fairness is expressed by (1 - largest disparity between protected subgroups in terms of AUROC) [20]. We use the AUROC mean and confidence bounds generated using bootstrapping with 300 samples [7]. We do not consider protected subgroups which have less than 5 positive cases or less than 5 negative cases as this results in poor estimates of performance.
4.1 Initial results
According to traditional group fairness, in assessing the results of the four models shown in figure 3a one could conclude that:
M2 improves the overall performance Our results show that M2 outperforms M1 in terms of AUROC. This is in line with our expectation as we are providing an additional relevant medical feature for the model to learn from. This better performance comes with a larger gap in AUROC between the most advantaged and most discriminated races, in other words less fairness from a traditional point of view. But this larger disparity is not necessarily harmful according to the positive-sum fairness as we will discuss it in the next section.
M4 improves the fairness M4 improves fairness for lung lesions and consolidations, while performing similar for pneumonia and pleural effusion. The improved fairness is likely due to the gradient reversal layer, which removes race information from the image and prevents the model from exploiting any demographic shortcut.
No clear pattern for M3 The results for M3 are less consistent. Its performance is lower than the baseline except for pneumonia and its fairness measurement is sometimes higher and other times lower than the baseline’s. If the baseline model exploited demographic encodings present in the images to generate shortcuts, training M3 to maximize the race prediction might have intensified the impact of these shortcuts.
4.2 Positive-sum fairness
We now apply the notion of positive-sum fairness, defined in section 3.1 and reframe the fairness vs performance problem as shown in 3b. Here, the x-axis represents the difference in performance between each improved classifier and the baseline () and the y-axis shows the performance increase (or decrease) for the least improved subgroup (). A negative value indicates that the model performs worse for the given subgroup.
Any classifier which has the exact same overall performance and exact same performance per protected subgroup (race) as the baseline, would be at coordinate (0,0). Any classifier that has a negative x-coordinate, would have a lower general performance than M1 and any classifier that has a negative y-coordinate would have at least one protected subgroup with a lower AUROC than M1 (at least one subgroup negatively impacted by the changes brought to the baseline model).
For lung lesions, figure 3b shows that M2 appears in the positive side of the x and y axes, meaning that the performance was improved without harming any subgroup’s performance. And this even though the figure 3a shows a decrease in fairness (larger disparity between the most advantaged and least advantaged subgroups) for M2 compared with M1. This matches the previous conclusion that the larger performance gap between protected subgroups for M2 compared with M1 cannot be considered harmful as every protected subgroup’s performance was individually increased.
On the other hand, for lung lesions, model M4 improved fairness (smaller disparity between the most advantaged and least advantaged subgroups) as shown in the figure 3a. However, the figure 3b, shows that M4 has negative y coordinates, meaning that at least one subgroup was harmed while trying to achieve a smaller disparity between protected subgroups.
5 Conclusion
In this paper, we presented the notion of positive-sum fairness and argued that larger disparities are not necessarily harmful, as long as it does not come at the expense of a specific subgroup performance. The general performance, standard fairness and positive-sum fairness of four models was analyzed, each leveraging sensitive attributes in a different way.
Our study highlights the need for a nuanced understanding of fairness metrics and their implications in real-world applications. Good incorporation of medical knowledge is crucial when utilizing sensitive information and evaluating fairness accurately, particularly in cases where models may show a large performance disparity.
When traditional methods often aim for equality, positive-sum fairness focuses on equity, pushing for each group to achieve its highest possible performance level. This can lead to better overall outcomes, as it encourages to address the specific needs and challenges of each group without diminishing the quality of care for others. But, being defined as an optimization problem, it could also have unintended side effects as it may inadvertently prioritize larger or more well-represented groups by focusing the efforts on the groups with the highest impact on the overall performance rather than those with the most critical needs. Therefore, it is to be noted that meeting the positive-sum fairness criterion alone does not ensure a model to be fair from an egalitarian perspective, and the use of this notion in conjunction with other metrics can give a more holistic understanding of a model’s fairness.
As positive-sum fairness is a relative measure, it requires a baseline to be used. Further work in this area would include developing a more robust baseline or adapting the approach to remove the need for a baseline. It would also be worth it to compare out-of-domain tested models, include other sensitive attributes such as sex and age and take into account confounding factors. {credits}
5.0.1 \discintname
The authors declare that there are no conflicts of interest regarding the publication of this paper.
References
- [1] Baumann, J., Hertweck, C., Loi, M., Heitz, C.: Distributive justice as the foundational premise of fair ml: Unification, extension, and interpretation of group fairness metrics (2023), https://arxiv.org/abs/2206.02897
- [2] Berk, R., Heidari, H., Jabbari, S., Kearns, M., Roth, A.: Fairness in criminal justice risk assessments: The state of the art (2017)
- [3] Brown, A., Tomasev, N., Freyberg, J., Liu, Y., Karthikesalingam, A., Schrouff, J.: Detecting shortcut learning for fair medical AI using shortcut testing
- [4] Burton, D.C., Flannery, B., Bennett, N.M., Farley, M.M., Gershman, K., Harrison, L.H., Lynfield, R., Petit, S., Reingold, A.L., Schaffner, W., Thomas, A., Plikaytis, B.D., Rose, Jr, C.E., Whitney, C.G., Schuchat, A., for the Active Bacterial Core Surveillance/Emerging Infections Program Network: Socioeconomic and racial/ethnic disparities in the incidence of bacteremic pneumonia among US adults. Am. J. Public Health 100(10), 1904–1911 (Oct 2010)
- [5] Diana, E., Gill, W., Kearns, M., Kenthapadi, K., Roth, A.: Minimax group fairness: Algorithms and experiments (2021)
- [6] EAM, S., M, W., P, M., ND., F.: Fairness-related performance and explainability effects in deep learning models for brain image analysis. J Med Imaging (Bellingham). 2022 Nov;9(6):061102. doi: 10. (2022)
- [7] Efron, B.: Better bootstrap confidence intervals. Journal of the American statistical Association 82(397), 171–185 (1987)
- [8] Feldman, M., Friedler, S., Moeller, J., Scheidegger, C., Venkatasubramanian, S.: Certifying and removing disparate impact (2015)
- [9] Gichoya, J.W., Banerjee, I., Bhimireddy, A.R., Burns, J.L., Celi, L.A., Chen, L.C., Correa, R., Dullerud, N., Ghassemi, M., Huang, S.C., Kuo, P.C., Lungren, M.P., Palmer, L.J., Price, B.J., Purkayastha, S., Pyrros, A.T., Oakden-Rayner, L., Okechukwu, C., Seyyed-Kalantari, L., Trivedi, H., Wang, R., Zaiman, Z., Zhang, H.: AI recognition of patient race in medical imaging: a modelling study. Lancet Digit. Health 4(6), e406–e414 (Jun 2022)
- [10] Glocker, B., Jones, C., Bernhardt, M., Winzeck, S.: Algorithmic encoding of protected characteristics in chest x-ray disease detection models. EBioMedicine 89(104467), 104467 (Mar 2023)
- [11] Haeri, M.A., Zweig, K.A.: The crucial role of sensitive attributes in fair classification. In: 2020 IEEE Symposium Series on Computational Intelligence (SSCI). pp. 2993–3002 (2020). https://doi.org/10.1109/SSCI47803.2020.9308585
- [12] Hardt, M., Price, E., Srebro, N.: Equality of opportunity in supervised learning (2016)
- [13] Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks (2018)
- [14] Johnson, A., Bulgarelli, L., Pollard, T., Horng, S., Celi, L.A., Mark, R.: MIMIC-IV (2023)
- [15] Johnson, A.E.W., Bulgarelli, L., Shen, L., Gayles, A., Shammout, A., Horng, S., Pollard, T.J., Hao, S., Moody, B., Gow, B., Lehman, L.W.H., Celi, L.A., Mark, R.G.: MIMIC-IV, a freely accessible electronic health record dataset. Sci. Data 10(1), 1 (Jan 2023)
- [16] Johnson, A.E.W., Pollard, T.J., Greenbaum, N.R., Lungren, M.P., ying Deng, C., Peng, Y., Lu, Z., Mark, R.G., Berkowitz, S.J., Horng, S.: Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs (2019)
- [17] Joseph, N.P., Reid, N.J., Som, A., Li, M.D., Hyle, E.P., Dugdale, C.M., Lang, M., Betancourt, J.R., Deng, F., Mendoza, D.P., Little, B.P., Narayan, A.K., Flores, E.J.: Racial and ethnic disparities in disease severity on admission chest radiographs among patients admitted with confirmed coronavirus disease 2019: A retrospective cohort study. Radiology 297(3), E303–E312 (Dec 2020)
- [18] Kleinberg, J., Mullainathan, S., Raghavan, M.: Inherent trade-offs in the fair determination of risk scores (2016)
- [19] Lara, R., A., M., Echeveste, R., Ferrante, E.: Addressing fairness in artificial intelligence for medical imaging. Nat Commun 13 4581 (2022)
- [20] Lee, J., Brooks, C., Yu, R., Kizilcec, R.: Fairness hub technical briefs: Auc gap (2023)
- [21] Lee, J.K., Bu, Y., Rajan, D., Sattigeri, P., Panda, R., Das, S., Wornell, G.W.: Fair selective classification via sufficiency. In: International Conference on Machine Learning (2021), https://api.semanticscholar.org/CorpusID:235826429
- [22] Loshchilov, I., Hutter, F.: Sgdr: Stochastic gradient descent with warm restarts (2017)
- [23] Loshchilov, I., Hutter, F.: Decoupled weight decay regularization (2019)
- [24] Mittelstadt, B., Wachter, S., Russell, C.: The unfairness of fair machine learning: Levelling down and strict egalitarianism by default (2023), https://arxiv.org/abs/2302.02404
- [25] Mukherjee, D., Yurochkin, M., Banerjee, M., Sun, Y.: Two simple ways to learn individual fairness metrics from data (2020)
- [26] Petersen, E., Ferrante, E., Ganz, M., Feragen, A.: Are demographically invariant models and representations in medical imaging fair? (2024), https://arxiv.org/abs/2305.01397
- [27] Petersen, E., Holm, S., Ganz, M., Feragen, A.: The path toward equal performance in medical machine learning. Patterns 4(7), 100790 (Jul 2023). https://doi.org/10.1016/j.patter.2023.100790, https://dx.doi.org/10.1016/j.patter.2023.100790
- [28] Raff, E., Sylvester, J.: Gradient reversal against discrimination (2018)
- [29] Rajeev, C., Natarajan, K.: Data Augmentation in Classifying Chest Radiograph Images (CXR) Using DCGAN-CNN, pp. 91–110 (11 2023). https://doi.org/10.1007/978-3-031-43205-7_6
- [30] Rubinstein, W.S.: Hereditary breast cancer in jews. Fam. Cancer 3(3-4), 249–257 (2004)
- [31] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: Imagenet large scale visual recognition challenge (2015)
- [32] Seyyed-Kalantari, L., Zhang, H., McDermott, M., Chen, I., Ghassemi, M.: Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nature Medicine 27 (12 2021). https://doi.org/10.1038/s41591-021-01595-0
- [33] Shi, H., Seegobin, K., Heng, F., Zhou, K., Chen, R., Qin, H., Manochakian, R., Zhao, Y., Lou, Y.: Genomic landscape of lung adenocarcinomas in different races. Front. Oncol. 12 (Sep 2022)
- [34] Ustun, B., Liu, Y., Parkes, D.: Fairness without harm: Decoupled classifiers with preference guarantees. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 97, pp. 6373–6382. PMLR (09–15 Jun 2019), https://proceedings.mlr.press/v97/ustun19a.html
- [35] Varkey, B.: Principles of clinical ethics and their application to practice. Med. Princ. Pract. 30(1), 17–28 (2021)
- [36] Verma, S., Rubin, J.S.: Fairness definitions explained. 2018 IEEE/ACM International Workshop on Software Fairness (FairWare) pp. 1–7 (2018), https://api.semanticscholar.org/CorpusID:49561627
- [37] Warner, E., Foulkes, W., Goodwin, P., Meschino, W., Blondal, J., Paterson, C., Ozcelik, H., Goss, P., Allingham-Hawkins, D., Hamel, N., Di Prospero, L., Contiga, V., Serruya, C., Klein, M., Moslehi, R., Honeyford, J., Liede, A., Glendon, G., Brunet, J.S., Narod, S.: Prevalence and penetrance of BRCA1 and BRCA2 gene mutations in unselected ashkenazi jewish women with breast cancer. J. Natl. Cancer Inst. 91(14), 1241–1247 (Jul 1999)
- [38] Xu, Z., Li, J., Yao, Q., Li, H., Zhou, S.K.: Fairness in medical image analysis and healthcare: A literature survey (2023)
- [39] Yang, Y., Zhang, H., Gichoya, J.W., Katabi, D., Ghassemi, M.: The limits of fair medical imaging ai in the wild (2023)
- [40] Zong, Y., Yang, Y., Hospedales, T.: Medfair: Benchmarking fairness for medical imaging (2023)
- [41] Žliobaitė, I., Custers, B.: Using sensitive personal data may be necessary for avoiding discrimination in data-driven decision models (2016)