How Bias in Medical AI Affects Diagnoses Across Different Groups

cover
29 Dec 2024
  1. Abstract and Introduction

  2. Related work

  3. Methods

    3.1 Positive-sum fairness

    3.2 Application

  4. Experiments

    4.1 Initial results

    4.2 Positive-sum fairness

  5. Conclusion and References

Bias is commonly identified in medical image analysis applications [38,40]. For instance [6], a CNN trained on brain MRI resulted in a significant difference between ethnicities. Seyyed-Kalantari et al. [32] observed that minorities received higher rates of algorithmic underdiagnosis. Zong et al. [40] assessed bias mitigation algorithms inand out-of-distribution settings. The experiments demonstrated the wide existence of bias in AI-based medical imaging classifiers and none of the bias mitigation algorithms was able to prevent this.

Different definitions of fairness are used:

Individual fairness [25] requires that similar individuals should be treated equally and thus have similar predictions. For example, a model should have comparable diagnosis on two similar X-Ray images.

Group fairness requires equal performance on sub-groups divided based on sensitive attributes (e.g., race, sex, and age). Common group fairness metrics are demographic parity [8], equal odds [12] and predictive rate parity or sufficiency [21].

Minimax fairness [5] seeks to ensure that the worst-off group is treated as fairly as possible, reducing the most severe negative impacts of a decision or system.

These definitions have pros and cons [36]. Individual fairness relies on the choice of the distance metric, which requires expert input. In minimax fairness, the ideal solution is difficult to compute and the degree of unfairness relies heavily on the choice of the set of models. Group fairness metrics are easy to implement and understand, but are not always adapted to the problem nor compatible with one another [2,18]. And even though prior work has broadened the group fairness notion by adding other normative choices than strict equality [1], none of the proposed metrics prevent the harm that could be brought to each subgroup’s performance individually or to the whole population’s benefit.

As mentioned in the introduction, similarly to [24,34,27,26], we believe that medical AI is different from other domains in that each improvement can save lives. Therefore, increasing disparities to achieve the best performance possible for each demographic subgroup and for the population as a whole could be justified. Previous research has shown that images themselves could carry demographic encodings [10,9]. E.g., Yang et al. [39] investigate the utilization of demographic encodings by analyzing the use of demographic shortcuts for disease classification. Two papers [41,11] examine the relevance of explicitly using sensitive attributes in fair classification systems for non-medical problems. They compare different models which leverage sensitive attributes with a model which is not trained on any sensitive attribute.

Authors:

(1) Samia Belhadj∗, Lunit Inc., Seoul, Republic of Korea (samia.belhadj@lunit.io);

(2) Sanguk Park [0009 −0005 −0538 −5522]*, Lunit Inc., Seoul, Republic of Korea (tony.superb@lunit.io);

(3) Ambika Seth, Lunit Inc., Seoul, Republic of Korea (ambika.seth@lunit.io);

(4) Hesham Dar [0009 −0003 −6458 −2097], Lunit Inc., Seoul, Republic of Korea (heshamdar@lunit.io);

(5) Thijs Kooi [0009 −0003 −6458 −2097], Kooi, Lunit Inc., Seoul, Republic of Korea (tkooi@lunit.io).


This paper is available on arxiv under CC BY-NC-SA 4.0 license.