New Study Shows How Positive-Sum Fairness Impacts Medical AI Models in Chest Radiography

cover
30 Dec 2024
  1. Abstract and Introduction

  2. Related work

  3. Methods

    3.1 Positive-sum fairness

    3.2 Application

  4. Experiments

    4.1 Initial results

    4.2 Positive-sum fairness

  5. Conclusion and References

4 Experiments

Data We use chest radiographs from MIMIC-CXR-JPG [16,29]. The dataset has annotations for 14 findings. However, we focus on lung lesions, pneumonia, pleural effusion and consolidation as the diseases associated with these findings have been shown to be correlated with ethnicity [4,17,33]. We use only frontal images and split the dataset into training, validation, and test sets on a patient level. In total, 237,972, 1,959, and 3,403 images are used for training, validation, and testing, respectively.

Sensitive attributes We define the protected subgroups based on the self-reported race from MIMIC-IV [14,15] and split it into five groups: White, African-American, Latino, Asian, others.

Model training We train our 4 models to predict all 14 CXR findings and a race group. We initialize a DenseNet-121 backbone with pre-trained weights from ImageNet [31]. The images are resized to 256 × 256, and augmented using random rotation from [- 15,15] degree range and random horizontal flip. We conduct the experiments with 8 V100 NVIDIA GPU. AdamW [23] is used with an initial learning rate of 0.002 which is updated using the cosine annealing warm up [22] scheduler.

Evaluation We compare the four models by general performance and fairness across the protected subgroups. The general performance is assessed using the Area under the ROC curve (AUROC) score and the traditional group fairness metric used to compare with positive-sum fairness is expressed by (1 - largest disparity between protected subgroups in terms of AUROC) [20]. We use the AUROC mean and confidence bounds generated using bootstrapping with 300 samples [7]. We do not consider protected subgroups which have less than 5 positive cases or less than 5 negative cases as this results in poor estimates of performance.

Authors:

(1) Samia Belhadj∗, Lunit Inc., Seoul, Republic of Korea (samia.belhadj@lunit.io);

(2) Sanguk Park [0009 −0005 −0538 −5522]*, Lunit Inc., Seoul, Republic of Korea (tony.superb@lunit.io);

(3) Ambika Seth, Lunit Inc., Seoul, Republic of Korea (ambika.seth@lunit.io);

(4) Hesham Dar [0009 −0003 −6458 −2097], Lunit Inc., Seoul, Republic of Korea (heshamdar@lunit.io);

(5) Thijs Kooi [0009 −0003 −6458 −2097], Kooi, Lunit Inc., Seoul, Republic of Korea (tkooi@lunit.io).


This paper is available on arxiv under CC BY-NC-SA 4.0 license.