Data Split and Model Construction

In this setting, we construct a model that will predict 28-day mortality (day_28_flg). To do this, we split the dataset into a training and testing sets and fit a random forest model. The first 700 patients are used as the training set and the remaining patients are used as the testing set. After the model is fit, it is used to predict 28-day mortality. The predicted probabilities are saved as new column in the testing data and are used to assess model fairness.

# Load required libraries
library(dplyr)
library(fairmetrics)
library(randomForest)
# Set seed for reproducibility
set.seed(1)
# Use 700 labels to train on the mimic_preprocessed dataset
train_data <- mimic_preprocessed %>%
  filter(row_number() <= 700)

# Test the model on the remaining data
test_data <- mimic_preprocessed %>%
  filter(row_number() > 700)

# Fit a random forest model
rf_model <- randomForest(factor(day_28_flg) ~ ., data = train_data, ntree = 1000)
# Save model prediction
test_data$pred <- predict(rf_model, newdata = test_data, type = "prob")[, 2]

Fairness Evaluation

The fairmetrics package is used to assess model fairness across binary protected attributes. This means that the number of unique values in the protected attribute column need to be exactly two. To evaluate fairness of the random forest model which we fit, we examine patient gender as the binary protected attribute.

# Recode gender variable explicitly for readability: 
test_data <- test_data %>%
  mutate(gender = ifelse(gender_num == 1, "Male", "Female"))

Since many fairness metrics require binary predictions, we threshold the predicted probabilities using a fixed cutoff. We set a threshold of 0.41 to maintain the overall false positive rate (FPR) at approximately 5%. To evaluate specific fairness metrics, its possible to do so with the eval_* functions (for a list of the functions contained in fairmetrics see here). For example, if we are interested in calculating the statistical parity of our model across gender (here assumed to be binary), we write:

eval_stats_parity(
  data = test_data, 
  outcome = "day_28_flg",
  group = "gender",
  probs = "pred",
  cutoff = 0.41,
  message = TRUE
)
#> There is evidence that model does not satisfy statistical parity.
#>                     Metric GroupFemale GroupMale Difference  95% Diff CI Ratio
#> 1 Positive Prediction Rate        0.17      0.09       0.08 [0.04, 0.12]  1.89
#>   95% Ratio CI
#> 1 [1.35, 2.64]

The dataframe returned gives the positive prediction rate between the groups defined by the binary protected attribute (GroupFemale and GroupMale in this case), the difference and ratios between the groups and the bootstrap calculated confidence intervals for the estimated difference and ratios. For inference, it can be considered that confidence interval which contains 0 in its range for difference or 1 for ratio as insignificant.

The above syntax can be extended to the other eval_* functions as well. Additionally, if one wishes to calculate various fairness metrics (and their confidence intervals) for the model simultaneously, we pass our test data with its predicted results into the get_fairness_metrics function.

get_fairness_metrics(
  data = test_data,
  outcome = "day_28_flg",
  group = "gender",
  probs = "pred",
  cutoff = 0.41
 )
#>            Fairness Assesment                                  Metric
#> 1          Statistical Parity                Positive Prediction Rate
#> 2           Equal Opportunity                     False Negative Rate
#> 3         Predictive Equality                     False Positive Rate
#> 4  Balance for Positive Class           Avg. Predicted Positive Prob.
#> 5  Balance for Negative Class           Avg. Predicted Negative Prob.
#> 6  Positive Predictive Parity               Positive Predictive Value
#> 7  Negative Predictive Parity               Negative Predictive Value
#> 8          Brier Score Parity                             Brier Score
#> 9     Overall Accuracy Parity                                Accuracy
#> 10         Treatment Equality (False Negative)/(False Positive) Ratio
#>    GroupFemale GroupMale Difference    95% Diff CI Ratio 95% Ratio CI
#> 1         0.17      0.09       0.08   [0.04, 0.12]  1.89 [1.35, 2.64]
#> 2         0.38      0.57      -0.19 [-0.34, -0.04]  0.67 [0.47, 0.94]
#> 3         0.08      0.03       0.05   [0.02, 0.08]  2.67  [1.4, 5.08]
#> 4         0.46      0.37       0.09   [0.04, 0.14]  1.24 [1.09, 1.42]
#> 5         0.15      0.11       0.04   [0.02, 0.06]  1.36 [1.17, 1.59]
#> 6         0.62      0.68      -0.06  [-0.23, 0.11]  0.91 [0.71, 1.18]
#> 7         0.92      0.91       0.01  [-0.03, 0.05]  1.01 [0.97, 1.05]
#> 8         0.09      0.08       0.01  [-0.01, 0.03]  1.12 [0.88, 1.43]
#> 9         0.87      0.89      -0.02  [-0.06, 0.02]  0.98 [0.93, 1.02]
#> 10        1.03      2.78      -1.75    [-3.6, 0.1]  0.37 [0.17, 0.79]

With the get_fairness_metrics function, it’s possible to examine the bootstrapped confidence intervals to determine which fairness criterion show evidence of violation. From the above output, Statistical Parity, Equal Opportunity, Predictive Equality, Balance for Positive Class and Balance for Negative Class show evidence of being violated, while Positive Predictive Parity, Negative Predictive Parity, Brier Score Parity and Overall Accuracy Parity do not. Treatment Equality remains ambiguous: when assessed by difference, the 95% bootstrapped confidence interval crosses 0 - indicating that there is no evidence indicating that the fairness criterion is violated, but when assessed by ratio, the interval crosses 1 - which indicates that there is.

NOTE:

Statistical inference from bootstrapped confidence intervals should be interpreted with caution. A confidence interval crossing 0 (for differences) or 1 (for ratios) means the evidence is inconclusive rather than proving absence of unfairness. Apparent violations may reflect sampling variability rather than systematic bias. Always complement these results with domain knowledge, sensitivity analyses, and additional fairness diagnostics before drawing strong conclusions about a specific fairness assessment.

While fairmetrics focuses only on assessing fairness of models across binary protected attributes, it is possible to work with protected attributes which consist of more than two groups by using “one-vs-all” comparisons and a little bit of data wrangling to create the appropriate columns.

Conditional Fairness Evaluation

To evaluate a model’s fairness for a specific subgroup, simply subset the data. For example, to evaluate the conditional statistical parity among subjects aged 60 and older using the previously fitted random forest model, subsetting the test data and and passing it through the eval_stats_parity is all that’s required.

eval_stats_parity(
  data = subset(test_data, age >= 60), 
  outcome = "day_28_flg",
  group = "gender",
  probs = "pred",
  cutoff = 0.41,
  message = TRUE
)
#> There is not enough evidence that the model does not satisfy
#>             statistical parity.
#>                     Metric GroupFemale GroupMale Difference 95% Diff CI Ratio
#> 1 Positive Prediction Rate        0.34      0.25       0.09   [0, 0.18]  1.36
#>   95% Ratio CI
#> 1 [1.01, 1.84]

From the bootstrapped confidence intervals for the difference and the ratio, we note that there is evidence that statistical parity is violated based on the difference and ratio confidence intervals between the male and female subjects aged 60 and over. A similar example can be constructed with get_fairness_metrics for a more comprehensive view of the conditional fairness of the model.

get_fairness_metrics(
  data = subset(test_data, age >= 60), 
  outcome = "day_28_flg",
  group = "gender",
  probs = "pred",
  cutoff = 0.41
)
#>            Fairness Assesment                                  Metric
#> 1          Statistical Parity                Positive Prediction Rate
#> 2           Equal Opportunity                     False Negative Rate
#> 3         Predictive Equality                     False Positive Rate
#> 4  Balance for Positive Class           Avg. Predicted Positive Prob.
#> 5  Balance for Negative Class           Avg. Predicted Negative Prob.
#> 6  Positive Predictive Parity               Positive Predictive Value
#> 7  Negative Predictive Parity               Negative Predictive Value
#> 8          Brier Score Parity                             Brier Score
#> 9     Overall Accuracy Parity                                Accuracy
#> 10         Treatment Equality (False Negative)/(False Positive) Ratio
#>    GroupFemale GroupMale Difference   95% Diff CI Ratio 95% Ratio CI
#> 1         0.34      0.25       0.09  [0.01, 0.17]  1.36 [1.01, 1.83]
#> 2         0.30      0.46      -0.16    [-0.32, 0]  0.65 [0.41, 1.04]
#> 3         0.18      0.11       0.07 [-0.01, 0.15]  1.64 [0.91, 2.96]
#> 4         0.50      0.41       0.09  [0.04, 0.14]  1.22 [1.08, 1.38]
#> 5         0.26      0.22       0.04  [0.01, 0.07]  1.18 [1.03, 1.36]
#> 6         0.63      0.69      -0.06  [-0.22, 0.1]  0.91 [0.71, 1.17]
#> 7         0.86      0.81       0.05 [-0.03, 0.13]  1.06 [0.96, 1.17]
#> 8         0.14      0.16      -0.02 [-0.05, 0.01]  0.88 [0.71, 1.07]
#> 9         0.78      0.78       0.00 [-0.08, 0.08]  1.00  [0.91, 1.1]
#> 10        0.71      1.88      -1.17  [-2.54, 0.2]  0.38 [0.16, 0.89]

Conditioning on subjects aged 60 and older, Statistical Parity, Balance for Positive Class and Balance for Negative Class posses evidence for being violated based on 95% bootstrapped confidence intervals. While Predictive Equality, Positive Predictive Parity, Negative Predictive Parity, Brier Score Parity and Overall Accuracy Parity do not. Similar to the unconditioned case, Treatment Equality remains ambiguous based on the contrast between the difference and ratio confidence intervals. If difference is considered, there is evidence of Treatment equality being violated. If ratio is considered, there does not.

Appendix: Confidence Interval Construction

The function get_fairness_metrics() computes Wald-type confidence intervals for both group-specific and disparity metrics using nonparametric bootstrap. To illustrate the construction of confidence intervals (CIs), we use the following example involving the false positive rate (\(FPR\)).

Let \(\widehat{\textrm{FPR}}_a\) and \(\textrm{FPR}_a\) denote the estimated and true FPR in group \(A = a\). Then the difference \(\widehat{\Delta}_{\textrm{FPR}} = \widehat{\textrm{FPR}}_{a_1} - \widehat{\textrm{FPR}}_{a_0}\) satisfies (e.g., Gronsbell et al., 2018):

\[ \sqrt{n}(\widehat{\Delta}_{\textrm{FPR}} - \Delta_{\textrm{FPR}}) \overset{d}{\to} \mathcal{N}(0, \sigma^2) \]

We estimate the standard error of \(\widehat{\Delta}_{\textrm{FPR}}\) using bootstrap resampling within groups, and form a Wald-style confidence interval:

\[ \widehat{\Delta}_{\textrm{FPR}} \pm z_{1-\alpha/2} \cdot \widehat{\textrm{se}}(\widehat{\Delta}_{\textrm{FPR}}) \]

For ratios, such as \(\widehat{\rho}_{\textrm{FPR}} = \widehat{\textrm{FPR}}_{a_1} / \widehat{\textrm{FPR}}_{a_0}\), we apply a log transformation and use the delta method:

\[ \log(\widehat{\rho}_{\textrm{FPR}}) \pm z_{1-\alpha/2} \cdot \widehat{\textrm{se}}\left[\log(\widehat{\rho}_{\textrm{FPR}})\right] \]

Exponentiation of the bounds yields a confidence interval for the ratio on the original scale:

\[ \left[ \exp\left\{\log(\widehat{\rho}_{\textrm{FPR}}) - z_{1-\alpha/2} \cdot \widehat{\textrm{se}}\left[\log(\widehat{\rho}_{\textrm{FPR}})\right]\right\},\ \exp\left\{\log(\widehat{\rho}_{\textrm{FPR}}) + z_{1-\alpha/2} \cdot \widehat{\textrm{se}}\left[\log(\widehat{\rho}_{\textrm{FPR}})\right]\right\} \right]. \]

References

Raffa, J. (2016). Clinical data from the MIMIC-II database for a case study on indwelling arterial catheters (version 1.0). PhysioNet. https://doi.org/10.13026/C2NC7F.
Raffa J.D., Ghassemi M., Naumann T., Feng M., Hsu D. (2016) Data Analysis. In: Secondary Analysis of Electronic Health Records. Springer, Cham
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., … & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220.
Gao, J. et al. What is Fair? Defining Fairness in Machine Learning for Health. arXiv.org https://arxiv.org/abs/2406.09307 (2024).
Gronsbell, J. L. & Cai, T. Semi-Supervised approaches to efficient evaluation of model prediction performance. Journal of the Royal Statistical Society Series B (Statistical Methodology) 80, 579–594 (2017).
Hort, M., Chen, Z., Zhang, J. M., Harman, M. & Sarro, F. Bias Mitigation for Machine Learning Classifiers: A Comprehensive survey. arXiv.org https://arxiv.org/abs/2207.07068 (2022).
Hsu, D. J. et al. The association between indwelling arterial catheters and mortality in hemodynamically stable patients with respiratory failure. CHEST Journal 148, 1470–1476 (2015).
Efron, B. & Tibshirani, R. Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical Science 1, (1986).

Assessing Model Fairness Across Binary Protected Attributes

Introduction

Data Split and Model Construction

Fairness Evaluation

Conditional Fairness Evaluation

Appendix: Confidence Interval Construction

References