Adding more patient data to a clinical AI model does not reliably make it fairer — and can sometimes make it worse, according to a new study published on ArXiv examining algorithmic bias in intensive care unit decision-making tools.
The research, which uses two widely studied healthcare datasets — the eICU Collaborative Research Database and the MIMIC-IV dataset — challenges a common assumption in machine learning: that more data means better, fairer models. The findings suggest that when data sources are combined, distribution shifts between hospitals or admission departments can undermine any gains from increased sample size, producing unpredictable effects on how well a model performs for specific patient subgroups.
Why "Just Add More Data" Is an Unreliable Fix
Algorithmic bias in healthcare AI is a well-documented problem. Models trained predominantly on data from one type of hospital, patient population, or clinical setting can perform systematically worse for groups underrepresented in that data — older patients, patients from minority ethnic groups, or those with less common comorbidities. The intuitive response is to broaden the training set by pulling in data from additional sources.
Data addition can both help and hurt model fairness and performance, and many intuitive strategies for data selection are unreliable.
The researchers found this intuition breaks down in practice. When data from different hospitals or departments is combined, the resulting training set may contain conflicting patterns — what statisticians call distribution shift — that confuse the model rather than enriching it. The study describes these effects as "volatile," meaning the direction of impact on subgroup fairness is difficult to predict without empirical testing.
What the ICU Data Actually Showed
Working with eICU and MIMIC-IV, two large, real-world electronic health record (EHR) datasets drawn from hospital admissions, the research team systematically tested different strategies for selecting and combining data sources. They compared approaches that prioritised data volume, data similarity, and demographic balance, and found that none of these strategies reliably improved subgroup fairness across both datasets.
The study also compared data-centric interventions — changing what goes into the training set — against model-based post-hoc calibration, a technique applied after training to adjust a model's output scores so they better reflect actual probabilities for different groups. Neither approach alone proved consistently sufficient.
The key finding is that combining both strategies — curating training data and then applying calibration — produced the most reliable improvements in subgroup performance. According to the authors, this pairing matters because data interventions address the upstream source of bias while calibration corrects for residual disparities that survive into the final model.
The Practical Challenge for Clinical AI Developers
The implications for teams building and deploying clinical decision-support tools are significant. In most real-world settings, data collection is constrained: hospitals have access to their own records and perhaps a handful of partner institutions, not an unlimited supply of perfectly matched patient data. The study explicitly frames its investigation around these practical limitations, noting that available additional data sources are often "less than ideal."
This matters because clinical AI tools are increasingly used to support high-stakes decisions — predicting patient deterioration, flagging sepsis risk, or triaging ICU admissions. If a model performs worse for certain patient subgroups, those patients may receive less timely or less appropriate care. The researchers frame this not just as a technical problem but as a systemic one, noting that algorithmic bias can exacerbate systemic harm to already-vulnerable groups.
The study does not offer a simple corrective formula. Instead, it argues that the field needs to move away from rules of thumb — such as "always add more data" or "balance your training set" — toward empirical testing of fairness interventions on specific datasets and target populations. What works in one clinical context may not transfer to another.
Questioning a Core Assumption in Fairness Research
The broader contribution of the paper is methodological. Much of the fairness-in-machine-learning literature assumes that data quality and quantity are the primary levers for reducing bias, with model-level corrections treated as secondary. This study argues the opposite: model-based calibration is not a fallback but a necessary component of any serious fairness intervention.
The research also highlights a gap between research settings and deployment realities. Academic benchmarks in this space often assume researchers have flexibility over data collection that clinical teams do not have in practice. By grounding the analysis in two established, publicly available ICU datasets, the authors aim to make their findings directly applicable to practitioners.
The datasets used — eICU and MIMIC-IV — are among the most widely used in clinical machine learning research, which lends the findings broader relevance. That said, both datasets come from US hospital systems, and the results may not generalise to healthcare settings with different data infrastructures or patient populations.
What This Means
For anyone building or evaluating AI tools in high-stakes clinical environments, this study is a direct challenge to the assumption that expanding a training dataset is a reliable path to fairer models — and a practical case for treating post-hoc calibration as standard practice, not an afterthought.