Researchers at the University of Florida have developed an AI-powered tool capable of identifying HIV-related stigma within clinical notes, a step toward making an often-invisible psychosocial barrier visible to healthcare systems at scale.

Stigma is well established as a driver of poor health outcomes for people living with HIV — shaping whether patients stay engaged in care, adhere to treatment, and maintain mental health — yet no off-the-shelf tool existed to extract stigma-related content from the clinical records where it is routinely, if inconsistently, documented. The study, posted to ArXiv in April 2025, sets out to close that gap.

How the Dataset Was Built

The research team drew clinical notes from University of Florida (UF) Health covering patients living with HIV between 2012 and 2022. Rather than starting from scratch, the team used expert-curated stigma-related keywords to identify candidate sentences, then iteratively expanded the search using clinical word embeddings — a technique that finds terms used in similar contexts within medical text.

The result was 1,332 sentences annotated by human reviewers across four stigma subscales: Concern with Public Attitudes, Disclosure Concerns, Negative Self-Image, and Personalized Stigma. This annotation framework reflects established stigma research, giving the classification task a clinical grounding rather than relying on ad hoc labelling.

The Models Tested — and How They Performed

The study compared two types of model. Encoder-based models — GatorTron-large and BERT — were fine-tuned on the annotated data. Generative large language models — GPT-OSS-20B, LLaMA-8B, and MedGemma-27B — were tested under both zero-shot and few-shot prompting conditions, meaning they were either given no examples or a handful of examples before being asked to classify new sentences.

GatorTron-large achieved the best overall performance, reaching a Micro F1 score of 0.62 — surpassing every generative model tested.

GatorTron-large, a model pre-trained specifically on large volumes of clinical text, achieved the top Micro F1 score of 0.62, surpassing every generative model tested. Among the generative models, few-shot prompting produced meaningful gains: 5-shot GPT-OSS-20B reached a Micro F1 of 0.57, and 5-shot LLaMA-8B reached 0.59. Zero-shot generative inference fared considerably worse, with failure rates — instances where models produced unusable or malformed outputs — reaching as high as 32%.

All benchmark results are as reported by the study authors and have not been independently verified.

Where the Models Struggled

Performance was not uniform across the four stigma categories. Negative Self-Image proved the most predictable subscale, likely because the language associated with it is relatively direct and consistent. Personalized Stigma — which captures experiences of actual discrimination or rejection — remained the hardest category to classify correctly.

This variation matters practically. Personalized stigma, the category models found most difficult, is also arguably the most clinically actionable: a patient who has experienced direct discrimination may need different support than one expressing more internalised concerns. The gap between model capability and clinical need is sharpest precisely where the stakes are highest.

Why Clinical NLP for Stigma Is Hard

Clinical notes present particular challenges for natural language processing. They are written in shorthand, contain abbreviations, vary by clinician, and rarely use the same terminology twice to describe the same phenomenon. Stigma-related content adds another layer of difficulty: clinicians may document it obliquely, embedded in broader psychosocial assessments, or may use euphemistic language that doesn't map cleanly to any keyword list.

The iterative keyword expansion approach the team used — leaning on clinical word embeddings to surface related terms — reflects an awareness of this problem. But the relatively modest F1 scores across all models suggest the task remains genuinely difficult, and that a Micro F1 of 0.62 represents a meaningful but incomplete solution.

What This Means

If refined and validated across other health systems, this tool could allow researchers and clinicians to systematically track HIV-related stigma at population scale using existing records — turning a routinely undercounted psychosocial factor into a measurable variable that health systems can act on.