BACKGROUND AND RATIONALE FOR THE STUDY
Conducting clinical and quality-of-care assessment research in emergency medicine is as
difficult as it is important. It is difficult because the vast number of patients that
need to be treated and the chronic shortage of staff make ad hoc data collection
impractical. It is important because, in the end, research enables emergency physicians
and nurses to base their practice on evidence obtained in their own, unique setting, as
opposed to evidence obtained in far-removed contexts, as is commonly the case today.
The only way to bridge the gap between research needs and availability of robust data is
to extract data directly from the electronic health records (EHRs) of emergency
departments, avoiding dedicated, time-consuming data collection. This is a difficult
task, however, because the most useful information is in free text format (e.g., presence
of signs and symptoms, suspected and confirmed diagnosis, anamnesis). Such circumstances
and needs require a reliable natural language processing (NLP) tool to derive highly
consistent data from free text.
Today, large-scale language models are available that can accurately interpret natural
language. These models are trained on huge amounts of general knowledge taken mostly from
the Internet, however, so their performance in more specialized areas, such as the
medical domain, may not be optimal.
The present study is part of a larger project called eCREAM (enabling Clinical Research
in Emergency and Acute-care Medicine), and aims to develop and validate a language model
(called eCREAM_LM) for six languages that can interpret the contents of emergency
department EHRs and extract relevant information for research purposes.
METHODS
The study is an observational, multicenter, retrospective, 24-month study. Thirty centers
will participate in the study: 13 from Italy, 4 from Poland, 3 from Greece, Slovakia,
Slovenia, and the United Kingdom, and 1 from Switzerland. The centers will not receive
any compensation, but their expenses will be covered by project funds.
Development and validation of the eCREAM_LM model.
eCREAM_LM will be developed through training and fine-tuning of the best overall model,
among those open-source, and will proceed in partially parallel phases. Candidate models
will be exposed to a huge amount (billions) of medical texts from the scientific
literature or other public sources. Simultaneously, the models will also be exposed to a
massive amount (millions) of free text notes obtained from medical records in use at
participating hospitals. The investigators will then move on to fine-tuning, where a
large amount (thousands) of clinical notes, obtained, once again, from the medical
records of participating centers, will be used. These notes will be annotated by
experienced physicians, which consists of extracting information from the notes to fill
in the data items listed in a virtual data collection form (vCRF). The vCRF was created
for a related study and contains a set of variables useful in predicting the
hospitalization of patients with dyspnea or transient loss of consciousness, which is the
objective of the related study. In the current study, the vCRF will serve as a tool for
validating the language model.
Validation of eCREAM_LM will be carried out using a set of 1,000 clinical notes annotated
as described above, but not used in the development phase. These notes will be submitted
to the eCREAM_LM model with the task of compiling the vCRF. The concordance in filling in
the vCRF between the expert physicians and the eCREAM_LM will be the measure of final
validation of eCREAM_LM.
Data collection and anonymization
Each participating hospital will provide free text notes contained in the medical records
of 150-300,000 adult patients treated between 2021 and 2023. Notes referring to different
aspects of the same patient (e.g., history, objective examination, test results) will be
separated from each other so that it will be impossible to reconstruct the complete
profile of the patient. In addition, the notes will be stripped of any reference to the
patient (e.g., first name, last name, date of birth) and context (e.g., hospital, date
and time of arrival at the center). This process minimizes the likelihood of
re-identifying patients and maximizes the protection of their rights. The likelihood of
re-identifying a patient within a database depends on how unique his or her
characteristics are from other individuals in the database. The likelihood of having
unique, and therefore identifiable, patients increases with the amount of information
available in the database and decreases with its size. By removing all personal and
contextual information from clinical notes and separating each note from the others, each
note will only report a few characteristics of the patient. In addition, data collected
from hospitals in the same country will be merged so that there is one large database for
each language. This effectively zeroes out the probability of there being individuals
uniquely identifiable from the notes.
Finally, to rule out the possibility that the notes will contain information about third
parties, such as names and phone numbers of patients' relatives, a certified
anonymization software, specifically designed to remove personal data from free text,
will be installed in each hospital.
Once anonymized, the data will be centralized for analysis and will also be uploaded to
major European language resource sharing platforms in the scientific community.
Statistical analysis
In the eCREAM_LM validation, the investigators will assess the concordance between expert
emergency physicians and the eCREAM_LM itself in filling in the vCRF. The data will refer
to a sample of 1,000 notes for each study language. Concordance will be assessed for each
variable of the vCRF using Cohen's κ as a measure of agreement. The eCREAM_LM will be
considered valid if Cohen's κ is greater than 0.75.
Sample size
Assuming an excellent agreement (κ=0.80) between eCREAM_LM and the experienced emergency
physicians in completing the vCRF, a sample of at least 735 notes will be necessary to
achieve sufficient precision to guarantee a good agreement (lower confidence limit of 95%
confidence interval of Cohen's κ greater than 0.75). This number is the maximum sample
size obtained under different scenarios involving a different number of categories (2 to
5) for each variable and different marginal distributions of the categories in the
sample, including balanced distributions (e.g., 5 categories with 20% of the sample in
each category) and very imbalanced results (e.g., 5 categories with 1.8%, 7.3%, 16.4%,
29.1% and 45.5% of the sample). Since information of interest may be missing in some
notes, the investigators will perform the data validation assessment on 1,000 notes.