<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Novel Real-World Dataset of Italian Clinical Notes for NLP-based Decision Support in Low Back Pain Treatment</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Agnese Bonfigli</string-name>
          <email>agnese.bonfigli@unicampus.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ruben Piperno</string-name>
          <email>ruben.piperno@unicampus.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Bacco</string-name>
          <email>l.bacco@unicampus.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Felice Dell'Orletta</string-name>
          <email>felice.dellorletta@ilc.cnr.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dominique Brunato</string-name>
          <email>dominique.brunato@ilc.cnr.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Filippo Crispino</string-name>
          <email>f.crispino@unicampus.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Francesco Papalia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabrizio Russo</string-name>
          <email>fabrizio.russo@policlinicocampus.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gianluca Vadalà</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rocco Papalia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mario Merone</string-name>
          <email>m.merone@unicampus.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leandro Pecchia</string-name>
          <email>leandro.pecchia@unicampus.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CLiC-it 2025: Eleventh Italian Conference on Computational Linguistics</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Fondazione Policlinico Universitario Campus Bio-Medico</institution>
          ,
          <addr-line>Via Alvaro del Portillo 200, 00128 Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>ItaliaNLP Lab, Institute of Computational Linguistics “Antonio Zampolli”, National Research Council</institution>
          ,
          <addr-line>Via Giuseppe Moruzzi 1, 56124 Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Research Unit of Intelligent Health-Technologies, Department of Engineering, Università Campus Bio-Medico di Roma</institution>
          ,
          <addr-line>Via Alvaro del Portillo 21, 00128 Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Low back pain represents a leading source of disability worldwide and poses a significant challenge for evidence-based clinical decision support. In contexts where Italian-language resources for diversified therapeutic pathways are lacking, we have assembled a novel, annotated dataset comprising up to three pre-treatment documents per patient (MRI report, X-ray report, and patient visit notes), alongside demographic information (age and sex). The cohort consists of 176 patient records, stratified into three therapeutic groups: 50 conservative, 92 regenerative, and 34 surgical. The primary aim is to investigate whether the collected dataset can be harnessed to predict which of the three treatment modalities is most appropriate. To this end, six document-combination scenarios were defined, evaluating each single-report modality as well as all possible pairings. For each scenario, two modeling strategies were contrasted: a traditional Support Vector Machine classifier leveraging TF-IDF features based on unigrams, bigrams, and trigrams, and a fine-tuned Italian BERT model adapted to our corpus. Experimental results indicate that classic n-gram-based approaches achieve the highest performance (macro-1 up to 71.3%). The BERT model, while outperforming the baseline, encounters limitations in this low-resource scenario.These ifndings suggest that the present dataset has the potential to catalyze the development of Italian-language clinical decision support systems that account for the distinct signatures of treatment pathways.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Italian Medical Corpus</kwd>
        <kwd>Decision Support Systems</kwd>
        <kwd>Clinical Natural Language Processing</kwd>
        <kwd>Treatment Prediction</kwd>
        <kwd>NLP in healthcare</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>Low back pain (LBP) represents one of the most prevalent medical conditions globally, significantly impacting both</title>
        <p>Despite extensive research and clinical experience,
determining optimal treatment strategies remains
challenging due to the diverse range of available therapeutic
interventions. LBP management has been extensively studied
considering the aforementioned impacts on the
individual patient and the community. However, there is still
a gap between this information and its applications in
clinical practice, particularly in the area of detailing
conservative (non-invasive) management. As surgeries and
interventional therapies are not recommended in most
patients with acute LBP, it is important for primary care
physicians (PCPs) to know the details of non-invasive
treatment.</p>
      </sec>
      <sec id="sec-1-2">
        <title>The complexity of treatment selection is compounded</title>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Dataset</title>
      <p>by the need to consider multiple patient-specific factors,
including clinical presentation, radiological findings, and
demographic characteristics. Data Acquisition This study is based on a
retrospec</p>
      <p>
        Electronic health records (EHRs) provide a rich source tive analysis of anonymized clinical records collected
of clinical data that can inform LBP treatment decisions, during routine care for patients with LBP enrolled at
particularly through unstructured texts such as imaging the spine clinic of the Fondazione Policlinico Campus
Bioreports (e.g., Magnetic Resonance Imaging (MRI) and X- Medico in Rome. The dataset represents a pilot collection
rays) and physician notes [
        <xref ref-type="bibr" rid="ref4">4, 5</xref>
        ]. Recent advancements curated through a rigorous manual selection process
carin natural language processing (NLP) have demonstrated ried out in collaboration with board-certified orthopaedic
significant potential in extracting meaningful clinical in- specialists. All records were obtained prior to any
therasights from these texts, thereby supporting data-driven, peutic intervention and reflect real-world clinical
deciinformed, and personalized decision-making in health- sions made during standard care.
care [6]. This progress has been supported by large-scale Each case was annotated by the attending physician
English-language datasets, such as MIMIC-CXR [7] and responsible for the patient’s care, linking each patient
MIMIC-IV-Note [8], which provide radiology reports re- to a treatment label reflecting the therapeutic decision.
lated to central and lower body axial regions. However, Consequently, no additional annotation was necessary.
the development of NLP-based clinical decision support For each patient, we selected the corresponding
presystems for LBP is significantly limited by the lack of treatment documents, thus creating a realistic
decisionannotated datasets, especially in languages other than support scenario in which models are trained to predict
English. Building language-specific datasets is critical to treatment strategies based solely on clinical text available
promoting equitable access to AI-driven healthcare inno- prior to intervention.
vations [9, 10] adapted to diferent healthcare contexts,
like the Italian one. Dataset Composition The dataset reflects the
real
      </p>
      <p>The primary objective of this work is to develop and world distribution of therapeutic strategies typically
emrelease a novel dataset of manually annotated Italian ployed in orthopedic practice, clustering into three
paclinical notes for low back pain management, created in tient groups:
close collaboration with medical experts. This resource
addresses a significant gap in biomedical NLP for the
Italian language, where publicly available annotated datasets
are extremely limited.</p>
      <p>To demonstrate the potential of this dataset as a
valuable tool for the BioNLP community, we conduct a set of
preliminary analyses focused on the task of automated
treatment recommendation. Specifically, we compare the
performance of traditional machine learning methods
(i.e., Support Vector Machines) and Transformer models
[11] like BERT [12], with the goal of exploring how this
resource can support physicians decisions.</p>
      <p>This work thus provides two main contributions:
• Conservative. Patients managed non-invasively
through physiotherapy, pharmacological pain
control, and rehabilitative interventions designed
to restore muscular strength and joint mobility;
• Regenerative. Patients treated with minimally
invasive biologic therapies, including
growthfactor injections, stem-cell preparations, or
platelet-rich plasma, aimed at promoting tissue
regeneration and functional recovery;
• Surgical. Patients who underwent operative
procedures, such as spinal stabilization, to address
severe pathology or persistent symptoms
unresponsive to conservative care.
• The release of a new annotated dataset of
Italian clinical notes for LBP treatment, ofering the
BioNLP community a much-needed resource for
conducting research in biomedical language
processing in Italian.
• A preliminary comparative study designed to
evaluate the dataset’s capacity to support
different NLP techniques and modeling strategies,
thereby validating its role as a foundation for
further investigation in clinical decision support and
related tasks.</p>
      <sec id="sec-2-1">
        <title>The dataset includes a total of 176 patients, distributed</title>
        <p>as follows: 50 conservative, 92 regenerative, and 34
surgical cases. This imbalanced distribution mirrors actual
clinical practice, where non-invasive approaches are
generally preferred over surgical interventions when
clinically appropriate.</p>
        <p>Each record consists of textual data from three
primary clinical sources: radiological reports (MRI and
Xray) and consultation notes. MRI reports describe spinal
anatomy and pathology; X-ray reports focus on
vertebral alignment and bone structure; consultation notes
provide narrative summaries written by orthopedic
specialists during outpatient visits. Demographic variables,
including age and sex, are also available for each patient.
(a) Conservative
(b) Regenerative
(c) Surgical
An example of these reports is provided in Appendix A. These texts often feature highly specialized terminology,
Overall, the corpus is a multi-source, domain-specific diverse narrative styles, and intricate links between
dicollection that integrates radiologic descriptions with agnoses and recommended therapies. To address these
unstructured clinical narratives of varying information challenges and to assess the suitability of our dataset,
density. we adopted a modeling strategy that integrates both
tra</p>
        <p>The detailed composition of our dataset reveals vary- ditional machine learning techniques and modern deep
ing distributions of textual data across treatment cate- learning approaches.
gories. Specifically, Figure 1 illustrates the percentage Our aim was to evaluate whether the combination of
distribution of MRI, X-ray, and clinical visit reports across unstructured text and demographic data provides
sufithe three groups, while Table 1 presents the average re- cient signal for a multiclass classification task focused on
port lengths for each category. Notably, X-ray reports LBP treatment decisions. The classification task involves
and clinical visit notes exhibit similar average lengths assigning each case to one of the three treatment classes,
across the treatment categories, while MRI reports show reflecting typical therapeutic pathways for LBP.
a marked diference, with surgical patients having signif- To explore how diferent modeling paradigms handle
icantly longer reports. This suggests that MRI documen- the specificities of the Italian medical language and the
tation may be particularly relevant in distinguishing sur- integration of heterogeneous inputs, we implemented
gical from non-surgical cases in clinical practice [13, 14]. and compared two approaches: a Support Vector
MaHowever, this hypothesis should be interpreted with cau- chine (SVM) with TF–IDF vectorization, and a
BERTtion, given the relatively small and imbalanced nature of based model fine-tuned on our dataset.
the dataset, which may afect the generalizability of such
ifndings.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <sec id="sec-3-1">
        <title>The classification of clinical reports for LBP treatment poses specific challenges due to the linguistic complexity and domain-specific nature of medical documentation.</title>
      </sec>
      <sec id="sec-3-2">
        <title>We chose these two models to contrast a strong clas</title>
        <p>sical method with a state-of-the-art contextual model.
A linear-kernel SVM remains highly efective for text
classification, especially on small or imbalanced clinical
datasets where lexical cues often sufice [ 15]. In contrast,
BERT [12] uses Transformer architectures [11] to capture
deep contextual and semantic relationships, making it
better suited for narrative clinical notes where meaning
depends heavily on context.</p>
        <p>SVM Approach We developed a multiclass
classification pipeline based on a SVM with a linear kernel,
leveraging traditional NLP techniques to process clinical text and
predict the appropriate treatment category. The pipeline
begins with standard text pre-processing steps,
including tokenization, stop-word removal, and lemmatization,
aimed at normalizing the clinical narratives and
reducing linguistic variability [16]. For feature representation
strategy, we applied Term Frequency–Inverse Document
Frequency (TF–IDF) vectorization using a combination of
unigrams, bigrams, and trigrams. This n-gram approach
enables the model to capture both individual medical
terms and short multi-word expressions that frequently
occur in clinical language. The TF–IDF transformation
converts the unstructured reports into structured
numerical representations by emphasizing terms that are
particularly informative within the context of the corpus.
To incorporate demographic information, patient age and
sex were appended to the TF–IDF feature vectors,
allowing the SVM to integrate both textual and structured data
in the classification process.</p>
      </sec>
      <sec id="sec-3-3">
        <title>BERT Approach We developed a multiclass classi</title>
        <p>ifcation pipeline based on the
bert-base-italian-xxluncased model on Hugging Face made by Bavarian State
Library1, fine-tuned on our dataset to capture the
semantic complexity of Italian clinical narratives. Each instance
is constructed by concatenating one or more clinical
freetext reports with patient age and sex, forming a single
input sequence. No additional feature engineering is
required, as the transformer architecture learns deep,
context-aware representations of the sequence through
self-attention mechanisms. The embedding of the [CLS]
token is passed to a classification head that outputs the
predicted treatment category via a softmax activation.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <sec id="sec-4-1">
        <title>To explore the capabilities of our dataset, we con</title>
        <p>ducted a series of experiments examining how
varying combinations of clinical documents and diferent
feature-extraction techniques afect system performance.
Through this systematic analysis, we identified the
optimal configuration for deploying our LBP
treatmentplanning decision support system in the Italian
healthcare setting, as illustrated in Figure 2.
1Model available at
https://huggingface.co/dbmdz/bert-base-italianxxl-uncased.</p>
        <sec id="sec-4-1-1">
          <title>4.1. Classification Approach</title>
          <p>• SVM (TF–IDF N-grams): We implemented an
SVM Classifier and evaluated three n-gram
configurations with TF-IDF vectorization to extract
features from Italian-language LBP clinical reports:
unigrams (1-gram), bigrams (2-gram), and
trigrams (3-gram). This multilevel approach enabled
us to capture both individual medical terms and
significant multi-word expressions commonly
found in diagnostic-related documentation. The
n-gram analysis proved especially efective at
uncovering language-specific LBP diagnostic
patterns and treatment indicators in Italian medical
terminology.
• BERT: Rather than relying on manual feature
engineering, we fine-tuned a pre-trained
Italian BERT model to obtain contextualized token
representations. Thanks to its multi-head
selfattention mechanism, BERT inherently models
the sequential dependencies among tokens, such
that the order of concatenated documents (e.g.,
X-ray → MRI vs. MRI → X-ray) can influence
prediction performance. For this BERT approach, we
therefore applied the full document-combination
analysis described in Section 4.2 to evaluate how
diferent report sequences afect model accuracy
[17, 12].</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>4.2. Document Combination Analysis</title>
          <p>To assess the impact of our Italian LBP dataset on model
performance, we systematically explored the following
eight input configurations, and, for each paired setup,
evaluated all possible document orders:
• Single Document Decision Support:
– MRI reports
– X-ray reports
– Clinical visit notes
• Paired Document Decision Support:
• Comprehensive Decision Support:
– MRI reports with clinical visit notes
– X-ray reports with MRI reports
– X-ray reports with clinical visit notes
– Integration of all three document types</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>Patient demographic (age and sex) are appended as additional input information at the end of the selected (concatenation of) documents.</title>
      </sec>
      <sec id="sec-4-3">
        <title>Patient Cohort: As this study reflects the real-world</title>
        <p>clinical scenario, not every patient in the registry
possesses the full set of imaging and clinical documents. For
each input configuration we therefore retain all patients
who have at least one of the documents in that specific
combination (e.g., any patient with an X-ray or an MRI
is included in the X-ray+MRI setting). This choice
maximizes cohort size while mirroring typical clinical
availability, where documentation completeness varies across
healthcare facilities.</p>
        <p>This structured evaluation aimed to identify the most
informative combination of clinical documents for LBP
treatment prediction. We focused particularly on
configurations that balance predictive performance with clinical
availability, acknowledging that healthcare facilities may
have varying access to diferent types of diagnostic
documentation. The analysis of document combinations
proved especially relevant in LBP cases, where the
diagnostic value of imaging studies may vary based on
specific pathology presentations and resource
availability.</p>
        <sec id="sec-4-3-1">
          <title>4.3. Evaluation Protocol</title>
          <p>We performed 5-fold cross-validation for each
configuration, maintaining consistent patient splits across all
models to ensure a fair and comparable evaluation. Class
distributions were preserved within each folad to retain
the original class balance across splits. Model
performance was evaluated using the macro-averaged 1-score,
which is particularly appropriate for imbalanced classes.
All models were compared against a baseline classifier
that always predicts the majority class within each fold.
Results are reported as the mean ± standard deviation
across the five folds.</p>
        </sec>
        <sec id="sec-4-3-2">
          <title>4.4. Training Configuration Details</title>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>To ensure reproducibility and provide clarity on our mod</title>
        <p>eling setup, we report below all the key hyperparameters
and implementation choices for both the BERT-based
and the SVM-based experiments. All hyperparameters
reported were left at their default values in the respective
libraries, with no manual tuning.</p>
        <p>MRI
X-ray</p>
        <p>Visit
MRI+X-ray
X-ray+MRI
MRI+Visit
Visit+MRI
X-ray+Visit</p>
        <p>Visit+X-ray
MRI+X-ray+Visit
MRI+Visit+X-ray
X-ray+MRI+Visit
X-ray+Visit+MRI
Visit+MRI+X-ray
Visit+X-ray+MRI</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>SVM with TF-IDF N-grams Table 2 compares the
macro-F1 performance obtained with unigram, bigram,
and trigram TF-IDF vectors. The bigram configuration
attains the highest score, 71.34 ± 6.05%, improving upon
unigrams (68.31 ± 5.57%) and trigrams (68.83 ± 8.14%)
– Vectorization: TF–IDF with n-gram range while exceeding the majority-class baseline of 22% by
[1,  ],  ∈ {1, 2, 3} almost 50 percentage points. The advantage of bigrams
– Classifier: LinearSVC with  = 1.0, is most pronounced when the full set of reports (Visit,
class weights = inverse sample frequency X-ray, and MRI) is concatenated, indicating that short
multi–word expressions such as "discopatia lombare"
encapsulate diagnostic nuance that unigrams cannot
capture. In contrast, for single-source inputs the benefit is
attenuated: unigrams remain preferable for isolated
Xray reports (60.24% vs 54.20%), suggesting that imaging
lexicons are adequately represented by individual tokens.</p>
      <sec id="sec-5-1">
        <title>Tables 2 and 3 present the results of our preliminary</title>
        <p>experiments using SVMs with n-gram features and a
BERT-based model on various combinations of clinical
documents. These results should be interpreted not as BERT Table 3 shows the ifne-tuned
evidence of a finalized decision support system, but as bert-base-italian-xxl-uncased model
rean initial validation of the dataset’s utility in support- sults. The model reaches a maximum macro-F1 of 55.24
ing automatic classification tasks in the context of LBP ± 9.37% when the clinical visit note precedes the X-ray
treatment. To provide a meaningful reference point for report (Visit→X-ray), again outperforming the baseline
model performance, we include the results of a simple but trailing the best bigram SVM combination by
majority class predictor, which assigns all test instances roughly 16 percentage points. Performance varies with
to the most frequent class observed in the training set for document order: reversing the sequence (X-ray→Visit)
each fold. This baseline yields macro-averaged 1-scores lowers the score to 53.51 ± 7.95%, and the inclusion of
in the range of 22–30%, establishing a minimal thresh- MRI text frequently degrades results. These fluctuations
old that highlights the added value of learning-based confirm the order sensitivity anticipated in Section 4.1
and underscore that, under the limited data regime of applied to radiological reports (MRI and X-Ray). These
this study, contextual embeddings do not yet capitalise reports are typically concise, standardized, and lexically
on MRI radiological terminology as eficiently as lexical redundant, making them well-suited to models that
exfeatures. ploit explicit lexical features. SVMs, in particular, benefit
from frequent term patterns and domain-specific
collo5.2. Document Combination Analysis cations captured through n-gram vectorization.
In contrast, BERT showed stronger performance on
SVM Consistent with the experimental design of Sec- less structured, semantically dense documents such as
tion 4.2, eight input configurations were evaluated using clinical visit notes. These notes are written in natural
lanthe n-gram representation. Among single documents, the guage, often include temporal and referential elements,
clinical visit note achieves the highest macro-F1 (69.75 ± and require a deeper semantic understanding to
accu6.18% for the trigram representation), whereas the MRI rately interpret. Despite being the least represented
docureport is the only configuration that underperforms the ment type across all treatment classes, visit notes boosted
majority-class baseline, reaching just 29.71 ± 4.54%. Pair- performance when used alone or in combination with
ing X-ray with the visit note yields a substantial gain other sources. This indicates their high semantic
informato 68.18 ± 4.41%, and adding MRI further increases per- tiveness and BERT’s ability to leverage contextual cues
formance to the overall peak of 71.34 ± 6.05% for the and long-range dependencies.
bigram representation. By contrast, the combination For a sample of each report type, see Appendix A.
X-ray+MRI, which excludes the narrative Visit note, at- Interestingly, although BERT underperformed
comtains only 47.81 ± 6.04% macro-F1. This sharp drop, pared to SVM in nearly all configurations, its strengths
together with the sub-baseline score of the MRI alone, became more evident when visit notes were incorporated
underscores how indispensable free-text clinical obser- into multidocument setups. The best-performing
configvations are for diferentiating low-back pain treatments. uration among all SVM experiments was the integration
Beyond classification performance, we also sought to en- of all three document types. This reinforces the idea that
hance the interpretability of the best-performing model each source contributes distinct and valuable
informa(SVM with TF–IDF bigrams on all reports) through quali- tion: X-rays provide succinct structural summaries, MRIs
tative analysis of its learned features. Each weight reflects add detailed anatomical insights (especially relevant for
the discriminative power of a lexical bigram for a given surgical decision-making), and visit notes contribute
clintreatment class. In Appendix B, we present the most in- ical reasoning and narrative depth. The integration of
formative medical expressions associated with each class, these heterogeneous data sources allows the model to
emphasizing how specific terms are strongly linked to capture a more comprehensive clinical picture, ultimately
particular treatment decisions. improving classification accuracy.</p>
        <p>BERT was consistently outperformed by SVM across
BERT The document-level ranking mirrors that of nearly all configurations. A likely explanation lies in
the SVM but at lower absolute values. The sequence the underrepresentation of visit notes within the dataset.
Visit→X-ray tops the list (55.24 ± 9.37%), followed by Although visit notes are semantically rich, their greatest
X-ray→Visit (53.51± 7.95%) and MRI→X-ray (52.21 ± impact on classification performance becomes evident
7.54%). Configurations that concatenate all three reports when they are combined with radiological sources. One
might exceed the 512-token limit and achieve no more of the most notable findings from this dataset is that the
than 51%. Despite these constraints, every BERT vari- integration of all three document types yielded the
bestant surpasses the baseline, confirming that contextual performing configuration in all SVM experiments. This
representations contain useful decision cues even when outcome underscores the complementary nature of the
suboptimal ordering or length truncation is necessary. information encoded in these documents: X-rays
provide concise structural descriptions, MRIs ofer detailed
6. Discussion anatomical insights (especially valuable for surgical
planning), and visit notes contribute clinical reasoning and
Our comparative evaluation of traditional machine learn- contextual narrative. The fusion of these heterogeneous
ing and transformer-based approaches for classifying inputs enables the model to capture multiple dimensions
LBP treatments yields several key insights into how NLP of the clinical scenario, ultimately leading to improved
models behave across diferent types of clinical documen- classification accuracy.
tation. It should be noted that, given the real-world nature of</p>
        <p>In particular, SVM models leveraging TF–IDF represen- this dataset, not all document combinations are directly
tations consistently outperformed BERT-based models comparable due to the difering numbers of available
across multiple experimental settings, especially when documents across treatment categories. While this
vari</p>
        <p>Although MRI is routinely regarded as the most
informative examination for surgical planning in low-back
pain, its impact in our study was limited by
availability: surgical cases accounted for only 34 of 176 patients
and contained proportionally fewer MRI reports than
the other treatment groups. This scarcity translated
into weak stand-alone performance - an SVM trained
on MRI text alone fell below the majority-class
baseline (macro-1 29.7 ± 4.5 %) and, even when coupled
with X-ray, remained inferior to the X-ray + visit-note
configuration. Clinically, these results indicate that the
proposed decision-support tool already ofers actionable
triage guidance in contexts where MRI access is delayed,
while underscoring the need to enrich the dataset with
additional surgical MRIs, through prospective collection, to
reduce the risk of under-referral for patients who would
ultimately benefit from operative management.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>7. Conclusions</title>
      <p>The results of this study underscore the clinical relevance
and future potential of our curated dataset as a
foundation for developing NLP-based decision support tools
in the context of low back pain. By aligning structured
radiology reports with semantically rich clinical
narratives and treatment labels drawn from real-world care
trajectories, the dataset captures a heterogeneous and
realistic cross-section of diagnostic information, reflective
of everyday clinical reasoning.</p>
      <p>Despite its limited size, the dataset reveals meaningful
interactions between document types and model
performance. Notably, while magnetic resonance imaging is
routinely regarded as the most informative modality for
surgical planning, its impact in our study was constrained
by availability: only 34 out of 176 patients were
classiifed under the surgical group, and this subset contained
proportionally fewer MRI reports than the others. This
imbalance translated into weak stand-alone performance.</p>
      <p>These results suggest that the proposed dataset already
supports the development of decision-support tools
capable of ofering actionable triage guidance, even in
contexts where MRI access is limited or delayed. At the
same time, the findings highlight a clear direction for
future dataset enrichment: increasing the number of
surgical MRIs, either through prospective data collection
or active-learning-guided sampling, will be essential to</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>Authors were supported by two projects: 1) the
European Union - Next Generation EU - NRRP M6C2 -
Investment 2.1 Enhancement and strengthening of
biomedical research in the NHS, project n.
PNRR-MAD-202212376692_VADALA’ - CUP F83C22002470001. 2) the
European Union under the Horizon Europe Programme
through the Innovative Health Initiative Joint
Undertaking (IHI JU) – Project GRACE (Project number: 101194778,
Project name: bridGing gaps in caRdiAC health
managEment). 3) the European Union - Next Generation EU
NRRP M6C2 - Investment 2.1 Enhancement and
strengthening of biomedical research in the NHS - Project
PNRRMR1- 2022-12376635 - ”Early Detection of Rare Inherited
Retinal Dystrophies and Cardiac Amyloidosis enhanced
by Artificial Intelligence: the impact on the patient’s
pathway in Campania Region” (CUP: C83C22001540007)
healthcare system, Healthcare informatics research sentiment analysis for italian reviews in
health25 (2019) 1–2. care, in: CEUR Workshop Proceedings, volume
[5] J. Liang, Y. Li, Z. Zhang, D. Shen, J. Xu, X. Zheng, 2769, CEUR-WS, 2020.</p>
      <p>T. Wang, B. Tang, J. Lei, J. Zhang, Adoption of [16] R. Catelli, F. Gargiulo, V. Casola, G. De Pietro, H.
Fuelectronic health records (ehrs) in china during the jita, M. Esposito, A novel covid-19 data set and
past 10 years: consecutive survey data analysis and an efective deep learning approach for the
decomparison of sino-american challenges and expe- identification of italian medical records, Ieee Access
riences, Journal of medical Internet research 23 9 (2021) 19097–19110.</p>
      <p>(2021) e24813. [17] C. Sun, X. Qiu, Y. Xu, X. Huang, How to
fine[6] L. Bacco, F. Russo, L. Ambrosio, F. D’Antoni, tune bert for text classification?, in: China national
L. Vollero, G. Vadalà, F. Dell’Orletta, M. Merone, conference on Chinese computational linguistics,
R. Papalia, V. Denaro, Natural language processing Springer, 2019, pp. 194–206.
in low back pain and spine diseases: a systematic
review, Frontiers in Surgery 9 (2022) 957085.
[7] A. E. Johnson, T. J. Pollard, S. J. Berkowitz, N. R. A. Sample Reports</p>
      <p>Greenbaum, M. P. Lungren, C.-y. Deng, R. G.</p>
      <p>Mark, S. Horng, Mimic-cxr: A large publicly avail- We present three representative reports that illustrate
disable database of labeled chest radiographs, 2019. tinct documentation styles: the MRI and X-ray findings
URL: https://physionet.org/content/mimic-cxr/2.0. are conveyed with technical details, whereas the clinical
0/. doi:10.13026/cr8q-rw49, rRID:SCR_007345. evaluation is presented as a concise narrative.
[8] A. Johnson, T. Pollard, S. Horng, L. A. Celi, The imaging reports describe features such as lumbar
R. Mark, Mimic-iv-note: Deidentified free-text disc degeneration, spondylolisthesis, and preserved
verteclinical notes (version 2.2), PhysioNet (2023). bral alignment. In contrast, the consult note summarizes
URL: https://doi.org/10.13026/1n74-ne17. doi:10. patient history, describes symptoms, and reports
physi13026/1n74-ne17, rRID:SCR_007345. cal examination findings, before referencing the imaging
[9] F. A. Matsuoka, H. N. Onaga, Classifying domains, results. Thus, the narrative note provides clinical
conbenchmarking gpt-4, a portuguese dataset for med- text, while the radiological reports contribute detailed
ical ai q&amp;a, bioRxiv (2024) 2024–12. anatomical and pathological descriptions.
[10] V. Basile, C. Bosco, M. Fell, V. Patti, R. Varvara,
et al., Italian nlp for everyone: Resources and
models from evalita to the european language grid, in: B. SVM’s Lexical Feature Analysis
2022 Language Resources and Evaluation
Conference, LREC 2022, European Language Resources To improve the interpretability of our best-performing
Association (ELRA), 2022, pp. 174–180. SVM classifier, trained with TF–IDF bigrams on the full
[11] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, set of clinical documents, we analyzed the feature weights
L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At- learned by the model from the top-performing fold of the
tention is all you need, Advances in neural infor- 5-fold cross-validation. These weights indicate the
contrimation processing systems 30 (2017). bution of each lexical bigram to treatment classification,
[12] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: highlighting expressions with clear clinical significance.</p>
      <p>Pre-training of deep bidirectional transformers for We manually prioritized domain-specific expressions
language understanding, in: Proceedings of the (e.g., anatomical or pathological descriptors)from the top
2019 conference of the North American chapter of 50 lexical features (unigrams and brigrams) ranked by
the association for computational linguistics: hu- coeficient value for each treatment class, over generic
toman language technologies, volume 1 (long and kens (e.g., grado, presenza), which, despite their assigned
short papers), 2019, pp. 4171–4186. weights, lack standalone diagnostic value. The most
in[13] N. Sheehan, Magnetic resonance imaging for low formative medically relevant features identified by the
back pain: indications and limitations, Postgradu- model for each treatment class—Conservative,
Regenerate medical journal 86 (2010) 374–378. ative, and Surgical—are reported in Table 5, along with
[14] R. U. Din, X. Cheng, H. Yang, Diagnostic role of mag- their associated weights and frequencies in the training
netic resonance imaging in low back pain caused and test sets. Importantly, the selected inspected features
by vertebral endplate degeneration, Journal of Mag- exhibit meaningful clinical relevance, efectively
capturnetic Resonance Imaging 55 (2022) 755–771. ing diagnostic and pathological indicators that inform
[15] L. Bacco, A. Cimino, L. Paulon, M. Merone, therapeutic decision-making.</p>
      <p>F. Dell’Orletta, A machine learning approach for Specifically, conservative treatment is associated with
clinically less invasive descriptors such as
sostanzialItalian
MRI:
Sostanzialmente conservata la fisiologica lordosi lombare;
lieve deviazione sinistro-convessa del rachide lombare a
fulcro L3-L4. Discopatia degenerativa a livello L4-L5 ed
L5S1; in particolare:</p>
      <p>• a livello L4-L5 si osserva protrusione discale ad ampio
raggio che occupa bilateralmente il pavimento dei forami
neurali e, a destra entra in contatto con il tratto
preforaminale della radice L5 destra; si associa a tale livello
alterazione dell’intensità di segnale dei contrapposti versanti
intersomatici tipo Modic 2–3.</p>
      <p>• a livello L5-S1 è presente protrusione discale ad ampio
raggio che non entra in conflitto con le radici nervose
adiacenti.</p>
      <p>Conservata la morfologia delle restanti unità
discosomatiche. Non ci sono alterazioni focali ossee nei segmenti
scheletrici esaminati. Canale vertebrale di dimensioni nella
norma. Nella norma l’intensità di segnale del cono
midollare, posizionato a livello D12. Conservato il trofismo dei
muscoli para-vertebrali al passaggio lombo-sacrale. Cisti
aracnoidee sacrali a livello S1-S2, del diametro massimo di 3
cm.</p>
      <p>X-Ray:
Sostanzialmente conservata la fisiologica lordosi lombare.</p>
      <p>Non evidenti alterazioni ossee radiograficamente
apprezzabili nei segmenti ossei in esame. Normoallineati i muri
somatici posteriori sia in proiezione LL standard che in
massima estensione; disallineamento dei muri somatici
posteriori con spondilolistesi anteriore L4-L5 di grado 1 in
massima flessione, come segno di instabilità articolare a tale
livello. Lieve riduzione in altezza dello spazio intersomatico
L4-L5, come segno di discopatia degenerativa. Tono calcico
conservato.</p>
      <p>Visit:
APR: n.d.r. APP: Il paziente riferisce lombalgia da diversi
anni, esacerbata durante attività sportiva. NRS colonna
lombosacrale 6/10. Ha praticato FKT con temporaneo beneficio.</p>
      <p>Il dolore è maggiormente lateralizzato a sinistra a livello
del rachide lombosacrale. Non episodi di sciatalgia. La
sintomatologia inficia il riposo notturno, ma non si altera con
la manovra di Valsalva. Presenta limitazione della
flessoestensione del rachide lombosacrale. Porta in visione RMN
colonna LS (11/09/2020) che mostra discopatia L4-L5 ed
L5-S1 in presenza di alterazione degenerativo-infiammatoria
dei piatti vertebrali contrapposti e dell’osso subcondrale a
livello L4-L5 in fase acuta del tipo Modic 1. EO: Dolore in
iperestensione del rachide lombosacrale ed inclinazione
laterale. Ipercifosi dorsale. Marcata contrattura paravertebrale.</p>
      <p>Dolore all’articolazione sacro-iliaca SX. Deambulazione
possibile in taligrado e digitigrado. Lasègue bilaterale. Non
deficit di TA, EPA ed ECD. Diagnosi: Discopatia L4-L5 ed
L5-S1 in presenza di alterazione degenerativo-infiammatoria
dei piatti vertebrali contrapposti e dell’osso subcondrale a
livello L4-L5 in fase acuta del tipo Modic 1.</p>
      <p>Età: 45
Sesso: M</p>
      <p>English
MRI:
Essentially preserved physiological lumbar lordosis; slight
left-convex deviation of the lumbar spine with apex at L3-L4.</p>
      <p>Degenerative disc disease at L4-L5 and L5-S1; specifically:</p>
      <p>• at L4-L5, a broad-based disc protrusion is observed,
bilaterally occupying the floor of the neural foramina and,
on the right, contacting the preforaminal tract of the right
L5 root; associated with a mild signal intensity alteration of
the opposing endplates (Modic type 2–3).</p>
      <p>• at L5-S1, a broad-based disc protrusion is present, which
does not impinge on adjacent nerve roots.</p>
      <p>Morphology of the remaining disc–vertebral units is
preserved. No focal bone abnormalities in the examined skeletal
segments. Vertebral canal dimensions are within normal
limits. Signal intensity of the conus medullaris is normal,
positioned at D12. Paravertebral muscle trophism at the
lumbosacral junction is preserved. Sacral arachnoid cysts at
S1-S2 level, with a maximum diameter of 3 cm.</p>
      <p>X-Ray:
Essentially preserved physiological lumbar lordosis. No
radiographically appreciable bone abnormalities in the
examined osseous segments. Posterior vertebral walls are
normally aligned in both standard LL projection and
maximum extension; misalignment of the posterior vertebral
walls with Grade I anterior spondylolisthesis at L4-L5 in
maximum flexion, indicating articular instability at that
level. Mild reduction in intervertebral space height at L4-L5,
indicating degenerative disc disease. Preserved bone density.</p>
      <p>Visit:
APR: no relevant medical history recorded. APP: The
patient reports low back pain for several years, exacerbated
during sports activity. NRS lumbosacral score 6/10. He
underwent physiokinetic therapy with temporary relief. Pain
is predominantly lateralized to the left at the lumbosacral
spine. No episodes of sciatica. Symptoms disrupt sleep but
do not change with the Valsalva maneuver. Presents with
limitation of flexion-extension of the lumbosacral spine.</p>
      <p>Brings MRI of LS spine (11/09/2020) showing discopathy at
L4-L5 and L5-S1 with degenerative-inflammatory changes
of the opposing vertebral endplates and subchondral bone
at L4-L5 in acute Modic 1 phase. EO: Pain on
hyperextension of the lumbosacral spine and lateral bending. Thoracic
hyperkyphosis. Marked paravertebral muscle contracture.</p>
      <p>Pain at the left sacroiliac joint. Ambulation possible on heels
and toes. Bilateral Lasègue’s sign. No deficits in TA, EPA,
and ECD. Diagnosis: Discopathy at L4-L5 and L5-S1 with
degenerative-inflammatory changes of the opposing
vertebral endplates and subchondral bone at L4-L5 in acute
Modic 1 phase.</p>
      <p>Age: 45
Sex: M
With Polarity Inversion
ernia
discopatia
muri somatici
proiezioni dinamiche
spondilolistesi
stenosi
Other High-Weight Bigrams
sostanzialmente conservati
protrusione discale
antero listesi</p>
      <p>SVM Weight by Treatment Class
mente conservati and degenerazioni artrosiche. Regen- treatment class and negative weights in another. This
erative treatments, meanwhile, are characterized by med- underlines the context-sensitive nature of their clinical
ically pertinent terms like muri somatici and proiezioni di- interpretation.
namiche. Finally, surgical treatment features expressions Furthermore, it is worth emphasizing that feature
freindicative of more severe pathology, including spondilolis- quency alone does not fully explain clinical importance:
tesi and stenosi, both frequently occurring in the training even relatively infrequent terms can receive high model
data and receiving high positive weights (0.698 and 0.688, weights if they demonstrate strong discriminative power.
respectively). For example, antero listesi appeared only 11 times in the</p>
      <p>Notably, our analysis highlighted polarity inversion training set yet emerged as one of the top-ranked surgical
phenomena, whereby certain clinically relevant terms features, confirming the model’s capability to identify
(e.g., spondilolistesi, ernia) showed positive weights in one clinically informative lexical indicators.
Declaration on Generative AI</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>March</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Blyth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Buchbinder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hoy</surname>
          </string-name>
          ,
          <article-title>Global low back pain prevalence and years lived with disability from 1990 to 2017: estimates from the global burden of disease study 2017</article-title>
          ,
          <source>Annals of translational medicine 8</source>
          (
          <year>2020</year>
          )
          <fpage>299</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Salman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. H.</given-names>
            <surname>McGregor</surname>
          </string-name>
          ,
          <article-title>Recent clinical practice guidelines for the management of low back pain: a global comparison</article-title>
          ,
          <source>BMC musculoskeletal disorders 25</source>
          (
          <year>2024</year>
          )
          <fpage>344</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>O.</given-names>
            <surname>Airaksinen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. I.</given-names>
            <surname>Brox</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cedraschi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hildebrandt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Klaber-Mofett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Kovacs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Mannion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Reis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Staal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ursin</surname>
          </string-name>
          , et al.,
          <article-title>European guidelines for the management of chronic nonspecific low back pain</article-title>
          ,
          <source>European spine journal 15</source>
          (
          <year>2006</year>
          )
          <article-title>s192</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.-J.</given-names>
            <surname>Kong</surname>
          </string-name>
          ,
          <article-title>Managing unstructured big data in</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>