1. Introduction

A Novel Real-World Dataset of Italian Clinical Notes for NLP-based Decision Support in Low Back Pain Treatment

Agnese Bonfigli

agnese.bonfigli@unicampus.it 0 2 3

Ruben Piperno

ruben.piperno@unicampus.it 0 2 3

Luca Bacco

l.bacco@unicampus.it 0 2 3

Felice Dell'Orletta

felice.dellorletta@ilc.cnr.it 0 2

Dominique Brunato

dominique.brunato@ilc.cnr.it 0 2

Filippo Crispino

f.crispino@unicampus.it 0 3

Giuseppe Francesco Papalia

0 1

Fabrizio Russo

fabrizio.russo@policlinicocampus.it 0 1

Gianluca Vadalà

0 1

Rocco Papalia

0 1

Mario Merone

m.merone@unicampus.it 0 1 3

Leandro Pecchia

leandro.pecchia@unicampus.it 0 1 3 0 CLiC-it 2025: Eleventh Italian Conference on Computational Linguistics 1 Fondazione Policlinico Universitario Campus Bio-Medico , Via Alvaro del Portillo 200, 00128 Rome , Italy 2 ItaliaNLP Lab, Institute of Computational Linguistics “Antonio Zampolli”, National Research Council , Via Giuseppe Moruzzi 1, 56124 Pisa , Italy 3 Research Unit of Intelligent Health-Technologies, Department of Engineering, Università Campus Bio-Medico di Roma , Via Alvaro del Portillo 21, 00128 Rome , Italy

2025

Low back pain represents a leading source of disability worldwide and poses a significant challenge for evidence-based clinical decision support. In contexts where Italian-language resources for diversified therapeutic pathways are lacking, we have assembled a novel, annotated dataset comprising up to three pre-treatment documents per patient (MRI report, X-ray report, and patient visit notes), alongside demographic information (age and sex). The cohort consists of 176 patient records, stratified into three therapeutic groups: 50 conservative, 92 regenerative, and 34 surgical. The primary aim is to investigate whether the collected dataset can be harnessed to predict which of the three treatment modalities is most appropriate. To this end, six document-combination scenarios were defined, evaluating each single-report modality as well as all possible pairings. For each scenario, two modeling strategies were contrasted: a traditional Support Vector Machine classifier leveraging TF-IDF features based on unigrams, bigrams, and trigrams, and a fine-tuned Italian BERT model adapted to our corpus. Experimental results indicate that classic n-gram-based approaches achieve the highest performance (macro-1 up to 71.3%). The BERT model, while outperforming the baseline, encounters limitations in this low-resource scenario.These ifndings suggest that the present dataset has the potential to catalyze the development of Italian-language clinical decision support systems that account for the distinct signatures of treatment pathways.

eol>Italian Medical Corpus Decision Support Systems Clinical Natural Language Processing Treatment Prediction NLP in healthcare

1. Introduction Low back pain (LBP) represents one of the most prevalent medical conditions globally, significantly impacting both

Despite extensive research and clinical experience, determining optimal treatment strategies remains challenging due to the diverse range of available therapeutic interventions. LBP management has been extensively studied considering the aforementioned impacts on the individual patient and the community. However, there is still a gap between this information and its applications in clinical practice, particularly in the area of detailing conservative (non-invasive) management. As surgeries and interventional therapies are not recommended in most patients with acute LBP, it is important for primary care physicians (PCPs) to know the details of non-invasive treatment.

The complexity of treatment selection is compounded 2. Dataset

by the need to consider multiple patient-specific factors, including clinical presentation, radiological findings, and demographic characteristics. Data Acquisition This study is based on a retrospec

Electronic health records (EHRs) provide a rich source tive analysis of anonymized clinical records collected of clinical data that can inform LBP treatment decisions, during routine care for patients with LBP enrolled at particularly through unstructured texts such as imaging the spine clinic of the Fondazione Policlinico Campus Bioreports (e.g., Magnetic Resonance Imaging (MRI) and X- Medico in Rome. The dataset represents a pilot collection rays) and physician notes [ 4, 5 ]. Recent advancements curated through a rigorous manual selection process carin natural language processing (NLP) have demonstrated ried out in collaboration with board-certified orthopaedic significant potential in extracting meaningful clinical in- specialists. All records were obtained prior to any therasights from these texts, thereby supporting data-driven, peutic intervention and reflect real-world clinical deciinformed, and personalized decision-making in health- sions made during standard care. care [6]. This progress has been supported by large-scale Each case was annotated by the attending physician English-language datasets, such as MIMIC-CXR [7] and responsible for the patient’s care, linking each patient MIMIC-IV-Note [8], which provide radiology reports re- to a treatment label reflecting the therapeutic decision. lated to central and lower body axial regions. However, Consequently, no additional annotation was necessary. the development of NLP-based clinical decision support For each patient, we selected the corresponding presystems for LBP is significantly limited by the lack of treatment documents, thus creating a realistic decisionannotated datasets, especially in languages other than support scenario in which models are trained to predict English. Building language-specific datasets is critical to treatment strategies based solely on clinical text available promoting equitable access to AI-driven healthcare inno- prior to intervention. vations [9, 10] adapted to diferent healthcare contexts, like the Italian one. Dataset Composition The dataset reflects the real

The primary objective of this work is to develop and world distribution of therapeutic strategies typically emrelease a novel dataset of manually annotated Italian ployed in orthopedic practice, clustering into three paclinical notes for low back pain management, created in tient groups: close collaboration with medical experts. This resource addresses a significant gap in biomedical NLP for the Italian language, where publicly available annotated datasets are extremely limited.

To demonstrate the potential of this dataset as a valuable tool for the BioNLP community, we conduct a set of preliminary analyses focused on the task of automated treatment recommendation. Specifically, we compare the performance of traditional machine learning methods (i.e., Support Vector Machines) and Transformer models [11] like BERT [12], with the goal of exploring how this resource can support physicians decisions.

This work thus provides two main contributions: • Conservative. Patients managed non-invasively through physiotherapy, pharmacological pain control, and rehabilitative interventions designed to restore muscular strength and joint mobility; • Regenerative. Patients treated with minimally invasive biologic therapies, including growthfactor injections, stem-cell preparations, or platelet-rich plasma, aimed at promoting tissue regeneration and functional recovery; • Surgical. Patients who underwent operative procedures, such as spinal stabilization, to address severe pathology or persistent symptoms unresponsive to conservative care. • The release of a new annotated dataset of Italian clinical notes for LBP treatment, ofering the BioNLP community a much-needed resource for conducting research in biomedical language processing in Italian. • A preliminary comparative study designed to evaluate the dataset’s capacity to support different NLP techniques and modeling strategies, thereby validating its role as a foundation for further investigation in clinical decision support and related tasks.

The dataset includes a total of 176 patients, distributed

as follows: 50 conservative, 92 regenerative, and 34 surgical cases. This imbalanced distribution mirrors actual clinical practice, where non-invasive approaches are generally preferred over surgical interventions when clinically appropriate.

Each record consists of textual data from three primary clinical sources: radiological reports (MRI and Xray) and consultation notes. MRI reports describe spinal anatomy and pathology; X-ray reports focus on vertebral alignment and bone structure; consultation notes provide narrative summaries written by orthopedic specialists during outpatient visits. Demographic variables, including age and sex, are also available for each patient. (a) Conservative (b) Regenerative (c) Surgical An example of these reports is provided in Appendix A. These texts often feature highly specialized terminology, Overall, the corpus is a multi-source, domain-specific diverse narrative styles, and intricate links between dicollection that integrates radiologic descriptions with agnoses and recommended therapies. To address these unstructured clinical narratives of varying information challenges and to assess the suitability of our dataset, density. we adopted a modeling strategy that integrates both tra

The detailed composition of our dataset reveals vary- ditional machine learning techniques and modern deep ing distributions of textual data across treatment cate- learning approaches. gories. Specifically, Figure 1 illustrates the percentage Our aim was to evaluate whether the combination of distribution of MRI, X-ray, and clinical visit reports across unstructured text and demographic data provides sufithe three groups, while Table 1 presents the average re- cient signal for a multiclass classification task focused on port lengths for each category. Notably, X-ray reports LBP treatment decisions. The classification task involves and clinical visit notes exhibit similar average lengths assigning each case to one of the three treatment classes, across the treatment categories, while MRI reports show reflecting typical therapeutic pathways for LBP. a marked diference, with surgical patients having signif- To explore how diferent modeling paradigms handle icantly longer reports. This suggests that MRI documen- the specificities of the Italian medical language and the tation may be particularly relevant in distinguishing sur- integration of heterogeneous inputs, we implemented gical from non-surgical cases in clinical practice [13, 14]. and compared two approaches: a Support Vector MaHowever, this hypothesis should be interpreted with cau- chine (SVM) with TF–IDF vectorization, and a BERTtion, given the relatively small and imbalanced nature of based model fine-tuned on our dataset. the dataset, which may afect the generalizability of such ifndings.

3. Methods The classification of clinical reports for LBP treatment poses specific challenges due to the linguistic complexity and domain-specific nature of medical documentation. We chose these two models to contrast a strong clas

sical method with a state-of-the-art contextual model. A linear-kernel SVM remains highly efective for text classification, especially on small or imbalanced clinical datasets where lexical cues often sufice [ 15]. In contrast, BERT [12] uses Transformer architectures [11] to capture deep contextual and semantic relationships, making it better suited for narrative clinical notes where meaning depends heavily on context.

SVM Approach We developed a multiclass classification pipeline based on a SVM with a linear kernel, leveraging traditional NLP techniques to process clinical text and predict the appropriate treatment category. The pipeline begins with standard text pre-processing steps, including tokenization, stop-word removal, and lemmatization, aimed at normalizing the clinical narratives and reducing linguistic variability [16]. For feature representation strategy, we applied Term Frequency–Inverse Document Frequency (TF–IDF) vectorization using a combination of unigrams, bigrams, and trigrams. This n-gram approach enables the model to capture both individual medical terms and short multi-word expressions that frequently occur in clinical language. The TF–IDF transformation converts the unstructured reports into structured numerical representations by emphasizing terms that are particularly informative within the context of the corpus. To incorporate demographic information, patient age and sex were appended to the TF–IDF feature vectors, allowing the SVM to integrate both textual and structured data in the classification process.

BERT Approach We developed a multiclass classi

ifcation pipeline based on the bert-base-italian-xxluncased model on Hugging Face made by Bavarian State Library1, fine-tuned on our dataset to capture the semantic complexity of Italian clinical narratives. Each instance is constructed by concatenating one or more clinical freetext reports with patient age and sex, forming a single input sequence. No additional feature engineering is required, as the transformer architecture learns deep, context-aware representations of the sequence through self-attention mechanisms. The embedding of the [CLS] token is passed to a classification head that outputs the predicted treatment category via a softmax activation.

4. Experiments To explore the capabilities of our dataset, we con

ducted a series of experiments examining how varying combinations of clinical documents and diferent feature-extraction techniques afect system performance. Through this systematic analysis, we identified the optimal configuration for deploying our LBP treatmentplanning decision support system in the Italian healthcare setting, as illustrated in Figure 2. 1Model available at https://huggingface.co/dbmdz/bert-base-italianxxl-uncased.

4.1. Classification Approach

• SVM (TF–IDF N-grams): We implemented an SVM Classifier and evaluated three n-gram configurations with TF-IDF vectorization to extract features from Italian-language LBP clinical reports: unigrams (1-gram), bigrams (2-gram), and trigrams (3-gram). This multilevel approach enabled us to capture both individual medical terms and significant multi-word expressions commonly found in diagnostic-related documentation. The n-gram analysis proved especially efective at uncovering language-specific LBP diagnostic patterns and treatment indicators in Italian medical terminology. • BERT: Rather than relying on manual feature engineering, we fine-tuned a pre-trained Italian BERT model to obtain contextualized token representations. Thanks to its multi-head selfattention mechanism, BERT inherently models the sequential dependencies among tokens, such that the order of concatenated documents (e.g., X-ray → MRI vs. MRI → X-ray) can influence prediction performance. For this BERT approach, we therefore applied the full document-combination analysis described in Section 4.2 to evaluate how diferent report sequences afect model accuracy [17, 12].

4.2. Document Combination Analysis

To assess the impact of our Italian LBP dataset on model performance, we systematically explored the following eight input configurations, and, for each paired setup, evaluated all possible document orders: • Single Document Decision Support: – MRI reports – X-ray reports – Clinical visit notes • Paired Document Decision Support: • Comprehensive Decision Support: – MRI reports with clinical visit notes – X-ray reports with MRI reports – X-ray reports with clinical visit notes – Integration of all three document types

Patient demographic (age and sex) are appended as additional input information at the end of the selected (concatenation of) documents. Patient Cohort: As this study reflects the real-world

clinical scenario, not every patient in the registry possesses the full set of imaging and clinical documents. For each input configuration we therefore retain all patients who have at least one of the documents in that specific combination (e.g., any patient with an X-ray or an MRI is included in the X-ray+MRI setting). This choice maximizes cohort size while mirroring typical clinical availability, where documentation completeness varies across healthcare facilities.

This structured evaluation aimed to identify the most informative combination of clinical documents for LBP treatment prediction. We focused particularly on configurations that balance predictive performance with clinical availability, acknowledging that healthcare facilities may have varying access to diferent types of diagnostic documentation. The analysis of document combinations proved especially relevant in LBP cases, where the diagnostic value of imaging studies may vary based on specific pathology presentations and resource availability.

4.3. Evaluation Protocol

We performed 5-fold cross-validation for each configuration, maintaining consistent patient splits across all models to ensure a fair and comparable evaluation. Class distributions were preserved within each folad to retain the original class balance across splits. Model performance was evaluated using the macro-averaged 1-score, which is particularly appropriate for imbalanced classes. All models were compared against a baseline classifier that always predicts the majority class within each fold. Results are reported as the mean ± standard deviation across the five folds.

4.4. Training Configuration Details To ensure reproducibility and provide clarity on our mod

eling setup, we report below all the key hyperparameters and implementation choices for both the BERT-based and the SVM-based experiments. All hyperparameters reported were left at their default values in the respective libraries, with no manual tuning.

MRI X-ray

Visit MRI+X-ray X-ray+MRI MRI+Visit Visit+MRI X-ray+Visit

Visit+X-ray MRI+X-ray+Visit MRI+Visit+X-ray X-ray+MRI+Visit X-ray+Visit+MRI Visit+MRI+X-ray Visit+X-ray+MRI

5. Results

SVM with TF-IDF N-grams Table 2 compares the macro-F1 performance obtained with unigram, bigram, and trigram TF-IDF vectors. The bigram configuration attains the highest score, 71.34 ± 6.05%, improving upon unigrams (68.31 ± 5.57%) and trigrams (68.83 ± 8.14%) – Vectorization: TF–IDF with n-gram range while exceeding the majority-class baseline of 22% by [1, ], ∈ {1, 2, 3} almost 50 percentage points. The advantage of bigrams – Classifier: LinearSVC with = 1.0, is most pronounced when the full set of reports (Visit, class weights = inverse sample frequency X-ray, and MRI) is concatenated, indicating that short multi–word expressions such as "discopatia lombare" encapsulate diagnostic nuance that unigrams cannot capture. In contrast, for single-source inputs the benefit is attenuated: unigrams remain preferable for isolated Xray reports (60.24% vs 54.20%), suggesting that imaging lexicons are adequately represented by individual tokens.

Tables 2 and 3 present the results of our preliminary

experiments using SVMs with n-gram features and a BERT-based model on various combinations of clinical documents. These results should be interpreted not as BERT Table 3 shows the ifne-tuned evidence of a finalized decision support system, but as bert-base-italian-xxl-uncased model rean initial validation of the dataset’s utility in support- sults. The model reaches a maximum macro-F1 of 55.24 ing automatic classification tasks in the context of LBP ± 9.37% when the clinical visit note precedes the X-ray treatment. To provide a meaningful reference point for report (Visit→X-ray), again outperforming the baseline model performance, we include the results of a simple but trailing the best bigram SVM combination by majority class predictor, which assigns all test instances roughly 16 percentage points. Performance varies with to the most frequent class observed in the training set for document order: reversing the sequence (X-ray→Visit) each fold. This baseline yields macro-averaged 1-scores lowers the score to 53.51 ± 7.95%, and the inclusion of in the range of 22–30%, establishing a minimal thresh- MRI text frequently degrades results. These fluctuations old that highlights the added value of learning-based confirm the order sensitivity anticipated in Section 4.1 and underscore that, under the limited data regime of applied to radiological reports (MRI and X-Ray). These this study, contextual embeddings do not yet capitalise reports are typically concise, standardized, and lexically on MRI radiological terminology as eficiently as lexical redundant, making them well-suited to models that exfeatures. ploit explicit lexical features. SVMs, in particular, benefit from frequent term patterns and domain-specific collo5.2. Document Combination Analysis cations captured through n-gram vectorization. In contrast, BERT showed stronger performance on SVM Consistent with the experimental design of Sec- less structured, semantically dense documents such as tion 4.2, eight input configurations were evaluated using clinical visit notes. These notes are written in natural lanthe n-gram representation. Among single documents, the guage, often include temporal and referential elements, clinical visit note achieves the highest macro-F1 (69.75 ± and require a deeper semantic understanding to accu6.18% for the trigram representation), whereas the MRI rately interpret. Despite being the least represented docureport is the only configuration that underperforms the ment type across all treatment classes, visit notes boosted majority-class baseline, reaching just 29.71 ± 4.54%. Pair- performance when used alone or in combination with ing X-ray with the visit note yields a substantial gain other sources. This indicates their high semantic informato 68.18 ± 4.41%, and adding MRI further increases per- tiveness and BERT’s ability to leverage contextual cues formance to the overall peak of 71.34 ± 6.05% for the and long-range dependencies. bigram representation. By contrast, the combination For a sample of each report type, see Appendix A. X-ray+MRI, which excludes the narrative Visit note, at- Interestingly, although BERT underperformed comtains only 47.81 ± 6.04% macro-F1. This sharp drop, pared to SVM in nearly all configurations, its strengths together with the sub-baseline score of the MRI alone, became more evident when visit notes were incorporated underscores how indispensable free-text clinical obser- into multidocument setups. The best-performing configvations are for diferentiating low-back pain treatments. uration among all SVM experiments was the integration Beyond classification performance, we also sought to en- of all three document types. This reinforces the idea that hance the interpretability of the best-performing model each source contributes distinct and valuable informa(SVM with TF–IDF bigrams on all reports) through quali- tion: X-rays provide succinct structural summaries, MRIs tative analysis of its learned features. Each weight reflects add detailed anatomical insights (especially relevant for the discriminative power of a lexical bigram for a given surgical decision-making), and visit notes contribute clintreatment class. In Appendix B, we present the most in- ical reasoning and narrative depth. The integration of formative medical expressions associated with each class, these heterogeneous data sources allows the model to emphasizing how specific terms are strongly linked to capture a more comprehensive clinical picture, ultimately particular treatment decisions. improving classification accuracy.

BERT was consistently outperformed by SVM across BERT The document-level ranking mirrors that of nearly all configurations. A likely explanation lies in the SVM but at lower absolute values. The sequence the underrepresentation of visit notes within the dataset. Visit→X-ray tops the list (55.24 ± 9.37%), followed by Although visit notes are semantically rich, their greatest X-ray→Visit (53.51± 7.95%) and MRI→X-ray (52.21 ± impact on classification performance becomes evident 7.54%). Configurations that concatenate all three reports when they are combined with radiological sources. One might exceed the 512-token limit and achieve no more of the most notable findings from this dataset is that the than 51%. Despite these constraints, every BERT vari- integration of all three document types yielded the bestant surpasses the baseline, confirming that contextual performing configuration in all SVM experiments. This representations contain useful decision cues even when outcome underscores the complementary nature of the suboptimal ordering or length truncation is necessary. information encoded in these documents: X-rays provide concise structural descriptions, MRIs ofer detailed 6. Discussion anatomical insights (especially valuable for surgical planning), and visit notes contribute clinical reasoning and Our comparative evaluation of traditional machine learn- contextual narrative. The fusion of these heterogeneous ing and transformer-based approaches for classifying inputs enables the model to capture multiple dimensions LBP treatments yields several key insights into how NLP of the clinical scenario, ultimately leading to improved models behave across diferent types of clinical documen- classification accuracy. tation. It should be noted that, given the real-world nature of

In particular, SVM models leveraging TF–IDF represen- this dataset, not all document combinations are directly tations consistently outperformed BERT-based models comparable due to the difering numbers of available across multiple experimental settings, especially when documents across treatment categories. While this vari

Although MRI is routinely regarded as the most informative examination for surgical planning in low-back pain, its impact in our study was limited by availability: surgical cases accounted for only 34 of 176 patients and contained proportionally fewer MRI reports than the other treatment groups. This scarcity translated into weak stand-alone performance - an SVM trained on MRI text alone fell below the majority-class baseline (macro-1 29.7 ± 4.5 %) and, even when coupled with X-ray, remained inferior to the X-ray + visit-note configuration. Clinically, these results indicate that the proposed decision-support tool already ofers actionable triage guidance in contexts where MRI access is delayed, while underscoring the need to enrich the dataset with additional surgical MRIs, through prospective collection, to reduce the risk of under-referral for patients who would ultimately benefit from operative management.

7. Conclusions

The results of this study underscore the clinical relevance and future potential of our curated dataset as a foundation for developing NLP-based decision support tools in the context of low back pain. By aligning structured radiology reports with semantically rich clinical narratives and treatment labels drawn from real-world care trajectories, the dataset captures a heterogeneous and realistic cross-section of diagnostic information, reflective of everyday clinical reasoning.

Despite its limited size, the dataset reveals meaningful interactions between document types and model performance. Notably, while magnetic resonance imaging is routinely regarded as the most informative modality for surgical planning, its impact in our study was constrained by availability: only 34 out of 176 patients were classiifed under the surgical group, and this subset contained proportionally fewer MRI reports than the others. This imbalance translated into weak stand-alone performance.

These results suggest that the proposed dataset already supports the development of decision-support tools capable of ofering actionable triage guidance, even in contexts where MRI access is limited or delayed. At the same time, the findings highlight a clear direction for future dataset enrichment: increasing the number of surgical MRIs, either through prospective data collection or active-learning-guided sampling, will be essential to

Acknowledgments

Authors were supported by two projects: 1) the European Union - Next Generation EU - NRRP M6C2 - Investment 2.1 Enhancement and strengthening of biomedical research in the NHS, project n. PNRR-MAD-202212376692_VADALA’ - CUP F83C22002470001. 2) the European Union under the Horizon Europe Programme through the Innovative Health Initiative Joint Undertaking (IHI JU) – Project GRACE (Project number: 101194778, Project name: bridGing gaps in caRdiAC health managEment). 3) the European Union - Next Generation EU NRRP M6C2 - Investment 2.1 Enhancement and strengthening of biomedical research in the NHS - Project PNRRMR1- 2022-12376635 - ”Early Detection of Rare Inherited Retinal Dystrophies and Cardiac Amyloidosis enhanced by Artificial Intelligence: the impact on the patient’s pathway in Campania Region” (CUP: C83C22001540007) healthcare system, Healthcare informatics research sentiment analysis for italian reviews in health25 (2019) 1–2. care, in: CEUR Workshop Proceedings, volume [5] J. Liang, Y. Li, Z. Zhang, D. Shen, J. Xu, X. Zheng, 2769, CEUR-WS, 2020.

T. Wang, B. Tang, J. Lei, J. Zhang, Adoption of [16] R. Catelli, F. Gargiulo, V. Casola, G. De Pietro, H. Fuelectronic health records (ehrs) in china during the jita, M. Esposito, A novel covid-19 data set and past 10 years: consecutive survey data analysis and an efective deep learning approach for the decomparison of sino-american challenges and expe- identification of italian medical records, Ieee Access riences, Journal of medical Internet research 23 9 (2021) 19097–19110.

(2021) e24813. [17] C. Sun, X. Qiu, Y. Xu, X. Huang, How to fine[6] L. Bacco, F. Russo, L. Ambrosio, F. D’Antoni, tune bert for text classification?, in: China national L. Vollero, G. Vadalà, F. Dell’Orletta, M. Merone, conference on Chinese computational linguistics, R. Papalia, V. Denaro, Natural language processing Springer, 2019, pp. 194–206. in low back pain and spine diseases: a systematic review, Frontiers in Surgery 9 (2022) 957085. [7] A. E. Johnson, T. J. Pollard, S. J. Berkowitz, N. R. A. Sample Reports

Greenbaum, M. P. Lungren, C.-y. Deng, R. G.

Mark, S. Horng, Mimic-cxr: A large publicly avail- We present three representative reports that illustrate disable database of labeled chest radiographs, 2019. tinct documentation styles: the MRI and X-ray findings URL: https://physionet.org/content/mimic-cxr/2.0. are conveyed with technical details, whereas the clinical 0/. doi:10.13026/cr8q-rw49, rRID:SCR_007345. evaluation is presented as a concise narrative. [8] A. Johnson, T. Pollard, S. Horng, L. A. Celi, The imaging reports describe features such as lumbar R. Mark, Mimic-iv-note: Deidentified free-text disc degeneration, spondylolisthesis, and preserved verteclinical notes (version 2.2), PhysioNet (2023). bral alignment. In contrast, the consult note summarizes URL: https://doi.org/10.13026/1n74-ne17. doi:10. patient history, describes symptoms, and reports physi13026/1n74-ne17, rRID:SCR_007345. cal examination findings, before referencing the imaging [9] F. A. Matsuoka, H. N. Onaga, Classifying domains, results. Thus, the narrative note provides clinical conbenchmarking gpt-4, a portuguese dataset for med- text, while the radiological reports contribute detailed ical ai q&a, bioRxiv (2024) 2024–12. anatomical and pathological descriptions. [10] V. Basile, C. Bosco, M. Fell, V. Patti, R. Varvara, et al., Italian nlp for everyone: Resources and models from evalita to the european language grid, in: B. SVM’s Lexical Feature Analysis 2022 Language Resources and Evaluation Conference, LREC 2022, European Language Resources To improve the interpretability of our best-performing Association (ELRA), 2022, pp. 174–180. SVM classifier, trained with TF–IDF bigrams on the full [11] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, set of clinical documents, we analyzed the feature weights L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At- learned by the model from the top-performing fold of the tention is all you need, Advances in neural infor- 5-fold cross-validation. These weights indicate the contrimation processing systems 30 (2017). bution of each lexical bigram to treatment classification, [12] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: highlighting expressions with clear clinical significance.

Pre-training of deep bidirectional transformers for We manually prioritized domain-specific expressions language understanding, in: Proceedings of the (e.g., anatomical or pathological descriptors)from the top 2019 conference of the North American chapter of 50 lexical features (unigrams and brigrams) ranked by the association for computational linguistics: hu- coeficient value for each treatment class, over generic toman language technologies, volume 1 (long and kens (e.g., grado, presenza), which, despite their assigned short papers), 2019, pp. 4171–4186. weights, lack standalone diagnostic value. The most in[13] N. Sheehan, Magnetic resonance imaging for low formative medically relevant features identified by the back pain: indications and limitations, Postgradu- model for each treatment class—Conservative, Regenerate medical journal 86 (2010) 374–378. ative, and Surgical—are reported in Table 5, along with [14] R. U. Din, X. Cheng, H. Yang, Diagnostic role of mag- their associated weights and frequencies in the training netic resonance imaging in low back pain caused and test sets. Importantly, the selected inspected features by vertebral endplate degeneration, Journal of Mag- exhibit meaningful clinical relevance, efectively capturnetic Resonance Imaging 55 (2022) 755–771. ing diagnostic and pathological indicators that inform [15] L. Bacco, A. Cimino, L. Paulon, M. Merone, therapeutic decision-making.

F. Dell’Orletta, A machine learning approach for Specifically, conservative treatment is associated with clinically less invasive descriptors such as sostanzialItalian MRI: Sostanzialmente conservata la fisiologica lordosi lombare; lieve deviazione sinistro-convessa del rachide lombare a fulcro L3-L4. Discopatia degenerativa a livello L4-L5 ed L5S1; in particolare:

• a livello L4-L5 si osserva protrusione discale ad ampio raggio che occupa bilateralmente il pavimento dei forami neurali e, a destra entra in contatto con il tratto preforaminale della radice L5 destra; si associa a tale livello alterazione dell’intensità di segnale dei contrapposti versanti intersomatici tipo Modic 2–3.

• a livello L5-S1 è presente protrusione discale ad ampio raggio che non entra in conflitto con le radici nervose adiacenti.

Conservata la morfologia delle restanti unità discosomatiche. Non ci sono alterazioni focali ossee nei segmenti scheletrici esaminati. Canale vertebrale di dimensioni nella norma. Nella norma l’intensità di segnale del cono midollare, posizionato a livello D12. Conservato il trofismo dei muscoli para-vertebrali al passaggio lombo-sacrale. Cisti aracnoidee sacrali a livello S1-S2, del diametro massimo di 3 cm.

X-Ray: Sostanzialmente conservata la fisiologica lordosi lombare.

Non evidenti alterazioni ossee radiograficamente apprezzabili nei segmenti ossei in esame. Normoallineati i muri somatici posteriori sia in proiezione LL standard che in massima estensione; disallineamento dei muri somatici posteriori con spondilolistesi anteriore L4-L5 di grado 1 in massima flessione, come segno di instabilità articolare a tale livello. Lieve riduzione in altezza dello spazio intersomatico L4-L5, come segno di discopatia degenerativa. Tono calcico conservato.

Visit: APR: n.d.r. APP: Il paziente riferisce lombalgia da diversi anni, esacerbata durante attività sportiva. NRS colonna lombosacrale 6/10. Ha praticato FKT con temporaneo beneficio.

Il dolore è maggiormente lateralizzato a sinistra a livello del rachide lombosacrale. Non episodi di sciatalgia. La sintomatologia inficia il riposo notturno, ma non si altera con la manovra di Valsalva. Presenta limitazione della flessoestensione del rachide lombosacrale. Porta in visione RMN colonna LS (11/09/2020) che mostra discopatia L4-L5 ed L5-S1 in presenza di alterazione degenerativo-infiammatoria dei piatti vertebrali contrapposti e dell’osso subcondrale a livello L4-L5 in fase acuta del tipo Modic 1. EO: Dolore in iperestensione del rachide lombosacrale ed inclinazione laterale. Ipercifosi dorsale. Marcata contrattura paravertebrale.

Dolore all’articolazione sacro-iliaca SX. Deambulazione possibile in taligrado e digitigrado. Lasègue bilaterale. Non deficit di TA, EPA ed ECD. Diagnosi: Discopatia L4-L5 ed L5-S1 in presenza di alterazione degenerativo-infiammatoria dei piatti vertebrali contrapposti e dell’osso subcondrale a livello L4-L5 in fase acuta del tipo Modic 1.

Età: 45 Sesso: M

English MRI: Essentially preserved physiological lumbar lordosis; slight left-convex deviation of the lumbar spine with apex at L3-L4.

Degenerative disc disease at L4-L5 and L5-S1; specifically:

• at L4-L5, a broad-based disc protrusion is observed, bilaterally occupying the floor of the neural foramina and, on the right, contacting the preforaminal tract of the right L5 root; associated with a mild signal intensity alteration of the opposing endplates (Modic type 2–3).

• at L5-S1, a broad-based disc protrusion is present, which does not impinge on adjacent nerve roots.

Morphology of the remaining disc–vertebral units is preserved. No focal bone abnormalities in the examined skeletal segments. Vertebral canal dimensions are within normal limits. Signal intensity of the conus medullaris is normal, positioned at D12. Paravertebral muscle trophism at the lumbosacral junction is preserved. Sacral arachnoid cysts at S1-S2 level, with a maximum diameter of 3 cm.

X-Ray: Essentially preserved physiological lumbar lordosis. No radiographically appreciable bone abnormalities in the examined osseous segments. Posterior vertebral walls are normally aligned in both standard LL projection and maximum extension; misalignment of the posterior vertebral walls with Grade I anterior spondylolisthesis at L4-L5 in maximum flexion, indicating articular instability at that level. Mild reduction in intervertebral space height at L4-L5, indicating degenerative disc disease. Preserved bone density.

Visit: APR: no relevant medical history recorded. APP: The patient reports low back pain for several years, exacerbated during sports activity. NRS lumbosacral score 6/10. He underwent physiokinetic therapy with temporary relief. Pain is predominantly lateralized to the left at the lumbosacral spine. No episodes of sciatica. Symptoms disrupt sleep but do not change with the Valsalva maneuver. Presents with limitation of flexion-extension of the lumbosacral spine.

Brings MRI of LS spine (11/09/2020) showing discopathy at L4-L5 and L5-S1 with degenerative-inflammatory changes of the opposing vertebral endplates and subchondral bone at L4-L5 in acute Modic 1 phase. EO: Pain on hyperextension of the lumbosacral spine and lateral bending. Thoracic hyperkyphosis. Marked paravertebral muscle contracture.

Pain at the left sacroiliac joint. Ambulation possible on heels and toes. Bilateral Lasègue’s sign. No deficits in TA, EPA, and ECD. Diagnosis: Discopathy at L4-L5 and L5-S1 with degenerative-inflammatory changes of the opposing vertebral endplates and subchondral bone at L4-L5 in acute Modic 1 phase.

Age: 45 Sex: M With Polarity Inversion ernia discopatia muri somatici proiezioni dinamiche spondilolistesi stenosi Other High-Weight Bigrams sostanzialmente conservati protrusione discale antero listesi

SVM Weight by Treatment Class mente conservati and degenerazioni artrosiche. Regen- treatment class and negative weights in another. This erative treatments, meanwhile, are characterized by med- underlines the context-sensitive nature of their clinical ically pertinent terms like muri somatici and proiezioni di- interpretation. namiche. Finally, surgical treatment features expressions Furthermore, it is worth emphasizing that feature freindicative of more severe pathology, including spondilolis- quency alone does not fully explain clinical importance: tesi and stenosi, both frequently occurring in the training even relatively infrequent terms can receive high model data and receiving high positive weights (0.698 and 0.688, weights if they demonstrate strong discriminative power. respectively). For example, antero listesi appeared only 11 times in the

Notably, our analysis highlighted polarity inversion training set yet emerged as one of the top-ranked surgical phenomena, whereby certain clinically relevant terms features, confirming the model’s capability to identify (e.g., spondilolistesi, ernia) showed positive weights in one clinically informative lexical indicators. Declaration on Generative AI

[1]

Wu ,

March ,

Zheng ,

Huang ,

Wang ,

Zhao ,

F. M.

Blyth ,

Smith ,

Buchbinder ,

Hoy , Global low back pain prevalence and years lived with disability from 1990 to 2017: estimates from the global burden of disease study 2017 , Annals of translational medicine 8 ( 2020 ) 299 .

[2]

Zhou ,

Salman ,

A. H.

McGregor , Recent clinical practice guidelines for the management of low back pain: a global comparison , BMC musculoskeletal disorders 25 ( 2024 ) 344 .

[3]

Airaksinen ,

J. I.

Brox ,

Cedraschi ,

Hildebrandt ,

Klaber-Mofett ,

Kovacs ,

A. F.

Mannion ,

Reis ,

Staal ,

Ursin , et al., European guidelines for the management of chronic nonspecific low back pain , European spine journal 15 ( 2006 ) s192 .

[4]

H.-J.

Kong , Managing unstructured big data in