On the Diagnosis and Characterisation of Prostate Cancer in Pathology Reports in Spanish Rosa M. Montañés-Salas1 , Sergio Gracia-Borobia1 , María de la Vega Rodrigálvarez-Chamarro1 , Ángel Borque-Fernando2 , Patricia A. Guerrero-Ochoa3 , Alejandro Camón-Fernández3 , Jorge Alfaro-Torres4 , Isabel Marquina-Ibáñez4 , Sofia Hakim-Alonso4 , Luis M. Esteban5 and Rafael del-Hoyo-Alonso1 1 Aragon Institute of Technology (ITA), María de Luna, 7–8, 50018 Zaragoza, Spain 2 Department of Urology, Miguel Servet University Hospital (GIIS071-uro-servet), 50009 Zaragoza, Spain 3 Health Research Institute of Aragon Foundation (GIIS071-uro-servet), 50009 Zaragoza, Spain 4 Department of Pathology, Miguel Servet University Hospital (GIIS071-uro-servet), 50009 Zaragoza, Spain 5 Department of Applied Mathematics, Escuela Universitaria Politécnica de La Almunia, Universidad de Zaragoza, 50100 Zaragoza, Spain Abstract Prostate cancer is a prevalent disease worldwide, with early diagnosis enabling better prognosis. Natural language processing (NLP) techniques show promise in extracting information from electronic health records to support clinical decision-making. This paper presents an NLP approach to detect and characterise prostate cancer (PCa) diagnoses from Spanish pathology reports. A combination of lexical-morphological analysis, rule-based techniques and transformer models is used to identify PCa, Gleason scores, procedures, organs and other markers. The system achieves near 96% agreement in detecting cancer diagnoses compared to expert annotation. Keywords Medical Natural Language Processing, Prostate Cancer, Pathology Reports, Information Extraction 1. Introduction of medical procedures. However, there are still limitations stemming from the non-uniformity of information systems, Prostate cancer (PCa) was the fourth most diagnosed can- the diverse repositories for clinical analyses, radiological cer worldwide in 2022. In Spain, in 2023, there have been or pathological reports, or other Electronic Health Records reported between 33,000-34,000 new cases diagnosed, with (EHRs), and the assessment and follow-up of patients con- a 5-year prevalence of more than 140,000 cases, making it ducted by different healthcare professionals. Therefore, the the leading cancer in terms of incidence among the male sources and data are significantly heterogeneous, includ- population, as reported by the Spanish Cancer Association ing a substantial amount of textual information containing in 2023 1 . Early detection of PCa enables treatment at initial valuable clinical knowledge provided by experts in the field, stage, resulting in higher cure rates and reduced side effects which allows for precise and accurate diagnoses. Moreover, of aggressive treatments, as well as lower healthcare costs. privacy concerns have to be taken into account when deal- In this context, the use of advanced Artificial Intelligence ing with patient sensitive data [3]. (AI) techniques presents itself as a promising tool to sup- The research presented here is part of the port clinical decision-making systems and predictive model AI4HealthyAging project, whose mission is to lever- research [1]. Specifically, Natural Language Processing can age distributed AI technologies for early diagnosis and play a decisive role by facilitating information retrieval from treatment of diseases that are highly prevalent in the ageing non-structured sources and opening the range of processing population. Within this project, several work packages are techniques [2]. organized regarding various diseases such as Parkinson, The acquisition of large volumes of patient data for re- sarcopenia, deafness or cancer. The use case related to the search purposes has become feasible today, thanks to the diagnostic management of prevalent cancers in the elderly digitization of healthcare systems and systematic recording is focused on prostate and colon cancers. Particularly, the overarching goal for prostate cancer is to develop decision SEPLN-2024: 40th Conference of the Spanish Society for Natural Language support and risk interpretation tools based on the biological Processing. Valladolid, Spain. 24-27 September 2024. footprint present in patient EHRs, histological preparations, $ rmontanes@ita.es (R. M. Montañés-Salas); sgracia@ita.es and radiological images, thus enhancing diagnosis through (S. Gracia-Borobia); vrodrigalvarez@ita.es (M. d. l. V. Rodrigálvarez-Chamarro); aborque@salud.aragon.es the use of hybrid data. (Á. Borque-Fernando); pguerrero@iisaragon.es (P. A. Guerrero-Ochoa); In this article, we present our experience in analysing, acamon@iisaragon.es (A. Camón-Fernández); jalfaro@salud.aragon.es extracting and structuring information using Natural Lan- (J. Alfaro-Torres); imarquina@salud.aragon.es (I. Marquina-Ibáñez); guage Processing techniques applied to pathology reports of shakim@salud.aragon.es (S. Hakim-Alonso); lmeste@unizar.es prostate cancer in Spanish. The main challenge presented is (L. M. Esteban); rdelhoyo@ita.es (R. del-Hoyo-Alonso)  0000-0003-4636-5868 (R. M. Montañés-Salas); 0009-0005-4863-8550 the unavailability of a truly reliable and medically consistent (S. Gracia-Borobia); 0000-0003-1393-8260 labeled dataset in Spanish from which to incorporate new (M. d. l. V. Rodrigálvarez-Chamarro); 0000-0003-0178-4567 clinical features for the development of advanced hybrid (Á. Borque-Fernando); 0000-0002-1657-4792 (P. A. Guerrero-Ochoa); predictive models. Therefore, the first stages of this project 0000-0002-3007-302X (L. M. Esteban); 0000-0003-2755-5500 consisted of the development of a working methodology to (R. del-Hoyo-Alonso) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License retrieve relevant data and implement an NLP-based system Attribution 4.0 International (CC BY 4.0). 1 Observatory of the Spanish Cancer Association (AECC) that would enable to efficiently detect and characterise can- (2023). Dynamic report: Prostate cancer | AECC Observatory. cer diagnoses based on pathologists’ reports. The proposed Available at https://observatorio.contraelcancer.es/informes/ approach has facilitated the compilation of a comprehensive informe-dinamico-cancer-de-prostata. Last accessed: 2024-04-03 CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings and validated dataset, providing both, a final/clear diagno- potential of fine-tuning such models with domain-specific sis and several features of interest, for the development of data to boost diagnostic performance. explainable risk prediction and prognosis machine learning Despite these advancements, challenges persist in achiev- models fueled by multiple information sources. ing optimal performance for the detection and diagnosis The paper is organised as follows: a contextual introduc- of diseases, particularly due to the inherent noise and vari- tion that delineates the research setting has been outlined, ability in EHRs and other medical reports. Moreover, the followed by a related work exploration. Section 3 encom- multilingualism challenge is still an open issue, although passes the materials and methods integral to underpinning multiple efforts are being conducted to develop annotation the system’s development. Subsequently, the attained re- standards and AI systems in Spanish (Seda et al.; Miranda- sults are discussed upon finishing with the conclusions of Escalada et al.; Solarte-Pabón et al.), the scope of research the work and prospective areas for future development and in the oncology field is still limited. research. 3. Proposed approach 2. Related work This section outlines the proposed approach for the devel- Natural Language Processing applied to the biomedical do- opment of the clinical report analysis system. The system main has witnessed significant growth and innovation in aims to extract relevant information from medical docu- recent years, driven by the increasing availability of large- ments and classify them according to their diagnosis or the scale healthcare data and advancements in NLP techniques. requirements proposed. It poses a working methodology Electronic health records have emerged as a significant customizable to different types of medical text reports, in source of information for the detection and diagnosis of which, starting from basic resources in the form of a textual several diseases, including cancer. With the incorporation corpus and elementary terminology, different natural lan- of rich textual data, natural language processing has been ex- guage processing strategies are applied semi-automatically tensively applied to these records. [4] and [5] systematically to characterise the data set and obtain implicit information reviewed the applications of NLP in sifting through EHRs, useful for feeding other learning systems. highlighting its potential in detecting chronic diseases and The materials and methods presented here have been signs of various cancer types, respectively, emphasizing developed by three independent teams in order to ensure the challenges of data heterogeneity and the significance of data privacy2 and broad applicability of the system. First, domain-specific annotations. This is also shown in [6] and the team responsible for accessing the healthcare databases [7], where information was extracted from free-text pathol- retrieves two sets of data. Thereafter, the text set is analysed ogy reports related to breast and lung cancer and colorectal by the natural language processing team and its results are cancer, respectively, both using expert annotated textual combined with the medical data modelling team’s set for data. validation purposes. In the realm of prostate cancer, a niche yet rapidly grow- ing area of research focuses on leveraging NLP techniques 3.1. Materials for diagnostic purposes. DiBello et al. demonstrated that NLP can accurately identify metastatic PCa by searching The initial study population in this work consists of all males unstructured text in medical records such as pathology, ra- affiliated with Zaragoza II healthcare sector (392,177 individ- diology and clinic notes. Thomas et al. validated an NLP uals), who have been identified with at least one Prostate- program to accurately identify patients with prostate cancer Specific Antigen (PSA) determination in the last 5 years, i.e., and extract relevant information from pathology reports. in the interval between 2017 and 2022. With these charac- Some approaches also underscore the critical role of domain- teristics, a total of 92,171 patients were identified. From this specific knowledge in curating and understanding the spe- initial population, those with at least one biopsy performed, cialized terminology present in such reports [10]. Addition- and consequently, with at least one pathology report avail- ally, the extraction of such domain-specific knowledge helps able in the system, were selected, resulting in a total of improve unimodal models within PCa diagnosis: Morote 10,563 patients of interest. All available data in the hospital et al. assessed the ability of microscopic findings in prostate systems accountable for performing these procedures were biopsies to improve the prediction of clinically significant retrospectively collected since 1999, ultimately retrieving a prostate cancer using numerical-only data, and Khosravi total of 24981 textual pathology reports. et al. developed an AI based model for PCa diagnosis using magnetic resonance images labelled with manually-assigned 3.1.1. Data preparation histopathology information. A great variety of text-related machine learning models The team responsible for accessing healthcare databases and have seen application in this domain: Breischneider et al. retrieving reports work independently of the other teams to developed an unsupervised rule-based ontology system for ensure the confidentiality and protection of sensitive data. feature extraction in free-form text obtained from clinical re- The former team has conducted a triple pseudonymization ports. Yoon et al. dived deeper and demonstrates how graph process upon the pathology reports, compiling a document neural networks can be trained with in-context textual data database for natural language processing with two content for multitask cancer labeling. Also, transformer-based mod- sections: the macroscopic description and the diagnosis els, renowned for their prowess in NLP tasks, have been section. The macroscopic description provides concrete introduced into the biomedical field. ClinicalBioBERT [15] and OncoBERT [16] showcased the utility of BERT and its 2 Following the Regulation (EU) 2016/679 of the European Parliament variants in comprehending medical narratives, aiming at and of the Council of 27 April 2016 on the protection of individuals with regard to the processing of personal data and on the free movement of identifying signs of cancers. Their findings illuminate the such data (GDPR). Figure 1: Data sources schema. details regarding the tissue removal procedure, while in the 3.2. Methods diagnosis section, pathologists determine the findings in the In the design of the clinical record information extraction extracted tissue samples and provide detailed information and labelling system depicted in figure 2, we aimed to follow they deem relevant for patient monitoring. Both, “diagnosis” an iterative yet simple methodology: based on the set of and “macro” sections are considered for the detection and language resources outlined in the previous section (see 3.1) characterisation of PCa (referred to as “corpus” in figure 1). and the application of different biomedical natural language Along with this base corpus, a lightweight thesaurus was processing techniques, a comprehensive structured dataset built from the exploration of PCa-related concepts in stan- is built and refined in conjunction with the in-domain knowl- dard ontologies and classification schemas such as SNOMED edge. Both, dataset information and expert knowledge are [20] and ICD10-codes [21] in their Spanish versions. This easily customized according to the specific needs of subse- resource has been constructed by the NLP team with the quent machine learning models or expert requirements. In guidance of the medical staff at the Zaragoza II healthcare our use case, a structured dataset of prostate cancer diagno- sector in charge of reporting prostate cancer, with the aim of sis features is built and then validated by an independent simplifying the external knowledge integration, implement- team, performing a limited number of feedback iterations ing an easily adaptable tool and adjusting the linguistic anal- to improve the overall performance and guarantee the gen- ysis to real-world usage by reproducing the communicative eralization and customizable capabilities of the system. registry employed in those healthcare reports, according to the American College of Pathologist-CAP guidelines actu- alized every 6 months by the specialist pathologist staff of the Departament of Pathology. This dictionary is designed as a simple hierarchy in which each concept or character- istic to be extracted is paired with a carefully refined list of expressions drawn from the standards mentioned above, comprising the expert domain knowledge base of the sys- tem. 3.1.2. Validation set To assess the results of the analysis system, an indepen- dent dataset from morphology study repositories has been retrieved under the same assumptions as the textual data corresponding to the pathology reports, corresponding to 10297 patients with 22628 cases. These studies contain nu- merical and categorical data, particularly a summary label assigned by clinical staff during patient exams, following the SNOMED nomenclature. A preliminary analysis of these la- bels revealed a high dimensionality and variability of the as- signed classes. Additionally, excessively specific categories Figure 2: System schema. were used, making it difficult to study PCa diagnosis ade- quately. This diversity can be attributed to the complexity of the standard used on a day-to-day basis and the subjectivity of human criteria for assignment. 3.2.1. Analysis and tagging An overview of the aforementioned resources is depicted The preliminary analysis phase consists of building automat- in the following figure 1. ically a glossary of categorized terms and expressions that are morphologically and semantically very close to the base in-domain knowledge provided by the medical staff. There- fore, a lightweight customized thesaurus is constructed, through similarity searches over the corpus along with a set a list of inspected cilinders) all the possible compo- of approximate regular morphological data patterns usually nents are extracted but the gleason score associated found on cancer reports. These terms and expressions facil- to that document is the most severe among the mul- itate the further identification of information patterns and tiple results. The pathology reports analysed in this data in the textual records, aiding in the determination of research show certain variability due to the evolu- diagnoses and their context. tion of the Gleason grading system in the time range From this, the available pathology reports are processed considered, which have been redacted following the and both the prostate cancer diagnosis, in terms of positive recommendations and updates of the International or negative presence, and a number of additional features Society of Urological Pathology (ISUP). of interest are inferred, building a rich dataset that can feed • Type of Medical Procedure: each of the reports anal- other AI multimodal systems. To accomplish the detection ysed corresponds to a specific procedure conducted of diagnoses and the extraction of significant variables em- on the patient. The considered procedures are: ploying the compiled lexicon and patterns, various strategies Biopsy, Prostatectomy, Cystoprostatectomy, Ade- are integrated into a three-step approach: nomectomy and TURP (Transurethral Resection of Firstly, a content-based filtering is applied, discarding the Prostate). An additional Untagged label is con- empty or non-informative reports (i.e. many documents sidered in case none of the above could be detected. consist of only “VER B” or similar texts). The second step This result is treated as a multilabel output. is based on a mixture of lexical-morphological analysis • Organ mentions: given that the different medical supported by fuzzy matching and the integration of a pre- procedures can affect several areas of the organism, trained language model based on transformers. In this case, the mentioned organs on each document are also we utilize biomedical language models in the Spanish lan- extracted. The considered organs are: Prostate, Sem- guage, specifically the RoBERTa-base biomedical model that inal Vesicles, Lymph Nodes and Bladder. This result has been already finetuned for the Named Entity Recogni- is treated as a mutilabel output. tion (NER) task on the Cantemist dataset for tumour mor- • Other informative markers: mentions to ASAP phology extraction by Carrino et al.. The joint objective is (Atypical Small Acinar Proliferation); mentions to to extract cancer-related terms and expressions from the PIN (Prostatic Intraepithelial Neoplasia); mentions text reports, which represent the relevant features pursued. of inflammatory processes and atrophies; TNM The third step consists of applying rule-based techniques stage: standard classification for cancer staging, it designed for the extraction of objective features. On the one refers to Tumour, Nodes and Metastasis [24]; “DUC” hand, a simple set of grammatical rules has been applied, i.e. label related to mentions of ductal carcinoma; and occurrence of certain constituents in the sentences, negation neoplastic morphology mentions extracted with the detection and comparison of lexical-morphological analysis NER model. with NER output. On the other hand, domain-specific rules have been designed: a priority system between labels has 3.2.2. Validation been established for the PCa diagnosis distinguishing from The modelling team, in collaboration with health profes- positive cases to different gravity levels and not PCa as less sionals, found that the SNOMED labels assigned in the vali- critical; specific rules to compute the final Gleason degrees dation set (see section 3.1.2) are not very accurate for the and groups when different values are retrieved; and warning actual diagnosis of PCa. They have independently made rules to analyse whether values extracted are incoherent. an adjustment to this set of labels, reducing it to a three- The detailed characteristics retrieved through the de- category classification (existence or non-existence of PCa scribed NLP techniques on the pathology reports of prostate plus an ‘untagged’ label) through semi-automated mapping cancer are the following: and subsequent human validation by two experts. • Prostate Cancer diagnosis: a binary output (PCa+, The intersection of the text corpus and the morphology PCa-, for positive or negative Prostate Cancer diag- data over patient studies consists of 10249 patients and a nosis, respectively) that includes an additional label total of 17696 studies on which it is possible to compare for uncertain cases in which the analysis and rule and validate PCa tagging results at document-level. The processing throw opposing or empty results (identi- inter-annotator agreement (IAA) is computed between the fied as Untagged). The latter serves to indicate the NLP tags and the postprocessed human-assigned SNOMED need for a thorough review by an expert. labels. • Gleason Score (Sum and Group): the Gleason score, often referred to as the Gleason grading system [23] 4. Results is a numerical scale used in the field of pathology to assess the severity and aggressiveness of prostate The results reported in this section correspond to the final cancer glands under a microscope. It can be ex- results obtained after three cycles of compilation, execution pressed in several ways, generally consisting of two and validation of information extraction and classification numbers: a primary and a secondary cancer pattern, over the available pathology reports. The base outcome of i.e. the most common pattern of cancer glands seen, the system presented is the structured dataset of prostate and the next most common pattern, respectively (ter- cancer diagnoses from highly unstructured free-text data. tiary is also extracted, but it is rarely mentioned). In conjunction with this resource, it has been possible to The sum of the primary and secondary patterns validate a simple yet effective method for the extraction of along with their corresponding group are identified relevant information from these types of documents with a and conveniently computed. Whenever the Gleason very promising overall performance. score is mentioned multiple times in a document Annex A contains a complete example of one of the anal- (i.e. in a pathology report describing a biopsy with ysed documents, corresponding to the following textual fragment from the diagnosis section. The table 1 below in- Table 2 cludes the most relevant features extracted from the whole Gleason sum distribution. document. All the characteristics extracted are specified in the table 4 in the annex. BIOPSIA DE PRÓSTATA TRANSPERINEAL, [...] - Gleason sum % records GRADO DE GLEASON: 7 (3+4) - GRADO GRUPO: 2 4 0.3 [...] PATOLOGÍA ADICIONAL PROSTÁTICA: NO SE 5 0.5 OBSERVA. 6 52.7 7 32.8 Table 1 8 6.9 Extracted information from example document 9 6.0 10 0.8 Extracted information Value cancer tag Cancer doc tag Biopsia primary gleason pattern 3 secondary gleason pattern 4 gleason sum 7 gleason group 2 The distribution of cancer presence in the population studied through pathology reports is depicted in figure 3. According to the experts, it aligns with the typical average distribution of PCa diagnoses in the specific region under study with the characteristics described in the materials section, remaining about a 2% of the records (370 reports approximately) to be reviewed by doctors. ASAP, PIN, in- Figure 4: Gleason groups distribution. flammation, and atrophy are all pathologies related to the possible development of prostate cancer. However, for diag- nostic purposes, medical experts consider them as negative cases of PCa. Figure 5: Procedures identified. valuable for characterising and filtering data in downstream tasks. Tumour morphology entities extracted with the NER Figure 3: Distribution of cancer diagnosis. model were not found to be helpful at this level of lan- guage analysis, regarding the techniques integrated within the system. Neither prostate cancer nor Gleason scores were improved, as these mentions, when correctly detected, The distributions of Gleason Score sum and group re- co-occurred with specific prostate domain expressions al- trieved for positive PCa cases are also reproduced in table 2 ready matched in the lexical-morphological analysis. The and figure 4. transformer-based model was excluded from the process in Regarding the compiled corpus of documents, it was order to reduce the system’s runtime. found that the inclusion of the macroscopic section aided in Finally, to ensure the system’s correct performance, extracting features related to procedures. The distribution prostate cancer diagnosis from pathology reports were vali- of procedures found is depicted in figure 5. dated against the validation set, i.e. the manually-annotated Figure 6 displays the organs directly related to prostate morphology dataset, using the predefined tags outlined in cancer mentioned in the textual corpus. These mentions are implemented, set the starting point for the development of a global and homogeneous cancer risk prediction system based on the biological footprint of patients distributed in the different electronic health records available in the hospi- tal repositories, supporting and improving the efficiency of decision support tools for early disease detection processes. We have built an extensive structured dataset that serves to enhance the predictive capacity and explainability of advanced predictive models for PCa risk identification, such as multimodal algorithms that work with biological data and images, and enrich hospital repositories. Additionally, the validation performed along with the clinical experts throws very encouraging results, approximating to 96% of annotator agreement. Nevertheless, a comprehensive and rigorous validation of the remaining characteristics is still required, despite their initial alignment with the expectations and needs of the experts. The developed system has been successfully adapted and Figure 6: PCa organs identified. executed into a colorectal cancer scenario within this project, by simply defining an in-domain thesaurus. This has al- lowed the processing and evaluation of pathology reports section 3.2: PCa+, PCa- and Untagged. Although, initially, and then, extending the approach to colonoscopies and the morphology coding was not entirely consistent, after EHR, serving as an effective system to annotate customized the reviewing cycles it was found to be a useful baseline datasets for further research. and provided valuable support for the detection of complex As future work, several avenues are considered, from cases. These cases were then personally reviewed by medi- the NLP perspective we plan to improve the techniques ex- cal experts, thus enhancing our methodology and allowing plored, delving into higher levels of language analysis as the for a thorough evaluation of the obtained results. semantics, and exploring the new possibilities offered by the leading-edge Large Language Models (LLMs). Furthermore, Table 3 for the use case under study, the methodology will be ap- Agreement on cancer diagnosis. plied to the magnetic resonance reports of prostate cancer patients as well, with the necessary adaptations for that NLP/Expert PCa+ PCa- Untagged specific context, in order to further enhance the dataset con- PCa+ 5433 648 1 structed. Lastly, our objective is to extrapolate the analysis PCa- 12 11566 11 and processing to all types of reports with textual content Untagged 1 18 6 to build a comprehensive system that allows for the identi- Total agreement 17005 fication of a patient’s biological footprint and its influence on the final diagnosis. As shown in table 3, with respect to the evaluable pathol- ogy reports coinciding in both datasets, a final 96.095% Ethical Statement agreement between NLP-based classification and human This study and the use of patient data was approved by the labelling was reached, which corresponds with a weighted regional ethics committee of Aragón. Cohen’s Kappa of 0.9115. During the validation cycles, our automatic analysis and processing was found to be more robust than the manual annotation registered in the hospital Acknowledgments repositories, and both were iteratively improved. Gleason scores, medical procedures, organ mentions and the rest This research was funded by project MIA.2021.M02.0007 of of the additional markers retrieved also underwent a cur- NextGenerationEU program and Integration and Develop- sory validation by human expert intervention, as there was ment of Big Data and Electrical Systems (IODIDE) group of no corresponding information conveniently categorised in Aragon Goverment program. the clinical repositories or in the previous work studied. The distribution of data obtained corresponds to the actual distribution of cases treated in the time period analysed, References as determined by the vast experience of the clinical staff [1] A. A. Rabaan, M. A. Bakhrebah, H. AlSaihati, S. Alhu- involved. maid, R. A. Alsubki, S. A. Turkistani, S. Al-Abdulhadi, Y. Aldawood, A. A. Alsaleh, Y. N. Alhashem, J. A. Al- 5. Conclusions and future work matouq, A. A. Alqatari, H. E. Alahmed, D. A. Sharbini, A. F. Alahmadi, F. Alsalman, A. Alsayyah, A. A. Mu- In this paper we have presented our approach to retrieving, tair, Artificial intelligence for clinical diagnosis and classifying and structuring information using a combination treatment of prostate cancer, Cancers 14 (2022) of NLP techniques over prostate cancer pathology reports 5595. URL: http://dx.doi.org/10.3390/cancers14225595. in Spanish from the records belonging to the Health Sector doi:10.3390/cancers14225595. Zaragoza II. The methodology designed, and the system [2] Y.-H. Chuang, J.-H. Su, D.-H. Han, Y.-W. Liao, Y.-C. Lee, P. Zisimopoulos, A. Sigaras, M. Brendel, J. Barnes, Y.-F. Cheng, T.-P. Hong, K. S.-M. Li, H.-Y. Ou, Y. Lu, C. Ricketts, D. Meleshko, A. Yat, T. D. McClure, C.-C. Wang, Effective natural language processing and B. D. Robinson, A. Sboner, O. Elemento, B. Chughtai, interpretable machine learning for structuring ct liver- I. Hajirasouliha, A deep learning approach to diag- tumor reports, IEEE Access 10 (2022) 116273–116286. nostic classification of prostate cancer using pathol- URL: http://dx.doi.org/10.1109/ACCESS.2022.3218646. ogy–radiology fusion, Journal of Magnetic Resonance doi:10.1109/access.2022.3218646. Imaging 54 (2021) 462–471. URL: http://dx.doi.org/10. [3] H. R. Abdulshaheed, S. A. Mohammed Al-Juboori, I. A. 1002/jmri.27599. doi:10.1002/jmri.27599. Al Sayed, I. A. Barazanchi, H. M. Gheni, Z. A. Jaaz, Re- [13] C. Breischneider, S. Zillner, M. Hammon, P. Gass, search on optimization strategy of medical data infor- D. Sonntag, Automatic extraction of breast cancer mation security and privacy, in: 2022 9th International information from clinical reports, in: 2017 IEEE Conference on Electrical Engineering, Computer Sci- 30th International Symposium on Computer-Based ence and Informatics (EECSI), IEEE, 2022. URL: http:// Medical Systems (CBMS), IEEE, 2017. URL: http://dx. dx.doi.org/10.23919/EECSI56542.2022.9946606. doi:10. doi.org/10.1109/CBMS.2017.138. doi:10.1109/cbms. 23919/eecsi56542.2022.9946606. 2017.138. [4] E. H. Houssein, R. E. Mohamed, A. A. Ali, Machine [14] H.-J. Yoon, J. Gounley, M. T. Young, G. Tourassi, In- learning techniques for biomedical natural language formation extraction from cancer pathology reports processing: a comprehensive review, IEEE Access 9 with graph convolution networks for natural lan- (2021) 140628–140653. guage texts, in: 2019 IEEE International Conference [5] C. Li, Y. Zhang, Y. Weng, B. Wang, Z. Li, Natural lan- on Big Data (Big Data), IEEE, 2019. URL: http://dx. guage processing applications for computer-aided di- doi.org/10.1109/BigData47090.2019.9006270. doi:10. agnosis in oncology, Diagnostics 13 (2023). URL: https: 1109/bigdata47090.2019.9006270. //www.mdpi.com/2075-4418/13/2/286. doi:10.3390/ [15] E. Alsentzer, J. R. Murphy, W. Boag, W.-H. Weng, D. Jin, diagnostics13020286. T. Naumann, M. McDermott, Publicly available clinical [6] M. Alawad, H.-J. Yoon, G. D. Tourassi, Coarse-to-fine bert embeddings, arXiv preprint arXiv:1904.03323 multi-task training of convolutional neural networks (2019). for automated information extraction from cancer [16] H. Lin, J. Ginart, W. Chen, Y. Interian, H. Gong, pathology reports, in: 2018 IEEE EMBS International B. Liu, T. Upadhaya, J. Lupo, J. Hong, S. Braunstein, Conference on Biomedical & Health Informatics Oncobert: Building an interpretable transfer learn- (BHI), IEEE, 2018. URL: http://dx.doi.org/10.1109/BHI. ing bidirectional encoder representations from trans- 2018.8333408. doi:10.1109/bhi.2018.8333408. formers framework for longitudinal survival predic- [7] D. Martinez, Y. Li, Information extraction from pathol- tion of cancer patients, 2023. doi:10.21203/rs.3. ogy reports in a hospital setting, in: Proceedings of rs-3158152/v1. the 20th ACM international conference on Informa- [17] S. S. Seda, F. d. P. P. León, J. M. Conde, M. C. G. Ruiz, tion and knowledge management, CIKM ’11, ACM, J. M. Sánchez, G. Rodríguez, J. A. P. Simón, C. L. P. 2011. URL: http://dx.doi.org/10.1145/2063576.2063846. Calderón, Plataforma para la extracción automática doi:10.1145/2063576.2063846. y codificación de conceptos dentro del ámbito de la [8] J. DiBello, B. H. Li, C. Zheng, W. Yu, S. Weinmann, oncohematología (proyecto coco), Procesamiento del K. E. Richert-Boe, D. P. Ritzwoller, S. K. Vandeneeden, Lenguaje Natural 61 (2018) 65–71. S. J. Jacobsen, Development of an algorithm to iden- [18] A. Miranda-Escalada, E. Farré, M. Krallinger, Named tify metastatic prostate cancer in electronic medical entity recognition, concept normalization and clinical records using natural language processing., Journal of coding: Overview of the cantemist track for cancer clinical oncology : official journal of the American So- text mining in spanish, corpus, guidelines, methods ciety of Clinical Oncology 32 30_suppl (2014) 164. URL: and results., IberLEF@ SEPLN (2020) 303–323. https://api.semanticscholar.org/CorpusID:25873512. [19] O. Solarte-Pabón, O. Montenegro, A. García-Barragán, [9] A. Thomas, C. Zheng, H. Jung, A. Chang, B. J. Kim, M. Torrente, M. Provencio, E. Menasalvas, V. Rob- J. Gelfond, J. Slezak, K. R. Porter, S. J. Jacobsen, G. W. les, Transformers for extracting breast cancer in- Chien, Extracting data from electronic medical records: formation from spanish clinical narratives, Ar- validation of a natural language processing program to tificial Intelligence in Medicine 143 (2023) 102625. assess prostate biopsy results, World Journal of Urol- URL: https://www.sciencedirect.com/science/article/ ogy 32 (2014) 99–103. URL: https://api.semanticscholar. pii/S0933365723001392. doi:https://doi.org/10. org/CorpusID:8917027. 1016/j.artmed.2023.102625. [10] O. Hamzeh, L. Rueda, A gene-disease-based machine [20] M. M. Van Berkum, Snomed ct® encoded cancer pro- learning approach to identify prostate cancer biomark- tocols, in: Amia Annual Symposium Proceedings, vol- ers, in: Proceedings of the 10th ACM International ume 2003, American Medical Informatics Association, Conference on Bioinformatics, Computational Biology 2003, p. 1039. and Health Informatics, 2019, pp. 633–638. [21] W. H. Organization, Icd-10 : international statistical [11] J. Morote, I. Schwartzman, A. Borque, L. M. Esteban, classification of diseases and related health problems : A. Celma, S. Roche, I. M. de Torres, R. Mast, M. E. tenth revision, 2004. Semidey, L. Regis, et al., Prediction of clinically sig- [22] C. P. Carrino, J. Llop, M. Pàmies, A. Gutiérrez-Fandiño, nificant prostate cancer after negative prostate biopsy: J. Armengol-Estapé, J. Silveira-Ocampo, A. Valencia, The current value of microscopic findings, in: Uro- A. Gonzalez-Agirre, M. Villegas, Pretrained biomed- logic Oncology: Seminars and Original Investigations, ical language models for clinical NLP in Spanish, volume 39, Elsevier, 2021, pp. 432–e11. in: Proceedings of the 21st Workshop on Biomed- [12] P. Khosravi, M. Lysandrou, M. Eljalby, Q. Li, E. Kazemi, ical Language Processing, Association for Compu- tational Linguistics, Dublin, Ireland, 2022, pp. 193– J1. K.- AD1: ÁPEX DERECHO, PERIFÉRICO POSTERIOR: 199. URL: https://aclanthology.org/2022.bionlp-1.19. SE RECIBE UN CILINDRO. INCLUSIÓN TOTAL EN BLOQUE doi:10.18653/v1/2022.bionlp-1.19. K1. L.- AD2: ÁPEX DERECHO, PERIFÉRICO EXTERNO: SE [23] J. I. Epstein, W. C. Allsbrook, M. B. Amin, L. L. Egevad, RECIBE UN CILINDRO PEQUEÑO. INCLUSIÓN TOTAL EN The 2005 international society of urological pathol- BLOQUE L1. M.- AD3: ÁPEX DERECHO, PERIFÉRICO AN- ogy (isup) consensus conference on gleason grading TERIOR: SE RECIBE UN CILINDRO (EN VARIOS FRAGMEN- of prostatic carcinoma, American Journal of Surgi- TOS). INCLUSIÓN TOTAL EN BLOQUE M1. N.- AD4: ÁPEX cal Pathology 29 (2005) 1228–1242. URL: http://dx. DERECHO, TRANSICIONAL: SE RECIBE UN CILINDRO. IN- doi.org/10.1097/01.pas.0000173646.99337.b1. doi:10. CLUSIÓN TOTAL EN BLOQUE N1. O.- AD5: ÁPEX DERE- 1097/01.pas.0000173646.99337.b1. CHO, ANTERIOR: SE RECIBE UN CILINDRO MUY FINO. IN- [24] R. D. Rosen, A. Sapra, Tnm classification., 2023. URL: CLUSIÓN TOTAL EN BLOQUE O1. https://www.ncbi.nlm.nih.gov/books/NBK553187/, last accessed: 2023-09-30. Table 4 Extracted information from example document in Spanish. Each component of the extracted gleason patterns corresponds to the first gleason pattern, the second gleason pattern, the third glea- A. Annex 1: Full example son pattern, the gleason sum and the gleason group, respectively. Diagnosis BIOPSIA DE PRÓSTATA TRANSPERINEAL, PROTOCOLO Extracted information Value EXTENDIDO (BASADO EN CAP JUN): - ADENOCARCI- NOMA CONVENCIONAL. - AI4: ÁPEX IZQUIERDO, TRAN- cancer tag Cancer SICIONAL: - CILINDROS AFECTADOS / REMITIDOS: 1/1 - doc tag Biopsia organs detected Próstata, GRADO DE GLEASON: 6 (3+3) - GRADO GRUPO: 1 - POR- Vesículas Seminales CENTAJE DE PATRÓN GLEASON 4 O 5: 0% - PORCENTAJE (3, 3, 0, 6, 1), DE TEJIDO PROSTÁTICO AFECTADO POR TUMOR: 16,6% extracted gleason patterns (3, 3, 0, 6, 1), - MM DE CARCINOMA / MM DE CILINDRO: 1/6 MM - (3, 4, 0, 7, 2) AI5: ÁPEX IZQUIERDO, ANTERIOR: - CILINDROS AFEC- primary gleason pattern 3 TADOS / REMITIDOS: 2/2 - GRADO DE GLEASON: 6 (3+3) - secondary gleason pattern 4 GRADO GRUPO: 1 - PORCENTAJE DE PATRÓN GLEASON gleason sum 7 4 O 5: 0% - PORCENTAJE DE TEJIDO PROSTÁTICO AFEC- gleason group 2 TADO POR TUMOR: 61,5% - MM DE CARCINOMA / MM DE CILINDRO: 8/13 MM - AD4: ÁPEX DERECHO, TRAN- SICIONAL: - CILINDROS AFECTADOS / REMITIDOS:2/2 - GRADO DE GLEASON: 7 (3+4) - GRADO GRUPO: 2 - POR- CENTAJE DE PATRÓN GLEASON 4 O 5: 5% - PORCENTAJE DE TEJIDO PROSTÁTICO AFECTADO POR TUMOR: 47,3% - MM DE CARCINOMA / MM DE CILINDRO: 9/19 MM - IN- FILTRACIÓN GRASA PERIPROSTÁTICA: NEGATIVA. - IN- FILTRACIÓN DE VESÍCULA SEMINAL: NO VALORABLE POR AUSENCIA DE VESÍCULA SEMINAL EN EL MATERIAL REMITIDO. - INVASIÓN LINFOVASCULAR: NEGATIVA. - INVASIÓN PERINEURAL: NEGATIVA. - PATOLOGÍA ADI- CIONAL PROSTÁTICA: NO SE OBSERVA. Macroscopic findings A.- BD1: BASE DERECHO, PERIFÉRICO POSTERIOR: SE RECIBE UN CILINDRO. INCLUSIÓN TOTAL EN BLOQUE A1. B.- BD2: BASE DERECHO, PERIFÉRICO EXTERNO: SE RECIBE UN CILINDRO. INCLUSIÓN TOTAL EN BLOQUE B1. C.- BD3: BASE DERECHO, PERIFÉRICO ANTERIOR: SE RECIBE UN CILINDRO. INCLUSIÓN TOTAL EN BLOQUE C1. D.- BD4: BASE DERECHO, TRANSICIONAL: SE RECIBE UN CILINDRO FRAGMENTADO. INCLUSIÓN TOTAL EN BLOQUE D1. E.- BD5: BASE DERECHO, ANTERIOR: SE RECIBE UN CILINDRO. INCLUSIÓN TOTAL EN BLOQUE E1. F.- MD1: MEDIO DERE- CHO, PERIFÉRICO POSTERIOR: SE RECIBE UN CILINDRO MÁS UN FRAGMENTO. INCLUSIÓN TOTAL EN BLOQUE F1. G.- MD2: MEDIO DERECHO, PERIFÉRICO EXTERNO: SE RECIBE UN CILINDRO. INCLUSIÓN TOTAL EN BLOQUE G1. H.- MD3: MEDIO DERECHO, PERIFÉRICO ANTERIOR: SE RECIBE UN CILINDRO MÁS FRAGMENTO. INCLUSIÓN TOTAL EN BLOQUE H1. I.- MD4: MEDIO DERECHO, TRAN- SICIONAL: SE RECIBE UN CILINDRO. INCLUSIÓN TOTAL EN BLOQUE I1. J.- MD5: MEDIO DERECHO, ANTERIOR: SE RECIBE UN CILINDRO. INCLUSIÓN TOTAL EN BLOQUE