SKET: an Unsupervised Knowledge Extraction Tool to Empower Digital Pathology Applications ⋆

SKET: an Unsupervised Knowledge Extraction Tool to Empower Digital Pathology Applications ⋆ GiorgioMariaDi Department of Information Engineering University of Padua NicolaFerro Department of Information Engineering University of Padua FabioGiachelle Department of Information Engineering University of Padua OrnellaIrrera Department of Information Engineering University of Padua StefanoMarchesin stefano.marchesin@unipd.it Department of Information Engineering University of Padua GianmariaSilvello Department of Information Engineering University of Padua IRCDL (The Conference on Information and Research science Connecting to Digital and Library science) 19th

February 23-24 2023 Bari Italy

SKET: an Unsupervised Knowledge Extraction Tool to Empower Digital Pathology Applications ⋆ 1613-0073 00AA7BA99557F0962B89D805B2139AB8 GROBID - A machine learning software for extracting information from scholarly documents Knowledge Extraction Machine Learning Expert Systems Digital Pathology

Large volumes of medical data have been produced for decades. These data include diagnoses, which are often reported as free text, thus encoding medical knowledge that is still largely unexploited. To decode the medical knowledge present within reports, we propose the Semantic Knowledge Extractor Tool (SKET), an unsupervised knowledge extraction system combining a rule-based expert system with pretrained Machine Learning (ML) models. This work demonstrates the viability of unsupervised Natural Language Processing (NLP) techniques to extract critical information from cancer reports, opening opportunities such as data mining for knowledge extraction purposes, precision medicine applications, structured report creation, and multimodal learning.

Introduction

Hundred of thousands of medical reports have been used to communicate diagnoses, encoding a vast amount of medical knowledge. In this context, free-text reporting is the de facto standard to communicate diagnoses, guiding patients' treatment, and conducting therapies. Processing high volumes of free-text reports to extract the crucial knowledge is usually performed manually. However, since reports vary widely between institutions, contain noise, and lack a standard structure, this becomes an extremely time-consuming process. To overcome this limitation, Natural Language Processing (NLP) methods become essential [2,3,4,5,6,7,8,9] as they empower the efficient automatic processing of thousands of reports and the extraction of relevant information for several (downstream) tasks, such as clinical note mining [10,11] and structuring [12], risk prediction [13], clinical decision support [14], and precision medicine retrieval [15].

In the context of digital pathology -a field that involves the analysis of histopathology images known as Whole Slide Images (WSIs) -this work aims at proving the viability of unsupervised NLP techniques to automatically extract critical information from pathology reports and use it for different applications, such as automatic report annotation and visualization [16], as well as WSI classification [17]. To this end, we present the Semantic Knowledge Extractor Tool (SKET), an unsupervised hybrid knowledge extraction system that combines rule-based techniques with pre-trained Machine Learning (ML) models to extract knowledge from pathology reports. In recent years, NLP has shifted from using rules to ML approaches [18,9], which have the advantage of learning regularities from data and of generalizing to previously unseen patterns. Moreover, the advent of efficient Neural Language Models (NLMs) [19,20,21,22] paved the way for the pre-training era, where large NLMs trained in a self-supervised fashion on huge datasets are used to develop NLP models for a number of downstream tasks. Nevertheless, similarly to [10], we argue that rule-based techniques capture critical information that should be used together with -and not substituted by -ML to improve performance.

We evaluate SKET effectiveness on entity linking and text classification, considering three use-cases: Colon, Cervix, and Lung cancer. We resort on diagnostic reports coming from two medical centers based in Italy and The Netherlands. Then, we compare SKET with unsupervised ML approaches to understand the impact that combining rule-based techniques and pre-trained ML models have on the extraction of knowledge from diagnostic reports. The results highlight the effectiveness of ML methods for information extraction in the pathology domain but, at the same time, they also stress the role of expert knowledge in reaching the high levels of accuracy required to semi-automate the clinical practice. As further proof, SKET has been already used as core system in automatic report annotation and visualization [16], as well as weak supervision for WSI classification [17]. SKET source code is publicly available at https://github.com/ExaNLP/sket.

The rest of this paper is organized as follows: Section 2 presents SKET. Section 3.2 describes the experimental evaluation. Finally, Section 4 concludes the paper.

The Semantic Knowledge Extractor Tool

SKET combines pre-trained Named Entity Recognition (NER) models with unsupervised Entity Linking (EL) methods to extract relevant entities from diagnostic reports and link them to concepts stored in a reference ontology 1 . By relying on pre-trained NER models and unsupervised EL methods, SKET can serve as automated annotator in weak supervision tasks. For instance, the concepts extracted by SKET can be used as weak labels when training ML models for image classification [23,24] and relation extraction [25], or as nodes to build knowledge graphs that can be used for retrieval tasks [26].

SKET consists of four main components: (1) Named Entity Recognition, (2) Entity Linking, (3) Data Labeling, and (4) Graph Creation. Components (1) and ( 2) are sequential, whereas ( 3) and ( 4) can be applied in parallel. We briefly describe each component below.

Named Entity Recognition

NER can be defined as the task of identifying and categorizing relevant information within text. A named entity can be any word or phrase -i.e., a mention -that consistently refers to a concept or object of the world. Once identified, mentions are classified into predefined categories, such as disease, gene/protein, symptom, etc.

To perform NER, SKET combines pre-trained neural models with rule-based techniques. As neural component, SKET exploits ScispaCy models [27], which provide full NER pipelines for biomedical data, consisting of large medical vocabularies, as well as Word2Vec [19] word vectors trained on the PubMed Central Open Access Subset [28]. Regarding the integration of expert rules, SKET extends the ScispaCy pipeline with two more components: Entity Fusion and Negation Detection. For Entity Fusion, SKET exploits expert rules to identify and merge specific mentions that would otherwise be regarded as separate by ScispaCy. For example, "high-grade" and "dysplasia" are considered as separate mentions, whereas we are interested in "high-grade dysplasia" as a unique mention. Hence, we developed regular expressions capable of identifying trigger terms that are indicative of a set of mentions that should potentially be combined into one. These expert rules have been developed on a holdout dataset, which is available in the SKET GitHub repository2 . The dataset consists of 50 diagnostic reports for each use-case and medical center, for a total of 250 diagnostic reports. For Negation Detection, SKET relies on NegEx [29], a negation detection algorithm that evaluates whether extracted entities are negated within text. NegEx uses regular expressions to identify the scope of trigger terms that are indicative of negation. Then, the entities extracted within the scope of a trigger term are marked as negated and removed.

Entity Linking

EL can be defined as the task of assigning unique meanings to entities mentioned within text. In a nutshell, EL aims to determine whether a target named entity refers to a specific concept or object stored within a reference ontology.

To perform EL, SKET adopts ad-hoc and similarity-based matching. Given an extracted entity, SKET performs a two-stage matching approach. First, the system tries to link the entity using ad-hoc matching. Then, if ad-hoc matching fails, it employs the similarity-based matching. For Ad-Hoc Matching, SKET employs regular expressions to find trigger terms indicative of a specific concept in the ontology. Once a trigger is found, the system matches the entity containing the trigger term with the closest ontology concept. In this case, if an extracted entity contains the (trigger) term "carcinoma", then SKET links the entity to the "colon adenocarcinoma" concept. Ad-hoc matching rules have also been developed on the holdout dataset and are available on GitHub. Regarding Similarity Matching, SKET combines string and semantic matching techniques. For string matching, SKET adopts the Gestalt Pattern Matching (GPM) algorithm [30]. For semantic matching, SKET exploits the word vectors provided by ScispaCy models [27]. Specifically, it computes the cosine distance between the vector representations of extracted entities and ontology concepts.

Data Labeling

Given the set of concepts extracted from each diagnostic report, SKET maps a clinically relevant subset of such concepts to a set of annotation classes defined by pathologists.

Graph Creation

SKET builds report-level knowledge graphs using the extracted concepts as nodes and the semantic relations of the reference ontology as edges. The use of ontology concepts and relations to describe diagnostic reports increases the semantic understanding of the underlying data [31]. Once created, report-level knowledge graphs are encoded in a machine-readable format through RDF.

Experimental Evaluation

Setup

Tasks: We evaluate SKET on Entity Linking (Task 1) and Text Classification (Task 2). Both tasks are addressed as multi-label classification problems. Note that the number of possible labels for entity linking is much higher than for text classification, making the task an extreme multi-label classification problem [32,33]. Datasets: For Task 1, we use 1,250 annotated reports coming from both medical centers and related to all the three use-cases. For Task 2, we resort on 9,798 annotated reports, divided among medical centers and use-cases. We refer the reader to the original publication [1] for a comprehensive description of the available data. Baselines: For both tasks, we compare SKET with two unsupervised approaches based on Bio FastText [20,34] and BioClinical BERT [22,35]. For a fair comparison, both approaches adopt the same NER ScispaCy pipeline used by SKET, but without the extensions introduced with it. Then, they perform EL by computing the cosine distance between the vector representations of the extracted entities and the ontology concepts. Both baselines are straightforward approaches to perform entity linking and text classification without annotated data.

Results

Table 1 reports the results obtained by SKET and the considered baselines on Entity Linking (left) and Text Classification (right).

For entity linking (Task 1), we observe that SKET achieves high performance for both microand weighted-average F1 in each considered use-case. Regarding accuracy, its performance varies depending on the use-case -with the lowest score obtained in colon cancer with a value of 0.6280. As for the comparison of SKET with the considered baselines, we see that it outperforms them in each use-case for all measures. This result shows the effectiveness of combining ad-hoc, expert rules with ML models -making SKET both precise and sensitive. Specifically, ad hoc matching makes SKET precise, while semantic matching makes it sensitive. To support this intuition, we observe that unsupervised baselines -which only rely on ML models and semantic matching -have low accuracy values. Since we tackle the entity linking task as a multi-label classification problem, we resort on subset accuracy, where the set of concepts predicted for a report must exactly match the corresponding set of ground-truth concepts. Therefore, accuracy values are prone to rapidly decrease and less precise models are naturally affected by this. For text classification (Task 2), we see that SKET performs well on colon and lung cancer use-cases, whereas it shows lower accuracy values on cervix cancer. This result suggests that the cervix use-case is harder than the others, as subset accuracy drops fast when a model fails to predict all labels correctly. The higher values for micro-and weighted-average F1 -which do not perform exact match between predicted and ground-truth labels -further support this intuition. Compared to baselines, SKET outperforms them in colon and cervix use-cases. On the other hand, the BERT-based approach proves more effective in lung cancer. Despite this, the robustness of SKET across different use-cases makes it a viable solution in real scenarios, where annotated data are hard and expensive to get.

Conclusion

In this work, we presented SKET, an unsupervised hybrid knowledge extraction system that combines rule-based techniques with pre-trained ML models to extract relevant concepts from diagnostic reports. The experimental evaluation demonstrated the effectiveness of SKET, making it a viable solution to reduce pathologists' workload. Besides, the experimental results highlighted the importance of expert knowledge in developing unsupervised systems for specialized medicine. As a result, the extracted concepts can serve different digital pathology applications, such as automatic report annotation, visualization, and retrieval, as well as image classification.

Table 11Entity linking (left) and text classification (right) results on colon, cervix, and lung cancer pathology reports. Bold values represent the highest scores achieved for each measure.Entity LinkingText ClassificationColonColonModelAccuracy Micro F1 Weighted F1ModelAccuracy Micro F1 Weighted F1SKET0.62800.88610.8694SKET0.75250.83860.8373FastText0.06600.50000.6146FastText0.41460.52980.5514BERT0.18400.39050.4527BERT0.51670.56970.6587CervixCervixModelAccuracy Micro F1 Weighted F1ModelAccuracy Micro F1 Weighted F1SKET0.70200.83220.8368SKET0.52810.77910.7611FastText0.09000.28020.3439FastText0.25330.48820.4445BERT0.07200.27150.2940BERT0.30660.39620.4867LungLungModelAccuracy Micro F1 Weighted F1ModelAccuracy Micro F1 Weighted F1SKET0.86240.93750.9262SKET0.81370.83870.8262FastText0.25100.56100.6506FastText0.52210.72960.6853BERT0.38060.68040.8395BERT0.85230.86300.8526

https://w3id.org/examode/ontology/ https://github.com/ExaNLP/sket/tree/main/sket/nerd/rules/

Acknowledgments

The work was supported by the ExaMode project, as part of the EU H2020 program under Grant Agreement no. 825292.

Empowering digital pathology applications through explainable knowledge extraction tools SMarchesin FGiachelle NMarini MAtzori SBoytcheva GButtafuoco FCiompi GMDi Nunzio FFraggetta OIrrera HMüller TPrimov SVatrano GSilvello 10.1016/j.jpi.2022.100139 doi: Journal of Pathology Informatics 13 100139 2022 The Potential for Artificial Intelligence in Healthcare TDavenport RKalakota 10.7861/futurehosp.6-2-94 Future Healthc J 6 2019 The Feasibility of Using Natural Language Processing to Extract Clinical Information from Breast Pahology Reports JMBuckley SBCoopey JSharko FPolubriaginof BDrohan AKBelli EMKim JEGarber BLSmith MAGadd MCSpecht CARoche TMGudewicz KSHughes 10.4103/2153-3539.97788 doi:10.4103/2153-3539.97788 J. Pathol Inform 3 23 2012 Information Extraction from Multi-Institutional Radiology Reports SHassanpour CPLanglotz 10.1016/j.artmed.2015.09.007 Artif. Intell. Medicine 66 2016 Natural Language Processing in Pathology: a Scoping Review GBurger AAbu-Hanna NDe Keizer RCornet 10.1136/jclinpath-2016-203872 Journal of Clinical Pathology 69 2016 Mining Fall-Related Information in Clinical Notes: Comparison of Rule-Based and Novel Word Embedding-Based Machine Learning Approaches MTopaz LMurga KMGaddis MVMcdonald OBar-Bachar YGoldberg KHBowles 10.1016/j.jbi.2019.103103 J. Biomed. Informatics 90 2019 Obtaining Knowledge in Pathology Reports Through a Natural Language Processing Approach With Classification, Named-Entity Recognition, and Relation-Extraction Heuristics TOliwa SBMaron LMChase SLomnicki DV TCatenacci BFurner SLVolchenboum 10.1200/CCI.19.00008 JCO Clinical Cancer Informatics 1 2019 Natural Language Processing Systems for Capturing and Standardizing Unstructured Clinical Information: A Systematic Review KKreimeyer MFoster APandey NArya GHalford SFJones RForshee MWalderhaug TBotsis 10.1016/j.jbi.2017.07.012 J. Biomed. Informatics 73 2017 Clinical Information Extraction Applications: A Literature Review YWang LWang MRastegar-Mojarad SMoon FShen NAfzal SLiu YZeng SMehrabi SSohn HLiu 10.1016/j.jbi.2017.11.011 J. Biomed. Informatics 77 2018 Exploiting Rules to Enhance Machine Learning in Extracting Information From Multi-Institutional Prostate Pathology Reports ESantus TSchuster AMTahmasebi CLi AYala CRLanahan PPrinsen SFThompson SCoons LMynderse RBarzilay KHughes 10.1200/CCI.20.00028 JCO Clinical Cancer Informatics 2020 Validation of Deep Learning Natural Language Processing Algorithm for Keyword Extraction from Pathology Reports in Electronic Health Records YKim JHLee SChoi JMLee JHKim JSeok HJJoo 10.1038/s41598-020-77258-w Sci Rep 1 2020 Artificial Intelligence-Driven Structurization of Diagnostic Information in Free-Text Pathology Reports PGiannaris ZAl-Taie MKovalenko NThanintorn OKholod YInnokenteva ECoberly SFrazier KLaziuk MPopescu CRShyu DXu RHammer DShin 10.4103/jpi.jpi_30_19 Journal of Pathology Informatics 11 10 2020 Automating the Determination of Prostate Cancer Risk Strata From Electronic Medical Records JRGregg MLang LLWang MJResnick SKJain JLWarner DABarocas 10.1200/CCI.16.00045 JCO Clinical Cancer Informatics 1 2017 Automated Extraction of Grade, Stage, and Quality Information From Transurethral Resection of Bladder Tumor Pathology Reports Using Natural Language Processing APGlaser BJJordan JCohen ADesai PSilberman JJMeeks 10.1200/CCI.17.00128 JCO Clinical Cancer Informatics 1 2018 Benchmarking Information Retrieval for Precision Oncology: the TREC Precision Medicine Track KRoberts DDemner-Fushman EMVoorhees WRHersh SBedrick AJLazar SPant AMIA 2018, American Medical Informatics Association Annual Symposium

San Francisco, CA

AMIA November 3-7, 2018. 2018 MedTAG: a portable and customizable annotation tool for biomedical documents FGiachelle OIrrera GSilvello 10.1186/s12911-021-01706-4 BMC Medical Informatics Decis. Mak 21 352 2021 Unleashing the potential of digital pathology data by training computer-aided diagnosis models without human annotations NMarini SMarchesin SOtálora MWodzinski ACaputo MVan Rijthoven WAswolinskiy JMBokhorst DPodareanu EPetters SBoytcheva GButtafuoco SVatrano FFraggetta JDer Laak MAgosti FCiompi GSilvello HMuller MAtzori 10.1038/s41746-022-00635-4 npj Digital Medicine 5 2022 Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems! LChiticariu YLi FRReiss Proc. of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013 of the 2013 Conference on Empirical Methods in Natural Language essing, EMNLP 2013

Grand Hyatt Seattle, Seattle, Washington, USA, ACL

18-21 October 2013. 2013 Distributed Representations of Words and Phrases and their Compositionality TMikolov ISutskever KChen GSCorrado JDean Proc. of the 27th Annual Conference on Neural Information Processing Systems 2013 of the 27th Annual Conference on Neural Information essing Systems 2013

NIPS, Lake Tahoe, Nevada, United States

December 5-8, 2013, 2013 Enriching Word Vectors with Subword Information PBojanowski EGrave AJoulin TMikolov 10.1162/tacl_a_00051 Trans. Assoc. Comput. Linguistics 5 2017 Deep Contextualized Word Representations MEPeters MNeumann MIyyer MGardner CClark KLee LZettlemoyer 10.18653/v1/n18-1202 Proc. of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018 of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018

New Orleans, Louisiana, USA

June 1-6, 2018. 2018 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding JDevlin MWChang KLee KToutanova CoRR abs/1810.04805 2018 Clinical-Grade Computational Pathology using Weakly Supervised Deep Learning on Whole Slide Images GCampanella MGHanna LGeneslaw AMiraflor VW KSilva KJBusam EBrogi VEReuter DSKlimstra TJFuchs 10.1038/s41591-019-0508-1 Nat Med 25 2019 Multiple Instance Learning: A Survey of Problem Characteristics and Applications MACarbonneau VCheplygina EGranger GGagnon 10.1016/j.patcog.2017.10.009 Pattern Recognit 77 2018 TBGA: a large-scale gene-disease association dataset for biomedical relation extraction SMarchesin GSilvello 10.1186/s12859-022-04646-6 BMC Bioinform 23 111 2022 Case-Based Retrieval Using Document-Level Semantic Networks SMarchesin 10.1145/3209978.3210221 doi:10.1145/3209978.3210221 Proc. of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR 2018 of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR 2018

Ann Arbor, MI, USA

ACM July 08-12, 2018. 2018 1451 ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing MNeumann DKing IBeltagy WAmmar 10.18653/v1/w19-5034 Proc. of the 18th BioNLP Workshop and Shared Task, BioNLP@ACL 2019 of the 18th BioNLP Workshop and Shared Task, BioNLP@ACL 2019

Florence, Italy

ACL August 1, 2019. 2019 Distributional Semantics Resources for Biomedical Text Processing SPyysalo FGinter HMoen TSalakoski SAnaniadou Proc. of LBM of LBM 2013 A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries WWChapman WBridewell PHanbury GFCooper BGBuchanan 10.1006/jbin.2001.1029 J. Biomed. Informatics 34 2001 Pattern Matching: the Gestalt Approach JWRatcliff DEMetzener Dr Dobbs Journal 13 46 1988 Learning Unsupervised Knowledge-Enhanced Representations to Reduce the Semantic Gap in Information Retrieval MAgosti SMarchesin GSilvello 10.1145/3417996 ACM Trans. Inf. Syst 38 48 2020 Taming pretrained transformers for extreme multi-label text classification WCChang HFYu KZhong YYang ISDhillon 10.1145/3394486.3403368 doi:10.1145/ 3394486.3403368 KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event

, CA, USA

ACM August 23-27, 2020. 2020 Lasige-biotm at MESINESP2: entity linking with semantic similarity and extreme multi-label classification on spanish biomedical documents PRuas VD TAndrade FMCouto Proc. of the Working Notes of CLEF 2021 -Conference and Labs of the Evaluation Forum CEUR Workshop Proceedings of the Working Notes of CLEF 2021 -Conference and Labs of the Evaluation Forum

Bucharest, Romania

September 21st -to -24th, 2021. 2936. 2021 Improving Biomedical Word Embeddings with Subword Information and MeSH YZhang QChen ZYang HLin ZLu Biowordvec 10.1038/s41597-019-0055-0 Scientific Data 6 2019 EAlsentzer JRMurphy WBoag WHWeng DJin TNaumann MB AMcdermott CoRR abs/1904.03323 Publicly Available Clinical BERT Embeddings 2019