<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>S. Marchesin); https://www.dei.unipd.it/~dinunzio (G. Di Nunzio);
https://www.dei.unipd.it/~silvello (G. Silvello)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Terminology Extraction in Electronic Health Records. The ExaMode Project⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Stefano Marchesin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giorgio Maria Di Nunzio</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gianmaria Silvello</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information Engineering, University of Padua</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>Medical free-text records store a lot of useful information that can be exploited in developing computersupported medicine. Nevertheless, extracting terminological knowledge from unstructured text is dificult because the volume of medical texts created every year keeps growing at a very fast pace and it is highly dependent on the language under examination. In this work, we present an initial study of a Natural Language Processing pipeline in order to extract terminological information from pathology reports and link this information to medical ontologies.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Terminology Extraction</kwd>
        <kwd>Electronic Health Records</kwd>
        <kwd>Named Entity Recognition</kwd>
        <kwd>Entity Linking</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Modern medical specialties rely on clinical context for an accurate interpretation of medical
data. The literature of medical imaging analysis shows an important trend where both the
Electronic Health Records (EHR) and medical images are leveraged in an approach called ‘fusion
paradigm’ for solving complex tasks that cannot be tackled by a single modality [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In fact,
medical free-text records store a lot of useful information that can be exploited in developing
computer-supported medicine [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. These medical free-text records can also be produced by
patients in the so-called patient-reported diagnosis [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This type of documents can reveal if a
patient left a medical encounter knowing the diagnosis explained to them and can ultimately
inform on whether there are language diferences between training health care professionals
and patients without medical training.
      </p>
      <p>
        However, extracting terminological knowledge from unstructured text is dificult for at least
two reasons: firstly, the volume of medical texts created every year keeps growing at a very fast
pace. The time required by clinicians to retrieve relevant information from such an amount of
literature using standard systems is often prohibitive. Therefore, there has been a strong interest
in Clinical Decision Support (CDS) systems [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ] designed to produce efective and timely
information that can help clinicians in the decision making process for patient care.1 Secondly,
the extraction of relevant terminological knowledge highly depends on the language (and not
only on the specialized language). For example, in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], the authors explore terminology related to
the semantic field of terms that indicate or suggest the presence of implants in electronic medical
records EHRs written in Swedish with techniques that are highly optimized (in a justified way)
for that language, making that software only partially reusable for other languages.
      </p>
      <p>In this work, we focus on digital pathology, a specialized field that studies digital
histopathology images to diagnose cancer cases and related diseases. In particular, we propose a Natural
Language Processing (NLP) pipeline in order to extract terminological information from
pathology reports and link this information to medical ontologies. We evaluate the efectiveness of
this approach on entity linking and text classification tasks, considering diferent use-cases
concerning diferent types of cancer. Moreover, an unsupervised multilingual hybrid knowledge
extraction system that combines rule-based techniques with pre-trained deep neural models to
extract knowledge from pathology reports will be assessed against the manual extraction of
terms from the same medical reports.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Proposal</title>
      <p>This work has been carried out under the umbrella of the ExaMode project,2 an interdisciplinary
Horizon 2020 project the goal of which is to design eficient methodologies to manage the vast
amount of heterogeneous medical data produced every day in diferent forms, and to ensure
easier, faster discovery and consultation with little human supervision. The ultimate goal of
the ExaMode project is to have these resources adopted not only by specialists, but also by
non-experts, so that the latter can achieve a better understanding of medical information.</p>
      <p>The dataset for this study is a selection of real medical reports produced in the Hospital of
Catania, one of the project partners. These reports are divided into four pathologies, namely
three types of cancer (cervix, colon and lung) and celiac disease for a total of 200 medical reports.</p>
      <p>
        The proposed method is based on a NLP pipeline that adopts a combination of pre-trained
Named Entity Recognition (NER) models [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and unsupervised Entity Linking (EL) methods [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
to extract key terms from the medical reports and to link them to the reference ontology. NER
is the task of identifying and categorizing key information, or “entities”, within a text. An
entity can be any multi-word term that consistently refers to the same concept. Each entity is
therefore classified, i.e. “linked”, into a predefined category, such as disease or protein.3 Entity
Linking is therefore the task of assigning unique meanings to entities mentioned in a text. Our
approach proposes a combination of ad-hoc and similarity matching techniques to connect the
extracted entities to unique concepts. Finally, the approach uses a set of rules to merge entities
into multi-word terms. For example, the terms “colon” and “transverse” may be considered as
separate entities in a text, while the “transverse colon” is the correct entity to be linked to the
ontology.
1https://www.trec-cds.org
2https://www.examode.eu
3We use italics to indicate categories.
This work was partially supported by the ExaMode Project, as a part of the European Union
Horizon 2020 Program under Grant 825292.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.-C.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pareek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Seyyedi</surname>
          </string-name>
          , I. Banerjee,
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Lungren</surname>
          </string-name>
          ,
          <article-title>Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines</article-title>
          ,
          <source>npj Digital Medicine</source>
          <volume>3</volume>
          (
          <year>2020</year>
          )
          <article-title>136</article-title>
          . URL: https://doi.org/10.1038/ s41746-020-00341-z. doi:
          <volume>10</volume>
          .1038/s41746-020-00341-z.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Dobrakowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mykowiecka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Marciniak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Jaworski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Biecek</surname>
          </string-name>
          ,
          <article-title>Interpretable segmentation of medical free-text records based on word embeddings</article-title>
          ,
          <source>Journal of Intelligent Information Systems</source>
          <volume>57</volume>
          (
          <year>2021</year>
          )
          <fpage>447</fpage>
          -
          <lpage>465</lpage>
          . URL: https://doi.org/10.1007/s10844-021-00659-4. doi:
          <volume>10</volume>
          .1007/s10844-021-00659-4.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Gleason</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Dahm</surname>
          </string-name>
          ,
          <article-title>How patients describe their diagnosis compared to clinical documentation</article-title>
          ,
          <source>Diagnosis</source>
          <volume>9</volume>
          (
          <year>2022</year>
          )
          <fpage>250</fpage>
          -
          <lpage>254</lpage>
          . URL: https://doi.org/10.1515/dx-2021-0070. doi:doi:10.1515/dx-2021-0070.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E. S.</given-names>
            <surname>Berner</surname>
          </string-name>
          ,
          <article-title>Clinical decision support systems : theory and practice</article-title>
          , 3rd ed., Springer,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Agosti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. Di</given-names>
            <surname>Nunzio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Marchesin</surname>
          </string-name>
          , G. Silvello,
          <article-title>Medical retrieval using structured information extracted from knowledge bases</article-title>
          ,
          <source>in: SEBD</source>
          , volume
          <volume>2400</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>O.</given-names>
            <surname>Jerdhaf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Santini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lundberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Karlsson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jönsson</surname>
          </string-name>
          ,
          <article-title>Focused terminology extraction for cpss the case of "implant terms" in electronic medical records</article-title>
          ,
          <source>in: 2021 IEEE International Conference on Communications Workshops (ICC Workshops)</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          . doi:
          <volume>10</volume>
          .1109/ ICCWorkshops50388.
          <year>2021</year>
          .
          <volume>9473700</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <article-title>Recent named entity recognition and classification techniques: A systematic review</article-title>
          ,
          <source>Computer Science Review</source>
          <volume>29</volume>
          (
          <year>2018</year>
          )
          <fpage>21</fpage>
          -
          <lpage>43</lpage>
          . URL: https://www.sciencedirect.com/science/article/pii/S1574013717302782. doi:https://doi. org/10.1016/j.cosrev.
          <year>2018</year>
          .
          <volume>06</volume>
          .001.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>X.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <article-title>Unsupervised approaches for textual semantic annotation, a survey</article-title>
          ,
          <source>ACM Comput. Surv</source>
          .
          <volume>52</volume>
          (
          <year>2019</year>
          ). URL: https://doi.org/10.1145/3324473. doi:
          <volume>10</volume>
          .1145/3324473.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>