<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>F. Bai, L. You, H. Lei, X. Li, Association between increased and decreased gut microbiota
abundance and Parkinson's disease: A systematic review and subgroup meta-analysis,
Experimental Gerontology</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1016/j.jad.2023.04.003</article-id>
      <title-group>
        <article-title>Reusability of Biomedical Annotations for Gut-Brain Interplay Information Extraction as Terminological Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vanessa Bonato</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Linguistic and Literary Studies, University of Padova</institution>
          ,
          <addr-line>Via Elisabetta Vendramini 13 35137 Padova</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1952</year>
      </pub-date>
      <volume>191</volume>
      <issue>112444</issue>
      <fpage>21</fpage>
      <lpage>29</lpage>
      <abstract>
        <p>In the framework of the CLEF 2025 conference, the GutBrainIE @ CLEF 2025 challenge related to the European-founded project HEREDITARY (HetERogeneous sEmantic Data integration for the guT-bRain interplaY) has been proposed. This Natural Language Processing challenge involves the performance of Named Entity Recognition and Relation Extraction aimed at Information Extraction on a corpus of PubMed abstracts concerning the gut-brain interplay. In this paper, we explore the possibility of reusing entity mentions and relations identified during the gold-standard training dataset annotation process in the form of terminological data in a medical terminology resource.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;medical terminology</kwd>
        <kwd>information extraction</kwd>
        <kwd>biomedical annotation</kwd>
        <kwd>gut-brain interplay1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>their labelling by choosing from a defined list of categories, and 2) the identification of the relations
between entity mentions [14]. The two tasks were likewise performed by expert annotators to
create the gold-standard training dataset, which will be finally used to train IE systems.</p>
      <p>The task of Named Entity Recognition carried out on abstracts by expert annotators differs from
the process of term extraction, that is “terminology work that involves the identification and
excerption of terminological data by searching through a text corpus” [15]. For instance, in some
circumstances, the extracted entity mention cannot be considered a term, defined in terminology
science as a “designation that represents a general concept by linguistic means” [15]. For example,
the entity mention “oral and gut microbiota” does not represent a term, as two distinct terms
designating two different concepts can be identified: ‘oral microbiota’ and ‘gut microbiota’.
Nevertheless, in many other cases, the identified entity mention represents a term in the medical
terminological domain. This applies to entity mentions such as “major depressive disorder” and
“Autism Spectrum Disorder”.</p>
      <p>In this paper, we assess the extent to which entity mentions and entity relations identified
during the annotation process aimed at Information Extraction can be reused in a medical
terminology resource, in the form of terminological data concerning the gut-brain axis and gut
microbiota-related health conditions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Dataset and Dataset Annotation Description</title>
      <p>In this section, we present an overview of the dataset used and provide information about the
team of expert annotators. Subsequently, we describe the manual annotation process performed by
expert annotators on the set of PubMed abstracts to create the gold-standard training dataset.</p>
      <sec id="sec-2-1">
        <title>2.1. Dataset</title>
        <p>The abstracts composing the corpus were extracted from papers systematically selected from the
PubMed Electronic Database5, which is the largest database of biomedical publications. The
selection was performed by running two separate queries using the following keywords: 1) “mental
health” AND “gut microbiota”, and 2) “Parkinson” AND “gut microbiota”. Following the exclusion
of duplicated documents, the corpus comprehensively amounts to 1663 documents.</p>
        <p>The annotation process was carried out by 7 annotators on a total of 403 PubMed abstracts.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Annotators</title>
        <p>The team of annotators is composed of both terminology experts and computer science experts.
The group is therefore heterogeneous, specifically comprising three annotators with expertise in
terminology and four specialized in computer science.</p>
        <p>The terminology work conducted by terminology experts served as the starting point for
creating the annotation schema. Indeed, terminology experts are trained to identify terms within
textual documents, infer the corresponding general concepts, and detect concept relationships. In
particular, the described terminology work was conducted manually, with a view to creating a
highly-curated gold-standard annotated dataset. This approach allowed, for instance, to exclusively
extract terms pertaining to the medical domain from the abstracts used to create the annotation
schema, and to exclude candidate terms from the selection. Terminology experts were then trained
by computer science experts to acquire knowledge in both Named Entity Recognition and Relation
Extraction.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Annotation Schema</title>
        <p>5 https://pubmed.ncbi.nlm.nih.gov/
Following the creation of the corpus, an annotation schema was established by the annotators. The
annotation schema defines the set of entity labels that annotators are required to associate to entity
mentions identified in the abstracts. The schema also specifies the list of entity relations that link
entity mentions, along with the corresponding relation labels.</p>
        <p>Concerning entity labels, the schema consists of 14 different categories under which the entity
mentions of interest can be classified. Due to space limitations, we will focus on 3 of the 14 labels
outlined in the GutBrainIE@CLEF25 Annotation Guidelines6. In particular, the labels “Disease,
Disorder, or Finding”, “Microbiome” and “Chemical” are particularly relevant to the objectives of
the present research.</p>
        <p>With reference to entity relations, 22 distinct types can be annotated. For each entity relation, a
specific predicate is assigned, considered as a relation label that defines the type of semantic
connection between two labeled entity mentions. For example, the entity relation that is
established between an entity mention labeled as “Microbiome” and an entity mention labeled as
“Disease, Disorder, or Finding” is expressed by using the predicate “is linked to”. In this relation,
“Microbiome” is the head entity, while “Disease, Disorder, or Finding” is the tail entity. Another
example of entity relation is the relation established between the head entity “Chemical” and the
tail entity “Disease, Disorder, or Finding”, whose predicate is “influence”.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Annotation Process</title>
        <p>The annotation process involved the sequential performance of two different tasks for each
assigned abstract: Named Entity Recognition (NER) and Relation Extraction (RE). Named Entity
Recognition consisted in identifying text spans considered as entity mentions, with the goal of
assigning a specific predefined label to each mention. On the other hand, the activity of Relation
Extraction concerned the identification of existing relations between pairs of labeled entity
mentions explicitly present or inferred within each abstract. The list of defined entity labels and
entity relations was provided to annotators through guidelines developed for the challenge.</p>
        <p>The annotation workflow consisted of two distinct phases. In particular, in the first phase expert
annotators manually annotated a total of 148 abstracts, without pre-annotations for entity
mentions and entity labels. The work carried out in this phase led to the identification of 4860
entity mentions and 2360 entity relations. In the second phase, additional 255 abstracts were
annotated. In this occasion, however, pre-annotations for entity mentions and entity labels
operated by unsupervised algorithms were provided. In the annotated abstracts, 6317 entity
mentions and 3045 entity relations were detected.</p>
      </sec>
      <sec id="sec-2-5">
        <title>2.4.1. Mentions</title>
        <p>For the selection of text spans constituting entity mentions, specific annotation rules were
established. In particular, the following instruction was provided to annotators: “[a]nnotate
6 For a detailed description of entity labels, relation labels and the annotation rules for entity mentions and entity
relations, see https://hereditary.dei.unipd.it/challenges/gutbrainie/2025/#
composite entities as a single entity if they belong to the same category. However, if entities belong
to the same category but appear as a sequence, annotate them separately”.</p>
        <p>This implies that, for instance, “Parkinson’s and Alzheimer’s diseases” is considered a single
entity mention, due to the fact that the two composite entities belong to the same category. The
same reasoning applies to “Oral and gut dysbiosis”, “mineralocorticoid and N-methyl-D-aspartate
receptors” and to “oral and gut microbiome”.</p>
      </sec>
      <sec id="sec-2-6">
        <title>2.4.2. Relations</title>
        <p>Concerning the relations established between pairs of labeled entities, a specific instruction was
provided in the guidelines. In some cases, indeed, a given predicate may not match the type of
semantic connection between entity mentions that can be inferred from the analyzed text. In these
circumstances, the predicate “associated with” can be used to signal the existence of a different
type of relation between entity mentions. A relation denoted by this predicate has been also used to
link entity mentions for which no relation has been established in the guidelines.</p>
        <p>For example, provided that an association between two entity mentions labeled “Disease,
Disorder, or Finding” explicitly or implicitly emerges from the specific abstract, a relation labeled
“associated with” can be annotated. An example of this type of relation can be found in Figure 1, in
the context of which the predicate “associated with” is used to specify the link that exists between
the entity mentions “Oral and gut dysbiosis” and “Parkinson’s disease”.</p>
      </sec>
      <sec id="sec-2-7">
        <title>2.5. MetaTron</title>
        <p>The annotation process was carried out by using the annotation tool MetaTron [16], specifically
developed to support biomedical corpora annotation.</p>
        <p>The tool enabled annotators to sequentially perform the tasks of Named Entity Recognition and
Relation Extraction for each assigned abstract.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Can NER and RE Annotations be reused as Terminological Data?</title>
      <p>The annotation process aimed at Information Extraction enabled to identify entity mentions and
relations in abstracts related to the gut-brain interplay and gut microbiota-related health states. As
previously mentioned, however, the task of Named Entity Recognition fundamentally differs from
the process of term extraction. Indeed, term extraction is a fundamental step in terminology work,
concerned with the extraction of terminological data from document collections [17, 18].</p>
      <p>In particular, within the framework of the presented challenge, the need to homogenize the
manually performed NER annotations led to the establishment of internal annotation rules to be
followed by all annotators. These rules, essential for creating a ground truth for IE systems
training, are not necessarily aligned with the terminological approach that is used to extract terms
from texts. Considering this, a fundamental distinction characterizes entity mentions and terms. As
a matter of fact, in the GutBrainIE dataset, a text span is regarded as a single entity mention when
the textual sequence represents composite entities that share the same entity label. Differently, in
the terminological domain, the term is the linguistic designation of a concept, that is a “unit of
knowledge created by a unique combination of characteristics” [15].</p>
      <p>For instance, entity mentions such as “Oral and gut dysbiosis”, “mineralocorticoid and
Nmethyl-D-aspartate receptors” and “oral and gut microbiome”, respectively labeled as “Disease,
Disorder, or Finding”, “Chemical” and “Microbiome”, cannot be considered terms. Within the three
selected text spans, indeed, six different terms designating six different concepts can be identified:
1) oral dysbiosis, 2) gut dysbiosis, 3) mineralocorticoid receptor, 4) N-methyl-D-aspartate receptor,
5) oral microbiome, and 6) gut microbiome.</p>
      <p>As shown in Figure 2, “Parkinson’s and Alzheimer’s diseases” is also a single entity mention
whose composite entities share the entity label “Disease, Disorder, or Finding”. In terminology, the
text span would not correspond to a term. As a matter of fact, two distinct terms designating two
distinct concepts can be identified: ‘Parkinson’s disease’ and ‘Alzheimer’s disease’.</p>
      <p>As can be observed in Figure 2, however, other entity mentions labeled as “Disease, Disorder, or
Finding” are identified: “multiple sclerosis”, “irritable bowel syndrome”, “IBS”, “colorectal cancer”,
“diabetes”, “obesity” and “metabolic syndrome”. These mentions would be considered medical
terms, as each linguistically designates a medical concept. In addition, these terms could be part of
lexical networks, where relationships between terms are outlined.</p>
      <p>For what concerns entity relations, data emerging from the GutBrainIE gold-standard dataset
could also be partially reused in a medical terminology resource in the form of terminological data.
Moreover, they can be used to define concept relationships in conceptual systems. For example, the
relation established in the guidelines between the head entity “Bacteria” and the tail entity
“Microbiome”, whose predicate is “part of”, matches the part-whole relation, used in terminology
as a “concept relation between a comprehensive concept and a partitive concept” [15]. Following
this line of reasoning, the concept &lt;microbiome&gt; is the comprehensive concept, that is a “concept
in a partitive relation that is viewed as a whole consisting of various parts” [15]. Instead, the
concept &lt;bacteria&gt; is the corresponding partitive concept, that represents a “concept in a partitive
relation that is viewed as a part of a whole” [15].</p>
      <p>On the other hand, it can be observed that generic relations, also defined as “is-a relations”, are
not considered in the guidelines. In the terminological domain, a generic relation is a “concept
relation between a generic concept and a specific concept where the intension of the specific
concept includes the intension of the generic concept plus at least one additional delimiting
characteristic” [15].</p>
      <p>Another observation concerns the predicate “associated with”, used to link entity mentions
when predefined relation labels do not accurately express the relation established in a specific
abstract. This predicate exclusively suggests that a link exists between two entity mentions,
without specifying the particular kind of entity relation that is established. By way of
exemplification, in Figure 1, the predicate “associated with” marks the relation between “Oral and
gut dysbiosis” and “Parkinson’s disease”. In terminology work, associative relationships are used to
link concepts that are not involved in generic relations or part-whole relations. However, in
conceptual systems, it would be necessary to precisely indicate the kind of associative relationship
established between concepts. In this sense, an additional fine-grained level of analysis from a
semantic viewpoint should be considered in entity relation labeling, with a view to precisely
systematizing conceptual knowledge.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions</title>
      <p>In this paper, we investigated the possibility of reusing data stemming from the manual annotation
of the GutBrainIE gold-standard training dataset for Information Extraction in the form of
terminological data.</p>
      <p>Our analysis highlighted that entity mentions and relations can be partially reused as
terminological data related to the gut-brain interplay in a medical terminology resource, as well as
in domain-specific lexical networks and conceptual systems. In particular, a selection should be
performed to identify entity mentions that are considered terms in the medical terminological
domain. Moreover, for terminological conceptual analysis, it would be essential to integrate
information on generic relations and to further specify the predicates that denote associative
relations between entity mentions.</p>
      <p>As future work, we aim to compare the gold-standard annotated dataset with the output
generated by automatic term extractors, in terms of both precision and recall. Furthermore, we aim
to provide further information about the terminology work that served as the foundation for the
creation of the annotation schema. Finally, we will analyze additional entity mentions and entity
relations included in the annotated dataset to further investigate how NER and RE annotations can
be reused as terminological data in a medical terminology resource.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work is partially supported by the HEREDITARY Project, as part of the European Union’s
Horizon Europe research and innovation programme under grant agreement No GA 101137074.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT in order to: Grammar and
spelling check, Paraphrase and reword. After using this tool/service, the author(s) reviewed and
edited the content as needed and take(s) full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Chiticariu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Danilevsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Krishnamurthy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Raghavan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Reiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vaithyanathan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , Web Information Extraction, in: L. Liu, M.T. Özsu (Eds.),
          <source>Encyclopedia of Database Systems</source>
          , Springer, New York, NY,
          <year>2018</year>
          , pp.
          <fpage>4620</fpage>
          -
          <lpage>4629</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-1-
          <fpage>4614</fpage>
          - 8265-9_
          <fpage>459</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Nasar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. W.</given-names>
            <surname>Jaffry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. K.</given-names>
            <surname>Malik</surname>
          </string-name>
          ,
          <article-title>Named Entity Recognition and Relation Extraction: Stateof-the-</article-title>
          <string-name>
            <surname>Art</surname>
          </string-name>
          ,
          <source>ACM Comput. Surv</source>
          .
          <volume>54</volume>
          ,
          <issue>1</issue>
          , (
          <year>2021</year>
          )
          <article-title>20</article-title>
          . doi:
          <volume>10</volume>
          .1145/3445965.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Jehangir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Radhakrishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <article-title>A survey on Named Entity Recognition - datasets, tools, and methodologies</article-title>
          ,
          <source>Natural Language Processing Journal</source>
          <volume>3</volume>
          (
          <year>2023</year>
          )
          <article-title>100017</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.nlp.
          <year>2023</year>
          .
          <volume>100017</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , H. Cheng, W. Lam,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <surname>A Comprehensive</surname>
          </string-name>
          <article-title>Survey on Relation Extraction: Recent Advances and New Frontiers, ACM Comput</article-title>
          . Surv.,
          <volume>56</volume>
          (
          <issue>11</issue>
          ) (
          <year>2024</year>
          )
          <article-title>293</article-title>
          . doi:
          <volume>10</volume>
          .1145/3674501.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Gulas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Wysiadecki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Strzelecki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Gawlik-Kotelnicka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Polguj</surname>
          </string-name>
          ,
          <article-title>Can microbiology affect psychiatry? A link between gut microbiota and psychiatric disorders</article-title>
          ,
          <source>Psychiatria Polska</source>
          ,
          <volume>52</volume>
          (
          <issue>6</issue>
          ) (
          <year>2018</year>
          )
          <fpage>1023</fpage>
          -
          <lpage>1039</lpage>
          . doi:
          <volume>10</volume>
          .12740/PP/OnlineFirst/81103.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dziedzic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Maciak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bliźniewska-Kowalska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gałecka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Kobierecka</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Saluk,</surname>
          </string-name>
          <article-title>The Power of Psychobiotics in Depression: A Modern Approach through the Microbiota-Gut-Brain Axis: A Literature Review</article-title>
          , Nutrients,
          <volume>16</volume>
          (
          <issue>7</issue>
          ) (
          <year>2024</year>
          )
          <article-title>1054</article-title>
          . doi:
          <volume>10</volume>
          .3390/nu16071054.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>