<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Reproducibility and Generalization of a Relation Extraction System for Gene-Disease Associations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Laura Menotti</string-name>
          <email>laura.menotti@unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information Engineering, University of Padua</institution>
          ,
          <addr-line>Padua</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <fpage>11</fpage>
      <lpage>13</lpage>
      <abstract>
        <p>Workshop Proceedings Best Master Thesis Award (Extended Abstract). Understanding the interactions between genes and diseases is a great resource for improving patient care as it could provide the foundation for curative therapies, beneficial treatments, and preventative measures. This type of data is available in databases, e.g. DisGeNET [1] and BioXpress [2], in the form of Gene-Disease Associations (GDAs), that contain relationships between gene expressions and specific diseases such as cancer. Biomedical literature is a rich source of information about GDAs, that are usually extracted manually from text. Human annotations are expensive and cannot scale to the huge amount of data available in scientific literature (e.g., biomedical abstracts). Therefore, developing automated tools to identify GDAs is getting traction in the community [3]. Such systems employ Relation Extraction (RE) techniques to extract information on gene/microRNA expression in diseases from text. Once an automated text-mining tool has been developed, it can be tested on human annotated data or it can be compared to state-of-the-art systems. Indeed, it is crucial for researchers to compare newly developed systems with the state-of-the-art to assess whether they made a breakthrough. However, previous works may not be immediately reproducible for example, due to the lack of source code. At the time of writing, the state of the art is DEXTER, a rule-based system that extracts gene/microRNA expressions in diseases from biomedical abstracts [4]. DEXTER was published in Database: The Journal of Biological Databases and Curation in 2018 and has attracted thirteen citations so far. DEXTER takes as input biomedical abstracts from PubMed and extracts relevant information such as the correlation between genes or miRNAs and diseases. The system also classifies sentences as TypeA or TypeB based on the number of entities found; such classification is useful to integrate data in existing resources like BioXpress. In particular, TypeA sentences are comparative phrases where gene expression is contrasted between two diferent samples or conditions while in TypeB sentences there is no explicit comparison. Unfortunately, DEXTER's source code is not publicly available hence researchers rely only on data provided by the authors to evaluate newly developed systems.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR</p>
      <p>ceur-ws.org
https://www.dei.unipd.it/~menottilau/ (L. Menotti)
CEUR
Workshop
Proceedings</p>
      <p>© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
application to foster reusability. In this way, our implementation of DEXTER can be easily
run on diferent datasets, without extensive knowledge of the system’s internal architecture.
We made some changes in each component to enable a seamless integration of the diferent
modules. While the original system is developed both in Python and Java and mainly employs
the Stanford CoreNLP toolkit, our system is entirely developed in Python, exploiting the SpaCy
library for linguistic annotations and dependency trees. During implementation, we faced some
reproducibility issues related to the use of the SpaCy library and the PubTator annotations.
SpaCy employs a diferent dependency parser than the Stanford CoreNLP toolkit used in the
original system. This results in having diferent dependency trees for some sentences thus we
added some patterns to match more sentences. In addition, we had to translate the original
rules in a diferent format to comply with the new library, and this resulted in having more
rules with respect to the original system. Finally, we recall that to extract disease and gene
mentions we matched pre-computed annotations. With this approach, we could have problems
related to text normalization, special characters, and acronyms.</p>
      <p>To assess the accuracy of our system, we used as ground truth three datasets provided by
the authors in BioXpress1. In particular, we evaluated our system performance based on the
percentage of parsed sentences and the correctness of the results in terms of gene expression
level and sentence type. Our implementation parsed 97% of the input sentences. We performed
an error analysis which confirmed that discarded sentences are mostly due to missing PubTator
annotations or problems related to Dependency Parsing. The system achieved an accuracy of
84% on the gene expression level. Such value raises up to 93% if input mentions are used instead
of PubTator annotations. This can be related to the previously defined issues of string matching
or to the diferent version of PubTator that we are using with respect to the original system.</p>
      <p>
        In conclusion, results demonstrate that the system has been reproduced to a reasonable
degree as the benefits of having an end-to-end system completely written in Python outweigh
the margin of error we introduced. We released our implementation of DEXTER in a GitHub
repository2 so that anyone working on this field can use it to compare their system with the
state of the art. To this end, this work has been used to validate the Collaborative Oriented
Relation Extraction (CORE) system [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ]3, a Knowledge Base Construction (KBC) system based
on the combination of distant supervision and active learning paradigms.
      </p>
      <p>Acknowledgments
This work was supported by the ExaMode Project, as a part of the European Union Horizon
2020 Program under grant 825292. The author wishes to thank Prof. Gianmaria Silvello and Dr.
Stefano Marchesin whose suggestions helped in improving this study.
1Datasets “DEXTER Glycosyltransferase Expression”, “DEXTER Expression in Lung Cancer”, “DEXTER miRNA
Expression” from Section “BioXPress Downloads” at https://hive.biochemistry.gwu.edu/bioxpress.
2https://github.com/mntlra/DEXTER
3The knowledge base derived by CORE can be accessed via https://gda.dei.unipd.it along with a demonstration video
(https://gda.dei.unipd.it/static/videos/demo.mp4)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J. P.</given-names>
            <surname>González</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Ramírez-Anguita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Saüch-Pitarch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ronzano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Centeno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Sanz</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. I. Furlong</surname>
          </string-name>
          ,
          <article-title>The DisGeNET knowledge platform for disease genomics: 2019 update</article-title>
          ,
          <source>Nucleic Acids Res</source>
          .
          <volume>48</volume>
          (
          <year>2020</year>
          )
          <fpage>D845</fpage>
          -
          <lpage>D855</lpage>
          . doi:
          <volume>10</volume>
          .1093/nar/gkz1021.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Dingerdissen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Torcivia-Rodriguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. C.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mazumder</surname>
          </string-name>
          , R. Y. Kahsay,
          <article-title>BioMuta and BioXpress: mutation and expression knowledgebases for cancer biomarker discovery</article-title>
          ,
          <source>Nucleic Acids Res</source>
          .
          <volume>46</volume>
          (
          <year>2018</year>
          )
          <fpage>D1128</fpage>
          -
          <lpage>D1136</lpage>
          . doi:
          <volume>10</volume>
          .1093/nar/gkx907.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Marchesin</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Silvello, TBGA: a large-scale gene-disease association dataset for biomedical relation extraction</article-title>
          ,
          <source>BMC Bioinform</source>
          .
          <volume>23</volume>
          (
          <year>2022</year>
          )
          <article-title>111</article-title>
          . doi:
          <volume>10</volume>
          .1186/s12859- 022- 04646- 6.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Dingerdissen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. E.</given-names>
            <surname>Ross</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. H.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mazumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Vijay-Shanker</surname>
          </string-name>
          ,
          <source>DEXTER: Disease-Expression Relation Extraction from Text</source>
          ,
          <year>Database 2018</year>
          (
          <year>2018</year>
          ). doi:
          <volume>10</volume>
          . 1093/database/bay045.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Marchesin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Menotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Giachelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Silvello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Alonso</surname>
          </string-name>
          ,
          <article-title>Building a Large Gene Expression-Cancer Knowledge Base with Limited Human Annotations</article-title>
          ,
          <string-name>
            <surname>Database</surname>
          </string-name>
          (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .1093/database/baad061, in print.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F.</given-names>
            <surname>Giachelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Marchesin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Silvello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Alonso</surname>
          </string-name>
          ,
          <article-title>Searching for reliable facts over a medical knowledge base</article-title>
          ,
          <source>in: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , SIGIR '23,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2023</year>
          , p.
          <fpage>3205</fpage>
          -
          <lpage>3209</lpage>
          . doi:
          <volume>10</volume>
          .1145/3539618.3591822.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>