<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>S. Marchesin);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Databases to Train Relation Extraction Models for Gene-Disease Associations*</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>(Discussion Paper)</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Marchesin</string-name>
          <email>stefano.marchesin@unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gianmaria Silvello</string-name>
          <email>gianmaria.silvello@unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Weak Supervision, Relation Extraction, Gene-Disease Association</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dipartimento di Ingegneria dell'Informazione, Università degli Studi di Padova</institution>
          ,
          <addr-line>Via Gradenigo 6/b, 35131, Padova</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>Databases are pivotal to advancing biomedical science. Nevertheless, most of them are populated and updated by human experts with a great deal of efort. Biomedical Relation Extraction (BioRE) aims to shift these expensive and time-consuming processes to machines. Among its diferent applications, the discovery of Gene-Disease Associations (GDAs) is one of the most pressing challenges. Despite this, few resources have been devoted to training - and evaluating - models for GDA extraction. Besides, such resources are limited in size, preventing models from scaling efectively to large amounts of data. To overcome this limitation, we have exploited the DisGeNET database to build a large-scale, semiautomatically annotated dataset for GDA extraction: TBGA. TBGA is generated from more than 700K publications and consists of over 200K instances and 100K gene-disease pairs. We have evaluated stateof-the-art models for GDA extraction on TBGA, showing that it is a challenging dataset for the task. The dataset and models are publicly available to foster the development of state-of-the-art BioRE models for</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Curated databases, such as UniProt [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], DrugBank [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], or CTD [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], are pivotal to the
development of biomedical science. Such databases are usually populated and updated with a great deal
of efort by human experts [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], thus slowing down the biological knowledge discovery process.
To overcome this limitation, the Biomedical Information Extraction (BioIE) field aims to shift
population and curation processes to machines by developing efective computational tools that
automatically extract meaningful facts from the vast unstructured scientific literature [
        <xref ref-type="bibr" rid="ref6 ref7 ref8">6, 7, 8</xref>
        ].
Once extracted, machine-readable facts can be fed to downstream tasks to ease biological
knowledge discovery. Among the various tasks, the discovery of Gene-Disease Associations (GDAs)
is one of the most pressing challenges to advance precision medicine and drug discovery [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ],
as it helps to understand the genetic causes of diseases [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Thus, the automatic extraction
* The full paper has been originally published in BMC Bioinformatics [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
and curation of GDAs is key to advance precision medicine research and provide knowledge to
assist disease diagnostics, drug discovery, and therapeutic decision-making.
      </p>
      <p>
        Most datasets used to train and evaluate Relation Extraction (RE) models for GDA extraction
are hand-labeled corpora [
        <xref ref-type="bibr" rid="ref11 ref12 ref13">11, 12, 13</xref>
        ]. However, hand-labeling data is an expensive process
requiring large amounts of time to expert biologists and, therefore, all of these datasets are
limited in size. To address this limitation, distant supervision has been proposed [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Under the
distant supervision paradigm, all the sentences mentioning the same pair of entities are labeled
by the corresponding relation stored within a source database. The assumption is that if two
entities participate in a relation, at least one sentence mentioning them conveys that relation.
As a consequence, distant supervision generates a large number of false positives, since not
all sentences express the relation between the considered entities. To counter false positives,
the RE task under distant supervision can be modeled as a Multi-Instance Learning (MIL)
problem [
        <xref ref-type="bibr" rid="ref15 ref16">15, 16, 17, 18</xref>
        ]. With MIL, the sentences containing two entities connected by a given
relation are collected into bags labeled with such relation. Grouping sentences into bags reduces
noise, as a bag of sentences is more likely to express a relation than a single sentence. Thus,
distant supervision alleviates manual annotation eforts, and MIL increases the robustness of
RE models to noise.
      </p>
      <p>
        Since the advent of distant supervision, several datasets for RE have been developed under
this paradigm for news and biomedical science domains [
        <xref ref-type="bibr" rid="ref14 ref6">14, 19, 6</xref>
        ]. Among biomedical ones,
the most relevant datasets are BioRel [19], a large-scale dataset for domain-general Biomedical
Relation Extraction (BioRE), and DTI [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], a large-scale dataset developed to extract Drug-Target
Interactions (DTIs). In the wake of such eforts, we created TBGA: a novel large-scale,
semiautomatically annotated dataset for GDA extraction based on DisGeNET. We chose DisGeNET as
source database since it is one of the most comprehensive databases for GDAs [20], integrating
several expert-curated resources.
      </p>
      <p>Then, we trained and tested several state-of-the-art RE models on TBGA to create a large
and realistic benchmark for GDA extraction. We built models using OpenNRE [21], an open
and extensible toolkit for Neural Relation Extraction (NRE). The choice of OpenNRE eases the
re-use of the dataset and the models developed for this work to future researchers. Finally, we
publicly released TBGA on Zenodo,1 whereas we stored source code and scripts to train and
test RE models in a publicly available GitHub repository.2</p>
    </sec>
    <sec id="sec-2">
      <title>2. Dataset</title>
      <p>TBGA is the first large-scale, semi-automatically annotated dataset for GDA extraction. The
dataset consists of three text files, corresponding to train, validation, and test sets, plus an
additional JSON file containing the mapping between relation names and IDs. Each record in
train, validation, or test files corresponds to a single GDA extracted from a sentence, and it is
represented as a JSON object with the following attributes:
• text: sentence from which the GDA was extracted.
• relation: relation name associated with the given GDA.
• h: JSON object representing the gene entity, composed of:
∘ id: NCBI Entrez ID associated with the gene entity.
∘ name: NCBI oficial gene symbol associated with the gene entity.</p>
      <p>∘ pos: list consisting of starting position and length of the gene mention within text.
• t: JSON object representing the disease entity, composed of:
∘ id: UMLS Concept Unique Identifier (CUI) associated with the disease entity.
∘ name: UMLS preferred term associated with the disease entity.
∘ pos: list consisting of starting position and length of the disease mention within
text.</p>
      <p>If a sentence contains multiple gene-disease pairs, the corresponding GDAs are split into separate
data records.</p>
      <p>Overall, TBGA contains over 200,000 instances and 100,000 bags. Table 1 reports per-relation
statistics for the dataset. Notice the large number of Not Associated (NA) instances. Regarding
gene and disease statistics, the most frequent genes are tumor suppressor genes, such as TP53
and CDKN2A, and (proto-)oncogenes, like EGFR and BRAF. Among the most frequent diseases,
we have neoplasms such as breast carcinoma, lung adenocarcinoma, and prostate carcinoma.
As a consequence, the most frequent GDAs are gene-cancer associations.
3. Experimental Setup
3.1. Benchmarks
We performed experiments on three diferent datasets: TBGA, DTI, and BioRel. We used TBGA
as a benchmark to evaluate RE models for GDA extraction under the MIL setting. On the other
hand, we used DTI and BioRel only to validate the soundness of our implementation of the
baseline models.</p>
      <sec id="sec-2-1">
        <title>3.2. Evaluation Measures</title>
        <p>
          We evaluated RE models using the Area Under the Precision-Recall Curve (AUPRC). AUPRC
is a popular measure to evaluate distantly-supervised RE models, which has been adopted by
OpenNRE [21] and used in several works, such as [
          <xref ref-type="bibr" rid="ref6">6, 19</xref>
          ]. For experiments on TBGA, we also
computed Precision at k items (P@k).
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>3.3. Aggregation Strategies</title>
        <p>We adopted two diferent sentence aggregation strategies to use RE models under the MIL
setting: average-based (AVE) and attention-based (ATT) [22]. The average-based aggregation
assumes that all sentences within the same bag contribute equally to the bag-level representation.
In other words, the bag representation is the average of all its sentence representations. On
the other hand, the attention-based aggregation represents each bag as a weighted sum of
its sentence representations, where the attention weights are dynamically adjusted for each
sentence.</p>
      </sec>
      <sec id="sec-2-3">
        <title>3.4. Relation Extraction Models</title>
        <p>
          We considered the main state-of-the-art RE models to perform experiments: CNN [23], PCNN [24],
BiGRU [
          <xref ref-type="bibr" rid="ref6">25, 19, 6</xref>
          ], BiGRU-ATT [
          <xref ref-type="bibr" rid="ref6">26, 6</xref>
          ], and BERE [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. All models use pre-trained word
embeddings to initialize word representations. On the other hand, Position Features (PFs), Position
Indicators (PIs), and unknown words are initialized using the normal distribution, whereas
blank words are initialized with zeros.
        </p>
        <p>
          We adopted pre-trained BioWordVec [27] embeddings to perform experiments on TBGA. Two
versions of pre-trained BioWordVec embeddings are available: “Bio_embedding_intrinsic” and
“Bio_embedding_extrinsic”. We chose the “Bio_embedding_extrinsic” version as it is the most
suitable for BioRE. As for the experiments on DTI and BioRel, we adopted the pre-trained word
embeddings used in the original works [
          <xref ref-type="bibr" rid="ref6">6, 19</xref>
          ] – that is, the word embeddings from Pyysalo et
al. [28] for DTI, and the “Bio_embedding_extrinsic” version of BioWordVec for BioRel.
        </p>
        <p>For TBGA experiments, we used grid search to determine the best combination between
optimizer and learning rate. As combinations, we tested Stochastic Gradient Descent (SGD)
with learning rate among {0.1, 0.2, 0.3, 0.4, 0.5} and Adam [29] with learning rate set to 0.0001.
For all RE models, we set the rest of the hyper-parameters empirically.</p>
        <p>
          For DTI and BioRel experiments, we relied on the hyper-parameter settings reported in the
original works [
          <xref ref-type="bibr" rid="ref6">6, 19</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Experimental Results</title>
      <p>
        We report the results for two diferent experiments. The first experiment aims to validate the
soundness of the implementation of the considered RE models. To this end, we trained and
tested the RE models on DTI and BioRel datasets, and we compared the AUPRC scores we
obtained against those reported in the original works [
        <xref ref-type="bibr" rid="ref6">6, 19</xref>
        ]. For this experiment, we only
compared the RE models and aggregation strategies that were used in the original works. The
Model
CNN
PCNN
BiGRU
second experiment uses TBGA as a benchmark to evaluate RE models for GDA extraction. In
this case, we trained and tested all the considered RE models using both aggregation strategies.
For each RE model, we reported the AUPRC and P@k scores.
      </p>
      <sec id="sec-3-1">
        <title>4.1. Baselines Validation</title>
        <p>
          The results of the baselines validation are reported in Table 2. We can observe that the RE
models we use from – or implement within – OpenNRE achieve performance higher than or
comparable to those reported in DTI and BioRel original works. The only exceptions are BiGRU
and BiGRU-ATT on DTI, where the AUPRC scores of our implementations are lower than those
reported in the original work. However, Hong et al. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] report the optimal hyper-parameter
settings for BERE, but not for the baselines. Thus, we attribute the negative diference between
our implementations and theirs to the lack of information about optimal hyper-parameters.
Overall, the results confirm the soundness of our implementations. Therefore, we can consider
them as competitive baseline models to use for benchmarking GDA extraction.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>4.2. GDA Benchmarking</title>
        <p>
          Table 3 reports the AUPRC and P@k scores of RE models on TBGA. Given the RE models
performance, we make the following observations. First, the AUPRC performances achieved by
0.422
0.403
RE models on TBGA indicate a high complexity of the GDA extraction task. The task complexity
is further supported by the lower performances obtained by top-performing RE models on
TBGA compared to DTI and BioRel (cf. Table 2). Secondly, CNN, PCNN, BiGRU, and
BiGRUATT RE models behave similarly. Among them, BiGRU-ATT has the worst performance. This
suggests that replacing BiGRU max pooling layer with an attention layer proves less efective.
Overall, the best AUPRC and P@k scores are achieved by BERE when using the
attentionbased aggregation strategy. This highlights BERE efectiveness of fully exploiting sentence
information from both semantic and syntactic aspects [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Thirdly, in terms of AUPRC, the
attention-based aggregation proves less efective than the average-based one. On the other hand,
attention-based aggregation provides mixed results on P@k measures. Although in contrast
with the results obtained in general-domain RE [22], this trend is in line with the results found
by Xing et al. [19] on BioRel, where RE models using an average-based aggregation strategy
achieve performance comparable to or higher than those using an attention-based one. The
only exception is BERE, whose performance using the attention-based aggregation outperforms
the one using the average-based strategy. Thus, the obtained results suggest that TBGA is a
challenging dataset for GDA extraction.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Conclusions</title>
      <p>We have created TBGA, a large-scale, semi-automatically annotated dataset for GDA extraction.
Automatic GDA extraction is one of the most relevant tasks of BioRE. We have used TBGA as a
benchmark to evaluate state-of-the-art BioRE models on GDA extraction. The results suggest
that TBGA is a challenging dataset for this task and, in general, for BioRE.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>The work was supported by the EU H2020 ExaMode project, under Grant Agreement no. 825292.
[17] R. Hofmann, C. Zhang, X. Ling, L. S. Zettlemoyer, D. S. Weld, Knowledge-Based Weak
Supervision for Information Extraction of Overlapping Relations, in: The 49th Annual
Meeting of the Association for Computational Linguistics: Human Language Technologies,
Proceedings of the Conference, 19-24 June, 2011, Portland, Oregon, USA, ACL, 2011, pp.
541–550.
[18] M. Surdeanu, J. Tibshirani, R. Nallapati, C. D. Manning, Multi-instance Multi-label Learning
for Relation Extraction, in: Proc. of the 2012 Joint Conference on Empirical Methods in
Natural Language Processing and Computational Natural Language Learning,
EMNLPCoNLL 2012, July 12-14, 2012, Jeju Island, Korea, ACL, 2012, pp. 455–465.
[19] R. Xing, J. Luo, T. Song, BioRel: towards large-scale biomedical relation extraction, BMC</p>
      <p>Bioinform. 21-S (2020) 543.
[20] Z. Tanoli, U. Seemab, A. Scherer, K. Wennerberg, J. Tang, M. Vähä-Koskela, Exploration of
databases and methods supporting drug repurposing: a comprehensive survey, Briefings
Bioinform. 22 (2021) 1656–1678.
[21] X. Han, T. Gao, Y. Yao, D. Ye, Z. Liu, M. Sun, OpenNRE: An Open and Extensible Toolkit
for Neural Relation Extraction, in: Proc. of the 2019 Conference on Empirical Methods
in Natural Language Processing and the 9th International Joint Conference on Natural
Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, ACL,
2019, pp. 169–174.
[22] Y. Lin, S. Shen, Z. Liu, H. Luan, M. Sun, Neural Relation Extraction with Selective Attention
over Instances, in: Proc. of the 54th Annual Meeting of the Association for Computational
Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers, ACL,
2016, pp. 2124–2133.
[23] D. Zeng, K. Liu, S. Lai, G. Zhou, J. Zhao, Relation Classification via Convolutional Deep
Neural Network, in: Proc. of COLING 2014, 25th International Conference on
Computational Linguistics, Technical Papers, August 23-29, 2014, Dublin, Ireland, ACL, 2014, pp.
2335–2344.
[24] D. Zeng, K. Liu, Y. Chen, J. Zhao, Distant Supervision for Relation Extraction via Piecewise
Convolutional Neural Networks, in: Proc. of the 2015 Conference on Empirical Methods
in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015,
ACL, 2015, pp. 1753–1762.
[25] D. Zhang, D. Wang, Relation Classification via Recurrent Neural Network, CoRR
abs/1508.01006 (2015).
[26] P. Zhou, W. Shi, J. Tian, Z. Qi, B. Li, H. Hao, B. Xu, Attention-Based Bidirectional Long
Short-Term Memory Networks for Relation Classification, in: Proc. of the 54th Annual
Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016,
Berlin, Germany, Volume 2: Short Papers, ACL, 2016, pp. 207–212.
[27] Y. Zhang, Q. Chen, Z. Yang, H. Lin, Z. Lu, BioWordVec, improving biomedical word
embeddings with subword information and MeSH, Sci Data 6 (2019) 1–9.
[28] S. Pyysalo, F. Ginter, H. Moen, T. Salakoski, S. Ananiadou, Distributional Semantics</p>
      <p>Resources for Biomedical Text Processing, Proc. of LBM (2013) 39–44.
[29] D. P. Kingma, J. Ba, Adam: A Method for Stochastic Optimization, in: Proc. of the 3rd
International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA,
May 7-9, 2015, 2015, pp. 1–15.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Marchesin</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Silvello, TBGA: a large-scale gene-disease association dataset for biomedical relation extraction</article-title>
          ,
          <source>BMC Bioinform</source>
          .
          <volume>23</volume>
          (
          <year>2022</year>
          )
          <fpage>111</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bairoch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Apweiler</surname>
          </string-name>
          ,
          <article-title>The SWISS-PROT protein sequence data bank and its supplement TrEMBL</article-title>
          ,
          <source>Nucleic Acids Res</source>
          .
          <volume>25</volume>
          (
          <year>1997</year>
          )
          <fpage>31</fpage>
          -
          <lpage>36</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Wishart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Knox</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shrivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hassanali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Stothard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Woolsey,</surname>
          </string-name>
          <article-title>DrugBank: a comprehensive resource for in silico drug discovery and exploration</article-title>
          ,
          <source>Nucleic Acids Res</source>
          .
          <volume>34</volume>
          (
          <year>2006</year>
          )
          <fpage>668</fpage>
          -
          <lpage>672</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Mattingly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. T.</given-names>
            <surname>Colby</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. N.</given-names>
            <surname>Forrest</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Boyer</surname>
          </string-name>
          ,
          <article-title>The Comparative Toxicogenomics Database (CTD), Environ</article-title>
          . Health Perspect.
          <volume>111</volume>
          (
          <year>2003</year>
          )
          <fpage>793</fpage>
          -
          <lpage>795</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Buneman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cheney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. C.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vansummeren</surname>
          </string-name>
          , Curated Databases,
          <source>in: Proc. of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS</source>
          <year>2008</year>
          , June 9-11,
          <year>2008</year>
          , Vancouver, BC, Canada,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          ,
          <year>2008</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <article-title>A novel machine learning framework for automated biomedical relation extraction from large-scale literature repositories</article-title>
          ,
          <source>Nat. Mach. Intell</source>
          .
          <volume>2</volume>
          (
          <year>2020</year>
          )
          <fpage>347</fpage>
          -
          <lpage>355</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E. P.</given-names>
            <surname>Barracchia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Pio</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. D'Elia</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Ceci</surname>
          </string-name>
          ,
          <article-title>Prediction of new associations between ncrnas and diseases exploiting multi-type hierarchical clustering</article-title>
          ,
          <source>BMC Bioinform</source>
          .
          <volume>21</volume>
          (
          <year>2020</year>
          )
          <fpage>70</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Alaimo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Giugno</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>Pulvirenti, ncpred: ncrna-disease association prediction through tripartite network-based inference</article-title>
          ,
          <source>Front. Bioeng. Biotechnol</source>
          .
          <volume>2</volume>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Dugger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Platt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Goldstein</surname>
          </string-name>
          ,
          <article-title>Drug development in the era of precision medicine</article-title>
          ,
          <source>Nat. Rev. Drug. Discov</source>
          .
          <volume>17</volume>
          (
          <year>2018</year>
          )
          <fpage>183</fpage>
          -
          <lpage>196</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J. P.</given-names>
            <surname>González</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Ramírez-Anguita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Saüch-Pitarch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ronzano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Centeno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Sanz</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. I. Furlong</surname>
          </string-name>
          ,
          <article-title>The DisGeNET knowledge platform for disease genomics: 2019 update</article-title>
          ,
          <source>Nucleic Acids Res</source>
          .
          <volume>48</volume>
          (
          <year>2020</year>
          )
          <fpage>D845</fpage>
          -
          <lpage>D855</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>E. M. van Mulligen</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Fourrier-Réglat</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Gurwitz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Molokhia</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Nieto</surname>
            , G. Trifirò,
            <given-names>J. A.</given-names>
          </string-name>
          <string-name>
            <surname>Kors</surname>
            ,
            <given-names>L. I. Furlong</given-names>
          </string-name>
          ,
          <article-title>The EU-ADR corpus: Annotated drugs, diseases, targets, and their relationships</article-title>
          ,
          <source>J. Biomed. Informatics</source>
          <volume>45</volume>
          (
          <year>2012</year>
          )
          <fpage>879</fpage>
          -
          <lpage>884</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D.</given-names>
            <surname>Cheng</surname>
          </string-name>
          , C. Knox,
          <string-name>
            <given-names>N.</given-names>
            <surname>Young</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Stothard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Damaraju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Wishart</surname>
          </string-name>
          ,
          <article-title>PolySearch: a web-based text mining system for extracting relationships between human diseases, genes, mutations, drugs and metabolites</article-title>
          ,
          <source>Nucleic Acids Res</source>
          .
          <volume>36</volume>
          (
          <year>2008</year>
          )
          <fpage>399</fpage>
          -
          <lpage>405</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>H. J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. H.</given-names>
            <surname>Shim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <article-title>CoMAGC: a corpus with multi-faceted annotations of gene-cancer relations</article-title>
          ,
          <source>BMC Bioinform</source>
          .
          <volume>14</volume>
          (
          <year>2013</year>
          )
          <fpage>323</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mintz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bills</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Snow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jurafsky</surname>
          </string-name>
          ,
          <article-title>Distant supervision for relation extraction without labeled data, in: Proc. of the 47th Annual Meeting of the Association for Computational Linguistics (ACL 2009) and the 4th</article-title>
          <source>International Joint Conference on Natural Language Processing of the AFNLP, 2-7 August</source>
          <year>2009</year>
          , Singapore,
          <string-name>
            <surname>ACL</surname>
          </string-name>
          ,
          <year>2009</year>
          , pp.
          <fpage>1003</fpage>
          -
          <lpage>1011</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>T. G.</given-names>
            <surname>Dietterich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. H.</given-names>
            <surname>Lathrop</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          Lozano-Pérez,
          <article-title>Solving the Multiple Instance Problem with Axis-Parallel Rectangles, Artif</article-title>
          . Intell.
          <volume>89</volume>
          (
          <year>1997</year>
          )
          <fpage>31</fpage>
          -
          <lpage>71</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>McCallum</surname>
          </string-name>
          ,
          <article-title>Modeling Relations and Their Mentions without Labeled Text</article-title>
          ,
          <source>in: Proc. of Machine Learning and Knowledge Discovery in Databases, European Conference, ECML PKDD</source>
          <year>2010</year>
          , Barcelona, Spain,
          <source>September 20-24</source>
          ,
          <year>2010</year>
          , volume
          <volume>6323</volume>
          <source>of LNCS</source>
          , Springer,
          <year>2010</year>
          , pp.
          <fpage>148</fpage>
          -
          <lpage>163</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>