=Paper= {{Paper |id=Vol-3741/paper10 |storemode=property |title=Bootstrapping Gene Expression-Cancer Knowledge Bases with Limited Human Annotations |pdfUrl=https://ceur-ws.org/Vol-3741/paper10.pdf |volume=Vol-3741 |authors=Stefano Marchesin,Laura Menotti,Fabio Giachelle,Gianmaria Silvello,Omar Alonso |dblpUrl=https://dblp.org/rec/conf/sebd/0001MGSA24 }} ==Bootstrapping Gene Expression-Cancer Knowledge Bases with Limited Human Annotations== https://ceur-ws.org/Vol-3741/paper10.pdf
                                Bootstrapping Gene Expression-Cancer Knowledge
                                Bases with Limited Human Annotations⋆
                                Stefano Marchesin1 , Laura Menotti1 , Fabio Giachelle1 , Gianmaria Silvello1 and
                                Omar Alonso2,∗∗
                                1
                                    Department of Information Engineering, University of Padua, Padua, Italy
                                2
                                    Amazon, Palo Alto, California, USA


                                              Abstract
                                              We introduce the Collaborative Oriented Relation Extraction (CORE) system for Knowledge
                                              Base Construction, based on the combination of Relation Extraction (RE) methods and domain
                                              experts feedback. CORE features a seamless, transparent, and modular architecture that suits
                                              large-scale processing. Via active learning, the CORE system bootstraps Knowledge Bases
                                              (KBs) and then employs RE methods to scale to large text corpora. We employ CORE to
                                              build one of the largest KBs focusing on fine-grained gene expression- cancer associations,
                                              fundamental to complement and validate experimental data for precision medicine and cancer
                                              research. We conducted comprehensive experiments showing the robustness of the approach
                                              and highlighting the scalability of CORE to large text corpora with limited manual annotations.

                                              Keywords
                                              Knowledge Base Construction, Relation Extraction, Active Learning, Distant Supervision




                                1. Introduction
                                In 2020 there were about 19.2 million cancer cases worldwide and the World Health
                                Organization estimates a 33% overall increase by 2040.1 With this growing global
                                burden, cancer prevention is one of the century’s most pressing public health challenges,
                                and data-driven research is crucial in assisting the development of medical solutions
                                to address it. In this regard, microarray and next-generation sequencing technologies
                                providing raw data about gene expression-cancer interactions [2, 3] are essential to
                                guide diagnosis, assess prognosis, or predict therapy response [4]. Although these data
                                are invaluable to the advancement of cancer research, they cannot be steadily used as
                                is, as they require further processing and validation by experts. In most cases, the
                                outcome of this research process is described in a scientific peer-reviewed publication.
                                Hence, scientific literature is an authoritative data source that can be exploited to
                                complement and validate such experimental data. However, the manual extraction of
                                knowledge (e.g., scientific facts) from domain-specific literature is expensive and time-
                                consuming [5, 6, 7]. In recent years, thanks to the advancement of Machine Learning (ML)
                                SEBD 2024: 32nd Symposium on Advanced Database Systems, June 23-26, 2024, Villasimius, Sardinia,
                                Italy
                                ⋆
                                  Extended abstract of [1].
                                ∗∗
                                  Work done prior to joining Amazon.
                                            © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0
                                            International (CC BY 4.0).
                                1
                                    https://gco.iarc.fr/tomorrow/en/dataviz/bubbles?sexes=0&mode=population




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
methods, automated techniques for Knowledge Base Construction (KBC) have flourished
and empowered large-scale construction and curation of Knowledge Bases (KBs) [8, 9, 10].
Nevertheless, the two main components of KBC systems – i.e., Named Entity Recognition
and Disambiguation (NERD) and Relation Extraction (RE) – both require expensive
and often unavailable labeled data for training.
   Thus, alternative solutions have been proposed to address this limitation, such as distant
supervision [11, 12] and active learning [13, 14]. Distant supervision and active learning are
complementary and often used together [15, 16] to bootstrap KBC systems and generate
high-quality datasets for NERD and RE. Therefore, in this work, we use both paradigms
to build a modular, pluggable, transparent, and scalable KBC system for cancer research
that focuses on the discovery of “gene expression-cancer” associations. Specifically, we
present the Collaborative Oriented Relation Extraction (CORE) system [1], a KBC
system based on the combination of automated ML-based methods and domain experts
feedback. CORE features a seamless, transparent and modular architecture, where the
different components can be easily plugged-in. CORE also employs active learning to
bootstrap a KB focusing on gene expression-cancer associations. To this end, CORE
exploits the fine-grained aspects involved in gene expression-cancer associations to perform
iterative tests that measure the reliability of the data to be stored in the KB and return
small, selected samples to domain experts for annotation. The high-quality data generated
by this process is then used as reinforcement to re-train the ML models from scratch.
Active learning makes the CORE system suited to iterative KB versioning. Therefore,
with the data annotated by domain experts, re-trained ML models are deployed to build
subsequent versions of the KB.
   To show the robustness of the proposed approach, we conducted extensive analyses that
highlight how CORE scales to large text corpora with little human annotations. Moreover,
to evaluate the system effectiveness against the state-of-the-art, we performed a knowledge
base completion task showing that CORE achieves top performances. The KB derived
by CORE storing fine-grained facts about gene expression-cancer associations is available
at https://zenodo.org/records/7577127. The KB can also be accessed via CoreKB [17], a
web search platform available at https://gda.dei.unipd.it.
   The rest of the article is as follows: Section 2 reports on related work; Section 3 outlines
the CORE system; Section 4 presents the experiments; Section 5 concludes the paper.


2. Related Work
To date, there are a handful of knowledge resources containing data about gene expression-
cancer associations [18, 19, 20, 21, 22, 23]. Most of these resources only contain ex-
perimental data obtained through microarray and next-generation sequencing technolo-
gies [18, 19, 20, 21]. Whereas few of them, such as BioXpress [22] and OncoMX [23],
also integrate knowledge extracted from the biomedical literature and rely on pattern
matching techniques to extract relationships [24]. Thus, there is the opportunity to
develop more adaptive RE methods that can broaden the reach of KBC systems to
heterogeneous large-scale text corpora.
   Beside resources based on experimental studies, there also exist a few literature-based
resources [25, 26, 27, 28] such as CoMAGC [25] and OncoSearch [26]. They focus on
gene expression-cancer associations, modeling the different, fine-grained aspects involved
between gene expression and cancer. Although relevant, CoMAGC only consists of 821
sentences on prostate, breast, and ovarian cancers while OncoSearch is currently not
maintained. On the other hand, more general and large-scale resources on gene-disease
associations – i.e., DisGeNET [27] and LHGDN [28] – store coarse-grained information
expressing the existence of an association between gene expression and cancer, which is
often insufficient to model such complex, faceted relationships effectively.
   Hence, there is a need for KBC systems that can scale to large text corpora and
stay up to date while generating fine-grained information about gene expression-cancer
associations. These fine-grained associations are essential to complement and validate
experimental data, fundamental for advancing cancer research.


3. The CORE System
Preliminaries. Let us consider a directed graph 𝐺 = (𝑉 , 𝐸), where 𝐸 ⊆ {(𝑣1 , 𝑣2 ) | (𝑣1 , 𝑣2 ) ∈
𝑉 ×𝑉 } is the set of edges connecting ordered pairs of vertices. Given an edge 𝑒 = (𝑣1 , 𝑣2 ) ∈ 𝐸,
we call 𝑣1 the source vertex and 𝑣2 the target vertex. In our context, the nodes of 𝐺 are
entities and the edges are the relationships between them.

Definition 1 (Aspect). We call aspect an attribute of a relationship between a pair of
entities. An aspect has a name and a domain dom= {𝑎𝑖1 , … , 𝑎𝑖𝑛 }, where 𝑎𝑖𝑗 ∈ 𝐴𝑖 is the 𝑗 𝑡ℎ
aspect value of 𝐴𝑖 . Given an aspect 𝐴, the function 𝐷𝑜𝑚(𝐴) = dom returns its domain.

  When it is clear from the context, the aspect value 𝑎𝑖𝑗 ∈ 𝐴𝑖 is simply referred to as 𝑎𝑗 .

Example 1. Let us consider the context of gene-cancer associations, where there are
three aspects describing a possible relationship (𝑒) between gene (𝑣1 ) and cancer (𝑣2 ): the
Change of Gene Expression (CGE), the Change of Cancer Status (CCS), and the Gene-
Cancer Interaction (GCI). Following Definition 1, CGE, CCS, and GCI are the names
of the aspects with the following domains: 𝐷𝑜𝑚(CGE) = {up , down , notinf }, 𝐷𝑜𝑚(CCS) =
{progression , regression , notinf }, and 𝐷𝑜𝑚(GCI) = {causality , correlation , notinf }. A detailed
description of these aspect domains can be found in the original paper [1].

Definition 2 (Multi-Aspect Relationship). Given a graph 𝐺(𝑉 , 𝐸) and a set of aspects
𝒜 = {𝐴𝑖 }𝑛𝑖=1 , then a tuple of aspect values (𝑎1𝑗 , … , 𝑎𝑛𝑗 ) associated with 𝑒 = (𝑣1 , 𝑣2 ) ∈ 𝐸
defines a multi-aspect relationship between 𝑣1 and 𝑣2 .

Definition 3 (Signature Function). Given a set of aspects 𝒜 = {𝐴𝑖 }𝑛𝑖=1 and an alphabet Σ,
                𝑛
we define s ∶ ∏𝑖=1 𝐴𝑖 → 𝑆 ⊆ Σ∗ ; s((𝑎1𝑗 , ..., 𝑎𝑛𝑗 )) ↦ type as the signature function that maps
a multi-aspect relationship to a type from 𝑆, called the signature set.

  The signature function defines a set of mapping rules depending on the domain of
interest. In our setting, we refer to the mapping rules described in Table 1. That is,
we use the signature function to map multi-aspect gene expression-cancer relationships
Table 1
Inference rules for gene classes. For each combination of CGE, CCS, and GCI, we report the expected
gene class. Gene classes refer to the role that a given gene has on a specific disease. The * symbols in Rule
5 mean that CGE and CCS can assume any value between {up , down } and {progression , regression }.
                    Rule #   CGE    CCS            GCI           Gene Class
                    1        up     progression    causality     oncogene
                    2        up     regression     causality     tumor suppressor gene
                    3        down   regression     causality     oncogene
                    4        down   progression    causality     tumor suppressor gene
                    5        *      *              observation   biomarker




to gene prospective roles in cancer. Gene roles allow to distinguish the genes that are
responsible for oncogenesis from those that are not; these are essential information for
effective for cancer research and therapy design [29].

Definition 4 (Tagging Function). Given an edge 𝑒 ∈ 𝐸 and the signature set 𝑆. We define
𝜎 ∶ 𝐸 → 𝑆; 𝜎 (𝑒) ↦ type as the function tagging an edge with a signature type.

  The tagging function works on the graph and associates a signature type to an edge.
Thus, we use it to label edges with gene prospective roles. In other words, the graph
represents gene expression-cancer associations as gene prospective roles in cancer.

Overview. The goal of the CORE system is to harvest facts from text corpora to
populate KBs. We model a KB as a directed graph 𝐺 made up of entities connected
by typed relationships. Facts (or statements) are (𝑣1 , 𝑒, 𝑣2 ) triples, where 𝑣1 , 𝑣2 ∈ 𝑉,
𝑒 = (𝑣1 , 𝑣2 ) ∈ 𝐸, and 𝜎 (𝑒) ∈ 𝑆.
   To obtain facts, CORE collects scientific literature from different sources, identifies
sentences containing pairs of entities relevant to the considered task, and extracts aspects
from them. Depending on the combination of extracted aspect values, a sentence expresses
a specific signature type. Note that, for a given pair of entities, different sentences can
express various signature types, as we show in the next example.

Example 2. Let us consider the following sentences taken from the biomedical literature:

  A. Colorectal cancer (CRC) growth and progression is frequently driven by RAS
     pathway activation through upstream growth factor receptor activation or through
     mutational activation of KRAS or BRAF.
  B. Somatic mutations of the BRAF gene, causing constitutive activation of BRAF,
     have been found in various types of human cancers such as malignant melanoma,
     and colorectal cancer.

In both sentences, the following entities are extracted 𝑣1 = BRAF and 𝑣2 = Colorectal Cancer .
Considering the aspects introduced in Example 1, for sentence A we find CGE
= up , CCS = progression , and GCI = causality , leading to the signature type
s((up , progression , causality )) = oncogene . On the other hand, the aspect values of sentence
Figure 1: Overview of the CORE system architecture. The system consists of three main processes:
bootstrapping (orange), deployment (blue), and active learning (purple).


B are CGE = up , CCS = progression , and GCI = correlation , leading to the signature
type s((up , progression , correlation )) = biomarker .
   From Example 2, we see that different sentences may lead to different signature types.
In the scientific discourse, it is not surprising that there are different viewpoints and
that various studies can lead to different conclusions – even in contradiction with each
other. Hence, we need to consider this potential uncertainty when facts are extracted
from the literature. The CORE system models this inherent uncertainty by assigning the
likelihood of being true to each aspect value. This probability is based on the evidence
we can extract from the literature. Given a set of sentences concerning the same two
entities, the more an aspect value is consistent in the set, the higher the probability
for that value to be true. Hence, we define the concepts of Aspect-Probability Set and
Multi-Aspect Function.
Definition 5 (Aspect-Probability Set). Given an aspect 𝐴𝑖 = {𝑎𝑗 }𝑚        𝑗=1 such that each aspect
value 𝑎𝑗 carries a likelihood Pr(𝑎𝑗 ), we call 𝐴𝑃𝑖 = {(𝑎𝑗 , Pr(𝑎𝑗 ))}𝑚
                                                                     𝑗=1 its aspect-probability set.
Definition 6 (Multi-Aspect Function). Let 𝐺 = (𝑉 , 𝐸) be a directed graph and
                                                                      𝑛
𝒜 𝒫 = {𝐴𝑃𝑖 }𝑛𝑖=1 a set of aspect-probability sets. We define 𝜙 ∶ 𝐸 → ∏𝑖=1 𝐴𝑃𝑖 ; 𝜙(𝑒) ↦
               |𝐴 |                   |𝐴 |
({(𝑎1𝑗 , Pr(𝑎1𝑗 ))}𝑗=11 , … , {(𝑎𝑛𝑗 , Pr(𝑎𝑛𝑗 ))}𝑗=1𝑛 ) as the multi-aspect function that, given an edge,
returns the 𝑛-tuple of aspect-probability sets.
   Thus, for each pair of target entities, CORE computes the probabilities for all the
aspect values and combines them into tuples of aspect-probability sets – which represent
a probability distribution over multi-aspect relationships. In this way, sentences serve
as supporting or contradicting evidence that strengthens or weakens the likelihood of a
fact. Furthermore, aspect-probability sets drive another essential aspect of CORE: the
data-driven, active learning approach used to bootstrap KBs. That is, through reliability
tests based on aspect value likelihoods and inference rules, the system tags facts as
reliable or unreliable. Part of the sentences associated with the most “highly” unreliable
facts is then fed to a human-in-the-loop process that reinforces the RE methods for
aspect extraction.

Architecture. Figure 1 presents the system architecture. In the first module (module 1),
the texts acquired from the literature are processed and normalized to obtain sentences,
from which a NERD component extracts entity pairs. The entity-annotated sentences
undergo two different processes: bootstrapping (orange workflow) and deployment (blue
workflow). In the bootstrapping workflow, experts manually annotate multi-aspect
relationships between the entities (module 2), producing a set of relation-annotated
sentences.
   The manual, relation-annotated sentences are then used to train RE methods (module
3) and to populate the KB (module 5). The RE methods are trained to predict the
different aspects of multi-aspect relationships. Once trained, RE methods are employed
in the deployment workflow to obtain automatic annotations expressing multi-aspect
relationships between entities (module 4). Then, automatic, relation-annotated sentences
are used to further populate the KB (module 5).
   In the last module (module 5), relation-annotated sentences are grouped by entity
pairs and used to generate facts. First, a knowledge enrichment component computes
probabilities for all the aspect values and combines them into tuples of aspect-probability
sets. Then, a reliability testing component uses these probabilities to perform multiple
tests that tag facts as either reliable or unreliable. Only facts tagged as reliable are used
to populate the KB.
   When the deployment workflow is complete, unreliable facts are ranked by ascending
reliability score and the top-𝑘 automatically annotated sentences associated with them
are re-annotated by experts – thus triggering an active learning process that reinforces
the RE methods (purple workflow).

Versioning. The active learning workflow makes CORE suited to iterative KB versioning.
We define a KB version as the graph 𝐺𝑗 = (𝑉𝑗 , 𝐸𝑗 ) obtained after the 𝑗 𝑡ℎ iteration of the
bootstrap and deployment workflows. Once the 𝑗 𝑡ℎ version of the KB has been deployed,
the active learning workflow starts by generating the batch of unreliable sentences for
bootstrapping the 𝑗 𝑡ℎ + 1 version of the KB. The unreliable sentences are manually
annotated and used to increase the size of the datasets to re-train the RE methods from
scratch, which then generate a new set of automatic annotations to be included in the
𝑗 𝑡ℎ + 1 KB version. When the bootstrap and deployment workflows end, the 𝑗 𝑡ℎ + 1 version
of the KB is re-built from scratch and comprises all the available annotations.


4. Implementation and Experiments
Knowledge Base Construction. We use different resources to build the KB, which
increase with each subsequent iteration of the KB construction process. The considered
resources are CoMAGC [25], OncoSearch [26], BioXpress [22], DisGeNET [27], and
PubMed.2 For CoMAGC, BioXpress, and OncoSearch (KBs 0–3) we revised the available
manual annotations to make them compliant with our annotation schema; for DisGeNET
(KBs 1–3) we divided its data into two batches to test versioning; and for PubMed (KB3)
we only considered the articles citing those stored within KB2. Table 2 reports statistics
for the resources used to build each KB version, while Table 3 reports the statistics about
each version of the generated KB.
2
    https://pubmed.ncbi.nlm.nih.gov
Table 2
Raw statistics for the KB versions. Rows represent the raw instances considered to build the KB.
                                                              KB0        KB1         KB2          KB3
                                 CoMAGC (revised)              821       821         821           821
                                 OncoSearch (revised)          157       157         157           157
                    Manual       BioXpress (revised)            74        74          74            74
                                 DisGeNET (batch 1)              –         –         250           250
                                 DisGeNET (batch 2)              –         –           –           249
                                 DisGeNET (batch 1)              –    184,859     184,609       184,609
                    Automatic    DisGeNET (batch 2)              –          –     184,858       184,609
                                 PubMed (citing papers)          –          –           –     2,841,096
                    Total                                     1,052   185,911     370,769     3,211,865


Table 3
Partition and general statistics for each KB version.
                                                        KB0       KB1      KB2         KB3
                                         Manual         655        585      605         592
                             Partition
                                         Automatic        –     96,531   95,282     435,283
                                         Sentence       655     97,116   95,887     435,875
                                         Article        411     69,462   65,236     161,449
                             General     Gene           329      9,483    9,981      21,005
                                         Cancer          98      1,479    1,554       1,665
                                         Fact           512     71,554   89,999     153,016



   First, we can see that the ratio between the sentences stored in the KB (Table 3) and
the input ones (Table 2) decreases at each iteration. From the first iteration, where
the CORE system uses the 62% of the input sentences to build KB0, we move to the
52% to build KB1, 26% for KB2, and only 14% for KB3. Such a decrease reflects the
use of reliability tests and active learning, which make the system more selective and
accurate. In particular, active learning leads the CORE system to refine the RE methods
at each iteration, thus reducing false positives as well as unreliable facts. Secondly, the
large number of different genes and cancers in KB3 highlights the scalability of the
approach. In this regard, KB3 contains 21, 005 genes, which cover the 70% of the 30, 000
estimated genes in the human genome.3 On the other hand, through the integration
of DisGeNET data, KBs 1–3 contain most of the (known) cancer types involved in
gene expression-cancer associations. Combined, this large number of genes and cancer
types leads to more than 150, 000 reliable facts. Finally, compared to currently available
knowledge resources [22, 23, 26], KB3 represents the largest literature-derived KBs with
reliable fine-grained facts about gene expression-cancer associations.

Knowledge Base Completion. We evaluate the effectiveness of the CORE system on a
KB completion task, in which we hold out a portion of an existing KB with associated
sentences and we assess CORE ability to recover it. To this end, we hold out from

3
    https://www.genome.gov/human-genome-project/
Table 4
CORE system performances on the BioXpress completion task after each (re-)training of the RE methods.
We also report DEXTER performance on KB3.
                       Dataset     Method   Accuracy   Precision   Recall      F1
                                   CORE0      0.9544     0.9601    0.9544   0.9572
                       BioXpress   CORE1      0.9703     0.9831    0.9703   0.9766
                                   CORE2      0.9706     0.9827    0.9706   0.9766
                       KB3         DEXTER     0.3256     0.6034    0.3256   0.2882



BioXpress [22] the set of 9, 636 sentences annotated by DEXTER [24] – a state-of-the-art
text-mining system for gene expression-cancer associations based on pattern matching
– and we evaluate the CORE system on them. Note that such sentences are not part
of those used to train the CORE RE methods. Vice versa, we apply DEXTER on the
manually annotated subset of KB3 to evaluate its ability to generalize to heterogeneous
sentences, whose syntactic structure can differ from its predefined patterns.
   For BioXpress completion, we use the three versions of the CORE system obtained after
each (re-)training of the RE methods. Table 4 reports the CORE system performances
on the BioXpress completion task after each (re-)training of the RE methods, as well as
DEXTER performance on KB3.
   We can see that each CORE version consistently achieves performances above 0.95 for
each measure. In particular, CORE1 improves over CORE0 by about 2% and reaches
a performance plateau, as shown by CORE2 performance. The results highlight the
effectiveness of the CORE system in recovering BioXpress using a limited amount of
manual annotations to train the RE methods. On the other hand, the poor performance
of DEXTER on KB3 highlights a lack of flexibility that hampers its applicability to
heterogeneous sentences. To further support this intuition, we observe that between
precision and recall it is recall to have the worst performance, with a value of 0.3256.
This underlines the expert system nature of DEXTER which, although precise, fails to
generalize beyond its set of predefined patterns.


5. Conclusions
In this work we presented CORE, a KBC system based on the combination between
automated RE methods and domain experts. The reliability tests and the active learning
process make the system suited to iterative KB versioning. We used the CORE to
build one of the largest KBs about gene expression-cancer associations. We conducted
extensive experiments that (i) highlighted the ability of CORE to scale to large collections
of heterogeneous data with limited human annotations and (ii) showed its generalizability
and reliability compared to the current state-of-the-art.

Acknowledgments. This work is partially supported by the HEREDITARY Project,
as part of the European Union’s Horizon Europe research and innovation programme
under grant agreement No GA 101137074.
References
 [1] S. Marchesin, L. Menotti, F. Giachelle, G. Silvello, O. Alonso, Building a Large Gene
     Expression-Cancer Knowledge Base with Limited Human Annotations, Database J.
     Biol. Databases Curation 2023 (2023). URL: https://doi.org/10.1093/database/baad061.
     doi:10.1093/DATABASE/BAAD061 .
 [2] C. Manzoni, D. A. Kia, J. Vandrovcova, J. Hardy, N. W. Wood, P. A. Lewis,
     R. Ferrari, Genome, Transcriptome and Proteome: the Rise of Omics Data and
     Their Integration in Biomedical Sciences, Briefings in Bioinformatics 19 (2016)
     286–302.
 [3] P. Borry, H. B. Bentzen, I. Budin-Ljøsne, M. C. Cornel, H. C. Howard, O. Feeney,
     L. Jackson, D. Mascalzoni, Á. Mendes, B. Peterlin, B. Riso, M. Shabani, H. Skirton,
     S. Sterckx, D. Vears, M. Wjst, H. Felzmann, The Challenges of the Expanded
     Availability of Genomic Information: an Agenda-Setting Paper, J. Community
     Genet. 9 (2018) 103–116.
 [4] B. Neary, J. Zhou, P. Qiu, Identifying Gene Expression Patterns Associated with
     Drug-Specific Survival in Cancer Patients, Scientific Reports 11 (2021) 1–12.
 [5] F. Liu, J. Chen, A. Jagannatha, H. Yu, Learning for Biomedical Information
     Extraction: Methodological Review of Recent Advances, CoRR abs/1606.07993
     (2016).
 [6] M. Krallinger, O. Rabal, S. A. Akhondi, M. P. Pérez, J. Santamaría, G. P. Ro-
     dríguez, G. Tsatsaronis, A. Intxaurrondo, J. A. Lopez, U. K. Nandal, E. M. van Buel,
     A. Chandrasekhar, M. Rodenburg, A. Lægreid, M. A. Doornenbal, J. Oyarzábal,
     A. Lourenço, A. Valencia, Overview of the BioCreative VI chemical-protein inter-
     action Track, in: Proc. of the sixth BioCreative challenge evaluation workshop,
     2017.
 [7] A. Miranda, F. Mehryary, J. Luoma, S. Pyysalo, A. Valencia, M. Krallinger, Overview
     of DrugProt BioCreative VII track: quality evaluation and large scale text mining of
     drug-gene/protein relations, in: Proc. of the seventh BioCreative challenge evaluation
     workshop, 2021.
 [8] G. Weikum, X. L. Dong, S. Razniewski, F. M. Suchanek, Machine Knowledge:
     Creation and Curation of Comprehensive Knowledge Bases, Found. Trends Databases
     10 (2021) 108–490.
 [9] D. Wright, A. L. Gentile, N. Faux, K. L. Beck, BioAct: Biomedical Knowledge Base
     Construction using Active Learning, bioRxiv (2022).
[10] P. Ernst, A. Siu, G. Weikum, HighLife: Higher-arity Fact Harvesting, in: Proc. of
     the 2018 World Wide Web Conference on World Wide Web, WWW 2018, Lyon,
     France, April 23-27, 2018, ACM, 2018, pp. 1013–1022.
[11] M. Mintz, S. Bills, R. Snow, D. Jurafsky, Distant supervision for relation extraction
     without labeled data, in: Proc. of the 47th Annual Meeting of the Association for
     Computational Linguistics (ACL 2009) and the 4th International Joint Conference
     on Natural Language Processing of the AFNLP, 2-7 August 2009, Singapore, ACL,
     2009, pp. 1003–1011.
[12] M. Surdeanu, J. Tibshirani, R. Nallapati, C. D. Manning, Multi-instance Multi-
     label Learning for Relation Extraction, in: Proc. of the 2012 Joint Conference on
     Empirical Methods in Natural Language Processing and Computational Natural
     Language Learning, EMNLP-CoNLL 2012, July 12-14, 2012, Jeju Island, Korea,
     ACL, 2012, pp. 455–465.
[13] B. Settles, Active Learning Literature Survey, Science 10 (1995) 237–304.
[14] F. Olsson, A Literature Survey of Active Machine Learning in the Context of Natural
     Language Processing, SICS Technical Report (2009).
[15] G. Angeli, J. Tibshirani, J. Wu, C. D. Manning, Combining Distant and Partial
     Supervision for Relation Extraction, in: Proc. of the 2014 Conference on Empirical
     Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha,
     Qatar, ACL, 2014, pp. 1556–1567.
[16] L. Sterckx, T. Demeester, J. Deleu, C. Develder, Using Active Learning and Semantic
     Clustering for Noise Reduction in Distant Supervision, in: Proc. of the 4th Workshop
     on Automated Base Construction at NIPS 2014 (AKBC-2014), 2014, pp. 1–6.
[17] F. Giachelle, S. Marchesin, G. Silvello, O. Alonso, Searching for Reliable Facts
     over a Medical Knowledge Base, in: Proc. of the 46th International ACM SIGIR
     Conference on Research and Development in Information Retrieval, SIGIR 2023,
     Taipei, Taiwan, July 23-27, 2023, ACM, 2023.
[18] S. J. Park, B. H. Yoon, S. K. Kim, S. Y. Kim, GENT2: an updated gene expression
     database for normal and tumor tissues, BMC Medical Genom. 12 (2019) 1–8.
[19] Y. D. Shaul, B. Yuan, P. Thiru, A. Nutter-Upham, S. McCallum, C. Lanzkron,
     G. W. Bell, D. M. Sabatini, MERAV: a tool for comparing gene expression across
     human tissues and cell types, Nucleic Acids Res. 44 (2016) 560–566.
[20] J. Zhang, J. Baran, A. Cros, J. M. Guberman, S. Haider, J. Hsu, Y. Liang, E. Rivkin,
     J. Wang, B. Whitty, M. Wong-Erasmus, L. Yao, A. Kasprzyk, International Cancer
     Genome Consortium Data Portal - a one-stop shop for cancer genomics data,
     Database J. Biol. Databases Curation 2011 (2011).
[21] J. N. Weinstein, E. A. Collisson, G. B. Mills, K. R. Shaw, B. A. Ozenberger,
     K. Ellrott, I. Shmulevich, C. Sander, J. M. Stuart, The Cancer Genome Atlas
     Pan-Cancer Analysis Project, Nat. Genet. 45 (2013) 1113–1120.
[22] H. Dingerdissen, J. Torcivia-Rodriguez, Y. Hu, T. C. Chang, R. Mazumder, R. Y.
     Kahsay, BioMuta and BioXpress: mutation and expression knowledgebases for
     cancer biomarker discovery, Nucleic Acids Res. 46 (2018) D1128–D1136.
[23] H. M. Dingerdissen, F. Bastian, K. Vijay-Shanker, M. Robinson-Rechavi, A. Bell,
     N. Gogate, S. Gupta, E. Holmes, R. Kahsay, J. Keeney, H. Kincaid, C. H. King,
     D. Liu, D. J. Crichton, R. Mazumder, OncoMX: A Knowledgebase for Exploring
     Cancer Biomarkers in the Context of Related Cancer and Healthy Data, JCO Clin.
     Cancer Inform. (2020) 210–220.
[24] S. Gupta, H. Dingerdissen, K. E. Ross, Y. Hu, C. H. Wu, R. Mazumder, K. Vijay-
     Shanker, DEXTER: disease-expression relation extraction from text, Database J.
     Biol. Databases Curation 2018 (2018) bay045.
[25] H. J. Lee, S. H. Shim, M. R. Song, H. Lee, J. C. Park, CoMAGC: a corpus with
     multi-faceted annotations of gene-cancer relations, BMC Bioinform. 14 (2013) 323.
[26] H. J. Lee, T. C. Dang, H. Lee, J. C. Park, OncoSearch: cancer gene search engine
     with literature evidence, Nucleic Acids Res. 42 (2014) 416–421.
[27] J. P. González, J. M. Ramírez-Anguita, J. Saüch-Pitarch, F. Ronzano, E. Centeno,
     F. Sanz, L. I. Furlong, The DisGeNET knowledge platform for disease genomics:
     2019 update, Nucleic Acids Res. 48 (2020) D845–D855.
[28] M. Bundschus, A. Bauer-Mehren, V. Tresp, L. I. Furlong, H. P. Kriegel, Digging
     for knowledge with information extraction: a case study on human gene-disease
     associations, in: Proc. of the 19th ACM Conference on Information and Knowledge
     Management, CIKM 2010, Toronto, Ontario, Canada, October 26-30, 2010, ACM,
     2010, pp. 1845–1848.
[29] D. Haber, J. Settleman, Cancer: Drivers and passengers, Nature 446 (2007) 145–146.