<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>SeWebMeDa, May</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Structural Characteristics of Knowledge Graphs Determine the Quality of Knowledge Graph Embeddings Across Model and Hyperparameter Choices</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jeffrey Sardina</string-name>
          <email>sardinaj@tcd.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Declan O'Sullivan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Knowledge Graphs</institution>
          ,
          <addr-line>Hyperparameters</addr-line>
          ,
          <country>Knowledge Graph Embeddings</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Trinity College Dublin, College Green</institution>
          ,
          <addr-line>Dublin, Address</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>2</volume>
      <fpage>9</fpage>
      <lpage>29</lpage>
      <abstract>
        <p>The realm of biomedicine is producing information at a rate far beyond the capacity of clinicians, researchers, and machine learning experts to analyse in full. Recently, developments in Knowledge Graphs (KGs) have facilitated the representation of all this information in an easily-integrable and easily-queryable format. With increasing academic and clinical interest in Knowledge Graph Embeddings (KGEs), various KGE models have been developed to allow machine learning to efficiently run on these large Knowledge Graphs and predict new, previously unseen information about the domain. However, the need to validate hyperparameters for every new dataset, especially considering the time and expertise needed for validation and model training, have limited the use of KGEs in bi-ology to those who have expertise in machine learning and knowledge engineering. This research presents a framework by which the effect of hyperparameters on model performance for a given KG can be modelled as a function of KG structure. The presented evaluation of the framework finds a clear effect of graph structure on hyperparameter fitness. This leads to the conclusion that more re-search into cross-dataset hyperparameter prediction and re-use holds promise for increasing the accessibility and usability of KGEs for biomedical applications.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Cancer biology and biomedical sciences are being revolutionised by Big Data. From projects such
as Bio2RDF [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], the International Cancer Genome Consortium [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the 1,000 Genomes Project [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and
TumorMap [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], Big Data has become a centrepiece of biomedical research. With the ever-increasing
magnitude of these datasets, several approaches have been taken to analyse and utilise them in full.
Some projects, such as TumorMap, have focused on transforming the available data into a simpler
for-mat through dimensional reduction mechanisms, accepting a degree of information loss in
exchange for easier usability [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. On the other hand, recent Linked Open Data (LOD) systems have
attempted to represent the entirety of the data in an easily-queryable graph-based format [
        <xref ref-type="bibr" rid="ref1 ref5 ref6 ref7">1, 5–7</xref>
        ].
Among the projects that have taken this approach is Bio2RDF, a LOD data store that incorporates data
from many different biological and biomedical datasets into a graph format [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Other groups have
followed up upon such projects with graph-based machine learning methods called Knowledge Graph
Embeddings (KGEs) to process entire LOD datasets at once [
        <xref ref-type="bibr" rid="ref6 ref8 ref9">6, 8, 9</xref>
        ].
      </p>
      <p>
        However, using Knowledge Graph Embeddings (KGEs) on Knowledge Graphs (KGs) requires the
selection of hyperparameters to the models, and proper selection of hyperparameters is critical to
enabling the model to best learn the data at hand [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ].
      </p>
      <p>This research addresses the question of whether the hyperparameters to KGE models on biological
LOD proceed from the structure of the data. The data suggests that graph structure indeed is the</p>
      <p>2020 Copyright for this paper by its authors.
CEUR</p>
      <p>ceur-ws.org
driving force behind hyperparameter preference. Our characterisation of the interaction between
graph structure and model performance in the context of an arbitrary hyperparameter sets suggests
that model performance for given hyperparameter sets can be predicted based on graph structure, in
the absence of data on the hyperparameters themselves.</p>
      <p>It is anticipated that relating hyperparameters and KGE performance to structure would allow
predicting and explaining why certain models perform better than others and would allow these
results to be generalised to similarly structured graphs from diverse domains. While in this work the
cancer biology and biomedical fields are used—due to their importance and due to the lead author’s
previous experience in both—the goal of explaining results in terms of KG structure allows for wider
generalisation.</p>
      <p>The characterisation achieved in this research, while not definitive, calls for further research into
the possibility of allowing relational learning algorithms to operate on similarly-structured datasets
using a consistent set of known hyperparameters, and on whether predicting these hyperparameters
from graph structure along without a hyperparameter search may be possible. Such a system would
allow biological LOD datasets to be analysed much more quickly, without the need for running a full
search for optimal hyperparameters on every new dataset.</p>
      <p>Please note that in this paper, references to “hyperparameters” refer not to only the parameters
given to a KGE / relational learning model (such as the learning rate), but also to the choice of the
models themselves; while differs from the formal definition of hyperparameters, it results in a simpler
phrasing of the model choices made.</p>
      <p>The remained of this paper is structured as follows: Section 2 gives a background on KGs, KGEs,
and the gap in the current state-of-the-art that this research seeks to fill. Section 3 details the methods
by which experiments were performed, and Sections 4 provides the results of these experiments.
Section 5 concludes the paper and gives a discussion of the major findings, as well as the limitations
of this study.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
    </sec>
    <sec id="sec-3">
      <title>2.1. Knowledge Graphs</title>
      <p>
        The most popular KG format is RDF, the Resource Description Framework [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Like most KG formats, the
smallest unit of information in the graph, a set of two entities and one predicate (or relationship) that links them
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. As thus, Every KG is a special case of a graph G(V, E) with vertices V and edges E, where every vertex and
edge is involved in at least one triple. Within the context of any one triple, the first entity is called the subject or
the head, and the second entity the object or the tail [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. This can be abbreviated simply as (subject, predicate,
object).
      </p>
      <p>In this configuration, RDF triples mirror simple linguistic statements. For example, the statement “P53 is a
protein” could be modelled as the triple (p53, is-a, protein).</p>
      <p>
        In the RDF format, subject, predicates, and objects are often represented by URLs which allows for entities
and predicates to be easily reused and dereferenced, either within one data source or between different data
sources [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. The ability to reuse entities and relationships means that various triples can be linked and
connected logically to each other, either by sharing a head, a tail, or both. This allows many large KGs – to
connect data from multiple datasets with relative ease in RDF.
      </p>
    </sec>
    <sec id="sec-4">
      <title>2.2. Knowledge Graph Embeddings</title>
    </sec>
    <sec id="sec-5">
      <title>2.2.1. Representing Vertices and Edges</title>
      <p>
        Repeating entities as both a subject and an object allows linking triples to each other. While this
can be used in many cases for logical reasoning using formal rules to ex-tract more information from
the graph [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], this property also allows machine learning techniques to operate on the graph and
learn to distinguish and related entities based on the triples they are involved in [13, 14]. These
machine learning techniques are referred to as relational machine learning since they learn based on
these relations [13].
      </p>
      <p>The output of this relational learning is a set of Knowledge Graph Embeddings, where entities are
typically represented as vectors in Rn and relationships are represented by transformations on them
[13, 14]. These relationships are structured (for example) as functions that map Rn onto Rn, allowing
them to convert from subjects to the expected objects of that relationship (the so-called “link
prediction task”) [13, 14].</p>
      <p>The choice of the dimension into which the entities are placed, as well as the choice of what sort
of transformation is used to model the relationship (such as vector addition or matrix multiplication)
are model design choices that must be investigated by developers to find the optimal combination for
a given data set [13, 14].</p>
    </sec>
    <sec id="sec-6">
      <title>2.2.2. Training on Negative Samples</title>
      <p>In order for the KGE model to effectively learn to predict true triples and reject false ones, it must
be trained not only on the known-true triples but also on known-false triples [13, 14]. This is done
using a technique called negative sampling [13, 14].</p>
      <p>There are various assumptions about the data that researchers can make to produce negative
samples; the most common is Local Closed-World Assumption, which claims that if a given subject and
predicate are observed with a certain set of objects, then that subject and predicate only ever match
to objects of those types [13, 14]. This is a specific case of the Open-World Assumption, also commonly
used, which assumes that there may be arbitrarily many true statements which are not contained in
the KG [13, 14]. However, the Open-World Assumption gives no way of predicting which triples that
are not in the KG are true and which are false, and is less commonly used [13, 14].</p>
      <p>Choosing by what method or methods to sample negatives, and how many to sample, are also
hyperparameters to KGE models.</p>
    </sec>
    <sec id="sec-7">
      <title>2.2.3. The TransE Model for KGEs</title>
      <p>The most basic conceptualisation of KGEs is the TransE model [13, 14]. Under the TransE model,
nodes are embedded as vectors, and relationships as vector displacements between those nodes. If si
is the embedding of a subject node, pi the embed-ding of a relationship, and oi the embedding of an
object node, then TransE attempts to enforce the following equality as closely as possible:

 +   =  
(1)</p>
      <p>In essence, this gives a very intuitive definition of embeddings: the subject (entity) plus the
predicate (how the subject relates to the predicate) should be close or equal to the object [13, 14].</p>
      <p>A simple example of embedding a KGE into 2-space is given in Figure 1.
simple example of embedding a KGE into 2-space is given in Figure 1.</p>
      <sec id="sec-7-1">
        <title>Original Knowledge Graph</title>
      </sec>
      <sec id="sec-7-2">
        <title>Embedded Knowledge Graph</title>
        <p>Link prediction in TransE is formulated as taking the translation of some subject s under some
predicate p, resulting in a vertex v’. The closest other vertex o to v’ in Rn is predicted to be the object
of the relationship (s, p, o). A visualisation of this method is given in Figure 2.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>2.2.4. Other KGE Models</title>
      <p>Many other KGE models exist; however, they all follow the same basic idea of using some
transformation (represented by an edge) on subjects to produce embeddings that can be matched by
some metric to the correct objects [13, 14].</p>
      <p>
        In terms of model definition, it is very common that the operator used to represent edges, the
comparison metric (i.e., Euclidean distance or cosine similarity) between transformed subjects and
objects, the loss function used to score embeddings, and regularisation terms added differ between
various models [
        <xref ref-type="bibr" rid="ref10">10, 13, 14</xref>
        ]. However, a large variety of even more diverse KGE methods exist [13, 14].
2.3.
      </p>
    </sec>
    <sec id="sec-9">
      <title>Gap in the State-of-the-Art</title>
      <p>The KGE models that have been applied have focused on producing a set of embeddings either for use as
feature-vector inputs to a different machine learning application [15, 16] or directly for novel link prediction
within the learned dataset [16]. Both run into the inevitable issue of selecting hyperparameters, which often
involves a time-consuming brute-force search [16].</p>
      <p>
        The recent trend towards both more advanced visualization for KGs in bioinformatics, as in the transition
from the LTCGA to BIOOPENER [
        <xref ref-type="bibr" rid="ref5 ref6 ref7">5–7</xref>
        ], as well as interest in automated KGE models [15, 16] are producing results
that commonly match or exceed other modern approaches in data visualization and prediction [
        <xref ref-type="bibr" rid="ref6">6, 16</xref>
        ]. In the
case of this research, focus was given specifically to how to reduce the up-front barrier to entry posed by
hyperparameter selection.
      </p>
      <p>While various articles have established that KGEs are very sensitive to good hyperparameter choice [16],
characterized the most important meta-elements of graphs such as structure and provenance [17], and noted
the effect of graph structure on embeddings [18], no attempts have been made to relate these hyperparameters
to KG structural features. Doing so would thus be a contribution both to the field of knowledge-graph machine
learning and to the of KGE-based bioinformatics.</p>
    </sec>
    <sec id="sec-10">
      <title>3. Background</title>
    </sec>
    <sec id="sec-11">
      <title>3.1. Gap in the State-of-the-Art</title>
      <p>Selection of data sources occurred in two steps: selection of a multi-dataset LOD mashup, and selecting
datasets from within that mashup. Selection of data from a single mashup rather than from solitary KGs was
done for four reasons: simplicity, ease of reproducibility, relevance, and consistency.</p>
      <p>
        Simplicity and ease of reproducibility for LOD-based projects go hand in hand. Mashup systems
such as Bio2RDF are intended to be used in many different contexts and by different applications [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
Moreover, they are designed for easy access of their components by researchers [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Both attributes
make the data attractive in terms of simplicity: all the data is easily obtained from a single place.
Moreover, ease of access to the data is a basis for ease of reproducibility, since other groups who wish
to reproduce the results of this work need only reference data from a single location rather than many.
      </p>
      <p>
        In addition, these larger data mashups tend to have much higher overall relevance. Bio2RDF, for
example, was constructed from the most common biological datasets [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], and now contains a total of
35 biological datasets [19].
      </p>
      <p>From within Bio2RDF, a small selection of datasets was chosen based on their relevance to
biomedical research, as well as on their overall size.</p>
      <p>In terms of the first criterion, datasets from Bio2RDF that were immediately relevant to cancer and
biomedical research were selected given the lead author’s experience and interest in these domains.
These domains, of course, are very broad and can encompass a variety of types and sources of data.
Specifically, in order to select the datasets most relevant to these categories, datasets containing
information on drugs, molecular biology, clinical data, and genetics were selected as the most highly
relevant. Datasets containing data exclusively from non-human animals were excluded to focus more
clearly on biomedical modelling in the human context. This left a list of 15 potential datasets.</p>
      <p>Once this list of datasets was obtained, a subset of datasets to be used in analysis was selected
based on dataset size. This filter was introduced for purely practical reasons: larger datasets take
significantly longer to pre-process and train, even at low epochs.</p>
      <p>In order to strike a balance between including enough datasets in the pool for analysis and
minimizing the overall computational time and power spent, any datasets in excess of 20GB were
removed from consideration. Moreover, datasets measuring under 1 MB were removed for containing
too little information, since the goal of this work is to focus on big data rather than learning from small
KGs. This resulted in a list of 9 datasets from the Bio2RDF mashup being used: BioPortal (18.3 GB),
Database of single nucleotide polymorphism (DBSNP, 2.9 GB), DrugBank (1.6 GB), Gene Ontology
Annotation (GOA, 17.1 GB), HUGO Gene Nomenclature Committee (HGNC, 844.8 MB), the Kyoto
Encyclopedia of Genes and Genomes (KEGG, 18.1 GB), the Life Science Resource Registry (LSR, 12.1
MB), Online Mendelian Inheritance in Man (OMIM, 2.2 GB), and Pharmacogenomics Knowledge Base
(PharmGKB, 1.5 GB).
3.2.</p>
    </sec>
    <sec id="sec-12">
      <title>KGE Implementation</title>
      <p>
        In this research, KGE models were implemented in PyTorch-BigGraph (PBG) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. PBG allows the user to define
a KGE model by choosing an operator (to transform subject embeddings to object embeddings), a comparator
(to define how to measure closeness of two embeddings) and a loss function (to optimize the model) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        Importantly, PBG also allows the main graph to be split into partitions. Each partition is loaded into memory
one at a time, and each one represents the fraction of the graph that can fit in system memory at once [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. PBG
then handles communicating the results of training on different partitions between partitions and uses this to
create an overall KGE model [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. As a result, PBG is well-suited for large-scale biomedical (or other) LOD
datasets; an example of its use may be seen in [20].
      </p>
      <p>
        The authors showed that, for large KGs, this system approximates the results that would be obtained using
an identical configuration on non-partitioned graphs [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. However, they did note that creating partitions on
smaller graphs could have some negative effects on embedding quality [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
    </sec>
    <sec id="sec-13">
      <title>Hyperparameter Selection</title>
      <p>In order to select hyperparameters, a modified version of the grid search was used. In a traditional grid search,
all values of all hyperparameters in question are varied over a grid, and the best choice from among them is
chosen. However, the large number of KGE models and hyperparameters involved made such an approach
infeasible (for a listing of hyperparameters involved, see Table 2). Thus, an arbitrary set hyperparameters values
given in [21] were used to initialize the model.</p>
      <p>Only five of the 9 total datasets were used in the initial round of hyperparameter selection; these
were BioPortal, DBSNP, DrugBank, OMIM, and PharmGKB. This approach was undertaken so that the
resulting hyperparameter configurations could be run on datasets for which they had not been
created as a measure of how well the hyperparameter configurations worked across different
datasets.</p>
      <p>Three grid searches were then carried out: in the first, model-related hyperparameters were varied. These
were, specifically: comparator, learning rate, loss function, operator, and regularisation coefficient. It should be
noted that, due to the design of PBG, a regularization coefficient hyperparameter is not given to KGE models
using the affine operator [21]. As such, this combination of operator and regularization coefficient was not
allowed when searching for optimal hyper parameter calculations.</p>
      <p>In the second round, hyperparameters relating to batching were varied; these were batch size, the number
of batch negatives to use, and the number of uniformly sampled negatives to use. Finally, in the third round, the
number of epochs and embedding dimensions were varied.</p>
      <p>In addition, rather than conducting the hyperparameter search on the entire dataset, datasets were
subsetted randomly in order to make the search feasible in the available time. In order to do this, the decision
of how large to make the subsets was critical to ensuring that they could well represent the data from which
they were drawn. All subsets taken were taken in a single-pass traversal of the graph in which a triple was
randomly chosen with probability equal to the desired number of triples divided by the total number of triples
in the graph. The desired number of triples was set to 4,000.</p>
      <p>For all rounds, model performance was measured using the “r1” metric, which is the probability
that a true triple would be preferred over all of the negative triples created samples for it under the
Local Closed-World Assumption during link prediction [21].</p>
    </sec>
    <sec id="sec-14">
      <title>4. Background</title>
    </sec>
    <sec id="sec-15">
      <title>4.1. Hyperparameter Sets Selected</title>
      <p>Ultimately, two different sets of hyperparameters were created: one for BioPortal, and one for DBSNP,
DrugBank, OMIM, and PharmGKB. The reason for these two sets was that, in each of the hyperparameter
validation rounds, in almost all cases the distribution of r1 scores given different hyperparameter combinations
for DBSNP, DrugBank, OMIM, and PharmGKB matched very closely, while BioPortal did not follow this trend.
The hyperparameters selected are shown in Table 2.
Once all the hyperparameters had been obtained, all datasets were run using both sets of
hyperparameters. The output r1 scores for all runs are shown in Table 3.
Interestingly, the general configuration yielded its best scores on datasets it had not been created to
accommodate (BioPortal and KEGG). Those datasets it was trained on were (DBSNP, DrugBank, OMIM,
and PharmGKB) universally had lower performance when trained on that hyperparameter set.</p>
      <p>The BioPortal configuration outperformed the general configuration on all datasets except one:
BioPortal. While this difference was small and quite possibly insignificant, its variance was not
estimated. In any case however, the BioPortal configuration was not clearly better for BioPortal itself.</p>
      <p>The departure of these results from the expected ones—that the BioPortal configuration would be
optimal for BioPortal by a large margin and the general configuration would be similarly superior for
DBSNP, DrugBank, OMIM, and PharmGKB from which it was created—suggest that the selected
hyperparameter values are not optimal.</p>
      <p>However, it was noted that even in the absence of optimal or near-optimal hyperparameters, the
data can be interpreted as coming from arbitrary hyperparameter selections, making no assumptions
about the goodness (or lack thereof) of the model choices, it is under this assumption that the
remaining analysis was carried out.
4.2.</p>
    </sec>
    <sec id="sec-16">
      <title>Relating KG Structure, Hyperparameters, and Model Performance</title>
      <p>Relating KG structure to model performance and hyperparameters was formulated as a regression problem:
given a set of structural characteristics, predict the r1 scores of a model under a single hyperparameter set. Each
prediction was made in the context of data from all datasets under a single hyperparameter configuration only.</p>
      <p>The first step in this process was to identify relevant structural features from each KG. Since it has
been noted that KG connectivity—particularly centrality—impacts KGEs [18], two different methods
to examining centrality were applied The first was by measuring the counts and proportions of sources
(nodes that are only ever subjects), sinks (nodes that are only ever objects) and repeats (nodes that
and a subject and an object at least once in the KG). The second was by measuring the distribution of
the centrality of nodes, where centrality was calculated as the total degree of a node. The outputs of
both of these methods are shown in Tables 4 and 5.
Used in the
searches
BioPortal</p>
      <p>DBSNP
DrugBank</p>
      <p>OMIM
PharmGKB
Not used in</p>
      <p>the
searches</p>
      <p>GOA
HGNC
KEGG
LSR
From these features, a subset was selected as inputs to a regressor. The structural features selected
to measure the effect of the high prevalence of sinks versus sources and repeats were the ratios of
sinks and repeats to triples. The ratio of sources to triples was not included, since in all cases it was
nearly identical to the ratio of repeats to triples and thus was redundant. Moreover, adding in more
features on datasets with few data points can lead to machine learning models overfitting by
memorizing data rather than learning general trends, which would make interpretation of the results
less clear.</p>
      <p>The ratio of the maximum centrality to the number of triples, as well as median, 3rd quartile, and
90th percentile centralities were used to represent the effects of centrality. Notably, the first quartile
was not included since it was identical in all datasets. Other values were not included because they
varied similarly to data already in the dataset and would result in introducing too many features and
potentially overfitting the data.</p>
      <p>Data from all datasets was normalised prior to being input into the regression models, and all
regression models were run with 5-fold cross-validation using an L1 (or “Lasso”) penalty to select the
regularisation coefficient. The Lasso regression penalty was chosen since it tends to drive the values
of parameters that are not needed in the regression decision to zero, thus providing an easy tool for
the detection of which structural elements are important and which are of no use for gaining
predictive power. R2 scores were obtained by from the final Lasso model on the training data.</p>
      <p>Since all input values were normalized, the parameter values themselves can be used as an
estimate of their importance to the regression decision within the context of a single model: those
with higher values represent structural features that have a created correlation to the final r1 score
obtained by the KGE model. It must be noted, however, that care must still be taken in this
interpretation of the parameter values, especially since interaction terms were not considered and
thus some effects of the variance of each parameter on the model may not be fully contained in the
given data.</p>
      <p>The results of Lasso regression and the final Lasso models are given in Tables 6 – 8.
All models examined achieved R2 values of at least 0.60; all but one was above 0.80. While all the
models based on sink and repeat frequencies used both features in the final regression model, none
of the models using centrality distribution statistics used all the available features; in both cases on of
then was ignored by the regressor.</p>
      <p>These results suggest that, for datasets trained on the same hyperparameters, their structures
correlate very well with how well they perform under that hyperparameter set. This suggests a
possible effect of KG structure on how well a model performs given an arbitrary set of
hyperparameters. Or, put differently, it suggests that the fitness of a hyperparameter set for a given
KG, and ultimate KGE performance under that set, can be determined from the structure of the KG
alone.</p>
    </sec>
    <sec id="sec-17">
      <title>5. Discussion and Conclusions</title>
    </sec>
    <sec id="sec-18">
      <title>5.1. Key Contributions</title>
      <p>It is expected that this work contributes to relational learning and bioinformatics in two key ways.</p>
      <p>First, it suggests that KGE performance is very responsive to structure, particularly with respect to proportions
of sources, sinks, and repeated entities in the graph and to the distribution of the centrality of nodes in the
graph.</p>
      <p>It also demonstrates that the performance of a KGE with a given set of hyperparameters can be predicted
with high accuracy considering only KG structure. This suggests that it would be possible to rapidly predict what
KGs may fit a given set of hyperparameters using only a linear regression model, rather than a time-intensive
grid search.</p>
      <p>The work also indicates two important future directions, which if followed could provide critical
insights to the field. One on side, this research suggests that it may be possible to create a regression
model that predicts model performance not only based off the KG structure, but also upon
hyperparameters, in the absence of training. If done, this would allow for the rapid detection of
optimal hyperparameter configurations—even those never examined before—without the need to
do a grid search. It is hypothesised that this model could then be applied to very rapidly find the ideal
hyperparameter configuration for a KG with a given structure.</p>
      <p>Secondly, it suggests that hyperparameter selection may be able to be formulated as a
classification problem, mapping from KG structural statistics to one of several models and sets of
hyperparameters, without any need for a gird search or traditional hyperparameter selection
methods.
5.2.</p>
    </sec>
    <sec id="sec-19">
      <title>Limitations of this Work</title>
      <p>Since the method for selecting hyperparameters was found to be sub-optimal for the intended
datasets, the hyperparameter sets produced are known to be imperfect. This is possibly a result of
having subsetted the graphs, producing subgraphs that could be easily grid-searched, but whose
structure was notably different in some respects from the structure of the original KG. Unfortunately,
the extent by which they vary from the optimum configurations was impossible to estimate, as the
optimal configurations are not known. As a result, the results of this study are interpreted as
presenting data in the context of arbitrary hyperparameter configurations, rather than optimal or near
optimal ones. However, the structural analysis of KGE scores under these configurations remains valid,
because that analysis made no assumption of optimality of the configurations which it was predicting.</p>
      <p>In addition, in this work only biomedical datasets were considered, and all these datasets were
observed to have an extremely strong skewness of centrality values. Given the findings of this
research, examining how other datasets with very different centrality distributions than those seen
here interact with optimal hyperparameters to KGE models is expected to be of benefit to the
relational learning and KGE fields.</p>
      <p>Finally, in this work even the final hyperparameter configurations identified yielded low r1 scores
which never exceeded 0.4. Scores observed in the hyperparameter search were similarly low, often
significantly lower. However, the reason for these low scores was not directly identified. As outlined
in Wang et al., KGs learn by understanding the relationships between entities [14]. This suggests, then,
that entities with very few observed relationships would be relatively hard to learn to embed properly.
Therefore, it stands to reason that an effect of structure would be observed here as well, very likely
in terms of the number of sinks and sources in the graph relative to the number of triples in total. This
effect may be partly agnostic to the hyperparameters involved, although this determination could not
be made with the data available. Further research in this direction would be merited.
5.3.</p>
    </sec>
    <sec id="sec-20">
      <title>Final Observations</title>
      <p>It is hoped that this work contributes to the understanding of hyperparameter choices not only in the
realm of KGEs, but in the context of machine learning models generally. The finding that the structural
elements of KGs are very highly predictive of model performance under different hyperparameter
configurations suggests that data structure and model choice may be best understood in the context
of each other.</p>
      <p>Creating machine learning models by which structural elements could lead to optimal
hyperparameter prediction and predictions of model performance is well merited. Moreover, it would
be equally merited to extend this work to other machine learning domains, to understand if all
hyperparameters and model performance—or only those for KGEs—can be modelled as a function of
dataset structure.</p>
      <p>The author of this work hypotheses that such dataset-structure based approaches would yield
fruitful results, advancing understanding of machine learning models and facilitating optimal
hyperparameter selection in machine learning domains outside of KGEs alone. Furthermore, it is
hypothesised that structure-based optimal or near-optimal hyperparameter determination may be
possible even in the absence of any form of traditional hyperparameter search for KGEs, and it is
suggested that further work in this area examine whether such approaches are effective and practical.</p>
    </sec>
    <sec id="sec-21">
      <title>6. References</title>
      <p>[13] 13.Nickel, M., Murphy, K., Tresp, V., Gabrilovich, E.: A Review of Relational Machine Learning for
Knowledge Graphs. Proc. IEEE. 104, 11–33 (2016).
https://doi.org/10.1109/JPROC.2015.2483592.
[14] 14.Wang, Q., Mao, Z., Wang, B., Guo, L.: Knowledge Graph Embedding: A Survey of Approaches
and Applications. IEEE Trans. Knowl. Data Eng. 29, 2724–2743 (2017).
https://doi.org/10.1109/TKDE.2017.2754499.
[15] 15.Mohamed, S.K., Nounu, A., Nováček, V.: Biological applications of knowledge graph
embedding models. Briefings in Bioinformatics. 22, 1679–1693 (2021).
https://doi.org/10.1093/bib/bbaa012.
[16] 16.Celebi, R., Uyar, H., Yasar, E., Gumus, O., Dikenelli, O., Dumontier, M.: Evaluation of knowledge
graph embedding approaches for drug-drug interaction prediction in realistic set-tings. BMC
Bioinformatics. 20, 726 (2019). https://doi.org/10.1186/s12859-019-3284-5.
[17] 17.Ben Ellefi, M., Bellahsene, Z., Breslin, J.G., Demidova, E., Dietze, S., Szymański, J., Todorov, K.:
RDF dataset profiling – a survey of features, methods, vocabularies and applications. SW. 9, 677–
705 (2018). https://doi.org/10.3233/SW-180294.
[18] 18.Sadeghi, A., Collarana, D., Graux, D., Lehmann, J.: Embedding Knowledge Graphs Attentive to
Positional and Centrality Qualities. In: Oliver, N., Pérez-Cruz, F., Kramer, S., Read, J., and Lozano,
J.A. (eds.) Machine Learning and Knowledge Discovery in Databases. Re-search Track. pp. 548–
564. Springer International Publishing, Cham (2021).
https://doi.org/10.1007/978-3-030-865207_34.
[19] 19.Bio2RDF Release 3, https://download.bio2rdf.org/files/release/3/release.html, last accessed
2022/02/24.
[20] 20.Fisher, J., Palfrey, D., Christodoulopoulos, C., Mittal, A.: Measuring Social Bias in Knowledge</p>
      <p>Graph Embeddings. arXiv:1912.02761 [cs]. (2020).
[21] 21.Welcome to PyTorch-BigGraph’s documentation! — PyTorch-BigGraph 1.dev documentation,
https://torchbiggraph.readthedocs.io/en/latest/index.html, last accessed 2022/02/24.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <fpage>1</fpage>
          .
          <string-name>
            <surname>Belleau</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nolin</surname>
            , M.-
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tourigny</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rigault</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morissette</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>Bio2RDF: Towards a mashup to build bioinformatics knowledge systems</article-title>
          .
          <source>Journal of Biomedical Informatics</source>
          .
          <volume>41</volume>
          ,
          <fpage>706</fpage>
          -
          <lpage>716</lpage>
          (
          <year>2008</year>
          ). https://doi.org/10.1016/j.jbi.
          <year>2008</year>
          .
          <volume>03</volume>
          .004.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <fpage>2</fpage>
          .
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>Baran</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cros</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guberman</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haider</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hsu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rivkin</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Whitty</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wong-Erasmus</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yao</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kasprzyk</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>International Cancer Ge-nome Consortium Data Portal--a one-stop shop for cancer genomics data</article-title>
          .
          <source>Database</source>
          .
          <year>2011</year>
          ,
          <fpage>bar026</fpage>
          -
          <lpage>bar026</lpage>
          (
          <year>2011</year>
          ). https://doi.org/10.1093/database/bar026.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <fpage>3</fpage>
          . The 1000
          <string-name>
            <given-names>Genomes</given-names>
            <surname>Project</surname>
          </string-name>
          <article-title>Consortium: An integrated map of genetic variation from 1,092 human genomes</article-title>
          .
          <source>Nature</source>
          .
          <volume>491</volume>
          ,
          <fpage>56</fpage>
          -
          <lpage>65</lpage>
          (
          <year>2012</year>
          ). https://doi.org/10.1038/nature11632.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <fpage>4</fpage>
          .
          <string-name>
            <surname>Newton</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Novak</surname>
            ,
            <given-names>A.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Swatloski</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McColl</surname>
            ,
            <given-names>D.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chopra</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Graim</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weinstein</surname>
            ,
            <given-names>A.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baertsch</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salama</surname>
            ,
            <given-names>S.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ellrott</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chopra</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goldstein</surname>
            ,
            <given-names>T.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haussler</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morozova</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stuart</surname>
            ,
            <given-names>J.M.:</given-names>
          </string-name>
          <article-title>TumorMap: Exploring the Molecular Similarities of Cancer Samples in an Interactive Portal</article-title>
          .
          <source>Cancer Res</source>
          .
          <volume>77</volume>
          ,
          <fpage>e111</fpage>
          -
          <lpage>e114</lpage>
          (
          <year>2017</year>
          ). https://doi.org/10.1158/
          <fpage>0008</fpage>
          -
          <lpage>5472</lpage>
          .CAN-
          <volume>17</volume>
          -0580.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <fpage>5</fpage>
          .
          <string-name>
            <surname>Saleem</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Padmanabhuni</surname>
            ,
            <given-names>S.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ngomo</surname>
            ,
            <given-names>A.-C.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Almeida</surname>
            ,
            <given-names>J.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Decker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deus</surname>
            ,
            <given-names>H.F.</given-names>
          </string-name>
          :
          <article-title>Linked cancer genome atlas database</article-title>
          .
          <source>In: Proceedings of the 9th International Conference on Semantic Systems - I-SEMANTICS '13</source>
          . p.
          <fpage>129</fpage>
          . ACM Press, Graz, Austria (
          <year>2013</year>
          ). https://doi.org/10.1145/2506182.2506200.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <fpage>6</fpage>
          .
          <string-name>
            <surname>Jha</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khan</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mehdi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karim</surname>
            ,
            <given-names>M.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mehmood</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zappa</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rebholz-Schuhmann</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sahay</surname>
          </string-name>
          , R.:
          <article-title>Towards precision medicine: discovering novel gynecological cancer biomarkers and pathways using linked data</article-title>
          .
          <source>J Biomed Semant. 8</source>
          ,
          <issue>40</issue>
          (
          <year>2017</year>
          ). https://doi.org/10.1186/s13326-017- 0146-9.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <fpage>7</fpage>
          .
          <string-name>
            <surname>Saleem</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kamdar</surname>
            ,
            <given-names>M.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Iqbal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sampath</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deus</surname>
            ,
            <given-names>H.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ngonga</surname>
            <given-names>Ngomo</given-names>
          </string-name>
          , A.-C.:
          <article-title>Big linked cancer data: Integrating linked TCGA and PubMed</article-title>
          .
          <source>Journal of Web Semantics. 27-28</source>
          ,
          <fpage>34</fpage>
          -
          <lpage>41</lpage>
          (
          <year>2014</year>
          ). https://doi.org/10.1016/j.websem.
          <year>2014</year>
          .
          <volume>07</volume>
          .004.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <fpage>8</fpage>
          .
          <string-name>
            <surname>McCusker</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dumontier</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dordick</surname>
            ,
            <given-names>J.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McGuinness</surname>
            ,
            <given-names>D.L.</given-names>
          </string-name>
          :
          <article-title>Find-ing melanoma drugs through a probabilistic knowledge graph</article-title>
          .
          <source>PeerJ Computer Science</source>
          .
          <volume>3</volume>
          ,
          <issue>e106</issue>
          (
          <year>2017</year>
          ). https://doi.org/10.7717/peerj-cs.
          <volume>106</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <fpage>9</fpage>
          .
          <string-name>
            <surname>Hasan</surname>
            ,
            <given-names>S.M.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rivera</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>X.-C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Durbin</surname>
            ,
            <given-names>E.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Christian</surname>
            ,
            <given-names>J.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tourassi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>Knowledge Graph-Enabled Cancer Data Analytics</article-title>
          .
          <source>IEEE J. Biomed. Health Inform</source>
          .
          <volume>24</volume>
          ,
          <fpage>1952</fpage>
          -
          <lpage>1967</lpage>
          (
          <year>2020</year>
          ). https://doi.org/10.1109/JBHI.
          <year>2020</year>
          .
          <volume>2990797</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <fpage>10</fpage>
          .
          <string-name>
            <surname>Lerer</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lacroix</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wehrstedt</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bose</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peysakhovich</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>PyTorchBigGraph: A Large-scale Graph Embedding System</article-title>
          . arXiv:
          <year>1903</year>
          .12287 [cs, stat]. (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <fpage>11</fpage>
          .
          <string-name>
            <surname>Ali</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hoyt</surname>
            ,
            <given-names>C.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Domingo-Fernández</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jabeen</surname>
          </string-name>
          , H.:
          <article-title>BioKEEN: a library for learning and evaluating biological knowledge graph embeddings</article-title>
          .
          <source>Bioinformatics</source>
          .
          <volume>35</volume>
          ,
          <fpage>3538</fpage>
          -
          <lpage>3540</lpage>
          (
          <year>2019</year>
          ). https://doi.org/10.1093/bioinformatics/btz117.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <fpage>12</fpage>
          .
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heath</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berners-Lee</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Linked Data - The Story</surname>
          </string-name>
          So Far:
          <source>International Journal on Semantic Web and Information Systems</source>
          .
          <volume>5</volume>
          ,
          <fpage>1</fpage>
          -
          <lpage>22</lpage>
          (
          <year>2009</year>
          ). https://doi.org/10.4018/jswis.2009081901.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>