<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Representation and comparison of chemotherapy protocols with ChemoKG and graph embeddings</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jong Ho Jhee</string-name>
          <email>jong-ho.jhee@inria.fr</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alice Rogier</string-name>
          <email>alice.rogier-ext@aphp.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dune Giraud</string-name>
          <email>dune.giraud@aphp.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emma Pinet</string-name>
          <email>Emma.PINET@aphp.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Brigitte Sabatier</string-name>
          <email>brigitte.sabatier@aphp.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bastien Rance</string-name>
          <email>bastien.rance@aphp.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adrien Coulet</string-name>
          <email>adrien.coulet@inria.fr</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Biomedical Informatics, Hôpital Européen Georges Pompidou, AP-HP</institution>
          ,
          <addr-line>Paris</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Pharmacy, Hôpital Européen Georges Pompidou, AP-HP</institution>
          ,
          <addr-line>Paris</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Inria Paris</institution>
          ,
          <addr-line>F-75015 Paris</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Inserm, Centre de Recherche des Cordeliers, Université Paris Cité, Sorbonne Université</institution>
          ,
          <addr-line>F-75006 Paris</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Background: Chemotherapy, a central cancer treatment, employs antineoplastic drugs to hinder cancer cell replication by disrupting DNA synthesis or mitosis. Chemotherapies follow complex protocols composed of cycles of treatment where antineoplastic and adjuvant drugs prescribed at different doses and times. Various protocols exist, with either small or large and numerous variations to others, making it hard to compare chemotherapies to each other, comparing their differential outcomes, and in the end choosing the most adapted one for a particular patient. Method: We propose ChemoKG, a knowledge graph for chemotherapy protocols that encompasses first administration programs such as drugs, dosages, treatment durations, and second drug properties and classes imported from ChEBI, DrugBank and the ATC classification. Three resources on drugs provide complementary hierarchies and chemical properties that help to better identify similar chemotherapy protocols. To this aim, we tested on ChemoKG a novel graph embedding method employing graph neural networks (GNNs) to compare nodes in the graph that represent protocols. Unlike previous approaches that focus on triple-based embeddings, the proposed method captures subgraph structures inherited from the aggregation scheme in GNNs. Results: The resulting knowledge graph encompasses 329,164 triples with 99,901 entities and 75 predicates including 1,358 chemotherapy protocols and 226 anti-cancer drugs. We performed a cluster analysis of protocol embeddings learned on ChemoKG, to propose groups of similar protocols. This will contribute in facilitating the comparison of chemotherapy themselves, and by extension to their potential effectiveness. Additionally, it should aid in analyzing gaps between commonly accepted protocols and their real-world implementation.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Chemotherapy protocol</kwd>
        <kwd>chemotherapy regimen</kwd>
        <kwd>knowledge graph</kwd>
        <kwd>graph embedding</kwd>
        <kwd>clustering 1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Chemotherapy is a cancer treatment which employs antineoplastic drugs to hinder cancer cell
replication for instance by disrupting DNA synthesis or mitosis. Chemotherapy remains a
cornerstone in cancer treatment, administering cytotoxic drugs to limit tumor growth. This
involves a nuanced balancing between reducing tumor size and minimizing side effects.
Combining various drugs in a timely manner, adapted to patient profile and response is a common
strategy to achieve this tradeoff [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Indeed, each chemotherapy treatment follows a complex, but
precisely defined protocol (also named regimen) composed of repeated cycles where a set of
antineoplastic and adjuvant drugs are prescribed for administration with various dose, mode
(continuous vs. bolus infusion) and timing. This cyclic approach is not arbitrary, it aligns with the
life cycle of cancer cells and ensures optimal drug efficacy [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Many different protocols have been
described with either small or large variations, to adapt to individual factors such as age, health
0000-0001-8887-8149 (J. H. Jhee); 0000-0002-5499-3197 (A. Rogier); 0000-0002-1466-062X (A. Coulet)
© 2024 Copyright for this paper by its authors.
      </p>
      <p>Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).</p>
      <p>
        CEUR Workshop Proceedings (CEUR-WS.org)
conditions, and genetic profiles [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. As a result, a large number of protocols co-exist in clinical
information systems and expert databases, but they are suffering from unequal evaluation and
consequently make more complicated for the clinician the choice of a protocol versus another.
      </p>
      <p>
        The subtlety of variations in term for instance of timing (e.g., time lapse between two
administrations), mode of administration (e.g., bolus vs. continuous) motivates the need for a low
grain knowledge representation of protocols, especially for future studies aiming at evaluating
the comparative effectiveness of treatment strategies [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In addition, such representation would
enable the definition of distances between protocols, in regard to their multidimensional
definition, composed of the drugs they leverage, their dose, timing, mode, classes, etc.
      </p>
      <p>We introduce ChemoKG, a knowledge graph that represents chemotherapy protocols, their
various dimensions, and links their constitutive drugs to various properties from ChEBI,
DrugBank and the anatomical therapeutic chemical (ATC) classification. As a first illustration of
the interest of ChemoKG, we propose here a clustering that group similar protocols in regard to
their description in ChemoKG. The clustering relies on a novel embedding framework named
RAGE, which leverages graph neural networks (GNNs) to learn a representation of protocols that
considers the properties of the drug administrations present in the neighborhood of protocols in
ChemoKG. Unlike other approaches that focus on triple-based embeddings, RAGE captures
subgraph structures inherited from the aggregation scheme in GNNs. The clustering analysis
provides a classification of protocols that we compare with two reference classifications: the first
is based on cancer locations associated with protocols, the second on pharmacological groups of
drugs. We evaluated RAGE embedding approach on a link prediction task; and RAGE outperforms
or shows competitive performance against the selected baselines. The cluster analysis we
performed with classical algorithms shows that RAGE allows for a reasonably good grouping of
protocols by cancer locations, despite the fact that the graph does not contain this information.
For ATC as a reference, RAGE showed the best result, in comparison to a classical method (not
based on machine learning) named cumulative dose intensity (CDI).</p>
      <p>To our knowledge, this is the first attempt to classify chemotherapy protocols on the basis of
several of their features. We believe that our effort to compare chemotherapies will find
applications first in the management of protocols in hospitals that historically recorded every
small variation in protocols, resulting in large collections in need of structuration and cleaning;
second in the definition of standard protocols as institutions have adopted different, but
sometimes similar ones; and in the identification of concurrent protocols. Indirectly, we hope that
comparing chemotherapy protocols will help in their relative evaluation and in the guidance for
the choice for one among a set of similar ones.</p>
      <p>The remainder of the paper is organized as follows. First, previous works on chemotherapy
representation, graph embeddings and clustering from graph embeddings are presented in
Section 2. Section 3 introduces ChemoKG. Section 4 describes the methodology of both the
proposed graph embedding framework and its use for a clustering task. Section 5 presents our
experimental results and is followed by elements of discussion and a conclusion in Section 6.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>
        Chemotherapy databases The growing variety of chemotherapy protocols has led to a recent
interest first in naming protocols in non-ambiguous ways [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], and second in proposing
repositories of the various protocols. The larger available one is HemOnc, which includes &gt;4,000
regimens [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. It is a collaborative database that includes regimens description and general
information about them. It started in 2011 through a collaboration of oncologists from several US
University hospitals with an initial focus on the field of hematology cancers. HemOnc proposes a
data schema to represent and share protocols in Owl. However, this schema does not include
detailed properties of administrations and drugs. Another initiative developed in the UK is SACT
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] that contains both adult and pediatric oncology protocols. SACT has the particularity to store
data not only about protocols, but also about patients, their diagnoses and outcomes. For this
reason, SACT is not shared in open access. Worth to note, a seminal work is DIOS [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], that
consisted of 260 protocols at the time of publication (2013) and is not accessible anymore.
Knowledge graph embeddings The aim of knowledge graph embeddings is to project entities
and relations into some continuous vector space while preserving the relation between entities
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Those entity and relation embeddings can further be used in downstream tasks, such as link
prediction [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], triple classification [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and entity clustering [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. TransE [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] is a representative
approach based on a translational distance model. Given a triple ( ,  ,  ) , the relation is
interpreted as a translation vector r so that the embedded entities s and o can be connected by r
with low error having  +  ≈  . TransE has the advantage of being simple, but has difficulty in
learning 1-to-many and many-to-many relations [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. DistMult [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] exploits a similarity-based
scoring function to match the latent semantics of entities. It represents pairwise relations
between entities in the vector space along the same dimension of relations. However, since
relations between entities are over-simplified, the model consider all relations symmetric.
ComplEx [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] is an extension of DistMult that uses complex-valued embeddings so as to better
model asymmetric relations. MuRE [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] employs hyperbolic embedding instead of Euclidean
analogues to represent hierarchical structures. RDF2Vec [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] uses random walks on the RDF
graph to create sequences of entities, which are then used as input for the model. However, it is
focused on embedding entities without considering the semantics of relations. CompGCN [35]
represents relationships between entities using Graph Convolutional Networks (GCNs), which
focuses on local neighborhood entities. It utilizes relation-type specific parameters to learn
embeddings. More embedding techniques can be found in the following surveys [
        <xref ref-type="bibr" rid="ref18">18, 19</xref>
        ].
Clustering with graph embeddings Embeddings provide a representation of objects in the form
of numeric vectors that is convenient for computing distances between them and consequently
driving clustering analyses. To cite only few works from the biomedical domain, Monnin et al.
[20] clustered embeddings of pharmacogenomic relationships, Mohamed et al. [21] clustered
polypharmacy side-effects and Fernández-Torras et al. [22] clustered drugs, diseases and genes
to predict drug responses.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. ChemoKG</title>
      <p>ChemoKG is an original knowledge graph in RDF (Resource Description Framework) of
chemotherapy protocols. It encompasses 1,358 protocols by instantiating the ontology
ChemoOnto [23, 24], which provides the necessary classes and relations to represent the various
dimensions of protocols. As illustrated by Figure 1, this includes the administration program of a
chemotherapy composed of its drugs, their dosages, the duration of their administration (bolus
vs. continuous infusion), drug properties (such as their half-life) imported from ChEBI [25],
DrugBank and their classification in the ATC classification. The protocols themselves have been
extracted from a local database of the Pharmacy Service of the European Georges Pompidou
Hospital of the AP-HP, Paris. The resulting knowledge graph encompasses 329,164 triples with
99,901 entities and 75 relations including 1,358 chemotherapy protocols and 226 anti-cancer
drugs. ChemoKG includes both protocols used in standard treatments and in clinical studies.
Because protocols evaluated by clinical studies are confidential, we defined a subset of our
knowledge graph named ChemoKG-open that excludes them and that is consequently sharable.
ChemoKG-open includes 513 protocols and is available on Zenodo at
https://zenodo.org/records/10263831. 1 The statistics of the main classes of ChemoKG and
ChemoKG-open are presented in Table 1.
1 A SPARQL endpoint will be made available upon publication at https://chemokg.inria.fr.
4. Method
In order to compare chemotherapy protocols, we propose first to learn embedding for protocols
represented in ChemoKG, second to cluster similar protocols and third to compare the resulting
clustering to two reference classifications: (i) protocols classified by cancer location, and (ii) by
pharmacological and therapeutic subgroups.</p>
      <p>Protocol embeddings We propose an original approach named RAGE, standing for
RelationAware knowledge Graph Embedding, to compute node embeddings. Inspired from relation
learning in [26, 27], RAGE builds on the GNN model to aggregate information from each entity’s
neighborhood, plus relations to characterize the type of links that connect the entity to its
neighbors. An overview and naming of main variables involved in RAGE is depicted in Figure 2.</p>
      <p>Let  be a knowledge graph such as  = {( ,  ,  )| ,  ∈  ,  ∈ ℛ}, where  is a set of nodes,
here named entities, ℛ a set of labeled and oriented edges named relations or predicates and the
triple ( ,  ,  ) denotes that the entity  is related to the entity  through the relation  . For
Accordingly, the final embedding is the sum of embeddings from each layer including the input
 0. In this way, we gather all the information of the target entity  and its “ -hop” neighbors.
Because our task is to perform a clustering of protocols, it is important to consider drug property
in the representation of protocols. For instance, ‘ATC classes’ or ‘biological roles’ can be captured
within 3-hop neighbors (Figure 1).</p>
      <p>Given a set of protocols</p>
      <p>⊂  , the evaluation of how the relation between the protocol and a
drug administration is likely is defined as follows:
 ̂
=  ∗</p>
      <p>∗.
, ℎ
  =</p>
      <p>1
| 
|</p>
      <p>∑
( , )∈ 
  −1 ∘   −1 ,</p>
      <p>∗ =  0 + ⋯ +   .</p>
      <p>(1)
(2)
(3)
(4)
example, the triple (



, 
) indicates that “
ℎ
”. The  -th layer embedding of a given entity  is formulated as:
where   = {( ,  )|( ,  ,  ) ∈  } is the neighborhood of the entity  ,  = {1 …  } is the number of
layers and ∘ is the element-wise product. For  layers, the final embedding of an entity  is
defined as the sum of each layer embedding:
If the final embeddings of protocol  and administration  derived from (2) are close (connected)
to each other the evaluation value  ̂</p>
      <p>is high. i.e., if the drug administration  is part of a protocol
 ,  ̂ should be high and inversely low if the drug administration is not. We define the objective
function using the Bayesian personalized ranking loss [30]:
ℒ =</p>
      <p>∑
( , , ′)∈
− ln  ( ̂
−  ̂  ′) ,
where  = {( ,  ,  ′)|( ,  ) ∈  +, ( ,  ′) ∈  −} and  is the sigmoid function.  + is the set of
protocol and administered drug pair and  − is the set of protocol and non-administered drug
pair. The loss is minimized when the likely score of the administered drug increases and the likely
score of the non-administered drug decreases.
final embedding for, here a protocol. 
and  ’
are nodes related and not related to  ,
respectively; here drug administrations.   are other nodes of the graph.   are predicates relating
nodes. Plus signs represent the aggregative sum of embeddings.  ̂
is an evaluation of the
probability for nodes  and  to be linked.</p>
      <p>
        Evaluation of embeddings on a link prediction task The task of link prediction aims at
predicting potentially missing links between entities within a knowledge graph, on the basis of
what is already stated in the graph. To compare RAGE with other graph embedding approaches,
we evaluate their different capabilities in predicting “5as Administration” predicate between
protocols and drug administrations in ChemoKG. We particularly consider TransE [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], DistMult
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], MuRE [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], ComplEx [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and CompGCN [35]. TransE and MuRE are translational distance
models that aim at finding a vector representation of entities with relation to the translation of
the entities based on distance measures. DistMult and ComplEx are semantic matching models
that use similarity-based scoring functions. CompGCN is a GCN-based model which considers
aggregating the neighborhood information for the entity relation embeddings. The models
consider the relations between entities, but require adaptation to capture similarity between
strings and numerical values that compose literals or distant relations in a graph. To compare the
performances of these different models, we use the Mean Reciprocal Rank (MRR) and Hits@N
(H@N). MRR evaluates models that return a ranked list of answers to queries by weighting results
proportionally of their place in the ranking. H@N is the count of how many positive triples are
ranked in the top-N positions against a set of negative triples.
      </p>
      <p>Cluster computations and evaluation We cluster a subset of nodes  ⊂  of a knowledge graph
on the basis of the Euclidean distance between their embeddings computed with a selection of
two embedding approaches (RAGE included). In this exploratory study we compare
performances of several classical clustering algorithms, namely k-means, Single and OPTICS [28,
29]. Both k-means and Single take as an input parameter the number of desired clusters. Single
differs from k-means in that it is a hierarchical clustering algorithm that successively merges
clusters whose distance between their closest observations is minimal. OPTICS is fundamentally
different as it finds zones of high density and expands clusters from them. It takes as a main input
parameter the minimal size of a cluster.</p>
      <p>We evaluate our clusters in comparison to two referential classification of protocols. The first
reference classification groups protocols by their primary indication i.e., the cancer localization
they primarily target according to our pharmacology experts. The second classification is based
on the sets of 3rd level ATC classes of drugs involved in protocols. These classifications are
available at https://chemokg.inria.fr, for protocols of ChemoKG-open.</p>
      <p>Clustering analyses are evaluated with Adjusted Rand Index (ARI), Normalized Mutual
Information (NMI) and Fowlkes-Mallows Index (FMI). ARI measures the overlapping between
two clustering (or between one and a reference classification in our case). ARI equals 0 for a
random labeling and 1 for an exactly similar labeling and is adjusted to limit the effect of chance.
NMI measures the mutual information between two clustering, normalized by the entropy of each
clustering. NMI is equal to 1 for an exactly similar labeling. FMI is the geometric mean of precision
and recall. FMI ranges from 0 to 1 and a high value indicates a high similarity between the
clustering and the reference classification. In addition, we compare our method based on graph
embeddings to a state-of-the-art mean to compare chemotherapies, named the Cumulative Dose
Intensity (CDI) [31]. The CDI is defined as a vector of normalized cumulative doses of
administered drugs. It is used to compare the course of chemotherapies, as well as protocols.</p>
    </sec>
    <sec id="sec-4">
      <title>5. Experimental results</title>
      <p>The experiment was conducted within two steps. First, to perceive the effectiveness of the
chemotherapy protocol representation we compared RAGE with other graph embedding
methods on a link prediction task on ChemoKG. Second, to seek the features and patterns within
the group of chemotherapy protocols clustering was performed with the embeddings of protocols
obtained from the first step. This work is implemented with PyTorch [32], PyKeen [33] for graph
embeddings and scikit-learn [34] for clustering.</p>
      <sec id="sec-4-1">
        <title>5.1. Graph embeddings on ChemoKG</title>
        <p>We compared the performance of RAGE and state-of-the-art approaches on a link prediction task.
10-fold cross validation was conducted on the triples in ChemoKG. The initialization of
parameters was done using Xavier uniform initialization [36]. For RAGE, pre-trained vectors of
entities were initialized using TransE. Next, the learning of entity embeddings is continued using
the RAGE model with three layers ( = 3). The number of layers used in CompGCN was also 3. All
the baseline models were optimized using Adam [37], the learning rate was 0.01 with exponential
decay and the output dimension of entities was 100.</p>
        <p>The performance of RAGE and baseline methods on the link prediction are reported in Table
2. Overall, we observed that RAGE outperformed four baselines and was competitive to MuRE on
all metrics. MuRE showed the best performance for all metrics in average. We observed that
translational distance models (i.e., MuRE) performed well in regard to semantic matching models
(i.e., DistMult, ComplEx) and CompGCN. The performance of RAGE is relatively lower than MuRE
probably because it learns relations between protocols and administrations only rather than
between all the entities and relations. This observation seems to also impact clustering results
reported in Section 5.2.</p>
      </sec>
      <sec id="sec-4-2">
        <title>5.2. Chemotherapy protocol clustering</title>
        <p>Clustering was performed on the embeddings of protocols obtained using RAGE and MuRE.
Protocols assigned to the same cluster are expected to be similar in regard to their definition in
ChemoKG. Results of our comparative study of three clustering algorithms and their ability to
reflect our two reference classifications are shown in Table 3 and 4. For cancer locations, RAGE
and the combination of CDI and RAGE showed better performance than CDI alone and MuRE. For
ATC, RAGE still showed better performance than CDI and MuRE. We deduce that drug properties
such as biological roles and half-lifes were beneficial for grouping protocols into cancer locations,
which are absent from the graph. The ATC level information present in ChemoKG should be
considered by RAGE what should explain the good grouping of protocols according to ATC classes.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Discussion and conclusion</title>
      <p>This paper is an initial attempt to evaluate the suitability of learning graph embeddings to classify
and identify similar chemotherapy protocols, in a setting where many protocols are offered to
clinicians, with potentially unequal levels of evaluation, consequently generating a clinical
decision-making challenge.</p>
      <p>From a knowledge representation and open resource point of view our knowledge graph is
aligned with standard ontologies, but we would win in providing mapping to other initiatives, in
particular HemOnc. This task is indeed not trivial as the HemOnc schema is different and less
precise than ours. However, the graph embedding approach we described can spotlight highly
similar protocols from HemOnc and ChemoKG, and in this sense has the potential of guiding the
mapping between the two resources.</p>
      <p>We acknowledge that the work presented here would gain from additional experiments. First,
our graph of protocols encompasses some numerical values, such as drug dose and half-life. All
graph embedding approaches we considered do not enable arithmetic comparisons of numerical
values. This lets us think that approaches that enable such comparison, such as KEN [38] would
lead to improvements. Also, we experimented recursive approaches (RAGE and CompGCN) only
with  = 3 (i.e., size of the considered neighborhood), and without including inferable links in the
graph. Increasing L would enable embeddings to consider more of the graph, potentially leading
to improvements. The inclusion of inferable links would add direct links between drugs and
higher ATC or ChEBI ontology classes, what could help in identifying similar protocols. However,
this extension is associated with a risk of flooding embeddings with general classes and is
consequently to test with caution. Regarding the evaluation of the clustering, we only provided
external evaluation metrics, i.e., metrics that compare one clustering with a reference
classification. This could be enriched with internal evaluation metrics such as the DUNN index or
silhouette score that evaluate how many instances assigned to one cluster are both close to each
other and distant from instances assigned to other clusters. Indeed, the two reference
classifications are not a ground truth, but references one may want to compare to. In this setting,
the internal quality of the clustering is a pertinent metrics, which could also enable the
comparison of various clustering strategies.</p>
      <p>Nonetheless, our experiments illustrate the level of performance of various graph embedding
approaches on ChemoKG and their usability for learning meaningful clusters. We will pursue our
efforts to facilitate the data and knowledge management associated with chemotherapy protocols
and would like to expand our work from protocols to patient data, and study how protocols are
respected, or modified to adapt to individuals and what is the impact on patient outcomes.</p>
    </sec>
    <sec id="sec-6">
      <title>Contributions</title>
      <p>JHJ populated ChemoOnto with protocols, enriched it with additional drug properties;
codesigned the study; designed RAGE; implemented and ran the experiments; wrote the first
version of this manuscript. AR created ChemoOnto and designed the instantiation of ChemoOnto
with protocols; participated in the writing. DG, EP and BS guided the motivation of the work,
provided with the two reference classifications. BR participated in the design of ChemoOnto,
ChemKG and of this study, and in the writing. AC participated in the design of ChemoOnto,
ChemKG, co-designed this study, and contributed to the writing.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <p>This work is supported by Inria Paris and the CombO project. This work has benefited from a
government grant managed by the Agence Nationale de la Recherche under the France 2030
program, reference ANR-22-PESN-0007, ShareFAIR and ANR-22-PESN-0008, NEUROVASC.
[19] Ji, S., Pan, S., Cambria, E., Marttinen, P., &amp; Philip, S. Y. (2021). A survey on knowledge graphs:
Representation, acquisition, and applications. IEEE transactions on neural networks and
learning systems, 33(2), 494-514.
[20] Monnin, P., Raïssi, C., Napoli, A., &amp; Coulet, A. (2022). Discovering alignment relations with</p>
      <p>Graph Convolutional Networks: A biomedical case study. Semantic Web, 13(3), 379-398.
[21] Mohamed, S. K., Nounu, A., &amp; Nováček, V. (2021). Biological applications of knowledge graph
embedding models. Briefings in bioinformatics, 22(2), 1679-1693.
[22] Fernández-Torras, A., Duran-Frigola, M., Bertoni, M., Locatelli, M., &amp; Aloy, P. (2022).</p>
      <p>Integrating and formatting biomedical data as pre-calculated knowledge graph embeddings
in the Bioteque. Nature Communications, 13(1), 5304.
[23] Rogier, A., Rance, B., Coulet, A. (2023). ChemoOnto, an ontology to qualify the course of
chemotherapies. Bio-ontologies COSI 2023, Poster.
[24] ROGIER, A., Rance, B., &amp; Coulet, A. (2024, January 22). ChemoOnto, an ontology to qualify the
course of chemotherapies. ISMB-ECCB 2023, Lyon. Zenodo.
https://doi.org/10.5281/zenodo.10548491
[25] Hastings, J., Owen, G., Dekker, A., Ennis, M., Kale, N., Muthukrishnan, V., ... &amp; Steinbeck, C.
(2016). ChEBI in 2016: Improved services and an expanding collection of metabolites.</p>
      <p>Nucleic acids research, 44(D1), D1214-D1219.
[26] Monnin, P., Raïssi, C., Napoli, A., &amp; Coulet, A. (2022). Discovering alignment relations with</p>
      <p>Graph Convolutional Networks: A biomedical case study. Semantic Web, 13(3), 379-398.
[27] Wang, X., Huang, T., Wang, D., Yuan, Y., Liu, Z., He, X., &amp; Chua, T. S. (2021, April). Learning
intents behind interactions with knowledge graph for recommendation. In Proceedings of
the web conference 2021 (pp. 878-887).
[28] Everitt, B. S., Landau, S., Leese, M., &amp; Stahl, D. (2011). Cluster analysis. John Wiley &amp; Sons.
[29] Ankerst, M., Breunig, M. M., Kriegel, H. P., &amp; Sander, J. (1999). OPTICS: Ordering points to
identify the clustering structure. ACM Sigmod record, 28(2), 49-60.
[30] Rendle, S., Freudenthaler, C., Gantner, Z., &amp; Schmidt-Thieme, L. (2012). BPR: Bayesian
personalized ranking from implicit feedback. arXiv preprint arXiv:1205.2618.
[31] Longo, D. L., Duffey, P. L., DeVita Jr, V. T., Wesley, M. N., Hubbard, S. M., &amp; Young, R. C. (1991).</p>
      <p>The calculation of actual or received dose intensity: a comparison of published methods. J
Clin Oncol, 9(11), 2042-2051.
[32] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., et al. (2019). Pytorch: An
imperative style, high-performance deep learning library. Advances in neural information
processing systems, 32.
[33] Ali, M., Berrendorf, M., Hoyt, C. T., Vermue, L., Galkin, M., Sharifzadeh, S., Fisher, A., Tres, V. &amp;
Lehmann, J. (2021). Bringing light into the dark: A large-scale evaluation of knowledge graph
embedding models under a unified framework. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 44(12), 8825-8845.
[34] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., et al.. (2011).</p>
      <p>Scikit-learn: Machine learning in Python. the Journal of machine Learning research, 12,
28252830.
[35] Vashishth, S., Sanyal, S., Nitin, V., &amp; Talukdar, P. (2019). Composition-based multi-relational
graph convolutional networks. arXiv preprint arXiv:1911.03082.
[36] Glorot, X., &amp; Bengio, Y. (2010, March). Understanding the difficulty of training deep
feedforward neural networks. In Proceedings of the thirteenth international conference on
artificial intelligence and statistics (pp. 249-256). JMLR Workshop and Conference
Proceedings.
[37] Kingma, D. P., &amp; Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980.
[38] Cvetkov-Iliev, A., Allauzen, A., &amp; Varoquaux, G. (2023). Relational data embeddings for
feature enrichment with background information. Machine Learning, 112(2), 687-720.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>DeVita</given-names>
            <surname>Jr</surname>
          </string-name>
          , V. T., &amp;
          <string-name>
            <surname>Chu</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>A history of cancer chemotherapy</article-title>
          .
          <source>Cancer research</source>
          ,
          <volume>68</volume>
          (
          <issue>21</issue>
          ),
          <fpage>8643</fpage>
          -
          <lpage>8653</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Khongorzul</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ling</surname>
            ,
            <given-names>C. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khan</surname>
            ,
            <given-names>F. U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ihsan</surname>
            ,
            <given-names>A. U.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>Antibody-drug conjugates: a comprehensive review</article-title>
          .
          <source>Molecular Cancer Research</source>
          ,
          <volume>18</volume>
          (
          <issue>1</issue>
          ),
          <fpage>3</fpage>
          -
          <lpage>19</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Warner</surname>
            ,
            <given-names>J. L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cowan</surname>
            ,
            <given-names>A. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hall</surname>
            ,
            <given-names>A. C.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>P. C.</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>HemOnc. Org: A collaborative online knowledge platform for oncology professionals</article-title>
          .
          <source>Journal of Oncology Practice</source>
          ,
          <volume>11</volume>
          (
          <issue>3</issue>
          ),
          <fpage>e336</fpage>
          -
          <lpage>e350</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Riaño</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peleg</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>&amp; Ten</given-names>
            <surname>Teije</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>Ten years of knowledge representation for health care (2009-2018): Topics, trends, and challenges</article-title>
          .
          <source>Artificial intelligence in medicine</source>
          ,
          <volume>100</volume>
          ,
          <fpage>101713</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Rubinstein</surname>
            ,
            <given-names>S. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>P. C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cowan</surname>
            ,
            <given-names>A. J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Warner</surname>
            ,
            <given-names>J. L.</given-names>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>Standardizing chemotherapy regimen nomenclature: a proposal and evaluation of the HemOnc and National Cancer Institute Thesaurus Regimen Content</article-title>
          .
          <source>JCO Clinical Cancer Informatics</source>
          ,
          <volume>4</volume>
          ,
          <fpage>60</fpage>
          -
          <lpage>70</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Bright</surname>
            ,
            <given-names>C. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lawton</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Benson</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bomb</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dodwell</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Henson</surname>
            ,
            <given-names>K. E.</given-names>
          </string-name>
          , ... &amp;
          <string-name>
            <surname>Smittenaar</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          (
          <year>2020</year>
          ).
          <article-title>Data resource profile: the systemic anti-cancer therapy (SACT) dataset</article-title>
          .
          <source>International journal of epidemiology</source>
          ,
          <volume>49</volume>
          (
          <issue>1</issue>
          ),
          <fpage>15</fpage>
          -
          <lpage>15l</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Klimes</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smid</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kubásek</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vyzula</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Dusek</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>DIOS-database of formalized chemotherapeutic regimens</article-title>
          .
          <source>In EFMI-STC</source>
          (pp.
          <fpage>165</fpage>
          -
          <lpage>169</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mao</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          (
          <year>2017</year>
          ).
          <article-title>Knowledge graph embedding: A survey of approaches and applications</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          ,
          <volume>29</volume>
          (
          <issue>12</issue>
          ),
          <fpage>2724</fpage>
          -
          <lpage>2743</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Paulheim</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          (
          <year>2017</year>
          ).
          <article-title>Knowledge graph refinement: A survey of approaches and evaluation methods</article-title>
          .
          <source>Semantic web</source>
          ,
          <volume>8</volume>
          (
          <issue>3</issue>
          ),
          <fpage>489</fpage>
          -
          <lpage>508</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C. D.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Reasoning with neural tensor networks for knowledge base completion</article-title>
          .
          <source>Advances in neural information processing systems</source>
          ,
          <volume>26</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Gad-Elrab</surname>
            ,
            <given-names>M. H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stepanova</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tran</surname>
            ,
            <given-names>T. K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Adel</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Weikum</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          (
          <year>2020</year>
          , November).
          <article-title>Excut: Explainable embedding-based clustering over knowledge graphs</article-title>
          .
          <source>In International Semantic Web Conference</source>
          (pp.
          <fpage>218</fpage>
          -
          <lpage>237</lpage>
          ). Cham: Springer International Publishing.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Bordes</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Usunier</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garcia-Duran</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weston</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Yakhnenko</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Translating embeddings for modeling multi-relational data</article-title>
          .
          <source>Advances in neural information processing systems</source>
          ,
          <volume>26</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          , Zhang, J.,
          <string-name>
            <surname>Feng</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          (
          <year>2014</year>
          , June).
          <article-title>Knowledge graph embedding by translating on hyperplanes</article-title>
          .
          <source>In Proceedings of the AAAI conference on artificial intelligence (</source>
          Vol.
          <volume>28</volume>
          , No. 1).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G. S.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>Advances in neural information processing systems</source>
          ,
          <volume>26</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Trouillon</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Welbl</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riedel</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Gaussier, É., &amp;
          <string-name>
            <surname>Bouchard</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          (
          <year>2016</year>
          , June).
          <article-title>Complex embeddings for simple link prediction</article-title>
          .
          <source>In International conference on machine learning</source>
          (pp.
          <fpage>2071</fpage>
          -
          <lpage>2080</lpage>
          ). PMLR.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Balazevic</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Allen</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Hospedales</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>Multi-relational poincaré graph embeddings</article-title>
          .
          <source>Advances in Neural Information Processing Systems</source>
          ,
          <volume>32</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Ristoski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Paulheim</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Rdf2vec: Rdf graph embeddings for data mining</article-title>
          .
          <source>In The Semantic Web-ISWC</source>
          <year>2016</year>
          : 15th International Semantic Web Conference, Kobe, Japan,
          <source>October 17-21</source>
          ,
          <year>2016</year>
          , Proceedings,
          <string-name>
            <surname>Part I</surname>
          </string-name>
          15 (pp.
          <fpage>498</fpage>
          -
          <lpage>514</lpage>
          ). Springer International Publishing.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mao</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          (
          <year>2017</year>
          ).
          <article-title>Knowledge graph embedding: A survey of approaches and applications</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          ,
          <volume>29</volume>
          (
          <issue>12</issue>
          ),
          <fpage>2724</fpage>
          -
          <lpage>2743</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>