<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Matching Pharmacogenomic Knowledge: Particularities, Results, and Perspectives</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pierre Monnin</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adrien Coulet</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centre de Recherche des Cordeliers, Inserm, Université Paris Cité, Sorbonne Université</institution>
          ,
          <addr-line>Paris</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Inria Paris</institution>
          ,
          <addr-line>Paris</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Orange</institution>
          ,
          <addr-line>Belfort</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Knowledge in pharmacogenomics (PGx) is scattered across several resources, e.g., reference databases and the biomedical literature. Matching their content would thus lead to a consolidated view of the available PGx knowledge that could, in turn, support multiple downstream applications, including knowledge curation and precision medicine. However, matching atomic units of PGx knowledge is challenging due to their peculiarities: they are of -ary nature, represented with heterogeneous vocabularies, and with various levels of granularity. In this paper, we frame the matching of PGx knowledge units of various provenance as an instance matching problem. We summarize our work to represent such units within a knowledge graph named PGxLOD, and to match them with a rule-based and a graph embedding-based matching approaches. We then particularly discuss the remaining challenges and how our research artifacts opened to the community could foster new benchmarks and methods for structure-based instance matching.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Instance Matching</kwd>
        <kwd>-ary Tuple</kwd>
        <kwd>Preorder</kwd>
        <kwd>Graph Embedding</kwd>
        <kwd>Pharmacogenomics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The increasing adoption of Linked Open Data (LOD) principles as well as Semantic Web
standards and technologies leads to an ever-growing number of resources being published online.
Consequently, the knowledge of a domain can be scattered across several complementary,
potentially overlapping, resources. That is to say, both similar and complementary knowledge units
may be represented across diferent resources. The aforementioned standards and technologies
also allow the concurrent edition of resources by human and software agents. This can lead to
duplicates within the same resource. Thus, matching the content of such resources is a first and
necessary step towards ofering a consolidated view of the available knowledge of a domain.
However, matching similar knowledge units within and across resources requires to face issues
such as diference of vocabularies or levels of granularity in the representation of knowledge
units.</p>
      <p>This general observation is also valid in pharmacogenomics (PGx), a domain that studies the
influence of genetic factors on drug response phenotypes. Indeed, state-of-the-art knowledge
in PGx is mainly scattered across specialized databases such as PharmGKB, and the biomedical
literature, i.e., PubMed. Electronic Health Records (EHRs) can also be mined to extract PGx
knowledge. Additionally to this scattering, PGx knowledge also sufers from heterogeneous
levels of validation. Indeed, some PGx knowledge units have been extensively studied, validated,
and are implemented in clinical practice. On the contrary, others have only been observed on
reduced cohorts of patients and remain to be further studied and confirmed. Consequently,
matching PGx knowledge across resources would, first, ofer a consolidated view of the available
knowledge of the domain; and second, should ease knowledge curation. Indeed, PharmGKB is
manually fed by human curators who continuously review the literature. Connecting similar
PGx knowledge units across articles would ease their work by guiding them to sets of relevant
articles to consider jointly and confront. This would facilitate the validation, or moderation, of
state-of-the-art knowledge.</p>
      <p>In this paper, we specify the problem of matching PGx knowledge units of various
provenance and its challenges (Section 2). Then, we summarize our research results consisting of
a knowledge graph (KG) named PGxLOD, and two matching approaches (Section 3). Finally,
we outline new research directions, and advocate for the community reuse of our produced
research artifacts (Section 4).</p>
    </sec>
    <sec id="sec-2">
      <title>2. Problem Setting</title>
      <p>Matching PGx knowledge requires to tackle several specific issues. Indeed, PGx knowledge
units are -ary relationships whose arguments are sets of individuals, e.g., the sets of involved
drugs, the sets of involved genetic factors, and the sets of involved phenotypes (Figure 1a).
Such a relationship states that a patient being treated with the specified drugs while having the
specified genetic factors will likely experiment the specified phenotypes. To illustrate, Figure 1b
depicts a state-of-the-art PGx relationship. It should be noted that, on the application side,
such knowledge units are named pharmacogenomic relationships. However, in a mathematical
formalism or Semantic Web standards, such units are actually relation instances or relation tuples.
To avoid any ambiguity, we will only refer to PGx knowledge units as PGx tuples.</p>
      <p>We choose to represent PGx tuples in a KG that follows Semantic Web formalisms to
interconnect with ontologies and additional LOD about drugs, genetic factors, and phenotypes. Such
interconnections provide additional knowledge that we leverage during the matching process.
In such formalisms, only binary predicates exist. Consequently, PGx tuples are reified: tuples
become individuals that are linked to their components with binary predicates. For example, in
Figure 1c, the tuple pgt1 is an individual linked to its components with the causes predicate. In
such a view, the matching of PGx tuples comes down to an instance matching task. Due to the
reification of tuples, this process will solely rely on a comparison of neighbors of tuples in the
graph, which corresponds to a structure-based matching approach.</p>
      <p>Beside their arity, matching PGx tuples requires to face their incomplete and heterogeneous
representations (Figure 2). These issues inherently lead to various types of alignments between
tuples, which is somehow unusual in instance matching. For example, tuples can be identical, e.g.,
{d1, . . . , d}
{gf1, . . . , gf}
{p1, . . . , p}
{Codeine 25mg oral}
{CYP2D6*4}
{No efect}
warfarin causes
CYP2C9
causes pgt1 causes
bleeding
(a) Abstract relationship
(b) Example relationship
(c) Reified relationship
tuples on the bottom left. Identical alignments may need to rely on translations and synonyms.
Some tuples may be more specific than others, e.g., tuples on the right. Matching such granularity
diferences may rely on domain-specific orderings such as ontology hierarchies. In addition,
a specified argument is more specific than an unknown argument, e.g., vascular disorders is
more specific than ??. Tuples may also be related when their arguments are comparable w.r.t.
some orderings but not all ordered in the same way (e.g., tuples at the bottom). This type
of alignments can also connect tuples whose arguments are somehow related without being
strictly comparable.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Representing and Matching PGx Knowledge</title>
      <p>
        We developed our own ontology, PGxO, and our own KG, PGxLOD1 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], to represent PGx tuples
of various provenance. PGxO provides a reduced set of classes and predicates to represent
reified PGx tuples. We populated PGxO to create PGxLOD by extracting 50,425 PGx tuples
from three main resources: structured data of PharmGKB (3,650 tuples), semi-structured data
(namely clinical annotations) of PharmGKB (10,240 tuples), and PubMed abstracts (36,535 tuples).
PGxLOD also includes knowledge about drugs, genetic factors, and phenotypes by integrating
several LOD graphs and ontologies (e.g., DisGeNET, DrugBank, MeSH). PGxLOD is publicly
available and respects LOD and FAIR principles.
      </p>
      <p>
        To align PGx tuples represented in PGxLOD, we first proposed a symbolic rule-based
approach, named tcn3r2 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In this approach, we see PGx tuples in their mathematical
form of -ary tuples, where each argument is a set of individuals, e.g., from Figure 1c,
pgt1 = ({warfarin} , {CYP2C9} , {bleeding}). In this view, matching two tuples comes down
to comparing their arguments pairwise, before concluding on the relatedness of the two tuples.
To this aim, we define two preorders ≼p and ≼ that consider ontology statements, and thus
enrich the comparison provided by set operators (i.e., =, ⊆ ). We propose in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] five matching
rules that conclude on five relatedness levels between tuples. The many resulting
alignments illustrate potentialities of our approach and provide insights on PGxLOD. Rule 5 that
concludes on the weakest relatedness level generated the most inter-resource alignments, which
emphasizes the importance of weaker relatedness levels to align resources and overcome their
heterogeneity.
      </p>
      <p>
        To cope with this need for flexibility, we considered KG embeddings models. Indeed, the
continuous aspect of graph embeddings may provide the needed flexibility. We framed our task
as a node clustering task performed on the embedding space3 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Consequently, we learn
node embeddings using Graph Convolutional Networks and the Soft Nearest Neighbor Loss
such that similar PGx tuples have low distances between their embeddings. Then, we apply a
clustering algorithm on the node embeddings and consider nodes assigned to the same cluster as
similar. To learn node embeddings, we constituted gold clusters that are based on the alignments
output by our rule-based approach. We showed that integrating domain knowledge by adding
inferences in the KG improves clustering performance. We also observed that distances in the
embedding space are coherent with the “strength” of the diferent alignments ( e.g., smaller
distances for equivalences, larger for weak relations). This result corresponds somehow to a
rediscovery of KG semantics in the embedding space.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion &amp; Perspectives</title>
      <p>PGxLOD only contains PGx tuples from the state of the art. Hence, one major perspective
resides in developing an “observational” version with results of EHR mining or EHRs themselves.
However, such an integration raises several issues related to text mining and data privacy
1https://pgxo.loria.fr - https://pgxlod.loria.fr
2https://github.com/pmonnin/tcn3r
3https://github.com/pmonnin/gcn-matching
(e.g., anonymization, access control). Beside biomedical research, we think PGxLOD is a useful
resource for Computer Science researchers. Indeed, we illustrated with our matching approaches
the various and challenging characteristics of PGxLOD: integration of several data sets (i.e.,
owl:sameAs links) and ontologies (i.e., hierarchies of classes and predicates, predicate inverses
and symmetry), and a medium size (i.e., scalability issues). For these reasons, PGxLOD constitutes
an interesting real-world KG to experiment matching approaches. That is why, we envision to
propose its consideration in the Ontology Alignment Evaluation Initiative.</p>
      <p>Both our matching approaches showed how domain knowledge and reasoning mechanisms
can serve a structure-based matching. These approaches also present complementary strengths
and could foster each other. Two perspectives now lie in (i) learning new rules from clusters
output by our embedding-based approach and (ii) updating alignments between PGx tuples
when new tuples are integrated. Other aspects of PGx knowledge and metadata are not currently
considered but pave the way for future works. For example, PGxO allows to represent negation
within PGx tuples (e.g., drugs not causing a tuple). Consequently, we could match contradictory
tuples, which raises several questions such as the geometric representation of contradiction in
the embedding space. We could also tackle the task of knowledge validation, i.e., confirming or
moderating a knowledge unit based on similar or contradictory units existing in other resources.
This would require to leverage alignments and heterogeneous quality metadata such as levels
of evidence in PharmGKB, or odd ratios in biomedical articles. Such a knowledge validation
approach would, in turn, realize our ultimate objective of ofering a consolidated view of PGx
knowledge to clinicians.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work was supported by the French National Research Agency (ANR) through the
PractiKPharma project (ANR-15-CE23-0028 - http://practikpharma.loria.fr).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Monnin</surname>
          </string-name>
          , et al.,
          <article-title>PGxO and PGxLOD: a reconciliation of pharmacogenomic knowledge of various provenances, enabling further comparison</article-title>
          , BMC Bioinformatics 20-S (
          <year>2019</year>
          )
          <volume>139</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>139</lpage>
          :
          <fpage>16</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Monnin</surname>
          </string-name>
          , et al.,
          <article-title>Knowledge-based matching of n-ary tuples</article-title>
          ,
          <source>in: Ontologies and Concepts in Mind and Machine - 25th International Conference on Conceptual Structures, ICCS</source>
          <year>2020</year>
          , volume
          <volume>12277</volume>
          <source>of LNCS</source>
          , Springer,
          <year>2020</year>
          , pp.
          <fpage>48</fpage>
          -
          <lpage>56</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Monnin</surname>
          </string-name>
          , et al.,
          <article-title>Discovering alignment relations with Graph Convolutional Networks: A biomedical case study</article-title>
          ,
          <source>Semantic Web</source>
          <volume>13</volume>
          (
          <year>2022</year>
          )
          <fpage>379</fpage>
          -
          <lpage>398</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>