<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Medical Knowledge Graph Construction by Aligning Large Biomedical Datasets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giorgos Stoilos</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Geleta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jetendr Shamdasani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mohammad Khodadadi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Babylon Health</institution>
          ,
          <addr-line>London, SW3 3DD</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>{ An in-house LabelMatcher which is based along similar ideas as the label matcher in [1], i.e., label normalisation, inverted indexes, and more. { The state-of-the-art systems AML [1] and LogMap [3] in both its versions LogMapo2 and LogMapc3. { A UMLS-synonym and a UMLS-CUI based matcher, or mappings from 3rd parties like BioPortal, NHS, and more.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Building large Knowledge Bases can be realised by aligning and integrating
existing data sources. To support AI-based digital healthcare services within Babylon
Health1 signi cant e ort to build a large medical KB was recently undertaken. To
realise this goal a highly con gurable and modular ontology integration pipeline
has been created which works as follows: an initial ontology is used as a seed KB
(KB0) and additional data sources are integrated into it creating new extended
versions of KB0. The integration process is based on a Matching phase, an
Aggregation phrase, and a nal PostProcessing phase. In the Matching phase the
following matchers can be used:</p>
      <p>
        The mappings from the previous stage are Aggregated using a weighted average
and a threshold is applied. Finally, post-processing performs the following:
{ Mappings of higher-multiplicity (i.e., mapping multiple classes to the same
one) are separated from the rest. The former are handled by
multiplicitydisambiguation techniques which reduce them to 1-to-1 or 1-to-m mappings.
{ All mappings go through existing [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and novel [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] conservativity-based
mapping repair methods in order to avoid altering the structure of the seed KB.
Signi cant e orts were spent to determine which matching algorithm to use in
the Matching phase. The Large BioMedTrack datasets were considered for
evaluating the methods, however, surprisingly enough these datasets are much older,
smaller and with somewhat di erent content compared to the recent releases of
1 https://www.babylonhealth.com/
2 https://github.com/ernestojimenezruiz/logmap-matcher
3 https://github.com/asolimando/logmap-conservativity/
SNOMED, NCI, and FMA that are considered in Babylon. For example, NCI
in BioTrack is almost half the size of the NCI December 2017 release (the
former contains 96K axioms whereas the latter 185K), FMA is almost 1/4 and
SNOMED almost 1/3 of their recent releases. In addition, synonym labels of
classes seem to be completely missing from all ontologies. For those reasons the
reference set between SNOMED and NCI in the BioTrack was refactored to
point to codes in the o cial releases and then a precision/recall evaluation of
our LabelMatcher, AML, LogMap, and XMap was conduced using the o cial
releases (see Table 1); XMap did not manage to terminate.
      </p>
      <p>As can be seen, although in theory simple, LabelMatcher provides
comparable precision/recall and is orders of magnitude faster; the very low precision
is because of the extra mappings found in the larger ontology versions which
are confused as false positives. Given the scalability results and adequate
precision/recall, we used our LabelMatcher in the pipeline to integrate the latest
versions of NCI, CHV, and FMA on top of SNOMED (indeed this process could
not be completed using AML or LogMapo). Statistics about the KBs that we
created after each integration are depicted in Table 2; moreover, no conservativity
violations could be detected due to our post-processing.</p>
      <p>
        We have also compared our post-processing approach against mapping
repairing implemented in AML, LogMapc and LogMapo. In cases that these systems
don't terminate we used smaller versions of our (test) ontologies. In all cases a
large number of conservativity violations could be identi ed (in contrast to none
detectable after running our approach); detailed results can be found in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Faria</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pesquita</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Santos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Palmonari</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cruz</surname>
            ,
            <given-names>I.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Couto</surname>
            ,
            <given-names>F.M.:</given-names>
          </string-name>
          <article-title>The agreementmakerlight ontology matching system</article-title>
          .
          <source>In: Proc. of OTM</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Jimenez-Ruiz</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grau</surname>
            ,
            <given-names>B.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Horrocks</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Llavori</surname>
          </string-name>
          , R.B.:
          <article-title>Ontology integration using mappings: Towards getting the right logical consequences</article-title>
          .
          <source>In: Proc. of ESWC</source>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Jimenez-Ruiz</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grau</surname>
            ,
            <given-names>B.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Logmap 2.0: towards logic-based, scalable and interactive ontology matching</article-title>
          .
          <source>In: Proc. of SWAT4(HC)LS</source>
          . pp.
          <volume>45</volume>
          {
          <issue>46</issue>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Stoilos</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Geleta</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shamdasani</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khodadadi</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>A novel approach and practical algorithms for ontology integration</article-title>
          .
          <source>In: Proceedings of ISWC</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>