=Paper= {{Paper |id=Vol-2288/om2018_poster2 |storemode=property |title=Medical knowledge graph construction by aligning large biomedical datasets |pdfUrl=https://ceur-ws.org/Vol-2288/om2018_poster2.pdf |volume=Vol-2288 |authors=Michael Röder,Giorgos Stoilos,David Geleta,Jetendr Shamdasani,Mohammad Khodadadi |dblpUrl=https://dblp.org/rec/conf/semweb/RoderSGSK18 }} ==Medical knowledge graph construction by aligning large biomedical datasets== https://ceur-ws.org/Vol-2288/om2018_poster2.pdf
    Medical Knowledge Graph Construction by
      Aligning Large Biomedical Datasets

                        Giorgos Stoilos, David Geleta,
               Jetendr Shamdasani, and Mohammad Khodadadi

                    Babylon Health, London, SW3 3DD, UK
                    firstname.lastname@babylonhealth.com



1   Extended Abstract

Building large Knowledge Bases can be realised by aligning and integrating exist-
ing data sources. To support AI-based digital healthcare services within Babylon
Health1 significant effort to build a large medical KB was recently undertaken. To
realise this goal a highly configurable and modular ontology integration pipeline
has been created which works as follows: an initial ontology is used as a seed KB
(KB 0 ) and additional data sources are integrated into it creating new extended
versions of KB 0 . The integration process is based on a Matching phase, an Ag-
gregation phrase, and a final PostProcessing phase. In the Matching phase the
following matchers can be used:

 – An in-house LabelMatcher which is based along similar ideas as the label
   matcher in [1], i.e., label normalisation, inverted indexes, and more.
 – The state-of-the-art systems AML [1] and LogMap [3] in both its versions
   LogMapo 2 and LogMapc 3 .
 – A UMLS-synonym and a UMLS-CUI based matcher, or mappings from 3rd
   parties like BioPortal, NHS, and more.

The mappings from the previous stage are Aggregated using a weighted average
and a threshold is applied. Finally, post-processing performs the following:

 – Mappings of higher-multiplicity (i.e., mapping multiple classes to the same
   one) are separated from the rest. The former are handled by multiplicity-
   disambiguation techniques which reduce them to 1-to-1 or 1-to-m mappings.
 – All mappings go through existing [2] and novel [4] conservativity-based map-
   ping repair methods in order to avoid altering the structure of the seed KB.

Significant efforts were spent to determine which matching algorithm to use in
the Matching phase. The Large BioMedTrack datasets were considered for eval-
uating the methods, however, surprisingly enough these datasets are much older,
smaller and with somewhat different content compared to the recent releases of
1
  https://www.babylonhealth.com/
2
  https://github.com/ernestojimenezruiz/logmap-matcher
3
  https://github.com/asolimando/logmap-conservativity/
   Table 1. Evaluation results on aligning official releases of SNOMED and NCI

                                 precision recall f-Value Time(sec) ]mapppings
                 LabelMatcher       0.356 0.77       0.49        13      28457
                      LogMap        0.372 0.78       0.50     2 850      27342
                        AML         0.410 0.50       0.45       596      15861

    Table 2. Statistics about the KB after each integration/enrichment iteration.

                                        SNOMED        +NCI +CHV +FMA
                    Classes                340 995 429 241 429 241 524 837
                    Properties                  93       124       124       219
                    |A v B|                511 656 617 542 617 542 713 313
                    |hA p iri ∪ Liti|    1 069 562 1 611 543 1 708 616 2 173 649



SNOMED, NCI, and FMA that are considered in Babylon. For example, NCI
in BioTrack is almost half the size of the NCI December 2017 release (the for-
mer contains 96K axioms whereas the latter 185K), FMA is almost 1/4 and
SNOMED almost 1/3 of their recent releases. In addition, synonym labels of
classes seem to be completely missing from all ontologies. For those reasons the
reference set between SNOMED and NCI in the BioTrack was refactored to
point to codes in the official releases and then a precision/recall evaluation of
our LabelMatcher, AML, LogMap, and XMap was conduced using the official
releases (see Table 1); XMap did not manage to terminate.
    As can be seen, although in theory simple, LabelMatcher provides compa-
rable precision/recall and is orders of magnitude faster; the very low precision
is because of the extra mappings found in the larger ontology versions which
are confused as false positives. Given the scalability results and adequate pre-
cision/recall, we used our LabelMatcher in the pipeline to integrate the latest
versions of NCI, CHV, and FMA on top of SNOMED (indeed this process could
not be completed using AML or LogMapo ). Statistics about the KBs that we cre-
ated after each integration are depicted in Table 2; moreover, no conservativity
violations could be detected due to our post-processing.
    We have also compared our post-processing approach against mapping repair-
ing implemented in AML, LogMapc and LogMapo . In cases that these systems
don’t terminate we used smaller versions of our (test) ontologies. In all cases a
large number of conservativity violations could be identified (in contrast to none
detectable after running our approach); detailed results can be found in [4].


References
1. Faria, D., Pesquita, C., Santos, E., Palmonari, M., Cruz, I.F., Couto, F.M.: The
   agreementmakerlight ontology matching system. In: Proc. of OTM (2013)
2. Jiménez-Ruiz, E., Grau, B.C., Horrocks, I., Llavori, R.B.: Ontology integration using
   mappings: Towards getting the right logical consequences. In: Proc. of ESWC (2009)
3. Jiménez-Ruiz, E., Grau, B.C., Zhou, Y.: Logmap 2.0: towards logic-based, scalable
   and interactive ontology matching. In: Proc. of SWAT4(HC)LS. pp. 45–46 (2011)
4. Stoilos, G., Geleta, D., Shamdasani, J., Khodadadi, M.: A novel approach and prac-
   tical algorithms for ontology integration. In: Proceedings of ISWC (2018)


                                                   2