Demonstration: A Design Pattern for Representing Idiosyncratic
Ontologies in Industrial Applications

Anne Hunt 1
1
    Realtor.com, Austin, Texas


                                Abstract
                                Applications that intelligently handle data from multiple sources often need to manage and
                                present incoming data that is structured by source system ontologies that are widely different
                                from each other. Hereafter, these are referred to as idiosyncratic ontologies. In many cases,
                                idiosyncratic ontologies are mutually inconsistent. Because of the nature of the data and the
                                importance of these applications within such domains as health information, research data,
                                and financial data, it is necessary to both provide a coherent experience for end users and to
                                maintain a computationally accessible provenance for the source data. This is a
                                demonstration of a design pattern that has been effectively re-used across multiple domains.

                                Keywords 1
                                Idiosyncratic ontologies, representation, industry

1. Introduction

    This demonstration illustrates a repeatable pattern for managing data structured by idiosyncratic
ontologies. We especially focus on singular referential terms. This pattern is currently used to support
important software applications at several companies, where data sources have differing and
inconsistent ontologies whose original provenance must be maintained while the data is “made sense
of” for users of the software systems.
    In application systems design, as in philosophy of language, trouble arises when there is no one-to-
one correspondence between a term and the intended referent. And the situation is exacerbated when
there is no general, community-wide agreement on the semantics of the terms. This type of situation
arises in the real world more often than one might expect. For one common example, suppose that
Google Home’s underlying data representation is such that it has data structures representing two
living room lamps in my home when, in the real world, there is only one. A more serious example
occurs when an electronic health record (EHR) has a representation of two people called “Jane Doe”
attending a specific medical practice when there is in fact only one such person. And these situations
aren’t merely bugs in the data that should be corrected, for example by merging the records. To see
that this is the case, suppose someone has two identities that they need maintained for reasons of
privacy, and some helpful programmer “fixes” the problem by merging records from each identity.
    This is a demonstration of one way to properly represent data sourced from a variety of messy and
inconsistent ontologies, such that the applications built on this data behave according to these
requirements:
    •    End users are given a coherent experience
    •    Inferences such as identity resolution can be undone when defeasible

FOIS 2021 Demonstrations, held at FOIS 2021 - 12th International Conference on Formal Ontology in Information Systems, September 13-
17, 2021, Bolzano, Italy
EMAIL: annejudehunt@gmail.com (A. Hunt)
ORCID: 0000-0001-9805-3538 (A. Hunt)
                           © 2021 Copyright for this paper by its author.
                           Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Wor
Pr
   ks
    hop
 oceedi
      ngs
            ht
            I
             tp:
               //
                ceur
                   -
            SSN1613-
                    ws
                     .or
                   0073
                       g
                           CEUR Workshop Proceedings (CEUR-WS.org)
   •   The original data must be maintained, and the provenance of inferences must be available
   •   New data can be added to the system and data can be removed from the system
   •   End users see the data according to appropriate permissions (e.g. private identities are
   maintained in appropriate contexts)


2. Cases

    At Castlight Health, a public company, we ran into the problem of idiosyncratic ontologies with
data received from hundreds of other companies (usually insurance companies) daily. This data
represented medical providers – doctors, practices, clinics, hospitals, and other medical facilities – as
well as networks of providers. For example, a single record could represent a facility, such as a clinic,
or provider group such as a partnership of doctors with related specialties. The name of one of these
facilities could be almost identical to that of an individual doctor. Where a doctor’s name was, for
example, Dr. John Smith, the name of his clinic might be “The Doctor John Smith Practice.”
    Castlight Health’s application was a consumer-facing product that helps consumers pick in-
network healthcare providers. The magnitude of the data we received was on the order of millions of
records per week. While the data was labeled as to which of these types of named entities was
represented – whether it was a practice or a doctor, for example – that labeling was not reliable in all
cases. Worse, many of the data providers had inconsistent representations – for example, one insurer’s
data held that “Mary Washington Hospital” was the same as “Mary Washington Center”, while
another insurer’s data held they were two different providers, and a third held that there was yet
another provider, “M. Washington Hospital Center” that was not identical to either of the other two.
The data was accompanied by National Provider Identifiers (NPI) which are meant to be unique, but
in reality are not.
    The following table illustrates the type of data received.

 NPI           Name                            Location                           Provider Type
 1234          Stanford Hospital               45 Middlefield, Palo Alto, CA      Hospital
                                               22345
 1234          Stanford Radiation Center       47 Middlefield, Palo Alto, CA      Hospital
                                               22345
 3456          Dr. Jane Smith                  123 Chaucer Street, Palo Alto,     Practitioner
                                               CA 22344
 3456          Dr. Jane Smith Practice         123 Chaucer Street, Palo Alto,     Group
                                               CA 22344
 3456          Dr. John Doe                    123 Chaucer Street, Palo Alto,     Practitioner
                                               CA 22344
 7899          Dr. J. Smith                    123 Chaucer Street, Palo Alto,     Practitioner
                                               CA 22344

   Before we began our identity resolution update, Castlight’s algorithms and other data sources
indicated that any entities sharing a single NPI were one and the same, and that is the information that
we presented to consumers. Once we dug into the data by hand in specific cases (such as illustrated
above, we could see that NPIs were not unique. We also discovered that the best understanding of the
identity of the entities changed over time and with additions to our data sources. While a human can
easily see that Dr. Smith, her practice, and her co-worker Dr. Doe are three unique entities, and that
Dr. J. Smith is probably the same person as Dr. Jane Smith, in order to infer this at scale we had to
deploy a combination of machine learning and rule-based inference.
   In a separate company, we ran into a very similar problem with incoming data, in this case for both
medical practitioners and patients. At Medici, a health technology startup, we received patient data
from multiple providers’ electronic health records. Medici’s application was a doctor-facing and a
consumer-facing product that allowed doctors and patients to consult remotely and share data in a
manner compliant with applicable laws and regulations. Medici’s system conducted some amount of
identity resolution on data about both patients and doctors, so that data from various systems could be
stitched together for users to view. For reasons similar to those at Castlight, with the addition of
complexities around consumer data, data from different systems about the same individual could not
simply be merged.

3. Design Pattern

    To describe the design pattern used in these and other cases, I’ll use the example of an EHR
(electronic health record). An EHR can be thought of as something like a dossier of data about a
person. EHR systems are largely for the purpose of storing and managing information about real
people.
    Consider the case where a patient has two names, Clark Kent and Superman, along with two
associated and separate identities that he would like to keep separate for reasons of his own. Then the
EHR for that person will consist of a set of facts, some of which are associated with one of the names
(identities) and some that are associated with another.
    The simplest approach, and the one that seems intuitively correct, would be to represent the single
person and associate that entity with the two names along with the facts in the EHR. The following
diagram illustrates this model:


    While the model itself is relatively simple, it turns out that using this model within a software
system provides no simple way to retrieve data associated with one of the identities versus the other.
You would be reduced to writing a program to retrieve all of the facts, and then parsing through the
natural language to find the right names (which of course can take various surface forms, including
mis-spellings). It’s too difficult to separate the “Clark Kent” facts from the “Superman” facts.
Additionally, this sort of model doesn’t allow for data provenance, for example if the Clark Kent data
is sourced from one medical practice and the Superman data is sourced from another.
    The approach that worked is illustrated here:
    Using an algorithm that combined machine learning and inference, we were able to infer that the
Clark Kent data and the Superman data pertained to a single person in the real world. We represented
that real world person as what we decided to call a “canonical” entity with an associated persistent,
unique identifier to be used and relied upon by downstream systems. At the data level, we represented
the data received about each identity separately. This allowed us to easily and programmatically
retrieve data associated with each identity separately, for purposes of maintain privacy, while also
creating a correct representation (to allow us to do such things as get an accurate count of the number
of patients in our system).


4. Demonstration Method

   In this demonstration, we go into detail about these cases and show the ontology used to correctly
capture the semantics of the data while forming a foundation for applications that met requirements
around data provenance and privacy. This representation pattern is like what would be needed to
capture the semantics of such statements as:

    •   John knows that Lois Lane believes Clark Kent can’t fly.
    •   John knows that Lois Lane believes Superman can fly.

    We will use a small ontology, developed in Protégé (https://protege.stanford.edu), along with a set
of slides with illustrations of the intended model, some rejected models for comparison, and a set of
first order logic formalizations. We will map the demonstration ontology to the formalizations, and
describe how the pattern is easily extended to similar cases in different domains and covering
different types of terms, such as classes.