ICBO 2014 Proceedings


                 Clinical Data Wrangling using
           Ontological Realism and Referent Tracking
          Werner Ceusters                                  Chiun Yu Hsu                                    Barry Smith
Department of Biomedical Informatics                  Neuroscience Program                          Department of Philosophy
       University at Buffalo                    Medicine and Biomedical Sciences                      University at Buffalo
      Buffalo, NY 14203, USA                   University at Buffalo, NY 14260, USA                 Buffalo, NY 14203, USA
    Email: ceusters@buffalo.edu                    Email: chiunhsu@buffalo.edu                     Email: phismith@buffalo.edu


    Abstract — Ontological realism aims at the development of                 To be effective, all such paradigms require ontology-
high quality ontologies that faithfully represent what is general          based mappings ranging not only over the database schemas
in reality and to use these ontologies to render heterogeneous             but also over the data types by means of which the data are
data collections comparable. To achieve this second goal for               stored [6]. Research in OBDA revealed that successful
clinical research datasets presupposes not merely (1) that the
                                                                           information integration requires much more detail than is
requisite ontologies already exist, but also (2) that the datasets
in question are faithful to reality in the dual sense that (a) they        standardly provided: it requires also suitable mechanisms
denote only particulars and relationships between particulars              for mapping individual data values – rather than merely data
that do in fact exist and (b) they do this in terms of the types           fields – to corresponding instances of ontology classes – for
and type-level relationships described in these ontologies.                example to patients in a clinical study. This in turn requires
While much attention has been devoted to (1), work on (2),                 the specification of how identifiers for such instances can be
which is the topic of this paper, is comparatively rare. Using             generated from such data values in order to enable creation
Referent Tracking as basis, we describe a technical data                   of an ABox suitable for answering queries relating to such
wrangling strategy which consists in creating for each dataset a           instances [7]. Such specification, we believe, may well be a
template that, when applied to each particular record in the
                                                                           critical issue in the context of clinical research datasets,
dataset, leads to the generation of a collection of Referent
Tracking Tuples (RTT) built out of unique identifiers for the              where (as we shall discover below) data values do not
entities described by means of the data items in the record. The           always denote what is suggested by the variable or
proposed strategy is based on (i) the distinction between data             fieldname under which they appear.
and what data are about, and (ii) the explicit descriptions of                 Suppose, for example, that in the record of some patient
portions of reality which RTTs provide and which range not                 the variable phenotypic gender is associated with a value of
only over the particulars described by data items in a dataset,            either ‘0’ or ‘1’ – meaning ‘male’ or ‘female,’ respectively.
but also over these data items themselves. This last feature
                                                                           It is then safe to create an ABox statement to the effect that
allows us to describe particulars that are only implicitly
referred to by the dataset; to provide information about                   this patient’s phenotypic gender is an instance of the
correspondences between data items in a dataset; and to assert             corresponding ontology class. If no data value is found,
which data items are unjustifiably or redundantly present in or            however, then it should not be assumed that the patient in
absent from the dataset. The approach has been tested on a                 question does not have a phenotypic gender. If, on the other
dataset collected from patients seeking treatment for orofacial            hand a value of ‘2’ – documented as meaning ‘unknown’ –
pain at two German universities and made available for the
                                                                           is found, then this should not lead to an ABox assertion to
NIDCR-funded OPMQoL project.
                                                                           the effect that the given patient’s phenotypic gender is an
    Keywords—referent tracking, data wrangling, ontological                instance of a special kind which is neither male nor female.
realism                                                                    The value ‘unknown’ provides information not about the
                                                                           patient, but rather about the data we have about the patient.
                        I. INTRODUCTION                                        The problem we face in creating data value to ontology
One goal of ontology-based research is the integration of                  mappings from clinical research data repositories is that the
information residing in heterogeneous data collections in the              information needed for such mappings is not explicitly
hope that by running queries over the resultant combined                   represented in the datasets. Rather, it is scattered through
data collections we will be able to answer questions that                  various data dictionaries and instruction manuals (relating
would otherwise remain unanswered [1]. Such integration                    for example on how to extract and process data from
can be achieved through different paradigms, including:                    responses to standardized questionnaires).
mediation [2], federation [3], data warehousing [4], and,                      The explicit representation that is pursued by the
most recently, the Ontology-Based Data Access (OBDA)                       Referent Tracking (RT) methodology is based on
paradigm [5], which is distinguished by the fact that it keeps             Ontological Realism as described in [8], and on the thesis
the data sources and conceptual layer of an information                    that explicit representation can best be achieved by
system separate and independent.                                           generating unique identifiers to all instances of ontology


                                                                      27
                                                 ICBO 2014 Proceedings


classes which are described – whether explicitly and                                            III. METHODS
implicitly – in our data. In [9] we described an algorithm to
achieve explicit representation of this sort from highly                  A. Referent Tracking
structured electronic health record (EHR) data. The research          RT is designed to yield data repositories whose content can
questions we address here are:                                        be expressed as a collection of Referent Tracking Tuples
                                                                      (RTT) [14]. An RTT is an assertion about a particular, i.e.
  (1) to what extent can a similar algorithm be used for              an entity in reality that exists in space and time [15]. Each
      clinical research data collections, for instance to             RTT follows a semi-formal syntax which is close to the one
      provide information both about particulars that are             used for instance-level relationships in the definitions of the
      implicitly referred to and about correspondences                Relation Ontology [16]. Ignoring here certain housekeeping
      between data-items in a data set,                               parameters we can assert that RTT assertions about
  (2) what kinds of ambiguous and implicit information                continuants (entities such as patients, hospitals, teeth, jaws
      can one expect to encounter in such data collections,           which endure through time, as contrasted with occurrents or
                                                                      processes), are of the form ‘x p-rel y t-rel t’, where:
  (3) is it useful to set limits on the types and amounts of
      implicit information that we will render explicit, and             ‘x’ is the (ideally) singular and globally unique instance
  (4) is it possible to use the referent tracking methodology             identifier (IUI) denoting the particular described,
      in combination with appropriate ontologies to                      ‘y’ is either: (1) a IUI denoting another particular or:
      provide a complete and explicit representation of                   (2) a representational unit drawn from either a realism-
      clinical research datasets that will take account of the            based ontology or a concept-based terminology,
      constraints and provisions typically documented in                 ‘p-rel’ expresses a relationship obtaining between the
      data dictionaries and other data-related sources, for               referents of x and y,
      instance to describe which data items are
                                                                         ‘t’ denotes a particular temporal region, and
      unjustifiably and redundantly present or absent ?
                                                                         ‘t-rel’ expresses the relationship obtaining between the
Our hypothesis is that, even where it is not possible to                  temporal region denoted by t and the temporal region
provide a completely accurate RT representation of the                    during which p-rel obtains between x and y.
entities in reality described by a given body of data,
identifying the types of challenges to such representation            RTT assertions that do not mention a continuant have the
would itself yield a useful resource for avoiding similar             form ‘x p-rel y,’ where ‘x ’, ‘p-rel’ and ‘y’ are otherwise
problems in future clinical research studies.                         treated in the way described above.
                                                                           RT aims to do away with the ambiguity in assertions
                       II. MATERIALS                                  such as ‘John has a benign duodenal polyp’. This assertion
The work described below is part of the NIDCR-funded                  tells us that there exists some instance of a given type, but
project Ontology for Pain-related Mental Health and                   not which one in particular. This ambiguity is preserved in
Quality of Life (OPMQoL) which involves the integration of            John’s EHR, where diagnostic codes drawn from some
five datasets which – although collected independently –              terminology or ontology are used to assert existence in John
cover similar sorts of information about patients who                 at some time t1 of polyps of a given type. The consequence
experienced one or other form of orofacial pain [10]. All             is that, when a later assertion is added to John’s EHR to the
datasets are made available as spreadsheet tables (from here          effect that he has a malignant duodenal polyp, the data
on referred to as ‘source tables’). Each row in the body of           provides no basis for inferences concerning whether it is the
each such table is a collection of data items obtained from a         very same polyp as the one referred to at t1 that has turned
single patient; each column is a collection of data items             malignant or some other polyp appearing at some later time
resulting from some specific type of observation. If a header         t 2 [14]. This ambiguity disappears when we represent the
row is present, its cells indicate what sorts of observations         first-described situation using the following RTTs:
are reported on in the respective columns.
    The de-identified dataset used for the work described                #1 part-of #2 at t1                                    (1)
here – from here on referred to as the ‘study set’ – was                 #1 instance-of benign duodenal polyp at t1             (2)
collected from 390 patients seeking treatment for orofacial              #1 instance-of malignant duodenal polyp at t1          (3)
pain [11]. Inclusion criteria were that patients had at least         where ‘#1’ denotes the polyp and ‘#2’ John. The alternative
one diagnosis according to the Research Diagnostic Criteria           situation, would be represented by using distinct IUIs for
for Temporomandibular Disorders (RDC/TMD) [12]. The                   each polyp as follows, where ‘#3’ denotes a second polyp:
study set comes with a variable (n=161) codebook and a
technical report explaining certain dependencies and                     #1 part-of #2 at t1                                    (4)
implicit assumptions [13].                                               #3 part-of #2 at t2                                    (5)


                                                                 28
                                                              ICBO 2014 Proceedings


  L     Var IT REF                           Min      Max        Val    IUI(L)        IUI(P)     P-Type                 P-Rel         P-Targ      Trel    Time

    1        IM patient_study_record                                                  #psrec-    DATASET-RECORD                                   at      t
    2     id LV patient_identifier                                      #pidL-        #pid-      DENOTATOR              denotes       #pat-       at      t
    3     id IM patient                                                 #patL-        #pat-      PATIENT                                          at      t
    4    sex CV gender                                                  #patgL-       #patg-     GENDER                 inheres-in    #pat-       at      t
    5    sex CV male                                             0                    #patg-     MALE-GENDER            inheres-in    #pat-       at      t
    6    sex CV female                                           1                    #patg-     FEMALE-GENDER          inheres-in    #pat-       at      t
    7    sex UA sex                      BLANK BLANK                                  #patgL-    UNDERSPEC-ICE                                    at      t
    8    q3 CV no_pain_in_ lower_face                            0      #q3L0-        #pat-                             lacks-pcp     PAIN        at      #tq3-
    9    q3 CV pain_in_ lower_face                               1      #q3L1-        #pq3-      PAIN                   participant   #pat-       at      #tq3-
   10    q3 IM in_the_past_month                                                      #tq3-      MONTH-PERIOD
   11    q3 IM lower_face                                                             #patlf-    LOWER-FACE             part-of       #pat-       at      t
   12    q3 IM time_of_q3_concretization                                              #cq3-      TIME-PERIOD            after         #tq3-
   13    q3 RP an_8_gcps_1               0     0                 0      #q3L-         #q3L-                             co-ref-with   #q3L0-      at      t
   14    q3 UP an_8_gcps_1               1     10                0      #q3L-         #q3L-      DISINFORMATION                                   at      t
   15    q3 UA an_8_gcps_1               BLANK BLANK             1      #q3L-         #q3L-      UNDERSPEC-ICE                                    at      t
   16    q3 JA an_8_gcps_1               BLANK BLANK             0      #q3L-         #q3L-      J-BLANK-ICE                                      at      t

  Table 1: Simplified template for data expansion of the variables (‘Var’) ‘id’, ‘sex’ and ‘q3’ of the original dataset ignoring time-related information.
  Legend: ‘L’ = Line number in this table; ‘IT’ = Information Type (possible values being ‘LV’ = Literal Value, ‘CV’ = Coded Value, ‘UA’ = Unjustified
  Absence, ‘IM’ = IMplicit reference, ‘RP’ = Redundant Presence (RP), ‘UA’ = Unjustified Absence, ‘JA’ = Justified Absence); ‘REF’ = Reference; ‘Min’ =
  lowest possible value for variable; ‘Max’ = highest possible value for variable; ‘Val’ = possible value for variable; ‘IUI(L)’ = prefix for generating an IUI
  proxy for the information content entity which refers to the corresponding value for the variable under ‘Var’ for the patient being processed; IUI(P) = prefix
  for generating an IUI proxy for whatever is denoted by this information content entity; P-Type = ontological type of the entities denoted by instantiated
  IUI(P)s; P-Rel = relation between the entity denoted by an instantiated IUI(P) and the entity denoted by an instantiated P-Targ; ‘Trel’ - temporal relation;
  ‘Time’ - temporal period during which P-rel holds. Only entries relevant to the discussion in this paper are shown. See discussion section for other details.


   #1 instance-of benign duodenal polyp at t1                         (6)                 c. if a given particular is a dependent continuant,
   #3 instance-of malignant duodenal polyp at t2.                     (7)                    identify the independent continuant on which it
                                                                                              depends; if an entity is an occurrent, identify the
A further goal of RT is to make explicit all the implicit                                     continuants which participate in it;
assumptions that need to be taken into account to interpret                                d. repeat steps (3b) and (3c) as required;
given data correctly. Some of these assumptions result from
                                                                                       (4)    selecting from appropriate realism-based ontologies
the use of broken information models or from practices such
                                                                                              the representational units that denote universals or
as registering ICD-9-CM code 659.7 – ‘Abnormality in fetal                                    defined classes whose instances or members are
heart rate or rhythm’ – in the diagnosis field of the mother’s                                either directly referred to in the dataset or implicitly
EHR. The RT method is most effective when its principles are                                  referred to as discovered through application of the
applied at the time of data collection and registration, though                               algorithm described in (3);
as shown in [17] post-hoc translations are also possible.
                                                                                       (5)    implementing an algorithm that uses outputs from (3)
B. Methodology applied                                                                        and (4) to generate for each patient described in the
The work reported here involved the following steps:                                          dataset a collection of RTTs that provides a realism-
                                                                                              based representation of that patient’s situation;
  (1)    cross‐checking the study set with the variable
         codebook and technical report for appropriate coding                          (6)    generating statistics needed to answer the research
         of values, field names, and field descriptions,                                      questions described in the INTRODUCTION, above.
  (2)    annotating the dataset with appropriate descriptions,                                                    IV.     RESULTS
  (3)    building an executable template that makes explicit,                        Research questions (1) and (4) are answered by our
         for each of the data values, how their referents must                       development of a technical approach which enables the
         be analyzed in RT terms; this is achieved by applying                       creation for each dataset of a template which, when applied to
         the following data expansion algorithm [9]:                                 a particular record in the dataset, yields a corresponding
      a. identify all the possible particulars that are explicitly                   collection of RTTs. Part of the approach is captured in Table
         referred to by a specific data value when applied to a                      1, which shows a simplified version of some sample lines
         specific patient;                                                           (indexed under ‘L’) as they appear in the template produced at
      b. determine for each particular identified under (3a)                         step (3) (under METHODS, above) for the variables ‘id’, ‘sex’
         whether it is a dependent or independent entity [8];                        and ‘q3’. What the template lines encode is determined by the


                                                                                29
                                                          ICBO 2014 Proceedings

                                                                                 unique identifiers. It also allows us to provide information
                  Template                         Patients                      about correspondences (such as co-reference) between data
           Av. (SD)      Min     Max        Av. (SD)       Min     Max
                                                                                 items in a dataset, and also to assert which data items are
 CV       3.57 (2.27)      0      11       0.82 (0.38)       0       1           redundant, or unjustifiably absent, and so forth.
 IM       2.79 (1.43)      0       6       2.69 (1.46)       0       6
                                                                                 A.      Explicit data items
 UA       0.16 (1.02)      0      12       0.01 (0.09)       0      10
 JA       0.16 (1.02)      0      12       0.04 (0.34)       0      12           The study set contains some explicit data items which are
 RP       0.13 (0.98)      0      12       0.01 (0.10)       0      11           about particulars on the side of the patient such as gender,
 UP       0.13 (0.98)      0      12       0.00 (0.01)       0       5           facial pains experienced, clicking noises heard when opening
                                                                                 their mouths, and so forth. Referent Tracking requires each of
Table 2. Occurrence of Record Types (see Table 1) per variable (n=161) in
                                                                                 these particulars to be assigned an IUI; Ontological Realism
the study set for the template (left block) and per patient (n=390) after
                                                                                 tells us that each one of them is instance of at least one
application of the template (right block).
                                                                                 universal. What universals these particulars are instances of is
                                                                                 typically only very indirectly represented in the study set.
information type (IT), the detailed semantics of which is                            The strategy for translating explicit data items into RTTs is
described in section V. Common to all information types is                       covered by the Literal Value (LV) and Coded Value (CV)
that part of the template that appears to the left of the dashed                 records in the template (Table 1). Template lines of either type
vertical line in Table 1. This specifies the conditions which                    have under ‘REF’ the label obtained or constructed from the
must be satisfied if RTTs are to be generated on the basis of                    relevant data dictionary or other supporting information
the information provided to the right of this line.                              associated with the code value. The template shows, for
    Table 2 answers research questions (3) and (4) by                            example, that if, for a patient in the study set, the value for the
providing statistics relating to the lines from out of which the                 variable ‘sex’ is ‘0’ (L5), then the gender of this patient is
data translation template for the study set is composed, on the                  described as ‘male.’ This can be translated in RT terms into a
extent to which each of these lines were in fact applied to the                  assertion that the given patient’s gender is an instance of the
patient population described in the study set. The table shows,                  universal male gender (or, in case gender does not qualify as a
for instance, that unjustified absences and presences were                       universal [18], that it is a member of the defined class ‘male
encountered, albeit in a small percentage of cases, and that on                  gender’ – we will ignore this distinction in the remainder of
average for each variable and for each patient roughly 3                         this paper).
implicit particulars needed to be accounted for. It shows that                       The IUIs assigned through application of our method are in
the increase in the size of the dataset resulting from applying                  reality very large numbers generated by an RT system to
this methodology is, for the Halle-Leipzig dataset, roughly                      ensure the needed high probability of uniqueness. For the sake
300%, and also that the quality of this dataset (measured in                     of readability, however, we provide simple abbreviations to
terms of UA, RP and UP) is quite good.                                           stand in for these IUIs. We also leave out full specification of
                           V. DISCUSSION                                         time-related information (which would be needed, for
                                                                                 example, to deal with cases where a patient’s gender changes
Our vision is that the Big Data repositories of the future
                                                                                 from one time to the next), and certain housekeeping details
should be maximally explicit and maximally self-explanatory.
By ‘maximally explicit’, we mean that each such repository                       required by syntactically and semantically correct RTTs [15]).
should contain explicit reference to any and all the entities,                       To see how IUI assignment works, now, we will suppose
including their interrelationships, that must exist for an                       that, while processing the study set on the basis of the
assertion encoded in the repository to be a faithful                             template illustrated in Table 1, the IUI #pat-1 is assigned to
representation of the corresponding part of reality. By                          the first patient described and that #patg-1 is assigned to his
‘maximally self-explanatory’ we mean that the data in the                        gender. Then the following collection of assertions would be
repository should be presented in such a way that a researcher                   generated as part of a faithful RT-like representation of the
seeking to query the repository does not need to concern                         corresponding portion of reality (POR) on the basis of lines L3
himself with any idiosyncrasies of and between datasets, or                      and L5 of the template:
codes or formats, that were combined or used to build the
                                                                                    #pat-1 instance-of PATIENT at t                          (8)
repository. A strategy to achieve this is to submit to such a
repository only individual datasets which are themselves                            #patg-1 instance-of MALE-GENDER at t                     (9)
maximally explicit and self-explanatory.                                            #patg-1 inheres-in #pat-1 at t                          (10)
    Our approach is based on the – to us – obvious distinction
                                                                                 Of course, the study set, too, is a particular, and so also are the
between data and what data are about. It then takes advantage
                                                                                 data items from out of which it is built. According to the
of the fact that RTTs can be used to describe in explicit
                                                                                 Information Artifact Ontology (IAO) the study set and its
fashion not merely the portions of reality described by data
                                                                                 parts are particular concretizations of particular information
items in a dataset, but also these data items themselves. This
                                                                                 content entities (ICEs). Thus the ‘0’ in a particular position of
allows us to describe explicitly even those particulars that are
                                                                                 the spreadsheet on your screen indicating that #pat-1’s gender
only implicitly referred to in a dataset by generating suitable


                                                                            30
                                                    ICBO 2014 Proceedings

is male could be assigned an IUI, as also could the                       applying step (3) of the data expansion algorithm described
corresponding bits on the hard drive of your laptop which                 under METHODS above. When the template is used to generate
bring it about that your spreadsheet software causes the laptop           assertions about #pat-1, a negative answer to question q3 (L8)
to display the ‘0’ in that position. In addition, also the ICEs           would generate an RTT to the effect that the patient lacks
here concretized can be assigned IUIs of their own. For                   participation in an instance of pain – we view such instances
example in L1 of the template the IUI #psrec-1 is assigned to             as processes [19] – by using the lacks-family of relations for
the ICE that is concretized on your screen as a row of the                the expression of negative findings [20]. In case of a positive
patient’s record, and in L4 #patgL-1 is assigned to the ICE               answer, an IUI for the appropriate instance is generated and
whose concretizations inform us what the gender of #pat-1 is.             participation of the patient therein is asserted. Both answers
Since referent tracking implementations also assign IUIs to               generate IUIs for the patient’s lower face, the time when the
RTTs, #RTT-patg-1-RN5a would be assigned to the ICE of                    question was asked, and the period of one month prior to the
which assertion (9) which is generated by L5 is a                         asking: all of these entities do indeed exist whatever answer is
concretization. On this basis, now, the following assertions              given.
can be added:
                                                                          C.      (Un)justified presence and absence
      #patgL-1 component-of #psrec-1 at t                  (11)          Template lines of types UA, UP, RP, and JA make explicit
      #RTT-patg-1-RN5a instance-of RTT at t                (12)          whether there are missing data or data that should not be there.
      #patgL-1 co-ref-with #RTT-patg-1-RN5a at t           (13)              L7, for instance, brings it about that when, for patient #pat-
                                                                          1 in the study set, no value for the variable ‘sex’ is provided –
      #patgL-1 instance-of DATA-ITEM at t                  (14)          expressed by the appearance of ‘BLANK’ in the template
      #patgL-1 is-about #patg-1 at t                       (15)          under both ‘Min’ and ‘Max’ – an RTT is generated that
      #psrec-1 instance-of DATASET-RECORD at t             (16)          declares the data item #patgL-1 to be an instance of an
                                                                          underspecified ICE. This assertion does not mean that the data
Assertions of types (11) and (14) are generated whenever an               item itself is absent; rather it means that certain information is
IUI(L) – here #patgL-1 – is for the first time generated while            missing.
processing the data for a specific patient. Assertions of type                An absence or presence of a value for some variable may
(15) are generated wherever IUI(L) and IUI(P) values co-                  be justified or unjustified depending on the value of some
occur in a template line. Assertions of types (12) and (13) are           other variable. The last four lines in Table 1, for example,
generated for all template lines in which there is both (1) a             describe dependencies between the variables ‘q3’ (for which
value for P-Rel and (2) a condition expressed in the left part of         the possible values ‘1’ and ‘0’ mean, respectively, current
Table 1 that is satisfied by a data item in the original dataset.         presence or absence of pain) and ‘an_8_gcps_1’, the latter
Assertion (16) expresses the assertional content of L1. The co-           containing answers to the question ‘How would you rate your
ref-with relationship – short for ‘co-referential-with’ – used in         facial pain on a 0 to 10 scale at the present time, that is right
(13) holds between two ICEs whenever concretizations                      now, where 0 is “no pain” and 10 is “pain as bad as could
thereof describe the same portion of reality (POR). Both ICEs             be”?’ L13 states that when the values for both ‘q3’ and
then (in harmony with talk of a ‘correspondence theory of                 ‘an_8_gcps_1’ are ‘0’, then the two ICEs of which the coding
truth’) enjoy a corresponds-to relationship with the same                 for the answers are concretizations enjoy a corresponds-to
POR. Where the assertions (8) to (10) describe parts of first-            relation to the same portion of reality.
order reality, (11) to (14) describe the second-order entities                L16 asserts that, if a record in the dataset has a ‘0’ value
that have some sort of aboutness relation with these first-order          for the variable q3, and if there is no value for the variable
items. Assertion (15) provides the link between the two.                  ‘an_8_gcps_1’, then the absence of a value for ‘an_8_gcps_1’
B.       Referencing implicit information                                 is justified. This is then documented by means of an RTT to
The variable ‘q3’ in the study set holds responses to the                 the effect that the corresponding ICE is justifiably blank (as
question ‘Have you had pain in the face, jaw, temple, in front            concretized by, for instance, an empty cell in that part of the
of the ear or in the ear in the past month?’ A positive answer            spreadsheet). As a last example, L14 asserts that if the value
is encoded as ‘1,’ a negative one as ‘0’. Although certain                given for ‘an_8_gcps_1’ is between 1 and 10 while the value
particulars on the side of the patient to whom the question is            for q3 is 0, then the value for the former is unjustifiably
addressed (for example his jaw, temple, the past month, etc.)             present (the corresponding ICE must thus be classified as
are explicitly referred to in the question, they are only implicit        disinformation – as dictated by the coding guidelines for the
in admissible responses. To achieve our objective, explicit               corresponding pair of questions).
reference is required, which is achieved by means of IM-                  D.     Limitations
records, all of which have under ‘REF’ a textual reference to             To achieve the vision of maximally self-explanatory and
an entity – or configuration of entities [15] – that must exist           explicit data repositories, several issues will need to be
for the corresponding ‘Var’ to make sense. IM-records – in                addressed. We will need above all a fully adequate set of
this case L10, L11 and L12 – are generated manually by                    relations for the various flavors of aboutness and


                                                                     31
                                                            ICBO 2014 Proceedings

correspondence, and a better theory of ICEs, for instance                               Mendelzon Int Workshop on Foundations of Data Management (AMW
concerning the various types that exist and how they relate to                          2011) 2011.
concretizations and to each other; these issue are currently not                   [6] Kohler J, Philippi S, Lange M. SEMEDA: ontology based semantic
addressed in the Information Artifact Ontology or any other                             integration of biological databases. Bioinformatics. 2003 Dec
realism-based ontology.                                                                 12;19(18):2420-7.
                                                                                   [7] Poggi A, Lembo D, Calvanese D, Giacomo GD, Lenzerini M, Rosati R.
                           VI. CONCLUSION                                               Linking data to ontologies. In: Spaccapietra S, editor. Journal on data
We have presented the beginnings of a methodology that                                  semantics X. Heidelberg: Springer-Verlag; 2008. p. 133-73.
allows a clinical research dataset to be translated into a set of                  [8] Smith B, Ceusters W. Ontological Realism as a Methodology for
of Referent Tracking Tuples that has the following features:                            Coordinated Evolution of Scientific Ontologies. Applied Ontology.
not only the portion of reality described by the dataset and the                        2010;5(3-4):139-88.
dataset itself are represented in a way that mimics the structure                  [9] Rudnicki R, Ceusters W, Manzoor S, Smith B. What Particulars are
of reality, but so also are the relations between components of                         Referred to in EHR Data? A Case Study in Integrating Referent
this dataset on the one hand and the corresponding portions of                          Tracking into an Electronic Health Record Application. In: Teich JM,
reality on the other. Applying the methodology to a concrete                            Suermondt J, C H, editors. American Medical Informatics Association
dataset and performing some basic exploratory statistics                                2007 Annual Symposium Proceedings, Biomedical and Health
revealed that all of the relations we distinguished between data                        Informatics: From Foundations to Applications to Policy. Chicago,
items and what they are about (if, indeed, they are about                               IL2007. p. 630-4.
anything at all) do indeed occur in our study data. A set of                       [10] Ceusters W. An information artifact ontology perspective on data
RTTs of this sort may in the future perhaps replace the more                            collections and associated representational artifacts. Stud Health
complicated exchange information models that are used in                                Technol Inform. 2012;180:68-72.
message-based paradigms or in the Extract – Transform –                            [11] John MT, Reißmann D, Schierz O, Wassell RW. Oral health-related
Load (ETL) analyses and procedures used in data                                         quality of life in patients with temporomandibular disorders. Journal of
warehousing. Although the syntax and semantics of RTTs                                  Orofacial Pain. 2007;21(1):46-54.
seems to us to be powerful enough to represent what is                             [12] Dworkin SF, LeResche L. Research diagnostic criteria for
required, a current limitation is the insufficient development                          temporomandibular disorders: review, criteria, examinations and
of the Information Artifact Ontology. A second limitation is                            specifications. Journal of Craniomandibular Disorders. 1992;6(4):301-
that not all RTTs can easily be translated into OWL-based                               55.
languages. Where the former is a job to be done by                                 [13] Mancl L, Whitney C, Zhu X. A SAS computer program to evaluate the
ontologists, the latter is a task for computer science.                                 research diagnostic criteria for classification of temporomandibular
                                                                                        disorders: University of Washington1999 June 3.
                        ACKNOWLEDGEMENTS                                           [14] Ceusters W, Smith B. Strategies for Referent Tracking in Electronic
This work was funded in part by grant 1R01DE021917-01A1                                 Health Records. Journal of Biomedical Informatics. 2006
from the National Institute of Dental and Craniofacial                                  June;39(3):362-78.
Research (NIDCR). The content of the paper is solely the                           [15] Ceusters W, Manzoor S. How to track Absolutely Everything? In: Obrst
responsibility of the authors and does not necessarily represent                        L, Janssen T, Ceusters W, editors. Ontologies and Semantic
the official views of the NIDCR or the NIH.                                             Technologies for the Intelligence Community Frontiers in Artificial
                                                                                        Intelligence and Applications. Amsterdam: IOS Press; 2010. p. 13-36.
                              REFERENCES                                           [16] Smith B, Ceusters W, Klagges B, Köhler J, Kumar A, Lomax J, et al.
                                                                                        Relations in biomedical ontologies. Genome Biology. 2005;6(5):R46.
[1]   Haas L. Beauty and the Beast: The Theory and Practice of Information
                                                                                   [17] Hogan WR, Garimalla S, Tariq S, Ceusters W. Representing Local
      Integration. In: Schwentick T, Suciu D, editors. Lecture Notes in
                                                                                        Identifiers in a Referent-Tracking System. In: Smith B, editor.
      Computer Science. Berlin, Heidelberg: Springer-Verlag 2007. p. 28-43.
                                                                                        Proceedings of the International Conference on Biomedical Ontology.
[2]   Marenco L, Wang R, Nadkarni P. Automated Database Mediation Using
                                                                                        Buffalo NY2011. p. 252-4.
      Ontological Metadata Mappings. J Am Med Inform Assoc. 2009 Sep-
                                                                                   [18] Ceusters W, Smith B. A Unified Framework for Biomedical
      Oct;16(5):723-37.
                                                                                        Terminologies and Ontologies. In: Safran C, Marin H, Reti S, editors.
[3]   Sim I, Carini S, Tu SW, Detwiler LT, Brinkley J, Mollah SA, et al.
                                                                                        Proceedings of the 13th World Congress on Medical and Health
      Ontology-Based Federated Data Access to Human Studies Information.
                                                                                        Informatics (Medinfo 2010), Cape Town, South Africa, 12-15
      In:AMIA Annu Symp Proc 2012. Chicago IL2012. p. 856-65.
                                                                                        September 2010. Amsterdam: IOS Press; 2010. p. 1050-4.
[4]   Baumbach J, Brinkrolf K, Czaja LF, Rahmann S, Tauch A.
                                                                                   [19] Smith B, Ceusters W, Goldberg LJ, Ohrbach R. Towards an Ontology of
      CoryneRegNet: an ontology-based data warehouse of corynebacterial
                                                                                        Pain. In: Okada M, editor. Proceedings of the Conference on Logic and
      transcription factors and regulatory networks. BMC genomics.
                                                                                        Ontology. Tokyo: Keio University Press; 2011. p. 23-32.
      2006;7:24.
                                                                                   [20] Ceusters W, Elkin P, Smith B. Negative Findings in Electronic Health
[5]   Rodriguez-Muro M, Calvanese D. Dependencies: Making Ontology
                                                                                        Records and Biomedical Ontologies: A Realist Approach. International
      Based Data Access Work In Practice. . Proc of the 5th Alberto
                                                                                        Journal of Medical Informatics. 2007 March;76:326-33.


                                                                              32