ICBO 2014 Proceedings Clinical Data Wrangling using Ontological Realism and Referent Tracking Werner Ceusters Chiun Yu Hsu Barry Smith Department of Biomedical Informatics Neuroscience Program Department of Philosophy University at Buffalo Medicine and Biomedical Sciences University at Buffalo Buffalo, NY 14203, USA University at Buffalo, NY 14260, USA Buffalo, NY 14203, USA Email: ceusters@buffalo.edu Email: chiunhsu@buffalo.edu Email: phismith@buffalo.edu Abstract — Ontological realism aims at the development of To be effective, all such paradigms require ontology- high quality ontologies that faithfully represent what is general based mappings ranging not only over the database schemas in reality and to use these ontologies to render heterogeneous but also over the data types by means of which the data are data collections comparable. To achieve this second goal for stored [6]. Research in OBDA revealed that successful clinical research datasets presupposes not merely (1) that the information integration requires much more detail than is requisite ontologies already exist, but also (2) that the datasets in question are faithful to reality in the dual sense that (a) they standardly provided: it requires also suitable mechanisms denote only particulars and relationships between particulars for mapping individual data values – rather than merely data that do in fact exist and (b) they do this in terms of the types fields – to corresponding instances of ontology classes – for and type-level relationships described in these ontologies. example to patients in a clinical study. This in turn requires While much attention has been devoted to (1), work on (2), the specification of how identifiers for such instances can be which is the topic of this paper, is comparatively rare. Using generated from such data values in order to enable creation Referent Tracking as basis, we describe a technical data of an ABox suitable for answering queries relating to such wrangling strategy which consists in creating for each dataset a instances [7]. Such specification, we believe, may well be a template that, when applied to each particular record in the critical issue in the context of clinical research datasets, dataset, leads to the generation of a collection of Referent Tracking Tuples (RTT) built out of unique identifiers for the where (as we shall discover below) data values do not entities described by means of the data items in the record. The always denote what is suggested by the variable or proposed strategy is based on (i) the distinction between data fieldname under which they appear. and what data are about, and (ii) the explicit descriptions of Suppose, for example, that in the record of some patient portions of reality which RTTs provide and which range not the variable phenotypic gender is associated with a value of only over the particulars described by data items in a dataset, either ‘0’ or ‘1’ – meaning ‘male’ or ‘female,’ respectively. but also over these data items themselves. This last feature It is then safe to create an ABox statement to the effect that allows us to describe particulars that are only implicitly referred to by the dataset; to provide information about this patient’s phenotypic gender is an instance of the correspondences between data items in a dataset; and to assert corresponding ontology class. If no data value is found, which data items are unjustifiably or redundantly present in or however, then it should not be assumed that the patient in absent from the dataset. The approach has been tested on a question does not have a phenotypic gender. If, on the other dataset collected from patients seeking treatment for orofacial hand a value of ‘2’ – documented as meaning ‘unknown’ – pain at two German universities and made available for the is found, then this should not lead to an ABox assertion to NIDCR-funded OPMQoL project. the effect that the given patient’s phenotypic gender is an Keywords—referent tracking, data wrangling, ontological instance of a special kind which is neither male nor female. realism The value ‘unknown’ provides information not about the patient, but rather about the data we have about the patient. I. INTRODUCTION The problem we face in creating data value to ontology One goal of ontology-based research is the integration of mappings from clinical research data repositories is that the information residing in heterogeneous data collections in the information needed for such mappings is not explicitly hope that by running queries over the resultant combined represented in the datasets. Rather, it is scattered through data collections we will be able to answer questions that various data dictionaries and instruction manuals (relating would otherwise remain unanswered [1]. Such integration for example on how to extract and process data from can be achieved through different paradigms, including: responses to standardized questionnaires). mediation [2], federation [3], data warehousing [4], and, The explicit representation that is pursued by the most recently, the Ontology-Based Data Access (OBDA) Referent Tracking (RT) methodology is based on paradigm [5], which is distinguished by the fact that it keeps Ontological Realism as described in [8], and on the thesis the data sources and conceptual layer of an information that explicit representation can best be achieved by system separate and independent. generating unique identifiers to all instances of ontology 27 ICBO 2014 Proceedings classes which are described – whether explicitly and III. METHODS implicitly – in our data. In [9] we described an algorithm to achieve explicit representation of this sort from highly A. Referent Tracking structured electronic health record (EHR) data. The research RT is designed to yield data repositories whose content can questions we address here are: be expressed as a collection of Referent Tracking Tuples (RTT) [14]. An RTT is an assertion about a particular, i.e. (1) to what extent can a similar algorithm be used for an entity in reality that exists in space and time [15]. Each clinical research data collections, for instance to RTT follows a semi-formal syntax which is close to the one provide information both about particulars that are used for instance-level relationships in the definitions of the implicitly referred to and about correspondences Relation Ontology [16]. Ignoring here certain housekeeping between data-items in a data set, parameters we can assert that RTT assertions about (2) what kinds of ambiguous and implicit information continuants (entities such as patients, hospitals, teeth, jaws can one expect to encounter in such data collections, which endure through time, as contrasted with occurrents or processes), are of the form ‘x p-rel y t-rel t’, where: (3) is it useful to set limits on the types and amounts of implicit information that we will render explicit, and  ‘x’ is the (ideally) singular and globally unique instance (4) is it possible to use the referent tracking methodology identifier (IUI) denoting the particular described, in combination with appropriate ontologies to  ‘y’ is either: (1) a IUI denoting another particular or: provide a complete and explicit representation of (2) a representational unit drawn from either a realism- clinical research datasets that will take account of the based ontology or a concept-based terminology, constraints and provisions typically documented in  ‘p-rel’ expresses a relationship obtaining between the data dictionaries and other data-related sources, for referents of x and y, instance to describe which data items are  ‘t’ denotes a particular temporal region, and unjustifiably and redundantly present or absent ?  ‘t-rel’ expresses the relationship obtaining between the Our hypothesis is that, even where it is not possible to temporal region denoted by t and the temporal region provide a completely accurate RT representation of the during which p-rel obtains between x and y. entities in reality described by a given body of data, identifying the types of challenges to such representation RTT assertions that do not mention a continuant have the would itself yield a useful resource for avoiding similar form ‘x p-rel y,’ where ‘x ’, ‘p-rel’ and ‘y’ are otherwise problems in future clinical research studies. treated in the way described above. RT aims to do away with the ambiguity in assertions II. MATERIALS such as ‘John has a benign duodenal polyp’. This assertion The work described below is part of the NIDCR-funded tells us that there exists some instance of a given type, but project Ontology for Pain-related Mental Health and not which one in particular. This ambiguity is preserved in Quality of Life (OPMQoL) which involves the integration of John’s EHR, where diagnostic codes drawn from some five datasets which – although collected independently – terminology or ontology are used to assert existence in John cover similar sorts of information about patients who at some time t1 of polyps of a given type. The consequence experienced one or other form of orofacial pain [10]. All is that, when a later assertion is added to John’s EHR to the datasets are made available as spreadsheet tables (from here effect that he has a malignant duodenal polyp, the data on referred to as ‘source tables’). Each row in the body of provides no basis for inferences concerning whether it is the each such table is a collection of data items obtained from a very same polyp as the one referred to at t1 that has turned single patient; each column is a collection of data items malignant or some other polyp appearing at some later time resulting from some specific type of observation. If a header t 2 [14]. This ambiguity disappears when we represent the row is present, its cells indicate what sorts of observations first-described situation using the following RTTs: are reported on in the respective columns. The de-identified dataset used for the work described  #1 part-of #2 at t1 (1) here – from here on referred to as the ‘study set’ – was  #1 instance-of benign duodenal polyp at t1 (2) collected from 390 patients seeking treatment for orofacial  #1 instance-of malignant duodenal polyp at t1 (3) pain [11]. Inclusion criteria were that patients had at least where ‘#1’ denotes the polyp and ‘#2’ John. The alternative one diagnosis according to the Research Diagnostic Criteria situation, would be represented by using distinct IUIs for for Temporomandibular Disorders (RDC/TMD) [12]. The each polyp as follows, where ‘#3’ denotes a second polyp: study set comes with a variable (n=161) codebook and a technical report explaining certain dependencies and  #1 part-of #2 at t1 (4) implicit assumptions [13].  #3 part-of #2 at t2 (5) 28 ICBO 2014 Proceedings L Var IT REF Min Max Val IUI(L) IUI(P) P-Type P-Rel P-Targ Trel Time 1 IM patient_study_record #psrec- DATASET-RECORD at t 2 id LV patient_identifier #pidL- #pid- DENOTATOR denotes #pat- at t 3 id IM patient #patL- #pat- PATIENT at t 4 sex CV gender #patgL- #patg- GENDER inheres-in #pat- at t 5 sex CV male 0 #patg- MALE-GENDER inheres-in #pat- at t 6 sex CV female 1 #patg- FEMALE-GENDER inheres-in #pat- at t 7 sex UA sex BLANK BLANK #patgL- UNDERSPEC-ICE at t 8 q3 CV no_pain_in_ lower_face 0 #q3L0- #pat- lacks-pcp PAIN at #tq3- 9 q3 CV pain_in_ lower_face 1 #q3L1- #pq3- PAIN participant #pat- at #tq3- 10 q3 IM in_the_past_month #tq3- MONTH-PERIOD 11 q3 IM lower_face #patlf- LOWER-FACE part-of #pat- at t 12 q3 IM time_of_q3_concretization #cq3- TIME-PERIOD after #tq3- 13 q3 RP an_8_gcps_1 0 0 0 #q3L- #q3L- co-ref-with #q3L0- at t 14 q3 UP an_8_gcps_1 1 10 0 #q3L- #q3L- DISINFORMATION at t 15 q3 UA an_8_gcps_1 BLANK BLANK 1 #q3L- #q3L- UNDERSPEC-ICE at t 16 q3 JA an_8_gcps_1 BLANK BLANK 0 #q3L- #q3L- J-BLANK-ICE at t Table 1: Simplified template for data expansion of the variables (‘Var’) ‘id’, ‘sex’ and ‘q3’ of the original dataset ignoring time-related information. Legend: ‘L’ = Line number in this table; ‘IT’ = Information Type (possible values being ‘LV’ = Literal Value, ‘CV’ = Coded Value, ‘UA’ = Unjustified Absence, ‘IM’ = IMplicit reference, ‘RP’ = Redundant Presence (RP), ‘UA’ = Unjustified Absence, ‘JA’ = Justified Absence); ‘REF’ = Reference; ‘Min’ = lowest possible value for variable; ‘Max’ = highest possible value for variable; ‘Val’ = possible value for variable; ‘IUI(L)’ = prefix for generating an IUI proxy for the information content entity which refers to the corresponding value for the variable under ‘Var’ for the patient being processed; IUI(P) = prefix for generating an IUI proxy for whatever is denoted by this information content entity; P-Type = ontological type of the entities denoted by instantiated IUI(P)s; P-Rel = relation between the entity denoted by an instantiated IUI(P) and the entity denoted by an instantiated P-Targ; ‘Trel’ - temporal relation; ‘Time’ - temporal period during which P-rel holds. Only entries relevant to the discussion in this paper are shown. See discussion section for other details.  #1 instance-of benign duodenal polyp at t1 (6) c. if a given particular is a dependent continuant,  #3 instance-of malignant duodenal polyp at t2. (7) identify the independent continuant on which it depends; if an entity is an occurrent, identify the A further goal of RT is to make explicit all the implicit continuants which participate in it; assumptions that need to be taken into account to interpret d. repeat steps (3b) and (3c) as required; given data correctly. Some of these assumptions result from (4) selecting from appropriate realism-based ontologies the use of broken information models or from practices such the representational units that denote universals or as registering ICD-9-CM code 659.7 – ‘Abnormality in fetal defined classes whose instances or members are heart rate or rhythm’ – in the diagnosis field of the mother’s either directly referred to in the dataset or implicitly EHR. The RT method is most effective when its principles are referred to as discovered through application of the applied at the time of data collection and registration, though algorithm described in (3); as shown in [17] post-hoc translations are also possible. (5) implementing an algorithm that uses outputs from (3) B. Methodology applied and (4) to generate for each patient described in the The work reported here involved the following steps: dataset a collection of RTTs that provides a realism- based representation of that patient’s situation; (1) cross‐checking the study set with the variable codebook and technical report for appropriate coding (6) generating statistics needed to answer the research of values, field names, and field descriptions, questions described in the INTRODUCTION, above. (2) annotating the dataset with appropriate descriptions, IV. RESULTS (3) building an executable template that makes explicit, Research questions (1) and (4) are answered by our for each of the data values, how their referents must development of a technical approach which enables the be analyzed in RT terms; this is achieved by applying creation for each dataset of a template which, when applied to the following data expansion algorithm [9]: a particular record in the dataset, yields a corresponding a. identify all the possible particulars that are explicitly collection of RTTs. Part of the approach is captured in Table referred to by a specific data value when applied to a 1, which shows a simplified version of some sample lines specific patient; (indexed under ‘L’) as they appear in the template produced at b. determine for each particular identified under (3a) step (3) (under METHODS, above) for the variables ‘id’, ‘sex’ whether it is a dependent or independent entity [8]; and ‘q3’. What the template lines encode is determined by the 29 ICBO 2014 Proceedings unique identifiers. It also allows us to provide information Template Patients about correspondences (such as co-reference) between data Av. (SD) Min Max Av. (SD) Min Max items in a dataset, and also to assert which data items are CV 3.57 (2.27) 0 11 0.82 (0.38) 0 1 redundant, or unjustifiably absent, and so forth. IM 2.79 (1.43) 0 6 2.69 (1.46) 0 6 A. Explicit data items UA 0.16 (1.02) 0 12 0.01 (0.09) 0 10 JA 0.16 (1.02) 0 12 0.04 (0.34) 0 12 The study set contains some explicit data items which are RP 0.13 (0.98) 0 12 0.01 (0.10) 0 11 about particulars on the side of the patient such as gender, UP 0.13 (0.98) 0 12 0.00 (0.01) 0 5 facial pains experienced, clicking noises heard when opening their mouths, and so forth. Referent Tracking requires each of Table 2. Occurrence of Record Types (see Table 1) per variable (n=161) in these particulars to be assigned an IUI; Ontological Realism the study set for the template (left block) and per patient (n=390) after tells us that each one of them is instance of at least one application of the template (right block). universal. What universals these particulars are instances of is typically only very indirectly represented in the study set. information type (IT), the detailed semantics of which is The strategy for translating explicit data items into RTTs is described in section V. Common to all information types is covered by the Literal Value (LV) and Coded Value (CV) that part of the template that appears to the left of the dashed records in the template (Table 1). Template lines of either type vertical line in Table 1. This specifies the conditions which have under ‘REF’ the label obtained or constructed from the must be satisfied if RTTs are to be generated on the basis of relevant data dictionary or other supporting information the information provided to the right of this line. associated with the code value. The template shows, for Table 2 answers research questions (3) and (4) by example, that if, for a patient in the study set, the value for the providing statistics relating to the lines from out of which the variable ‘sex’ is ‘0’ (L5), then the gender of this patient is data translation template for the study set is composed, on the described as ‘male.’ This can be translated in RT terms into a extent to which each of these lines were in fact applied to the assertion that the given patient’s gender is an instance of the patient population described in the study set. The table shows, universal male gender (or, in case gender does not qualify as a for instance, that unjustified absences and presences were universal [18], that it is a member of the defined class ‘male encountered, albeit in a small percentage of cases, and that on gender’ – we will ignore this distinction in the remainder of average for each variable and for each patient roughly 3 this paper). implicit particulars needed to be accounted for. It shows that The IUIs assigned through application of our method are in the increase in the size of the dataset resulting from applying reality very large numbers generated by an RT system to this methodology is, for the Halle-Leipzig dataset, roughly ensure the needed high probability of uniqueness. For the sake 300%, and also that the quality of this dataset (measured in of readability, however, we provide simple abbreviations to terms of UA, RP and UP) is quite good. stand in for these IUIs. We also leave out full specification of V. DISCUSSION time-related information (which would be needed, for example, to deal with cases where a patient’s gender changes Our vision is that the Big Data repositories of the future from one time to the next), and certain housekeeping details should be maximally explicit and maximally self-explanatory. By ‘maximally explicit’, we mean that each such repository required by syntactically and semantically correct RTTs [15]). should contain explicit reference to any and all the entities, To see how IUI assignment works, now, we will suppose including their interrelationships, that must exist for an that, while processing the study set on the basis of the assertion encoded in the repository to be a faithful template illustrated in Table 1, the IUI #pat-1 is assigned to representation of the corresponding part of reality. By the first patient described and that #patg-1 is assigned to his ‘maximally self-explanatory’ we mean that the data in the gender. Then the following collection of assertions would be repository should be presented in such a way that a researcher generated as part of a faithful RT-like representation of the seeking to query the repository does not need to concern corresponding portion of reality (POR) on the basis of lines L3 himself with any idiosyncrasies of and between datasets, or and L5 of the template: codes or formats, that were combined or used to build the  #pat-1 instance-of PATIENT at t (8) repository. A strategy to achieve this is to submit to such a repository only individual datasets which are themselves  #patg-1 instance-of MALE-GENDER at t (9) maximally explicit and self-explanatory.  #patg-1 inheres-in #pat-1 at t (10) Our approach is based on the – to us – obvious distinction Of course, the study set, too, is a particular, and so also are the between data and what data are about. It then takes advantage data items from out of which it is built. According to the of the fact that RTTs can be used to describe in explicit Information Artifact Ontology (IAO) the study set and its fashion not merely the portions of reality described by data parts are particular concretizations of particular information items in a dataset, but also these data items themselves. This content entities (ICEs). Thus the ‘0’ in a particular position of allows us to describe explicitly even those particulars that are the spreadsheet on your screen indicating that #pat-1’s gender only implicitly referred to in a dataset by generating suitable 30 ICBO 2014 Proceedings is male could be assigned an IUI, as also could the applying step (3) of the data expansion algorithm described corresponding bits on the hard drive of your laptop which under METHODS above. When the template is used to generate bring it about that your spreadsheet software causes the laptop assertions about #pat-1, a negative answer to question q3 (L8) to display the ‘0’ in that position. In addition, also the ICEs would generate an RTT to the effect that the patient lacks here concretized can be assigned IUIs of their own. For participation in an instance of pain – we view such instances example in L1 of the template the IUI #psrec-1 is assigned to as processes [19] – by using the lacks-family of relations for the ICE that is concretized on your screen as a row of the the expression of negative findings [20]. In case of a positive patient’s record, and in L4 #patgL-1 is assigned to the ICE answer, an IUI for the appropriate instance is generated and whose concretizations inform us what the gender of #pat-1 is. participation of the patient therein is asserted. Both answers Since referent tracking implementations also assign IUIs to generate IUIs for the patient’s lower face, the time when the RTTs, #RTT-patg-1-RN5a would be assigned to the ICE of question was asked, and the period of one month prior to the which assertion (9) which is generated by L5 is a asking: all of these entities do indeed exist whatever answer is concretization. On this basis, now, the following assertions given. can be added: C. (Un)justified presence and absence  #patgL-1 component-of #psrec-1 at t (11) Template lines of types UA, UP, RP, and JA make explicit  #RTT-patg-1-RN5a instance-of RTT at t (12) whether there are missing data or data that should not be there.  #patgL-1 co-ref-with #RTT-patg-1-RN5a at t (13) L7, for instance, brings it about that when, for patient #pat- 1 in the study set, no value for the variable ‘sex’ is provided –  #patgL-1 instance-of DATA-ITEM at t (14) expressed by the appearance of ‘BLANK’ in the template  #patgL-1 is-about #patg-1 at t (15) under both ‘Min’ and ‘Max’ – an RTT is generated that  #psrec-1 instance-of DATASET-RECORD at t (16) declares the data item #patgL-1 to be an instance of an underspecified ICE. This assertion does not mean that the data Assertions of types (11) and (14) are generated whenever an item itself is absent; rather it means that certain information is IUI(L) – here #patgL-1 – is for the first time generated while missing. processing the data for a specific patient. Assertions of type An absence or presence of a value for some variable may (15) are generated wherever IUI(L) and IUI(P) values co- be justified or unjustified depending on the value of some occur in a template line. Assertions of types (12) and (13) are other variable. The last four lines in Table 1, for example, generated for all template lines in which there is both (1) a describe dependencies between the variables ‘q3’ (for which value for P-Rel and (2) a condition expressed in the left part of the possible values ‘1’ and ‘0’ mean, respectively, current Table 1 that is satisfied by a data item in the original dataset. presence or absence of pain) and ‘an_8_gcps_1’, the latter Assertion (16) expresses the assertional content of L1. The co- containing answers to the question ‘How would you rate your ref-with relationship – short for ‘co-referential-with’ – used in facial pain on a 0 to 10 scale at the present time, that is right (13) holds between two ICEs whenever concretizations now, where 0 is “no pain” and 10 is “pain as bad as could thereof describe the same portion of reality (POR). Both ICEs be”?’ L13 states that when the values for both ‘q3’ and then (in harmony with talk of a ‘correspondence theory of ‘an_8_gcps_1’ are ‘0’, then the two ICEs of which the coding truth’) enjoy a corresponds-to relationship with the same for the answers are concretizations enjoy a corresponds-to POR. Where the assertions (8) to (10) describe parts of first- relation to the same portion of reality. order reality, (11) to (14) describe the second-order entities L16 asserts that, if a record in the dataset has a ‘0’ value that have some sort of aboutness relation with these first-order for the variable q3, and if there is no value for the variable items. Assertion (15) provides the link between the two. ‘an_8_gcps_1’, then the absence of a value for ‘an_8_gcps_1’ B. Referencing implicit information is justified. This is then documented by means of an RTT to The variable ‘q3’ in the study set holds responses to the the effect that the corresponding ICE is justifiably blank (as question ‘Have you had pain in the face, jaw, temple, in front concretized by, for instance, an empty cell in that part of the of the ear or in the ear in the past month?’ A positive answer spreadsheet). As a last example, L14 asserts that if the value is encoded as ‘1,’ a negative one as ‘0’. Although certain given for ‘an_8_gcps_1’ is between 1 and 10 while the value particulars on the side of the patient to whom the question is for q3 is 0, then the value for the former is unjustifiably addressed (for example his jaw, temple, the past month, etc.) present (the corresponding ICE must thus be classified as are explicitly referred to in the question, they are only implicit disinformation – as dictated by the coding guidelines for the in admissible responses. To achieve our objective, explicit corresponding pair of questions). reference is required, which is achieved by means of IM- D. Limitations records, all of which have under ‘REF’ a textual reference to To achieve the vision of maximally self-explanatory and an entity – or configuration of entities [15] – that must exist explicit data repositories, several issues will need to be for the corresponding ‘Var’ to make sense. IM-records – in addressed. We will need above all a fully adequate set of this case L10, L11 and L12 – are generated manually by relations for the various flavors of aboutness and 31 ICBO 2014 Proceedings correspondence, and a better theory of ICEs, for instance Mendelzon Int Workshop on Foundations of Data Management (AMW concerning the various types that exist and how they relate to 2011) 2011. concretizations and to each other; these issue are currently not [6] Kohler J, Philippi S, Lange M. SEMEDA: ontology based semantic addressed in the Information Artifact Ontology or any other integration of biological databases. Bioinformatics. 2003 Dec realism-based ontology. 12;19(18):2420-7. [7] Poggi A, Lembo D, Calvanese D, Giacomo GD, Lenzerini M, Rosati R. VI. CONCLUSION Linking data to ontologies. In: Spaccapietra S, editor. Journal on data We have presented the beginnings of a methodology that semantics X. Heidelberg: Springer-Verlag; 2008. p. 133-73. allows a clinical research dataset to be translated into a set of [8] Smith B, Ceusters W. Ontological Realism as a Methodology for of Referent Tracking Tuples that has the following features: Coordinated Evolution of Scientific Ontologies. Applied Ontology. not only the portion of reality described by the dataset and the 2010;5(3-4):139-88. dataset itself are represented in a way that mimics the structure [9] Rudnicki R, Ceusters W, Manzoor S, Smith B. What Particulars are of reality, but so also are the relations between components of Referred to in EHR Data? A Case Study in Integrating Referent this dataset on the one hand and the corresponding portions of Tracking into an Electronic Health Record Application. In: Teich JM, reality on the other. Applying the methodology to a concrete Suermondt J, C H, editors. American Medical Informatics Association dataset and performing some basic exploratory statistics 2007 Annual Symposium Proceedings, Biomedical and Health revealed that all of the relations we distinguished between data Informatics: From Foundations to Applications to Policy. Chicago, items and what they are about (if, indeed, they are about IL2007. p. 630-4. anything at all) do indeed occur in our study data. A set of [10] Ceusters W. An information artifact ontology perspective on data RTTs of this sort may in the future perhaps replace the more collections and associated representational artifacts. Stud Health complicated exchange information models that are used in Technol Inform. 2012;180:68-72. message-based paradigms or in the Extract – Transform – [11] John MT, Reißmann D, Schierz O, Wassell RW. Oral health-related Load (ETL) analyses and procedures used in data quality of life in patients with temporomandibular disorders. Journal of warehousing. Although the syntax and semantics of RTTs Orofacial Pain. 2007;21(1):46-54. seems to us to be powerful enough to represent what is [12] Dworkin SF, LeResche L. Research diagnostic criteria for required, a current limitation is the insufficient development temporomandibular disorders: review, criteria, examinations and of the Information Artifact Ontology. A second limitation is specifications. Journal of Craniomandibular Disorders. 1992;6(4):301- that not all RTTs can easily be translated into OWL-based 55. languages. Where the former is a job to be done by [13] Mancl L, Whitney C, Zhu X. A SAS computer program to evaluate the ontologists, the latter is a task for computer science. research diagnostic criteria for classification of temporomandibular disorders: University of Washington1999 June 3. ACKNOWLEDGEMENTS [14] Ceusters W, Smith B. Strategies for Referent Tracking in Electronic This work was funded in part by grant 1R01DE021917-01A1 Health Records. Journal of Biomedical Informatics. 2006 from the National Institute of Dental and Craniofacial June;39(3):362-78. Research (NIDCR). The content of the paper is solely the [15] Ceusters W, Manzoor S. How to track Absolutely Everything? In: Obrst responsibility of the authors and does not necessarily represent L, Janssen T, Ceusters W, editors. Ontologies and Semantic the official views of the NIDCR or the NIH. Technologies for the Intelligence Community Frontiers in Artificial Intelligence and Applications. Amsterdam: IOS Press; 2010. p. 13-36. REFERENCES [16] Smith B, Ceusters W, Klagges B, Köhler J, Kumar A, Lomax J, et al. Relations in biomedical ontologies. Genome Biology. 2005;6(5):R46. [1] Haas L. Beauty and the Beast: The Theory and Practice of Information [17] Hogan WR, Garimalla S, Tariq S, Ceusters W. Representing Local Integration. In: Schwentick T, Suciu D, editors. Lecture Notes in Identifiers in a Referent-Tracking System. In: Smith B, editor. Computer Science. Berlin, Heidelberg: Springer-Verlag 2007. p. 28-43. Proceedings of the International Conference on Biomedical Ontology. [2] Marenco L, Wang R, Nadkarni P. Automated Database Mediation Using Buffalo NY2011. p. 252-4. Ontological Metadata Mappings. J Am Med Inform Assoc. 2009 Sep- [18] Ceusters W, Smith B. A Unified Framework for Biomedical Oct;16(5):723-37. Terminologies and Ontologies. In: Safran C, Marin H, Reti S, editors. [3] Sim I, Carini S, Tu SW, Detwiler LT, Brinkley J, Mollah SA, et al. Proceedings of the 13th World Congress on Medical and Health Ontology-Based Federated Data Access to Human Studies Information. Informatics (Medinfo 2010), Cape Town, South Africa, 12-15 In:AMIA Annu Symp Proc 2012. Chicago IL2012. p. 856-65. September 2010. Amsterdam: IOS Press; 2010. p. 1050-4. [4] Baumbach J, Brinkrolf K, Czaja LF, Rahmann S, Tauch A. [19] Smith B, Ceusters W, Goldberg LJ, Ohrbach R. Towards an Ontology of CoryneRegNet: an ontology-based data warehouse of corynebacterial Pain. In: Okada M, editor. Proceedings of the Conference on Logic and transcription factors and regulatory networks. BMC genomics. Ontology. Tokyo: Keio University Press; 2011. p. 23-32. 2006;7:24. [20] Ceusters W, Elkin P, Smith B. Negative Findings in Electronic Health [5] Rodriguez-Muro M, Calvanese D. Dependencies: Making Ontology Records and Biomedical Ontologies: A Realist Approach. International Based Data Access Work In Practice. . Proc of the 5th Alberto Journal of Medical Informatics. 2007 March;76:326-33. 32