=Paper=
{{Paper
|id=Vol-1515/regular14
|storemode=property
|title=Structured data acquisition with ontology-based web forms
|pdfUrl=https://ceur-ws.org/Vol-1515/regular14.pdf
|volume=Vol-1515
|dblpUrl=https://dblp.org/rec/conf/icbo/GoncalvesTNTM15
}}
==Structured data acquisition with ontology-based web forms==
Structured Data Acquisition with Ontology-Based Web Forms Rafael S. Gonçalves∗, Samson W. Tu, Csongor I. Nyulas, Michael J. Tierney and Mark A. Musen Stanford Center for Biomedical Informatics Research Stanford University, Stanford, California, USA ABSTRACT descriptions of information to be collected, such as the severity of Structured data acquisition is a common, challenging task that pain with a particular quality, and at a specific anatomical location. is widely performed in the field of biomedicine. However, in some The challenge is to model the assessment instruments and relate the biomedical fields, such as clinical functional assessment, little effort assessed data to a domain ontology with which one can formulate has been done to structure functional assessment data in such a meaningful queries. way that it can be automatically employed in decision making (e.g., In this paper, we describe a solution for representing, acquiring determining eligibility for disability benefits) based on conclusions and querying assessment data that uses (1) domain ontologies and derived from acquired data (e.g., assessment of impaired motor standard terminologies to give formal descriptions of entities in our function). In order to be able to apply such automatisms, we need chosen domain, (2) an information model of assessment instruments data structured in a way that can be exploited by automated deduction to drive the generation of data-acquisition Web forms, and (3) systems, for instance, in the Web Ontology Language (OWL); the a data model for the acquired information that links the data to de facto ontology language for the Web. The rise of OWL caused the domain ontologies and standard terminologies. Such linkage a paradigm shift in knowledge systems from frame-based to axiom- makes it possible to query and aggregate the data using the logical based. Because of the axiom-based nature of OWL, it is more representation of the domain concepts in the ontologies. difficult to acquire instance data based on OWL than it was based on frames. In this paper we tackle the problem of generating Web 2 RELATED WORK forms from OWL ontologies, and aggregating input gathered through In addition to the comparison with Protégé-Frames’ template-based these forms as an ontology of “semantically-enriched” form data that instance acquisition method described in Section 1, we briefly can be queried using an RDF query language, such as SPARQL. contrast our work with two other systems that are designed to use The ontology-based structured data acquisition framework that we forms for acquiring structured data: the first targets the domain of have developed is presented through its specific application to the patient assessment, which is similar to the work reported here, while clinical functional assessment domain, with examples of how one can the second is a generic Web-based technology from which one can perform desirable analyses of gathered data with simple queries. draw examples on how to arrive at a domain-independent solution. The clinical documentation system described in [6] uses a 1 INTRODUCTION template schema to allow a technology-savvy clinician to create Ontology-based form generation and structured data acquisition documentation templates that include the local structure of was first pioneered almost 30 years ago. In the early 1990s, subforms and potentially complex clinical descriptions consisting Protégé-Frames used definitions of classes in an ontology to of features and their values. The features and values are mapped generate knowledge-acquisition forms, which could be used to to a medical ontology, and the system automatically generates acquire instances of the classes [2, 3]. With OWL as the preferred ontological descriptions of the data elements based on the mappings. modeling language for ontologies, class definitions are collections Constrained by our goal to replicate existing forms, we took the of description logic (DL) axioms, and can no longer be seen opposite approach where we start with ontological descriptions as templates for forms [9]. Unlike template-based knowledge of the data elements, specify how they are used in assessment representations, where what can be said about a class is defined instruments as part of the description of instruments, and generate by the slots of the class template, axiom-based representations do Web forms for the acquisition of data. Having the freedom to design not have this kind of locally scoped specification, and allow any their documentation system, Horridge et al. avoided the laborious axiom describing the same class to be added to the ontology, as work of manually modeling the domain concepts. long as the axiom does not lead to inconsistencies. Template-based Semantic wikis extend regular wikis with semantic technologies, knowledge representation systems use closed-world reasoning and wherein each wiki article is an RDF resource, and an instance have local constraints (e.g., cardinality of a slot for a particular of some resource such as a class defined in the schema,1 which class) that can be validated easily, while in an axiom-based system can be asserted to have relations with other RDF resources. These with the open-world assumption such local constraint checking is relations are defined by the authors of wiki articles, which could much more problematic. Furthermore, in our chosen application be a challenging task to perform without previous knowledge of the domain, assessment instruments have specific formats that do not domain or the modeling. In a survey of semantic wikis featuring lend themselves to be seen as representing instances of domain OWL reasoning and SPARQL2 querying facilities [4], a user ontology classes. Items in the instruments have potentially complex 1 The typical kinds of schema accepted are OWL and RDFS. ∗ To whom correspondence should be addressed: rafaelsg@stanford.edu 2 http://www.w3.org/TR/rdf-sparql-query Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes 1 Gonçalves et al evaluation of a chosen semantic wiki implementation concluded that International Classification of Functioning, Disability and Health authoring instance data in such a way is cumbersome, even with (ICF),4 developed by the World Health Organization (WHO), and users that were familiar with ontologies. A good solution to this other reference terminologies such as SNOMED CT.5 would be exploiting the relations defined in the schema to provide “wiki article templates” whose form input fields derive from those Imports structure The modeling tasks of this project involve relations, thus making it easier to author semantic wiki articles. describing different domain areas, leading us to create separate ontology files that can be re-used independently. In our specific application we use the full import closure as depicted in Figure 1. 3 APPLICATION DOMAIN Clinical functional assessment provides the application motivation Form specification Instance data for our work. Functional assessment is the evaluation of an form 1 form data form 2 datamodel individual’s ability to perform body functions (e.g., flexing a joint) form n and defined tasks (e.g., walking a specific distance). It is necessary Legend for evaluating disabilities for rehabilitation, for social security domain-independent payment, or for decisions to retain or discharge service members domain-specific criteria CFA ICF who may be injured on duty. Despite its importance, it is not usually application-specific Querying and owl : imports supported by electronic health record (EHR) systems [1]. These classification Functional assessment assessments are often documented using assessment instruments Fig. 1: Imports structure and role separation of ontologies developed (e.g., check-lists and validated questionnaires) such as Karnofsky for, or included as part of our modeling solution. Form specifications Performance Status [11]. Too frequently the data derived from using use terms from the datamodel ontology (e.g., to create question these instruments are saved as either blobs or non-standard data instances) as well as from domain-specific ontologies (e.g., CFA). elements. While a standard such as LOINC R (Logical Observation Identifiers Names and Codes) defines the syntactic structures of The ontology marked as Instance data in Figure 1 is the assessment instruments as a hierarchy of panels with questions that collection of data assertions from form submissions, possibly from have coded answers [10], it does not relate the semantic content different forms. The ontologies represented in Form specification of the questions and answers to standard terminologies and data are specifications of different forms; in our case, we use a single models that allow meaningful querying and aggregation of acquired ontology that specifies two closely-related forms. The content of data. the above-mentioned ontologies is application-specific, that is, the In our application scenario we use, as exemplars, the way the data is represented is directly derived from the way in U.S. Department of Veterans Affairs (VA) Disability Benefits which forms are modeled (for different assessment instruments). Questionnaires (DBQs). DBQs are used to evaluate service However, resulting data still conform to the generic information members’ disabilities and to determine the benefits for which models specified in the datamodel ontology. In this way, there is they are eligible. We start off with these DBQs as our initial a separation of the Form specification ontologies (Abox axioms) form specifications, and design an ontology-based method for from the Functional assessment ontologies that model the functional Web form generation and structured data acquisition, subsequently assessment domain and data models (mostly Tbox axioms). In exemplifying how one would go about exploiting such data for Querying and classification we use a domain-specific ontology to immediate or post facto analyses. apply SWRL rules,6 and define complex OWL classes to facilitate querying in SPARQL and in OWL. 4 MODELING ICF ICF is a multi-purpose classification that, together with In order to capture the semantic distinctions that are needed the International Classification of Diseases (ICD),7 is a reference in functional assessment, we developed a Clinical Functional classification in the WHO Family of International Classifications Assessment (CFA) ontology that models the concepts and (WHO–FIC). It provides a standard language and conceptual basis relationships that occur in functional assessment instruments. We for the definition and measurement of functions and disability. developed information models for such instruments and for data However, unlike ICD codes that represent possible disease or captured in the instruments. We will show how the CFA ontology injuries, coding different health and health-related states requires and information models inform the generation of data-acquisition that ICF codes (e.g., “d4501” - walking long distance) be used forms and how the resulting data can be queried and aggregated. in conjunction with component-specific qualifiers (e.g., a 0 to Our goal was to develop a set of light-weight ontologies and 4 scale to encode the range of impairment). Such a complex models with minimal ontological commitments, and postponing coding scheme makes it difficult to transform data derived from alignment with possible upper-level ontologies to the future. assessment instruments into the ICF format. Nevertheless, ICF Existing ontologies, such as the Information Artifact Ontology provides a reference conceptual basis for the definition and (IAO),3 do not provide a modeling of forms and questions that we measurement of functions and disability, thus justifying its usage in could reuse. Furthermore, what we need is an information model descriptions of functional assessment results, despite its limitations that states, for example, that the structure of a “question” includes a specific text, not an ontology that models parts of information 4 http://www.who.int/classifications/icf/en artifacts as ontological entities (e.g., modeling the text of a question as an instance of “textual entity” class). Our ontologies reference the 5 http://www.ihtsdo.org/snomed-ct 6 http://www.w3.org/Submission/SWRL 3 https://code.google.com/p/information-artifact-ontology 7 http://www.who.int/classifications/icd/en 2 Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes as a formal ontology [7]. To reference ICF concepts in our Datamodel The datamodel ontology is a generic, context-free modeling of functional assessment descriptors, we use a version representation of a form (e.g., it models elements such as questions of ICF available from the National Center of Biomedical Ontology and sections) and the data generated from a form (e.g., a string value (NCBO) BioPortal repository [8], that is represented in OWL. from a text area, or values from an enumerated value set). Figure 3 summarizes key aspects of our modeling: elements of a form are CFA The Clinical Functional Assessment (CFA) ontology models asserted as subclasses of FormStructure, such as Form, Section concepts and relationships that allow us to give formal descriptions and Question. Each kind of FormStructure generates some kind of of the findings, assessments, and measurements embodied in Data; every form submission generates an instance of FormData, clinical functional assessment instruments. The ontology is divided which references (via the hasComponent property) all instances of into three main branches: (1) Finding: the result of an observation Data generated in the process of parsing form answers. Specific or judgement, (2) Value that defines collections of possible sections such as SubjectInfoSection collect information pertaining qualifiers and values for findings, and (3) SubjectMatterOntology to a subject, and these details are aggregated in an instance of that provides internally defined domain concepts that either are not SubjectInformation. An answer to an instance of Question gives rise available from standard terminologies or are references to standard to an instance of Observation with a hasValue property assertion terms that need to be organized into taxonomies. The Finding to the IRI of the selected answer. An instance of Observation will class is further subdivided into Assessment (those findings that have be inferred to have an outgoing hasFocus property assertion if the non-numeric result) and Measurement (those findings that have Question instance it derives from encodes some kind of semantic numeric results). We also define FunctionalFinding (a subclass of description of the question’s meaning via the isAbout relation. Each Finding) and FunctionalAssessment (a subclass of Assessment). In instance of Question specifies a set of possible (answer) values via general, a functional assessment will have some assessed function a hasPossibleValue relation to a subclass of Value. that can be related to an ICF body function or activity (possibly as an exact match, specialization, or generalization), some assessed FormStructure Data attribute, such as severity, that specifies the dimension of the function being assessed, and optionally some anatomical location Form generates FormData of the assessment. Both findings and functions can be modified by hasSection hasComponent qualifiers that further refine these entities. For example, a functional assessment may be made in the context of using assistive devices, Section generates Metadata and a function being assessed may have some temporal component SubjectInfoSection SubjectInformation generates (e.g., constant or intermittent pain). ICF being an imported ontology EvaluatorInfoSection EvaluatorInformation for CFA, all ICF categories, such as body structure, body function, CertificationSection Certification activities and participation, and environmental factors are available hasQuestion hasComponent for formalizing descriptions of functional assessments. For other standard terminologies such as SNOMED CT, ICD, and LOINC, Question generates Observation instead of importing them as ontologies, we make references to them hasPossibleValue isAbout hasFocus hasValue through an ExternallyCodedValue that specifies the terminology source and code. Queries that reference these codes require the Value DataElementDescription DataElementValue availability of terminology services that relate these codes to other terms in the referenced terminologies. Fig. 3: Excerpt of the datamodel ontology classes and relations. The modeling of Finding is exemplified as follows, based on the “Back (Thoracolumbar Spine) Conditions” DBQ that we use as one of our exemplar assessment instruments; in the question on Form The Form ontology contains the set of individuals that the severity of constant pain caused by radiculopathy on the right are necessary to produce forms. While the technology we have lower extremity, we define a subclass of FunctionalAssessment that developed is completely generic, we use as exemplars the U.S. has the assessed attribute ‘severity’, the assessed function ‘icf:b2801 Department of Veterans Affairs (VA) DBQs, which we modeled Pain in body part’ that is qualified by a temporal quality ‘Constant’, in an ontology named DBQ. This ontology contains instances and has anatomical location ‘icf:s750. structure of lower extremity’ of Question, Section, Form and other elements defined in the with laterality ‘Right’. Figure 2 illustrates the modeling of this datamodel ontology (shown in Figure 3). Not only does this assessment. With the modeling of the dimensions of assessment ontology rely on datamodel (for form structuring purposes), it also instrument questions, we can make queries on, and aggregate data relies on functional assessment classes and individuals given in the collected through the instruments, as will be shown in Section 6. CFA ontology, for example, values of a scale of severity of pain that should be presented as answer options to users reporting on the severity of constant pain in the lower extremity. Criteria The criteria ontology contains SWRL rules to enrich the domain representation (e.g., if a Question instance has an isAbout relation with some instance i, then the Observation data instance that represents the answer to that question will get a hasFocus property filler i), as well as defined classes used to better support Fig. 2: Modeling of “severity of constant pain caused by querying, which we describe in more detail in Section 6. radiculopathy in the lower right extremity”. Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes 3 Gonçalves et al 5 OWL-BASED DATA ACQUISITION question type (e.g., radio, checkbox, dropdown, horizontal Our approach to data acquisition in OWL requires two components: checkbox, etc), question-list layout (vertical or inline) and firstly, an OWL representation (in the form of one or more recurrence; one can specify that a collection of questions should ontologies) of the form structures (questions, sections, etc), and be repeated any given number of times. Some more complex descriptions of those structures’ meanings, and, secondly, the view options include overriding the default (alphabetic) order of component that is given by an XML file specifying user-interface answer options, and triggering sub-questions when a specific aspects. So, in order to use our method, a user will have to model answer is selected. These two features are exemplified in questions and their descriptions in OWL, and then specify the layout Figure 4: this question is configured with an attribute/value and content of the resulting form in XML. pair: showSubquestionsForAnswer=“cfa:Yes” on the question XML We implemented our form generation and data acquisition tool in element, so that answering ‘Yes’ triggers the sub-questions of Java, using the OWL API v4.0.1,8 and its source code is publicly that question. In Figure 4, under ‘Right lower extremity’, we available on GitHub.9 The tool implementation and configuration have a question with a list of answer options derived from details are omitted here due to lack of space, but can be found in the an enumerated value set, which would ordinarily be ordered GitHub project wiki. The tool takes as input a user-defined XML alphabetically. However, ‘None’ would then appear between configuration file, generates a form, and outputs form answers in ‘Moderate’ and ‘Severe’, thus interrupting a severity scale. So CSV, RDF and OWL formats. The configuration file should contain we added: optionOrder=“3;*” to the question element, which a pointer to the ontology specifying the form, as well as its imports. states that the would-be third option (alphabetically) should appear The two major stages in the service are form generation and form first, and the remaining (the “*” wild character stands for “all input handling, as described below. unmentioned options”) should be presented in default order. (1) Form generation – Steps to produce a form: (a) Process XML configuration, gathering form layout information, IRIs and bindings to ontology entities (b) Extract from the input ontology all relevant information pertaining to each form element: (b.1) Text to be displayed (e.g., section header, question text) (b.2) Options and their text, where applicable (b.3) The focus of each question (c) Generate the appropriate HTML and JavaScript code (2) Form input handling – Once the form is filled in and submitted: (a) Process answer data and create appropriate individuals Fig. 4: The user interface of the form generated for the DBQ (b) Produce a partonomy of the individuals created in (2.a) that question corresponding to radiculopathy pain modeled in Figure 2. mirrors the layout structure given in the configuration The key output of the data acquisition tool is the OWL ontology, (c) Return the (structured) answers to the user in a chosen format as it provides us with “semantically enriched” form data that can be The user-defined XML configuration (1.a) specifies: input and used for aggregation and querying. The resulting data individuals output information of the tool, bindings to ontology entities, and are structured in OWL (via hasComponent relations) similarly to layout of form elements. The key XML elements are: how the form is structured in the configuration, that is, if question Q is configured as having two sub-questions, then the Observation input: contains an ontology child element, and optionally a child individual generated by Q will have two outgoing hasComponent element named imports relations to the instances of Observation generated by the two sub- ◦ ontology: absolute path or URL to the form specification questions of Q. ontology (e.g., DBQ ontology) 6 DATA ANALYSIS ◦ imports: contains ontology child elements, which have an attribute iri, giving the IRI of the imported ontology One of the authors (Michael J. Tierney), who is a physician from output: contains the following child elements the VA Palo Alto Healthcare System, validated the generated ◦ file: defines, via a title attribute, the title of the form. OWL-based versions of the DBQ forms, and filled in the “Back Optionally, a path can be specified within the file element (Thoracolumbar Spine) Conditions” DBQ with 5 complete sets of where the HTML form file should be serialized sample data. The data gathered are stored in a graph database with support for SPARQL 1.1 querying and OWL 2 reasoning. ◦ cssStyle: the CSS style class to be used in the output HTML Since our data are both structured and semantically enriched, we bindings: defines mappings to ontology entities, such as what data are able to query the observations using SPARQL, classify them property is used to state the text of a question, or section headings into criteria representing powerful OWL expressions, or manipulate form: defines the layout and behaviors of the form them using SWRL. For example, Code Snippet 1 presents a simple There is a wide range of versatility when configuring forms, SPARQL query that returns all instances of Observation where a such as: multiple levels of sub-questions, form element numbering, patient presented signs or symptoms due to radiculopathy. It is worth observing that this query is formulated in such a way that it is independent of the assessment instrument, including the particular 8 http://owlapi.sourceforge.net formulation of the question, but rather uses the appropriate focus 9 http://github.com/protegeproject/facsimile individual from our CFA ontology. 4 Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes Code Snippet 1 SPARQL query for retrieving all observations of radicular pain due to radiculopathy. The modeling contributions include (1) CFA: a clinical functional assessment domain ontology that allows defining questions being SELECT ?obs WHERE { ?obs a datamodel:Observation . asked in an assessment instrument in terms of a rich ontology that ?obs datamodel:isDerivedFrom ?q . integrates standard terminologies such as ICF and SNOMED CT, ?q a datamodel:Question . and which provides the means for making detailed or aggregate ?q cfa:isAbout queries on acquired data, and (2) datamodel: an information model cfa:signs_or_symptoms_due_to_radiculopathy . ?obs cfa:hasValue cfa:Yes } that allows the specification of generic assessment forms and the format of structured data acquired through the instruments. We have designed our output model to support the acquisition of structured data through Web forms, and for the potential to In order to query for all observations of severe pain anywhere in integrate the data inside EHRs. It is straightforward to transform the lower extremity, one could formulate an OWL DL query such as the data we capture as instances of Observation, Certification, that given in Code Snippet 2. EvaluatorInformation, and SubjectInformation into, for example, Health Level Seven (HL7) Reference Information Model (RIM) Code Snippet 2 OWL DL query for retrieving all observations of standard compliant data [5]. Finally, we have shown that the severe pain anywhere in the lower extremity. problem of structured data acquisition can be suitably tackled datamodel:Observation and using OWL; our solution, though applied to the clinical functional cfa:hasValue value cfa:severe and cfa:hasFocus some (cfa:Assessment and assessment domain for the context of this paper, is entirely generic, (cfa:hasAssessedFunction some and can easily be applied to an arbitrary domain. (cfa:isExactMatchOf some ’icf:b2801. Pain in body part’)) and (cfa:hasAnatomicalLocation some ACKNOWLEDGMENTS ’icf:s750. Structure of lower extremity’)) This work is supported in part by contract W81XWH-13-2-0010 from the U.S. Department of Defense, and grants GM086587 and In response to the query in Code Snippet 2, a DL reasoner uses GM103316 from the U.S. National Institutes of Health (NIH). the semantic descriptions of the observation foci, which are derived from the questions’ isAbout property, to aggregate answers for REFERENCES severe pain for different parts of the lower extremity. [1] Buyl, R. and Nyssen, M. (2009). Structured electronic physiotherapy records. Int. J. of Med. Inf., 78(7), 473–481. 7 DISCUSSION [2] Eriksson, H., Puerta, A. R., and Musen, M. A. (1994). In this paper we presented a framework for OWL-based form Generation of knowledge-acquisition tools from domain generation and data acquisition that gathers form answers as tab- ontologies. Int. J. of Human-Computer Studies, 41, 425–453. delimited data, RDF triples, or OWL instances, which can be [3] Gennari, J. H., Musen, M. A., Fergerson, R. W., Grosso, subsequently analyzed in a systematic way (as shown in our queries W. E., Crubzy, M., et al. (2003). The evolution of Protégé: an in Section 6). Once the raw data is processed (by deriving the environment for knowledge-based systems development. Int. J. foci of observations from the isAbout field of the questions), the of Human-Computer Studies, 58(1), 89–123. resulting data have no dependency on specific questions (except [4] Gonçalves, R. S. (2009). Semantic Wiki for Travel and Holidays for provenance tracking), so if the form specification is modified, using OWL. Master’s thesis, The University of Manchester. then previous form data are still comprehensible and sound (i.e., [5] Health Level Seven (2015). HL7 Reference Information Model. upon form specification changes the new data and old data remain www.hl7.org/implement/standards/rim.cfm. compatible). However, if a user requires data to be structured [6] Horridge, M., Brandt, S., Parsia, B., and Rector, A. (2014). in a different or more specialized format than ours, then either A domain specific ontology authoring environment for a clinical the software needs modifying, or a post-processing step would documentation system. In Proc. of CBMS-14. be necessary. The value of data in such a structured format in [7] Kumar, A. and Smith, B. (2005). The ontology of processes any arbitrary domain is twofold: automating, or improving the and functions: A study of the international classification of automation of the process of arriving at desirable conclusions functioning, disability and health. In Proc. of the AIME Workshop from questions in the form, and for further analysis, for instance, on Biomedical Ontology Engineering. via querying. In the clinical functional assessment domain, our [8] Noy, N. F., Shah, N. H., Whetzel, P. L., Dai, B., Dorf, M., et al. modeling of forms and questions is consistent with the format of (2009). BioPortal: ontologies and integrated data resources at the assessment instruments defined in LOINC. However, the types of click of a mouse. Nucleic Acids Research, 37, 170–173. queries we formulated for functional assessment data are unfeasible [9] Rector, A. (2013). Axioms & templates: Distinctions using LOINC, since LOINC provides no semantics behind what an & transformations amongst ontologies, frames & information answer to a specific question means. models. In Proc. of K-CAP-13. We presented our modeling of functional assessments and [10] Vreeman, D. J., McDonald, C. J., and Huff, S. M. (2010). assessment instruments, and demonstrated (1) how to generate Representing patient assessments in LOINC R . In Proc. of AMIA. forms and acquire data based on these OWL ontologies and data [11] Yates, J. W., Chalmer, B., McKegney, F. P., et al. (1980). models, and (2) how to make use of the data using queries on Evaluation of patients with advanced cancer using the Karnofsky individual subjects and queries that aggregate population data. performance status. CANCER, 45(8), 2220–2224. Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes 5