Developing a Modular Architecture for Creation of Rule- based Clinical Diagnostic Criteria Na Hong1,2, Guoqian Jiang1*, Jyotishman Pathak1, Christopher G Chute3 1 Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA; 2 Institute of Medical Information, Chinese Academy of Medical Sciences, Beijing, China; 3 Johns Hopkins University, Baltimore, MD, USA Abstract. With recent advances in computerized patient records system, there is an urgent need for producing computable and standards-based clinical diag- nostic criteria. For example, constructing rule-based clinical diagnosis criteria has become one of the goals in the International Classification of Diseases (ICD)-11 revision. However, few studies have been done in building a unified architecture to support the need for diagnostic criteria computerization. In this study, we present a modular architecture for creation of rule-based clinical di- agnostic criteria leveraging Semantic Web technologies. The architecture con- sists of two major modules: one is an authoring module that utilizes a standards- based information model and the other is a translation module that utilizes Se- mantic Web Rule Language (SWRL). In a prototype implementation, for the authoring module, we developed a diagnostic criteria upper ontology that inte- grates ICD-11 content model with Quality Data Model (QDM); for the transla- tion module, we developed a transformation tool that converts QDM-based di- agnostic criteria into Semantic Web Rule Language (SWRL) representation. We evaluated the domain coverage of the upper ontology model by annotating 20 randomly selected diagnostic criteria. We also tested the transformation al- gorithms using 6 QDM templates for ontology population and 15 QDM-based criteria data for rule generation. In summary, our efforts in developing and pro- totyping a modular architecture provide useful insights into building a scalable solution to support diagnostic criteria representation and computerization. Keywords: Diagnostic Criteria, Ontology, ICD-11, QDM, SWRL 1 Introduction Diagnostic criteria are one of the most valuable sources of knowledge for supporting clinical decision-making and improving patient care [1], [2], [3], [4]. The clinical informatics research community has been seeking a solution to standardize and com- puterize clinical diagnosis criteria for all clinical domains. Diagnostic criteria are usually scattered over different media such as medical textbooks, literatures and clini- cal practice guidelines mostly in textual formats. Many studies have been conducted * Correspondence to: Dr. Guoqian Jiang at jiang.guoqian@mayo.edu. in integrating and formally expressing diagnostic rules from free-text-based clinical guidelines and diagnostic criteria into computerized decision support system to im- prove clinical performance and patient outcomes [5], [6]. However, very limited re- search has been done on building a unified architecture to support the goal of diagnos- tic criteria formalization. In particular, the lack of a standards-based information model has been recognized as a major barrier for achieving computable diagnostic criteria[7]. Diagnostic criteria are usually described in different narrative style, granu- larity, term usage and inner logic. There is a need to develop a clear information mod- el specification and a standard architecture to support the diagnostic criteria modeling and representation, and thereby enabling computerization. To achieve a unified archi- tecture, the following aspects should be considered: a) an information model that supports standard representation of diagnostic criteria; b) the semantic interoperability and expressivity of a knowledge representation language; c) the rule-based reasoning capability over the fact knowledge; and d) a standard exchange format for different layers of the architecture. Current efforts in the development of international recommendation standard mod- els in clinical domains have laid the foundation for modeling and representing com- putable diagnostic criteria. The notable examples include the International Classifica- tion of Diseases (ICD)-11 content model [8], [9] and National Quality Forum (NQF) Quality Data Model (QDM)1. The content model of ICD-11 is a structured framework that defines “a classification unit” in ICD in a standard way in terms of its compo- nents that allows computerization. Under the definition of the content model, each ICD entity can be seen from different dimensions and there are currently 13 defined main dimensions in the content model. One purpose of the ICD-11 content model is to use different settings of these dimensions or parameters to construct different sets of diagnostic criteria, so different elements in the content model come together to define the diagnosis criteria of a particular ICD category. As the ICD-11 content model de- picts a big picture of diagnostic criteria computerization and it has achieved consen- sus among the ICD Revision Group, we consider it a viable framework on which to build our Diagnostic Criteria Upper Ontology (DCUO). The QDM is an information model that describes clinical concepts in a standard- ized format to enable electronic quality performance measurement in support of oper- ationalizing the Meaningful Use Program of the Health Information Technology for Economic and Clinical Health Act. It allows quality measure developers and many clinical researchers or performers to describe clearly and unambiguously the data required to calculate the performance measure. As the purpose, QDM allows electron- ic health records (EHR) and other clinical electronic system to share a common un- derstanding and interpretation of the clinical data. To describe different part of the clinical care process, QDM defines many datatypes to specify the context in which each category is used. It has been proved that the extension of QDM could support a number of relevant areas. As a standard format, Health Quality Measure Format (HQMF) [10] formally defines a quality measure (data elements, logic, definitions, 1 http://www.healthit.gov/quality-data-model etc.) to support consistent and unambiguous interpretation. HQMF has been accepted as a format to define eMeasures in the HL7 standard. While formalizing the inner logic for diagnostic criteria is complex, Semantic Web technologies provide a homogeneous framework that enables an ontology-based mod- eling with the Web Ontology Language (OWL)2 and supports rule-based reasoning with the Semantic Web Rule Language (SWRL) [11]. In a semantic web environ- ment, OWL is a W3C recommendation for ontology description and modeling and SWRL is a rule language to formalize and represent rules to support knowledge rea- soning. In the present study, we evaluate OWL and SWRL-based representation lan- guages for formalizing diagnostic criteria. The objective of the present study is to describe our efforts in developing a modu- lar architecture for creation of rule-based clinical diagnostic criteria leveraging Se- mantic Web technologies. We prototyped and evaluated a number of key components of the architecture, including an upper ontology and a transformation tool. We select a collection of QDM datatypes that are commonly used in describing diagnostic criteria and then integrated them into ICD-11 Content Model to build a schema for a diagnos- tic criteria upper ontology. We perform our data translation and interaction following the HQMF standard format and propose extensions where needed. 2 Materials & Methods 2.1 Materials WHO ICD-11 content model: WHO developed a content model to present the knowledge that underlies the definitions of an ICD entity [8]. The content model is composed of three layers: a foundation layer, a linearization layer, and an ontological layer. The foundation layer is the core product of the ICD-11 revision that stores the full range of knowledge of all classification units in ICD. Each ICD entity can be seen from different dimensions. The content model represents each one of these dimensions as a parameter. Currently, there are 13 defined main parameters in the content model to describe a category in ICD-11, for example, Mani- festation Properties, Causal Properties, Treatment Properties. “Diagnostic Criteria” is one of the main parameters for describing an ICD category. NQF Quality Data Model (QDM): QDM consists of two modules: a data-model module and a logic module. The data-model module includes the notions of category (e.g., Medication), datatype (e.g., Medication, Administered), attribute (e.g., infor- mation about dosage, route, strength, and duration of a medication), and value set comprising concept codes from one or more terminologies. The logic module includes Logic Operators, Functions, Comparison Operators, Temporal Operators, Subset Op- erators. As mentioned above, the HQMF provides a standard format to render the QDM-based criteria (i.e., instance data) in XML format using a collection of tem- plates [10]. In a previous study [12], we evaluated the feasibility of using QDM for representing diagnostic criteria through a data-driven approach and suggested that the 2 http://www.w3.org/TR/owlfeatures/ common patterns informed by QDM are useful and feasible in building a standards- based information model for computable diagnostic criteria. In this study, we refer- ence the common patterns and selected a collection of QDM datatypes and attributes for developing an upper ontology. 2.2 Methods The overall system architecture for creation of rule-based clinical diagnosis criteria is shown in Figure 1. Fig. 1. Overall System Architecture for Creation of Rule-based Clinical Diagnosis Criteria The system architecture contains two major modules: one is an authoring module that utilizes a standards-based information model and the other is a translation module that utilizes SWRL. The first module of the architecture contains an upper ontology that supports the organization of diagnostic criteria. We integrated a collection of selected ICD-11 content model elements and QDM elements manually informed by the analy- sis of real-world diagnostic criteria. The first module also contains a unified web user interface that supports collecting and authoring diagnostic criteria from clinicians or experts. All collected data elements, value sets and logic expressions of diagnostic criteria are formalized using QDM-based HQMF template. Standard QDM model serves as a foundation layer for all following automatic parsing and reasoning work. The second module of the architecture contains a rule translation engine that converts diagnostic criteria represented in QDM-based HQMF templates into domain-specific diagnostic criteria ontology and a set of rules using SWRL. The rule translation en- gine supports further diagnostic inference on patient data. In the following subsec- tions, we mainly focus on describing the core parts that we prototyped and developed in detail. 2.2.1 Developing a standards-based diagnostic criteria upper ontology The purpose of this work is to integrate existing standard information models rele- vant to modeling of diagnostic criteria by expert review and manual editing. As men- tioned in the section above, we choose the ICD-11 content model and NQF QDM as reference standards. Our work in this stage is to create a diagnostic criteria upper ontology (DCUO) through integration of ICD-11 content model and those QDM ele- ments commonly used in diagnostic criteria. The selection of these QDM elements was informed by the results from a previous study [12]. We selected 10 QDM datatypes and 4 QDM attributes and integrated them with ICD-11 content model- based ontology schema. Table 1 shows a list of the QDM datatypes and attributes used for the integration. We used Protégéontology editing environment for manually integrating these two standard information models into a diagnostic criteria upper ontology. Table 1. A list of selected QDM datatypes and attributes for developing the upper ontology QDM Datatypes QDM Attributes Laboratory Test, Result Result Diagnostic Study, Performed Method Diagnostic, Active Reason Physical Exam, Performed Severity Symptom, Active Medication, Active Patient Characteristic Birth Date Patient Characteristic Race Patient Characteristic Sex Procedure, Recommended 2.2.2 Transforming QDM templates into domain-specific diagnostic criteria on- tology To build a scalable diagnostic rule translation environment, it is important to dy- namically populate a Diagnostic Criteria Domain Ontology (DCDO) for a specific disease or condition, e.g. ‘DCDO for AMI (Acute Myocardial Infarction)’. We devel- oped a parsing interface that could support data extraction from diagnostic criteria encapsulated by HQMF templates. To parse all HQMF instance data in a specific template, we developed a collection of JAVA-based XML parsing and mapping algo- rithms to automatically extract instance data from HQMF templates and convert them into corresponding DCDO elements. The parsing algorithms decompose HQMF XML data into different parts and populate the parsed elements into the same DCDO. The process of the ontology population consists of 2 steps: template-based XML parsing and semantic mapping. A HQMF template example and its parsing results are shown in Figure 2. The left- hand part is the template representation of QDM datatype “Laboratory Test, Result” (hqmf r1 template - 2.16.840.1.113883.3.560.1.12) [10] and the right-hand part is the elements extracted from the XML template. Elements of template - 2.16.840.1.113883.3.560.1.12 “30954-2” “Results” “2.16.840.1.113883.6.1” “$valueSetOID” “2.16.840.1.113883.3.560.101.1” “$displayName” “$datatypeName” “$valueSetName” Fig. 2. An XML Parsing of the HQMF template “Laboratory Test, Result” (hqmf r1 template - 2.16.840.1.113883.3.560.1.12) And then, we created semantic mapping between the XML elements and the elements of the DCDO ontology. For example, the semantic mappings of the template - 2.16.840.1.113883.3.560.1.12 are shown in Table 2. Table 2. Semantic mappings between HQMF template elements and ontology elements Elements of template - Elements of Ontology 2.16.840.1.113883.3.560.1.12 “30954-2” Annotation property of “Laboratory Test, Result” “Results” Annotation property of “Laboratory Test, Result” “2.16.840.1.113883.6.1” Annotation property of “Laboratory Test, Result” “$valueSetOID” Annotation property of “$valueSetName” “2.16.840.1.113883.3.560.101.1” Annotation property of “$valueSetName” “$displayName” Annotation property of “$valueSetName” “$datatypeName” Class: Laboratory Test, Result “$valueSetName” Class: Subclass of “$datatypeName” 2.2.3 Automatic rule composition and validation After having a DCDO ontology populated, we developed JAVA-based algorithms using Protégé OWL API and SWRL API for automatic rule composition and rule validation, which are respectively responsible for rule assembling and rule grammar checking. The SWRL syntax contains two parts: Body and Head. The Body is also called the antecedent and the Head part is the consequent of the rule. There are 6 atom types that can be used as the components of the Body and Head: class atom, individual property atom, same/different atom, and data valued property atom, build-in atom and data range atom. Adhering to SWRL structure and grammar, we designed a collection of translation algorithms to automatically extract SWRL rule elements from the logic components of an HQMF XML template and then to assemble these rule elements into the SWRL syntax. For example, Figure 3 shows the HQMF XML representation of the QDM-based criterion “Laboratory Test, Result: LDL-c (result < 100 mg/dL)”. The criterion is composed by two templates: HQMF template “Laboratory Test, Result” (hqmf r1 template - 2.16.840.1.113883.3.560.1.12) and HQMF template “result comparison” (hqmf r1 comparison template - 2.16.840.1.113883.3.560.1.1019.3). Laboratory Test, Result: LDL-c (result < 100 mg/dL) Fig. 3. The HQMF XML representation of the QDM-based criterion “Laboratory Test, Result: LDL-c (result < 100 mg/dL)”. Our translation algorithms then automatically extract SWRL rule elements from the logic components of the two HQMF XML templates and then assemble these rule elements into following SWRL syntax. Rule: Patient(?x),LDL-c(?y),has_result(?x, ?y),has_value(?y, ?z),has_unit(?y, mg/dL),lessThan(?z, 100)-> has_evidence(?x,ev1) 2.2.4 Evaluation of prototyped components First, we evaluated the domain coverage of ICD-11 content model in terms of rep- resenting diagnostic criteria. We collected 20 diagnostic criteria from different clinical topics and manually annotated them with the elements in ICD-11 content model. Second, we evaluated the translation algorithms for ontology population and rule generation. We first tested the ontology population algorithms using the 6 HQMF templates. The first author assessed whether they are correctly parsed and represented in the domain ontology, and the assessment results were verified by other three co- authors. The 6 HQMF templates are as follows. 1. “Laboratory Test, Result” (hqmf r1 template - 2.16.840.1.113883.3.560.1.12) 2. “Patient Characteristic Sex”(hqmf r1 template - 2.16.840.1.113883.3.560.1.402) 3. “Patient Characteristic Birth Date”(hqmf r1 template - 2.16.840.1.113883.3.560.1.400) 4. “result/is present”(hqmf r1 template - 2.16.840.1.113883.3.560.1.1019.1) 5. “result/valueset”(hqmf r1 template - 2.16.840.1.113883.3.560.1.1019.2) 6. “result/comparison” (hqmf r1 template - 2.16.840.1.113883.3.560.1.1019.3) We then tested the rule generation algorithms using 15 QDM-based criteria repre- sented in HQMF XML format. All the 15 criteria are selected from existing eMeasures and use the HQMF template - “Laboratory Test, Result” (hqmf r1 template - 2.16.840.1.113883.3.560.1.12). We used ProtégéSWRL API to validate the syntac- tical correctness of the SWRL rule grammars. The first authors assessed the semantic correctness of the generated SWRL rules through comparing the HQMF XML-based logic with SWRL rule logic and the assessment results were verified by other three co-authors. 3 Results 3.1 Upper ontology DUCO development and evaluation Figure 4 shows a screenshot of the upper ontology in Protégéontology editing envi- ronment. There are total 14 root classes and 21 subclasses in the ontology. In this ontology, 22 classes came from ICD-11 content model with the namespace prefix ‘ICD’, 10 of the classes are integrated from QDM datatypes with the namespace pre- fix ‘QDM’ and 3 classes with the namespace prefix ‘DCUO’ created for the need of representing diagnostic criteria. We also evaluated the domain coverage of ICD-11 content model. Table 3 shows distribution of element annotations based on ICD-11 content model. The results showed that Investigation Findings, and Signs and Symptoms are the two most com- monly used element types in diagnostic criteria description. The results are consistent with the analysis we did for QDM elements in a previous study [12]. Table 3. Distribution of element annotations based on ICD-11 Content Model ICD-11 Content Model Count Examples Investigation Findings 74 Serum triglycerides Sign and Symptom 69 Fatigue, Headache Title 20 Metabolic Syndrome Causal Properties 18 Pericardial effusion Classification 12 T71 Severity Of Subtype 10 Mind, Moderate, Severe Body System/Structure 8 Nervous system Specific Condition 3 Female, Pregnancy Temporary Properties 2 Age 55, sudden Fig. 4. The Diagnostic Criteria Upper Ontology 3.2 Translation algorithms evaluation All 6 HQMF templates are successfully parsed and populated into their corresponding DCDO ontologies. Human-based review confirmed that the elements in the templates are correctly represented in the target ontology. For the rule generation algorithm evaluation, in total, 15 SWRL rules were gener- ated. Table 5 shows a list of 15 QDM/HQMF-based criteria and the validation results in terms of whether generated rules passed the validation or not. Of them, 14 rules (93.3%) passed rule validation using ProtégéSWRL validation tool whereas one rule (6.7%) failed to pass. Human-based review analysis found that the failure was caused by an invalid expression ‘[copies]/mL’ that contains special characters ‘[’ and ‘]’. Human-based review also confirmed the semantic correctness of all 15 generated rules. Table 5. A list of 15 QDM/HQMF-based criteria and the validation results QDM/HQMF-based Criteria Using HQMF Template - “Laboratory If passed Test, Result” (hqmf r1 template - 2.16.840.1.113883.3.560.1.12) rule syntax validation? Laboratory Test, Result: INR (result >= 2 ) Yes Laboratory Test, Result: Hospital Measures-Neutrophil count (result < Yes 500 per mm3) Laboratory Test, Result: High Density Lipoprotein (HDL) (result < 40 Yes mg/dL) Laboratory Test, Result: Hepatitis A Antigen Test (result: 'Seropositive') Yes Laboratory Test, Result: Hepatitis B Antigen Test (result: 'Seropositive') Yes Laboratory Test, Result: HIV Viral Load (result < 200 copies/mL) No Occurrence A of Laboratory Test, Result: High Density Lipoprotein Yes (HDL) (result < 60 mg/dL) Occurrence A of Laboratory Test, Result: LDL Code (result < 100 Yes mg/dL) Occurrence A of Laboratory Test, Result: LDL-C Laboratory Test (result Yes < 100 mg/dL) Laboratory Test, Result: Macroalbumin Test (result: 'Positive Finding') Yes Laboratory Test, Result: Mumps Antigen Test (result: 'Seropositive') Yes Laboratory Test, Result: Prostate Specific Antigen Test (result <= 10 Yes ng/mL) Laboratory Test, Result: Measles Antigen Test (result: 'Seropositive') Yes Laboratory Test, Result: Rubella Antigen Test (result: 'Seropositive') Yes Laboratory Test, Result: High Density Lipoprotein (HDL) (result < 40 Yes mg/dL) 4 Discussion In this study, we developed a modular architecture, with a prototype implementation and evaluation, to support the authoring and formalization of diagnostic criteria knowledge leveraging Semantic Web OWL and SWRL technologies. The diagnostic criteria upper ontology and domain ontology are all represented in OWL that is built on formalisms of description logic (DL). And the rules extracted from QDM HQMF- based criteria are formalized and represented in SWRL, which leverages the full rea- soning power of OWL DL when invoking a rule engine. There are two main contribu- tions in this study. First, the design rationale of the architecture is to enable extensive support for representation and computation of diversified diagnostic criteria. Second, the architecture supports reuse of existing standards from the perspectives of infor- mation model, terminology services and technical interface. There are a number of limitations in this study since our pilot study in this paper is mainly focused the feasibility of our proposed architecture. First, the DCUO (Diag- nostic Criteria Upper Ontology) was reviewed for consensus and quality assurance only by a relatively small group (i.e., four authors). In the future, a rigorous ontology evaluation by a panel of experts from relevant domains will be useful in achieving consensus in terms of the vocabulary, syntax, structure, semantics, representation and context of the DCUO. We plan to use ontology evaluation methods as described by Vrandečić [13]. Second, we have not considered all complex conditions and details in the modeling of diagnostic criteria. For instance, the following problems need to be further considered.  In the QDM model, the semantics of some templates are not expressed explicitly. For example, the QDM element ‘Patient Characteristic Birth Date’ is used to repre- sent the numeric value comparison of the variable “Patient Age” (e.g. ), assuming the value of the variable “Patient Age” could be derived from the ‘Patient Characteristic Birth Date’.  In the preliminary study, we have implemented the translation algorithms only on a limit number (n=6) of HQMF templates and the preliminary evaluation demon- strated that the translation performed is reasonably well. However, in total, there are 186 HQMF templates from diverse domains and the HQMF templates are up- dated continuously, so maintaining the transportability and reusability of the trans- lation algorithms will be a challenge.  For the diagnostic criteria rules generation using SWRL, the inclusion criteria are well supported by built-in rule grammars, such as: comparison, mathematical func- tions, Booleans, string and Date/Time. We understand that some of exclusion crite- ria could not be explicitly expressed in SWRL because negated atoms or disjunc- tions are not supported in SWRL. Following the rationale of the ICD-11 content model, the full range of different values for a given parameter is predefined using standard terminologies and ontologies. In this study, the QDM-based criteria used the predefined “value set” in NIH Value Set Authority Center (VSAC). The architecture will support the extension of value set definitions. In the future, we plan to prototype a web-based application with the functionalities as follows. 1) DCUO display and update; 2) Diagnostic criteria authoring by clini- cians and domain experts, including value set services invoking and semi-automated workflow for criteria editing; 3) integration of rule engine functions, including DCDO enrichment, rule generation and computerized criteria display and execution. 5 Conclusion In this pilot study, we demonstrated the feasibility of prototyping a number of key components of our proposed architecture for diagnostic criteria knowledge modeling and reasoning. It remains a very complex field to explore and more semantic and syntactic features dealing with complexity of diagnostic criteria need to be further studied. We believe that our efforts provide useful insight into developing a scalable, semantic-oriented and standards-based solution to support diagnostic criteria formali- zation and computerization. Acknowledgments This work is supported in part by funding from: caCDE-QA (1U01CA180940- 01A1) , PhEMA (R01 GM105688) and a Mayo-WHO Contract 200822195-1. References 1. Yager, J., Mcintyre, J.S.: DSM-5 Clinical and Public Health Committee: Challenges and Considerations. American Journal of Psychiatry 171, 142-144 (2014) 2. Haug, P.J., Ferraro, J.P., Holmen, J., Wu, X., Mynam, K., Ebert, M., Dean, N., Jones, J.: An ontology-driven, diagnostic modeling system. Journal of the American Medical Informatics Association : JAMIA 20, e102-110 (2013) 3. Donfack Guefack, V., Bertaud Gounot, V., Duvauferrier, R., Bourde, A., Morelli, J., Lasbleiz, J.: Ontology driven decision support systems for medical diagnosis - an interactive form for consultation in patients with plasma cell disease. Studies in health technology and informatics 180, 108-112 (2012) 4. Bertaud-Gounot, V., Duvauferrier, R., Burgun, A.: Ontology and medical diagnosis. Informatics for health & social care 37, 51-61 (2012) 5. Trivedi, M.H., Kern, J.K., Marcee, A., Grannemann, B., Kleiber, B., Bettinger, T., Altshuler, K.Z., McClelland, A.: Development and implementation of computerized clinical guidelines: Barriers and solutions. Method Inform Med 41, 435-442 (2002) 6. Lloyd, T.E., Mammen, A.L., Amato, A.A., Weiss, M.D., Needham, M., Greenberg, S.A.: Evaluation and construction of diagnostic criteria for inclusion body myositis. Neurology 83, 426-433 (2014) 7. Richesson, R.L., Krischer, J.: Data standards in clinical research: gaps, overlaps, challenges and future directions. Journal of the American Medical Informatics Association : JAMIA 14, 687-696 (2007) 8. Jiang, G., Solbrig, H.R., Chute, C.G.: Using Semantic Web technology to support icd- 11 textual definitions authoring. J. Biomedical Semantics 4, 11 (2013) 9. Organization, W.H.: ICD-11 Alpha Content Model Reference Guide, 11th Revision. World Health Organization, Geneva, Switzerland (2011) 10. Forum, N.Q.: HQMF Templates for QDM December 2013. 11. Horrocks, I., Patel-Schneider, P.F., Boley, H., Tabet, S., Grosof, B., Dean, M.: SWRL: A semantic web rule language combining OWL and RuleML. W3C Member submission 21, 79 (2004) 12. Jiang, G., Solbrig, H.R., Pathak J., Chute, C.G.: Developing a Standards-based Information Model for Representing Computable Diagnostic Criteria: A Feasibility Study of the NQF Quality Data Model. MedInfo (in press) 2015 13. Vrandečić, D.: Ontology evaluation. Springer (2009)