=Paper=
{{Paper
|id=None
|storemode=property
|title=CoverageCity: Test Coverage for Clinical Guidelines
|pdfUrl=https://ceur-ws.org/Vol-949/kese8-01_02.pdf
|volume=Vol-949
|dblpUrl=https://dblp.org/rec/conf/ecai/HatkoBP12
}}
==CoverageCity: Test Coverage for Clinical Guidelines==
CoverageCity: Test Coverage for Clinical Guidelines Reinhard Hatko1 and Joachim Baumeister2 and Frank Puppe3 Abstract. In this paper, we introduce various metrics for test visualization method from a related area in (object-oriented) Soft- coverage of clinical guidelines, modeled in the graphical language ware Engineering, namely Software Metrics. They are used to as- DiaFlux. Additionally, an intuitive visualization method supports the sess code quality with respect to structural properties of classes, e.g., process of test creation and communicating the reached coverage lev- number of methods, number of members, lines of code, and so forth. els to medical experts involved in the authoring of the guideline. The Those metrics are purely static, not involving the execution of the goal is to reach a sufficiently high test coverage to assure patient program itself. They can graphically be visualized as a CodeCity [22] safety under all circumstances. to determine design flaws. In this paper, we introduce various metrics to determine the test coverage of clinical guidelines, modeled in the graphical language 1 Introduction DiaFlux. Furthermore, we adapted the city metaphor by creating a CoverageCity for communicating the reached coverage levels to the Testing is an important step in the development of any software arti- involved medical experts in an accessible manner. fact, be it a program, a knowledge-based system, or a clinical guide- The rest of this paper is structured as follows: In the next section line. The two most prevalent testing strategies in Software Engineer- we give a short introduction into the graphical language DiaFlux for ing are black-box and white-box testing. The former approach is un- clinical guidelines. Section 3 presents coverage metrics for DiaFlux concerned with the actual implementation and derives the test cases models. Following, in Section 4, we present the results of a case solely from the underlying specification. The latter one, in contrast, study. Finally, we conclude the paper with a summary and an out- allows to create tests based on the implementation and to examine it look. during execution of the tests. This introspection enables to capture which basic elements of the tested artifact were executed - and thus covered - by a test suite. Different metrics of Test Coverage were de- 2 The DiaFlux Guideline Language veloped to objectively measure and assess the thoroughness of such testing efforts. In classic Software Engineering, metrics have been Clinical guidelines are an accepted means to improve patient out- defined to assess the coverage of, e.g., methods of a program, state- come. Therefore, they offer a standardized treatment, based on ments of a method, taken decisions of control-flow, and so on [16]. evidence-based medicine. They are developed for several decades. Hence, the benefit of coverage metrics - and their proper visualiza- In their beginnings, they were solely text-based documents that re- tion - is two-fold, increasing the effectiveness and efficiency of the lied on the proper application by clinicians. The ongoing comput- test creation process: First, they help to avoid the creation of redun- erization and data availability, also in domains with high-frequency dant tests. Second, they can be used to identify untested elements. data as, e.g., Intensive Care Units (ICUs), allows for an automation Both are also important aspects in the area of knowledge-based of guideline application by medical devices. system in general and computerized clinical guidelines in particular. Several formalisms for Computer-Interpretable Guidelines were The creation of test cases for a clinical guideline will most likely in- developed, every one with its own focus, like shareability between volve both parties, the medical expert and the knowledge engineer. institutions [4] or decision support. Most of them are graphical ap- It is thus an expensive task, which should be completed efficiently. proaches, that employ a kind of Task-Network-Model to express the Though, the overall goal is to create a guideline, that assures patient guideline steps [17]. However, in the area of (semi-)closed-loop de- safety under all circumstances, which can best be guaranteed by a vices, rule-based approaches are predominant, e.g., [12, 15]. thorough test suite. This is especially important in the area of au- A downside of rule-based representations is their lower compre- tomated guidelines, which are applied by closed-loop devices. They hensibility compared to graphical ones. This especially holds true, as act autonomously on a patient to improve her state, without requiring medical experts are most usually involved in the creation of guide- constant supervision by a clinical user. lines. Therefore, we have developed a graphical guideline formalism For the interpretation of coverage metrics a visualization is help- called DiaFlux [7]. Its main focus lies on the direct applicability and ful, especially to find untested elements. Test coverage of software understandability by domain specialists. programs most usually is visualized by some kind of syntax high- lighting, by coloring, e.g., the executed statements. Though also 2.1 Application Scenario graphical representations were developed, e.g., [11]. We adopted a The main application area of DiaFlux are mixed-initiative devices 1 University of Wuerzburg, Germany, email: hatko@informatik.uni- that continuously monitor, diagnose and treat a patient in the setting wuerzburg.de 2 denkbares GmbH, Germany, email: joachim.baumeister@denkbares.com of an ICU. Such closed-loop systems interact with the clinical user 3 University of Wuerzburg, Germany, email: puppe@informatik.uni- during the process of care. Both, the clinician and the device, are able wuerzburg.de to initiate actions on the patient. Data is continuously available as a result of the monitoring task. It allows for repeated reasoning about all findings about the current patient session. Third, a reasoning en- the patient state, and to carry out appropriate actions to improve her gine, that executes the guideline and carries out its steps, depending condition, if necessary. on the current state as given by the contents of the blackboard. There- fore, the reasoning engine is notified about all findings, that enter the blackboard. A more thorough introduction to the DiaFlux language 2.2 Language Description and its execution engine is given in [7]. To specify a clinical guideline, two different types of knowledge The execution of the guideline is time-driven. The reasoning starts have to be effectively combined, namely declarative and procedu- by acquiring data and by interpreting this data. The results are writ- ral knowledge [6]. The declarative part contains the facts of a given ten to the blackboard. Then, the guideline is executed. This involves domain, i.e., findings, diagnoses, treatments and their interrelation. making decisions and possibly the generation of hints to the user and The knowledge of how to perform a task, i.e., the correct sequence therapeutic actions by the device. Finally, the time of the next exe- of actions, is expressed in the procedural knowledge. It is responsible cution is scheduled, pausing the execution until that instance in time, for the decision which action to perform next, e.g., asking a question waiting for the effects of the therapeutic interventions to take place. or carrying out a test, in a given situation. Each of these actions has a cost (monetary or associated risk) and a benefit (like information gain or therapeutic effect) associated with it. Therefore, the choice of an 3 Test Coverage of Clinical Guidelines appropriate sequence of actions is mandatory for efficient diagnosis and treatment. Verification and validation of a clinical guideline are important steps In DiaFlux models, the declarative knowledge is represented by in its development. That way, patient safety shall be assured under a domain-specific ontology, which contains the definition of find- all circumstances. Verification usually consists of formal methods, ings and solutions. This application ontology is an extension of the proofing, that a given guideline is free of internal inconsistencies and task ontology of diagnostic problem solving [3]. The ontology is incompleteness. Normally, these kinds of checks can be performed strongly formalized and provides the necessary semantics for execut- without executing the guideline. An overview of verification meth- ing the guideline. Like most graphical guideline languages, DiaFlux ods applied to clinical guidelines is given in [10]. Anomaly detection employs flowcharts as the Task-Network-Model. They describe deci- for DiaFlux models is described in [8]. In contrast, the validation of sions, actions and constraints about their ordering in a guideline plan. a guideline usually involves its execution by a set of test cases (i.e. These flowcharts consist of nodes and connecting edges. Nodes rep- a test suite) and comparing the actual results against the expected resent different kinds of actions. Edges connect nodes to create paths ones [2]. The thoroughness of such empirical testings can be deter- of possible actions. Edges can be guarded by conditions, that evaluate mined by different metrics of test coverage. the state of the current patient session, and thus guide the course of In conventional software engineering (SE), test coverage (also the care process. Figure 1 shows a module of an exemplary guideline known as code coverage) is a well-established technique to measure for diagnosing weight problems. how well a piece of software is exercised by a test suite. Often, the In the following, we informally describe the most important lan- reached level of coverage is also used as an indicator for the quality guage elements: of the tested program, as tested elements are less likely to contain • Test node: Test nodes represent an action for the acquisition of errors than untested ones. Coverage criteria have been defined on data during runtime. This may trigger a question, the user has to different levels of granularity, from the method-level down to single answer, or data to be automatically obtained by sensors or from a statements, or even parts of them (so called condition coverage) [16]. database. In the field of AI research, coverage measures have been proposed • Solution node: Solution nodes are used to set the rating of a so- for rule-based systems. In this case, the results of such a coverage lution based on the given inputs. Established solutions generate analysis can, e.g., be used to prune the rule base [1]. For graphical messages for the clinical user and can, e.g., advice him to conduct model representations, coverage measures have been proposed, e.g., some action. for business processes [14], taking their specifics into account, for • Wait node: Upon reaching a wait node, the execution of the pro- example, the coverage of fault handlers. tocol is suspended until the given period of time has elapsed. In general, employing coverage metrics during the creation of a • Composed node: DiaFlux models can be hierarchically struc- test suite may help improving it in terms of minimality and com- tured. Defined models can be reused as modules, represented by a pleteness. While a high coverage of the object under test is worth- composed node in the flowchart using it. while, this should be accomplished with the possibly minimal set of • Abstraction node: Abstraction nodes offer the possibility to cre- test cases, as test creation is a difficult and costly task. This espe- ate abstractions from available data. These values can then be used cially holds true for the test creation of clinical guidelines, as it may for therapeutic actions by influencing the settings of the host de- involve knowledge engineers as well as domain specialists like medi- vice. cal experts. Therefore, adopting the mentioned techniques for clinical guidelines and their graphical representations, offers the possibility to improve this process. 2.3 Guideline Execution In the following, we introduce coverage metrics on different lev- The execution engine for DiaFlux models is intended for, but not lim- els of granularity to assess the test coverage of a DiaFlux guideline. ited to, closed-loop devices, that provide data from sensors or man- Such a guideline usually is modularized in self-contained modules, ually entered by the clinical user and carry out therapeutic actions i.e. flowcharts. To represent this modularization also by the cover- on the patient, i.e., changing device settings in certain ranges. The age metrics, we define most of them over the elements of individual architecture of the DiaFlux guideline execution engine consists of flowcharts, hence the restriction of nodes and edges to those of a three components. First, a knowledge base, that contains the appli- single one. This focusing alleviates the increase of coverage by addi- cation ontology and the flowcharts. Second, a blackboard, that stores tional tests, as deficiencies are easier to spot. Figure 1. A module of a DiaFlux guideline for diagnosing weight problems. 3.1 Flowchart Coverage 3.3 Edge Coverage Flowchart Coverage is defined as the ratio of the number of Edges are used to create the control flow of a flowchart, by defining flowcharts that are executed by a test suite and the overall number paths of possible sequences of guideline steps. Every node can have of flowcharts in the guideline. several outgoing edges. Each of these edges can be guarded by a Let F be the set of DiaFlux models in guideline G, and FTex be condition, to select the appropriate successor node, depending on the the set of flowcharts executed by test suite T . Then, the Flowchart outcome of each guideline step. Therefore, the Edge Coverage metric Coverage F CT of G is given by: reports, if all possible outcomes of the guideline steps - in terms of the equivalence classes that are defined by the edge guards - have |FTex | been considered within the test suite. F CT (G) = |F | ex Let Ef , f ∈ F be the set of edges of flowchart f , and Ef,T ,f ∈ F be the set of edges of the flowchart f , that are executed by test This metric only gives a bird’s eye view of the testing situation. It can suite T . Then, ECT is defined as the number of activated edges of a be used to guarantee at least some testing in all areas of the guide- flowchart f to their overall number, executing a given test suite T : line during the starting phase of test creation. As it is a very coarse- grained measurement, a F CT value of 100% should be aimed at. ex |Ef,T | Otherwise, major parts of the guideline remain untested. In SE, the ECT (f ) = ,f ∈ F |Ef | equivalent metric is function coverage, which reports, if every func- tion of a program has been tested. As edges connect nodes, an Edge Coverage subsumes the Node Cov- erage of the according flowchart. This metric can be compared to 3.2 Node Coverage Decision Coverage in SE, that keeps track, if each decision in a pro- gram under test (e.g., if- and switch-statements) has at least once Nodes represent the elementary steps of a guideline. A node being been taken and once not. covered by a test suite, means, that its associated guideline step has been carried out at least once during the execution of the test suite. Let Nf , f ∈ F be the set of nodes of the flowchart f , and 3.4 Condition Coverage ex Nf,T , f ∈ F be the set of nodes of the flowchart f , that are exe- An edge guard may not be an atomic condition, but consist of sev- cuted by test suite T . The according Node Coverage metric N CT of eral sub-conditions, connected by boolean operators. For such non- a flowchart f for a given test suite T can then be calculated as: atomic guards, Edge Coverage gives no detailed information about ex |Nf,T | which of its sub-conditions were satisfied and which were not. This is N CT (f ) = ,f ∈ F of special interest, if the sub-conditions are joined by an OR-operator. |Nf | In this case, every possible combination of atomic conditions, that Similar to Flow Coverage, a N CT level of 100% should be reached can fulfill the overall condition, have to be tested. for every flowchart in the guideline, as untested nodes can enact ac- A more detailed view about this issue is given by Condition Cov- tions with unforeseen effects. This metric is equivalent to Statement erage. It checks, if every atomic condition has once been satisfied Coverage in classic SE, which reports, if every statement of a tested and once not. In classical SE, several different metrics for this issue function has been executed. have been developed, e.g. Modified Condition / Decision Coverage [5]. As those can directly be applied to the guarding conditions of a variable is defined and subsequently used without a redefinition of edges, we will not further elaborate on this issue. the variable, e.g., [19]. Black-box testing strategies concerned with data usage are Equivalence Partitioning and Boundary Value Anal- 3.5 Path Coverage ysis [16]. As an exhaustive testing with all valid input data is most likely not tractable even for a small program, its specification can A path through the guideline consists of consecutive nodes and be used to partition the input space into equivalence classes. Under edges. Such a path can be seen as the execution of decisions and the assumption that each value of a partition is treated equally by actions for a given clinical scenario. In Software Engineering, it usu- the program, an arbitrary representative of each class can be chosen ally is not possible to reach a full path coverage, as soon as loops for a test case. As errors are more probable at the boundaries of an are involved, as each number of iterations results in an additional equivalence class, Boundary Value Analysis is often used to derive path. In clinical guidelines, there are no loops of an unlimited num- additional test cases at these spots [20], for example to find “off-by- ber of iterations, as, for instance, some time has to pass, until an one” errors (e.g., resulting from the use of the operator “≤” rather action can be repeated. Nevertheless, the number of paths through than “<” in a numerical comparison). the complete guideline throughout multiple nested DiaFlux models DiaFlux models do not contain variables as they are common in most likely exceeds the possibilities of test creation. Given a proper procedural programming languages, and the input data is not mod- modularization, each flowchart is responsible for a specific aspect of ified during guideline execution. Therefore, a definition-use analy- a guideline. Each path through such a single flowchart can be seen sis is not applicable. Equivalence Partitioning and Boundary Value as one specific scenario concerning this aspect. Therefore, we assess Analysis can also not be used as they are. Explicit equivalence classes each flowchart independently, and define an according Path Cover- usually can not be stated for inputs. Even thresholds for less deter- age metric over the paths of each individual flowchart. mined assessments (like “low”, “normal”, “high”) are often hard to Let Pf , f ∈ F be the set of paths through flowchart f , and specify for a medical expert. Those can furthermore vary between ex Pf,T , f ∈ F be the set of paths through flowchart f , that are ex- different types of patients, which, e.g., share an underlying disease. ecuted by test suite T . Then, P CT is calculated as the number of However, for each numerical input, a contiguous interval of possible paths taken through flowchart f , by the execution of test suite T , values can typically be given, according to the human body’s phys- divided by the total number of paths: iological system and/or the preconditions of guideline applicability. ex To assure a proper coverage of allowed input data regardless of con- |Pf,T | P CT (f ) = ,f ∈ F cepts like equivalence classes, we define the metric Input Coverage: |Pf | Let the interval [mini ; maxi ] be the domain Di for numerical input Even with a proper modularization given, not every modeled path i, and n ∈ N, n > 0 be the number of equally-sized partitions of Di . may be enactable, due to implicit dependencies between the guide- The function cover(i, j) returns 1, iff the j − th partition of Di con- line steps. If certain combinations of decisions and actions on a single tains at least one input value in test suite T, and 0 otherwise. Then, path exist, that can not occur, the targeted value of Path Coverage has the Input Coverage of i is given by: to be decreased accordingly. As a path consists of consecutive nodes and edges, Path Coverage n P satisfies Node Coverage as well as Edge Coverage. cover(i, j) j=1 Path Coverage, as defined above, is a rather aggregated measure- ICn,T (i) = n ment and thus gives little advice of how to improve coverage with further tests. Therefore, Path Coverage can also be restricted to the Clearly, the significance of Input Coverage depends on the actual paths through a specific node n ∈ N : value of n. It should be chosen to appropriately represent the sensi- Let Pn be the set of paths containing n (∀p ∈ Pn : n ∈ p), tivity of the outcome of the guideline to changes in values of i. At ex and Pn,T be the set of paths containing n exercised by test suite T . later stages of test creation, the value can be increased stepwise to Accordingly, the Path Coverage of node n is defined as: test more fine-granular in terms of i’s input values. ex Similarly, the output of the guideline (which mainly consists of nu- |Pn,T | P CT (n) = merical values of the host device’s settings in predefined ranges) can |Pn | be assessed by the analog defined metric Output Coverage OCn,T . A node with a low Path Coverage is only tested under a small fraction of the contexts in which it is contained. Again, further tests should be created for those scenarios, unless they expose dependen- 3.7 Measuring Test Coverage cies that makes not every path feasible. Commonly, there are two strategies to gather the data needed for cal- 3.6 Value Coverage culating test coverage metrics. The first one is called instrumentation, The metrics presented so far assess the test coverage with respect which modifies the tested piece of software by including new code to structural properties, each considering some kind of modeled ele- that collects the necessary information. The second strategy is trac- ment, like nodes and edges. However, the actual input data, that di- ing, which traces the executed elements by using some sort of debug- rects the execution of the guideline, is not assessed by these metrics ging API (Application Programming Interface) of the execution en- in any way. vironment. Clearly, both approaches have an effect on the execution Beside the mentioned control-flow-based metrics, a second per- time of the tests, as additional data has to be gathered. An advantage spective on coverage in classic Software Engineering is given by of tracing is, that it does not alter the executed artifact. Under certain data-flow-based metrics. Those measure the coverage of definition- circumstances this may also influence its behavior. Under this aspect, use (du) sequences of variables, i.e., a block of instructions in which “tracing” seems preferable, if the necessary API is available. 3.8 Visualization of Test Coverage • Height: The height of a building shows how often it was covered by the test suite. The calculated metrics result in a numerical value representing the • Color: The color represents how well the paths, that go through a test coverage of the exercised artifact. This may very well be com- particular node, are covered by a test suite. In case a building is prehensible for software and knowledge engineers, though for non- “red”, only few paths are executed. “Green” stands for a high Path technical domain specialists, like medical experts, these sole num- Coverage. bers may not be accessible enough. Furthermore, only a proper com- position of metrics yields a meaningful overall picture, as each metric represents a different aspect of coverage. Thus, an intuitive visual- 4.3 Interpretation of the Visualization ization as a means for communicating the reached coverage levels to domain specialists seems preferable. One approach to this need The complexity that a node introduces into a guideline, mostly comes are so called “Polymetric Views” [13]. They allow to display various from its number of outgoing edges. In case the associated action has metrics in an aggregated view. numerous possible outcomes (each represented by an outgoing edge), it is more likely to contain errors. Thus, one dimension of the base area of a building is influenced by the number of outgoing edges of 4 Test Coverage for DiaFlux Guidelines the corresponding node. The increased size makes the building easier This section describes an implementation of coverage metrics for to spot. The other dimension of the base area grows with the num- DiaFlux guidelines and a small case study. ber of covered outgoing edges. As a result, a non-quadratic building stands for a node, whose outgoing edges are not completely covered. The lower the aspect ratio, the more uncovered edges exist. This de- 4.1 Implementation ficiency in test coverage can easily be observed. The development environment for DiaFlux guidelines is integrated The color of a building depicts its Path Coverage. A red one, rep- into the Semantic Wiki KnowWE. We created a plugin to calculate resents a node, that is contained in much more paths than were cov- the test coverage of DiaFlux models, when executing a test suite. It ered. With increasing Path Coverage, the color becomes green. The employs the tracing approach to collect the information about exer- height of a building corresponds to the number of times, it has been cised elements of the guideline. For each execution of the guideline, exercised by the test suite. Clearly, those two properties are not in- the chosen path according to the input data is recorded. After finish- dependent. On the one hand, a small building is more likely to be ing the test suite, the metrics are calculated and can subsequently be contained only in a small number of paths, and thus be red. On the visualized as CoverageCity, which can freely be scaled and rotated other hand, a building that is tall and red, implies that the correspond- (cf. Figure 2). For creating the visualization, we used the WebGL4 - ing node is often tested under similar circumstances. This can give based JavaScript library SceneJS5 . It is hence accessible with every hints for creating new test cases, that are not contained so far. Tall, (modern) web browser from within KnowWE, not requiring any pro- green buildings have been thoroughly exercised and do not require prietary software. additional test cases. The nesting level of the districts helps in estimating the necessary effort for testing a specific flowchart or node. Deeply nested module 4.2 The CoverageCity Visualization are probably harder to reach by a test case, as each module may only For an accessible visualization of the reached coverage levels, we use be called under certain preconditions. the metaphor of a city. It has been introduced as graphical represen- The aggregated view of CoverageCity makes it easy to spot de- tation of static code metrics (e.g. number of methods,. . . ) in the con- ficiencies in test coverage quickly. Though, a more detailed view is text of reverse-engineering of software systems (“CodeCity”) [21]. necessary to identify the specific nodes and their test deficiencies. Such a city consists of districts that represent the nesting structure of Hence, each building be selected by the user. Then, the correspond- packages and buildings representing classes. Each building is located ing flowchart is shown below the CoverageCity, and the node and its in the district corresponding to its package. The actual values of the coverage is highlighted visually. metrics of each class determine the visual appearance of the match- ing building. A building can represent up to four different metrics, 4.4 Case Study influencing its length, width, height, and color. Besides the package structure, districts can depict one metric by their color. Currently, we are involved in the implementation of a computerized We adapted the city metaphor for the visual representation of guideline for automated mechanical ventilation [9]. The guideline is test coverage of DiaFlux guidelines. Districts stand for individual intended to run on a mechanical ventilator, and is able to derive new flowcharts. They are nested as given by their call hierarchy. Buildings ventilatory settings in order to improve the ventilation of the patient. correspond to the nodes of the district’s corresponding flowchart. First testing efforts of the guideline were conducted using a physio- Edges are not explicitly included but mapped to one of the build- logical simulation. The guideline was run against a software tool that ings’ dimensions. In particular, we used the following assignment of simulates a mechanically ventilated patient. It employs a physiolog- metrics to visual properties of buildings: ical lung model to determine the effects of the current ventilation settings to the patient. The simulation is able to deliver the neces- • Length: The number of outgoing edges of a node determines the sary data (ventilation settings and measured patient response) to the length of the building. guideline execution engine. Based on this data, the guideline derives • Width: The width of a building relates to the number of outgoing optimized settings and returns them to the simulation environment, edges that are covered by the test suite. that uses them for the further simulation. The simulation tool was 4 http://webgl.org used by medical experts to generate the test cases and review the de- 5 http://scenejs.org rived ventilation settings. The generated test cases are saved to a file, Figure 2. The test coverage visualization using the city metaphor. Nodes are represented by buildings, flowcharts by districts. The visual properties of building are determined by the coverage metrics of the corresponding node. and are then uploaded to KnowWE for the introspection of guideline Ultimately, this could give insight, if parts of the guideline were only execution and coverage analysis. tested, e.g., for a certain patient type. Second, it would be helpful to We selected a sample of ten generated test cases. The visualiza- define scenarios with respect to the occurrence of a certain sequence tion of the acquired coverage levels is shown in Figure 2. Currently, of input data or therapeutic actions over time and trace their cover- we are in the process of evaluating our visualization with medical age by the test suite. In terms of the CoverageCity-visualization, we experts. Furthermore, we identify other meaningful assignments of will evaluate different mappings of the coverage metrics to the visual metrics to visual properties of the buildings. properties of the city, to create new perspectives on test coverage. One shortcoming of white-box testing in general is, that it is un- 5 Conclusion able to detect errors by omission, i.e. some requirement may not have been included in the implementation under test. An approach to find In this paper, we formally defined different coverage metrics to as- this type of errors is Requirements-based Test Coverage [18]. It de- sess the thoroughness of testing efforts for clinical guidelines. They fines coverage with respect to implementation-independently defined can be used to identify insufficiently tested elements, and to improve requirements that are exercised by a test suite. Having formally de- the process of test creation in terms of efficiency, as this may involve fined requirements, this approach should be transferable to testing domain specialists as well as knowledge engineers. An intuitive vi- clinical guidelines. sualization method helps in communicating the acquired coverage levels to domain specialists, for which numerically expressed met- rics probably are less helpful than for knowledge engineers. Acknowledgements Additional metrics could be defined over more dynamic aspects of a guideline. First, the distribution of values could be tracked for each University of Würzburg is funded by the German Federal Min- activated flowchart element. As there clearly are dependencies be- istry for Education and Research under the project “WiM- tween the actual values and the possible ones - given by the context Vent” (Knowledge- and Model-based Ventilation), grant number (i.e. path) of the element - proper preprocessing would be necessary. 01IB10002E. REFERENCES [20] S.C. Reid, ‘An empirical analysis of equivalence partitioning, bound- ary value analysis and random testing’, in Software Metrics Sympo- sium, 1997. Proceedings., Fourth International, pp. 64 –73, (November [1] Valerie Barr, ‘Applications of rule-base coverage measures to expert 1997). system evaluation’, Knowledge-Based Systems, 12, 27–35, (1999). [21] Richard Wettel and Michele Lanza, ‘Visualizing software systems as [2] Joachim Baumeister, ‘Advanced empirical testing’, Knowledge-Based cities’, in VISSOFT, eds., Jonathan I. Maletic, Alexandru Telea, and Systems, 24(1), 83–94, (2011). Andrian Marcus, pp. 92–99. IEEE Computer Society, (2007). [3] Joachim Baumeister, Jochen Reutelshoefer, and Frank Puppe, [22] Richard Wettel and Michele Lanza, ‘CodeCity: 3D visualization of ‘KnowWE: A semantic wiki for knowledge engineering’, Applied In- large-scale software’, in ICSE Companion, eds., Wilhelm Schäfer, telligence, 35(3), 323–344, (2011). Matthew B. Dwyer, and Volker Gruhn, pp. 921–922. ACM, (2008). [4] Aziz A. Boxwala, Mor Peleg, Samson Tu, Omolola Ogunyemi, Qing T. Zeng, Dongwen Wang, Vimla L. Patel, Robert A. Greenes, and Ed- ward H. Shortliffe, ‘GLIF3: a representation format for sharable computer-interpretable clinical practice guidelines’, J. of Biomedical Informatics, 37(3), 147–161, (2004). [5] J.J. Chilenski and S.P. Miller, ‘Applicability of modified condition/deci- sion coverage to software testing’, Software Engineering Journal, 9(5), 193–200, (1994). [6] Paul de Clercq, Katharina Kaiser, and Arie Hasman, ‘Computer- interpretable guideline formalisms’, in Computer-based Medical Guidelines and Protocols: A Primer and Current Trends, eds., Annette ten Teije, Silvia Miksch, and Peter Lucas, 22–43, IOS Press, Amster- dam, The Netherlands, (2008). [7] Reinhard Hatko, Joachim Baumeister, Volker Belli, and Frank Puppe, ‘DiaFlux: A graphical language for computer-interpretable guidelines’, in Knowledge Representation for Health-Care, eds., David Riaño, An- nette ten Teije, and Silvia Miksch, volume 6924 of Lecture Notes in Computer Science, 94–107, Springer, Berlin / Heidelberg, (2012). [8] Reinhard Hatko, Joachim Baumeister, Gritje Meinke, Stefan Mers- mann, and Frank Puppe, ‘Anomaly detection in DiaFlux models’, in KESE7: 7th Workshop on Knowledge Engineering and Software En- gineering, San Cristobal de La Laguna, Spain, November 10, 2011, volume 805 of CEUR Workshop Proceedings, Tenerife, Spain, (2011). CEUR-WS.org. [9] Reinhard Hatko, Dirk Schädler, Stefan Mersmann, Joachim Baumeis- ter, Norbert Weiler, and Frank Puppe, ‘Implementing an automated ven- tilation guideline using the semantic wiki knowwe’, in EKAW 2012: 18th International Conference on Knowledge Engineering and Knowl- edge Management, eds., Heiner Stuckenschmidt, Annette ten Teije, and Johanna Voelker, (2012). [10] Arjen Hommersom, Perry Groot, Michael Balser, and Peter Lucas, ‘Formal methods for verification of clinical practice guidelines’, in Computer-based Medical Guidelines and Protocols: A Primer and Cur- rent Trends, eds., Annette ten Teije, Silvia Miksch, and Peter Lucas, 63–80, IOS Press, Amsterdam, The Netherlands, (2008). [11] J.A. Jones, M.J. Harrold, and J. Stasko, ‘Visualization of test informa- tion to assist fault localization’, in Proceedings of the 24th international conference on Software engineering, pp. 467–477. ACM, (2002). [12] H.F Kwok, D.A Linkens, M Mahfouf, and G.H Mills, ‘Rule-base derivation for intensive care ventilator control using ANFIS’, Artificial Intelligence in Medicine, 29(3), 185 – 201, (2003). [13] Michele Lanza and Stéphane Ducasse, ‘Polymetric views - a lightweight visual approach to reverse engineering’, IEEE Trans. Soft- ware Eng., 29(9), 782–795, (2003). [14] Daniel Lübke, Leif Singer, and Alex Salnikow, ‘Calculating bpel test coverage through instrumentation’, in AST, eds., Dimitris Dranidis, Stephen P. Masticola, and Paul A. Strooper, pp. 115–122. IEEE, (2009). [15] Stefan Mersmann and Michel Dojat, ‘SmartCaretm - automated clini- cal guidelines in critical care’, in ECAI’04/PAIS’04: Proceedings of the 16th European Conference on Artificial Intelligence, including Presti- gious Applications of Intelligent Systems, pp. 745–749, Valencia, Spain, (2004). IOS Press. [16] Glenford J. Myers, Corey Sandler, and Tom Badgett, The art of software testing, John Wiley & Sons, Hoboken, N.J., 3 edn., 2011. [17] Mor Peleg, Samson Tu, Jonathan Bury, Paolo Ciccarese, John Fox, Robert A Greenes, Silvia Miksch, Silvana Quaglini, Andreas Sey- fang, Edward H Shortliffe, Mario Stefanelli, and et al., ‘Compar- ing computer-interpretable guideline models: A case-study approach’, JAMIA, 10, (2003). [18] A. Rajan, ‘Coverage metrics to measure adequacy of black-box test suites’, in Automated Software Engineering, 2006. ASE ’06. 21st IEEE/ACM International Conference on, pp. 335 –338, (sept. 2006). [19] Sandra Rapps and Elaine J. Weyuker, ‘Selecting software test data using data flow information’, IEEE Trans. Softw. Eng., 11(4), 367–375, (April 1985).