=Paper= {{Paper |id=None |storemode=property |title=CoverageCity: Test Coverage for Clinical Guidelines |pdfUrl=https://ceur-ws.org/Vol-949/kese8-01_02.pdf |volume=Vol-949 |dblpUrl=https://dblp.org/rec/conf/ecai/HatkoBP12 }} ==CoverageCity: Test Coverage for Clinical Guidelines== https://ceur-ws.org/Vol-949/kese8-01_02.pdf
      CoverageCity: Test Coverage for Clinical Guidelines
                               Reinhard Hatko1 and Joachim Baumeister2 and Frank Puppe3


Abstract. In this paper, we introduce various metrics for test             visualization method from a related area in (object-oriented) Soft-
coverage of clinical guidelines, modeled in the graphical language         ware Engineering, namely Software Metrics. They are used to as-
DiaFlux. Additionally, an intuitive visualization method supports the      sess code quality with respect to structural properties of classes, e.g.,
process of test creation and communicating the reached coverage lev-       number of methods, number of members, lines of code, and so forth.
els to medical experts involved in the authoring of the guideline. The     Those metrics are purely static, not involving the execution of the
goal is to reach a sufficiently high test coverage to assure patient       program itself. They can graphically be visualized as a CodeCity [22]
safety under all circumstances.                                            to determine design flaws.
                                                                              In this paper, we introduce various metrics to determine the test
                                                                           coverage of clinical guidelines, modeled in the graphical language
1   Introduction                                                           DiaFlux. Furthermore, we adapted the city metaphor by creating a
                                                                           CoverageCity for communicating the reached coverage levels to the
Testing is an important step in the development of any software arti-
                                                                           involved medical experts in an accessible manner.
fact, be it a program, a knowledge-based system, or a clinical guide-
                                                                              The rest of this paper is structured as follows: In the next section
line. The two most prevalent testing strategies in Software Engineer-
                                                                           we give a short introduction into the graphical language DiaFlux for
ing are black-box and white-box testing. The former approach is un-
                                                                           clinical guidelines. Section 3 presents coverage metrics for DiaFlux
concerned with the actual implementation and derives the test cases
                                                                           models. Following, in Section 4, we present the results of a case
solely from the underlying specification. The latter one, in contrast,
                                                                           study. Finally, we conclude the paper with a summary and an out-
allows to create tests based on the implementation and to examine it
                                                                           look.
during execution of the tests. This introspection enables to capture
which basic elements of the tested artifact were executed - and thus
covered - by a test suite. Different metrics of Test Coverage were de-     2     The DiaFlux Guideline Language
veloped to objectively measure and assess the thoroughness of such
testing efforts. In classic Software Engineering, metrics have been        Clinical guidelines are an accepted means to improve patient out-
defined to assess the coverage of, e.g., methods of a program, state-      come. Therefore, they offer a standardized treatment, based on
ments of a method, taken decisions of control-flow, and so on [16].        evidence-based medicine. They are developed for several decades.
   Hence, the benefit of coverage metrics - and their proper visualiza-    In their beginnings, they were solely text-based documents that re-
tion - is two-fold, increasing the effectiveness and efficiency of the     lied on the proper application by clinicians. The ongoing comput-
test creation process: First, they help to avoid the creation of redun-    erization and data availability, also in domains with high-frequency
dant tests. Second, they can be used to identify untested elements.        data as, e.g., Intensive Care Units (ICUs), allows for an automation
   Both are also important aspects in the area of knowledge-based          of guideline application by medical devices.
system in general and computerized clinical guidelines in particular.         Several formalisms for Computer-Interpretable Guidelines were
The creation of test cases for a clinical guideline will most likely in-   developed, every one with its own focus, like shareability between
volve both parties, the medical expert and the knowledge engineer.         institutions [4] or decision support. Most of them are graphical ap-
It is thus an expensive task, which should be completed efficiently.       proaches, that employ a kind of Task-Network-Model to express the
Though, the overall goal is to create a guideline, that assures patient    guideline steps [17]. However, in the area of (semi-)closed-loop de-
safety under all circumstances, which can best be guaranteed by a          vices, rule-based approaches are predominant, e.g., [12, 15].
thorough test suite. This is especially important in the area of au-          A downside of rule-based representations is their lower compre-
tomated guidelines, which are applied by closed-loop devices. They         hensibility compared to graphical ones. This especially holds true, as
act autonomously on a patient to improve her state, without requiring      medical experts are most usually involved in the creation of guide-
constant supervision by a clinical user.                                   lines. Therefore, we have developed a graphical guideline formalism
   For the interpretation of coverage metrics a visualization is help-     called DiaFlux [7]. Its main focus lies on the direct applicability and
ful, especially to find untested elements. Test coverage of software       understandability by domain specialists.
programs most usually is visualized by some kind of syntax high-
lighting, by coloring, e.g., the executed statements. Though also
                                                                           2.1    Application Scenario
graphical representations were developed, e.g., [11]. We adopted a
                                                                           The main application area of DiaFlux are mixed-initiative devices
1  University of Wuerzburg, Germany, email: hatko@informatik.uni-          that continuously monitor, diagnose and treat a patient in the setting
  wuerzburg.de
2 denkbares GmbH, Germany, email: joachim.baumeister@denkbares.com         of an ICU. Such closed-loop systems interact with the clinical user
3 University of Wuerzburg, Germany, email: puppe@informatik.uni-           during the process of care. Both, the clinician and the device, are able
  wuerzburg.de                                                             to initiate actions on the patient. Data is continuously available as a
result of the monitoring task. It allows for repeated reasoning about       all findings about the current patient session. Third, a reasoning en-
the patient state, and to carry out appropriate actions to improve her      gine, that executes the guideline and carries out its steps, depending
condition, if necessary.                                                    on the current state as given by the contents of the blackboard. There-
                                                                            fore, the reasoning engine is notified about all findings, that enter the
                                                                            blackboard. A more thorough introduction to the DiaFlux language
2.2    Language Description                                                 and its execution engine is given in [7].
To specify a clinical guideline, two different types of knowledge              The execution of the guideline is time-driven. The reasoning starts
have to be effectively combined, namely declarative and procedu-            by acquiring data and by interpreting this data. The results are writ-
ral knowledge [6]. The declarative part contains the facts of a given       ten to the blackboard. Then, the guideline is executed. This involves
domain, i.e., findings, diagnoses, treatments and their interrelation.      making decisions and possibly the generation of hints to the user and
The knowledge of how to perform a task, i.e., the correct sequence          therapeutic actions by the device. Finally, the time of the next exe-
of actions, is expressed in the procedural knowledge. It is responsible     cution is scheduled, pausing the execution until that instance in time,
for the decision which action to perform next, e.g., asking a question      waiting for the effects of the therapeutic interventions to take place.
or carrying out a test, in a given situation. Each of these actions has a
cost (monetary or associated risk) and a benefit (like information gain
or therapeutic effect) associated with it. Therefore, the choice of an      3   Test Coverage of Clinical Guidelines
appropriate sequence of actions is mandatory for efficient diagnosis
and treatment.                                                              Verification and validation of a clinical guideline are important steps
   In DiaFlux models, the declarative knowledge is represented by           in its development. That way, patient safety shall be assured under
a domain-specific ontology, which contains the definition of find-          all circumstances. Verification usually consists of formal methods,
ings and solutions. This application ontology is an extension of the        proofing, that a given guideline is free of internal inconsistencies and
task ontology of diagnostic problem solving [3]. The ontology is            incompleteness. Normally, these kinds of checks can be performed
strongly formalized and provides the necessary semantics for execut-        without executing the guideline. An overview of verification meth-
ing the guideline. Like most graphical guideline languages, DiaFlux         ods applied to clinical guidelines is given in [10]. Anomaly detection
employs flowcharts as the Task-Network-Model. They describe deci-           for DiaFlux models is described in [8]. In contrast, the validation of
sions, actions and constraints about their ordering in a guideline plan.    a guideline usually involves its execution by a set of test cases (i.e.
These flowcharts consist of nodes and connecting edges. Nodes rep-          a test suite) and comparing the actual results against the expected
resent different kinds of actions. Edges connect nodes to create paths      ones [2]. The thoroughness of such empirical testings can be deter-
of possible actions. Edges can be guarded by conditions, that evaluate      mined by different metrics of test coverage.
the state of the current patient session, and thus guide the course of         In conventional software engineering (SE), test coverage (also
the care process. Figure 1 shows a module of an exemplary guideline         known as code coverage) is a well-established technique to measure
for diagnosing weight problems.                                             how well a piece of software is exercised by a test suite. Often, the
   In the following, we informally describe the most important lan-         reached level of coverage is also used as an indicator for the quality
guage elements:                                                             of the tested program, as tested elements are less likely to contain
• Test node: Test nodes represent an action for the acquisition of          errors than untested ones. Coverage criteria have been defined on
   data during runtime. This may trigger a question, the user has to        different levels of granularity, from the method-level down to single
   answer, or data to be automatically obtained by sensors or from a        statements, or even parts of them (so called condition coverage) [16].
   database.                                                                In the field of AI research, coverage measures have been proposed
• Solution node: Solution nodes are used to set the rating of a so-         for rule-based systems. In this case, the results of such a coverage
   lution based on the given inputs. Established solutions generate         analysis can, e.g., be used to prune the rule base [1]. For graphical
   messages for the clinical user and can, e.g., advice him to conduct      model representations, coverage measures have been proposed, e.g.,
   some action.                                                             for business processes [14], taking their specifics into account, for
• Wait node: Upon reaching a wait node, the execution of the pro-           example, the coverage of fault handlers.
   tocol is suspended until the given period of time has elapsed.              In general, employing coverage metrics during the creation of a
• Composed node: DiaFlux models can be hierarchically struc-                test suite may help improving it in terms of minimality and com-
   tured. Defined models can be reused as modules, represented by a         pleteness. While a high coverage of the object under test is worth-
   composed node in the flowchart using it.                                 while, this should be accomplished with the possibly minimal set of
• Abstraction node: Abstraction nodes offer the possibility to cre-         test cases, as test creation is a difficult and costly task. This espe-
   ate abstractions from available data. These values can then be used      cially holds true for the test creation of clinical guidelines, as it may
   for therapeutic actions by influencing the settings of the host de-      involve knowledge engineers as well as domain specialists like medi-
   vice.                                                                    cal experts. Therefore, adopting the mentioned techniques for clinical
                                                                            guidelines and their graphical representations, offers the possibility
                                                                            to improve this process.
2.3    Guideline Execution
                                                                               In the following, we introduce coverage metrics on different lev-
The execution engine for DiaFlux models is intended for, but not lim-       els of granularity to assess the test coverage of a DiaFlux guideline.
ited to, closed-loop devices, that provide data from sensors or man-        Such a guideline usually is modularized in self-contained modules,
ually entered by the clinical user and carry out therapeutic actions        i.e. flowcharts. To represent this modularization also by the cover-
on the patient, i.e., changing device settings in certain ranges. The       age metrics, we define most of them over the elements of individual
architecture of the DiaFlux guideline execution engine consists of          flowcharts, hence the restriction of nodes and edges to those of a
three components. First, a knowledge base, that contains the appli-         single one. This focusing alleviates the increase of coverage by addi-
cation ontology and the flowcharts. Second, a blackboard, that stores       tional tests, as deficiencies are easier to spot.
                                      Figure 1. A module of a DiaFlux guideline for diagnosing weight problems.




3.1    Flowchart Coverage                                                    3.3    Edge Coverage
Flowchart Coverage is defined as the ratio of the number of                  Edges are used to create the control flow of a flowchart, by defining
flowcharts that are executed by a test suite and the overall number          paths of possible sequences of guideline steps. Every node can have
of flowcharts in the guideline.                                              several outgoing edges. Each of these edges can be guarded by a
   Let F be the set of DiaFlux models in guideline G, and FTex be            condition, to select the appropriate successor node, depending on the
the set of flowcharts executed by test suite T . Then, the Flowchart         outcome of each guideline step. Therefore, the Edge Coverage metric
Coverage F CT of G is given by:                                              reports, if all possible outcomes of the guideline steps - in terms of
                                                                             the equivalence classes that are defined by the edge guards - have
                                        |FTex |                              been considered within the test suite.
                          F CT (G) =
                                         |F |                                                                                              ex
                                                                                Let Ef , f ∈ F be the set of edges of flowchart f , and Ef,T  ,f ∈
                                                                             F be the set of edges of the flowchart f , that are executed by test
This metric only gives a bird’s eye view of the testing situation. It can
                                                                             suite T . Then, ECT is defined as the number of activated edges of a
be used to guarantee at least some testing in all areas of the guide-
                                                                             flowchart f to their overall number, executing a given test suite T :
line during the starting phase of test creation. As it is a very coarse-
grained measurement, a F CT value of 100% should be aimed at.                                                        ex
                                                                                                                  |Ef,T  |
Otherwise, major parts of the guideline remain untested. In SE, the                               ECT (f ) =               ,f ∈ F
                                                                                                                   |Ef |
equivalent metric is function coverage, which reports, if every func-
tion of a program has been tested.                                           As edges connect nodes, an Edge Coverage subsumes the Node Cov-
                                                                             erage of the according flowchart. This metric can be compared to
3.2    Node Coverage                                                         Decision Coverage in SE, that keeps track, if each decision in a pro-
                                                                             gram under test (e.g., if- and switch-statements) has at least once
Nodes represent the elementary steps of a guideline. A node being            been taken and once not.
covered by a test suite, means, that its associated guideline step has
been carried out at least once during the execution of the test suite.
   Let Nf , f ∈ F be the set of nodes of the flowchart f , and               3.4    Condition Coverage
   ex
Nf,T  , f ∈ F be the set of nodes of the flowchart f , that are exe-
                                                                             An edge guard may not be an atomic condition, but consist of sev-
cuted by test suite T . The according Node Coverage metric N CT of
                                                                             eral sub-conditions, connected by boolean operators. For such non-
a flowchart f for a given test suite T can then be calculated as:
                                                                             atomic guards, Edge Coverage gives no detailed information about
                                      ex
                                   |Nf,T  |                                  which of its sub-conditions were satisfied and which were not. This is
                     N CT (f ) =            ,f ∈ F                           of special interest, if the sub-conditions are joined by an OR-operator.
                                    |Nf |
                                                                             In this case, every possible combination of atomic conditions, that
Similar to Flow Coverage, a N CT level of 100% should be reached             can fulfill the overall condition, have to be tested.
for every flowchart in the guideline, as untested nodes can enact ac-           A more detailed view about this issue is given by Condition Cov-
tions with unforeseen effects. This metric is equivalent to Statement        erage. It checks, if every atomic condition has once been satisfied
Coverage in classic SE, which reports, if every statement of a tested        and once not. In classical SE, several different metrics for this issue
function has been executed.                                                  have been developed, e.g. Modified Condition / Decision Coverage
[5]. As those can directly be applied to the guarding conditions of        a variable is defined and subsequently used without a redefinition of
edges, we will not further elaborate on this issue.                        the variable, e.g., [19]. Black-box testing strategies concerned with
                                                                           data usage are Equivalence Partitioning and Boundary Value Anal-
3.5    Path Coverage                                                       ysis [16]. As an exhaustive testing with all valid input data is most
                                                                           likely not tractable even for a small program, its specification can
A path through the guideline consists of consecutive nodes and             be used to partition the input space into equivalence classes. Under
edges. Such a path can be seen as the execution of decisions and           the assumption that each value of a partition is treated equally by
actions for a given clinical scenario. In Software Engineering, it usu-    the program, an arbitrary representative of each class can be chosen
ally is not possible to reach a full path coverage, as soon as loops       for a test case. As errors are more probable at the boundaries of an
are involved, as each number of iterations results in an additional        equivalence class, Boundary Value Analysis is often used to derive
path. In clinical guidelines, there are no loops of an unlimited num-      additional test cases at these spots [20], for example to find “off-by-
ber of iterations, as, for instance, some time has to pass, until an       one” errors (e.g., resulting from the use of the operator “≤” rather
action can be repeated. Nevertheless, the number of paths through          than “<” in a numerical comparison).
the complete guideline throughout multiple nested DiaFlux models              DiaFlux models do not contain variables as they are common in
most likely exceeds the possibilities of test creation. Given a proper     procedural programming languages, and the input data is not mod-
modularization, each flowchart is responsible for a specific aspect of     ified during guideline execution. Therefore, a definition-use analy-
a guideline. Each path through such a single flowchart can be seen         sis is not applicable. Equivalence Partitioning and Boundary Value
as one specific scenario concerning this aspect. Therefore, we assess      Analysis can also not be used as they are. Explicit equivalence classes
each flowchart independently, and define an according Path Cover-          usually can not be stated for inputs. Even thresholds for less deter-
age metric over the paths of each individual flowchart.                    mined assessments (like “low”, “normal”, “high”) are often hard to
   Let Pf , f ∈ F be the set of paths through flowchart f , and            specify for a medical expert. Those can furthermore vary between
  ex
Pf,T  , f ∈ F be the set of paths through flowchart f , that are ex-       different types of patients, which, e.g., share an underlying disease.
ecuted by test suite T . Then, P CT is calculated as the number of         However, for each numerical input, a contiguous interval of possible
paths taken through flowchart f , by the execution of test suite T ,       values can typically be given, according to the human body’s phys-
divided by the total number of paths:                                      iological system and/or the preconditions of guideline applicability.
                                      ex                                   To assure a proper coverage of allowed input data regardless of con-
                                   |Pf,T  |
                     P CT (f ) =            ,f ∈ F                         cepts like equivalence classes, we define the metric Input Coverage:
                                    |Pf |                                  Let the interval [mini ; maxi ] be the domain Di for numerical input
   Even with a proper modularization given, not every modeled path         i, and n ∈ N, n > 0 be the number of equally-sized partitions of Di .
may be enactable, due to implicit dependencies between the guide-          The function cover(i, j) returns 1, iff the j − th partition of Di con-
line steps. If certain combinations of decisions and actions on a single   tains at least one input value in test suite T, and 0 otherwise. Then,
path exist, that can not occur, the targeted value of Path Coverage has    the Input Coverage of i is given by:
to be decreased accordingly.
   As a path consists of consecutive nodes and edges, Path Coverage                                            n
                                                                                                               P
satisfies Node Coverage as well as Edge Coverage.                                                                   cover(i, j)
                                                                                                              j=1
   Path Coverage, as defined above, is a rather aggregated measure-                             ICn,T (i) =
                                                                                                                      n
ment and thus gives little advice of how to improve coverage with
further tests. Therefore, Path Coverage can also be restricted to the         Clearly, the significance of Input Coverage depends on the actual
paths through a specific node n ∈ N :                                      value of n. It should be chosen to appropriately represent the sensi-
   Let Pn be the set of paths containing n (∀p ∈ Pn : n ∈ p),              tivity of the outcome of the guideline to changes in values of i. At
       ex
and Pn,T    be the set of paths containing n exercised by test suite T .   later stages of test creation, the value can be increased stepwise to
Accordingly, the Path Coverage of node n is defined as:                    test more fine-granular in terms of i’s input values.
                                          ex                                  Similarly, the output of the guideline (which mainly consists of nu-
                                       |Pn,T  |
                         P CT (n) =                                        merical values of the host device’s settings in predefined ranges) can
                                        |Pn |                              be assessed by the analog defined metric Output Coverage OCn,T .
   A node with a low Path Coverage is only tested under a small
fraction of the contexts in which it is contained. Again, further tests
should be created for those scenarios, unless they expose dependen-        3.7    Measuring Test Coverage
cies that makes not every path feasible.

                                                                           Commonly, there are two strategies to gather the data needed for cal-
3.6    Value Coverage
                                                                           culating test coverage metrics. The first one is called instrumentation,
The metrics presented so far assess the test coverage with respect         which modifies the tested piece of software by including new code
to structural properties, each considering some kind of modeled ele-       that collects the necessary information. The second strategy is trac-
ment, like nodes and edges. However, the actual input data, that di-       ing, which traces the executed elements by using some sort of debug-
rects the execution of the guideline, is not assessed by these metrics     ging API (Application Programming Interface) of the execution en-
in any way.                                                                vironment. Clearly, both approaches have an effect on the execution
   Beside the mentioned control-flow-based metrics, a second per-          time of the tests, as additional data has to be gathered. An advantage
spective on coverage in classic Software Engineering is given by           of tracing is, that it does not alter the executed artifact. Under certain
data-flow-based metrics. Those measure the coverage of definition-         circumstances this may also influence its behavior. Under this aspect,
use (du) sequences of variables, i.e., a block of instructions in which    “tracing” seems preferable, if the necessary API is available.
3.8     Visualization of Test Coverage                                      • Height: The height of a building shows how often it was covered
                                                                              by the test suite.
The calculated metrics result in a numerical value representing the         • Color: The color represents how well the paths, that go through a
test coverage of the exercised artifact. This may very well be com-           particular node, are covered by a test suite. In case a building is
prehensible for software and knowledge engineers, though for non-             “red”, only few paths are executed. “Green” stands for a high Path
technical domain specialists, like medical experts, these sole num-           Coverage.
bers may not be accessible enough. Furthermore, only a proper com-
position of metrics yields a meaningful overall picture, as each metric
represents a different aspect of coverage. Thus, an intuitive visual-       4.3    Interpretation of the Visualization
ization as a means for communicating the reached coverage levels
to domain specialists seems preferable. One approach to this need           The complexity that a node introduces into a guideline, mostly comes
are so called “Polymetric Views” [13]. They allow to display various        from its number of outgoing edges. In case the associated action has
metrics in an aggregated view.                                              numerous possible outcomes (each represented by an outgoing edge),
                                                                            it is more likely to contain errors. Thus, one dimension of the base
                                                                            area of a building is influenced by the number of outgoing edges of
4     Test Coverage for DiaFlux Guidelines                                  the corresponding node. The increased size makes the building easier
This section describes an implementation of coverage metrics for            to spot. The other dimension of the base area grows with the num-
DiaFlux guidelines and a small case study.                                  ber of covered outgoing edges. As a result, a non-quadratic building
                                                                            stands for a node, whose outgoing edges are not completely covered.
                                                                            The lower the aspect ratio, the more uncovered edges exist. This de-
4.1     Implementation                                                      ficiency in test coverage can easily be observed.
The development environment for DiaFlux guidelines is integrated                The color of a building depicts its Path Coverage. A red one, rep-
into the Semantic Wiki KnowWE. We created a plugin to calculate             resents a node, that is contained in much more paths than were cov-
the test coverage of DiaFlux models, when executing a test suite. It        ered. With increasing Path Coverage, the color becomes green. The
employs the tracing approach to collect the information about exer-         height of a building corresponds to the number of times, it has been
cised elements of the guideline. For each execution of the guideline,       exercised by the test suite. Clearly, those two properties are not in-
the chosen path according to the input data is recorded. After finish-      dependent. On the one hand, a small building is more likely to be
ing the test suite, the metrics are calculated and can subsequently be      contained only in a small number of paths, and thus be red. On the
visualized as CoverageCity, which can freely be scaled and rotated          other hand, a building that is tall and red, implies that the correspond-
(cf. Figure 2). For creating the visualization, we used the WebGL4 -        ing node is often tested under similar circumstances. This can give
based JavaScript library SceneJS5 . It is hence accessible with every       hints for creating new test cases, that are not contained so far. Tall,
(modern) web browser from within KnowWE, not requiring any pro-             green buildings have been thoroughly exercised and do not require
prietary software.                                                          additional test cases.
                                                                                The nesting level of the districts helps in estimating the necessary
                                                                            effort for testing a specific flowchart or node. Deeply nested module
4.2     The CoverageCity Visualization                                      are probably harder to reach by a test case, as each module may only
For an accessible visualization of the reached coverage levels, we use      be called under certain preconditions.
the metaphor of a city. It has been introduced as graphical represen-           The aggregated view of CoverageCity makes it easy to spot de-
tation of static code metrics (e.g. number of methods,. . . ) in the con-   ficiencies in test coverage quickly. Though, a more detailed view is
text of reverse-engineering of software systems (“CodeCity”) [21].          necessary to identify the specific nodes and their test deficiencies.
Such a city consists of districts that represent the nesting structure of   Hence, each building be selected by the user. Then, the correspond-
packages and buildings representing classes. Each building is located       ing flowchart is shown below the CoverageCity, and the node and its
in the district corresponding to its package. The actual values of the      coverage is highlighted visually.
metrics of each class determine the visual appearance of the match-
ing building. A building can represent up to four different metrics,
                                                                            4.4    Case Study
influencing its length, width, height, and color. Besides the package
structure, districts can depict one metric by their color.                  Currently, we are involved in the implementation of a computerized
   We adapted the city metaphor for the visual representation of            guideline for automated mechanical ventilation [9]. The guideline is
test coverage of DiaFlux guidelines. Districts stand for individual         intended to run on a mechanical ventilator, and is able to derive new
flowcharts. They are nested as given by their call hierarchy. Buildings     ventilatory settings in order to improve the ventilation of the patient.
correspond to the nodes of the district’s corresponding flowchart.          First testing efforts of the guideline were conducted using a physio-
Edges are not explicitly included but mapped to one of the build-           logical simulation. The guideline was run against a software tool that
ings’ dimensions. In particular, we used the following assignment of        simulates a mechanically ventilated patient. It employs a physiolog-
metrics to visual properties of buildings:                                  ical lung model to determine the effects of the current ventilation
                                                                            settings to the patient. The simulation is able to deliver the neces-
• Length: The number of outgoing edges of a node determines the
                                                                            sary data (ventilation settings and measured patient response) to the
  length of the building.
                                                                            guideline execution engine. Based on this data, the guideline derives
• Width: The width of a building relates to the number of outgoing
                                                                            optimized settings and returns them to the simulation environment,
  edges that are covered by the test suite.
                                                                            that uses them for the further simulation. The simulation tool was
4 http://webgl.org                                                          used by medical experts to generate the test cases and review the de-
5 http://scenejs.org
                                                                            rived ventilation settings. The generated test cases are saved to a file,
    Figure 2. The test coverage visualization using the city metaphor. Nodes are represented by buildings, flowcharts by districts. The visual properties of
                                        building are determined by the coverage metrics of the corresponding node.



and are then uploaded to KnowWE for the introspection of guideline                Ultimately, this could give insight, if parts of the guideline were only
execution and coverage analysis.                                                  tested, e.g., for a certain patient type. Second, it would be helpful to
   We selected a sample of ten generated test cases. The visualiza-               define scenarios with respect to the occurrence of a certain sequence
tion of the acquired coverage levels is shown in Figure 2. Currently,             of input data or therapeutic actions over time and trace their cover-
we are in the process of evaluating our visualization with medical                age by the test suite. In terms of the CoverageCity-visualization, we
experts. Furthermore, we identify other meaningful assignments of                 will evaluate different mappings of the coverage metrics to the visual
metrics to visual properties of the buildings.                                    properties of the city, to create new perspectives on test coverage.
                                                                                     One shortcoming of white-box testing in general is, that it is un-
5    Conclusion                                                                   able to detect errors by omission, i.e. some requirement may not have
                                                                                  been included in the implementation under test. An approach to find
In this paper, we formally defined different coverage metrics to as-              this type of errors is Requirements-based Test Coverage [18]. It de-
sess the thoroughness of testing efforts for clinical guidelines. They            fines coverage with respect to implementation-independently defined
can be used to identify insufficiently tested elements, and to improve            requirements that are exercised by a test suite. Having formally de-
the process of test creation in terms of efficiency, as this may involve          fined requirements, this approach should be transferable to testing
domain specialists as well as knowledge engineers. An intuitive vi-               clinical guidelines.
sualization method helps in communicating the acquired coverage
levels to domain specialists, for which numerically expressed met-
rics probably are less helpful than for knowledge engineers.                      Acknowledgements
   Additional metrics could be defined over more dynamic aspects of
a guideline. First, the distribution of values could be tracked for each          University of Würzburg is funded by the German Federal Min-
activated flowchart element. As there clearly are dependencies be-                istry for Education and Research under the project “WiM-
tween the actual values and the possible ones - given by the context              Vent” (Knowledge- and Model-based Ventilation), grant number
(i.e. path) of the element - proper preprocessing would be necessary.             01IB10002E.
REFERENCES                                                                          [20] S.C. Reid, ‘An empirical analysis of equivalence partitioning, bound-
                                                                                         ary value analysis and random testing’, in Software Metrics Sympo-
                                                                                         sium, 1997. Proceedings., Fourth International, pp. 64 –73, (November
 [1] Valerie Barr, ‘Applications of rule-base coverage measures to expert                1997).
     system evaluation’, Knowledge-Based Systems, 12, 27–35, (1999).                [21] Richard Wettel and Michele Lanza, ‘Visualizing software systems as
 [2] Joachim Baumeister, ‘Advanced empirical testing’, Knowledge-Based                   cities’, in VISSOFT, eds., Jonathan I. Maletic, Alexandru Telea, and
     Systems, 24(1), 83–94, (2011).                                                      Andrian Marcus, pp. 92–99. IEEE Computer Society, (2007).
 [3] Joachim Baumeister, Jochen Reutelshoefer, and Frank Puppe,                     [22] Richard Wettel and Michele Lanza, ‘CodeCity: 3D visualization of
     ‘KnowWE: A semantic wiki for knowledge engineering’, Applied In-                    large-scale software’, in ICSE Companion, eds., Wilhelm Schäfer,
     telligence, 35(3), 323–344, (2011).                                                 Matthew B. Dwyer, and Volker Gruhn, pp. 921–922. ACM, (2008).
 [4] Aziz A. Boxwala, Mor Peleg, Samson Tu, Omolola Ogunyemi, Qing T.
     Zeng, Dongwen Wang, Vimla L. Patel, Robert A. Greenes, and Ed-
     ward H. Shortliffe, ‘GLIF3: a representation format for sharable
     computer-interpretable clinical practice guidelines’, J. of Biomedical
     Informatics, 37(3), 147–161, (2004).
 [5] J.J. Chilenski and S.P. Miller, ‘Applicability of modified condition/deci-
     sion coverage to software testing’, Software Engineering Journal, 9(5),
     193–200, (1994).
 [6] Paul de Clercq, Katharina Kaiser, and Arie Hasman, ‘Computer-
     interpretable guideline formalisms’, in Computer-based Medical
     Guidelines and Protocols: A Primer and Current Trends, eds., Annette
     ten Teije, Silvia Miksch, and Peter Lucas, 22–43, IOS Press, Amster-
     dam, The Netherlands, (2008).
 [7] Reinhard Hatko, Joachim Baumeister, Volker Belli, and Frank Puppe,
     ‘DiaFlux: A graphical language for computer-interpretable guidelines’,
     in Knowledge Representation for Health-Care, eds., David Riaño, An-
     nette ten Teije, and Silvia Miksch, volume 6924 of Lecture Notes in
     Computer Science, 94–107, Springer, Berlin / Heidelberg, (2012).
 [8] Reinhard Hatko, Joachim Baumeister, Gritje Meinke, Stefan Mers-
     mann, and Frank Puppe, ‘Anomaly detection in DiaFlux models’, in
     KESE7: 7th Workshop on Knowledge Engineering and Software En-
     gineering, San Cristobal de La Laguna, Spain, November 10, 2011,
     volume 805 of CEUR Workshop Proceedings, Tenerife, Spain, (2011).
     CEUR-WS.org.
 [9] Reinhard Hatko, Dirk Schädler, Stefan Mersmann, Joachim Baumeis-
     ter, Norbert Weiler, and Frank Puppe, ‘Implementing an automated ven-
     tilation guideline using the semantic wiki knowwe’, in EKAW 2012:
     18th International Conference on Knowledge Engineering and Knowl-
     edge Management, eds., Heiner Stuckenschmidt, Annette ten Teije, and
     Johanna Voelker, (2012).
[10] Arjen Hommersom, Perry Groot, Michael Balser, and Peter Lucas,
     ‘Formal methods for verification of clinical practice guidelines’, in
     Computer-based Medical Guidelines and Protocols: A Primer and Cur-
     rent Trends, eds., Annette ten Teije, Silvia Miksch, and Peter Lucas,
     63–80, IOS Press, Amsterdam, The Netherlands, (2008).
[11] J.A. Jones, M.J. Harrold, and J. Stasko, ‘Visualization of test informa-
     tion to assist fault localization’, in Proceedings of the 24th international
     conference on Software engineering, pp. 467–477. ACM, (2002).
[12] H.F Kwok, D.A Linkens, M Mahfouf, and G.H Mills, ‘Rule-base
     derivation for intensive care ventilator control using ANFIS’, Artificial
     Intelligence in Medicine, 29(3), 185 – 201, (2003).
[13] Michele Lanza and Stéphane Ducasse, ‘Polymetric views - a
     lightweight visual approach to reverse engineering’, IEEE Trans. Soft-
     ware Eng., 29(9), 782–795, (2003).
[14] Daniel Lübke, Leif Singer, and Alex Salnikow, ‘Calculating bpel test
     coverage through instrumentation’, in AST, eds., Dimitris Dranidis,
     Stephen P. Masticola, and Paul A. Strooper, pp. 115–122. IEEE, (2009).
[15] Stefan Mersmann and Michel Dojat, ‘SmartCaretm - automated clini-
     cal guidelines in critical care’, in ECAI’04/PAIS’04: Proceedings of the
     16th European Conference on Artificial Intelligence, including Presti-
     gious Applications of Intelligent Systems, pp. 745–749, Valencia, Spain,
     (2004). IOS Press.
[16] Glenford J. Myers, Corey Sandler, and Tom Badgett, The art of software
     testing, John Wiley & Sons, Hoboken, N.J., 3 edn., 2011.
[17] Mor Peleg, Samson Tu, Jonathan Bury, Paolo Ciccarese, John Fox,
     Robert A Greenes, Silvia Miksch, Silvana Quaglini, Andreas Sey-
     fang, Edward H Shortliffe, Mario Stefanelli, and et al., ‘Compar-
     ing computer-interpretable guideline models: A case-study approach’,
     JAMIA, 10, (2003).
[18] A. Rajan, ‘Coverage metrics to measure adequacy of black-box test
     suites’, in Automated Software Engineering, 2006. ASE ’06. 21st
     IEEE/ACM International Conference on, pp. 335 –338, (sept. 2006).
[19] Sandra Rapps and Elaine J. Weyuker, ‘Selecting software test data using
     data flow information’, IEEE Trans. Softw. Eng., 11(4), 367–375, (April
     1985).