<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CoverageCity: Test Coverage for Clinical Guidelines</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Reinhard Hatko</string-name>
          <email>hatko@informatik.uni-</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Joachim Baumeister</string-name>
          <email>joachim.baumeister@denkbares.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Frank Puppe</string-name>
          <email>puppe@informatik.uni-</email>
        </contrib>
      </contrib-group>
      <abstract>
        <p>In this paper, we introduce various metrics for test coverage of clinical guidelines, modeled in the graphical language DiaFlux. Additionally, an intuitive visualization method supports the process of test creation and communicating the reached coverage levels to medical experts involved in the authoring of the guideline. The goal is to reach a sufficiently high test coverage to assure patient safety under all circumstances.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Testing is an important step in the development of any software
artifact, be it a program, a knowledge-based system, or a clinical
guideline. The two most prevalent testing strategies in Software
Engineering are black-box and white-box testing. The former approach is
unconcerned with the actual implementation and derives the test cases
solely from the underlying specification. The latter one, in contrast,
allows to create tests based on the implementation and to examine it
during execution of the tests. This introspection enables to capture
which basic elements of the tested artifact were executed - and thus
covered - by a test suite. Different metrics of Test Coverage were
developed to objectively measure and assess the thoroughness of such
testing efforts. In classic Software Engineering, metrics have been
defined to assess the coverage of, e.g., methods of a program,
statements of a method, taken decisions of control-flow, and so on [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>Hence, the benefit of coverage metrics - and their proper
visualization - is two-fold, increasing the effectiveness and efficiency of the
test creation process: First, they help to avoid the creation of
redundant tests. Second, they can be used to identify untested elements.</p>
      <p>Both are also important aspects in the area of knowledge-based
system in general and computerized clinical guidelines in particular.
The creation of test cases for a clinical guideline will most likely
involve both parties, the medical expert and the knowledge engineer.
It is thus an expensive task, which should be completed efficiently.
Though, the overall goal is to create a guideline, that assures patient
safety under all circumstances, which can best be guaranteed by a
thorough test suite. This is especially important in the area of
automated guidelines, which are applied by closed-loop devices. They
act autonomously on a patient to improve her state, without requiring
constant supervision by a clinical user.</p>
      <p>
        For the interpretation of coverage metrics a visualization is
helpful, especially to find untested elements. Test coverage of software
programs most usually is visualized by some kind of syntax
highlighting, by coloring, e.g., the executed statements. Though also
graphical representations were developed, e.g., [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. We adopted a
visualization method from a related area in (object-oriented)
Software Engineering, namely Software Metrics. They are used to
assess code quality with respect to structural properties of classes, e.g.,
number of methods, number of members, lines of code, and so forth.
Those metrics are purely static, not involving the execution of the
program itself. They can graphically be visualized as a CodeCity [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]
to determine design flaws.
      </p>
      <p>In this paper, we introduce various metrics to determine the test
coverage of clinical guidelines, modeled in the graphical language
DiaFlux. Furthermore, we adapted the city metaphor by creating a
CoverageCity for communicating the reached coverage levels to the
involved medical experts in an accessible manner.</p>
      <p>The rest of this paper is structured as follows: In the next section
we give a short introduction into the graphical language DiaFlux for
clinical guidelines. Section 3 presents coverage metrics for DiaFlux
models. Following, in Section 4, we present the results of a case
study. Finally, we conclude the paper with a summary and an
outlook.
2</p>
    </sec>
    <sec id="sec-2">
      <title>The DiaFlux Guideline Language</title>
      <p>Clinical guidelines are an accepted means to improve patient
outcome. Therefore, they offer a standardized treatment, based on
evidence-based medicine. They are developed for several decades.
In their beginnings, they were solely text-based documents that
relied on the proper application by clinicians. The ongoing
computerization and data availability, also in domains with high-frequency
data as, e.g., Intensive Care Units (ICUs), allows for an automation
of guideline application by medical devices.</p>
      <p>
        Several formalisms for Computer-Interpretable Guidelines were
developed, every one with its own focus, like shareability between
institutions [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] or decision support. Most of them are graphical
approaches, that employ a kind of Task-Network-Model to express the
guideline steps [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. However, in the area of (semi-)closed-loop
devices, rule-based approaches are predominant, e.g., [
        <xref ref-type="bibr" rid="ref12 ref15">12, 15</xref>
        ].
      </p>
      <p>
        A downside of rule-based representations is their lower
comprehensibility compared to graphical ones. This especially holds true, as
medical experts are most usually involved in the creation of
guidelines. Therefore, we have developed a graphical guideline formalism
called DiaFlux [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Its main focus lies on the direct applicability and
understandability by domain specialists.
2.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>Application Scenario</title>
      <p>The main application area of DiaFlux are mixed-initiative devices
that continuously monitor, diagnose and treat a patient in the setting
of an ICU. Such closed-loop systems interact with the clinical user
during the process of care. Both, the clinician and the device, are able
to initiate actions on the patient. Data is continuously available as a
result of the monitoring task. It allows for repeated reasoning about
the patient state, and to carry out appropriate actions to improve her
condition, if necessary.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Language Description</title>
      <p>
        To specify a clinical guideline, two different types of knowledge
have to be effectively combined, namely declarative and
procedural knowledge [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The declarative part contains the facts of a given
domain, i.e., findings, diagnoses, treatments and their interrelation.
The knowledge of how to perform a task, i.e., the correct sequence
of actions, is expressed in the procedural knowledge. It is responsible
for the decision which action to perform next, e.g., asking a question
or carrying out a test, in a given situation. Each of these actions has a
cost (monetary or associated risk) and a benefit (like information gain
or therapeutic effect) associated with it. Therefore, the choice of an
appropriate sequence of actions is mandatory for efficient diagnosis
and treatment.
      </p>
      <p>
        In DiaFlux models, the declarative knowledge is represented by
a domain-specific ontology, which contains the definition of
findings and solutions. This application ontology is an extension of the
task ontology of diagnostic problem solving [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The ontology is
strongly formalized and provides the necessary semantics for
executing the guideline. Like most graphical guideline languages, DiaFlux
employs flowcharts as the Task-Network-Model. They describe
decisions, actions and constraints about their ordering in a guideline plan.
These flowcharts consist of nodes and connecting edges. Nodes
represent different kinds of actions. Edges connect nodes to create paths
of possible actions. Edges can be guarded by conditions, that evaluate
the state of the current patient session, and thus guide the course of
the care process. Figure 1 shows a module of an exemplary guideline
for diagnosing weight problems.
      </p>
      <p>In the following, we informally describe the most important
language elements:</p>
      <p>Test node: Test nodes represent an action for the acquisition of
data during runtime. This may trigger a question, the user has to
answer, or data to be automatically obtained by sensors or from a
database.</p>
      <p>Solution node: Solution nodes are used to set the rating of a
solution based on the given inputs. Established solutions generate
messages for the clinical user and can, e.g., advice him to conduct
some action.</p>
      <p>Wait node: Upon reaching a wait node, the execution of the
protocol is suspended until the given period of time has elapsed.
Composed node: DiaFlux models can be hierarchically
structured. Defined models can be reused as modules, represented by a
composed node in the flowchart using it.</p>
      <p>Abstraction node: Abstraction nodes offer the possibility to
create abstractions from available data. These values can then be used
for therapeutic actions by influencing the settings of the host
device.
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>Guideline Execution</title>
      <p>
        The execution engine for DiaFlux models is intended for, but not
limited to, closed-loop devices, that provide data from sensors or
manually entered by the clinical user and carry out therapeutic actions
on the patient, i.e., changing device settings in certain ranges. The
architecture of the DiaFlux guideline execution engine consists of
three components. First, a knowledge base, that contains the
application ontology and the flowcharts. Second, a blackboard, that stores
all findings about the current patient session. Third, a reasoning
engine, that executes the guideline and carries out its steps, depending
on the current state as given by the contents of the blackboard.
Therefore, the reasoning engine is notified about all findings, that enter the
blackboard. A more thorough introduction to the DiaFlux language
and its execution engine is given in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>The execution of the guideline is time-driven. The reasoning starts
by acquiring data and by interpreting this data. The results are
written to the blackboard. Then, the guideline is executed. This involves
making decisions and possibly the generation of hints to the user and
therapeutic actions by the device. Finally, the time of the next
execution is scheduled, pausing the execution until that instance in time,
waiting for the effects of the therapeutic interventions to take place.
3</p>
    </sec>
    <sec id="sec-6">
      <title>Test Coverage of Clinical Guidelines</title>
      <p>
        Verification and validation of a clinical guideline are important steps
in its development. That way, patient safety shall be assured under
all circumstances. Verification usually consists of formal methods,
proofing, that a given guideline is free of internal inconsistencies and
incompleteness. Normally, these kinds of checks can be performed
without executing the guideline. An overview of verification
methods applied to clinical guidelines is given in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Anomaly detection
for DiaFlux models is described in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. In contrast, the validation of
a guideline usually involves its execution by a set of test cases (i.e.
a test suite) and comparing the actual results against the expected
ones [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The thoroughness of such empirical testings can be
determined by different metrics of test coverage.
      </p>
      <p>
        In conventional software engineering (SE), test coverage (also
known as code coverage) is a well-established technique to measure
how well a piece of software is exercised by a test suite. Often, the
reached level of coverage is also used as an indicator for the quality
of the tested program, as tested elements are less likely to contain
errors than untested ones. Coverage criteria have been defined on
different levels of granularity, from the method-level down to single
statements, or even parts of them (so called condition coverage) [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
In the field of AI research, coverage measures have been proposed
for rule-based systems. In this case, the results of such a coverage
analysis can, e.g., be used to prune the rule base [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. For graphical
model representations, coverage measures have been proposed, e.g.,
for business processes [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], taking their specifics into account, for
example, the coverage of fault handlers.
      </p>
      <p>In general, employing coverage metrics during the creation of a
test suite may help improving it in terms of minimality and
completeness. While a high coverage of the object under test is
worthwhile, this should be accomplished with the possibly minimal set of
test cases, as test creation is a difficult and costly task. This
especially holds true for the test creation of clinical guidelines, as it may
involve knowledge engineers as well as domain specialists like
medical experts. Therefore, adopting the mentioned techniques for clinical
guidelines and their graphical representations, offers the possibility
to improve this process.</p>
      <p>In the following, we introduce coverage metrics on different
levels of granularity to assess the test coverage of a DiaFlux guideline.
Such a guideline usually is modularized in self-contained modules,
i.e. flowcharts. To represent this modularization also by the
coverage metrics, we define most of them over the elements of individual
flowcharts, hence the restriction of nodes and edges to those of a
single one. This focusing alleviates the increase of coverage by
additional tests, as deficiencies are easier to spot.
Flowchart Coverage is defined as the ratio of the number of
flowcharts that are executed by a test suite and the overall number
of flowcharts in the guideline.</p>
      <p>Let F be the set of DiaFlux models in guideline G, and FTex be
the set of flowcharts executed by test suite T . Then, the Flowchart
Coverage F CT of G is given by:</p>
      <p>F CT (G) = jFTexj
jF j
This metric only gives a bird’s eye view of the testing situation. It can
be used to guarantee at least some testing in all areas of the
guideline during the starting phase of test creation. As it is a very
coarsegrained measurement, a F CT value of 100% should be aimed at.
Otherwise, major parts of the guideline remain untested. In SE, the
equivalent metric is function coverage, which reports, if every
function of a program has been tested.
3.2</p>
    </sec>
    <sec id="sec-7">
      <title>Node Coverage</title>
      <p>Nodes represent the elementary steps of a guideline. A node being
covered by a test suite, means, that its associated guideline step has
been carried out at least once during the execution of the test suite.</p>
      <p>Let Nf ; f 2 F be the set of nodes of the flowchart f , and
Nfe;xT ; f 2 F be the set of nodes of the flowchart f , that are
executed by test suite T . The according Node Coverage metric N CT of
a flowchart f for a given test suite T can then be calculated as:
N CT (f ) = jNfe;xT j ; f 2 F
jNf j
Similar to Flow Coverage, a N CT level of 100% should be reached
for every flowchart in the guideline, as untested nodes can enact
actions with unforeseen effects. This metric is equivalent to Statement
Coverage in classic SE, which reports, if every statement of a tested
function has been executed.
3.3</p>
    </sec>
    <sec id="sec-8">
      <title>Edge Coverage</title>
      <p>Edges are used to create the control flow of a flowchart, by defining
paths of possible sequences of guideline steps. Every node can have
several outgoing edges. Each of these edges can be guarded by a
condition, to select the appropriate successor node, depending on the
outcome of each guideline step. Therefore, the Edge Coverage metric
reports, if all possible outcomes of the guideline steps - in terms of
the equivalence classes that are defined by the edge guards - have
been considered within the test suite.</p>
      <p>Let Ef ; f 2 F be the set of edges of flowchart f , and Efe;xT ; f 2
F be the set of edges of the flowchart f , that are executed by test
suite T . Then, ECT is defined as the number of activated edges of a
flowchart f to their overall number, executing a given test suite T :
ECT (f ) = jEfe;xT j ; f 2 F</p>
      <p>jEf j
As edges connect nodes, an Edge Coverage subsumes the Node
Coverage of the according flowchart. This metric can be compared to
Decision Coverage in SE, that keeps track, if each decision in a
program under test (e.g., if- and switch-statements) has at least once
been taken and once not.
3.4</p>
    </sec>
    <sec id="sec-9">
      <title>Condition Coverage</title>
      <p>An edge guard may not be an atomic condition, but consist of
several sub-conditions, connected by boolean operators. For such
nonatomic guards, Edge Coverage gives no detailed information about
which of its sub-conditions were satisfied and which were not. This is
of special interest, if the sub-conditions are joined by an OR-operator.
In this case, every possible combination of atomic conditions, that
can fulfill the overall condition, have to be tested.</p>
      <p>
        A more detailed view about this issue is given by Condition
Coverage. It checks, if every atomic condition has once been satisfied
and once not. In classical SE, several different metrics for this issue
have been developed, e.g. Modified Condition / Decision Coverage
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. As those can directly be applied to the guarding conditions of
edges, we will not further elaborate on this issue.
A path through the guideline consists of consecutive nodes and
edges. Such a path can be seen as the execution of decisions and
actions for a given clinical scenario. In Software Engineering, it
usually is not possible to reach a full path coverage, as soon as loops
are involved, as each number of iterations results in an additional
path. In clinical guidelines, there are no loops of an unlimited
number of iterations, as, for instance, some time has to pass, until an
action can be repeated. Nevertheless, the number of paths through
the complete guideline throughout multiple nested DiaFlux models
most likely exceeds the possibilities of test creation. Given a proper
modularization, each flowchart is responsible for a specific aspect of
a guideline. Each path through such a single flowchart can be seen
as one specific scenario concerning this aspect. Therefore, we assess
each flowchart independently, and define an according Path
Coverage metric over the paths of each individual flowchart.
      </p>
      <p>Let Pf ; f 2 F be the set of paths through flowchart f , and
Pfe;xT ; f 2 F be the set of paths through flowchart f , that are
executed by test suite T . Then, P CT is calculated as the number of
paths taken through flowchart f , by the execution of test suite T ,
divided by the total number of paths:</p>
      <p>P CT (f ) = jPfe;xT j ; f 2 F</p>
      <p>jPf j</p>
      <p>Even with a proper modularization given, not every modeled path
may be enactable, due to implicit dependencies between the
guideline steps. If certain combinations of decisions and actions on a single
path exist, that can not occur, the targeted value of Path Coverage has
to be decreased accordingly.</p>
      <p>As a path consists of consecutive nodes and edges, Path Coverage
satisfies Node Coverage as well as Edge Coverage.</p>
      <p>Path Coverage, as defined above, is a rather aggregated
measurement and thus gives little advice of how to improve coverage with
further tests. Therefore, Path Coverage can also be restricted to the
paths through a specific node n 2 N :</p>
      <p>Let Pn be the set of paths containing n (8p 2 Pn : n 2 p),
and Pne;xT be the set of paths containing n exercised by test suite T .
Accordingly, the Path Coverage of node n is defined as:
P CT (n) = jPne;xT j
jPnj</p>
      <p>A node with a low Path Coverage is only tested under a small
fraction of the contexts in which it is contained. Again, further tests
should be created for those scenarios, unless they expose
dependencies that makes not every path feasible.
3.6</p>
    </sec>
    <sec id="sec-10">
      <title>Value Coverage</title>
      <p>The metrics presented so far assess the test coverage with respect
to structural properties, each considering some kind of modeled
element, like nodes and edges. However, the actual input data, that
directs the execution of the guideline, is not assessed by these metrics
in any way.</p>
      <p>
        Beside the mentioned control-flow-based metrics, a second
perspective on coverage in classic Software Engineering is given by
data-flow-based metrics. Those measure the coverage of
definitionuse (du) sequences of variables, i.e., a block of instructions in which
a variable is defined and subsequently used without a redefinition of
the variable, e.g., [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. Black-box testing strategies concerned with
data usage are Equivalence Partitioning and Boundary Value
Analysis [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. As an exhaustive testing with all valid input data is most
likely not tractable even for a small program, its specification can
be used to partition the input space into equivalence classes. Under
the assumption that each value of a partition is treated equally by
the program, an arbitrary representative of each class can be chosen
for a test case. As errors are more probable at the boundaries of an
equivalence class, Boundary Value Analysis is often used to derive
additional test cases at these spots [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], for example to find
“off-byone” errors (e.g., resulting from the use of the operator “ ” rather
than “&lt;” in a numerical comparison).
      </p>
      <p>DiaFlux models do not contain variables as they are common in
procedural programming languages, and the input data is not
modified during guideline execution. Therefore, a definition-use
analysis is not applicable. Equivalence Partitioning and Boundary Value
Analysis can also not be used as they are. Explicit equivalence classes
usually can not be stated for inputs. Even thresholds for less
determined assessments (like “low”, “normal”, “high”) are often hard to
specify for a medical expert. Those can furthermore vary between
different types of patients, which, e.g., share an underlying disease.
However, for each numerical input, a contiguous interval of possible
values can typically be given, according to the human body’s
physiological system and/or the preconditions of guideline applicability.
To assure a proper coverage of allowed input data regardless of
concepts like equivalence classes, we define the metric Input Coverage:
Let the interval [mini; maxi] be the domain Di for numerical input
i, and n 2 N; n &gt; 0 be the number of equally-sized partitions of Di.
The function cover(i; j) returns 1, iff the j th partition of Di
contains at least one input value in test suite T, and 0 otherwise. Then,
the Input Coverage of i is given by:
n</p>
      <p>P cover(i; j)
ICn;T (i) = j=1
n</p>
      <p>Clearly, the significance of Input Coverage depends on the actual
value of n. It should be chosen to appropriately represent the
sensitivity of the outcome of the guideline to changes in values of i. At
later stages of test creation, the value can be increased stepwise to
test more fine-granular in terms of i’s input values.</p>
      <p>Similarly, the output of the guideline (which mainly consists of
numerical values of the host device’s settings in predefined ranges) can
be assessed by the analog defined metric Output Coverage OCn;T .
3.7</p>
    </sec>
    <sec id="sec-11">
      <title>Measuring Test Coverage</title>
      <p>Commonly, there are two strategies to gather the data needed for
calculating test coverage metrics. The first one is called instrumentation,
which modifies the tested piece of software by including new code
that collects the necessary information. The second strategy is
tracing, which traces the executed elements by using some sort of
debugging API (Application Programming Interface) of the execution
environment. Clearly, both approaches have an effect on the execution
time of the tests, as additional data has to be gathered. An advantage
of tracing is, that it does not alter the executed artifact. Under certain
circumstances this may also influence its behavior. Under this aspect,
“tracing” seems preferable, if the necessary API is available.
3.8</p>
    </sec>
    <sec id="sec-12">
      <title>Visualization of Test Coverage</title>
      <p>
        The calculated metrics result in a numerical value representing the
test coverage of the exercised artifact. This may very well be
comprehensible for software and knowledge engineers, though for
nontechnical domain specialists, like medical experts, these sole
numbers may not be accessible enough. Furthermore, only a proper
composition of metrics yields a meaningful overall picture, as each metric
represents a different aspect of coverage. Thus, an intuitive
visualization as a means for communicating the reached coverage levels
to domain specialists seems preferable. One approach to this need
are so called “Polymetric Views” [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. They allow to display various
metrics in an aggregated view.
4
      </p>
    </sec>
    <sec id="sec-13">
      <title>Test Coverage for DiaFlux Guidelines</title>
      <p>This section describes an implementation of coverage metrics for
DiaFlux guidelines and a small case study.
4.1</p>
    </sec>
    <sec id="sec-14">
      <title>Implementation</title>
      <p>The development environment for DiaFlux guidelines is integrated
into the Semantic Wiki KnowWE. We created a plugin to calculate
the test coverage of DiaFlux models, when executing a test suite. It
employs the tracing approach to collect the information about
exercised elements of the guideline. For each execution of the guideline,
the chosen path according to the input data is recorded. After
finishing the test suite, the metrics are calculated and can subsequently be
visualized as CoverageCity, which can freely be scaled and rotated
(cf. Figure 2). For creating the visualization, we used the
WebGL4based JavaScript library SceneJS5. It is hence accessible with every
(modern) web browser from within KnowWE, not requiring any
proprietary software.
4.2</p>
    </sec>
    <sec id="sec-15">
      <title>The CoverageCity Visualization</title>
      <p>
        For an accessible visualization of the reached coverage levels, we use
the metaphor of a city. It has been introduced as graphical
representation of static code metrics (e.g. number of methods,. . . ) in the
context of reverse-engineering of software systems (“CodeCity”) [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
Such a city consists of districts that represent the nesting structure of
packages and buildings representing classes. Each building is located
in the district corresponding to its package. The actual values of the
metrics of each class determine the visual appearance of the
matching building. A building can represent up to four different metrics,
influencing its length, width, height, and color. Besides the package
structure, districts can depict one metric by their color.
      </p>
      <p>We adapted the city metaphor for the visual representation of
test coverage of DiaFlux guidelines. Districts stand for individual
flowcharts. They are nested as given by their call hierarchy. Buildings
correspond to the nodes of the district’s corresponding flowchart.
Edges are not explicitly included but mapped to one of the
buildings’ dimensions. In particular, we used the following assignment of
metrics to visual properties of buildings:</p>
      <p>Length: The number of outgoing edges of a node determines the
length of the building.</p>
      <p>Width: The width of a building relates to the number of outgoing
edges that are covered by the test suite.
4 http://webgl.org
5 http://scenejs.org
Height: The height of a building shows how often it was covered
by the test suite.</p>
      <p>Color: The color represents how well the paths, that go through a
particular node, are covered by a test suite. In case a building is
“red”, only few paths are executed. “Green” stands for a high Path
Coverage.
4.3</p>
    </sec>
    <sec id="sec-16">
      <title>Interpretation of the Visualization</title>
      <p>The complexity that a node introduces into a guideline, mostly comes
from its number of outgoing edges. In case the associated action has
numerous possible outcomes (each represented by an outgoing edge),
it is more likely to contain errors. Thus, one dimension of the base
area of a building is influenced by the number of outgoing edges of
the corresponding node. The increased size makes the building easier
to spot. The other dimension of the base area grows with the
number of covered outgoing edges. As a result, a non-quadratic building
stands for a node, whose outgoing edges are not completely covered.
The lower the aspect ratio, the more uncovered edges exist. This
deficiency in test coverage can easily be observed.</p>
      <p>The color of a building depicts its Path Coverage. A red one,
represents a node, that is contained in much more paths than were
covered. With increasing Path Coverage, the color becomes green. The
height of a building corresponds to the number of times, it has been
exercised by the test suite. Clearly, those two properties are not
independent. On the one hand, a small building is more likely to be
contained only in a small number of paths, and thus be red. On the
other hand, a building that is tall and red, implies that the
corresponding node is often tested under similar circumstances. This can give
hints for creating new test cases, that are not contained so far. Tall,
green buildings have been thoroughly exercised and do not require
additional test cases.</p>
      <p>The nesting level of the districts helps in estimating the necessary
effort for testing a specific flowchart or node. Deeply nested module
are probably harder to reach by a test case, as each module may only
be called under certain preconditions.</p>
      <p>The aggregated view of CoverageCity makes it easy to spot
deficiencies in test coverage quickly. Though, a more detailed view is
necessary to identify the specific nodes and their test deficiencies.
Hence, each building be selected by the user. Then, the
corresponding flowchart is shown below the CoverageCity, and the node and its
coverage is highlighted visually.
4.4</p>
    </sec>
    <sec id="sec-17">
      <title>Case Study</title>
      <p>
        Currently, we are involved in the implementation of a computerized
guideline for automated mechanical ventilation [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The guideline is
intended to run on a mechanical ventilator, and is able to derive new
ventilatory settings in order to improve the ventilation of the patient.
First testing efforts of the guideline were conducted using a
physiological simulation. The guideline was run against a software tool that
simulates a mechanically ventilated patient. It employs a
physiological lung model to determine the effects of the current ventilation
settings to the patient. The simulation is able to deliver the
necessary data (ventilation settings and measured patient response) to the
guideline execution engine. Based on this data, the guideline derives
optimized settings and returns them to the simulation environment,
that uses them for the further simulation. The simulation tool was
used by medical experts to generate the test cases and review the
derived ventilation settings. The generated test cases are saved to a file,
and are then uploaded to KnowWE for the introspection of guideline
execution and coverage analysis.
      </p>
      <p>We selected a sample of ten generated test cases. The
visualization of the acquired coverage levels is shown in Figure 2. Currently,
we are in the process of evaluating our visualization with medical
experts. Furthermore, we identify other meaningful assignments of
metrics to visual properties of the buildings.
5</p>
    </sec>
    <sec id="sec-18">
      <title>Conclusion</title>
      <p>In this paper, we formally defined different coverage metrics to
assess the thoroughness of testing efforts for clinical guidelines. They
can be used to identify insufficiently tested elements, and to improve
the process of test creation in terms of efficiency, as this may involve
domain specialists as well as knowledge engineers. An intuitive
visualization method helps in communicating the acquired coverage
levels to domain specialists, for which numerically expressed
metrics probably are less helpful than for knowledge engineers.</p>
      <p>Additional metrics could be defined over more dynamic aspects of
a guideline. First, the distribution of values could be tracked for each
activated flowchart element. As there clearly are dependencies
between the actual values and the possible ones - given by the context
(i.e. path) of the element - proper preprocessing would be necessary.
Ultimately, this could give insight, if parts of the guideline were only
tested, e.g., for a certain patient type. Second, it would be helpful to
define scenarios with respect to the occurrence of a certain sequence
of input data or therapeutic actions over time and trace their
coverage by the test suite. In terms of the CoverageCity-visualization, we
will evaluate different mappings of the coverage metrics to the visual
properties of the city, to create new perspectives on test coverage.</p>
      <p>
        One shortcoming of white-box testing in general is, that it is
unable to detect errors by omission, i.e. some requirement may not have
been included in the implementation under test. An approach to find
this type of errors is Requirements-based Test Coverage [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. It
defines coverage with respect to implementation-independently defined
requirements that are exercised by a test suite. Having formally
defined requirements, this approach should be transferable to testing
clinical guidelines.
      </p>
    </sec>
    <sec id="sec-19">
      <title>Acknowledgements</title>
      <p>University of Wu¨ rzburg is funded by the German Federal
Ministry for Education and Research under the project
“WiMVent” (Knowledge- and Model-based Ventilation), grant number
01IB10002E.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Valerie</given-names>
            <surname>Barr</surname>
          </string-name>
          , '
          <article-title>Applications of rule-base coverage measures to expert system evaluation'</article-title>
          ,
          <source>Knowledge-Based Systems</source>
          ,
          <volume>12</volume>
          ,
          <fpage>27</fpage>
          -
          <lpage>35</lpage>
          , (
          <year>1999</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Joachim</given-names>
            <surname>Baumeister</surname>
          </string-name>
          , '
          <article-title>Advanced empirical testing'</article-title>
          ,
          <source>Knowledge-Based Systems</source>
          ,
          <volume>24</volume>
          (
          <issue>1</issue>
          ),
          <fpage>83</fpage>
          -
          <lpage>94</lpage>
          , (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Joachim</given-names>
            <surname>Baumeister</surname>
          </string-name>
          , Jochen Reutelshoefer, and Frank Puppe, '
          <article-title>KnowWE: A semantic wiki for knowledge engineering'</article-title>
          ,
          <source>Applied Intelligence</source>
          ,
          <volume>35</volume>
          (
          <issue>3</issue>
          ),
          <fpage>323</fpage>
          -
          <lpage>344</lpage>
          , (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Aziz</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Boxwala</surname>
          </string-name>
          , Mor Peleg, Samson Tu, Omolola Ogunyemi, Qing T. Zeng, Dongwen Wang,
          <string-name>
            <surname>Vimla L. Patel</surname>
          </string-name>
          , Robert A.
          <string-name>
            <surname>Greenes</surname>
          </string-name>
          , and
          <string-name>
            <surname>Edward H. Shortliffe</surname>
          </string-name>
          , '
          <article-title>GLIF3: a representation format for sharable computer-interpretable clinical practice guidelines'</article-title>
          ,
          <source>J. of Biomedical Informatics</source>
          ,
          <volume>37</volume>
          (
          <issue>3</issue>
          ),
          <fpage>147</fpage>
          -
          <lpage>161</lpage>
          , (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.J.</given-names>
            <surname>Chilenski</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.P.</given-names>
            <surname>Miller</surname>
          </string-name>
          , '
          <article-title>Applicability of modified condition/decision coverage to software testing'</article-title>
          ,
          <source>Software Engineering Journal</source>
          ,
          <volume>9</volume>
          (
          <issue>5</issue>
          ),
          <fpage>193</fpage>
          -
          <lpage>200</lpage>
          , (
          <year>1994</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6] Paul de Clercq, Katharina Kaiser, and Arie Hasman, '
          <article-title>Computerinterpretable guideline formalisms', in Computer-based Medical Guidelines and Protocols: A Primer and</article-title>
          Current Trends, eds., Annette ten Teije,
          <source>Silvia Miksch, and Peter Lucas</source>
          ,
          <fpage>22</fpage>
          -
          <lpage>43</lpage>
          , IOS Press, Amsterdam, The Netherlands, (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Reinhard</given-names>
            <surname>Hatko</surname>
          </string-name>
          , Joachim Baumeister, Volker Belli, and Frank Puppe, '
          <article-title>DiaFlux: A graphical language for computer-interpretable guidelines', in Knowledge Representation for Health-Care, eds</article-title>
          .,
          <source>David Ria n˜o, Annette ten Teije, and Silvia Miksch</source>
          , volume
          <volume>6924</volume>
          of Lecture Notes in Computer Science,
          <volume>94</volume>
          -
          <fpage>107</fpage>
          , Springer, Berlin / Heidelberg, (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Reinhard</given-names>
            <surname>Hatko</surname>
          </string-name>
          , Joachim Baumeister, Gritje Meinke, Stefan Mersmann, and Frank Puppe, '
          <article-title>Anomaly detection in DiaFlux models'</article-title>
          ,
          <source>in KESE7: 7th Workshop on Knowledge Engineering and Software Engineering</source>
          , San Cristobal de La Laguna, Spain, November
          <volume>10</volume>
          ,
          <year>2011</year>
          , volume
          <volume>805</volume>
          <source>of CEUR Workshop Proceedings</source>
          , Tenerife, Spain, (
          <year>2011</year>
          ).
          <article-title>CEUR-WS.org</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Reinhard</given-names>
            <surname>Hatko</surname>
          </string-name>
          , Dirk Scha¨dler, Stefan Mersmann, Joachim Baumeister, Norbert Weiler, and Frank Puppe, '
          <article-title>Implementing an automated ventilation guideline using the semantic wiki knowwe'</article-title>
          ,
          <source>in EKAW 2012: 18th International Conference on Knowledge Engineering and Knowledge</source>
          Management, eds.,
          <string-name>
            <surname>Heiner</surname>
            <given-names>Stuckenschmidt</given-names>
          </string-name>
          , Annette ten Teije, and Johanna Voelker, (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Arjen</surname>
            <given-names>Hommersom</given-names>
          </string-name>
          , Perry Groot,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Balser</surname>
          </string-name>
          , and Peter Lucas, '
          <article-title>Formal methods for verification of clinical practice guidelines', in Computer-based Medical Guidelines and Protocols: A Primer and</article-title>
          Current Trends, eds., Annette ten Teije,
          <source>Silvia Miksch, and Peter Lucas</source>
          ,
          <fpage>63</fpage>
          -
          <lpage>80</lpage>
          , IOS Press, Amsterdam, The Netherlands, (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.A.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.J.</given-names>
            <surname>Harrold</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Stasko</surname>
          </string-name>
          , '
          <article-title>Visualization of test information to assist fault localization'</article-title>
          ,
          <source>in Proceedings of the 24th international conference on Software engineering</source>
          , pp.
          <fpage>467</fpage>
          -
          <lpage>477</lpage>
          . ACM, (
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>H.F Kwok</surname>
            ,
            <given-names>D.A</given-names>
          </string-name>
          <string-name>
            <surname>Linkens</surname>
            ,
            <given-names>M</given-names>
          </string-name>
          <string-name>
            <surname>Mahfouf</surname>
            , and
            <given-names>G.H</given-names>
          </string-name>
          <string-name>
            <surname>Mills</surname>
          </string-name>
          ,
          <article-title>'Rule-base derivation for intensive care ventilator control using ANFIS'</article-title>
          ,
          <source>Artificial Intelligence in Medicine</source>
          ,
          <volume>29</volume>
          (
          <issue>3</issue>
          ),
          <fpage>185</fpage>
          -
          <lpage>201</lpage>
          , (
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Michele</given-names>
            <surname>Lanza</surname>
          </string-name>
          and Ste´phane Ducasse, '
          <article-title>Polymetric views - a lightweight visual approach to reverse engineering'</article-title>
          ,
          <source>IEEE Trans. Software Eng.</source>
          ,
          <volume>29</volume>
          (
          <issue>9</issue>
          ),
          <fpage>782</fpage>
          -
          <lpage>795</lpage>
          , (
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14] Daniel Lu¨ bke, Leif Singer, and Alex Salnikow, '
          <article-title>Calculating bpel test coverage through instrumentation'</article-title>
          , in AST, eds.,
          <string-name>
            <surname>Dimitris</surname>
            <given-names>Dranidis</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Stephen P.</given-names>
            <surname>Masticola</surname>
          </string-name>
          , and Paul A. Strooper, pp.
          <fpage>115</fpage>
          -
          <lpage>122</lpage>
          . IEEE, (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Stefan</given-names>
            <surname>Mersmann</surname>
          </string-name>
          and Michel Dojat, '
          <article-title>SmartCaretm - automated clinical guidelines in critical care'</article-title>
          ,
          <source>in ECAI'04/PAIS'04: Proceedings of the 16th European Conference on Artificial Intelligence, including Prestigious Applications of Intelligent Systems</source>
          , pp.
          <fpage>745</fpage>
          -
          <lpage>749</lpage>
          , Valencia, Spain, (
          <year>2004</year>
          ). IOS Press.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Glenford</surname>
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Myers</surname>
          </string-name>
          , Corey Sandler, and Tom Badgett,
          <article-title>The art of software testing</article-title>
          , John Wiley &amp; Sons, Hoboken,
          <string-name>
            <surname>N.J.</surname>
          </string-name>
          ,
          <volume>3</volume>
          <fpage>edn</fpage>
          .,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Mor</surname>
            <given-names>Peleg</given-names>
          </string-name>
          , Samson Tu, Jonathan Bury, Paolo Ciccarese, John Fox, Robert A Greenes, Silvia Miksch, Silvana Quaglini, Andreas Seyfang, Edward H Shortliffe,
          <article-title>Mario Stefanelli, and</article-title>
          et al., '
          <article-title>Comparing computer-interpretable guideline models: A case-study approach'</article-title>
          , JAMIA,
          <volume>10</volume>
          , (
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rajan</surname>
          </string-name>
          , '
          <article-title>Coverage metrics to measure adequacy of black-box test suites'</article-title>
          ,
          <source>in Automated Software Engineering</source>
          ,
          <year>2006</year>
          . ASE '
          <volume>06</volume>
          . 21st IEEE/ACM International Conference on, pp.
          <fpage>335</fpage>
          -
          <lpage>338</lpage>
          , (sept.
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Sandra</given-names>
            <surname>Rapps and Elaine J. Weyuker</surname>
          </string-name>
          , '
          <article-title>Selecting software test data using data flow information'</article-title>
          ,
          <source>IEEE Trans. Softw</source>
          . Eng.,
          <volume>11</volume>
          (
          <issue>4</issue>
          ),
          <fpage>367</fpage>
          -
          <lpage>375</lpage>
          , (
          <year>April 1985</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>S.C.</given-names>
            <surname>Reid</surname>
          </string-name>
          , '
          <article-title>An empirical analysis of equivalence partitioning, boundary value analysis and random testing'</article-title>
          ,
          <source>in Software Metrics Symposium</source>
          ,
          <year>1997</year>
          . Proceedings., Fourth International, pp.
          <fpage>64</fpage>
          -
          <lpage>73</lpage>
          , (
          <year>November 1997</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Richard</given-names>
            <surname>Wettel</surname>
          </string-name>
          and Michele Lanza, '
          <article-title>Visualizing software systems as cities'</article-title>
          , in VISSOFT, eds.,
          <string-name>
            <surname>Jonathan</surname>
            <given-names>I. Maletic</given-names>
          </string-name>
          , Alexandru Telea, and Andrian Marcus, pp.
          <fpage>92</fpage>
          -
          <lpage>99</lpage>
          . IEEE Computer Society, (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Richard</given-names>
            <surname>Wettel</surname>
          </string-name>
          and Michele Lanza, '
          <article-title>CodeCity: 3D visualization of large-scale software'</article-title>
          , in ICSE Companion, eds., Wilhelm Scha¨fer, Matthew B.
          <string-name>
            <surname>Dwyer</surname>
          </string-name>
          , and Volker Gruhn, pp.
          <fpage>921</fpage>
          -
          <lpage>922</lpage>
          . ACM, (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>