<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Test Suite Evaluation using Code Coverage Based Metrics</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ferenc Horváth</string-name>
          <email>hferenc@inf.u-szeged.hu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Béla Vancsics</string-name>
          <email>vancsics@inf.u-szeged.hu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>László Vidács</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Árpád Beszédes</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dávid Tengeri</string-name>
          <email>dtengeri@inf.u-szeged.hu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tamás Gergely</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tibor Gyimóthy</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Software Engineering University of Szeged Szeged</institution>
          ,
          <country country="HU">Hungary</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>MTA-SZTE Research Group on Artificial Intelligence University of Szeged Szeged</institution>
          ,
          <country country="HU">Hungary</country>
        </aff>
      </contrib-group>
      <fpage>46</fpage>
      <lpage>60</lpage>
      <abstract>
        <p>Regression test suites of evolving software systems are often crucial to maintaining software quality in the long term. They have to be effective in terms of detecting faults and helping their localization. However, to gain knowledge of such capabilities of test suites is usually difficult. We propose a method for deeper understanding of a test suite and its relation to the program code it is intended to test. The basic idea is to decompose the test suite and the program code into coherent logical groups which are easier to analyze and understand. Coverage and partition metrics are then extracted directly from code coverage information to characterize a test suite and its constituents. We also use heat-map tables for test suite assessment both at the system level and at the level of logical groups. We employ these metrics to analyze and evaluate the regression test suite of the WebKit system, an industrial size browser engine with an extensive set of 27,000 tests.</p>
      </abstract>
      <kwd-group>
        <kwd>code coverage</kwd>
        <kwd>regression testing</kwd>
        <kwd>test suite evaluation</kwd>
        <kwd>test metrics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Regression testing is a very important technique for maintaining the overall
quality of incrementally developed and maintained software systems [
        <xref ref-type="bibr" rid="ref14 ref21 ref4 ref5">5, 21, 14,
4</xref>
        ]. The basic constituent of regression testing, the regression test suite, however,
may become as large and complex as the software itself. To keep its value, the
test suite needs continuous maintenance, e.g. by the addition of new test cases
and update or removal of outdated ones [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>
        Test suite maintenance is not easy and imposes high risks if not done
correctly. In general, the lack of systematic quality control of test suites will reduce
their usefulness and increase associated regression risks. Difficulties include their
ever growing size and complexity [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and the resulting incomprehensibility. But
unlike many advanced methods for efficient source code maintenance and
evolution available today (such as refactoring tools, code quality assessment tools,
static defect checkers, etc.), developers and testers have hardly any means that
may help them in test suite maintenance activities, apart from perhaps test
prioritization/selection and test suite reduction techniques [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], and some more
recent approaches for the assessment of test code quality [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>Hence, a more disciplined quality control of test suites requires that one is
able to understand the internal structure of the test suite, its elements and their
relation to the program code. Without this information it is usually very hard
to decide about what parts of the test suite should be improved, extended or
perhaps removed. Today, the typical information available to software engineers
is limited to knowing the purpose of the test cases (the associated functional
behavior to be tested), possibly some information about the defect detection
history of the test cases and – mostly in the case of unit tests – their overall
code coverage.</p>
      <p>
        Furthermore, most of previous work related to assessing test suites and test
cases are defect-oriented, i.e. the actual amount of defects detected and corrected
are central [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Unfortunately, past defect data is not always available or cannot
be reliably extracted. Approximations about defect detection capability may be
used instead which, if reliable, could be a more flexible and general approach to
assess test suites. In this work, we employ code coverage-based approximations
that are based on analyzing coverage structures related to individual code
elements and test cases in detail (code coverage is essentially a signature of dynamic
program behavior reflecting which program parts are executed during testing,
often associated with the fault detection capability of the tests [
        <xref ref-type="bibr" rid="ref21 ref22">21, 22</xref>
        ]).
      </p>
      <p>
        In our previous work [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], we developed a method for a systematic assessment
and improvement of test suites (named Test Suite Assessment and Improvement
Method – TAIME ). One of its use cases is the assessment of a test suite which
supports its comprehension. In TAIME, we decompose the test suite and
program code into logical groups called functional units, and compute associated
code coverage and other related metrics for these groups. This will enable a more
elaborate evaluation of the interrelationships among such units, thus not limiting
the analysis to overall coverage on the system level. Such an in-depth analysis
of test suites will help in understanding and evaluating the relationships of the
test suite to the program code and identify possible improvement points that
require further attention. We also introduce “heat-map tables” for more intuitive
visualization of the interrelationships and the metrics.
      </p>
      <p>
        In this paper, we adapted the TAIME approach to use it for assessment
purposes and verified the usefulness of the method for the comprehension of the
test suite of the open source system WebKit [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], a web-browser layout engine
containing about 2.2 million lines of code and a large test suite of about 27
thousand test cases. We started from major functional units in this system and
decomposed test cases and procedures3 into corresponding pairs of groups. We
      </p>
    </sec>
    <sec id="sec-2">
      <title>3 We use the term procedure as a common name for functions and methods</title>
      <p>were able to characterize the test suite both as a whole and its constituents and
eventually provide suggestions for improvement.</p>
      <p>In summary, our contributions are the following:
1. We adapted the TAIME method to provide an initial assessment of a large
regression test suite and present a graphical representation of the metrics
computed from code coverage.
2. We demonstrate the method on a large open source system and its regression
test suite, where we were able to identify potential improvement points.</p>
      <p>The paper continues with a general introduction of the analysis method
(Section 2) and an overview of the metrics used for the evaluation (Section 3). Then,
we demonstrate the process by evaluating the WebKit system: in Section 4,
we report how functional units were determined, while the assessment itself is
presented in Section 5, where we also introduce the heat-map visualization.
Relevant related work is presented in Section 6, before we conclude and outline
future research in the last section.
2</p>
      <sec id="sec-2-1">
        <title>Overview of the Method</title>
        <p>
          Our long-term research goal is to elaborate an assessment method for test suites,
which could be a natural extension to existing software quality assessment
approaches. The key problem is how to gain knowledge of overall system properties
of thousands of tests and at the same time understand lower level structure of
the test suite to draw conclusions about enhancement possibilities. In our
previous work [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], we proposed a method to balance between the two extreme cases:
providing system level metrics is not satisfactory for in depth analysis; while
coping with individual tests may miss higher level aims in case of large size test
suites.
        </p>
        <p>z
?</p>
        <sec id="sec-2-1-1">
          <title>Determine goal</title>
          <p>eu</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Determine groups</title>
          <p>6</p>
          <p>Set
granularity
eu
?</p>
        </sec>
        <sec id="sec-2-1-3">
          <title>Execute</title>
        </sec>
        <sec id="sec-2-1-4">
          <title>Measure</title>
          <p>eu
eu</p>
        </sec>
        <sec id="sec-2-1-5">
          <title>Analyze</title>
          <p>
            uj
test suite and the program code into different groups. This can be done in
several ways, e.g. by asking the developers or, as we did, by investigating the tests
and the functionality what they are testing. After we have a set of features
we can decompose the test suite and the code into groups. Those tests, which
are intended to test the same functionality, are grouped together (test groups).
The features are implemented by different (possibly overlapping) program parts
(such as statements or procedures), which we call code groups. In the following,
we will use the term functional unit to refer to the decomposed functionality,
which we consider as a pair of associated test group and code group (see
Figure 2). The whole analysis process is centered around functional units, because
by this division the complexity of large test suites can be handled more easily.
The decomposition process (the number of functional units, the granularity of
code groups and whether the groups may overlap) depend on the system under
investigation. It may follow some existing structuring of the system or may
require additional manual or automatic analysis. More on how we used the concept
of functional units in our subject system will be provided later in the paper.
According to the GQM approach [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ], the aim of test suite understanding
and analysis can be addressed by various questions about functional units. The
proposed method lets us answer questions about the relation of code and test
groups to investigate how tests intended for a functionality cover the associated
procedures; and also about the relation of an individual functional unit to other
units or to the test suite.
3
          </p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Metrics for Test Suite Evaluation</title>
        <p>In this section we give an overview of metrics, coverage and partition that will
be used for test suite evaluation. All metrics will be defined for a pair of a test
group T and a code group P . Let T be the set of tests of a functional unit and
P be the set of procedures which usually belong to the same functional unit.
3.1</p>
        <p>Coverage Metric (cov)
We define the Coverage metric (cov) as the ratio of the number of procedures in
the code group P that are covered by test group T . We consider a procedure as
covered if one of its statement was executed by a test case. This is the traditional
notion of code coverage, and is more formally given as:
cov(T; P ) = jfp 2 P j p covered by T gj :
jP j</p>
        <p>Code coverage measures are widely used in white-box test design techniques,
and they are useful, among others, to enhance fault detection rate, drive test
selection and test suite reduction. This metric is easy to calculate based on the
coverage matrix produced by test executions. Possible values of cov fall into
[0; 1] (clearly, bigger values are better).
3.2</p>
        <p>
          Partition Metric (part)
We define the Partition metric (part) to express the average ratio of procedures
that can be distinguished from any other procedures in terms of coverage. The
primary application of this metric is to support fault localization [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. The basis
for computing this metric are the coverage vectors in a coverage matrix
corresponding to the different procedures. The equality relation on coverage vectors
of different procedures will determine a partitioning on them: procedures covered
by the same test cases (i.e. having identical columns in the matrix) belong to
the same partition. For a given test group T and code group P we denote such
a partitioning with P(P ). We define p 2 for every p 2 P , where
p = fp0 2 P j p0 is covered by and only by the same test cases from T as pg:
Notice that p 2 p according to the definition. Having fault localization
application in mind, j pj 1 will be the number of procedures “similar” to p in the
program, hence to localize p in p we would need at most j pj 1 examinations.
Based on this observation, the part metric is formalized as follows, taking a
value from [0; 1]:
part(T; P ) = 1
        </p>
        <p>Pp2P (j pj
jP j (jP j
1)
1)
:</p>
        <p>The numerator can be also interpreted as the sum of j j (j j 1) values for
all different partitions. It will be best (1) if test cases partition procedures so
that each procedure belongs to its own partition, while it will be the worst (0)
if there is only one big partition with all the procedures.</p>
        <p>C = tttttt452367 BBBBBBBBB@ 110111</p>
        <p>p1 p2 p3 p4 p5 p6
t1 0 1 0 0 0 0 0 1
1 0 0 0
11 10 10 00 101001 CCCCCCCCCA
1 1 1 1
1 1 0 0
1 0 0 1
t8 1 1 1 1 0 0</p>
        <p>The cov and part metric values for our example coverage matrix (see
Figure 3) can be seen in Table 1. The metrics were calculated for four test groups
and for the whole matrix, the code group always consisted of all procedures.
This example exhibits several cases where either cov is higher than part, or the
other way around. Although in theory there is no direct relationship between
these two metrics, with realistic coverage matrices we observed similarities for
our subject system as described later.</p>
      </sec>
      <sec id="sec-2-3">
        <title>Data Extraction from the WebKit System</title>
        <p>
          The first two steps in the adapted TAIME method is to set the granularity and
determine the test and code groups. In this section we present the extraction
process of these groups from the WebKit system [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], a layout engine that renders
web pages in some major browsers such as Safari. WebKit contains about 2
million lines of C++ code (this amounts to about 86% of the whole source code)
and it has a relatively big collection of regression tests, which helps developers
keep code quality at a high level.
        </p>
        <p>
          We chose to work on the granularity of procedures (C++ functions and
methods) due to the size and complexity of the system, but our approach is
generalizable to finer levels as well. (In [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], we applied the TAIME method on
both statement and procedure level.)
        </p>
        <p>The next step was to determine test and code groups. WebKit tests are
maintained in the LayoutTests directory. It contains test inputs and expected
results to test whether the content of a web page is displayed correctly (they
could be seen as a form of system level functional tests). In addition to strictly
testing the layout, it contains, for example, http protocol tests and tests for
JavaScript code as well. Tests are divided into thematic subdirectories to group
tests of particular topics. These topics are the different features of the WebKit
system and this separation made by the WebKit developers gave the base of the
decomposition of the program into functional units. Based on this separation,
the decomposition of the test suite into test groups was a natural choice.</p>
        <p>At the separation of the program code, it is important to note that the
associated code groups were not selected based on code coverage information, rather
on expert opinion about the logical connections between test and code. Code
groups could be determined automatically based on the test group coverage,
but in that case the coverage of code groups would be always 100%. Our choice
to involve experts helps to highlight the differences between the plans of the
developers and the actual state that the test suite provides.</p>
        <p>The resulting functional units can be observed in Table 2. We identified a
total number of 84142 procedures in WebKit, and the full test suite contains
27013 tests, as shown in the first row of the table.4 The rest of the table lists
selected functional units and the statistics of the associated test and code groups,
which are detailed in the following subsections. We asked 3 experienced WebKit
developers – who have been developing WebKit at our department for about 7
years – to determine functional units of the WebKit system.
4.1</p>
        <p>Test Groups
As the experts suggested, the test groups were determined based on how
WebKit developers categorize tests. The LayoutTests directory contains a separate
directory for each functionality in the code that is intended to be tested. These
directories (and their sub-directories) contain the test files. A test case in
WebKit is a test file processed by the WebKit engine as a main test file, which can
include other files as resources. The distribution of the number of tests in these
main test directories can be seen in Figure 4. There are several small groups
starting form the 12th largest directory in the middle of the diagram. We could
either put them into individual groups, which would make the analysis process
inefficient (with too many groups but less gain), or we could threat them as one
functional unit breaking the logical structure of the groups. Finally we decided
to exclude these tests, so about 91% of tests were included in our experiments.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4 Actually, there are more tests than this, but we included only those tests that are</title>
      <p>executed by the test scripts. We used revision r91555 of the Qt (version 4.7.4) port
of WebKit called QtWebKit on the x86_64 Linux platform. Also, only C++ code
was measured.</p>
      <p>Number of Tests in each Group</p>
      <p>However, we found some inconsistencies in the test directory structure, so
made further adjustments to arrive at the final set of groups. First, the fast
directory contains a selection of test cases for various topics to provide quick
testing capability for WebKit. This directory is constantly growing, and takes about
27% of all the tests in WebKit. Unfortunately, it cannot be treated as a separate
functional unit as it contains tests that logically belong to different functions.
Hence we associated each fast/* subdirectory to some of the other groups and
a small, heterogeneous part is left uncategorized, marked as fast-remaining
(starting with this point we cut the remaining test groups, see in the middle of
Figure 4). Second, there exist three separate test groups for testing JavaScript
code. This language plays a key role in rendering web pages, and the project
adopted two external JavaScript test suites called sputnik and ietestcenter.
In addition, the WebKit project has its own test cases for JavaScript in the
fast/js directory. We used the union of these three sources to create a common
JavaScript category called js in our tables.
Once the features were formed, we asked the experts to assign source code
directories/files to functionalities based on their expertise, but without letting them
access the coverage data of the tests. This task required two person-day effort
from the experts and the result of this step was a list of directories and files for
each functional unit. The path of each procedure in the code was matched against
this list, and the matching functional unit(s) were assigned to the procedures.
We chose this approach to avoid investigating more than 84000 procedures one
by one and so we determined path prefixes and labeled all procedures
automatically. However, this method resulted in a file level granularity in our dataset, and
increased the overlap between functional units. Eventually, out of 84142
procedures we associated about 28800 with one of the functional units, during code
browsing only procedures strictly implementing the functionality of the units
were selected. The omitted procedures usually included general functionality
that could not be associated with any of the functional units unambiguously. At
the end of this process, all experts accepted the proposed test and code groups.</p>
      <p>
        We used several tools during this process. For the identification of all
procedures in WebKit (including non-covered ones) the static C/C++ analyzer of the
Columbus framework [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] was used. Furthermore, we used the SoDA library [
        <xref ref-type="bibr" rid="ref15 ref17">15,
17</xref>
        ] for trace analysis and metrics calculation after running the tests on the
instrumented version of WebKit.
      </p>
      <p>At the end of this process we determined the groups, executed the test suite
and measured the coverage using the given metrics.
5</p>
      <sec id="sec-3-1">
        <title>The Internals of a Test Suite as seen through Metrics</title>
        <p>In this section, we present the last, analyze step of the proposed TAIME approach
by first presenting our measurement results using the introduced metrics on
WebKit, then at the end of the section we evaluate the system based on the
result and determine possible enhancement opportunities.
5.1</p>
        <p>Coverage metrics
In the first set of experiments, we investigated the code coverage metrics for
the different functional units to find out if a test group produces higher overall
coverage on the associated code group than the others. In particular, for each
test group - code group pair we calculated the percentage of code that the
corresponding test cases cover, which we summarize in Table 3. Rows in the table
represent test groups while code groups are in the columns, and each cell
contains the associated coverage metric cov. For example, we can observe that tests
from the canvas functional unit cover 26% of the procedures of unit css.</p>
        <p>The first row and column of this table show data computed for the whole
WebKit system without separating the code and tests of the functional units. The
coverage ratio of the whole system is 53%, which means that about 47% of the
procedures is never executed by any of the test cases (note that for computing
system level coverage all tests are used including the 9% tests omitted from
individual groups). This is clearly a possible area for improvement globally, but
let us not forget that in realistic situations complete coverage is never aimed
due to a number of difficulties such as unreachable code or exception handling
routines.</p>
        <p>We can interpret the results of Table 3 in two ways: by comparing results in
a row or in a column. In the former case, we can observe which code groups are
best covered by a specific test group, and in the latter the best matching test
group can be associated with a code group. Using the latter it can be observed
which tests are in fact useful for testing a particular piece of code. We used a
graphical notation in the matrix of two parallel horizontal lines for rows and two
PPPPPCPodPe
Test
WebKit
canvas
css
dom
editing
http
js
svg
m
o
d
g
v
s
.72
.45
.62
.57
.59
.63
.57
.63
.56
.62
vertical lines for columns, respectively, to indicate the maximal values. In order
to have a more global picture of the results, we also present them in a graphical
way in the form of a ‘heat-map’: the intensity of the red background of a cell
is proportional to the ratio of the cell value and the column maximum, i.e. the
column maxima are the brightest.</p>
        <p>The most important to observe is that, except for tables, each code group
is best covered by its own test group. This is an indicator of a good separation of
the tests associated with the functional units. In the other dimension, there are
more cases when a particular test group achieves higher coverage on a foreign
code group, but the value in the diagonal is also very high in all cases. The
dominance of the diagonal is clearly visible, however there are some notable
cases where further investigation was necessary. For example code group http,
which is very specific to its test group, and tables, which does not have a clear
distinction between the test groups. We are going to discuss these cases in more
detail at the end of this section. Another observation is that the coverage values
in the main diagonal are close to the overall coverage of 53%.
5.2</p>
        <p>Partition metrics
The partition metrics for the functional unit pairs showed surprising results. As
can be seen in Table 4 – that also shows the corresponding heat-map information
–, the part metric values basically indicate the same relationship between the
test and procedure groups as the coverage metrics. In fact, the Pearson
correlation of the two matrices of metrics (represented as vectors) is 0.98.
PPPPPCPodPe
Test
WebKit
canvas
css
dom
editing
http
js
svg
s
s
c
m
o
d
g
v
s</p>
        <p>
          In general, the coverage and the partition metric values do not necessarily
need to be correlated, which is illustrated also in our example from Figure 3 and
Table 1. However, certain general observations can be made as follows. When
the number of tests and procedures are over a certain limit, it is more likely that
a unit with high coverage will include different tests and produce high partition
metric as well, but it can happen that a unit with high coverage consists of test
cases with high but very similar coverage values, in which case the partition
metric will be worse. On the other hand, if the coverage is low, it is unlikely
that the partition metric will be high because non-covered items do not get
separated into distinct partitions. In earlier work [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ], we used partition metrics
in the context of test reduction, and there the difference between partitioning
and coverage was distinguishable. We explain this also by the fact that the
reduced test sets consisted of much less test cases compared to the test groups
of functional units from the present work.
In a theoretical case of an ideal separation of functional units (code and test
groups) the above test suite metric tables would contain high values only in
their diagonal cells. Our experiments show that this does not always hold for
WebKit test groups. We identified two possible reasons for this phenomenon:
either test cases are not concentrated to their own functional unit, or procedures
are highly overlapping between functional units. We calculated the number of
common procedures in all pairings of functional units and it turned out that in
general the overlap is small: css is slightly overlapped with four other groups;
and the html5lib group contains most of the canvas and is overlapped with
tables. From the small amount of the total 567 common procedures we reason
that test cases cause the phenomenon. Although the tests are aiming well defined
features of WebKit, technically they are high level (system level) test cases
executed through a special, but fully functional browser implementation. WebKit
developers acknowledged that tests go through functional units, for example a
single test case for testing an svg animation property will cover also css for style
information, js for controlling timing and querying attributes, which also implies
the coverage of some dom code as well.
5.4
        </p>
        <p>Characteristics of the WebKit test suite
We summarize our observations on functional units that we found during the
analysis of metrics in the previous sections.</p>
        <p>Regarding the special cases, we identified two extremes in the system. On the
one hand, there are functional units, where code groups are not really exercised
by other test groups, while their test groups cover other code groups. One of these
groups is http. By investigating this group, we found that most functionalities –
i.e. assembling requests, sending data, etc. – are covered by the http test group,
while other test groups usually use basic communication and small number of
requests. The number of test cases in these groups could probably be reduced
without losing coverage, but only with taking care of tests which cover the related
code of the group.</p>
        <p>On the other hand, there are groups like the tables group which is some
kind of an outlier in the sense that this code group is heavily covered by all of the
test groups. The reason for this is that it is hard to separate this group from the
code implementing the so-called box model in WebKit, which is an integral part
of the rendering engine. Thus, almost anything that tests web pages will use the
implementation of the box model, which is mostly included in the table code
group. Hence, the coverage data is highly overlapping. Although its coverage is
the highest one tables maintains good part metrics. Highly covered by other
test groups, tables should be the last one to be optimized among the test
groups. The number of test cases in this group could probably be reduced due to
the high coverage by other modules, however, more specific tests could be used
to improve coverage.</p>
        <p>According to our analysis there is room for improving the coverage of all code
groups as they are around the overall coverage rate of 53%. Another general
comment is that component level testing (unit testing) is usually more effective
than system level testing (as is the case with WebKit) when higher code coverage
is aimed. For example, error handling code is very hard to be exercised during
system level testing, while at component level such code can be more easily
tested. Thus, in a long term, introducing a unit test framework and adding real
component tests would be beneficial to attain higher coverage.</p>
        <p>
          In summary, the experience that we gatherd from analyzing the data of the
adapted TAIME method appoints two main ways to improve testing. The
proposed metrics can either be used to provide guidance for the developers during
the maintenance of the tests, or to focus test reduction on enhancing specific
capabilities of the test suite, e.g. fault detection and localization [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ].
6
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Related Work</title>
        <p>
          Although there exist many different criteria for test assessment [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ], the main
approach to assess the adequacy of testing has long been the fault detection
capability of test processes in general and test suites in particular. Code coverage
measurement is often used as a predictor to the fault detection capability of test
suites, but other approaches have been proposed as well, such as the output
uniqueness criteria defined by Alshahwan and Harman [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Code coverage is also
a traditional base for white-box test design techniques due to the presumed
relationship to defect detection capability, but some studies showed that this
correlation is not always present, inversely correlated to reliability [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], or at
least not evident [
          <xref ref-type="bibr" rid="ref10 ref11">11, 10</xref>
          ].
        </p>
        <p>
          There have been sporadic attempts to define metrics that can be used to
assess the quality of test suites, but this area is much less developed than that
of general software quality. Athanasiou et. al. [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] gave an overview on the state
of the art. They concluded that although some aspects of test quality had been
addressed, basically it remained an open challenge. They provided a model for
test quality based on the software code maintainability model, however, their
approach can only be applied on tests implemented on some programming
language.
        </p>
        <p>
          Researchers started to move towards test oriented metrics only recently,
which strengthens our motives to work towards a more systematic evaluation
method for testing. Gomez et. al. provide a comprehensive survey of measures
used in software engineering [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. They found that a large proportion of metrics
are related to source code, and only a small fraction is directed towards testing.
Chernak [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] also stresses the importance of test suite evaluation as a basis for
improving the test process. The main message of the paper is that objective
measures should be defined and built into the testing process to improve the overall
quality of testing, but the employed measures in this work are also defect-based.
        </p>
        <p>
          Code coverage based test selection and prioritization techniques are also
related to our work as we share the same or similar notions. An overview of
regression test selection techniques has been presented by Rothermel and Harrold, who
introduced a framework to evaluate the different techniques [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. Elbaum et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]
conducted a set of empirical studies and and found that fine-grained (statement
level) prioritization techniques outperform coarse-grained (function level) ones,
but the latter produce only marginally worse results in most cases, and a small
decrease in effectiveness can be more than offset by their substantially lower cost.
A survey for further reading in this area has been presented by Yoo et al. [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Conclusions</title>
        <p>Systematic evaluation of test suites to support various evolution activities is
largely an unexplored area. This paper provided a step towards establishing a
systematic and objective evaluation method and measurement model for
(regression) test suites of evolving software systems, which is based on analyzing the
code coverage structures among test and code elements. One direction for future
work will be to continue working towards a more complete test assessment model
by incorporating other metrics, possibly from other sources like past defect data.</p>
        <p>We believe that our method for test evaluation is general enough and is not
limited to the application we used it in. Nevertheless, we plan to verify it by
involving more systems having different properties, and in different test suite
evolution scenarios such as metrics-driven white-box test case design.</p>
        <p>Regarding our subject WebKit, our observations are based mostly on the
introduced metrics and we did not take into account the feasibility of the proposed
optimization possibilities of the tests. We plan to involve more test suite metrics
and investigate our actual suggestions in the future.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Alshahwan</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harman</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Coverage and fault detection of the output-uniqueness test selection criteria</article-title>
          .
          <source>In: Proceedings of the 2014 International Symposium on Software Testing and Analysis</source>
          . pp.
          <fpage>181</fpage>
          -
          <lpage>192</lpage>
          . ACM (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Athanasiou</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nugroho</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Visser</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaidman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Test code quality and its relation to issue handling performance</article-title>
          .
          <source>Software Engineering</source>
          , IEEE Transactions on
          <volume>40</volume>
          (
          <issue>11</issue>
          ),
          <fpage>1100</fpage>
          -
          <lpage>1125</lpage>
          (
          <year>Nov 2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Basili</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Caldiera</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rombach</surname>
          </string-name>
          , H.D.
          <article-title>: Goal question metric approach</article-title>
          .
          <source>In: Encyclopedia of Software Engineering</source>
          , pp.
          <fpage>528</fpage>
          -
          <lpage>532</lpage>
          . John Wiley &amp; Sons, Inc. (
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Beck</surname>
          </string-name>
          , K. (ed.):
          <article-title>Test Driven Development: By Example</article-title>
          .
          <string-name>
            <surname>Addison-Wesley Professional</surname>
          </string-name>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Bertolino</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Software testing research: Achievements, challenges, dreams</article-title>
          .
          <source>In: 2007 Future of Software Engineering</source>
          . pp.
          <fpage>85</fpage>
          -
          <lpage>103</lpage>
          . IEEE Computer Society (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Chernak</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Validating and improving test-case effectiveness</article-title>
          .
          <source>IEEE Softw</source>
          .
          <volume>18</volume>
          (
          <issue>1</issue>
          ),
          <fpage>81</fpage>
          -
          <lpage>86</lpage>
          (
          <year>Jan 2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Elbaum</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malishevsky</surname>
            ,
            <given-names>A.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rothermel</surname>
          </string-name>
          , G.:
          <article-title>Test case prioritization: A family of empirical studies</article-title>
          .
          <source>IEEE Trans. Softw. Eng</source>
          .
          <volume>28</volume>
          (
          <issue>2</issue>
          ),
          <fpage>159</fpage>
          -
          <lpage>182</lpage>
          (
          <year>Feb 2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Ferenc</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , Beszédes, Á.,
          <string-name>
            <surname>Tarkiainen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gyimóthy</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Columbus - reverse engineering tool and schema for C++</article-title>
          .
          <source>In: Proceedings of the 6th International Conference on Software Maintenance (ICSM</source>
          <year>2002</year>
          ). pp.
          <fpage>172</fpage>
          -
          <lpage>181</lpage>
          . IEEE Computer Society, Montreal, Canada (Oct
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Gómez</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oktaba</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piattini</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>García</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>A systematic review measurement in software engineering: State-of-the-art in measures</article-title>
          .
          <source>In: Software and Data Technologies, Communications in Computer and Information Science</source>
          , vol.
          <volume>10</volume>
          , pp.
          <fpage>165</fpage>
          -
          <lpage>176</lpage>
          . Springer (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Inozemtseva</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Holmes</surname>
          </string-name>
          , R.:
          <article-title>Coverage is not strongly correlated with test suite effectiveness</article-title>
          .
          <source>In: Proceedings of the 36th International Conference on Software Engineering (ICSE</source>
          <year>2014</year>
          ). pp.
          <fpage>435</fpage>
          -
          <lpage>445</lpage>
          . ACM (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Namin</surname>
            ,
            <given-names>A.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Andrews</surname>
            ,
            <given-names>J.H.</given-names>
          </string-name>
          :
          <article-title>The influence of size and coverage on test suite effectiveness</article-title>
          .
          <source>In: Proceedings of the Eighteenth International Symposium on Software Testing and Analysis</source>
          . pp.
          <fpage>57</fpage>
          -
          <lpage>68</lpage>
          . ACM (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Pinto</surname>
            ,
            <given-names>L.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sinha</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Orso</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Understanding myths and realities of test-suite evolution</article-title>
          .
          <source>In: Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering</source>
          . pp.
          <volume>33</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>33</lpage>
          :
          <fpage>11</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Pinto</surname>
            ,
            <given-names>L.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sinha</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Orso</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Understanding myths and realities of test-suite evolution</article-title>
          .
          <source>In: Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering</source>
          . pp.
          <volume>33</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>33</lpage>
          :
          <fpage>11</fpage>
          . FSE '12,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Rothermel</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harrold</surname>
            ,
            <given-names>M.J.:</given-names>
          </string-name>
          <article-title>Analyzing regression test selection techniques</article-title>
          .
          <source>IEEE Trans. Softw. Eng</source>
          .
          <volume>22</volume>
          (
          <issue>8</issue>
          ),
          <fpage>529</fpage>
          -
          <lpage>551</lpage>
          (
          <year>1996</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <article-title>SoDA library</article-title>
          . http://soda.sed.hu,
          <source>last visited: 2015-08-20</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Tengeri</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , Beszédes, Á.,
          <string-name>
            <surname>Gergely</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vidács</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Havas</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gyimóthy</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Beyond code coverage - an approach for test suite assessment and improvement</article-title>
          .
          <source>In: Proceedings of the 8th IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW'15); 10th Testing: Academic and Industrial Conference - Practice and Research Techniques (TAIC PART'15)</source>
          . pp.
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          (
          <year>Apr 2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Tengeri</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , Beszédes, Á.,
          <string-name>
            <surname>Havas</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gyimóthy</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Toolset and program repository for code coverage-based test suite analysis and manipulation</article-title>
          .
          <source>In: Proceedings of the 14th IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM'14)</source>
          . pp.
          <fpage>47</fpage>
          -
          <lpage>52</lpage>
          (
          <year>Sep 2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Veevers</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marshall</surname>
            ,
            <given-names>A.C.</given-names>
          </string-name>
          :
          <article-title>A relationship between software coverage metrics and reliability</article-title>
          .
          <source>Software Testing, Verification and Reliability</source>
          <volume>4</volume>
          (
          <issue>1</issue>
          ),
          <fpage>3</fpage>
          -
          <lpage>8</lpage>
          (
          <year>1994</year>
          ), http: //dx.doi.org/10.1002/stvr.4370040103
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Vidács</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beszédes</surname>
          </string-name>
          , Á.,
          <string-name>
            <surname>Tengeri</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Siket</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gyimóthy</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Test suite reduction for fault detection and localization: A combined approach</article-title>
          . In: Software Maintenance, Reengineering and Reverse
          <string-name>
            <surname>Engineering (CSMR-WCRE)</surname>
          </string-name>
          ,
          <source>2014 Software Evolution Week - IEEE Conference on</source>
          . pp.
          <fpage>204</fpage>
          -
          <lpage>213</lpage>
          (
          <year>Feb 2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <article-title>The WebKit open source project</article-title>
          . http://www.webkit.org/, last visited:
          <fpage>2015</fpage>
          -08- 20
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Yoo</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harman</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Regression testing minimization, selection and prioritization: a survey</article-title>
          .
          <source>Software Testing, Verification and Reliability</source>
          <volume>22</volume>
          (
          <issue>2</issue>
          ),
          <fpage>67</fpage>
          -
          <lpage>120</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , Hall,
          <string-name>
            <given-names>P.A.V.</given-names>
            ,
            <surname>May</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.H.R.</surname>
          </string-name>
          :
          <article-title>Software unit test coverage and adequacy</article-title>
          .
          <source>ACM Comput. Surv</source>
          .
          <volume>29</volume>
          (
          <issue>4</issue>
          ),
          <fpage>366</fpage>
          -
          <lpage>427</lpage>
          (
          <year>Dec 1997</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>