<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>User interface level testing with TESTAR; what about more sophisticated action speci cation and selection?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sebastian Bauersfeld</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tanja E. J. Vos</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Research Center on Software Production Methods (PROS) Universitat Politecnica de Valencia Valencia</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Testing software applications at the Graphical User Interface (GUI) level is a very important testing phase to ensure realistic tests because the GUI represents a central juncture in the application under test from where all the functionality is accessed. In earlier works we presented the TESTAR tool, a Model-Based approach to automate testing of applications at the GUI level whose objective is to generate test cases based on a model that is automatically derived from the GUI through the accessibility API. Once the model has been created, TESTAR derives the sets of visible and unblocked actions that are possible for each state that the GUI is in and randomly selects and executes actions in order to drive the tests. This paper, instead of random selection, we propose a more advanced action speci cation and selection mechanism developed on top of our test framework TESTAR. Instead of selecting random clicks and keystrokes that are visible and unblocked in a certain state, the tool uses a Prolog speci cation to derive sensible and sophisticated actions. In addition, it employs a well-known machine learning algorithm, called Q-Learning, in order to systematically explore even large and complex GUIs. This paper explains how it operates and present the results of experiments with a set of popular desktop applications.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Graphical User Interfaces (GUIs) represent the main connection point between a
software's components and its end users and can be found in almost all modern
applications. This makes them attractive for testers, since testing at the GUI
level means testing from the user's perspective and is thus the ultimate way of
verifying a program's correct behaviour. Current GUIs can account for 45-60%
of the entire source code [14] and are often large and complex. Consequently,
it is di cult to test applications thoroughly through their GUI, especially
because GUIs are designed to be operated by humans, not machines. Moreover,
they are inherently non-static interfaces, subject to constant change caused by
functionality updates, usability enhancements, changing requirements or altered
contexts. This makes it very hard to develop and maintain test cases without
resorting to time-consuming and expensive manual testing.</p>
      <p>
        In previous work, we have presented TESTAR [
        <xref ref-type="bibr" rid="ref3 ref5 ref6">5,6,3</xref>
        ], a Model-Based
approach to automate testing at the GUI level. TESTAR uses the operating
system's Accessibility API to recognize GUI controls and their properties and
enables programmatic interaction with them. It derives sets of possible actions for
each state that the GUI is in (i.e. the visible widgets, their size, location and
other properties such as whether they are enabled or blocked by other windows
etc.) and randomly selects and executes appropriate ones in order to drive the
tests. In completely autonomous and unattended mode, the oracles can detect
faulty behaviour when a system crashes or freezes. Besides these free oracles,
the tester can easily specify some regular expressions that can detect patterns
of suspicious titles in widgets that might pop up during the executed tests
sequences. For more sophisticated and powerful oracles, the tester can program
the Java protocol that is used to evaluate the outcomes of the tests.
      </p>
      <p>
        The strength of the approach is that the technique does not modify nor
require the SUT's source code, which makes it applicable to a wide range of
programs. With a proper setup and a powerful oracle, TESTAR can operate
completely unattended, which saves human e ort and consequently testing costs. We
believe that TESTAR is a straightforward and e ective technique of provoking
crashes and reported on its success describing experiments done with MS Word
in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. We were able to nd 14 crash sequences while running TESTAR during
48 hours1 applying a strategy of random selection of visible/unblocked actions.
      </p>
      <p>
        In this paper we will investigate more sophisticated ways of action
specication and selection. Instead of clicking randomly on visible and unblocked
locations within the GUI, we enable the tester to de ne the set of visible actions
that he or she wants to execute. The de nitions are written in Prolog syntax
and allow the speci cation of thousands of actions { even complex mouse
gestures { with only a few lines of code. Moreover, we will use a machine learning
algorithm called Q-Learning [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] to explore the GUI in a more systematic manner
than random testing. It learns about the interface and strives to nd previously
unexecuted actions in order to operate even deeply nested dialogs, which a
random algorithm is unlikely to discover. In the next section we will explain how
TESTAR obtains the GUI state and executes actions.
      </p>
      <p>This paper is structured as follows. Section 2 presents the TESTAR approach
for testing at the GUI level and describes the extensions for action speci cation
and selection. Section 3 presents the results of a rst experiment in which we
applied TESTAR to a set of popular applications to test its ability in nding
reproducible faults. Section 4 lists related work, section 5 reviews the approach
and section 6 describes future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>The TESTAR Approach</title>
      <p>1 Videos of these crashes are available at http://www.youtube.com/watch?v=PBs9jF_
pLCs
their size, location and other properties (such as whether they are enabled or
blocked by other windows etc.). From that state it (B) derives a set of feasible
actions (the green dots, letters and arrows, which represent clicks, text input
and drag and drop operations, respectively) from which it then (C) selects one
(marked red) and nally executes it. By repeating these steps, TESTAR will
be able to generate arbitrary input sequences to drive the GUI. The following
subsections will explain this in more details, together with the new more
sophisticated ways of deriving actions and selection them for execution.
2.1</p>
      <sec id="sec-2-1">
        <title>Determine the GUI State</title>
        <p>All of our experiments described in this paper are conducted on MacOSX. For
this platform the Accessibility API { which simpli es computer usage for people
with disabilities { is used to obtain the SUT's GUI state. It allows to gather
information about the visible widgets of an application and gives TESTAR the
means to query their property values. Since it is a native API written in
ObjectiveC, we make use of the Java Native Interface (JNI) to invoke its methods.
After querying the application's GUI state, we save the obtained information in
a so-called widget tree which captures the structure of the GUI. Figure 2
displays an example of such a tree. Each node corresponds to a visible widget and
contains information about its type, position, size, title and indicates whether it
is enabled, etc. The Accessibility API gives access to over 160 attributes which
allows us to retrieve detailed information such as:
{ The type of a widget.
{ The position and size which describe a widget's rectangle (necessary for
clicks and other mouse gestures).
{ It tells us whether a widget is enabled (It does not make sense to click
disabled widgets).
{ Whether a widget is blocked. This property is not provided by the API but
we calculate it. For example, if a message box blocks all other widgets behind
it, then TESTAR can detect those and other modal dialogs (like menus) and
sets the blocked attribute accordingly.
{ Whether a widget is focused (has keyboard focus) so that TESTAR knows
when it can type into text elds.
{ Attributes such as title, help and other descriptive attributes are very
important to distinguish widgets from each other and give them an identity. We
will make use of this in the next subsection when we describe our algorithm
for action selection.</p>
        <p>This gives us access to almost all widgets of an application, if they are not
custom-coded or drawn onto the window. We found that the Accessibility API
works very well with the majority of native applications (since the API works
for all standard widgets, the developers do not have to explicitly code for it to
work).</p>
        <p>ABCD</p>
        <p>EFG
ABCDEFG
2
3
Desktop</p>
        <p>Window
Button</p>
        <p>Text</p>
        <p>Menu</p>
        <p>Slider
type: TButton
title: "Button"
enabled: false
hasFocus: true
rect: [15, 25, 65, 55]
...</p>
        <p>MI MI MI MI
type: TMenuItem
title: "File"
...
Having obtained the GUI's current state, we can go on to derive a set of actions.
One thing to keep in mind is that the more actions TESTAR can choose from,
the bigger the sequence space will be and the more time the search for faults
will consume. Ideally, TESTAR should only select from a small set of actions
which are likely to expose faults. Thus, our intention is to keep the search space
as small as possible and as large as necessary for nding faults. For each state
we strive to generate a set of sensible actions which should be appropriate to the
widgets that they are executed on: Buttons should be clicked, scrollbars should
be dragged and text boxes should be lled with text. Furthermore, we would like
to exercise only those widgets which are enabled and not blocked. For example:
it would not make sense to click on any widget that is blocked by a message box.
Since the box blocks the input, it is unlikely that any event handling code (with
potential faults) will be invoked.</p>
        <p>One way to tell TESTAR how to use speci c widgets would be to provide a
set of rules so that it can generate actions according to a widget's type, position
and other attribute values. This would work reasonably well for many GUIs,
since most widgets are standard controls. However, it is not very exible. There
might be non-standard widgets which TESTAR does not know how to use or
which are used in a di erent way than they were designed for (Microsoft Word
uses static text labels as table items and has lots of custom coded widgets).
Moreover, on screen 3 of Figure 1 how would TESTAR know that it can drag
images from the media browser into the document or that it can draw onto the
canvas in Figure 4? Finally, a tester might want to execute only speci c parts
of the application, e.g. click only buttons and menu items or leave out certain
actions which could trigger \hazardous" operations that delete or move les.</p>
        <p>Due to these reasons, we let the tester specify which actions TESTAR should
execute. Since modern GUIs can be very large and complex, with potentially ten
thousands of widgets, the tester needs a comfortable and e cient way of de ning
actions. We found that action speci cation in Java is often verbose and not very
concise. De nitions such as \Drag every image within the window titled 'Images'
onto the canvas with help text 'Document'." often require many lines of code.
Therefore, we were looking for a more declarative language, which allows the
tester to specify actions in a more natural way. Eventually, we decided to
integrate a Prolog engine into our framework. Prolog is a programming language that
has its roots in rst-order-logic and is often associated with the arti cial
intelligence and computational linguistics communities [9]. In Prolog the programmer
writes a database with facts and rules such as
parent(bruce, sarah).
parent(sarah, tom).
parent(gerard, tom).
ancestor(X,Y):- parent(X,Y);</p>
        <p>(parent(X,Z), ancestor(Z,Y)).</p>
        <p>where the rst three lines state that Bruce is Sarah's parent and that Sarah
and Gerard are Tom's parents. The third line is a recursive rule and reads: \X
is Y's ancestor, if X is the parent of Y or there exists a Z so that X is a parent
of Z and Z is an ancestor of Y" (the semicolon and comma represent disjunction
and conjunction, respectively). The programmer can now issue queries such as
?- ancestor(X,tom) (\Who are Tom's ancestors?") against this database.
Prolog will then apply the facts and rules to nd the answers to this question, i.e.
all possible substitutions for X (Sarah, Gerard, Bruce). This can be applied to
much more complex facts and hierarchies and Prolog is known to be an e cient
and concise language for those kind of relationship problems [9].</p>
        <p>In our case we reason over widgets that also form a hierarchy. However, we
do not have an explicitly written fact database. Instead, the widget tree acts as
this database, as it describes the relationships among the widgets and contains
their property values. The tester can then use the Prolog engine to de ne actions
for the current GUI state. Figure 3.1 shows an example of how this is done. Let
us assume we want to generate clicks for all widgets within our example dialog.
The corresponding Prolog query is listed under the image and reads: \For all
widgets W, which have an ancestor A, whose title is 'Example', calculate the
center coordinates X and Y and issue a left click". Since this also generates clicks
for the disabled button widget, the text box and the slider, we might want to
improve this code, in order to obtain a more appropriate action set: Figure 3.2
shows the adapted Prolog code and its e ects. We implemented predicates such
as widget(W), enabled(W), type(W, R) and many others to allow the tester to
reason over the widget tree.</p>
        <p>Besides traditional clicks, text typing and drag and drop operations, the
Prolog engine also facilitates the de nition of more complex mouse gestures such
as the ones depicted in Figure 4. The tester simply de nes a set of points that
the cursor is supposed to pass through and TESTAR then uses cubic splines to
1
2
1) widget(W),ancestor(A,W),title(A,"Example"),center(W,X,Y),lclick(X,Y).
).
calculate the cursor's trajectory. This allows for complex drawing gestures and
handwriting.</p>
        <p>The ability to de ne the set of actions makes it possible to customize
TESTAR test and to restrict the search space to the actions that the tester thinks are
likely to trigger faults. As our experiments will show, a reasonably sized set of
\interesting" actions, can be more e ective in nding faults than unconstrained
mouse and keyboard inputs. The Prolog engine and the fact that GUIs usually
consist of a limited set of widget types with similar functionality, allow the tester
to specify thousands of these actions with only a few lines of code. Thus, even
large and complex applications can be con gured quickly.</p>
        <p>For our Prolog engine we used a modi ed version of the PrologCafe2
implementation, which compiles and transforms Prolog into Java code.
2 http://code.google.com/p/prolog-cafe
The action selection strategy is a crucial feature within TESTAR. The right
actions can improve the likelihood and decrease the time necessary for triggering
crashes. Consequently, we want it to learn about the GUI and be able to explore
it thoroughly. In large SUTs with many { potentially deeply nested { dialogs
and actions, it is unlikely that a random algorithm will su ciently exercise most
parts of the GUI within a reasonable amount of time. Certain actions are easier
to access and will therefore be executed more often, while others might not be
executed at all.</p>
        <p>Ideally, TESTAR should exercise all parts of the GUI equally and execute
each possible action at least once. Instead of pursuing a systematic exploration
like Memon et al. [13], which use a depth rst search approach, we would like to
have a mixture between a random and systematic approach that still allows each
possible sequence to be generated. Thus, our idea is to change the probability
distribution over the sequence space, so that seldom executed actions will be
selected with a higher likelihood than others, in order to favor exploration of
the GUI. To achieve this, a straightforward greedy approach would be to select
at each state the action which has been executed the least number of times.
Unfortunately, this might not yield the expected results: In a GUI it is often
necessary to rst execute certain actions in order to reach others. Hence, these
need to be executed more often, which requires an algorithm that can \look
ahead". This brought us to consider a reinforcement learning technique called
Q-Learning. We will now explain how it works, de ne the environment that it
operates in and specify the reward that it tries to maximize.</p>
        <p>We assume that our SUT can be modelled as a nite Markov Decision
Process (MDP). A nite MDP is a discrete time stochastic control process in an
environment with a nite set of states S and a nite set of actions A [16]. During
each time step t the environment remains in a state s = st and a decision maker,
called agent, executes an action a = at 2 As A from the set of available
actions, which causes a transition to state s0 = st+1. In our case, the agent is
TESTAR, the set A will refer to the possible GUI actions and S will be the set
of observable GUI states.</p>
        <p>An MDP's state transition probabilities are governed by</p>
        <p>P (s; a; s0) = P rfst+1 = s0jst = s; at = ag
meaning, that the likelihood of arriving in state s0 exclusively depends on a and s
and not on any previous actions or states. This is called Markov Property which
we assume holds approximately for the SUT3. In an MDP, the agent receives
rewards R(s; a; s0) after each transition, so that it can learn to distinguish good
from bad decisions. Since we want to favor exploration, we set the rewards as
3 Since we can only observe the GUI states and not the SUT's true internal states
(Hidden Markov Model ), one might argue whether the Markov Property holds
sufciently. However, we assume that the algorithm will still perform reasonably well.</p>
        <p>R(s; a; s0) := rx1ianit ;; ieflsxea = 0
Where xa is the amount of times that action a has been executed and rinit is
a large positive number (we want actions that have never been executed before,
to be extraordinarily attractive). Hence, the more often an action has been
executed, the less desirable it will be for the agent. The ultimate goal is to learn
a policy which maximizes the agent's expected reward. The policy determines
for each state s 2 S which action a 2 As should be executed. We will apply
the Q-Learning algorithm in order to nd . Instead of computing it directly,
QLearning rst calculates a value function V (s; a) which assigns a numeric quality
value { the expected future reward { to each state action pair (s; a) 2 S A.
This function is essential, since it allows the agent to look ahead when making
decisions: The agent then simply selects a = argmaxafV (s; a)ja 2 Asg within
each state s 2 S.</p>
        <p>Algorithm 1 shows the pseudo-code for our approach. Since it does not have
any prior knowledge about the GUI, the agent starts o completely uninformed.
Step by step it discovers states and actions and learns the value function through
the rewards it obtains. The quality value for each new state action pair is
initialized to rinit. The heart of the algorithm is the value update in line 9: The
updated quality value of the executed state action pair is the sum of the received
reward plus the maximum value of all subsequent state action pairs multiplied
by the discount factor . The more approaches zero, the more opportunistic
and greedy the agent becomes (it considers only immediate rewards). When
approaches 1 it will opt for long term reward. It is worth mentioning that the
value function and consequently the policy will constantly vary, due to the fact
that the rewards change. This is what we want, since the agent should always
put emphasis on the regions of the SUT which it has visited the least amount
of times4.</p>
        <p>Instead of always selecting the best action (line 6), one might also consider
a random selection proportional to the values of the available state action pairs.
This would introduce more randomness in the sequence generation process. For
our experiments, however, we sticked to the presented version of the algorithm.
Representation of States and Actions In order to be able to apply the
above algorithm, we have to assign a unique and stable identi er to each state
and action, so that we are able to recognize them. For this, we can use the
structure and the property values within the widget tree. For example: To create
a unique identi er for a click on a button we can use a combination of the button's
property values, such as its title, help text or its position in the widget hierarchy
(parents / children / siblings). The properties selected for the identi er should be
4 Again, this is contrary to the Markov Property where rewards are supposed to be
stationary. We counter the problem of constantly changing rewards with frequent value
updates, i.e. we sweep more often over the state actions pairs within the generated
sequence (i.e. line 9 of the algorithm).
/* reward for unexecuted actions */
/* discount factor */</p>
        <p>
          Algorithm 1: Sequence generation with Q-Learning [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]
relatively \stable": The title of a window, for example, is quite often not a stable
value (opening new documents in a text editor will change the title of the main
window) whereas its help text is less likely to change. The same approach can
be applied to represent GUI states: One simply combines the values of all stable
properties of all widgets on the screen. Since this might be a lot of information,
we will only save a hash value generated from these values5. This way we can
assign a unique and stable number to each state and action.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <p>We performed two di erent experiments. In the rst one, we compared three
di erent approaches with TESTAR summarized in Table 1 to nd out: RQ1
which of these testing approaches needs the least amount of time to crash
applications?.</p>
      <p>In the second experiment we used only approach C. At rst we had it
generate a set of short sequences { with at most 200 actions { with the goal to
trigger crashes for each SUT. We then tried to replay these crashing sequences
to determine the reproducibility of the revealed faults and nd out: RQ2: What
fraction of the generated crash sequences are reproducible, i.e. trigger the crash
again upon replay?
3.1</p>
      <sec id="sec-3-1">
        <title>The variables: What is being measured?</title>
        <p>In the rst experiment we measured the average time it took each approach to
crash an application, which is equivalent to what Liu et al. [12] do in their
experiments. We considered an SUT to have crashed if a) it unexpectedly terminated
or b) it did not respond to inputs during more than 60 seconds.
5 Of course this could lead to collisions. However, for the sake of simplicity we assume
that this is unlikely and does not signi cantly a ect the optimization process.</p>
        <p>Approach Description</p>
        <p>TESTAR's default con guration which randomly
A tshelaetctist aucsteisonths,e bAuctciesssiinbfiolritmyeAdPi nI ttoheresceongsneize
the visible/unblocked actions.</p>
        <p>TESTAR with random selection of actions
speci ed by the tester using Prolog as described
B isneleScetcstiaocnti2o.n2s., Tbhuits iatpipsr\oiancfhoramlseod"r,anindotmhelysense
that it makes use of Prolog action speci cations
tailored to each SUT.</p>
        <p>C TQh-liesaarnpipnrgoaaclhgoirsitehqmui vinasletneatdtoofBr,abnudto musesseltehcteion.</p>
        <p>In the second experiment we measured the absolute number of crashes that
we found and the percentage of the crashing sequences that we were able to
reproduce.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>The Systems Under Test (SUTs)</title>
        <p>We applied the three approaches to a set of popular MacOSX applications as
listed in the rst column of Table 2. We tried to include di erent types of SUTs
such as o ce applications (Word, Excel, Mail, iCal ...) an instant messenger
(Skype), a music player (iTunes) and a drawing application (Paintbrush), to
nd out whether our approach is generally applicable.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>The protocol</title>
        <p>We carried out all of our experiments on a 2.7 GHz Intel Core i7 MacBook with
8GB RAM and MacOSX 10.7.4. To verify and analyze the found crashes, we used
a frame grabber which recorded the screen content of our test machine during
the entire process. Thus, many of the found crashing sequences can be viewed
on our web page6. Before running any tests, we prepared each application by
performing the following tasks:
{ Write scripts which restore con guration les and application data before
each run: This is a crucial step, since we compared di erent approaches
against each other and wanted to make sure that each one starts the
application from the same con guration. Most applications save their settings
in con guration les, which needed to be restored to their defaults after
each run. In addition, applications like Mail or Skype employ user pro le
6 http://www.pros.upv.es/testar
data such as mailboxes or contact lists. During a run, mails and contacts
might be deleted, added, renamed or moved. Thus, each approach had a list
of directories and les which they automatically restored upon application
start.
{ Setup a secure execution environment: GUI testing should be done with
caution. Unless they are told otherwise, the test tools will execute every
possible action. Thus, they might print out documents, add / rename / move
/ delete les, printers and fonts, install new software or even shut down
the entire machine. We experienced all of these situations and rst tried
to counter them by disallowing actions for items such as \Print", \Open",
\Save As", \Restart", etc. However, applications such as Microsoft Word are
large and allow many potentially hazardous operations which are di cult to
anticipate. Therefore, we decided to run our tests in a sandbox (a MacOSX
program called sandbox-exec) with a ne-grained access control model. This
allowed us to restrict read and write access to certain directories, block
speci c system calls and restrict internet access. The latter was necessary
when we were testing Skype and Mail. Since we employed some of our own
contact lists and mail accounts, we could have contacted people and might
have transmitted private data.
{ For approaches B and C, we moreover de ned a set of sensible actions: As
described in Section 2.A and 2.B, we took care to specify actions appropriate
to the speci c widgets: We generated clicks on buttons and menu items as
well as right-clicks, input for text boxes, drag operations for scrollbars, sliders
and other draggable items. For one of the tested applications (Paintbrush,
a Microsoft Paint clone) we also generated mouse gestures to draw gures
as shown in Figure 4. For each application we strived to de ne a rich set of
actions, comprising the ones that a human user would apply when working
with the program. The LOCs of the Prolog speci cations can be found in
Table 3.</p>
        <p>For each approach we set the delay between two consecutive actions to 50
ms, to give the GUI some time to respond. During the entire testing process,
the SUTs were forced into the foreground, so that even if they started external
processes, these did not block access to the GUI.</p>
        <p>In the rst experiment we applied all three approaches to the applications
in Table 2, let each one run until it caused a crash and measured the elapsed
time. This process was repeated three times for each application, yielding 81
runs in total and 27 for each approach. Since we had only limited time for
the evaluation, we stopped runs that failed to trigger crashes within 24 hours
and took 24 hours as the time. This only happened with approach A and we
marked the corresponding cells in Table 2 with \*". We used our video records
to examine each crash and to determine whether an SUT was really frozen or
did only perform some heavy processing, such as importing music or mailboxes
in the case of iTunes and Mail.</p>
        <p>In the second experiment we had approach C exercise each of the object
SUTs again. This time we limited the amount of actions that were generated to
200, since we strived to generate short and comprehensible crashing sequences.
If the SUT did not crash after 200 actions, it was restarted. We run C for 10
hours on each application in order to generate several short crashing sequences.
After that, we replayed each of the crashing sequences 5 times. We considered
the crash to be reproducible if during one of these playbacks the application
crashed after the same action as during the initial crash.
3.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>The results</title>
        <p>
          Table 2 lists our ndings for the rst experiment. It shows that approach B and
C were able to trigger crashes for all applications and needed signi cantly less
time than the default TESTAR (we performed a t-test with signi cance level
= 0:05). These ndings are consistent with [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], where the authors carried out
their experiments on Motorola cell phones. Consequently, during the experiment,
with advanced TESTAR action speci cation in Prolog, approaches B and C were
found to be faster in nding crashes than the default setting (A). The additional
e ort of writing the Prolog speci cations was relatively small for us. As can be
found in Table 3, most of the speci cations consisted of less than 30 LOCs. Only
the speci cation for Paintbrush is slightly more complex, since we added a few
mouse gestures to draw gures into the drawing area. A future additional study
should be done with real testers to see if the learning curve is not to steep.
        </p>
        <p>Unfortunately, approach C is not signi cantly faster in crashing the
application than approach B, so we do not have a clear outcome about our Q-learning
approach for action selection. Each of the two algorithms seems to perform
better for speci c applications. We will need to investigate on the cause of this
outcome in future work since a quick analysis already found that approach C
on the average executes about 2.5 times as many di erent actions as B, as we
expected.</p>
        <p>Table 4 shows the results of the second experiment. We were able to
reproduce 21 out of 33 triggered crashes, which is more than 60% and yields the
answer to RQ2. Some of the reproducible crashes we found were indeterministic
so that during some playbacks the application did not crash, whereas during
others it did. This might be caused by the fact that the execution environment
during sequence recording and sequence replaying is not always identical.
Certain factors, which might have been crucial for the application to crash, are not
entirely under our control. Such factors are the CPU load, memory
consumption, thread scheduling, etc. For future research we consider the use of a virtual
machine which might further improve the reproducibility, because it would allow
to guarantee the same environmental conditions during recording and replaying.
Some of the crashes that were generated might have been caused by the same
faults. Unfortunately, we could not verify this, since we did not have access to
the source code. To determine the general reproducibility in RQ2, one would
need a set of sequences which trigger crashes which are known to be caused by
di erent faults of di erent types.</p>
        <p>The capabilities of approaches B and C depend on the Prolog speci cations
and thus on the tester's skills in de ning a set of fault sensitive actions. In
addition, the experiments in this paper were executed by the researcher that
developed the speci cation functionality and so it could be expected that
speci cations of a person less familiar with these facilities, and hence the additional
e ort, could be larger.</p>
        <p>Finally, we executed our experiments on a single platform. This might not
be representative for other desktop platforms such as Windows or even mobile
operating systems like Android and iOS.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Related Work</title>
      <p>
        Amal tano et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] perform crash-testing on Android mobile apps. Before
testing the application, they generate a model in the form of a GUI tree, whose nodes
represent the app's di erent screens and whose transitions refer to event handlers
red on widgets within these screens. The model is obtained by a so-called
GUICrawler which walks through the application by invoking the available
eventhandlers with random argument values. From the GUI-tree they obtain test
cases, by selecting paths starting from the root to one of the leaves. They
automatically instrument the application to detect uncaught exceptions which would
crash the app. They test their approach on a small calculator application.
      </p>
      <p>Liu et al. [12] employ Adaptive Random Testing (ART) to crash Android
mobile applications or render them unresponsive. Their algorithm tries to
generate very diverse test cases, which are di erent from the already executed ones.
They de ne distance measures for input sequences and strive to generate new
test cases which have a high distance to the ones which are already in the test
pool. They test their approach on six small applications among which an address
book and an SMS client. In addition to normal user input, like keystrokes, clicks
and scrolling, they also simulate environmental events like the change of GPS
coordinates, network input, humidity or phone usage.</p>
      <p>Hofer et al. [10] apply a smart testing monkey to the Microsoft Windows
calculator application. Before the testing process, they manually build an abstract
model of the GUI relevant behavior using a language that is based on nite
state machines and borrows elements from UML and state charts. The resulting
Decision Based State Machine acts as both, an orientation for walking through
the GUI by randomly selecting event transitions and as test oracle which checks
certain properties for each state. They use the Ranorex Automation Framework
to execute their test cases.</p>
      <p>
        Artzi et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] perform feedback-directed random test case generation for
JavaScript web applications. Their objectives are to nd test suites with high
code coverage as well as sequences that exhibit programming errors, such as
invalid-html or runtime exceptions. They developed a framework called Artemis,
which triggers events by calling the appropriate handler methods and supplying
them with the necessary arguments. To direct their search, they use prioritization
functions: They select event handlers at random, but prefer the ones for which
they have achieved only low branch coverage during previous sequences.
      </p>
      <p>Huang et al. [11] concentrate on functional GUI testing. Their idea is to walk
through the GUI (by systematically clicking on widgets) and to automatically
generate a model (in the form of an Event Flow Graph) from which they derive
test cases by applying several coverage criteria. In their experiments they test
Java applications (some of them are synthetic, some are part of an o ce suite
developed by students) which they execute by performing clicks. Sometimes
they have problems with the execution of their sequences, since the GUI model
they are derived from is an approximation. Thus, they repair their sequences by
applying a genetic algorithm which strives to restore the coverage of the initial
test suite. Their techniques are available as part of the GUITAR framework7.</p>
      <p>Miller et al. [15] conducted an interesting empirical study on the e ciency
of dumb monkey testing for 30 MacOS GUI applications. They developed a
monkey that issues random (double) clicks, keystrokes and drag operations on
the frontmost application. They even report on the causes of the crashes for the
few applications that they have the source code for. Unfortunately, they do not
mention how long they were running the tests for in order for the programs to
hang or crash.</p>
      <p>Among the tools for dumb monkey testing there are: anteater 8 which
performs crash testing on the iPhone by issuing random clicks for any view (screen)
that it nds. UI/Application Exerciser Monkey 9 is a similar program for Android
and generates random streams of clicks, touches or gestures, as well as a number
of system-level events. Other monkey tools are Powerfuzzer (an HTTP based
web fuzzer written in Python), GUI Tester 10 (a desktop monkey for Windows
which uses a taboo-list to avoid cycling on the same events) and MonkeyFuzz 11
(Windows desktop monkey developed in C#).
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>
        In this paper we presented TESTAR, a tool for automated testing at the GUI
level, together with some new approaches for action selection and speci cation.
Moreover we presented the results of evaluating these new approaches for crash
7 http://sourceforge.net/projects/guitar
8 http://www.redant.com/anteater
9 http://developer.android.com/tools/help/monkey.html
10 http://www.poderico.it/guitester/index.html
11 http://monkeyfuzz.codeplex.com
testing of a set of real-world applications with complex GUIs. The strengths of
our tool are:
{ Easy speci cation of even complex actions: This enables the tester to restrict
the search space to the interesting operations. It also allows to go beyond
simple clicks and keystrokes to generate mouse gestures which drive even
complex GUIs and exercise the majority of their functionalities.
{ Revealed faults are reproducible: Since crashing sequences are often
relatively short and their actions are parameterized with the widgets they are
executed on, they can be reliably replayed, which makes the majority of
the crashes reproducible. This allows the developer to better understand the
fault and to collect additional information during playback. He may even
replay the sequence in slow motion to observe the course of events.
{ Since we employ the operating system's Accessibility API the SUT does not
require any instrumentation which makes the approach feasible for many
technologies and operating systems: Many applications are not designed with
testing in mind and it can be di cult to add testing hooks later on [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Our
framework still allows to test those applications, without any
instrumentation e ort. We deliberately do not make use of any coverage techniques based
on source code or bytecode instrumentation. For many large applications it
is impractical or even impossible to measure code coverage, especially if the
source code is not available. And by far not all applications run in virtual
machines such as the JVM or Microsoft's CLR. Techniques that rely on
bytecode instrumentation can certainly make use of additional information
to guide the test case generation more e ectively, but they are restricted to
certain kinds of technologies. We strive for general applicability.
      </p>
      <p>The results that we obtained from our experiments are encouraging and show
the suitability of the approach for nontrivial SUTs. We proved that complex
popular applications can be crashed and that those crashes are reproducible to
a high degree. Once setup, the tests run completely automatic and report crashes
without any additional labour.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Future Work</title>
      <p>Although rst results are encouraging, more experimentation needs to be done
to nd out why the Q-learning approach did not work as expected and how
di cult writing Prolog speci cations would turn out for real test practitioners.</p>
      <p>Our current implementation runs on MacOSX, TESTAR is not restricted
to any particular platform and we are in the process of developing support for
Microsoft Windows. In addition, we plan implementations for other operating
systems such as Linux and Android. Our approach bene ts from the fact that
many platforms provide an Accessibility API or one to access window manager
information (such as the WinAPI).</p>
      <p>The approach presented in this paper currently only targets critical faults
such as crashes. However, we plan to extend our framework to develop techniques
for automated regression testing. Our ambitious goal is to completely replace the
fragile and inherently labor intense capture and replay method and to develop
a more e ective and automated approach to regression testing. Therefore, we
will have to use a powerful oracle which allows us to detect not only crashes
but also functional faults such as incorrect output values in text elds, layout
problems, etc. One way to achieve this is to perform back to back testing with
two consecutive versions of an application. In this scenario, the old version serves
as the oracle for the new one. The di culty lies in detecting the intended (new
features) and unintended (faults due to modi cations) di erences between the
widget trees in each state in order to reduce the amount of false positives.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgment</title>
      <p>This work is supported by EU grant ICT-257574 (FITTEST) and the SHIP
project (SMEs and HEIs in Innovation Partnerships) (reference: EACEA / A2
/ UHB / CL 554187).
9. Ivan Bratko. Prolog Programming for Arti cial Intelligence. Addison-Wesley
Longman Publishing Co., Inc., Boston, MA, USA, 1986.
10. B. Hofer, B. Peischl, and F. Wotawa. Gui savvy end-to-end testing with smart
monkeys. In Automation of Software Test, 2009. AST '09. ICSE Workshop on,
pages 130{137, May 2009.
11. Si Huang, Myra Cohen, and Atif M. Memon. Repairing gui test suites using
a genetic algorithm. In ICST 2010: Proceedings of the 3rd IEEE International
Conference on Software Testing, Veri cation and Validation, Washington, DC,
USA, 2010. IEEE Computer Society.
12. Zhifang Liu, Xiaopeng Gao, and Xiang Long. Adaptive random testing of
mobile application. In Computer Engineering and Technology (ICCET), 2010 2nd
International Conference on, volume 2, pages V2{297{V2{301, April 2010.
13. Atif Memon, Ishan Banerjee, Bao Nguyen, and Bryan Robbins. The rst decade of
gui ripping: Extensions, applications, and broader impacts. In Proceedings of the
20th Working Conference on Reverse Engineering (WCRE). IEEE Press, 2013.
14. Atif M. Memon. A comprehensive framework for testing graphical user
interfaces. 2001. Advisors: Mary Lou So a and Martha Pollack; Committee members:
Prof. Rajiv Gupta (University of Arizona), Prof. Adele E. Howe (Colorado State
University), Prof. Lori Pollock (University of Delaware).
15. Barton P. Miller, Gregory Cooksey, and Fredrick Moore. An empirical study of
the robustness of macos applications using random testing. In Proceedings of the
1st International Workshop on Random Testing, RT '06, pages 46{54, New York,
NY, USA, 2006. ACM.
16. Richard S. Sutton and Andrew G. Barto. Introduction to Reinforcement Learning.</p>
      <p>MIT Press, Cambridge, MA, USA, 1st edition, 1998.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>Amal tano,</article-title>
          <string-name>
            <given-names>A.R.</given-names>
            <surname>Fasolino</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Tramontana</surname>
          </string-name>
          .
          <article-title>A gui crawling-based technique for android mobile application testing</article-title>
          .
          <source>In Software Testing, Veri cation and Validation Workshops (ICSTW)</source>
          ,
          <source>2011 IEEE Fourth International Conference on</source>
          , pages
          <volume>252</volume>
          {
          <fpage>261</fpage>
          ,
          <string-name>
            <surname>March</surname>
          </string-name>
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Shay</given-names>
            <surname>Artzi</surname>
          </string-name>
          , Julian Dolby, Simon Holm Jensen,
          <string-name>
            <surname>Anders</surname>
          </string-name>
          <article-title>M ller, and Frank Tip. A framework for automated testing of javascript web applications</article-title>
          .
          <source>In Proceedings of the 33rd International Conference on Software Engineering</source>
          , ICSE '
          <volume>11</volume>
          , pages
          <fpage>571</fpage>
          {
          <fpage>580</fpage>
          , New York, NY, USA,
          <year>2011</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>S.</given-names>
            <surname>Bauersfeld</surname>
          </string-name>
          , A de Rojas, and
          <string-name>
            <given-names>T.E.J.</given-names>
            <surname>Vos</surname>
          </string-name>
          .
          <article-title>Evaluating rogue user testing in industry: An experience report</article-title>
          .
          <source>In Research Challenges in Information Science (RCIS)</source>
          ,
          <year>2014</year>
          IEEE Eighth International Conference on, pages
          <volume>1</volume>
          {
          <fpage>10</fpage>
          , May
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Bauersfeld</surname>
          </string-name>
          and
          <string-name>
            <given-names>Tanja</given-names>
            <surname>Vos</surname>
          </string-name>
          .
          <article-title>A reinforcement learning approach to automated gui robustness testing</article-title>
          .
          <source>In In Fast Abstracts of the 4th Symposium on Search-Based Software Engineering (SSBSE</source>
          <year>2012</year>
          ), pages
          <fpage>7</fpage>
          <lpage>{</lpage>
          12. IEEE,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Bauersfeld</surname>
          </string-name>
          and
          <string-name>
            <given-names>Tanja E. J.</given-names>
            <surname>Vos</surname>
          </string-name>
          .
          <article-title>Guitest: a java library for fully automated gui robustness testing</article-title>
          .
          <source>In Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, ASE</source>
          <year>2012</year>
          , pages
          <fpage>330</fpage>
          {
          <fpage>333</fpage>
          , New York, NY, USA,
          <year>2012</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Bauersfeld</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Tanja E. J.</given-names>
            <surname>Vos</surname>
          </string-name>
          , Nelly Condori-Fernandez,
          <article-title>Alessandra Bagnato, and Etienne Brosse. Evaluating the TESTAR tool in an industrial case study</article-title>
          .
          <source>In 2014 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement</source>
          , ESEM '14,
          <string-name>
            <surname>Torino</surname>
          </string-name>
          , Italy,
          <source>September 18-19</source>
          ,
          <year>2014</year>
          , page
          <issue>4</issue>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>S.</given-names>
            <surname>Berner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Weber</surname>
          </string-name>
          , and R.K. Keller.
          <article-title>Observations and lessons learned from automated testing</article-title>
          .
          <source>In Software Engineering</source>
          ,
          <year>2005</year>
          .
          <article-title>ICSE 2005</article-title>
          .
          <article-title>Proceedings</article-title>
          . 27th International Conference on, pages
          <volume>571</volume>
          {
          <fpage>579</fpage>
          , May
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>C.</given-names>
            <surname>Bertolini</surname>
          </string-name>
          , G. Peres,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>d' Amorim, and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Mota</surname>
          </string-name>
          .
          <article-title>An empirical evaluation of automated black box testing techniques for crashing guis</article-title>
          .
          <source>In Software Testing Veri cation and Validation</source>
          ,
          <year>2009</year>
          . ICST '09. International Conference on, pages
          <volume>21</volume>
          {
          <fpage>30</fpage>
          ,
          <string-name>
            <surname>April</surname>
          </string-name>
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>