Unsupervised Graphical User Interface Learning

                              Przemysław Czaus 1 [0000-0003-1108-525X]
    1
        Department of Mathematical Methods of Computer Science, Faculty of Mathematics and
        Computer Science, University of Warmia and Mazury in Olsztyn, Słoneczna 54, 10-710
                                         Olsztyn, Poland

                                czaus@matman.uwm.edu.pl


  Abstract. While there are many different papers on automatic testing or generation of inter -
faces[1, 2, 3, 4], the information concerning similar to human perception user interface learning
methods are scarce. Given the recent leaps in Artificial Intelligence (AI) one might find this
topic very useful in implementing a solution for communication of AI with applications created
for human interaction. User interfaces differ between operating systems, hardware possibilities
and implementation. The main objective of the study was the analysis of the applications’
Graphic User Interface (GUI) and interaction with the graphic interface elements in general.
This method can be used to track the change in application behaviour based on any action from
the user, without having knowledge about the underlying object structure.

Keywords: artificial intelligence, machine learning, user interface


1          Introduction


  Currently, the main component of any human-application interaction is the output data re -
turned in a graphical or textual form. This data is then processed and another input data causes
an event that changes (or not) the output. For AI to handle repeating tasks as humans, it must
work with existing UI. The main focus of this paper is to prepare an easy to implement algo -
rithm that focuses on tracking change between the screens of the application and in turn create
a map of the interface elements. underlying events and changes in the GUI. It may help build
more sophisticated mechanisms for modeling changes between views and finding similarities
between GUI elements and understanding the key steps needed to be done to achieve an ex-
pected result.
  In the underlying sections one can find information about:
                   Algorithm first tasks to find GUI elements
                   Detecting change
                   Optimizations
                   Mapping interface elements and changes in the UI


  Copyright © 2019 for this paper by its authors. Use permitted under Creative Com-
mons License Attribution 4.0 International (CC BY 4.0).
2       Example Algorithm

2.1     Generating events


  Currently, the main component of any human-application interaction is the output data re -
turned in a graphical or textual form. This data is then processed and another input data causes
an event that changes (or not) the output (see Fig. 1)


                      Fig. 1. Change in UI after clicking the sign in button


  The first step is identifying the change in the application behaviour that in turn help us iden -
tify the event that caused the change. This is something that everyone unwillingly processes
while surfing the internet, playing video games or using an application. Where the speed of
identifying every action differs depending on the person, the process is roughly the same (link
article that has this data) and we start by inputting data (like click on elements or keys on the
keyboard) and we look for changes in the output (like an information message that the input
was wrong, or animations suggesting data transfer). The output of a single event can be called a
frame or a screenshot (see Fig. 2).
                Fig. 1. Click event on the sign in button generates a new frame.

2.2     Optimizations


  The first step is to generate a current GUI render image. It isn't important if it's a web page or
an application. The main focus is to get the output data that is an image of the working applica-
tion. Generating an image that contains only the application window and any action is limited
to the window borders, ensures that no action outside the application window will interfere with
the data processing. One can limit the input size by creating a diff from the first frame and an-
other one is generated some time after the initial render(see Fig. 3)..


                              Fig. 1. First and second frames differ


  This may cause some delay between starting of the application and processing the data, but
may minimize errors on slower systems. Limiting events only to parts of the image that didn't
change can significantly lower the number of iterations of the algorithm. This approach can
cause problems in applications that animate user interfaces or generate animations before the
GUI is initialized(see Fig. 4).
                                 Fig. 1. Red area will be ignored


2.3     Tracking change


  The second phase is generating input data that should cause execution of an event. For the
purpose of this example coordinates start from [0,0] (the upper left corner of the application
window, without the status bar). We can limit those events to occur only in coordinates that
didn't change between two first frames. In most cases the interface elements are stationary. This
applies to modern video games, web pages or mobile applications. After executing an action
(click, long click, swipe etc. on coordinates [0,0]) we should wait for an event to make the nec -
essary changes in the GUI. Depending on the type of an application and processing power some
delay should be used after every action and new image generation.
  At this point, we have at least 3 states of the application: the initial view (1), view after the
first few seconds from generating the view (2), after the last event (3) (see Fig. 5).
                                    Fig. 1. First three frames

2.4     Mapping interface elements to changes


  Change of application state can be checked generating a diff between the first and the last
frame (or, if any changes occurred, the second). If we find any differences between the frames
we can assume that an event was fired and the GUI changed. Generating a hash based on this
image will be useful for future reference. The same operations should be repeated for every co-
ordinate of the application interface. This will help us generate a map of interface elements cor-
responding to frames generated by those click events. Any action that generates the same frame
should be grouped as part of the same interface element (see Fig. 6).


                 Fig. 1. All click actions on the sign in button generate frame 3
  This provides enough data to generate a decision tree for the initial application view with ac-
tions that allow us to get to the second level frames.
  Treating every next level frame as a starting point allows us to find new views. Checking
hashes of every level frames help us identify any actions that can send us to a previous view.
After finishing all the iterations we have all the unique frames with the corresponding coordi-
nates and actions made on those interface elements. It's possible to verify if frames have similar
features (like information about a single list item) for grouping purposes. With an application
fully mapped it's possible to train classifiers to find some similar interface elements which in
turn makes the process of analyzing new applications faster(see Fig. 7).


            Fig. 1. All identified interface elements that generate change to the GUI


 Using the coordinates we can generate smaller images that only contain a single interface ele-
ment. At this point for unique interface elements, we have datasets with only one row. The
grouped elements have as many rows as similar objects were found during the application map-
ping.
3       Conclusions


  Where creating mapping solutions for a single technology may be more precise in finding ex -
act UI elements, a more general approach might give more benefits long term. Without focus-
ing on a single implementation of all the fazes of data processing it’s a very interesting topic to
test many different approaches to data mining and computer vision. Many optimizations are
still possible in many different steps, yet tracking changes is still possible in a closed amount of
time.


    References
 1. Chunyang Chen, Ting Su, Guozhu Meng, Zhenchang Xing, and Yang Liu: From UI De-
    sign Image to GUI Skeleton: A Neural Machine Translator to Bootstrap Mobile GUI Im-
    plementation. In Proceedings of ICSE ’18: 40th International Conference on Software En-
    gineering , Gothenburg, Sweden, pp. 665-676.
 2. Shauvik Roy Choudhary, Alessandra Gorla, and Alessandro Orso. 2015. Automated Test
    Input Generation for Android: Are We There Yet? (E). In 30th IEEE/ACM International
    Conference on Automated Software Engineering, ASE 201Ssd
 3. Ting Su. 2016. FSMdroid: Guided GUI testing of android apps. In Proceedings of the 38th
    International Conference on Software Engineering, ICSE 2016, Austin, TX, USA, May
    14-22, 2016 - Companion Volume. 689–691.Fd
 4. Ting Su, Guozhu Meng, Yuting Chen, KeWu,Weiming Yang, Yao Yao, Geguang Pu,
    Yang Liu, and Zhendong Su. 2017. Guided, Stochastic Model-based GUI Testing of An -
    droid Apps. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software
    Engineering (ESEC/FSE 2017). ACM, New York, NY, USA, 245–256.