<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Unsupervised Graphical User Interface Learning</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Mathematical Methods of Computer Science, Faculty of Mathematics and Computer Science, University of Warmia and Mazury in Olsztyn</institution>
          ,
          <addr-line>Słoneczna 54, 10-710 Olsztyn</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
      </contrib-group>
      <fpage>0000</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>While there are many different papers on automatic testing or generation of inter faces[1, 2, 3, 4], the information concerning similar to human perception user interface learning methods are scarce. Given the recent leaps in Artificial Intelligence (AI) one might find this topic very useful in implementing a solution for communication of AI with applications created for human interaction. User interfaces differ between operating systems, hardware possibilities and implementation. The main objective of the study was the analysis of the applications' Graphic User Interface (GUI) and interaction with the graphic interface elements in general. This method can be used to track the change in application behaviour based on any action from the user, without having knowledge about the underlying object structure.</p>
      </abstract>
      <kwd-group>
        <kwd>artificial intelligence</kwd>
        <kwd>machine learning</kwd>
        <kwd>user interface</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Currently, the main component of any human-application interaction is the output data re
turned in a graphical or textual form. This data is then processed and another input data causes
an event that changes (or not) the output. For AI to handle repeating tasks as humans, it must
work with existing UI. The main focus of this paper is to prepare an easy to implement algo
rithm that focuses on tracking change between the screens of the application and in turn create
a map of the interface elements. underlying events and changes in the GUI. It may help build
more sophisticated mechanisms for modeling changes between views and finding similarities
between GUI elements and understanding the key steps needed to be done to achieve an ex
pected result.</p>
      <p>In the underlying sections one can find information about:
 Algorithm first tasks to find GUI elements
 Detecting change
 Optimizations
 Mapping interface elements and changes in the UI</p>
    </sec>
    <sec id="sec-2">
      <title>Example Algorithm</title>
      <sec id="sec-2-1">
        <title>Generating events</title>
        <p>Currently, the main component of any human-application interaction is the output data re
turned in a graphical or textual form. This data is then processed and another input data causes
an event that changes (or not) the output (see Fig. 1)</p>
        <p>The first step is identifying the change in the application behaviour that in turn help us iden
tify the event that caused the change. This is something that everyone unwillingly processes
while surfing the internet, playing video games or using an application. Where the speed of
identifying every action differs depending on the person, the process is roughly the same (link
article that has this data) and we start by inputting data (like click on elements or keys on the
keyboard) and we look for changes in the output (like an information message that the input
was wrong, or animations suggesting data transfer). The output of a single event can be called a
frame or a screenshot (see Fig. 2).</p>
        <p>The first step is to generate a current GUI render image. It isn't important if it's a web page or
an application. The main focus is to get the output data that is an image of the working
application. Generating an image that contains only the application window and any action is limited
to the window borders, ensures that no action outside the application window will interfere with
the data processing. One can limit the input size by creating a diff from the first frame and an
other one is generated some time after the initial render(see Fig. 3)..</p>
        <p>This may cause some delay between starting of the application and processing the data, but
may minimize errors on slower systems. Limiting events only to parts of the image that didn't
change can significantly lower the number of iterations of the algorithm. This approach can
cause problems in applications that animate user interfaces or generate animations before the
GUI is initialized(see Fig. 4).</p>
        <p>The second phase is generating input data that should cause execution of an event. For the
purpose of this example coordinates start from [0,0] (the upper left corner of the application
window, without the status bar). We can limit those events to occur only in coordinates that
didn't change between two first frames. In most cases the interface elements are stationary. This
applies to modern video games, web pages or mobile applications. After executing an action
(click, long click, swipe etc. on coordinates [0,0]) we should wait for an event to make the nec
essary changes in the GUI. Depending on the type of an application and processing power some
delay should be used after every action and new image generation.</p>
        <p>
          At this point, we have at least 3 states of the application: the initial view (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ), view after the
first few seconds from generating the view (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ), after the last event (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) (see Fig. 5).
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Mapping interface elements to changes</title>
        <p>Change of application state can be checked generating a diff between the first and the last
frame (or, if any changes occurred, the second). If we find any differences between the frames
we can assume that an event was fired and the GUI changed. Generating a hash based on this
image will be useful for future reference. The same operations should be repeated for every
coordinate of the application interface. This will help us generate a map of interface elements cor
responding to frames generated by those click events. Any action that generates the same frame
should be grouped as part of the same interface element (see Fig. 6).</p>
        <p>This provides enough data to generate a decision tree for the initial application view with ac
tions that allow us to get to the second level frames.</p>
        <p>Treating every next level frame as a starting point allows us to find new views. Checking
hashes of every level frames help us identify any actions that can send us to a previous view.
After finishing all the iterations we have all the unique frames with the corresponding
coordinates and actions made on those interface elements. It's possible to verify if frames have similar
features (like information about a single list item) for grouping purposes. With an application
fully mapped it's possible to train classifiers to find some similar interface elements which in
turn makes the process of analyzing new applications faster(see Fig. 7).</p>
        <p>Using the coordinates we can generate smaller images that only contain a single interface
element. At this point for unique interface elements, we have datasets with only one row. The
grouped elements have as many rows as similar objects were found during the application
mapping.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Conclusions</title>
      <p>Where creating mapping solutions for a single technology may be more precise in finding ex
act UI elements, a more general approach might give more benefits long term. Without focus
ing on a single implementation of all the fazes of data processing it’s a very interesting topic to
test many different approaches to data mining and computer vision. Many optimizations are
still possible in many different steps, yet tracking changes is still possible in a closed amount of
time.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Chunyang</given-names>
            <surname>Chen</surname>
          </string-name>
          , Ting Su, Guozhu Meng, Zhenchang Xing, and Yang Liu:
          <article-title>From UI Design Image to GUI Skeleton: A Neural Machine Translator to Bootstrap Mobile GUI Implementation</article-title>
          .
          <source>In Proceedings of ICSE '18: 40th International Conference on Software Engineering</source>
          , Gothenburg, Sweden, pp.
          <fpage>665</fpage>
          -
          <lpage>676</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Shauvik</given-names>
            <surname>Roy</surname>
          </string-name>
          <string-name>
            <surname>Choudhary</surname>
          </string-name>
          , Alessandra Gorla, and
          <string-name>
            <given-names>Alessandro</given-names>
            <surname>Orso</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Automated Test Input Generation for Android: Are We There Yet? (E)</article-title>
          .
          <source>In 30th IEEE/ACM International Conference on Automated Software Engineering, ASE 201Ssd</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Ting</given-names>
            <surname>Su</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>FSMdroid: Guided GUI testing of android apps</article-title>
          .
          <source>In Proceedings of the 38th International Conference on Software Engineering, ICSE</source>
          <year>2016</year>
          , Austin, TX, USA, May
          <volume>14</volume>
          -22,
          <fpage>2016</fpage>
          - Companion Volume.
          <volume>689</volume>
          -
          <fpage>691</fpage>
          .Fd
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Ting</given-names>
            <surname>Su</surname>
          </string-name>
          , Guozhu Meng, Yuting Chen, KeWu,Weiming Yang, Yao Yao, Geguang Pu, Yang Liu, and
          <string-name>
            <given-names>Zhendong</given-names>
            <surname>Su</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Guided, Stochastic Model-based GUI Testing of An - droid Apps</article-title>
          .
          <source>In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (ESEC/FSE</source>
          <year>2017</year>
          ). ACM, New York, NY, USA,
          <fpage>245</fpage>
          -
          <lpage>256</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>