<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Applying Generative Adversarial Networks to perform gaze path prediction in websites</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>(Discussion Paper)</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gianluca Bonifazi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Enrico Corradini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gianluca Porcino</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Scopelliti</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Domenico Ursino</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Virgili</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DII, Polytechnic University of Marche</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Data Labs</institution>
          ,
          <addr-line>Daimler AG</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>In recent years, gaze path prediction has become a topic widely studied by computer scientists, who have proposed a variety of approaches to address this issue in the context of natural images. Among these approaches, the ones based on deep learning and, in particular, on Generative Adversarial Networks (GANs) have proven to be extremely accurate. When moving from natural images to websites, gaze path prediction becomes much more complex. As an evidence of this fact, no GAN-based approaches have yet been presented to solve this problem. In this paper, we aim at filling this gap by proposing two GAN-based approaches capable of predicting the gaze path of a user when looking at a website.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Gaze path prediction</kwd>
        <kwd>Deep learning</kwd>
        <kwd>Generative Adversarial Networks</kwd>
        <kwd>PathGAN</kwd>
        <kwd>FiWI dataset</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        With the passing of the years, more and more contents are present on the Web, leading to an
increase in the dificulty of capturing the attention of a user when she visits a website. The
evaluation of the attention posed by a user when visiting a website is a non-trivial problem,
which depends on several factors. To cope with it, researchers have proposed a powerful tool,
namely visual scanpath [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. It represents a formal definition of the path made by a user’s gaze
while looking at an image. In the past, such concept was applied to natural images. However,
the huge development of the web has led to an increasing interest in applying it also to websites.
However, the website scenario is much more complex than the natural image one. In fact, a
single web page can contain more natural images, text, logos and animations. This peculiarity
makes gaze path prediction tools designed for natural images much less efective when applied
to websites. In fact, the presence of several competing stimuli in a web page makes the accurate
prediction of the eye fixation much more complex [2].
      </p>
      <p>To address this problem, the past literature has proposed several approaches, belonging to
diferent categories. Among them, deep learning represents one of the categories that recently
has attracted a lot of interest. One of the first proofs of the efectiveness of deep learning in
this area is reported in [3]. After this attempt, several others have been proposed, and many
of them obtained important results. In particular, one architecture that is gaining increasing
attention is Generative Adversarial Networks (hereafter, GANs) [4, 5, 6, 7]. It has proven to be
extremely valuable in the context of gaze path prediction of natural images [8, 9, 10], where the
corresponding approaches have achieved state-of-the-art results. However, as for websites, to
the best of our knowledge, no GAN-based approaches have been proposed yet.</p>
      <p>In this paper, we aim at filling this gap by proposing two GAN-based approaches specifically
designed to operate on websites. We start from PathGAN [10], an approach that has been shown
to be very efective for gaze path prediction of natural images, and propose two variants of it.
They present a number of refinements derived from several observations we made during some
experiments conducted “in the field”.</p>
      <p>The outline of this paper is as follows: Section 2 provides a technical description of our
approaches. Section 3 discusses the experiments performed and the results obtained. Finally,
Section 4 draws some conclusions and provides a look at possible future developments.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Characterization of the proposed approach</title>
      <p>In the field of gaze path prediction, scientific research is still at an early stage. Among the few
studies to address this problem, a GAN-based approach, called PathGAN [10], stands out for
its results. It predicts the visual scanpath of people watching natural images in both normal
and 360-degree formats. Its architecture is shown in Figure 1. The first part of the network is a
generator that receives an image and returns a gaze path denoting the eye path of a potential
user in observing that image. The generated path is a sequence of 63 fixations, each modeled
by a tuple of 4 variables, namely an x-coordinate, a y-coordinate, a timestamp and an end of
path probability. The first three variables indicate the position and the duration of each fixation.
The fourth one guarantees the possibility of generating paths of variable length. To this end,
a threshold is defined for this variable, which makes it possible to determine which fixations
should not be included in the final prediction.</p>
      <p>We started by applying the original PathGAN to websites; unfortunately, this task did not
produce encouraging results. However, it allowed us to identify some problems to solve for
improving performances.</p>
      <p>First of all, we decided to update the weights of the generator and the discriminator with
diferent frequencies, but the choice to make more updates on the discriminator than on the
generator did not lead to performance improvements. Furthermore, assigning a very low weight
to the content loss (for instance, a weight equal to 0.05) makes the discriminator much stronger
than the generator. Training only the generator for the first 5 epochs was not enough to prevent
this phenomenon.</p>
      <p>The length of the gaze path to be predicted was another factor that greatly complicated the
problem. In fact, in order to handle paths of variable length, the end of path probability was
included. However, the training phase proved unable to tune this variable adequately, leading
to paths that were either too short or too long. Overall, we experienced completely wrong
predictions that, in the long run, led to mode collapse during training. In this case, the generator
tends to return gaze path predictions matching the edge of the image and completely ignoring
the original ground truth.</p>
      <p>The cause of this phenomenon was identified in the combination of several elements. The
discriminator becomes too strong with respect to the generator, which can no longer make
realistic predictions. The scarcity of data available does not help to solve this problem. The
discriminator clearly overfits on the training data after several epochs, and this cannot be prevented
by changing the update frequency of the weights of the generator and the discriminator.</p>
      <p>The next step led us to make improvements in the way the network is trained and to change
the weights assigned to content and adversarial loss. The only way to make the discriminator
weaker is to update the generator weights more often. Decreasing the weight assigned to
adversarial loss in the objective function produced positive efects. A higher content loss weight
prevented the discriminator from taking over. The quality of the generated samples increased a
lot, allowing the network training to be completed successfully. In conclusion, we decided to
multiply the adversarial loss for a weight equal to 0.35 and the content loss for a weight equal
to 1.</p>
      <p>Afterwards, we tuned the correct number of updates of the weights for each part of the
network. In particular, we decided to update 16 times the weights of the generator and 4 times
those of the discriminator at each step. We also introduced other changes to avoid overfitting.
In particular, taking inspiration from saliency prediction models, we added noise to the images
passed to the discriminator.</p>
      <p>After all these changes and updates to the original PathGAN, we obtained a new version of
it called NormalGAN. It has the same architecture as PathGAN but is specifically designed to
operate on websites.</p>
      <p>In addition to this first variant of PathGAN, we designed a second one. To this end, we
started with the considerations that had led to NormalGAN and made additional changes. In
particular, we modified the underlying architecture so that it would follow the structure of a
conditional Wasserstein GAN [11] (and, for this reason, we called WGAN this new variant). We
also modified the training process to comply with the characteristics of a conditional Wasserstein
GAN. In particular, we updated the discriminator’s weights more often than the generator’s
ones (i.e., 5 times compared to 1) because weight clipping was introduced. We reset the weights
of the various loss terms to their original values (i.e., 0.05 for content loss and 1 for adversarial
loss). These two changes allowed us to obtain a balance between the generator’s and the
discriminator’s strength. In fact, increasing the update rate of the weights leads to a scenario
where the generator and the discriminator learn too much from our dataset, causing overfitting.
On the other hand, decreasing this parameter implies that the training process takes longer, or
that, for the same duration, the generator and/or the discriminator cannot learn enough from
the dataset.</p>
      <p>In conclusion to these discussions, in Table 1, we provide a scheme of the diferences between
PathGAN, NormalGAN, and WGAN.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <p>Our approaches for gaze path prediction of users accessing websites have been implemented
in a web-based tool; in the design and realization of it, we paid particular attention to user
experience. We adopted Python as programming language, Keras and Tensorflow for handling
GANs, and Django for managing the underlying CMS. With User Experience Design techniques
in mind, we created a home page where a user can upload an image and specify the gaze path
prediction technique she want to apply. Our tool returns the original image, the corresponding
ground truth and the gaze path prediction. In Figure 2, we report an example of the output
provided by it.</p>
      <sec id="sec-3-1">
        <title>3.1. Description of the dataset</title>
        <p>To the best of our knowledge, the only dataset available in the literature, which contains both
web page layout images and gaze data is FiWI (Fixations in Webpage Images) [2]. In fact, it
is used as a reference in all researches regarding gaze path prediction in the GUI domain. It
was constructed by collecting data from 11 volunteers, who observed 149 websites. The limited
number of volunteers makes FiWI unable to represent all the ways in which people observe
websites. Also, the gender of people is not balanced in it, because 7 volunteers were female and
4 were male. All of these considerations prompted us to build a new dataset aiming at avoiding,
or at least mitigating, these problems.</p>
        <p>From the beginning we thought that it was necessary to increase the number of websites
present in the dataset and add new forms of layout than those considered by FiWI, because
the latter did not fully represent the current variety of web. In particular, the original layouts
of FiWI were three, namely (i) Pictorial, (ii) Text, and (iii) Mixed. First, we have extended
the number of websites belonging to each layout, while keeping the corresponding fractions
balanced. Furthermore, we added a fourth layout, i.e., Business, which presents analytical
layouts (in particular, dashboards) and Daimler intranet layouts. Finally, we considered that the
principles of website design have changed over time. For this reason, we decided to put our
dataset both the original and the updated layout of the pages already present in FiWI. Finally,
we removed from it the FiWI websites without an updated version available (for example, web
pages of companies that no longer exist) to get a balanced set of old and new layouts.</p>
        <p>Starting with this original core of 149 images, we reached a total of 262 web layouts. Finally,
we felt it necessary to engage more volunteers than FiWI, to include more nuances of how
diferent people look at images, as well as to balance their gender. For this reason, we selected
100 volunteers, 50 males and 50 females. Each of the 262 images was seen by 11 or 12 volunteers,
in such a way as to collect 3000 gaze paths. Each volunteer was asked to watch 30 images. If
compared with FiWI, the number of available gaze paths is more than twice. Table 2 shows
several information on FiWI and our dataset.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Results of the experiments</title>
        <p>
          In this section, we apply PathGAN, NormalGAN and WGAN to the images of the dataset
described in Section 3.1 to verify whether our two variants actually perform better. To carry out
this task, we relied on Jarodzka metrics [12]. These define scanpaths as a series of geometric
vectors, called saccade vectors, and compare them along the following dimensions: (i) Vector
shape, denoting the diference in shape between saccade vectors; (ii) Vector direction, indicating
the diference in direction (i.e., angle) between saccade vectors; (iii) Vector length, representing
the diference in amplitude between saccade vectors; (iv) Vector position, denoting the distance
between fixations; (v) Fixation duration, indicating the diference in duration between fixations.
All these measures range in the real interval [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ]; the higher their value, the closer saccade
vectors and the better the performance of the approach. Specifically, the first three measures are
normalized against the screen diagonal, the fourth against  , and the last against the maximum
value of the two durations being compared.
        </p>
        <p>Since each website in our dataset was watched by 11 or 12 diferent users (see Section 3.1),
there is no single ground truth for it, but the gaze path of each user watching it was considered
as a ground truth. Therefore, the corresponding gaze path prediction returned by the approach
to evaluate was compared with each ground truth and, next, the average performance for that
website was computed. Finally, the performance associated with the various websites was
averaged to obtain the evaluation of the approach with respect to the overall dataset. Table 3
shows the results obtained.</p>
        <p>From the analysis of this table we can see that the original PathGAN returns results much
lower than NormalGAN and WGAN for all the metrics adopted. As we know, this is due to the
fact that it was designed for natural images and not for websites. If we compare NormalGAN and
WGAN, we can observe that both of them predict vector shape and path length very well. The
position of fixation is also high for both approaches. Direction similarity decreases significantly
for both of them, although the values obtained still remain acceptable. Instead, both variants
show a low performance in predicting the duration of each fixation. This parameter is by far the
main weakness of our approaches, but also of the original PathGAN, from which they inherit
this issue.</p>
        <p>Table 3 also shows that WGAN is the best of the three approaches, because it achieves the
highest score in all five metrics. NormalGAN also shows a very good performance, lower than
WGAN only by a few decimal points. Finally, this table also highlights that both approaches
have the same strengths and weaknesses, because they perform well and poorly on the same
metrics.</p>
        <p>As a further test, we carried out an “absolute” evaluation of WGAN (i.e., the “winner” of
the previous comparison) to verify if it was adequate. In this task, we adopted the One human
baseline technique [13]. Due to space limitations, we cannot illustrate this experiment here. We
can only say that obtained results were extremely satisfying.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>In this paper, we have proposed NormalGAN and WGAN, i.e., two variants of PathGAN
conceived to predict the gaze path of a user watching a website. First, we have explained the
motivations underlying our approach. Then, we have illustrated the reasons leading us to define
the two variants and have described their technical features. Afterwards, we have illustrated our
dataset, which represents an advance with respect to FiWI that is the dataset adopted currently
in this field. Finally, we have proposed an experiment showing that our two variants outperform
PathGAN when operating on websites.</p>
      <p>Our work should not be considered as an ending point, but the starting point for further
research eforts. For instance, it could be combined with an approach for saliency map prediction
in such a way as that the latter guides the former in obtaining more accurate results. Furthermore,
it could be interesting to apply reinforcement learning in this scenario with the ultimate goal of
defining a reward function able to highlight the correct aspects of web interfaces and, therefore,
to ensure an appropriate training of the model.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work was partially funded by the Marche Region under the project “Human Digital
Flexible Factory of the Future Laboratory (HDSFIab) - POR MARCHE FESR 2014-2020 - CUP
B16H18000050007”. The authors thank the Daimler AG for supporting them in all their
testing activities by providing its laboratories and encouraging its employees to participate as
volunteers.
[2] C. Shen, Q. Zhao, Webpage saliency, in: Proc. of the European Conference on Computer</p>
      <p>Vision (ECCV’14), Zurich, Switzerland, 2014, pp. 33–46. Springer.
[3] K. Simonyan, A. Vedaldi, A. Zisserman, Deep inside convolutional networks: Visualising
image classification models and saliency maps, Proc. of the International Conference on
Learning Representations (ICLR’14) (2013). ICLR Press.
[4] V. Singh, H. Rashwan, S. Romani, F. Akram, N. Pandey, M. Sarker, A. Saleh, M. Arenas,
M. Arquez, D. Puig, Breast tumor segmentation and shape classification in mammograms
using generative adversarial and convolutional neural network, Expert Systems with
Applications 139 (2020) 112855. Elsevier.
[5] R. He, X. Li, G. Chen, G. Chen, Y. Liu, Generative adversarial network-based
semisupervised learning for real-time risk warning of process industries, Expert Systems with
Applications 150 (2020) 113244. Elsevier.
[6] J. Oh, J. Hong, J. Baek, Oversampling method using outlier detectable generative adversarial
network, Expert Systems with Applications 133 (2019) 1–8. Elsevier.
[7] G. Douzas, F. Bacao, Efective data generation for imbalanced learning using conditional
generative adversarial networks, Expert Systems with applications 91 (2018) 464–471.</p>
      <p>Elsevier.
[8] J. Pan, C. C. Ferrer, K. McGuinness, N. O’Connor, J. Torres, E. Sayrol, X. G. i Nieto,
Salgan: Visual saliency prediction with generative adversarial networks, arXiv preprint
arXiv:1701.01081 (2017).
[9] Y. Li, Y. Zhang, Webpage Saliency Prediction with Two-Stage Generative Adversarial</p>
      <p>Networks, arXiv preprint arXiv:1805.11374 (2018).
[10] M. Assens, X. G. i Nieto, K. McGuinness, N. O’Connor, PathGAN: visual scanpath prediction
with generative adversarial networks, in: Proc. of the European Conference on Computer
Vision (ECCV’18), Munich, Germany, 2018, pp. 406–422.
[11] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, A. Courville, Improved Training of</p>
      <p>Wasserstein GANs, 2017. arXiv:1704.00028.
[12] H. Jarodzka, K. Holmqvist, M. Nyström, A vector-based, multidimensional scanpath
similarity measure, in: Proc. of the Symposium on Eye Tracking Research &amp; Applications
(ETRA’10), Austin, TX, USA, 2010, pp. 211–218. ACM.
[13] A. Borji, Saliency prediction in the deep learning era: Successes and limitations, IEEE
Transactions on Pattern Analysis and Machine Intelligence (2019) 679–700. IEEE.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Goldberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Helfman</surname>
          </string-name>
          ,
          <article-title>Visual scanpath representation</article-title>
          ,
          <source>in: Proc. of the Symposium on Eye-Tracking Research &amp; Applications (ETMA'10)</source>
          , Austin, Texas, USA,
          <year>2010</year>
          , pp.
          <fpage>203</fpage>
          -
          <lpage>210</lpage>
          . ACM.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>