<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Can AI mimic human visual attention to assess e- commerce landing page engagement?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lisa Colombo</string-name>
          <email>lisa.colombo29@studenti.iulm.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Bruno</string-name>
          <email>alessandro.bruno@iulm.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Artificial Intelligence</institution>
          ,
          <addr-line>Visual Saliency, Eye Movements, e-commerce, landing page</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents a study on artificial intelligence (AI) models applied to extract visual saliency from images. In particular, the research assesses the accuracy of AI in replicating human-like attention mechanisms by comparing AI-generated saliency maps with eye movement data captured through eye-tracking technology. A case study is conducted to evaluate landing page engagement with viewers. Saliency maps of banners from 4 e-commerce landing pages are extracted with TranSalNet, an AI-based Visual Saliency model and compared to eye movements recorded with a webcam-based eye-tracking platform. Normalised Scanpath Saliency (NSS), Kullback-Leibler Divergence (KL-Div), and Area Under the Curve (AUC) metrics reveal AI models performing well in central regions of visual stimuli while exhibiting some false positives and false negatives in peripheral areas. The study offers insights into visual attention and e-commerce landing page assessment from a computational viewpoint.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        This paper investigates AI's capabilities in predicting human visual attention on e-commerce
websites through a study that compares AI-generated saliency maps with human attention data
captured via eye-tracking technology. Using TranSalNet [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], a state-of-the-art saliency prediction
model, we evaluated how well the model replicates human attention across various e-commerce
platforms, including Amazon, eBay, Shein, and Vinted. The study focuses on identifying key areas
of user interest and examines the accuracy of the AI model in predicting attention on visually
prominent and peripheral elements.
      </p>
      <p>This work aims to contribute to the growing field of AI-driven user experience optimisation
by exploring the alignment between AI predictions and human attention patterns. The findings
highlight the strengths and limitations of current AI models in saliency prediction and suggest
areas for further refinement to capture the complexity of user interactions in digital
environments.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Techniques</title>
      <p>Visual saliency modelling techniques are typically grouped into bottom-up and top-down
approaches, crucial in predicting where humans look in a visual scene. Bottom-up techniques
rely on the inherent properties of the image itself, such as colour, contrast, and intensity, to
determine which areas are more likely to attract attention. These stimulus-driven methods do
not account for the observer's goals or prior knowledge. In contrast, top-down techniques
incorporate higher-level cognitive processes, such as task relevance and user intent, influencing
attention based on expectations and goals rather than purely visual cues.</p>
      <p>
        In early bottom-up models, such as the one proposed by Itti et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], saliency is determined by
combining different low-level features through centre-surround mechanisms. The model
computes the Difference of Gaussian ( ) for intensity, colour, and orientation to detect
regions of local contrast. This model assumes that areas of high contrast across these features
are more likely to attract attention. The ( ) operation is represented as in equation 1:
  1( ,  ) and   2( ,  ) are Gaussian functions with standard deviations  1 and  2, respectively.
This difference between the Gaussian functions enables the model to highlight high-contrast
regions, typically areas that draw human attention.
      </p>
      <p>
        As research advanced, top-down approaches [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] were introduced to account for the observer's
intent, task relevance, and context. These models incorporate feedback mechanisms that adjust
saliency predictions based on high-level cognitive factors. For example, a user searching for a
specific product in an online store will focus on elements such as product images, search bars,
and filters, even if these elements are not the most visually salient according to a bottom-up
approach. By integrating the observer's goals, top-down methods complement bottom-up
processes and offer a more comprehensive understanding of attention.
      </p>
      <p>
        In addition to these foundational models, Hou and Zhang [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] introduced a novel frequency-based
approach to saliency that operates in the frequency domain rather than the spatial domain. Their
model suggests that the spectral residual, which captures unpredictable or irregular aspects of
an image, is crucial in determining visual saliency. This approach uses the Fourier transform to
separate an image's amplitude and phase components, focusing on the spectral residual to
highlight areas that differ from the surrounding content. The saliency map is then generated
based on this residual information (see equation 2):
 ( ,  ) =  −1(
(2)
In equation 2  −1 denotes the inverse Fourier transform,  ( ) the amplitude spectrum of the
image, and  ( ) the phase spectrum. The spectral residual,   ( ( )) −  ( ( )) ,
captures the irregularities in the image that contribute to its saliency.
      </p>
      <p>
        The frequency-based approach can highlight unpredictable elements in an image, making it
highly effective in identifying salient regions that traditional spatial-based methods may miss.
As visual saliency research evolved, more sophisticated methods emerged, combining
bottomup and top-down processes. The Graph-Based Visual Saliency (GBVS) model by Harel et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
takes a global approach, representing an image as a fully connected graph where nodes
represent different regions and edges are weighted by visual similarity. A random walker
algorithm is applied to identify globally unique areas, offering a more holistic understanding of
saliency that considers the image's overall structure.
      </p>
      <p>
        With the advent of deep learning, Convolutional Neural Networks (CNNs) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] have dramatically
improved saliency prediction by learning low-level and high-level features directly from data.
These models combine the strengths of both bottom-up and top-down approaches. They use the
stimulus-driven nature of bottom-up processes to detect contrasts and edges while leveraging
high-level information such as object categories, context, and user goals to fine-tune saliency
predictions. CNNs have proven effective in dynamic environments such as e-commerce
platforms, where understanding user attention is critical for optimising design and enhancing
user interaction.
      </p>
      <p>Tliba et al. [10] proposed SatSal (Self-Attention Saliency), an encoder-decoder deep learning
model that leverages skip connections during decoding to account for high and low-level features
from images. SAtSal also relies on convolutional self-attention modules connecting the encoder
to the decoder branches.</p>
      <p>TranSalNet leverages dense connections and residual networks, which helps the model maintain
spatial detail while reducing computational complexity. This approach allows the model to
capture high-level semantic features (such as object categories) and low-level features (like
colour and contrast), making it well-suited for complex, dynamic environments such as
ecommerce websites.</p>
      <p>In addition to the visual attention domain, saliency models can be used in many applications,
such as image segmentation [11], video summarisation [12], image content enhancement [13],
and automatic image cropping [14].</p>
    </sec>
    <sec id="sec-3">
      <title>3. Method and Materials</title>
      <p>
        In this study, we aimed to evaluate the performance of an AI model, TranSalNet [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], in
predicting visual saliency on various e-commerce platforms by comparing its output with human
eye-tracking data. The experiment involved 97 participants aged between 20 and 35, all with
prior experience in online shopping. Each participant was asked to use Realeye.io eye tracking
platform, that will present participants the chosen homepage images for a maximum of 4 seconds
each. Each participant begins the experiment with a calibration process to ensure the eye tracker
accurately records their gaze. The calibration involves having the participant look at a series of
points on the screen to establish a baseline for their eye movements. Once calibrated,
participants are shown a series of e-commerce homepage images for a maximum of 4 seconds
each using the realeye.io eye tracking platform. This brief exposure period is designed to
simulate the real-world scenario where users quickly scan a webpage to form an initial
impression.
      </p>
      <p>These are the images shown to the participants:</p>
      <p>As shown in the images, the stimuli used in this experiment consisted of static images and
dynamic content from each website's homepage, incorporating visual elements such as product
listings, banners, call-to-action buttons, and navigation menus. These elements provided diverse
visual content to test the model's ability to predict user focus in a real-world context. The
eyetracking data was transformed into heat maps, representing areas of high user attention. The AI
model’s predicted saliency maps were then compared with these heatmaps to assess the model’s
accuracy.</p>
      <p>The TranSalNet model processes images to predict visual saliency by following a carefully
structured pipeline. First, input images are preprocessed using padding and resizing to fit the
required input size of 384x288 pixels. This step ensures consistency across different image
inputs.</p>
      <p>The model architecture, which can be TranSalNet_Dense or TranSalNet_Res based on the
defined flag, leverages dense connections and residual networks to extract hierarchical features
critical for accurate saliency prediction. These features are then processed through the network,
and the predicted saliency map is generated.</p>
      <p>After generating the saliency maps, they are compared to ensure a one-to-one analysis with
eye-tracking data. The pipeline used in this study is illustrated in Figure 1, showcasing the steps
from image preprocessing to final visualisation.
Saliency (NSS), Kullback-Leibler Divergence (KL-Div), Area Under the Curve (AUC), the
Correlation Coefficient (CC) and the Similarity Index Measure (SIM).</p>
      <p>The Normalized Scanpath Saliency (NSS) metric was particularly important in assessing the
correspondence between predicted saliency maps and the actual fixations recorded in the
eyetracking data. Mathematically, NSS is calculated as:
= 1 ∑
  =1
 (    )− 
 
where  (    ) represents the saliency score at the fixation point (    ),   is the mean
saliency score across the image, and   is the standard deviation of the saliency scores. This
metric essentially measures how much the predicted saliency values at fixated locations deviate
from the mean saliency value, thus providing a direct way to evaluate the model’s accuracy in
predicting human attention.</p>
      <p>In addition, the Kullback-Leibler Divergence (KL-Div) was used to assess the similarity
between the predicted saliency distribution and the distribution of fixations recorded during the
eye-tracking sessions. The formula for KL-Div is:
  ( || ) = ∑  ( ) 
 ( )
 ( )
where  ( ) is the probability distribution of human fixations and  ( ) is the predicted
saliency distribution. KL-Div measures how much the two distributions diverge from each other,
with a smaller value indicating a better match between the model’s predictions and the human
data.</p>
      <p>The Area Under the Curve (AUC) was employed as a general performance metric, evaluating
the model’s ability to discriminate between fixated and non-fixated areas of the image. A high
AUC score suggests that the model accurately identifies the regions of interest within the image,
closely matching the human visual attention patterns observed through the eye-tracking data.</p>
      <p>Moreover, the Correlation Coefficient (CC) measures the linear correlation between the
predicted saliency map and the observed eye-tracking data. A value of 1 indicates a perfect
correlation, while a value of 0 indicates no correlation. The formula for CC is:
=</p>
      <p>∑( ( )− ̅)( ( )− ̅)
√∑( ( )− ̅)2 ∑( ( )− ̅)2
where  ( ) and  ( ) represent the predicted and observed saliency values, respectively, and
 ̅ and  ̅ are the mean saliency values of the two distributions.</p>
      <p>The Similarity Index Measure (SIM) evaluates the similarity between the predicted and
observed saliency maps. It compares the two distributions of saliency values, producing a score
between 0 and 1, where 1 indicates a perfect match. The SIM is calculated using the following
formula:
= ∑ min ( ( ),  ( ))
(6)
where  ( ) and  ( ) represent the predicted and observed saliency distributions,
respectively.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Results</title>
      <p>In this experiment, we aimed to explore how effectively TranSalNet, an AI model designed for
saliency prediction, could replicate human visual attention patterns when applied to
ecommerce websites. The central focus is to check how well the model performed in predicting
areas of interest, such as product images, banners, and call-to-action buttons, which are critical
elements in an online shopping experience. Using eye-tracking data as the ground truth, we
compared the model's predicted saliency maps to actual user fixations, analysing performance
across multiple platforms. Each e-commerce site presented different visual layouts, offering a
variety of challenges for the model in identifying primary and secondary areas of user interest.</p>
      <p>The Amazon platform results showed an AUC of 0.79 and an NSS score of 1.30, indicating good
predictive performance. The KL-Div score was 0.00, with a perfect CC of 1.00 and a SIM score of
0.99≈1, confirming the model's reliability. The heatmap generated for Amazon highlighted
product images and call-to-action buttons as the main focal points, indicating that the algorithm
can capture key elements that attract users' attention. However, some less prominent areas, such
as product description sections, were underestimated, suggesting the need for further
optimisation for a more comprehensive detection of areas of interest. That means the saliency
model's output accounts for some false negatives.</p>
      <p>In the case of eBay, the predictive saliency model achieved an AUC rate of 0.87 and an NSS
score of 1.83, demonstrating high accuracy. The KL-Div score was 0.00, with a CC of 1.00 and a
SIM score of 0.99≈1, underscoring the model's effectiveness in capturing users’ attention
accurately. The heatmap generated for eBay showed good alignment with the central areas of
the page, where the most viewed products are located. However, there are discrepancies in the
peripheral regions and navigation sections, where the model's predictions do not fully match the
eye-tracking data. That suggests that while the algorithm performs well for the main areas of
interest, it may require improvements to capture the entire spectrum of user fixations, including
secondary elements that could influence the browsing experience.</p>
      <p>For Shein, the predictive model achieved an AUC value of 0.82, indicating a high accuracy rate
in distinguishing salient from non-salient regions. The NSS score was 1.25, showing a moderate
alignment between the predicted saliency and actual user attention. Similarly, the KL-Div score
was 0.00, the CC was 1.00, and the SIM score was 1.00, reflecting a substantial similarity between
the predictive model and the eye-tracking data. The results reflected the quality of the predictive
map, which revealed significant attention to product listings and promotional images, which are
crucial elements for a fashion-focused e-commerce site. However, the predictive maps did not
highlight some areas, such as search filters and navigation sections. That suggests that the
algorithm could benefit from further optimisation to better capture the various visual
preferences of users, especially in a dynamic context like fashion.</p>
      <p>Finally, for Vinted, the predictive saliency maps are well-aligned with the eye-tracking
data in sections where user-generated content is displayed, such as product images uploaded by
users. AUC reaching 0.95 and NSS scoring 3.36 indicate a strong correlation with the eye-tracking
data. The KL-Div score remained at 0.0, while the CC was 0.99≈1, and the SIM score was 1.0,
further validating the model's accuracy. However, some discrepancies emerge in areas with
mixed content, where the model's predictions only partially align with the user fixation data.
That indicates that while the algorithm effectively predicts the main areas of interest, it may
require further optimisation to better process the complexity of mixed content.</p>
      <p>As observed, even though the algorithm used was not fine-tuned to capture the most salient
parts of the sites, it still performed well by identifying the critical areas of interest. In fact, as we
can see from these metrics, the high AUC and NSS scores and the perfect or near-perfect CC and
SIM scores indicate that the models are highly effective in replicating actual user attention
patterns. However, the moderate NSS scores for platforms like Amazon and Shein suggest that
while the models perform well, there is still room for improvement in fine-tuning the predictions
to better capture all nuances of user behaviour. That demonstrates the algorithm's robustness
in highlighting essential elements, such as product images and call-to-action buttons, which are
crucial for user engagement.</p>
      <p>The differences in the salient areas can be attributed to the diverse nature of the e-commerce
sites analysed. Each site caters to different product categories and thus employs distinct site
structures and design elements. For example, fashion-focused sites like Shein emphasise product
listings and promotional images. In contrast, a marketplace like eBay may have a broader focus
that includes search filters and navigation tools.</p>
      <p>These variations in site design and content emphasis naturally lead to differences in what is
considered visually salient for users. Despite not being fine-tuned explicitly for each site, the
algorithm's ability to adapt to these different contexts highlights its potential for general
applicability across various types of e-commerce platforms. That underscores the importance of
considering different product categories' unique characteristics and user behaviours when
developing and optimising predictive saliency models.</p>
      <p>Moreover, I have analysed the results of the eye-tracking experiment. Fixation points, where
the gaze is held steadily for around 200-300 milliseconds (Rayner, 1998), indicate areas of
interest and attention, revealing which elements on the webpage attract the user's focus.
Saccades, which are rapid movements between fixation points, help identify the scanning
behaviour and how users navigate the visual information. The heatmaps obtained provide a
detailed representation of user fixations on various e-commerce sites, offering a more
comprehensive analysis of users' visual habits.
For Amazon, the eye-tracking data showed a substantial concentration of fixations on product
images and call-to-action buttons, confirming the importance of these elements in capturing
users' attention. However, the heatmaps also revealed significant interest in product description
sections and user reviews, which were not as well highlighted by the predictive saliency maps.
In the case of eBay, user fixations spread in the central areas with the most viewed products
but with more significant dispersion in the peripheral regions compared to the model's
predictions. That suggests that users also explore navigation sections and search filters, which
are crucial for the shopping experience but are not always captured by the predictive maps. The
eye-tracking heatmaps thus indicate a more complex and distributed exploration behaviour,
highlighting the importance of optimising all sections of the page to improve overall user
interaction.</p>
      <p>Regarding Shein, the eye-tracking data revealed greater attention to search filters, navigation
sections, and product listings. That suggests that users find these areas equally important, an
aspect that the predictive saliency code did not fully capture. The heatmaps indicate that, besides
products, users frequently interact with search and filter tools to customise their shopping
experience. Future method optimisations should tackle the abovementioned aspects.
Finally, for Vinted, the eye-tracking heatmaps aligned strongly with the predictive maps in
sections with user-generated content, such as product images. However, the heatmaps also
highlighted significant interest in navigation sections and product details. Those areas that were
not highlighted as well as by the predictive saliency maps. That suggests that while the algorithm
effectively predicts the main areas of interest, it must be refined to capture the full spectrum of
users' visual interactions with higher accuracy.</p>
      <p>The eye-tracking heatmaps provide invaluable insights into user fixations across various
ecommerce platforms, revealing a nuanced picture of users' visual habits. They highlight not only
the primary areas of interest, such as product images and call-to-action buttons, but also
secondary elements, like product descriptions, user reviews, and navigation tools, that are
equally crucial for a comprehensive user experience. While the predictive saliency maps
performed well in identifying critical focal points, the eye-tracking data uncovered additional
areas of interest that the algorithm did not fully capture. This discrepancy underscores the
importance of integrating eye-tracking data into the development and refinement of predictive
models. By doing so, we can ensure a more holistic understanding of user behaviour and
optimise e-commerce sites to enhance user interaction and satisfaction across different product
categories and site structures.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>This research explored the integration of artificial intelligence (AI) into the field of visual
saliency modelling, specifically within the context of e-commerce. The primary objective was to
enhance the understanding of how AI models can replicate human attention mechanisms,
thereby improving user interaction and engagement with online platforms. Visual saliency plays
a crucial role in identifying areas of interest in visual scenes, both in human vision and AI
systems. This study focused on assessing how AI-driven models, like TranSalNet, could predict
user attention and saliency across various e-commerce websites.</p>
      <p>The findings revealed that the model effectively identified central areas of visual interest,
particularly product images and promotional content. However, some discrepancies were
observed in peripheral elements, such as navigation tools and search filters. That suggests that
while AI-driven models can replicate core areas of user focus, further improvements are needed
to capture more subtle or peripheral elements equally important for user interaction. Including
top-down processes, which integrate user intent and cognitive context, could significantly
enhance the predictive power of these models.</p>
      <p>Another key insight from the study was the role of personalisation in saliency modelling.
Individual user preferences, browsing behaviour, and cultural differences contribute to
variations in visual attention patterns. AI models that incorporate real-time user data and
adaptive mechanisms have the potential to provide personalised predictions, ensuring that
websites are optimised for different user groups. Future work in this area could focus on
developing models that dynamically adjust to specific user profiles, enhancing the overall user
experience.</p>
      <p>Moreover, as e-commerce platforms continue to integrate dynamic and interactive content, it
becomes essential for saliency models to handle both spatial and temporal dynamics. The results
suggest that optimised for static images, the current models could benefit from advancements in
handling time-based interactions, such as scrolling, animations, and videos. Future research
should prioritise developing models capable of processing these temporal aspects to provide a
more comprehensive understanding of user attention.</p>
      <p>In conclusion, this research highlights the potential of AI models like TranSalNet in optimising
the design of e-commerce platforms by predicting user attention patterns. While significant
progress has been made, there remains room for improvement, particularly in terms of
personalisation, peripheral element detection, and the incorporation of temporal dynamics. The
continued advancement of AI-driven saliency models will improve user experiences and provide
valuable insights for enhancing web design and user interface optimisation across various
industries.
[10] Tliba, M., MA, K., Ghariba, B., Chetouani, A., Çöltekin, A., MS, S. and Bruno, A., 2022.</p>
      <p>SATSal: A Multi-Level Self-Attention Based Architecture for Visual Saliency Prediction.</p>
      <p>IEEE ACCESS, 10, pp.20701-20713.
[11] Liu, X., Huang, G., Yuan, X. et al. Weakly supervised semantic segmentation via
saliency perception with uncertainty-guided noise suppression. Vis Comput (2024).
https://doi.org/10.1007/s00371-024-03574-1
[12] Zhong, R., Xiao, D., Dong, S. and Hu, M., 2021. Spatial attention model‐modulated
bi‐directional long short‐term memory for unsupervised video summarisation.</p>
      <p>Electronics Letters, 57(6), pp.252-254.
[13] Bruno, A., Gugliuzza, F., Ardizzone, E., Giunta, C.C. and Pirrone, R., 2019. Image
content enhancement through salient regions segmentation for people with color vision
deficiencies. i-Perception, 10(3), p.2041669519841073.
[14] Wang, C., Niu, L., Zhang, B. and Zhang, L., 2023. Image Cropping With
SpatialAware Feature and Rank Consistency. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (pp. 10052-10061).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Russell</surname>
            ,
            <given-names>S.J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Norvig</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <year>2016</year>
          .
          <article-title>Artificial intelligence: a modern approach</article-title>
          . Pearson.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Gao</surname>
          </string-name>
          , W. eds.,
          <year>2014</year>
          .
          <article-title>Visual saliency computation: A machine learning perspective</article-title>
          (Vol.
          <volume>8408</volume>
          ). Springer
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Itti</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koch</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Niebur</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <year>1998</year>
          .
          <article-title>A model of saliency-based visual attention for rapid scene analysis</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          ,
          <year>2011</year>
          , pp.
          <fpage>1254</fpage>
          -
          <lpage>1259</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Tatler</surname>
            ,
            <given-names>B.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baddeley</surname>
            ,
            <given-names>R.J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Gilchrist</surname>
            ,
            <given-names>I.D.</given-names>
          </string-name>
          ,
          <year>2005</year>
          .
          <article-title>Visual correlates of fixation selection: effects of scale and time</article-title>
          .
          <source>Vision Research</source>
          ,
          <volume>45</volume>
          (
          <issue>5</issue>
          ), pp.
          <fpage>643</fpage>
          -
          <lpage>659</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>[5] LJOVO TranSalNet GitHub repository</article-title>
          . Available: https://github.com/LJOVO/TranSalNet. Accessed: Sep.
          <volume>17</volume>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Fang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tian</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , Zhang,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Zhang</surname>
          </string-name>
          , Q., Han,
          <string-name>
            <surname>J</surname>
          </string-name>
          . and Han,
          <string-name>
            <surname>J</surname>
          </string-name>
          .,
          <year>2022</year>
          .
          <article-title>Densely nested top-down flows for salient object detection</article-title>
          .
          <source>Science China Information Sciences</source>
          ,
          <volume>65</volume>
          (
          <issue>8</issue>
          ), p.
          <fpage>182103</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Hou</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , L.,
          <year>2007</year>
          .
          <article-title>Saliency detection: A spectral residual approach</article-title>
          .
          <source>2007 IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Harel</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koch</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Perona</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <year>2007</year>
          .
          <article-title>Graph-based visual saliency</article-title>
          .
          <source>Advances in Neural Information Processing Systems</source>
          ,
          <volume>19</volume>
          , pp.
          <fpage>545</fpage>
          -
          <lpage>552</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>LeCun</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          , G.,
          <year>2015</year>
          .
          <article-title>Deep learning</article-title>
          .
          <source>nature</source>
          ,
          <volume>521</volume>
          (
          <issue>7553</issue>
          ), pp.
          <fpage>436</fpage>
          -
          <lpage>444</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>