<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Analysis for Recom mender User Interfaces</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sebastian Lubos</string-name>
          <email>sebastian.lubos@tugraz.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexander Felfernig</string-name>
          <email>alexander.felfernig@tugraz.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Damian Garber</string-name>
          <email>damian.garber@tugraz.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Viet-Man Le</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thi Ngoc Trang Tran</string-name>
          <email>ttrang@ist.tugraz.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>User Interfaces for Recommender Systems, Usability Analysis, Multimodal Large Language Models</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Graz University of Technology</institution>
          ,
          <addr-line>Infeldgasse 16b, Graz, 8010</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Usability is a key factor in the efectiveness of recommender systems. However, the analysis of user interfaces is a time-consuming process that requires expertise. Recent advances in multimodal large language models (LLMs) ofer promising opportunities to automate such evaluations. In this work, we explore the potential of multimodal LLMs to assess the usability of recommender system interfaces by considering a variety of publicly available systems as examples. We take user interface screenshots from multiple of these recommender platforms to cover both preference elicitation and recommendation presentation scenarios. An LLM is instructed to analyze these interfaces with regard to diferent usability criteria and provide explanatory feedback. Our evaluation demonstrates how LLMs can support heuristic-style usability assessments at scale to support the improvement of user experience.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Recommender systems are a central component of many digital platforms, where they provide
personalized item suggestions to help users navigate large sets of options [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. While the quality of the
underlying recommendation algorithm is important [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the overall efectiveness of a recommender
system also depends on how well users can interact with the interface. Usability and user experience
play a key role in enabling users to express preferences, interpret recommendations, and make informed
choices [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. Even highly accurate recommendations may fail to deliver value if the interface is dificult
to navigate or lacks transparency.
      </p>
      <p>
        Traditional usability evaluation methods include usability testing with real users [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and expert
inspections based on heuristic principles [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. While these methods are efective, they are time-consuming and
require expert involvement. General usability guidelines, such as Nielsen’s heuristics [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] ofer structured
support, and recommender-specific frameworks further improve contextual relevance [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Nevertheless,
usability assessments remain resource-intensive and are thus rarely applied across platforms.
      </p>
      <p>
        To reduce this efort, automated solutions such as rule-based tools and heuristic checkers have been
proposed [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ]. However, these often cover only limited usability dimensions and struggle with
subjective or context-specific issues [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. More recently, multimodal large language models (LLMs)
that can process both visual and textual inputs have emerged [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Early studies demonstrate their
ability to identify usability issues in design mockups [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and mobile interfaces [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], although expert
validation remains necessary. Initial research on the alignment between LLM-based analyses and expert
assessments reports promising accuracy in diferent scenarios [
        <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
        ], but more studies are needed to
confirm these results.
      </p>
      <p>LGOBE</p>
      <p>https://ase.sai.tugraz.at/ (S. Lubos); https://ase.sai.tugraz.at/ (A. Felfernig); https://ase.sai.tugraz.at/ (D. Garber);</p>
      <p>CEUR</p>
      <p>ceur-ws.org</p>
      <p>While these approaches address general usability, they have not yet been applied to the specific
challenges of recommender system interfaces, such as explainability, feedback mechanisms, and
preference elicitation workflows, which are central to the user experience in this context. In this work, we
explore how a multimodal LLM can help to analyze the usability of ten publicly available recommender
interfaces based on explicitly defined criteria. We review the analysis results to highlight the feasibility
and benefits of automated usability analysis for recommender interfaces and outline directions for
future research.</p>
      <p>The paper is organized as follows: Section 2 describes the experimental setup and implementation
details. Section 3 presents the results. Section 4 discusses implications and future work. Finally, the
paper is concluded in Section 5.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Usability Analysis of Recommender Interfaces</title>
      <p>
        A lot of focus in recommender system research is put on the accuracy of algorithms and personalization
strategies [
        <xref ref-type="bibr" rid="ref2">2, 17</xref>
        ]. However, the quality of recommender user interfaces plays an equally important role
in shaping the overall user experience. This experience can be assessed through usability analysis that
is concerned with the aspect of how efectively users can navigate, interpret, and interact with diferent
parts of the recommender system [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. These include possibilities to express explicit preferences,
understand why items are recommended, review recommended items, and provide feedback.
      </p>
      <p>The following sections outline our considered recommender scenarios, usability criteria, and describe
the LLM-based analysis in detail.</p>
      <sec id="sec-2-1">
        <title>2.1. Recommender Scenarios</title>
        <p>To explore the LLM-based usability analysis across a diverse set of recommender interfaces, we selected
ten publicly accessible platforms from various item domains, which are summarized in Table 1. These
systems vary in layout complexity, interaction mechanisms, and types of recommendations. This way,
the automated usability analysis could be reviewed for varying contexts to get an impression about its
generalizability. Each platform was assessed in two typical usage scenarios: (i) preference elicitation,
and (ii) recommendation presentation.</p>
        <sec id="sec-2-1-1">
          <title>Platform</title>
          <p>Amazon
Goodreads
Google News
KaptnCook
Last.fm
Netflix
Pinterest
Spotify
Steam
YouTube</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Item Domain</title>
          <p>E-commerce
Books
News Articles
Recipes
Music
Movies &amp; TV Shows
Visual Content
Music
Video Games
Videos</p>
          <p>URL
https://www.amazon.com
https://www.goodreads.com
https://news.google.com
https://www.kaptncook.com
https://www.last.fm
https://www.netflix.com
https://www.pinterest.com
https://open.spotify.com
https://store.steampowered.com
https://www.youtube.com</p>
          <p>To have a comparable situation for each platform, we considered a new user scenario of a user
interacting with the recommender for the first time. For this purpose, we used a desktop browser in
incognito mode to avoid personalization efects and simulate a first-time user experience. 1 For each
platform, we captured a representative screenshot2 for the usage scenarios. Depending on the platform,
the preference elicitation either showed the default onboarding screens or initial filter options. The
1A new user account was registered if needed to use the application.
2Screenshots were taken at full resolution and included the visible viewport with relevant UI context (e.g., navigation bars,
iflters, recommendation labels).
recommendation presentation showed either the main homepage (dashboard) or a detail page including
recommended items.</p>
          <p>To ensure comparability across platforms while accounting for platform-specific interaction patterns,
we defined explicit user tasks for each platform. This allowed us to maintain a consistent evaluation
structure while respecting the nuances of individual interfaces. The tasks for both scenarios and all
platforms are presented in Table 2. They were used to select the screenshots and provide contextual
information to the LLM during the usability analysis.</p>
        </sec>
        <sec id="sec-2-1-3">
          <title>Platform</title>
          <p>Amazon
Goodreads
Google News
KaptnCook
Last.fm
Netflix
Pinterest
Spotify
Steam
YouTube</p>
        </sec>
        <sec id="sec-2-1-4">
          <title>Preference Elicitation Task Recommendation Presentation Task</title>
          <p>Search for “Bluetooth headphones” Review the related product
recommenand interact with product listings to dations shown on the product or search
express shopping intent. results page.</p>
          <p>Rate previously read books as part Review the initial book recommendations
of the onboarding process to express generated based on ratings.
reading preferences.</p>
          <p>Select preferred news topics or regions Review the personalized news feed on the
during setup to tailor content delivery. dashboard.</p>
          <p>Indicate disliked ingredients during Review the list of recommended recipes
the initial setup to personalize meal based on stated preferences.
suggestions.</p>
          <p>Choose a trending artist to indicate Review the list of recommended tracks
music preferences. or artists based on the selected input.
Select at least three preferred titles Review the personalized dashboard with
during the onboarding setup to ex- recommended movies and shows.
press content preferences.</p>
          <p>Select inspirational images reflecting Review the personalized feed with
recompersonal interests during the onboard- mended visual content.
ing process.</p>
          <p>Add songs to playlists based on initial Review the homepage or dashboard with
recommendations to express musical recommended tracks.
preferences.</p>
          <p>Apply filters (e.g., ”Indie” genre) while
browsing to express game preferences.</p>
          <p>Review the recommended games
displayed on the store homepage or
Discovery Queue.</p>
          <p>Review the recommended videos shown
after watching the selected content.</p>
          <p>Search for and watch a video on
tennis serve drills to signal viewing
preferences.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Usability Criteria</title>
        <p>
          To analyze the usability of the recommender system interfaces in a structured way, we defined a set
of criteria that are shown in Table 3. These are based on general established usability principles [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]
and user-centric evaluation metrics for recommender systems [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. We defined and adapted these
criteria to cover general interface qualities and scenario-specific aspects of preference elicitation and
recommendation presentation. Our goal was to assess whether recommender interfaces support users in
understanding, influencing, and responding to recommendations.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. LLM-based Usability Analysis</title>
        <p>We used the gemini-2.5-flash model by Google for the LLM-based usability analysis [18]. This model
was designed to process textual and visual inputs, which makes it suitable for our experiments. We
instructed the model in Python using the Gemini Developer API.3 To improve the reproducibility of</p>
        <sec id="sec-2-3-1">
          <title>Category</title>
          <p>General
(Both Scenarios)
Preference Elicitation
Recommendation
Presentation</p>
        </sec>
        <sec id="sec-2-3-2">
          <title>Usability Criterion</title>
          <p>G1. Is the layout clear and visually structured?
G2. Are interactive elements (e.g., buttons, icons) clearly recognizable?
G3. Is the amount of information per item appropriate and helpful?
G4. Are interface elements used consistently (e.g., icons, labels, colors)?
P1. Can users explicitly express preferences (e.g., ratings, likes, categories)?
P2. Is there transparency about how input afects recommendations?
P3. Do users have control and flexibility (e.g., skip, edit, undo inputs)?
R1. Are recommendations clearly labeled as such?
R2. Are diferent types of recommendations distinguishable (e.g., ”Because
you liked...”)?
R3. Are there explanations for why items are recommended?</p>
          <p>R4. Can users interact with recommendations (e.g., rate, hide, save)?
results, we set the temperature to 0.0 for more deterministic LLM output.</p>
          <p>For the analysis task, the recommender interface screenshot was provided as context, together
with a high-level description of the platform, the considered usage scenario (preference elicitation or
recommendation presentation), and an explicit user task (see Table 2). To define the role and boundaries
of the LLM in this scenario, we used the system prompt shown in Figure 1.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>We analyzed 10 platforms across 2 usage scenarios each, and considered 11 diferent usability criteria.
This resulted in 150 individual assessments (respecting partly scenario-specific criteria, see Table 3).
The LLM completed the analysis in approximately 216 seconds.</p>
      <p>Figure 3 summarizes the fulfillment rates for each usability criterion across all evaluated platforms.
The results show that general interface design aspects, such as clear layout (G1), recognizable interactive
elements (G2), appropriate information density (G3), and consistent visual styling (G4), are
wellsupported on most platforms. This suggests that these systems follow established design conventions
and ofer a solid baseline for usability. This outcome is expected, given that the evaluated platforms are
widely used and likely to have invested greatly in interface design and user experience.</p>
      <p>In contrast, the recommender-specific criteria show a more diverse picture. Particularly, the presence
of explanations (R3) and interactive feedback options (R4) was less frequently fulfilled. The same holds
for transparency of underlying algorithms (R2/P2), explicit feedback options (P1), and flexibility of
interaction (P3). This suggests that while basic UI design is generally strong, many platforms lack
mechanisms to help users understand and influence recommendation behavior. These findings point to
a gap in explainability and user control in many “black-box ” recommender settings.
4To avoid potential legal issues with publishing proprietary platform screenshots, we redrew the relevant parts as simplified
UI sketches for presentation in this paper. Importantly, all analyses were conducted using the original screenshots. The
sketches only serve as substitutes for publication and preserve the necessary information for understanding the reported
aspects. This procedure was applied to all examples shown. For the original screen design, we refer to the respective platform
web pages (see Table 1).
illustrates example results related to criterion P1, which concerns the ability for users to explicitly
express preferences. In the playlist creation scenario on Spotify, the LLM judged this criterion as
unfulfilled, since users can only add recommended songs without any direct feedback mechanism, such
as liking or disliking. In contrast, Amazon was evaluated as fulfilling the criterion. Users could actively
apply clickable filters to indicate item type preferences.</p>
      <p>The explanation provided for Spotify is particularly nuanced and insightful. While the LLM
acknowledges that adding songs to a playlist can be seen as a form of preference expression, it argues that
this action alone may not suficiently satisfy the criterion. This reasoning is persuasive as users might
skip a recommendation for various reasons, such as disliking the artist or simply finding the song
unsuitable for the current playlist, which is not explicitly communicated to the system. The suggested
improvement, to provide more fine-grained feedback options, is both reasonable and actionable.</p>
      <p>In the recommendation presentation scenario, several criteria were considered unfulfilled. Figure 7
presents example results for criterion R4, which concerns the ability of users to interact with
recommended items. For Google News, the LLM noted the absence of visible interaction options and suggested
improvements. In contrast, YouTube was evaluated more positively, as its recommendation cards include
an accessible interaction menu (“three-dot” menu).</p>
      <p>These examples highlight both the usefulness and limitations of our approach. While the LLM
was able to identify relevant usability gaps, it also operated solely on static screenshots. As a result,
interactive elements that appear only on hover or during user interaction, such as the hidden menu in
Google News, may remain unrecognized. Nevertheless, the LLM’s suggestion remains valid, as for a
new or inexperienced user, the lack of visible afordances can be a barrier to efective interaction and
may justify more prominent cues.
(b) Fulfilled P1 criterion (Amazon)</p>
      <p>(b) Fulfilled R4 criterion (YouTube).</p>
      <p>Another notable observation is that the LLM judged none of the platforms as fulfilling criterion R3,
which concerns providing explanations for why items are recommended. However, a closer review of
the platforms and LLM explanations reveals a more nuanced picture. While many platforms indeed
lack explicit explanations, some ofer at least high-level contextual hints. For instance, Spotify displays
recommended playlists with labels such as “Brand new music from artists you love”, which is a
highlevel explanation for the recommendation. This suggests that the criterion could benefit from further
refinement to distinguish between vague contextual hints and explicit, personalized explanations.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion</title>
      <p>Our results suggest that LLM-based usability analysis can provide useful, low-efort insights into the
strengths and weaknesses of recommender interfaces. The generated explanations and improvement
suggestions were generally accurate, understandable, and relevant, which indicates the potential for
such tools to support development, especially in early-stage prototyping and iterative design. While
LLMs cannot replace expert evaluations, they could act as assistive tools in a human-in-the-loop setting,
reducing manual efort and accelerating the identification of usability issues.</p>
      <p>Building on these findings, several research challenges emerge:
Prioritization of Issues. The number of identified usability issues of a recommender system can
be large and trigger significant efort for developers to evaluate them and prioritize their fixes. The
currently used binary fulfillment decisions do not capture the severity of issues, which limits the
possibilities for ranking. Using more nuanced assessments using severity ratings could support issue
prioritization and make the analysis more actionable by highlighting the most critical usability gaps.
Prompting Design and Context. Prompt design can influence the analysis results. Our current
template evaluates multiple criteria simultaneously (see Figure 2), which reduces inference costs but
sometimes may overlook issues. Using separate prompts for each criterion could improve efectiveness.</p>
      <p>
        Another opportunity for improvement is the refinement of the context description and clarifying
the intended meaning of each criterion. In some cases, the LLM struggled to interpret the criteria
consistently, for example, determining what level of detail qualifies as an explanation (R3). Making this
clearer by adding more detailed descriptions or examples could lead to more accurate and robust results.
Dynamic UI Behavior. A boundary of our current approach is the reliance on static screenshots,
which prevents the model from recognizing dynamic interface elements that appear on mouse hover 5 or
click. Providing a video of recorded interactions as context could show the complete interface behavior
and overcome this limitation. Beyond that, LLM-based agents that directly interact with interfaces
could be an even more elaborate approach to also automate the data collection aspect.
Validation Against Expert Assessments. Systematic comparison with expert evaluations is needed
to assess the reliability and practical value of LLM-based usability analysis. While early studies report
promising alignment with expert judgments on general usability criteria [
        <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
        ], broader comparisons,
particularly in recommender-specific contexts, are needed to validate this.
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>In this paper, we explored an LLM-based approach to usability analysis of recommender user interfaces.
We applied the method to ten publicly available platforms to assess whether the identified issues
were plausible, clearly explained, and accompanied by meaningful improvement suggestions. Our
ifndings demonstrate the potential of multimodal LLMs to support low-efort usability evaluation of
recommender interfaces, particularly during early-stage design. Also, the findings highlight diferent
areas for improvement, particularly in handling dynamic interface elements and generating more
nuanced, context-aware judgments. Future work will focus on improving prompt strategies and
contextual understanding, and validating LLM-generated assessments against expert evaluations.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT and Grammarly in order to: Grammar
and spelling check, Paraphrase and reword. After using these tools, the authors reviewed and edited the
content as needed and take full responsibility for the publication’s content.
5see https://www.w3schools.com/howto/howto_css_display_element_hover.asp for an on mouse hover example.
[17] A. Gunawardana, G. Shani, S. Yogev, Evaluating Recommender Systems, Springer US, New
York, NY, 2022, pp. 547–601. URL: https://doi.org/10.1007/978-1-0716-2197-4_15. doi:10.1007/
978-1-0716-2197-4_15.
[18] G. Team, O. Authors, Gemini: A family of highly capable multimodal models, 2024. URL: https:
//arxiv.org/abs/2312.11805.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Ricci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Rokach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Shapira</surname>
          </string-name>
          , Recommender Systems: Techniques, Applications, and Challenges,
          <string-name>
            <surname>Springer</surname>
            <given-names>US</given-names>
          </string-name>
          , New York, NY,
          <year>2022</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>35</lpage>
          . URL: https://doi.org/10.1007/978-1-
          <fpage>0716</fpage>
          -2197-
          <issue>4</issue>
          _1. doi:
          <volume>10</volume>
          .1007/978-1-
          <fpage>0716</fpage>
          -2197-
          <issue>4</issue>
          _
          <fpage>1</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E.</given-names>
            <surname>Zangerle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bauer</surname>
          </string-name>
          ,
          <article-title>Evaluating recommender systems: Survey and framework</article-title>
          ,
          <source>ACM Comput. Surv</source>
          .
          <volume>55</volume>
          (
          <year>2022</year>
          ). URL: https://doi.org/10.1145/3556536. doi:
          <volume>10</volume>
          .1145/3556536.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Pu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <article-title>Evaluating recommender systems from the user's perspective: survey of the state of the art, User Modeling and User-Adapted Interaction 22 (</article-title>
          <year>2012</year>
          )
          <fpage>317</fpage>
          -
          <lpage>355</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B. P.</given-names>
            <surname>Knijnenburg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. C.</given-names>
            <surname>Willemsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Gantner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Soncu</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Newell, Explaining the user experience of recommender systems, User modeling and user-adapted interaction 22 (</article-title>
          <year>2012</year>
          )
          <fpage>441</fpage>
          -
          <lpage>504</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.</given-names>
            <surname>Hass</surname>
          </string-name>
          , A Practical Guide to Usability Testing, Springer International Publishing, Cham,
          <year>2019</year>
          , pp.
          <fpage>107</fpage>
          -
          <lpage>124</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>319</fpage>
          -96906-
          <issue>0</issue>
          _6. doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          -96906-
          <issue>0</issue>
          _
          <fpage>6</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Hollingsed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. G.</given-names>
            <surname>Novick</surname>
          </string-name>
          ,
          <article-title>Usability inspection methods after 15 years of research and practice</article-title>
          ,
          <source>in: Proceedings of the 25th Annual ACM International Conference on Design of Communication</source>
          , SIGDOC '07,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2007</year>
          , p.
          <fpage>249</fpage>
          -
          <lpage>255</lpage>
          . URL: https://doi.org/10.1145/1297144.1297200. doi:
          <volume>10</volume>
          .1145/1297144.1297200.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Nielsen</surname>
          </string-name>
          ,
          <article-title>Enhancing the explanatory power of usability heuristics</article-title>
          ,
          <source>in: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI '94</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>1994</year>
          , p.
          <fpage>152</fpage>
          -
          <lpage>158</lpage>
          . URL: https://doi.org/10.1145/191666.191729. doi:
          <volume>10</volume>
          .1145/191666.191729.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Pu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <article-title>A user-centric evaluation framework for recommender systems</article-title>
          ,
          <source>in: Proceedings of the Fifth ACM Conference on Recommender Systems</source>
          , RecSys '11,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2011</year>
          , p.
          <fpage>157</fpage>
          -
          <lpage>164</lpage>
          . URL: https://doi.org/10.1145/ 2043932.2043962. doi:
          <volume>10</volume>
          .1145/2043932.2043962.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Namoun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Alrehaili</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tufail</surname>
          </string-name>
          ,
          <article-title>A review of automated website usability evaluation tools: Research issues and challenges</article-title>
          , in: M.
          <string-name>
            <surname>M. Soares</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Rosenzweig</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Marcus (Eds.), Design,
          <string-name>
            <given-names>User</given-names>
            <surname>Experience</surname>
          </string-name>
          , and Usability: UX Research and Design, Springer International Publishing, Cham,
          <year>2021</year>
          , pp.
          <fpage>292</fpage>
          -
          <lpage>311</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Castro</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Garnica</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Rojas</surname>
          </string-name>
          ,
          <article-title>Automated tools for usability evaluation: A systematic mapping study</article-title>
          , in: G. Meiselwitz (Ed.),
          <source>Social Computing and Social Media: Design, User Experience and Impact</source>
          , Springer International Publishing, Cham,
          <year>2022</year>
          , pp.
          <fpage>28</fpage>
          -
          <lpage>46</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>E.</given-names>
            <surname>Kuric</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Demcak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krajcovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lang</surname>
          </string-name>
          ,
          <article-title>Systematic literature review of automation and artificial intelligence in usability issue detection</article-title>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2504.01415. arXiv:
          <volume>2504</volume>
          .
          <fpage>01415</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>A survey on multimodal large language models</article-title>
          ,
          <source>National Science Review</source>
          <volume>11</volume>
          (
          <year>2024</year>
          ). URL: http://dx.doi.org/10.1093/nsr/nwae403. doi:
          <volume>10</volume>
          .1093/nsr/ nwae403.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Warner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hartmann</surname>
          </string-name>
          ,
          <article-title>Generating automatic feedback on ui mockups with large language models</article-title>
          ,
          <source>in: Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, CHI '24</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2024</year>
          . URL: https://doi.org/10.1145/3613904.3642782. doi:
          <volume>10</volume>
          .1145/3613904.3642782.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A. E.</given-names>
            <surname>Pourasad</surname>
          </string-name>
          , W. Maalej,
          <string-name>
            <surname>Does GenAI Make Usability Testing</surname>
          </string-name>
          <article-title>Obsolete?</article-title>
          ,
          <source>in: 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE)</source>
          ,
          <source>IEEE Computer Society</source>
          , Los Alamitos, CA, USA,
          <year>2025</year>
          , pp.
          <fpage>675</fpage>
          -
          <lpage>675</lpage>
          . URL: https://doi.ieeecomputersociety.
          <source>org/10.1109/ICSE55347</source>
          .
          <year>2025</year>
          .
          <volume>00138</volume>
          . doi:
          <volume>10</volume>
          .1109/ICSE55347.
          <year>2025</year>
          .
          <volume>00138</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. W.</given-names>
            <surname>McDonald</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Hsieh, Synthetic heuristic evaluation: A comparison between ai- and human-powered usability evaluation, 2025</article-title>
          . URL: https://arxiv.org/abs/2507.02306. arXiv:
          <volume>2507</volume>
          .
          <fpage>02306</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Lubos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Felfernig</surname>
          </string-name>
          , G. Leitner,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schwazer</surname>
          </string-name>
          ,
          <article-title>Towards recommending usability improvements with multimodal large language models</article-title>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2508.16165. arXiv:
          <volume>2508</volume>
          .
          <fpage>16165</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>