<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Recognition for Autonomous Agents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Habtom Kahsay Gidey</string-name>
          <email>habtom.gidey@tum.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Niklas Huber</string-name>
          <email>niklas.huber@jessyworks.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexander Lenz</string-name>
          <email>alex.lenz@tum.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alois Knoll</string-name>
          <email>knoll@tum.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Autonomous Agents, Afordances, World Modeling, Design Patterns, Web Automation.</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Jessy Works</institution>
          ,
          <addr-line>München</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Technische Universität München</institution>
          ,
          <addr-line>München</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The autonomy of software agents is fundamentally dependent on their ability to construct an actionable internal world model from the structured data that defines their digital environment, such as the Document Object Model (DOM) of web pages and the semantic descriptions of web services. However, constructing this world model from raw structured data presents two critical challenges: the verbosity of raw HTML makes it computationally intractable for direct use by foundation models, while the static nature of hardcoded API integrations prevents agents from adapting to evolving services.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Recent advances and the abundance of foundation models are driving a new generation of software
agents performing complex cognitive tasks in dynamic, mixed-reality ecosystems [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ]. For these
agents to operate efectively, they must perceive and understand their environment to build an
internal world model, an actionable representation that guides their reasoning and planning [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4, 5, 6</xref>
        ].
Although significant research has focused on visual perception from pixels, many digital environments
are built upon rich, structured data sources like the hypermedia of the Document Object Model (DOM)
of web pages and the descriptive interfaces of web services [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. To this end, leveraging this explicit
structure ofers a more eficient and deterministic path to building a world model [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ].
      </p>
      <p>
        However, directly consuming this structured data presents two critical architectural challenges that
hinder the development of truly autonomous agents [
        <xref ref-type="bibr" rid="ref11 ref2">11, 2</xref>
        ]. First, the verbosity of raw HTML is a major
bottleneck for agents using foundation models for reasoning and planning [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Empirical studies of DOM
pruning show that the majority, often 80 - 90%, of tokens in raw HTML consist of non-semantic markup
such as scripts, styles, and trackers [
        <xref ref-type="bibr" rid="ref7 ref9">9, 7</xref>
        ], which can overwhelm the model’s, i.e., LLMs, context window,
degrade reasoning quality, and incur high computational costs [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. This is the fundamental challenge of
representation, where the raw state of the world is too complex for the agent’s cognitive core to process
eficiently [
        <xref ref-type="bibr" rid="ref10 ref13">10, 13</xref>
        ].
      </p>
      <p>
        Second, agents must interact with a dynamic ecosystem of services and devices in the increasingly
interconnected Web of Things (WoT) and microservice architectures [
        <xref ref-type="bibr" rid="ref14">14, 15, 16</xref>
        ]. In such cases, the
The Second International Workshop on Hypermedia Multi-Agent Systems (HyperAgents 2025), in conjunction with the 28th
      </p>
      <p>CEUR</p>
      <p>
        ceur-ws.org
traditional approach of hardcoding binary interfaces, such as API endpoints and interaction logic,
creates brittle, tightly coupled systems that cannot adapt when a service changes or a new device is
introduced in the web microcosm [
        <xref ref-type="bibr" rid="ref11">17, 11</xref>
        ]. This is the challenge of interoperability, adaptability, and
discovery, i.e., recognition of afordances, where the agent’s world model is static and cannot be updated
to reflect a changing digital environment [
        <xref ref-type="bibr" rid="ref10 ref12">12, 10</xref>
        ].
      </p>
      <p>
        This paper introduces preliminary work on a pattern language [18] that addresses these challenges
by framing them as problems of world model construction [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4, 5, 6</xref>
        ]. We present two architectural
design patterns that provide reusable solutions for building and enriching an agent’s world model from
structured data:
1. The DOM Transduction Pattern: A pattern for distilling a complex, raw DOM into a compact,
task-relevant world model optimized for an agent’s reasoning core.
2. The Hypermedia Afordances Recognition Pattern: A pattern for dynamically enriching
the world model by parsing standardized semantic descriptions of web services to discover and
integrate their capabilities at runtime.
      </p>
      <p>
        Combined with other percepts, these patterns provide a robust framework for engineering agents that
can eficiently and adaptively interact with the structured web and its connected extended resources
and the web of things [
        <xref ref-type="bibr" rid="ref5">15, 5</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>
        This research work covers several domains; in particular, it is situated at the intersection of hypermedia
multi-agent systems, world modeling, and cognitive automation [
        <xref ref-type="bibr" rid="ref14 ref4 ref5 ref6">19, 14, 4, 15, 5, 6</xref>
        ].
      </p>
      <sec id="sec-2-1">
        <title>2.1. Web Agents Observation Space</title>
        <p>
          The challenge of processing verbose HTML for LLM-based agents has become a significant area of
research [
          <xref ref-type="bibr" rid="ref11 ref2">11, 2</xref>
          ]. The core problem is that the agent’s observation space, when represented by a raw
DOM, is misaligned with the LLM’s processing capabilities. As a result, this has led to the development
of various techniques for DOM cleaning and simplification [
          <xref ref-type="bibr" rid="ref10 ref7 ref9">9, 7, 10</xref>
          ]. Recent work on LLM-based
web agents such as WebVoyager, Agent-E, and AgentOccam has highlighted the critical importance of
managing the agent’s observation space [
          <xref ref-type="bibr" rid="ref7 ref8">20, 8, 7</xref>
          ]. The primary challenge lies in handling complex and
verbose HTML, which motivates approaches such as DOM distillation and HTML pruning. Several
approaches have emerged, collectively known as DOM distillation or HTML pruning, which aim
to simplify the DOM to make it more tractable for LLMs. For example, AgentOccam focuses on
refining the observation space to better align with the LLM’s capabilities [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. More recent work, such
as HtmlRAG, has introduced block-tree-based pruning methods and concrete methods to clean and
compress HTML while preserving its structure [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. This concept of webpage segmentation has a
long history in information extraction, aiming to break down a page into semantically related parts
to improve downstream tasks [
          <xref ref-type="bibr" rid="ref7 ref8">20, 8, 7</xref>
          ]. These approaches demonstrate the critical importance of
preprocessing the DOM. Our DOM Transduction Pattern then aims to formalize these emerging best
practices into a reusable architectural solution.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Hypermedia Multi-agent Systems and World Models via Hypermedia</title>
        <p>
          The second challenge, interoperability, is addressed by principles from hypermedia multi-agent
systems [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. The core idea is Hypermedia as the Engine of Application State (HATEOAS), a fundamental
constraint of the REST architectural style [21]. HATEOAS states that a client should navigate an
application entirely through links provided dynamically by the server, eliminating the need for a developer in
the middle or hardcoded endpoints and thus decoupling the client from the server. While the adoption
of HATEOAS in general web APIs has been debated, its principles are ideally suited for autonomous
agents that require dynamic discovery and adaptation of afordances.
        </p>
        <p>The most concrete and standardized application of these principles is the W3C Web of Things (WoT)
framework [ 22]. Its central component, the Thing Description (TD), is a JSON-LD document that provides
machine-readable “capability knowledge” for any given “Thing,” such as a device, service, or other web
resource. A TD specifies a resource’s metadata and its Interaction Afordances , the ‘Properties’, ‘Actions’,
and ‘Events’ it exposes, along with the specific ‘Protocol Bindings’ (e.g., HTTP, MQTT) required for
interaction. By parsing a TD, an agent can dynamically learn what a service can do and how to
communicate with it without prior, hardcoded knowledge. This is complemented by WoT Discovery
mechanisms, which define how agents can find relevant Thing Descriptions, for instance, through a
searchable Thing Description Directory (TDD) [23, 16].</p>
        <p>Our Hypermedia Afordances Recognition Pattern formalizes this HATEOAS-based discovery process
as a key mechanism for enriching an agent’s world model at runtime.</p>
        <p>
          These ideas connect to the notion of a cognitive map or a world model, grounded in the foundational
work on world models [
          <xref ref-type="bibr" rid="ref4 ref5 ref6">4, 5, 6</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Pattern Language for Structured Perception</title>
      <p>To systematically address the challenges of building world models from structured data, here we
adopt the established software engineering methodology, i.e., design patterns [ 24, 25, 18]. A pattern
is a reusable design or architectural solution to a recurring problem within a given context, forming
a “pattern language” that captures and communicates expert architectural knowledge. By applying this
methodology, we can systematically investigate and codify solutions for agents’ recurring perceptual
challenges when interacting with structured digital environments.</p>
      <p>Consequently, to ensure a rigorous description of each pattern, we employ a comprehensive cataloging
template adapted from standard pattern documentation formats. This template organizes each pattern
into three main sections: the Problem Space, which defines the context, problem, and motivating forces;
the Solution Space, which details the solution’s description, components, flow, and formal constraints;
and Application and Evaluation, which discusses consequences and implementation. This format is
deliberately preferred to facilitate future formalization and ensure the reproducibility of our proposed
solutions [25, 26].</p>
      <sec id="sec-3-1">
        <title>3.1. The DOM Transduction Pattern</title>
        <p>
          3.1.1. Problem
An LLM-based agent must understand and interact with a web page, but the raw HTML DOM is too
verbose and noisy. It exceeds the LLM’s context window, contains irrelevant information that degrades
reasoning, and incurs high computational costs. To that end, the agent needs a simplified yet structurally
coherent representation of page afordances to build its world model [
          <xref ref-type="bibr" rid="ref11 ref2">11, 2</xref>
          ].
        </p>
        <sec id="sec-3-1-1">
          <title>3.1.2. Solution</title>
          <p>This pattern introduces a DOM Transformer component within the agent’s perception module. As
illustrated in Fig. 1, this component ingests raw DOM and applies a series of transformations to distill
and represent it in a compact, task-relevant world model with afordances. This process typically
involves the following.</p>
          <p>
            1. Cleaning: Removing universally irrelevant tags like ‘&lt;script&gt;’ and ‘&lt;style&gt;’, which can
substantially shrink the DOM size [
            <xref ref-type="bibr" rid="ref7 ref9">9, 7</xref>
            ].
2. Pruning: Intelligently removing content that is irrelevant to the current task, for example,
block-tree-based pruning strategies or embedding-based relevance filtering [
            <xref ref-type="bibr" rid="ref10 ref13">10, 13</xref>
            ].
3. Compact Representation: Converting cleaned HTML into token-eficient encodings such as
          </p>
          <p>
            Emmet notation [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ].
4. LLM-as-Transformer: Using a smaller LLM to summarize the DOM before passing it to a larger
reasoning model [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ].
          </p>
          <p>The output is a simplified DOM that preserves the essential structure needed for interaction while
being optimized for LLM processing. Fig. 1 illustrates the flow and components of the perception
pipeline for the DOM Transduction Pattern.
3.1.3. Architectural Constraints
1. The input must be a structured DOM tree.
2. The transformation is a unidirectional data flow from Raw DOM into a structured representation
of the Page Afordance Model.
3. The output, Page Afordance Model, must preserve the essential structure of task-relevant
interactive elements.</p>
          <p>
            4. The DOM Transformer must be a decoupled perception component.
3.1.4. Application and Evaluation
• Consequences:
– Benefits: Enables automation on complex pages by overcoming context window limitations;
significantly reduces cost and latency; improves agent reliability by providing a cleaner,
more focused context.
– Liabilities: Designing a robust DOM Transformer is a non-trivial engineering task; an overly
aggressive pruning strategy can lead to critical failures by removing necessary elements.
• Implementation Considerations:
– Rule-Based Filtering: Pattern-matching and rule-based parsing techniques can support the
selective removal of predefined tags and attributes from the DOM.
– Block-Based Pruning: Partitioning the DOM into semantic blocks, combined with task-aware
relevance scoring, for example, embedding similarity to task descriptions, can provide
efective strategies for discarding irrelevant content [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ].
– Compact Representation: Structural encoding methods, such as token-eficient notations,
can provide compressed forms of the cleaned DOM while preserving essential hierarchy
and relationships [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ].
– LLM-as-Transformer: Cascaded model strategies, where a smaller model distills or
summarizes the DOM before forwarding it to a more capable reasoning model, can ofer eficiency
gains [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ].
          </p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. The Hypermedia Afordances Recognition Pattern</title>
        <p>
          3.2.1. Problem
An agent must operate in a dynamic ecosystem of distributed services and IoT devices (the “Web of
Things”). Hardcoding the API and having a developer in the middle for each service makes the agent
brittle and unable to adapt to new or updated services. It must have perceptual skills to dynamically
discover, recognize, and understand the capabilities, or afordances, of any web service or resource it
encounters [
          <xref ref-type="bibr" rid="ref14">14, 17</xref>
          ].
        </p>
        <sec id="sec-3-2-1">
          <title>3.2.2. Solution</title>
          <p>This pattern, based on HATEOAS principles [21], requires that services expose their capabilities through
standardized, machine-readable semantic descriptions. The agent’s perception module contains an
Affordance Parser that fetches and interprets these descriptions. The canonical implementation is the W3C
WoT Thing Description (TD), a JSON-LD document that specifies two key elements [ 22]:
• Interaction Afordances: The ‘Properties’ (readable/writable state), ‘Actions’ (invokable
functions), and ‘Events’ (subscribable notifications) the service ofers.
• Protocol Bindings: These are the specific technical instructions that detail how an agent can
interact with each of a service’s capabilities or afordances.</p>
          <p>By parsing a TD, the agent dynamically learns how to interact with services at runtime. As shown in
Fig. 2, this enables adaptability and robustness, allowing agents to autonomously integrate new devices
and services without prior hardcoding.
3.2.3. Architectural Constraints
1. The agent and the resource it interacts with must be fully decoupled with no pre-configured,
hardcoded dependencies on each other.
2. The resource must expose its capabilities and make them known by publishing a standardized
semantic description.
3. All agent interactions must be driven by afordances discovered in the description.
4. The resource’s description dictates the interaction protocol and all communication specifics,
which are determined by the resource itself, not the agent.
3.2.4. Application and Evaluation
• Consequences:
• Implementation Considerations:
– Benefits: Enables extreme adaptability and robustness, allowing agents to autonomously
integrate new devices and services on the fly; simplifies the development of large-scale,
interoperable systems.
– Liabilities: Requires device and service providers to correctly implement and host a Thing
Description, which can be a barrier to adoption; a poorly written TD can lead to agent
errors.
– Afordance parsing can be implemented using JSON-LD libraries and WoT toolkits.
– Agents must include client libraries for common protocols (HTTP, MQTT, etc.), and can use</p>
          <p>WoT Discovery mechanisms such as TD directories to find services [ 22].</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Cognitive Map as a Unified World Model</title>
      <p>
        The two patterns embody two complementary modes of perception: one focused on distilling and
representing a known, complex environment, and the other on discovering and integrating unknown
entities. Jointly, they contribute to the construction of a unified cognitive map or a world model, as
emphasized in the foundational work on world models [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4, 5, 6, 27</xref>
        ].
      </p>
      <p>Their flow can be illustrated with a simplified practical example. An agent tasked with booking a
hotel first applies the DOM Transduction Pattern to parse the booking website, producing a simplified
world model of the form. Within this model, it discovers a hyperlink labeled “Smart Room Controls”.
Following this link, the agent receives a W3C Thing Description for the room’s environmental controls.
It then switches to the Hypermedia Afordances Recognition Pattern to parse the description, dynamically
enriching its world model with new capabilities, for example, discovering a ‘thermostat’ and a
‘setTemperature’ action. The agent can now not only complete the booking but also ofer to pre-set the room
temperature, an afordance discovered and integrated entirely at runtime.</p>
      <p>This composition of patterns is central to the agent’s perception architecture, as shown in Fig. 3. The
diagram illustrates how diferent percepts and afordances, such as DOM trees, thing descriptions, and
service contracts, are fused into a unified cognitive map. The DOM Transduction Pattern processes
HTML structures, while the Hypermedia Afordances Recognition Pattern interprets service descriptions.
Concurrently, these complementary perceptual streams provide the agent with an adaptable and
semantically rich representation of its environment in the evolving ecosystem.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Outlook</title>
      <p>This paper introduced a preliminary pattern language to address key challenges in web and service
interaction by presenting two architectural patterns that enable autonomous agents to construct
and maintain an actionable world models from structured data. The DOM Transduction Pattern
distills complex web pages into tractable afordance representations for LLM-based reasoning, while
the Hypermedia Afordances Recognition Pattern enables dynamic discovery of service capabilities,
ensuring interoperability and adaptability.</p>
      <p>The primary contribution of this work is a principled, reusable framework for engineering hypermedia
multi-agent systems that can build and maintain accurate world models from the explicit structure of
their environment. This enables the development of agents that are more eficient, scalable, resilient to
change, and capable of predictive reasoning, compared to those relying on brittle, hardcoded logic.</p>
      <p>As an outlook, this work represents one half of a larger vision. Our future research will extend
these structure-based patterns with visual counterparts. The long-term goal is a comprehensive pattern
language for multi-modal perception, enabling agents to fuse structured and visual percepts. This will
allow an agent, for example, to use the DOM Transduction pattern on a website but switch to visual
parsing when structured representations are unavailable. This ability to intelligently select and adapt
perceptual modalities will advance the next generation of autonomous agents toward human-level
competence in digital environments.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Grammarly and ChatGPT in order to: Grammar
and spelling check, Paraphrase and reword. After using this tool/service, the authors reviewed and
edited the content as needed and take full responsibility for the publication’s content.
[15] M. Castellucci1, S. Burattini1, A. Ciortea, J. Lemée, D. Vachtsevanou, A. Ricci1, S. Mayer, Towards
agents’ embodiment in hypermedia multi-agent systems, in: Multi-Agent Systems: 21st European
Conference, EUMAS 2024, Dublin, Ireland, August 26–28, 2024, Proceedings, volume 15685,
Springer Nature, 2025, p. 361.
[16] H. K. Gidey, M. Kesseler, P. Stangl, P. Hillmann, A. Karcher, Document-based knowledge discovery
with microservices architecture, in: Intelligent Systems and Pattern Recognition: ISPR 2022,
volume 1589, Springer, Cham, 2022, pp. 146–161. URL: https://doi.org/10.1007/978-3-031-08277-1_
13. doi:10.1007/978-3-031-08277-1_13.
[17] H. K. Gidey, P. Hillmann, A. Karcher, A. Knoll, Towards cognitive bots: Architectural research
challenges, in: Artificial General Intelligence, AGI 2023, Springer, 2023. URL: https://doi.org/10.
1007/978-3-031-33469-6. doi:10.1007/978-3-031-33469-6.
[18] H. K. Gidey, D. Marmsoler, J. Eckhardt, Grounded architectures: Using grounded theory for the
design of software architectures, in: 2017 IEEE International Conference on Software Architecture
Workshops (ICSAW), IEEE, Gothenburg, Sweden, 2017, pp. 141–148. URL: https://doi.org/10.1109/
ICSAW.2017.41. doi:10.1109/ICSAW.2017.41.
[19] H. K. Gidey, P. Hillmann, A. Karcher, A. Knoll, User-like bots for cognitive automation: A
survey, in: Machine Learning, Optimization, and Data Science, LOD 2023„ Springer, 2023. URL:
https://doi.org/10.1007/978-3-031-53966-4. doi:10.1007/978-3-031-53966-4.
[20] T. Abuelsaad, D. Akkil, P. Dey, A. Jagmohan, A. Vempaty, R. Kokku, Agent-e: From autonomous web
navigation to foundational design principles in agentic systems, arXiv preprint arXiv:2407.13032
(2024).
[21] M. Kelly, JSON Hypertext Application Language, Internet-Draft draft-kelly-json-hal-11, Internet</p>
      <p>Engineering Task Force, 2023. URL: https://datatracker.ietf.org/doc/draft-kelly-json-hal/11/.
[22] M. Lagally, R. Matsukura, M. McCool, K. Toumura, Web of Things (WoT) Architecture 1.1, W3C
Recommendation REC‑wot‑architecture11‑20231205, World Wide Web Consortium, 2023. URL:
https://www.w3.org/TR/wot-architecture11/.
[23] Web of Things (WoT) Discovery, W3C Recommendation, World Wide Web Consortium (W3C),
2023. URL: https://www.w3.org/TR/wot-discovery/.
[24] K. Beck, Using pattern languages for object-oriented programs, in: OOPSLA-87 workshop on the</p>
      <p>Specification and Design for Object-Oriented Programming, 1987.
[25] H. K. Gidey, A. Collins, D. Marmsoler, Modeling and verifying dynamic architectures with
factum studio, in: Formal Aspects of Component Software,FACS 2019„ Springer, 2019. URL:
https://doi.org/10.1007/978-3-030-40914-2. doi:10.1007/978-3-030-40914-2.
[26] H. K. Gidey, D. Marmsoler, FACTum Studio, https://habtom.github.io/factum/ , 2018.
[27] H. K. Gidey, D. Marmsoler, D. Ascher, Modeling adaptive self-healing systems, CoRR
abs/2304.12773 (2023). URL: https://arxiv.org/abs/2304.12773. doi:10.48550/arXiv.2304.12773.
arXiv:2304.12773.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>I.</given-names>
            <surname>Gur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Furuta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Safdari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Matsuo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Eck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Faust</surname>
          </string-name>
          ,
          <article-title>A real-world webagent with planning, long context understanding, and program synthesis</article-title>
          ,
          <source>arXiv</source>
          (
          <year>2023</year>
          ). URL: https: //arxiv.org/abs/2307.12856. doi:
          <volume>10</volume>
          .48550/arXiv.2307.12856. arXiv:
          <volume>2307</volume>
          .
          <fpage>12856</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Qu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>A survey of webagents: Towards next-generation AI agents for web automation with large foundation models</article-title>
          ,
          <source>in: Proceedings of the ACM Web Conference 2025 (WWW '25)</source>
          ,
          <year>2025</year>
          . URL: https://dl.acm.org/doi/ 10.1145/3711896.3736555. doi:
          <volume>10</volume>
          .1145/3711896.3736555.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Macedo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. K.</given-names>
            <surname>Gidey</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. B. Rebuli</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Machado</surname>
          </string-name>
          ,
          <article-title>Evolving user interfaces: A neuroevolution approach for natural human-machine interaction</article-title>
          , in: C. Johnson,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Rebelo</surname>
          </string-name>
          , I. Santos (Eds.),
          <source>Artificial Intelligence in Music</source>
          , Sound, Art and Design: 13th International Conference, EvoMUSART
          <year>2024</year>
          ,
          <article-title>Held as Part of EvoStar 2024, Aberystwyth</article-title>
          ,
          <string-name>
            <surname>UK</surname>
          </string-name>
          , April 3-
          <issue>5</issue>
          ,
          <year>2024</year>
          , Proceedings, volume
          <volume>14633</volume>
          of Lecture Notes in Computer Science, Springer, Cham,
          <year>2024</year>
          , pp.
          <fpage>246</fpage>
          -
          <lpage>264</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>031</fpage>
          -56992-0_
          <fpage>16</fpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -56992-0_
          <fpage>16</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Ha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          , World models, arXiv preprint arXiv:
          <year>1803</year>
          .
          <volume>10122 2</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>LeCun</surname>
          </string-name>
          ,
          <article-title>A path towards autonomous machine intelligence</article-title>
          ,
          <source>arXiv preprint arXiv:2205.01761</source>
          (
          <year>2022</year>
          ). URL: https://arxiv.org/abs/2205.01761.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Richens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Everitt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Abel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bellot</surname>
          </string-name>
          ,
          <article-title>General agents need world models</article-title>
          ,
          <source>in: Proceedings of the 42nd International Conference on Machine Learning (ICML</source>
          <year>2025</year>
          ), Vancouver, Canada,
          <year>2025</year>
          . URL: https://icml.cc/virtual/2025/poster/44620.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chaudhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fakoor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Chaudhari</surname>
          </string-name>
          , G. Karypis,
          <string-name>
            <given-names>H.</given-names>
            <surname>Rangwala</surname>
          </string-name>
          ,
          <article-title>Agentoccam: A simple yet strong baseline for LLM-based web agents</article-title>
          ,
          <source>arXiv</source>
          (
          <year>2024</year>
          ). URL: https://arxiv.org/abs/ 2410.13825. arXiv:
          <volume>2410</volume>
          .
          <fpage>13825</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ma</surname>
          </string-name>
          , W. Yu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Webvoyager: Building an endto-end web agent with large multimodal models</article-title>
          ,
          <source>arXiv preprint arXiv:2401.13919</source>
          (
          <year>2024</year>
          ). URL: https://arxiv.org/abs/2401.13919.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <article-title>Htmlrag: HTML is better than plain text for modeling retrieved knowledge in RAG systems</article-title>
          ,
          <source>in: Proceedings of the ACM Web Conference 2025 (WWW '25)</source>
          , ACM, Sydney,
          <year>2025</year>
          , pp.
          <fpage>1733</fpage>
          -
          <lpage>1746</lpage>
          . URL: https://dl.acm.org/doi/10.1145/3696410. 3714546. doi:
          <volume>10</volume>
          .1145/3696410.3714546.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Hu</surname>
          </string-name>
          , Wepo:
          <article-title>Web element preference optimization for LLM-based web navigation</article-title>
          ,
          <source>in: Proceedings of the AAAI Conference on Artificial Intelligence</source>
          , volume
          <volume>39</volume>
          , AAAI Press,
          <year>2025</year>
          , pp.
          <fpage>26614</fpage>
          -
          <lpage>26622</lpage>
          . URL: https://ojs.aaai.org/index.php/AAAI/article/view/34863.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R.</given-names>
            <surname>Assouel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Marty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Caccia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. H.</given-names>
            <surname>Laradji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Drouin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Mudumba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Palacios</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Cappart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Vázquez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Chapados</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gasse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lacoste</surname>
          </string-name>
          ,
          <article-title>The unsolved challenges of llms as generalist web agents: A case study</article-title>
          , in: NeurIPS 2023 Workshop on
          <article-title>Foundation Models for Decision Making (FMDM@NeurIPS</article-title>
          <year>2023</year>
          ),
          <year>2023</year>
          . URL: https://openreview.net/forum?id=jt3il4fC5B.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>X. H.</given-names>
            <surname>Lù</surname>
          </string-name>
          , G. Kamath,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mosbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Reddy</surname>
          </string-name>
          ,
          <article-title>Build the web for agents, not agents for the web</article-title>
          ,
          <source>arXiv</source>
          (
          <year>2025</year>
          ). URL: https://arxiv.org/abs/2506.10953. arXiv:
          <volume>2506</volume>
          .
          <fpage>10953</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jeong</surname>
          </string-name>
          , Next-eval:
          <article-title>Next evaluation of traditional and LLM web data record extraction</article-title>
          ,
          <source>arXiv</source>
          (
          <year>2025</year>
          ). URL: https://arxiv.org/abs/2505.17125. arXiv:
          <volume>2505</volume>
          .
          <fpage>17125</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>F.</given-names>
            <surname>Gandon</surname>
          </string-name>
          ,
          <article-title>Merry hmas and happy new web: A wish for standardizing an ai-friendly web architecture for hypermedia multi-agent systems</article-title>
          ,
          <source>in: Dagstuhl-Seminar 21072: Autonomous Agents on the Web</source>
          ,
          <year>2021</year>
          , p.
          <fpage>3</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>