<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>October</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Intelligent Agents: Where Is the Web?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Victor Charpenay</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Mines Saint-Etienne, Univ Clermont Auvergne, INP Clermont Auvergne</institution>
          ,
          <addr-line>CNRS, UMR 6158 LIMOS Saint-Etienne</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>26</volume>
      <issue>2025</issue>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>This paper reviews agent frameworks and benchmark datasets for software agents evolving in a Web environment. All frameworks and datasets make the assumption that agents interact with their environment through text and images alone, hiding the main abstractions of the Web and its architectural constraints. This study raises the question whether software agents that leverage structured data instead, mostly available in an RDF format, can be more accurate or more computationally eficient than agents based on chatbots.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Web Agent</kwd>
        <kwd>Chatbot</kwd>
        <kwd>Hypermedia</kwd>
        <kwd>RDF</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1. Introduction</p>
    </sec>
    <sec id="sec-2">
      <title>2. Agent Frameworks</title>
      <p>
        It is straightforward to build an agent from a chatbot. Instead of generating a plausible answer to a
question, the chatbot only has to generate a function call, either in a pre-defined format [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] or as a
code snippet [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. A dedicated software component, acting as an intermediary between the agent and its
      </p>
      <p>Google search, Tavily search, Slack, Hyberbrowser, . . .</p>
      <p>Google Serper, Firecrawl, Selenium, Website RAG search, . . .</p>
      <p>Google search, Perplexity search, Firecrawl, deep research, . . .</p>
      <p>Google search, API Web search, visit Web page, . . .</p>
      <p>
        Web search
environment, calls the function and informs the chatbot of possible efects in a textual form. The agent
is deemed intelligent if it is capable of anticipating the efects of an action and planning several steps
ahead. Such reasoning materializes as an intermediary step in which the chatbot generates free text.
Chatbots that combine reasoning and acting are referred to as ReAct agents [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. With the development
of multimodal chatbots, ReAct agents evolved into SeeAct agents to also perceive images [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The virtual
assistants that Anthropic and OpenAI commercially provide, which they call computer-using agents,
are SeeAct agents [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ].
      </p>
      <p>
        Agents based on a chatbot achieve satisfactory results on various tasks without any fine-tuning. They
are typically given a system prompt that describes available actions in natural language, at initialization
time. For more complex scenarios, such a prompt can be reformulated at run time based on signals
received from the environment [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. If available, the language model underlying the chatbot may even
be fine-tuned to generate more accurate function calls. WebGPT, for instance, results from fine-tuning
GPT-3 to leverage Web search in question answering tasks [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Web search, as we will see later, is one
of the most common capabilities of ReAct agents. OpenAI recently integrated it into ChatGPT in a
mode called deep research [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        Actions are made available to chatbots via tools, i.e. pieces of code that easily integrate with the
rest of a program [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ]. Many tools are defined as a single function, typically in Python, such that a
system prompt describing the tool can be automatically derived from its signature and documentation.
Many agent libraries are based on this principle, ofering users the possibility to create custom tools.
However, common tools are often built into the library.
      </p>
      <p>Table 2 gives a summary of Web tools provided by five well-known agent libraries. All libraries
provide a Web search tool. All but one give access to more than one search engine (the exception being
the OpenAI Agents SDK). Google search is the most frequent tool but other tools appear several times,
including Brave search and SearxNG (a meta-search engine), as well as Firecrawl and Hyperbrowser.
These two utilities, among others, provide high-level representations of Web pages and Web sites, hiding
certain aspects of Web interactions such as CSS rendering and link following. The tools presented in the
table can be broadly classified into five categories: search, scraping and crawling, browser automation,
platform access, direct access.</p>
      <p>Search. Search tools provide search engine response pages (SERPs) in a semi-structured way. Tools
exist for traditional search engines (Bing, Brave, DuckDuckGo, Google, Mojeek) but also for more recent
search engines dedicated to chatbots (Tavily, You.com, EXA, Linkup, Perplexity). The latter expect
full questions as queries (rather than keywords) and use themselves language models to generate a
description of each result page.</p>
      <p>Scraping and crawling. Scraping consists in extracting structured data from a Web page, possibly
after a rendering step. Crawling consists in following links automatically to produce a condensed view of
the visited Web pages (for instance, all pages of a Web site). The two procedures are often implemented
together in a program, sometimes called crawler or spider. Crawler tools in agent frameworks (Jina,
AgentQL, Hyperbrowser, MultiOn, Firecrawl, ScrapFly, ScrapGraph, Browserbase, Crawl4AI) tend to
be dedicated to chatbots. Many programs already existed for scraping and crawling but, as for search,
recent ones use language models to restructure or simplify Web pages. Some crawling programs also
provide deep research functionalities, like ChatGPT.</p>
      <p>Browser automation. Scraping programs often operate on dynamic Web pages, requiring a browser
environment. To do so, they use a browser automation program that sends key press and click events
to a headless browser. Such browser automation programs (Playwright, Selenium, StageHand) can also
be directly used as tools.</p>
      <p>Platform access. Many platforms primarily accessible via a Web API have a dedicated tool (Github,
Gitlab, Gmail, Infobip, Jira, Ofice365, Slack, Twilio, YouTube, Wikipedia). Messages exchanged between
the agent and the platform are formatted in JSON.</p>
      <p>Direct access. The remaining tools expose lower-level features, such as HTTP request helpers.</p>
      <p>All agent frameworks, except the OpenAI Agents SDK, put Web tools forward. Web tools represent
40%-74% of the built-in tools they provide (other tools give access to files and databases, for instance).
However, these tools hide the Web from agents more than they expose to it. Technical details of
identification, interaction and representation, the main architectural features of the Web, are indeed
invisible to agents.</p>
      <p>Identification. Crawlers that aggregate information from several pages within a Web site make
invisible what entities are being identified and how they are interlinked. Yet, links between resources
may themselves have a meaning—an assumption that is at the foundations of the Semantic Web.</p>
      <p>Interaction. Interoperability between clients (agents) and servers (platforms) is not guaranteed by a
standard protocol. Instead, the trend to build “walled garden” over Web standards is strengthened, each
platform requiring a dedicated tool with its own abstractions. Among other standards, OpenSearch
could be an eficient substitute to search tools. Github, YouTube and Wikipedia (at least) follow this
standard already.</p>
      <p>Representation. Scrapers alter the original representation of pages, generating accessibility trees or
textual summaries that are non-standard. Direct content negotiation with servers could be used instead,
to provide machine-readable or purely textual representations of HTML pages. A current proposal is to
serve MarkDown to software agents as an alternative to HTML1.</p>
      <p>Human agents require a browser to access Web resources. Browsers follow certain links automatically
(pointing to CSS and JavaScript resources, among others), pages can be rendered in diferent ways (Firefox
has a reading mode, for instance2) and many browser extensions can dynamically alter interactions or
representations. Arguably, Web tools provided by agent frameworks collectively play the same role
as a browser. However, as they are currently implemented, these tools are ineficient intermediaries.
Origin servers typically generate HTML from data that is already (partly) structured; they would more
eficiently serve a representation that chatbots can process (after content negotiation). Moreover, most
tools cited here are only accessible via a remote service: the agent thus entirely depends on another
server that becomes a centralization point, against the original design objective of the Web as a massively
distributed information system.</p>
      <p>Tools that rely on Web technologies can surely be considered as Web tools. However, agents that use
these tools are not exactly Web agents if the main abstractions of the Web (URIs, HTTP, HTML and
other hypermedia formats) are not exposed to them.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Benchmark Datasets</title>
      <p>
        WebGPT was tested on the ELI5 dataset, extracted from the “explain like I’m five” topic on Reddit [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
The ReAct approach to software agents was initially evaluated in the WebShop environment, an
ecommerce platform populated with information extracted from Amazon [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. In addition to these two,
several benchmark datasets were introduced to evaluate the behavior of agents on the Web, on diferent
aspects. Two families of datasets can be identified: some datasets require interaction with Web servers,
others do not except for search.
      </p>
      <p>
        Datasets that only require search include ELI5, SimpleQA [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], GPQA [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] and GAIA [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. They are
question answering datasets designed to quantify the information retrieval capabilities of agents: agents
1see the recent llms.txt proposal.
2the underlying JavaScript library, Readability.js, was even used to train WebGPT.
are expected to select relevant sources of information in order to answer factual questions. Questions in
ELI5 are open-ended, typically starting with ‘why’ and ‘how’. Other questions expect a precise answer,
such as a date or a figure. Answers in SimpleQA, as its name suggests, can easily be found by a human
but they are hard to guess. In contrast, GPQA, the Google-proof query answering dataset, includes
questions that are also hard for humans, even with access to the entire Web. GAIA, the general AI
assistant dataset, evaluates more general agents but more than half of its questions requires Web search
(355/660).
      </p>
      <p>
        The remaining datasets, listed in Table 3, require action on a Web server. QAWoB, FormWoB and
MiniWoB++ are all derived from World of Bits, original work by OpenAI [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. QAWoB is formulated as
a query answering benchmark but unlike above benchmarks, agents must only visit a single Web site (to
ifnd a recipe, for example). In FormWoB, agents even have to visite a single Web page. The dificulty in
this case is to correctly fill in a complex form (to reserve a flight). MiniWoB++ is derived from MiniWoB,
originally published alongside QAWoB and FormWoB. Both variants are a synthetic counterpart to
FormWoB, with simpler but more diverse examples. WebShop, Mind2Web and WebArena combine
question answering and form completion (to perform a local search, for instance). WebShop exposes
simplified pages to agents whereas Mind2Web and WebArena expose raw HTML pages, as found on the
Web. Mind2Web did not originally provide screenshots but a later version does (Multimodal-Mind2Web).
The latest version of Mind2Web (Online-Mind2Web) is designed for agents that directly access online
Web sites rather than ofline copies [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. Evaluating agents online was first introduced for WebVoyager,
an agent based on GPT-4V [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ].
      </p>
      <p>
        Broad question answering datasets do not prescribe what Web site to visit to find the right answer
but they are distributed with verified sources. Table 3 shows the most frequent sources across datasets.
Wikipedia is first on all datasets (SimpleQA, GPQA, GAIA). ELI5 was initially built from various sources
but the version that was integrated into the knowledge-intensive language tasks (KILT) benchmark
entirely relies on Wikipedia as well [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]. GPQA also relies on data provided by the National Institutes
of Health (NIH), in the United States, and LibreTexts, an educational digital common. It is interesting to
note that content from these three providers is also available in a more structured way. YAGO, DBpedia
and Wikidata all provide RDF representations of Wikipedia pages3, the NIH PubChem platform serves
both HTML and RDF representations4 and LibreTexts embeds RDF data in HTML pages, as encouraged
3see e.g. https://www.wikidata.org/entity/Q132451509
4see https://pubchem.ncbi.nlm.nih.gov/docs/rdf
by the Schema.org initiative5. In other words, an RDF source exists for 100% of ELI5, 40% of SimpleQA,
37% of GPQA and 18% of GAIA—at least. Yet, to the best of our knowledge, agents evaluated against
these datasets ignore structured data.
      </p>
      <p>
        Agents that are evaluated against these benchmark datasets largely follow the principle seen in
the previous section: they rely on tools that provide high-level abstractions. Each task in the GAIA
dataset, for instance, is annotated with one or more tool category. Web-related tasks only require a
search engine, not navigation. Environments that require action, from World of Bits to WebArena, all
send images or text to agents and expect human-like interactions, not HTTP messages. Researchers
noted that “many MiniWob++ tasks require clicking or dragging actions that cannot be achieved with
DOM-element based action” [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. Out of the four Web sites used in FormWoB, none actually has a
form element for the task in its raw HTML representation6. In other words, those benchmark datasets
carry with them the same limitations as agent frameworks regarding identification, interaction and
representation. They hide the links and forms that should drive the action of Web agents, as per the
hypermedia-as-the-engine-of-application-state (HATEOAS) principle.
      </p>
      <p>
        Commercial agents by Anthropic and OpenAI perform well on WebArena and Mind2Web. They
respectively complete 61% and 56% of the tasks on Online-Mind2Web, for instance, while the best
agent developed by an academic team (remotely accessing GPT-4V) completes 31% of the tasks [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ].
In average, seven actions must be performed to complete a single task in this dataset, which clearly
indicates that chatbot-based agents understand in some way their environment, originally designed for
humans. However, there is no evidence that agents understand the main abstractions of the Web. The
fact that agents perform equally well on unrelated benchmarks rather suggests the contrary, in fact.
On the general-purpose AgentBench benchmark, agents based on GPT-4 have comparable results on
Web-related tasks (taken from WebShop and Mind2Web) and on tasks involving code execution and
game simulations [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ].
      </p>
      <p>
        Are there tasks that could only be completed in a Web environment? Original publications about the
Semantic Web hint at two core tasks: choosing among Web services and composing Web services. In
“Agents and the Semantic Web,” Hendler describes a task in which an agent must help a fishing boat
caught in a storm [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The agent finds several services on the Web, either to download satellite images or
to call coast guards for help. In this task, the agent has to choose one of the services, although they are
not functionally equivalent. Similar tasks include choosing among several ofers for the same product
and selecting among contradictory sources of information. In “The Semantic Web,” the same year,
Berners-Lee, Lassila and Hendler describe a task involving a patient, their family and physicians [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ].
The agent must find an appointment with a physician that is compatible with the schedule of all family
members. Here, the agent must find physicians in an area and fetch the schedule of several persons
which, for privacy reasons, are assumed to be managed by distinct personal information management
systems. Each partipant in the task has its own server, for instance a Social Linked Data (Solid) Pod7,
and its own “steward” agent that has privileged access to their data but not others’.
      </p>
      <p>In all those tasks, servers are assumed to be managed by distinct organizations. Information is not
and should not be centralized on a single platform. WebArena includes tasks that require cross-site
navigation, fulfilling this requirement. For instance, “an agent needs to find out what art museums are
located in Pittsburgh by searching Wikipedia. Next, it should identify the location of each museum on
a map, optimizing the itinerary based on the information collected. Finally, the agent needs to update
the README file in the appropriate repository with the planned route 8.” Cross-site tasks like this
one represent only 5.9% of the dataset. Most tasks could have been performed on mobile or desktop
applications with a specific user interface. Mind2Web includes no cross-site task.
5see e.g. https://search.google.com/test/rich-results/result?id=WlI1Rg7EGr_aZRQcFDfPtg
6on jetblue.com and aa.com, a form is added by a script; on alaskaair.com, the page includes a form for mobile browsers, not
desktop browsers; on united.com, the page has no form even after execution of all embedded scripts.
7see https://solidproject.org/
8see https://webarena.dev/</p>
    </sec>
    <sec id="sec-4">
      <title>4. Proposal</title>
      <p>Some agent frameworks expose Web abstractions to agents (LangGraph and Smolagents have generic
tools for HTTP requests, for instance) but most frameworks do not. Some benchmark datasets impose
cross-site navigation (WebArena and GPQA have tasks combining information from diferent Web
sites) but on a small set of tasks. More generally, architectural constraints of the Web encourage
interoperability but platform-specific tools and tasks, on the contrary, encourage isolation. In light
of this opposition, chatbot-based agents should not be considered as Web agents. It is true that the
research objective behind the work reviewed in this paper is to build intelligent agents, not specifically
Web agents. However, it does not entirely address the original research project initiated by Berners-Lee,
Lassila and Hendler. To align with their research objective, this paper introduces criteria to decide
whether a task is specific to the Web or not. The criteria, formulated below, are based on several
assumptions: to perform a single task, the agent acts as a Web client (it always initiates interactions), it
reads and writes resources from one or more Web servers and representations may be served in any
format.</p>
      <p>Definition 1. A task performed by an intelligent agent is specific to the Web if all of the following criteria
hold:
1. the client reads the representation of resources from at least two servers during execution;
2. the client reads the representation a resource only if a link to that resource exists in the representation
of a previously accessed resource or it is the first resource being read;
3. the client writes the representation of a resource only if a form describing the operation exists in the
representation of a previously accessed resource;
4. the client executes no server-sent code (optional).</p>
      <p>This definition prescribes that all resource representations are in a hypermedia format, exposing links
and forms. HTML is the most common hypermedia format but it not the only one. RDF formats (Turtle,
JSON-LD and others) are also hypermedia formats, for example. The definition also excludes “intelligent”
intermediaries, such as remote chatbot services, because such intermediaries are not interlinked with
the rest of the Web. It is also assumed that the set of resources that are read during execution is minimal:
no subset is suficient to complete the task. If an agent always uses a remote search engine to find other
resources, criterion 1 is trivially met without this condition.</p>
      <p>Question answering tasks in ELI5, SimpleQA, GPQA and GAIA may or not meet criterion 1. For simple
questions as in ELI5 and SimpleQA, a single source of information is often enough. GPQA has three
levels of dificulty; questions at the highest level (post-graduate level) all require to combine sources and
thus meet criterion 1. Criterion 3 does not apply to question answering tasks but the question of whether
criteria 2 and 4 are met by GPQA and GAIA is open. WebArena, the only dataset in the other category
that meets criterion 1, is also not guaranteed to meet criteria 2 and 4. To answer this research question,
agents could rely on RDF representations for navigation—and delegate decision to language models if
not available. Unlike RDF crawlers whose purpose is to collect as many RDF resources as possible, such
agents would aim to minimize the number of visited resources, following links (exposed as RDF triples)
only if they bring closer to a target resource. To find an optimal route connecting art museums in
Pittsburgh, for instance, only one hop from Wikipedia to OpenStreetMap is required. Most of the information
can first be fetched from Wikidata with the following SPARQL query: SELECT ?m ?coords WHERE {
?m wdt:P31 wd:Q33506 ; wdt:P131 wd:Q1342 ; wdt:P625 ?coords }. Then, each result
entity has a link to OpenStreetMap, which provides further options to query entities with OpenSearch
and find directions via an HTML form.</p>
      <p>Wikidata and OpenStreetMap are among the platforms that welcome software agents by adopting
technologies similar to RDF to make structured data easily accessible. By doing so, they also acknowledge
that the Web is a shared space that encourages coordination. Developing agents that operate at the level
of URIs, HTTP and RDF (with raw HTML if needed) is a good way to answer the most dificult queries
and achieve the most dificult tasks: queries a single source cannot answer; tasks a single platform
cannot perform.</p>
    </sec>
    <sec id="sec-5">
      <title>Declaration on Generative AI</title>
      <p>The author has not employed any Generative AI tools.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hendler</surname>
          </string-name>
          ,
          <article-title>Where Are All the Intelligent Agents?</article-title>
          ,
          <source>IEEE Intelligent Systems</source>
          <volume>22</volume>
          (
          <year>2007</year>
          )
          <fpage>2</fpage>
          -
          <lpage>3</lpage>
          . URL: http://ieeexplore.ieee.org/document/4216971/. doi:
          <volume>10</volume>
          .1109/MIS.
          <year>2007</year>
          .
          <volume>62</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hendler</surname>
          </string-name>
          ,
          <article-title>Agents and the Semantic Web</article-title>
          ,
          <source>IEEE Intelligent Systems</source>
          <volume>16</volume>
          (
          <year>2001</year>
          )
          <fpage>30</fpage>
          -
          <lpage>37</lpage>
          . URL: http://ieeexplore.ieee.org/document/920597/. doi:
          <volume>10</volume>
          .1109/5254.920597.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <issue>OpenAI</issue>
          ,
          <string-name>
            <surname>Computer-Using Agent</surname>
          </string-name>
          ,
          <year>2025</year>
          . URL: https://openai.com/index/computer-using-agent/.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Anthropic</surname>
          </string-name>
          , Introducing computer use,
          <source>a new Claude 3.5 Sonnet, and Claude 3.5 Haiku</source>
          ,
          <year>2024</year>
          . URL: https://www.anthropic.com/news/3-5
          <article-title>-models-and-computer-use.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Technical</given-names>
            <surname>Architecture</surname>
          </string-name>
          <string-name>
            <surname>Group</surname>
          </string-name>
          ,
          <source>Architecture of the World Wide Web</source>
          , Volume One,
          <year>2004</year>
          . URL: https://www.w3.org/TR/webarch/.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Du</surname>
          </string-name>
          , I. Shafran,
          <string-name>
            <given-names>K. R.</given-names>
            <surname>Narasimhan</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Cao,</surname>
          </string-name>
          <article-title>ReAct: Synergizing Reasoning and Acting in Language Models</article-title>
          ,
          <source>in: The Eleventh International Conference on Learning Representations</source>
          ,
          <year>2023</year>
          . URL: https://openreview.net/forum?id=WE_
          <string-name>
            <surname>vluYUL-X.</surname>
          </string-name>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <article-title>Executable Code Actions Elicit Better LLM Agents</article-title>
          , in: Forty-first
          <source>International Conference on Machine Learning</source>
          ,
          <year>2024</year>
          . URL: https://openreview.net/forum?id=jJ9BoXAfFa.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Gou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Su</surname>
          </string-name>
          , GPT
          <article-title>-4V(ision) is a Generalist Web Agent, if Grounded</article-title>
          , in: Forty-first
          <source>International Conference on Machine Learning</source>
          ,
          <year>2024</year>
          . URL: https://openreview.net/ forum?id=piecKJ2DlB.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yasunaga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. N.</given-names>
            <surname>Ioannidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Subbian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leskovec</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Zou,</surname>
          </string-name>
          <article-title>AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning</article-title>
          , in: A.
          <string-name>
            <surname>Globerson</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Mackey</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Belgrave</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          <string-name>
            <surname>Paquet</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Tomczak</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          Zhang (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>37</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2024</year>
          , pp.
          <fpage>25981</fpage>
          -
          <lpage>26010</lpage>
          . URL: https://proceedings.neurips.cc/paper_files/paper/2024/file/ 2db8ce969b000fe0b3fb172490c33ce8-Paper-Conference.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R.</given-names>
            <surname>Nakano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hilton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Balaji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ouyang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hesse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kosaraju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Saunders</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cobbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Eloundou</surname>
          </string-name>
          , G. Krueger,
          <string-name>
            <given-names>K.</given-names>
            <surname>Button</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Knight</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chess</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Schulman,</surname>
          </string-name>
          <article-title>WebGPT: Browser-assisted question-answering with human feedback, 2022</article-title>
          . URL: http://arxiv.org/abs/2112. 09332. doi:
          <volume>10</volume>
          .48550/arXiv.2112.09332, arXiv:
          <fpage>2112</fpage>
          .09332 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>OpenAI</surname>
          </string-name>
          , Introducing deep research,
          <year>2025</year>
          . URL: https://openai.com/index/ introducing-deep-research/.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>T.</given-names>
            <surname>Schick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dwivedi-Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dessi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Raileanu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lomeli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Hambro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Cancedda</surname>
          </string-name>
          , T. Scialom, Toolformer: Language Models Can Teach Themselves to Use Tools, in: A.
          <string-name>
            <surname>Oh</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Naumann</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Globerson</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Saenko</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Hardt</surname>
          </string-name>
          , S. Levine (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>36</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2023</year>
          , pp.
          <fpage>68539</fpage>
          -
          <lpage>68551</lpage>
          . URL: https://proceedings.neurips.cc/paper_files/paper/2023/file/ d842425e4bf79ba039352da0f658a906-Paper-Conference.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Cong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Qian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , M. Gerstein, d. li,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          , M. Sun, ToolLLM: Facilitating Large Language Models to Master 16000+
          <string-name>
            <surname>Real-world</surname>
            <given-names>APIs</given-names>
          </string-name>
          , in: The Twelfth International Conference on Learning Representations,
          <year>2024</year>
          . URL: https://openreview.net/forum?id=dHng2O0Jjr.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>T.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Karpathy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hernandez</surname>
          </string-name>
          , P. Liang, World of Bits:
          <article-title>An Open-Domain Platform for Web-Based Agents</article-title>
          , in: D.
          <string-name>
            <surname>Precup</surname>
            ,
            <given-names>Y. W.</given-names>
          </string-name>
          <string-name>
            <surname>Teh</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 34th International Conference on Machine Learning</source>
          , volume
          <volume>70</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>3135</fpage>
          -
          <lpage>3144</lpage>
          . URL: https://proceedings.mlr.press/v70/shi17a.html.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>E. Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Guu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pasupat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <article-title>Reinforcement Learning on Web Interfaces using WorkflowGuided Exploration</article-title>
          , in: International Conference on Learning Representations,
          <year>2018</year>
          . URL: https: //openreview.net/forum?id=
          <fpage>ryTp3f</fpage>
          -
          <fpage>0</fpage>
          -.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Narasimhan</surname>
          </string-name>
          ,
          <article-title>WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents</article-title>
          , in: S. Koyejo,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Belgrave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Oh (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>35</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2022</year>
          , pp.
          <fpage>20744</fpage>
          -
          <lpage>20757</lpage>
          . URL: https://proceedings.neurips.cc/paper_files/paper/ 2022/file/82ad13ec01f9fe44c01cb91814fd7b8c-Paper-Conference.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>X.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Stevens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Su,</surname>
          </string-name>
          <article-title>Mind2Web: Towards a Generalist Agent for the Web</article-title>
          , in: A.
          <string-name>
            <surname>Oh</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Naumann</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Globerson</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Saenko</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Hardt</surname>
          </string-name>
          , S. Levine (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>36</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2023</year>
          , pp.
          <fpage>28091</fpage>
          -
          <lpage>28114</lpage>
          . URL: https://proceedings.neurips.cc/paper_files/paper/2023/ ifle/5950bf290a1570ea401bf98882128160-Paper-Datasets_and_Benchmarks.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. F.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sridhar</surname>
          </string-name>
          , X. Cheng, T. Ou,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bisk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fried</surname>
          </string-name>
          , U. Alon, G. Neubig,
          <article-title>WebArena: A Realistic Web Environment for Building Autonomous Agents</article-title>
          ,
          <source>in: The Twelfth International Conference on Learning Representations</source>
          ,
          <year>2024</year>
          . URL: https://openreview. net/forum?id=oKn9c6ytLx.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jernite</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Grangier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Weston</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Auli, ELI5: Long Form Question Answering</article-title>
          , in: A.
          <string-name>
            <surname>Korhonen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Traum</surname>
          </string-name>
          , L. Màrquez (Eds.),
          <article-title>Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Florence, Italy,
          <year>2019</year>
          , pp.
          <fpage>3558</fpage>
          -
          <lpage>3567</lpage>
          . URL: https://aclanthology.org/P19-1346/. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P19</fpage>
          -1346.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. W.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Papay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Glaese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schulman</surname>
          </string-name>
          , L. Fedus,
          <string-name>
            <surname>Introducing</surname>
            <given-names>SimpleQA</given-names>
          </string-name>
          ,
          <year>2024</year>
          . URL: https://github.com/openai/simple-evals.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>D.</given-names>
            <surname>Rein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. L.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Stickland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Petty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. Y.</given-names>
            <surname>Pang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dirani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Michael</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Bowman</surname>
          </string-name>
          , GPQA:
          <string-name>
            <given-names>A</given-names>
            <surname>Graduate-Level Google-Proof</surname>
          </string-name>
          <string-name>
            <surname>Q</surname>
          </string-name>
          &amp;A Benchmark,
          <source>in: First Conference on Language Modeling</source>
          ,
          <year>2024</year>
          . URL: https://openreview.net/forum?id=
          <fpage>Ti67584b98</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>G.</given-names>
            <surname>Mialon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fourrier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wolf</surname>
          </string-name>
          , Y. LeCun, T. Scialom,
          <article-title>GAIA: a benchmark for General AI Assistants</article-title>
          ,
          <source>in: The Twelfth International Conference on Learning Representations</source>
          ,
          <year>2024</year>
          . URL: https://openreview.net/forum?id=fibxvahvs3.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>T.</given-names>
            <surname>Xue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. H.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Gou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <article-title>An Illusion of Progress? Assessing the Current State of Web Agents</article-title>
          ,
          <year>2025</year>
          . URL: http://arxiv.org/abs/2504.01382. doi:
          <volume>10</volume>
          .48550/ arXiv.2504.01382, arXiv:
          <fpage>2504</fpage>
          .01382 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>H.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ma</surname>
          </string-name>
          , W. Yu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models</article-title>
          , in: L.
          <string-name>
            <surname>-W. Ku</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
          </string-name>
          , V. Srikumar (Eds.),
          <source>Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Bangkok, Thailand,
          <year>2024</year>
          , pp.
          <fpage>6864</fpage>
          -
          <lpage>6890</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>371</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>371</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yazdani</surname>
          </string-name>
          , N. De Cao,
          <string-name>
            <given-names>J.</given-names>
            <surname>Thorne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jernite</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Maillard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Plachouras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          , S. Riedel,
          <article-title>KILT: a Benchmark for Knowledge Intensive Language Tasks, in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics</article-title>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>2523</fpage>
          -
          <lpage>2544</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          . naacl-main.
          <volume>200</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .naacl-main.
          <volume>200</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>P. C.</given-names>
            <surname>Humphreys</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Raposo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Pohlen</surname>
          </string-name>
          , G. Thornton,
          <string-name>
            <given-names>R.</given-names>
            <surname>Chhaparia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Muldal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Abramson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Georgiev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Santoro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lillicrap</surname>
          </string-name>
          ,
          <article-title>A data-driven approach for learning to control computers</article-title>
          , in: K. Chaudhuri,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jegelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Szepesvari</surname>
          </string-name>
          , G. Niu, S. Sabato (Eds.),
          <source>Proceedings of the 39th International Conference on Machine Learning</source>
          , volume
          <volume>162</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>9466</fpage>
          -
          <lpage>9482</lpage>
          . URL: https://proceedings.mlr.press/v162/humphreys22a.html.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Men</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Shen,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Tang,</surname>
          </string-name>
          <article-title>AgentBench: Evaluating LLMs as Agents</article-title>
          ,
          <source>in: The Twelfth International Conference on Learning Representations</source>
          ,
          <year>2024</year>
          . URL: https://openreview.net/forum?id=zAdUB0aCTQ.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>T.</given-names>
            <surname>Berners-Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hendler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Lassila</surname>
          </string-name>
          , The Semantic Web, Scientific
          <string-name>
            <surname>American</surname>
          </string-name>
          (
          <year>2001</year>
          )
          <fpage>36</fpage>
          -
          <lpage>43</lpage>
          . URL:
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>