<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>MemAgent: A cache-inspired framework for augmenting conversational Web Agents with task-specific information</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nazmus Sakib</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Protoy Barai</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sifat Ishmam Parisa</string-name>
          <email>sifatiparisa@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anindya Iqbal</string-name>
          <email>anindya@cse.buet.ac.bd</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>LLM Agents, Memory Cache Bank, MemAgent, Agentic Memory</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bangladesh University of Engineering and Technology</institution>
          ,
          <addr-line>Dhaka</addr-line>
          ,
          <country country="BD">Bangladesh</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Large Language Model (LLM) based web agents require users to repeatedly provide task-specific information across interactions, limiting their practical utility. To address this issue, we propose MemAgent, a framework that enhances web agents with a cache-inspired memory mechanism to store and retrieve task-specific information. MemAgent employs a two-phase architecture that separates information gathering (alignment) from task execution, and introduces a Memory Cache Bank (MCB) with time-based expiration policies. Our evaluation on 150 web tasks across three categories shows that MemAgent reduces the average conversation turns by 22.4% (5.00 to 3.88). Human evaluation with 15 participants demonstrates a 58% reduction in task completion time for recurring tasks. Our implementation code, data, and trained models are available at: https://github.com/DialogBased-Interaction/Goal_Alignment ∗Corresponding author. †These authors contributed equally.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        With the rise of the Large Language Model (LLM), we have seen an increase of automation in many
aspects of our lives – given rise to the concept of Web Agents [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5">1, 2, 3, 4, 5</xref>
        ]. Broadly, web agents are all
systems that use LLMs as their engines and can perform actions on the websites based on observations.
These agents can automate users’ web experience such as: booking a flight [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], shopping in amazon [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
and so on.
      </p>
      <p>
        Current state-of-the-art web agents typically require users to provide a well-crafted detailed task
description to execute it. However, prior research shows that crafting efective prompts is a non-trivial
task for users. Studies by [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ] highlight that users often provide abstract and incomplete prompts,
struggling to anticipate and convey all the necessary information. This issue is further exacerbated for
recurring tasks as users need to repeatedly provide the same level of detail every time, leading to an
ineficient and frustrating user experience.
      </p>
      <p>
        To overcome these issues, recent works have explored augmenting agents with short-term, long-term
and working memory [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. These agents typically store the information in their working/short-term
memory and later bypass it into long-term memory. However, the transformation of these information
is complex, and is not controllable. On the other hand, few works explored how to enable agents to
ask follow-up questions when it is unsure [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and there is missing information. Although these agents
can engage with the users and ask follow-up questions as it executes, they still sufer from the memory
limitation, i.e., users need to engage with agents every time they execute a task. This raises the question:
How can we bridge between these two paradigms with a simple yet efective agent framework?
      </p>
      <p>To this end, we present MemAgent, a simple yet efective agent that learns to store task information
in a cache by conversing with the users. MemAgent works in two phases: Alignment and Execution. In
the Alignment phase, the agent is trained to pose follow-up questions to users, capturing and storing
https://thedeadcoder.github.io/ (N. Sakib)</p>
      <p>CEUR</p>
      <p>
        ceur-ws.org
their responses in our dedicated memory cache bank (MCB). During the Execution phase, it leverages
this stored information to perform tasks, thereby eliminating the need for users to repeatedly engage in
lengthy dialogues, as required by existing models. Instead of using a short-term or long-term memory
mechanism [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], we design a simpler, yet efective storage mechanism similar to cache. MCB saves the
task details, including type and value information for each task entity and includes an auto-expiration
ifeld, which helps to refresh MemAgent’s storage periodically and model user’s dynamic preference.
      </p>
      <p>Our contributions can be summarized as follows:
1. A novel web agent pipeline, MemAgent that can store task specific information in a memory
cache bank (MCB). MemAgent learns to create and retrieve information from MCB by conversing
with the users.
2. We evaluated MemAgent on a diverse set of tasks to showcase its abilities and improvement on
top of existing web agents</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Autonomous Web Agent</title>
        <p>
          There has been a large body of works on autonomous web agents, investigating how to eficiently utilize
large language models for automating usual web activities [
          <xref ref-type="bibr" rid="ref11 ref12 ref13 ref14 ref2 ref3 ref4 ref5">4, 11, 12, 2, 3, 5, 13, 14</xref>
          ]. [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] performs an
ofline exploration and creates a transition graph, which is used to provide more contextual information
to the LLM prompt. [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] introduces chain-of-action prompting that leverages previous action history
and future action plans to decide the next action. Most of the early works on Web UI are based on
synthetic frameworks, MiniWob [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] and WebShop [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. To capture the complexity of real-world tasks,
[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] and [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] introduce two realistic environments and datasets encompassing real-world tasks and extend
them for evaluating Large Multimodal web agents, [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] and [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], respectively. [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] also introduces
a real-world dataset for multimodal web agents and employs overlaying bounding boxes of the web
elements, similar to Set-of-Mark prompting [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], to improve the web agent. Perhaps the closest to our
work is WebLinx [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], which is a multi-turn dialog dataset for web activities. However, our approach
is significantly diferent from theirs. We separated the chat and operation actions into two separate
phases - Alignment and Execution. We primarily focus on improving web agent’s performance for
abstract task descriptions and repetitive tasks. Our MCB is also diferent from the approach used in
WebLinx. Also, very similar to our approach, [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ], builds a conversational dataset MT-Mind2Web by
organizing and combining the tasks from the Mind2Web dataset based on the similarity of the website
domain and instructions. Their approach also involves a memory bank, but unlike our MCB, it includes
the conversation history, previous actions and the environmental state(HTML). They have employed
multifaceted matching and reflection modules to filter out irrelevant memory components.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Memory augmentation for LLM Agent</title>
        <p>
          There has been a growing interest on how to incorporate human cognitive principles into LLM agents
[
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. CoALA proposes how a combination of procedural, semantic, and episodic memory can be useful
for improving the reasoning capacity of agents [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. Ret-LLM proposes simple ’read and write’ memory
operations for language models [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. MemGPT proposes a memory augmentation for GPT models
which can be accessed with a simple function calling [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. MemoryBank [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] stores a summary of chat
history and user portrait to help in future conversations and recommendations. Unlike their process,
we do not store the summary, but rather the user-specific detailed information of each task individually,
enabling more transparent and accurate replication in the future.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. The MemAgent Framework</title>
      <p>In this section, we outline the components of our MemAgent framework. MemAgent operates on the
principles of temporal decoupling. We validate this design choice with a pilot study detailed in §6.
Preliminaries. Given an abstract task description   , MemAgent does the following:
1. Extract the necessary information   = {(   ,</p>
      <p>action (§3.1)
2. Store this information in the MCB with appropriate expiration policies (§3.2)
3. Execute the task using the combination of   and retrieved information,   from MCB (§3.3)
 )| = 1, ..., ||} through conversational
inter</p>
      <sec id="sec-3-1">
        <title>3.1. Phase 1: Alignment</title>
        <p>Given   , MemAgent engages in a multi-turn conversation with the users in Alignment phase to obtain
all necessary details for   . In this phase, the agent has two key responsibilities - 1) Enquire: Only
ask questions that are relevant to the current task; 2) Extract: Parse user response to find out the
information type(   ) and value(   ).</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Memory Cache Bank (MCB)</title>
        <p>Central to MemAgent is the memory cache bank, MCB, which stores   for each   . Similar to cache,
  has an `Expires` field, which controls when   becomes stale. MCB provides several benefits to
MemAgent: 1) Reduced turn of conversation: It stores the detailed information,   for   so that the user
does not need to provide the detailed information every time they want to execute   . 2) Integration
with Retrieval Augmented Pipeline: MCB can be easily integrated with Vector Databases to support
retrieval augmented execution for web agents (please see §7.1 for detailed experiments with Vector
database).</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Phase 2: Execution</title>
        <p>
          Given   and   , MemAgent completes the task in the Execution phase. In this phase, we adopt a two-step
workflow similar to the Mind2Act framework proposed by Mind2Web [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Our approach difers in
that we concatenate   and   instead of solely relying on the task description   . This concatenation
allows us to examine the eficacy of the additional context towards task completion, without altering
execution strategy (§4.1). Similar to MindAct, our execution framework operates in two steps. – 1)
candidate generation: a small LM ranks webpage elements based on   ; 1 2) action prediction: a larger
LM predicts the action and target element from top-k candidates ranked in the first step (  = 10 ).
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <sec id="sec-4-1">
        <title>4.1. Dataset</title>
        <p>
          While there are multiple datasets on web agents, there is no specific dataset in our desired format that
includes multi-turn conversation and task information in slot filling style [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]. Hence, we synthetically
augment our dataset over Mind2Web [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] to create a conversational dialog between a user and an
agent. Table 1 shows an example data from our augmented data. We use GPT-4-1106-preview to
create this augmented data following Self-Refine framework [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]. Specifically, we tell the GPT model
to generate the augmented data, followed by feedback in terms of conciseness (whether it includes
repetitive conversation), usefulness (whether it includes useful questions), and verbosity (whether it
asks the question with less verbosity) on a scale of 1 to 5. If score is below 5 on any metric, we ask the
GPT to refine the augmented data further. Table 2 shows the data distribution used in MemAgent.
        </p>
        <p>Followup Questions for Alignment Phase Memory Bank,  
What is the weight of the package? Weight: 4 pounds
Where is the package being shipped from ? Shipped from: Texas
What is the destination of the package? Destination: New York</p>
        <p>Corresponding Task in Mind2Web</p>
        <p>Calculate shipping cost for 4 pound package from Texas to New York
4.2. Models.</p>
        <p>
          Finetuning. For alignment, we have fine-tuned Vicuna 7B [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]. We initialize the training in two ways:
1) empty MCB: agent has to ask all the questions relevant to the task; 2) prefilled MCB: agent has to ask
only the remaining questions relevant to the task. For execution, we finetune MindAct from Mind2Web
in its three variants (Flan-T5 Base, Large, XL). Each training was completed either on a A100 or A6000
GPU. For hyperparameters, please see §4.4.
        </p>
        <p>
          In-context Learning (ICL). We also report the efectiveness of MemAgent with few-shot prompting
for LLMs. We report our results both on GPT-4o and Gemini-1.5-pro with 2-shot prompting. For
Alignmemt, we explore basic, CoT [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] and ReAcT [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ] prompting technique w/ or w/o MCB. For
execution, we explore the 3-shot prompting similar to Mind2Web. Please see Appendix A.1 to find the
corresponding prompt in each setting.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.3. Evaluation Metrics</title>
        <p>
          Alignment. To measure whether the task information is derived successfully, we adopt the BERTScore
[
          <xref ref-type="bibr" rid="ref28">28</xref>
          ] and BLEUScore [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ] metrics to calculate the similarity between the ground truth and the generated
MCB. We also measure the turn of conversation between the user and agent (lower is better), to compute
how well the model can ask relevant questions. For an objective evaluation of information extraction,
we also measure the Precision and Recall of the extracted memory entities and the corresponding values.
Execution. To assess the successful execution of the task, we measure the metrics established in the
literature [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. The supported operations are: Click, Select, and Type. The element accuracy measures
1We use their of-the-shelf candidate generator since the data augmentation does not impact the ranking.
same questions repetitively (Figure 13 Appendix), which makes the conversation very lengthy. Hence,
we forcefully stop the conversation if the conversation turn exceeds 10.
        </p>
        <p>Algorithm 1: MemAgent Evaluation</p>
        <p>Data: task, [ 
Result:  
,  
],</p>
        <p>, [  ]
, turn, F1, step_SR, elem. acc, SR
turn = 0; mem_bank, message = [];
if task in bank: message.append(  .task)
if task in bank: mem_bank.append(</p>
        <p>.task)
while true do</p>
        <p>,  
turn += 1;
q, mem = alignment(message);
  ,   = find_closest(q,</p>
        <p>);
= calculate (q, mem,   .task);
message.append(  );
mem_bank.append(mem);
if (turn &gt; 10 || ‘FINISH’ in q): break;
end
while true do
if (‘FINISH’ in a): break;
a = execution(mem_bank, task,   );</p>
        <p>F1, step_SR, elem. acc = calculate(a,   .a);
end</p>
        <p>SR = sum(step_SR) == len(  );
model
Finetuned (w/ prefilled MCB)
2-Shot
Prompting Gemini-Pro</p>
        <p>Model Name
Vicuna7B
Vicuna7B
GPT-4o
GPT-4o + MCB
GPT-4o + CoT + MCB
GPT-4o + ReAct + MCB
Gemini-Pro + MCB
Gemini-Pro + CoT + MCB
Gemini-Pro + ReAct + MCB 22.62
43.17
40.85
22.13
23.25
19.48
17.78
27.78
Fine-tuned MindAct Models
epoch
batch size</p>
        <p>LoRA
r=8
epoch: 4, learning rate: 2 −4</p>
        <p>2-shot prompting
MemAgent result for Extraction Phase. Gemini-Pro+Cot performs the best among the the LLMs. But
the finetuned Vicuna-7B model outperforms others with a significant margin.</p>
        <sec id="sec-4-2-1">
          <title>Model Name</title>
          <p>Cross-Task
Type (↑)</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>Value (↑)</title>
          <p>Cross-Website
Cross-Domain
Type (↑)</p>
        </sec>
        <sec id="sec-4-2-3">
          <title>Value (↑)</title>
          <p>Type (↑)</p>
        </sec>
        <sec id="sec-4-2-4">
          <title>Value (↑)</title>
        </sec>
        <sec id="sec-4-2-5">
          <title>Precision</title>
        </sec>
        <sec id="sec-4-2-6">
          <title>Recall</title>
        </sec>
        <sec id="sec-4-2-7">
          <title>Precision</title>
        </sec>
        <sec id="sec-4-2-8">
          <title>Recall</title>
        </sec>
        <sec id="sec-4-2-9">
          <title>Precision</title>
        </sec>
        <sec id="sec-4-2-10">
          <title>Recall</title>
        </sec>
        <sec id="sec-4-2-11">
          <title>Precision</title>
        </sec>
        <sec id="sec-4-2-12">
          <title>Recall</title>
        </sec>
        <sec id="sec-4-2-13">
          <title>Precision</title>
        </sec>
        <sec id="sec-4-2-14">
          <title>Recall</title>
        </sec>
        <sec id="sec-4-2-15">
          <title>Precision</title>
        </sec>
        <sec id="sec-4-2-16">
          <title>Recall</title>
          <p>Finetuned model Vicuna7B
2-Shot
Prompting
GPT-4o + CoT + MCB
GPT-4o + ReAct + MCB
Gemini-Pro + CoT + MCB
Gemini-Pro + ReAct + MCB 0.26
0.34
0.27
0.27
0.41
0.28
0.13
0.12
0.17
0.11
0.45
0.40
0.44
0.52
0.46
0.39
0.20
0.17
0.22
0.18
0.34
0.28
0.20
0.30
0.20
0.20
0.11
0.08
0.15
0.10
0.53
0.32
0.36
0.39
0.36
0.30
0.12
0.14
0.17
0.15
0.38
0.12
0.12
0.13
0.10
0.27
0.05
0.05
0.06
0.04
0.50
0.14
0.20
0.16
0.10
0.34
0.06
0.09
0.08
0.04</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.4. Experimental Setup</title>
        <p>and Flan-T5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>
        Framework. We use Fastchat and Axolotl framework for training the models in Alignment Phase. For
Execution, we followed the oficial github repository by Mind2Web [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>Hyperparameter setup
Alignment. Table 3a shows the hyperparameter settings of this phase.</p>
      <p>Execution. We have fine-tuned all three MindAct Flan-T5 models with the learning rate 5 −5. Flan-T5
were fine-tuned using LoRA. Table 3b shows the other hyperparameters: epoch, batch
size, LoRA rank r, LoRA scaling factor  and the temperature parameters for ICL.</p>
      <p>Similar to Mind2Web, due to budget constraints, we evaluate MemAgent on 150 test samples (50
from each split: Cross-Task, Cross-Website, Cross-Domain). As we use the Mind2Web’s of-the-shelf
candidate generator, the failure of ranking ground-truth (positive) candidates could impact overall
Fine-tuned
MindAct
Model
3-shot
performance. To minimize this efect, we pick samples with the least missing candidates. Specifically,
50 samples in cross-domain have positive candidates for all task steps. For cross-task and cross-website,
the values are 43 and 29 respectively. To pick the remaining samples in these splits, we randomly select
samples with missing candidates in only one step. This approach ensures a more reliable evaluation of
MemAgent’s performance.</p>
      <sec id="sec-5-1">
        <title>5.1. Alignment</title>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Extraction</title>
        <p>The scores in the alignment phase do not depict the capabilities of our agent to extract task-specific
memory entities. So, we take the first k(#no of turns in ground truth) number of model outputs from
each conversation. Then, we extract the memory portions from the ground truth and model outputs.
Given the types and values of ground truths and model outputs, we calculate the precision and recall
of entity type and entity value separately by similarity matching. Table 5 shows that the finetuned
Vicuna-7B outperforms all the tested models in extraction. Vicuna-7B-prefilled model is not evaluated
because some of the memory entities are prefilled which may create a bias in overall model output.
Gemini-Pro+CoT+MCB peforms significantly well among all the LLMs. We only use CoT and ReAct
prompts in evaluation because of their consistent performance in the alignment phase.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Execution</title>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Human Evaluation</title>
      <p>
        We conduct a pilot study with 15 participants to understand the impact of storing repetitive queries
in MCB. We used a vector database to simulate MCB, as shown in Figure 2. Whenever a user queries
the model, we first fetch similar tasks performed by the user from the vector database. There is a TTL
(Time-to-Live) involved which we fixed to simulate cache invalidation. In a certain TTL boundary,
semantically similar task histories are fetched. They are re-ranked based on similarity and recency. In
our vector Database, we kept 3 vectors. We have used BM-25 as Sparse vector [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ],
Text-Embedding3-Large as Dense vector [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ] and ColBERT as Late Interaction model [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ]. These histories serve as a
cache. Then the model generates necessary questions with some of the answers auto-filled.
We developed an interactive application for testing the efectiveness of memory-based task assistance
. This application leverages the cache mechanism described, where similar task histories are fetched
based on semantic relevance and recency.
      </p>
      <p>Figure 3b demonstrates the autofill feature, where user inputs such as shoe size and preferred brand
are pre-populated based on previous queries. This feature reduces user efort in filling in redundant
information, thus improving the overall user experience.</p>
      <p>The purpose of integrating this cache simulation was to reduce the time and cognitive load required
for users to perform similar tasks. In tasks where autofill is enabled, users can immediately confirm
pre-filled fields, expediting the task completion process.</p>
      <sec id="sec-6-1">
        <title>6.1. Study Setup</title>
        <p>We assessed the time-saving benefits of our memory-based task assistant. The study evaluated the time
required for users to complete tasks under three diferent scenarios: (1) cross-domain, (2) cross-task,
and (3) cross-website interactions.</p>
        <p>
          During the study, the user had to converse with an assistant to query three randomly chosen tasks
from Mind2Web [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. The procedure was as follows:
1. First Query (No match found in cache): Participants picked a task from the Mind2Web dataset
and performed it without any cache assistance.
2. Second Query (Match found in cache): Participants performed a similar task with the aid of
the caching system, which utilized previously stored information to enhance task performance.
3. Third Query (Match expired): The cache was expired, and participants attempted to perform
the task again to measure performance without the benefits of caching.
        </p>
        <p>
          After each round of query, users were asked to answer the following questions. We adopted a
modified version of NASA-TLX [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ] questionnaire to understand the workload in each query:
(a) Average response times (in seconds) for diferent task
categories across three stages: Initial stage, With Cache, and
Cache Invalidation.
(b) Autofill feature in action. Certain fields,
such as shoe size and color, are auto-filled
based on previous task history.
        </p>
        <p>• How hard did you have to work to accomplish your level of performance before caching?
• How hard did you have to work to accomplish your level of performance after caching?
• How relevant were the auto-filled entries to your current goal?
• Do you prefer auto-filling the entries rather than entering the values yourself?
• How successful were you in accomplishing what you were asked to do with caching?
• Does setting the threshold and asking the questions again from scratch support your dynamic
preference?</p>
        <p>As shown in Figure 3a, the average response times demonstrate that enabling cache significantly
reduces the time needed to complete tasks across all three categories. In the initial stage, without any
cache, response times are considerably higher. After cache expires, the system performance returns to
near-initial stage levels, but the reduction in time during cache-enabled stages shows the advantage of
employing memory-based mechanisms in task repetition scenarios.</p>
        <p>The graph highlights that the cross-website category exhibits the most substantial improvement,
suggesting that tasks involving diferent websites but similar contexts benefit the most from
cacheassisted processing.</p>
        <p>In addition to the tasks associated with the caching mechanism, participants were asked to respond
to various questions regarding their experience. While the focus here is on the responses to specific
features, other questions were also part of the study to gain a comprehensive understanding of the
system’s impact.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Results</title>
        <p>Figure 4 graph illustrates participants’ responses to the various features assessed in the study. The
responses are categorized based on their experience during the three stages of the experiment and are
presented on a scale from 1-5 (very low to very high). We can see that workload significantly reduced
when a matching entries were found in the cache, denoting the efectiveness of our mechanism.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Discussion</title>
      <sec id="sec-7-1">
        <title>7.1. MemAgent for RAG</title>
        <p>
          MemAgent’s modular components allow integration with the RAG framework [
          <xref ref-type="bibr" rid="ref34">34</xref>
          ]. MCB can be stored
and queried from a vector database. Moreover, the alignment models, finetuned with a prefilled memory
bank, ask questions only when information is missing. We perform an additional analysis with prefilled
MCB, reporting the ratio of conversations to MCB entries (Figure 5). As anticipated, the alignment
model asked fewer questions when the MCB contained more information.
        </p>
      </sec>
      <sec id="sec-7-2">
        <title>7.2. MemAgent for dynamic preference modeling</title>
        <p>
          Current agents struggle with handling user preferences efectively [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Although memory-augmented
agents show promises in storing information [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], the transformation of memory remains complex. In
contrast, our MCB is straightforward yet powerful, to store user preferences for a defined period before
automatic removal. This enables MemAgent to dynamically model user preferences.
        </p>
      </sec>
      <sec id="sec-7-3">
        <title>7.3. MemAgent for generalistic web modeling</title>
        <p>MemAgent’s information is generalizable across websites. For example, to book a flight, we always
need to know the time, departure and arrival location no matter which booking website we are using.
Since our MCB only stores   for each   and is independent of the website, it can reutilize the task
information across websites with similar usecases.</p>
      </sec>
      <sec id="sec-7-4">
        <title>7.4. Efectiveness of Temporal Decoupling</title>
        <p>We observe that the separation of alignment and execution phases provides several benefits:
7.4.1. Cognitive Load Distribution
Our pilot study reveals that by front-loading the information gathering process, users can focus entirely
on providing accurate information without the distraction of watching the agent attempt (and potentially
fail) at task execution. Prioritizing task information retrieval naturally maximizes time eficiency for
users as well (58% reduction in user response time).
7.4.2. Learning Eficiency
The alignment model learns a more focused objective—asking relevant questions—rather than the
complex joint objective of conversation and action prediction. This specialization leads to more targeted
and eficient conversations (up to 22.4% conversation turn reductions).</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>8. Conclusion</title>
      <p>In this paper, we presented MemAgent, a novel pipeline designed to address the limitations of LLM
web agents, particularly the misalignment between user expectations and the agent’s actions. By
incorporating MCB, MemAgent efectively stores task-specific information, allowing it to proactively
query for supplementary context. This approach reduces user interaction overhead and enhances task
completion success. Our evaluations demonstrate significant improvements in both performance and
usability of the agent, indicating that MemAgent is a promising step towards seamless integration of
LLMs in web agent technologies.</p>
    </sec>
    <sec id="sec-9">
      <title>Limitation</title>
      <p>MemAgent has been tested on Mind2Web, which is a static dataset. There might be additional challenges
when MemAgent is deployed in an interactive web environment, which is beyond the current scope.
Currently, MemAgent supports the creation of one MCB per task. In cases where users might want to
utilize multiple MCBs, it may not support well. For example, a user wants to concurrently book flights
from New York - Florida and Chicago - Pennsylvania. MemAgent may not be able to store both of these
at the same time.</p>
    </sec>
    <sec id="sec-10">
      <title>Acknowledgments</title>
      <p>We would like to extend our gratitude to Faria Huq, PhD student at Carnegie Mellon University,
for her contributions to this research project. Her consultation and guidance refined the depth of
our methodology and the scope of our investigation. Additionally, her assistance in reviewing the
manuscript has been pivotal in shaping the overall quality of this work.</p>
    </sec>
    <sec id="sec-11">
      <title>Declaration on Generative AI</title>
      <p>
        The authors utilized third-party writing assistants (ChatGPT, Gemini, Grammarly) to refine the
manuscript. This usage was limited to improving the presentation and readability of the work and did
not involve these tools in any intellectual or creative capacity [
        <xref ref-type="bibr" rid="ref35">35</xref>
        ]. The intellectual contributions and
research content remain solely the product of the authors’ eforts.
and ergonomics society annual meeting, volume 50, Sage publications Sage CA: Los Angeles, CA,
2006, pp. 904–908.
      </p>
    </sec>
    <sec id="sec-12">
      <title>A. Appendix</title>
      <sec id="sec-12-1">
        <title>A.1. Prompts used for various LLM calls</title>
        <p>This section includes some additional figures that provide visual insight into the discussed topics.
task description. Figure 8-11 shows prompts used for 2-shot prompting in GPT-4o and Gemini-pro
evaluations. Figure 12 shows the prompt used for the GPT-4o/Gemini-pro execution.
User wants to generate conversation data, (where &lt;abs&gt; includes the input task description and a consecutive list of
question (Q), answer (A), and memory (mem) tuple) for input task description.</p>
        <p>However, the conversation data collected is not always clean. Your task is to filter out repetitive tuples that are
already present in &lt;abs&gt;.</p>
        <p>Follow these guidelines:
1. If a question is already answered in the &lt;abs&gt;, discard it.
2. Rate the quality from 1-5 (1: bad, 5: good) for conciseness (whether it includes repetitive conversation),
usefulness (whether it includes useful questions), and verbosity (whether it asks the question with less verbosity.)
3. Do NOT delete any information that was present in the original description but not in &lt;abs&gt;.
4. If the data looks good to you, you can just reply noop.</p>
        <p>Here is an example:
Original Description: Find a latest post with more than 10k upvotes in r/announcements community and upvote it.
Input:
&lt;Abs&gt; Upvote latest post with high engagement &lt;/Abs&gt;
&lt;Questions&gt;
&lt;Q&gt; Which community's latest post should be searched for? &lt;/Q&gt;
&lt;A&gt; r/announcements &lt;/A&gt;
&lt;mem&gt; Target Community: r/announcements &lt;/mem&gt;
&lt;Q&gt; What is the minimum number of upvotes required for the post to be considered? &lt;/Q&gt;
&lt;A&gt; More than 10,000 upvotes &lt;/A&gt;
&lt;mem&gt; Minimum Upvotes Required: More than 10,000 &lt;/mem&gt;
&lt;Q&gt; What action should be taken once a suitable post is found? &lt;/Q&gt;
&lt;A&gt; Upvote it &lt;/A&gt;
&lt;mem&gt; Action to Take: Upvote the post &lt;/mem&gt;
&lt;/Questions&gt;
Thought: The abstract description already mentioned that the task is to upvote a post which is repeated in the
last question.So, I will discard the last question.</p>
        <p>Rate:
conciseness: 3 (the last question is repetitive),
usefulness: 4 (count of upvotes is not a mandatory parameter, the rest are good),
verbosity: 2 (questions are too lengthy)
Output: &lt;Abs&gt; Upvote latest post with high engagement &lt;/Abs&gt;
&lt;Questions&gt;
&lt;Q&gt; Which community's post? &lt;/Q&gt;
&lt;A&gt; r/announcements &lt;/A&gt;
&lt;mem&gt; Target Community: r/announcements &lt;/mem&gt;
&lt;Q&gt; Minimum number of upvotes to be considered? &lt;/Q&gt;
&lt;A&gt; More than 10,000 upvotes &lt;/A&gt;
&lt;mem&gt; Minimum Upvotes Required: More than 10,000 &lt;/mem&gt;
&lt;/Questions&gt;
Now reply with your thought, rate, and output for the following.</p>
        <p>Original Description: {tsk}
Input: {prompt}
Thought:
&lt;Q&gt; How many guests will be attending the winery tour? &lt;/Q&gt;
&lt;A&gt; 4 guests &lt;/A&gt;
&lt;mem&gt; Number of Guests: 4 guests &lt;/mem&gt;
&lt;Q&gt; What is the date and time for the winery tour booking? &lt;/Q&gt;
&lt;A&gt; April 15, at 10 am. &lt;/A&gt;
&lt;mem&gt; Tour Date and Time: April 15, at 10 am. &lt;/mem&gt;
&lt;Q&gt; What type of setting is requested for the tour? &lt;/Q&gt;
&lt;A&gt; Outdoor setup. &lt;/A&gt;
&lt;mem&gt; Setup Preference: Outdoor setup. &lt;/mem&gt;
&lt;/Questions&gt;</p>
        <p>Given an initial task description, your task is to ask follow-up questions and parse the user's response. Only ask
one question at a time. If you are done, reply with &lt;Finish&gt;. Please reply only with the question.</p>
        <p>First Example:
User: Book me a flight
Agent: Where are you going?
Second Example:
User: Subscribe to newsletter
Agent: newsletter name to subscribe to?
User: Daily Fitness Tips
Agent: What email address should be used?
User: john.fitnessfan@example.com
Agent: &lt;Finish&gt;</p>
        <p>Now complete the following task:
Given an initial task description, your task is to ask follow-up questions and parse the use's response for answer
type and value to be stored into &lt;mem&gt;type: value&lt;/mem&gt;. Only ask one question at a time. If you are done, reply with
&lt;Finish&gt;. Please reply only with the question and &lt;mem&gt; if any.</p>
        <p>First Example:
User: Book me a flight
Agent: Where are you going?
Now complete the following task:
Given an initial task description, your task is to ask follow-up questions and parse the user's response for answer
type and value to be stored into &lt;mem&gt;type: value&lt;/mem&gt;. Only ask one question at a time. If you are done, reply with
&lt;Finish&gt;. Please include your question in &lt;Q&gt; tag and parsed answer in &lt;mem&gt; tag.</p>
        <p>First Example:
User: Book me a flight.</p>
        <p>Agent: Let's think step by step. To book a flight, we need to know the departure, arrival location, and time. I will
first ask about the departure location. &lt;Q&gt;Where are you going? &lt;Q&gt;
Second Example:
User: Subscribe to newsletter.</p>
        <p>Agent: Let's think step by step. To subscribe, I need to know the newsletter name first. &lt;Q&gt;What is the newsletter
name to subscribe to?&lt;/Q&gt;
User: Daily Fitness Tips.</p>
        <p>Agent: Let's think step by step. I will put the user response into the mem bank as a newsletter name. I also need to
ask about their email address. &lt;mem&gt; Newsletter Name: Daily Fitness Tips &lt;/mem&gt; &lt;Q&gt;What email address should be used?
&lt;/Q&gt;
User: john.fitnessfan@example.com.</p>
        <p>Agent: Let's think step by step. Since all the information is already asked, I will finish now and store the email
address from the last reply. &lt;mem&gt;Email Address: john.fitnessfan@example.com &lt;/mem&gt;&lt;Finish&gt;
Now complete the following task:
Given an initial task description, your task is to ask follow-up questions and parse the user's response for answer
type and value to be stored into &lt;mem&gt;type: value&lt;/mem&gt;. Only ask one question at a time and include your thought and
action. If you are done, reply with &lt;Finish&gt;. Please include your question in &lt;Q&gt; tag and parsed answer in &lt;mem&gt; tag.
First Example:
User: Book me a flight
Agent: Thought: To book a flight, we need to know the departure, arrival location, and time. I will first ask about
the departure location. Action: &lt;Q&gt; Where are you going? &lt;Q&gt;
Second Example:
User: Subscribe to newsletter
Agent: Thought: To subscribe, I need to know the newsletter name first. Action: &lt;Q&gt; Newsletter name to subscribe to?
&lt;/Q&gt;
User: Daily Fitness Tips
Agent: Thought: I will put the user response into the mem bank as a newsletter name. I also need to ask about their
email address. Action: &lt;mem&gt; Newsletter Name: Daily Fitness Tips &lt;/mem&gt; &lt;Q&gt; What email address should be used? &lt;/Q&gt;
User: john.fitnessfan@example.com
Agent: Thought: Since all the information is already asked, I will finish now and store the email address from the
last reply. Action: &lt;mem&gt; Email Address: john.fitnessfan@example.com &lt;/mem&gt;&lt;Finish&gt;
Now complete the following task:
Role: System
Content: You are a helpful assistant that is great at website design, navigation, and executing tasks for the user.
Role: User
Content:
'''
&lt;html&gt; &lt;div&gt; &lt;div&gt; &lt;a tock home page /&gt; ... &lt;span&gt; Explore now &lt;/span&gt; &lt;/div&gt; &lt;/div&gt; &lt;/div&gt; &lt;/html&gt;
'''
Based on the HTML webpage above, try to complete the following task:
Task: Check restaurant availability for pickup. City: Boston, NY, Date and Time: March 18, 5pm, Number of Guests: 1
Previous actions:
None
What should be the next action? Please select from the following choices (If the correct action is not in the page
above, please select A. 'None of the above'):
A. None of the above
B. &lt;button id=0 book a reservation. toggle open&gt; &lt;span&gt; Book a
C. &lt;select id=1 type&gt; &lt;option reservations true&gt; Dine in &lt;/option&gt; &lt;option
D. &lt;div id=2&gt; &lt;p&gt; Celebrating and supporting leading women shaking up
Role: Assistant
Content:
Answer: C.</p>
        <p>Action: SELECT
Value: Pickup
Role: User
Content:
'''
&lt;html&gt; &lt;div&gt; &lt;main main&gt; &lt;section tabpanel&gt; ... &lt;/a&gt; &lt;/ul&gt; &lt;/div&gt; &lt;/footer&gt; &lt;/div&gt; ... &lt;/html&gt;
'''
Based on the HTML webpage above, try to complete the following task:
Task: Compare fare types for booking a train ticket. Departure Location: Springfield, IL, Arrival Location: Austin,
TX, Travel Date: April 29th, 2023, Number of Adults: 1
Previous actions:
[combobox] Enter your departing city, airport name, or airpor... -&gt; TYPE: SPRINGFIELD
[button] Springfield, IL, US (SPI) -&gt; CLICK
[combobox] Enter your destination city, airport name, or airp... -&gt; TYPE: AUSTIN
[button] Austin, TX, US (AUS) -&gt; CLICK
What should be the next action? Please select from the following choices (If the correct action is not in the page
above, please select A. 'None of the above'):
A. None of the above
B. &lt;li id=0 tab heading level 3 search and&gt; &lt;span&gt; Hotel
C. &lt;div id=1&gt; &lt;div&gt; &lt;span&gt; Dates* &lt;/span&gt; &lt;button button clear dates
D. &lt;ul id=2&gt; &lt;a mobile tools&gt; &lt;/a&gt; &lt;a open united's tiktok
Role: Assistant
Content:
Answer: A.</p>
        <p>Role: User
Content:
'''
&lt;html&gt; &lt;div&gt; &lt;nav main menu&gt; &lt;ul&gt; &lt;li&gt; &lt;div button&gt; Car Sales &lt;/div&gt; ... &lt;/html&gt;
'''
Based on the HTML webpage above, try to complete the following task:
Task: Find a rental vehicle. Vehicle Type: Mini van, Rental Location: Brooklyn City, Rental Start Date: April 5th,
Rental End Date: April 8th, Renter's Age: 22 years old
Previous actions:
[searchbox] Pick-up &amp; Return Location (ZIP, City or Airport) (... -&gt; TYPE: Brooklyn
[option] Brooklyn, NY, US Select -&gt; CLICK
What should be the next action? Please select from the following choices (If the correct action is not in the page
above, please select A. 'None of the above'):
A. None of the above
B. &lt;div id=0&gt; &lt;div&gt; &lt;div&gt; &lt;div&gt; Buy A Car &lt;/div&gt; &lt;div&gt;
C. &lt;div id=1&gt; Enterprise Fleet Management &lt;/div&gt;
D. &lt;button id=2 selected pick-up date 03/19/2023&gt; &lt;span&gt; &lt;span&gt; 19 &lt;/span&gt;
Role: Assistant
Content:
Answer: D.</p>
        <p>Action: CLICK
Q: Can you specify whether you have a particular browser or tool that you would like to use to open the reviews?
A: Not specified</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Weng</surname>
          </string-name>
          , W. Cheng,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. Qin,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gui</surname>
          </string-name>
          ,
          <article-title>The rise and potential of large language model based agents: A survey</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2309</volume>
          .
          <fpage>07864</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. F.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sridhar</surname>
          </string-name>
          , X. Cheng, Y. Bisk,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fried</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Alon</surname>
          </string-name>
          , et al.,
          <article-title>Webarena: A realistic web environment for building autonomous agents</article-title>
          ,
          <source>arXiv preprint arXiv:2307.13854</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>X.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Stevens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Su,</surname>
          </string-name>
          <article-title>Mind2web: Towards a generalist agent for the web</article-title>
          ,
          <source>arXiv preprint arXiv:2306.06070</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Enabling conversational interaction with mobile ui using large language models</article-title>
          ,
          <source>in: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI '23</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2023</year>
          . URL: https://doi.org/10. 1145/3544548.3580895. doi:
          <volume>10</volume>
          .1145/3544548.3580895.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Narasimhan</surname>
          </string-name>
          , Webshop:
          <article-title>Towards scalable real-world web interaction with grounded language agents</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2207</volume>
          .
          <fpage>01206</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zamfirescu-Pereira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. Y.</given-names>
            <surname>Wong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hartmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Why johnny can't prompt: how non-ai experts try (and fail) to design llm prompts</article-title>
          ,
          <source>in: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>21</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T. S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <article-title>Stylette: Styling the web with natural language</article-title>
          ,
          <source>in: Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, CHI '22</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2022</year>
          . URL: https://doi.org/10.1145/3491102.3501931. doi:
          <volume>10</volume>
          .1145/3491102.3501931.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.</given-names>
            <surname>Packer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wooders</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. G.</given-names>
            <surname>Patil</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Stoica</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          , Memgpt:
          <article-title>Towards llms as operating systems</article-title>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2310</volume>
          .
          <fpage>08560</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>X. H.</given-names>
            <surname>Lù</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Kasner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Reddy</surname>
          </string-name>
          , Weblinx:
          <article-title>Real-world website navigation with multi-turn dialogue</article-title>
          ,
          <source>arXiv preprint arXiv:2402.05930</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T. R.</given-names>
            <surname>Sumers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Narasimhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. L.</given-names>
            <surname>Grifiths</surname>
          </string-name>
          ,
          <article-title>Cognitive architectures for language agents</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2309</volume>
          .
          <fpage>02427</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          , G. Liu,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yu</surname>
          </string-name>
          , T. J.
          <string-name>
            <surname>-J. Li</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , Y. Liu,
          <article-title>Empowering llm to use smartphone for intelligent task automation</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2308</volume>
          .
          <fpage>15272</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Zhang,</surname>
          </string-name>
          <article-title>You only look at screens: Multimodal chain-of-action agents</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2309</volume>
          .
          <fpage>11436</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>T.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Karpathy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hernandez</surname>
          </string-name>
          , P. Liang, World of bits:
          <article-title>An open-domain platform for web-based agents</article-title>
          ,
          <source>in: International Conference on Machine Learning, PMLR</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>3135</fpage>
          -
          <lpage>3144</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>R.</given-names>
            <surname>Kapoor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. P.</given-names>
            <surname>Butala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Russak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Y.</given-names>
            <surname>Koh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kamble</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Alshikh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          ,
          <article-title>Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web</article-title>
          ,
          <source>arXiv preprint arXiv:2402.17553</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>B.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Gou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <article-title>Gpt-4v(ision) is a generalist web agent, if grounded</article-title>
          ,
          <source>in: Forty-first International Conference on Machine Learning</source>
          ,
          <year>2024</year>
          . URL: https://openreview.net/ forum?id=piecKJ2DlB.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J. Y.</given-names>
            <surname>Koh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Duvvur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. C.</given-names>
            <surname>Lim</surname>
          </string-name>
          , P.-Y. Huang,
          <string-name>
            <given-names>G.</given-names>
            <surname>Neubig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fried</surname>
          </string-name>
          , Visualwebarena:
          <article-title>Evaluating multimodal agents on realistic visual web tasks</article-title>
          ,
          <source>ACL</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>H.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ma</surname>
          </string-name>
          , W. Yu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Webvoyager: Building an end-to-end web agent with large multimodal models</article-title>
          ,
          <source>arXiv preprint arXiv:2401.13919</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <article-title>Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v</article-title>
          , arXiv preprint arXiv:
          <volume>2310</volume>
          .11441 (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. Zhang,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yuan</surname>
          </string-name>
          , S.
          <article-title>-</article-title>
          K. Ng, T.-S. Chua,
          <article-title>On the multi-turn instruction following for conversational web agents</article-title>
          , in: L.
          <string-name>
            <surname>-W. Ku</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
          </string-name>
          , V. Srikumar (Eds.),
          <source>Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Bangkok, Thailand,
          <year>2024</year>
          , pp.
          <fpage>8795</fpage>
          -
          <lpage>8812</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>477</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Bo</surname>
          </string-name>
          , C. Ma,
          <string-name>
            <given-names>R.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-R.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <article-title>A survey on the memory mechanism of large language model based agents</article-title>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2404</volume>
          .
          <fpage>13501</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>A.</given-names>
            <surname>Modarressi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Imani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fayyaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schütze</surname>
          </string-name>
          , Ret-llm:
          <article-title>Towards a general read-write memory for large language models</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2305</volume>
          .
          <fpage>14322</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          , Memorybank:
          <article-title>Enhancing large language models with long-term memory</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2305</volume>
          .
          <fpage>10250</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>H.</given-names>
            <surname>Weld</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Long</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Poon</surname>
          </string-name>
          , S. C.
          <article-title>Han, A survey of joint intent detection and slot filling models in natural language understanding</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>55</volume>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>A.</given-names>
            <surname>Madaan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tandon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hallinan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wiegrefe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Alon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Dziri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Prabhumoye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          , et al.,
          <article-title>Self-refine: Iterative refinement with self-feedback</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>36</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zheng</surname>
          </string-name>
          , W.-L. Chiang,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Xing</surname>
          </string-name>
          , et al.,
          <article-title>Judging llm-as-a-judge with mt-bench and chatbot arena</article-title>
          ,
          <source>arXiv preprint arXiv:2306.05685</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , et al.,
          <article-title>Chain-of-thought prompting elicits reasoning in large language models</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>35</volume>
          (
          <year>2022</year>
          )
          <fpage>24824</fpage>
          -
          <lpage>24837</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Du</surname>
          </string-name>
          , I. Shafran,
          <string-name>
            <given-names>K.</given-names>
            <surname>Narasimhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          , React:
          <article-title>Synergizing reasoning and acting in language models</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2210</volume>
          .
          <fpage>03629</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kishore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Artzi</surname>
          </string-name>
          , Bertscore:
          <article-title>Evaluating text generation with bert</article-title>
          , arXiv preprint arXiv:
          <year>1904</year>
          .
          <volume>09675</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>K.</given-names>
            <surname>Papineni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roukos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ward</surname>
          </string-name>
          , W.-J. Zhu,
          <article-title>Bleu: a method for automatic evaluation of machine translation</article-title>
          , in: P.
          <string-name>
            <surname>Isabelle</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Charniak</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
          </string-name>
          (Eds.),
          <article-title>Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Philadelphia, Pennsylvania, USA,
          <year>2002</year>
          , pp.
          <fpage>311</fpage>
          -
          <lpage>318</lpage>
          . URL: https://aclanthology.org/P02-1040. doi:
          <volume>10</volume>
          .3115/1073083.1073135.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>S.</given-names>
            <surname>Robertson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zaragoza</surname>
          </string-name>
          ,
          <article-title>The probabilistic relevance framework: Bm25 and beyond</article-title>
          ,
          <source>in: Foundations and Trends in Information Retrieval</source>
          , volume
          <volume>3</volume>
          , Now Publishers Inc,
          <year>2009</year>
          , pp.
          <fpage>333</fpage>
          -
          <lpage>389</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <surname>OpenAI</surname>
          </string-name>
          , Openai text embedding models,
          <year>2023</year>
          . https://platform.openai.com/docs/guides/ embeddings.
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>O.</given-names>
            <surname>Khattab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zaharia</surname>
          </string-name>
          ,
          <article-title>Colbert: Eficient and efective passage search via contextualized late interaction over bert</article-title>
          ,
          <source>in: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '20)</source>
          , ACM,
          <year>2020</year>
          , pp.
          <fpage>39</fpage>
          -
          <lpage>48</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>S. G.</given-names>
            <surname>Hart</surname>
          </string-name>
          ,
          <article-title>Nasa-task load index (nasa-tlx); 20 years later</article-title>
          ,
          <source>in: Proceedings of the human factors</source>
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Küttler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          , W.-t. Yih,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          , et al.,
          <article-title>Retrieval-augmented generation for knowledge-intensive nlp tasks</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>9459</fpage>
          -
          <lpage>9474</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>E.</given-names>
            <surname>Nakazawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Udagawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Akabayashi</surname>
          </string-name>
          ,
          <article-title>Does the use of ai to create academic research papers undermine researcher originality?</article-title>
          ,
          <source>AI</source>
          <volume>3</volume>
          (
          <year>2022</year>
          )
          <fpage>702</fpage>
          -
          <lpage>706</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>