<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Memory System For Smart Personal Assistants</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Felix Ocker</string-name>
          <email>felix.ocker@honda-ri.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jörg Deigmöller</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pavel Smirnov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Julian Eggert</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Memory System, Ontology Construction, Retrieval Augmented Generation, GraphRAG, Grounding</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Honda Research Institute Europe</institution>
          ,
          <addr-line>Carl-Legien-Str. 30, 63073 Ofenbach am Main</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>A wide variety of agentic AI applications - ranging from cognitive assistants for dementia patients to robotics demand a robust memory system grounded in reality. In this paper, we propose such a memory system consisting of three components. First, we combine Vision Language Models for image captioning and entity disambiguation with Large Language Models for consistent information extraction during perception. Second, the extracted information is represented in a memory consisting of a knowledge graph enhanced by vector embeddings to eficiently manage relational information. Third, we combine semantic search and graph query generation for question answering via Retrieval Augmented Generation. We illustrate the system's working and potential using a real-world example.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The emergence of Large Language Models (LLMs) has advanced conversational assistants beyond
rule-based systems, enabling them to operate within a user’s perceptual and conceptual context. For
this, assistants must integrate stored knowledge with ongoing interactions to ensure that responses
remain relevant and grounded. Retrieval Augmented Generation (RAG) techniques combine LLMs
with external knowledge bases and multimodal LLMs process diverse inputs, enabling richer, more
context-aware interactions. However, the need for assistants with personal and situational support
based on a large-scale memory also highlights critical challenges. First, to efectively deal with memories
in the form of multimodal inputs, a robust conceptual framework is needed that acknowledges the
role of space and time as fundamental dimensions of experience and memory. Inspired by Kantian
notions, which describe space and time as fundamental structuring elements imposed by the mind [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ],
we recognize that memory systems must integrate these dimensions to maintain coherent, grounded
knowledge. Second, standard RAG stores information as disconnected snippets, failing to capture the
relational dependencies needed for complex queries [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Third, true situational awareness requires
structured, concept-based retrieval and inference for more advanced reasoning and decision-making. To
address these challenges, we propose a novel approach for grounded memory-based personal assistants.
Our approach builds on a structured memory akin to human episodic and biographic memory, ensuring
that information is pre-structured before inference rather than relying on on-the-fly conceptualization
like standard RAG. Each of these components addresses a specific challenge, resulting in three pillars:
1. Grounded Perception: Structure multimodal inputs with spatial and temporal awareness,
categorizing them into actions, agents, and objects.
2. Memory Graph: Overcome standard RAG limitations by using a richer knowledge representation in
the form of an ontological framework for representing memories, i.e., structuring interconnected
concepts and enhancing memory versatility through semantic embeddings.
3. Agentic Retrieval: Use graph querying and expansion together with semantic search for improving
coherence and context-awareness for complex queries.
      </p>
      <p>By combining these elements, our system enables assistants to deliver personalized, context-aware
support with enhanced reasoning and decision-making.</p>
      <p>CEUR</p>
      <p>ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        This section reviews related work for the three pillars of our memory system: grounded perception,
memory graphs, and agentic retrieval. Grounded perception organizes multimodal data into actions,
agents, and objects with temporal awareness, forming structured action patterns [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Memory graphs
overcome standard RAG limitations by structuring knowledge to capture even implicit relations. Agentic
retrieval enhances reasoning via graph-based inference instead of relying solely on embedding similarity.
      </p>
      <sec id="sec-2-1">
        <title>2.1. Grounded Perception</title>
        <p>
          Multimodal perception has advanced with LLMs, aiding AI applications like robotics and surveillance.
For instance, robots critically depend on their visual understanding capabilities for navigation [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] and
object localization tasks [
          <xref ref-type="bibr" rid="ref5 ref6 ref7">5, 6, 7</xref>
          ]. Recent work such as 3D Dynamic Scene Graphs (DSGs) [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] and
TASKOGRAPHY [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] rely on creating structured models of the environment. However, perception
in robotics usually does not focus on building a lifelong memory, but rather on creating a faithful
representation of the current environment which could be recalled for specific tasks. For Embodied
RAG [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], the authors build a structured semantic forest based on spatial proximity which can be used in
combination with LLMs to support robotic navigation. Other examples of multimodal perception systems
specializing in human activity recognition are systems for understanding long videos [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. Such systems,
e.g., VideoAgent [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] and AMEGO [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], focus on person-object annotations, primarily tracking
handobject interactions without explicit action labeling. Efective memory-based assistants need persistent
representations of actions, agents, and objects with contextual tracking. Many multimodal perception
systems ofer contextualized understanding but lack structured long-term recall. Our approach integrates
LLM-based perception with a structured graph-based memory to ensure interpretability and retrieval.
With advances in LLMs, many specialized environment recognition and action detection approaches
are being replaced by multimodal LLMs [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. In the context of this paper, we rely on multimodal
LLMs, specifically Vision Language Models (VLMs), for these tasks, since they generally provide more
contextualized information for building a grounded memory system. While VLMs provide a flexible and
context-aware understanding, they lack the structured memory needed for long-term, explainable recall
by themselves. Our approach addresses this by integrating VLM-based perception into a structured,
graph-based memory, ensuring that memories remain interpretable and retrievable.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Memory Graphs</title>
        <p>
          A scalable memory is essential for assistants with personal support capabilities. Due to their benefits
regarding the integration of heterogeneous data [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], knowledge graphs provide an excellent
technological basis for such a memory [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. For instance, TobuGraph [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] is an approach to transform pictures and
conversations with a text-based chatbot into a memory graph. The authors demonstrate the limitations
of the standard RAG approach for describing personal memories. In [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], the authors describe MemPal
- a wearable video-based conversation device for assisting elderly with memory impairments. MemPal
focuses on a use-case of finding lost objects and is evaluated on the efects of voice-enabled multimodal
LLMs. These systems address two deficiencies of standard RAG approaches: 1) The problem of scaling
them to large-scale multimodal real-world scenarios and 2) the deficiencies in terms of representing
complex memories of interconnected world entities. To address the second deficiency, the authors
of [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] describe a framework for capturing lifelong personal memories from images and videos by
memorizing them via a natural language interface. The approach includes extracting a taxonomy of
contextual information out of textual information obtained from videos and images, with contexts
being described by time, location, people, visual elements of environment, activities and emotions.
The extracted taxonomy is used for a special retrieval which augments semantic search. While this
demonstrates that RAG-based approaches can be used to retrieve snippets of personal experiences, it
lacks the power of relational memories as provided by underlying memory graphs. In this paper, we rely
on a combination of RAG techniques with knowledge graphs for improving the retrieval capabilities of
such systems.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Agentic Retrieval</title>
        <p>
          GraphRAG is a retrieval-augmented generation that enriches conventional RAG pipelines with a
graphbased representation of knowledge. In standard RAG systems, semantic search is used to retrieve
relevant text snippets from a vector store, which are provided as context to an LLM for question
answering. However, this chunk-oriented retrieval can miss deeper relationships and dependencies
among pieces of information. GraphRAG addresses this limitation by building and utilizing knowledge
structured in graphs, enabling more coherent reasoning over interconnected facts. One way of realizing
GraphRAG is to conduct semantic search to find entry points in the graph and then expand the context
for further relevant, but less explicit, information. There are several suitable algorithms for graph
expansion, PageRank being one of them [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. Another approach to GraphRAG is to translate natural
language queries into graph queries, such as Cypher, for structured database access [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. Another
GraphRAG application is presented in [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], where a knowledge graph is built from textual data. Instead
of retrieving isolated text snippets, the system retrieves relational subgraphs relevant to a user query,
which are then passed to an LLM. By leveraging both textual and structured graph-based knowledge,
this approach enables deeper reasoning over complex, interconnected facts, making it highly efective
for answering intricate queries. By leveraging a combination of these techniques, our system ensures
more explainable and context-aware responses, combining the flexibility of text-based search with the
expressiveness of a graph-based memory.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Grounded Memory System Architecture</title>
      <p>The memory system is based on a schema that revolves around textual notes, which are represented as
nodes in a graph, cp. Section 3.1. Leveraging this schema, the memory system is designed to seamlessly
capture, structure, and retrieve real-world observations through a three-phase process, see Figure 1.</p>
      <p>In the perception phase, cp. Section 3.2, cameras observe the environment, allowing a VLM to
generate descriptive captions for detected events. While this can be extended for further modalities
such as audio, we focus on visual inputs in the context of this paper. During the ingestion phase,
cp. Section 3.3, these images and captions undergo a structured analysis before being stored in a
persistent knowledge graph. Unlike unstructured memory systems, this graph-based representation
explicitly encodes who performed which action on which object, when, and where. Finally, in the
recall phase, cp. Section 3.4, the system retrieves stored information for question answering, event
verification, and intelligent recommendations. Throughout the following, we rely on a video showing
an individual in a home setting as a running example to exemplify the concepts introduced.</p>
      <sec id="sec-3-1">
        <title>3.1. Representing Memory Notes</title>
        <p>The knowledge base builds on a schema for representing so-called memory notes, cp. Figure 2. A
MemoryNote can be used to describe a time period of arbitrary length and it can be generated from
arbitrary sources, e.g., manually crafted for diary entries or generated automatically to describe a
single frame in a video. Each memory note is characterized by its note content, which is a natural
language string, and an optional list of data files from which it has been created. To create a structured
representation, every memory note is also represented as a node in a knowledge graph. For our
application, we introduce Image nodes, which are specialized memory notes that have an image caption
as note content and that refer to an image as a data file. To create a structured representation, each
MemoryNote is linked to the entities it mentions, categorized as Agents (”Who performed the action?”),
Objects (”What was acted upon?”), and Actions (”What was done?”). Images are temporally ordered using
has-previous links, cp. Figure 2, and agents, objects, and actions are connected to the images they occur
in via has-element links.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Perception</title>
        <p>The perception phase captures raw video input images, and generates descriptive captions using
gpt-4o’s [21] vision capabilities, thus laying the foundation for a structured representation of events.</p>
        <p>Unlike traditional video memory systems that passively store raw visual inputs, our system actively
identifies and links key entities in the environment. This includes detecting agents, objects, and
their spatial relationships, forming a structured representation. To achieve this, the system processes
sequences of  consecutive frames from the video stream using the prompt shown in Figure 3. Each
sequence is analyzed as a single unit, where the first and last frame overlap with adjacent sequences,
ensuring temporal continuity. To find a balance between eficiency and accuracy, we caption only each
n-th frame. This strategy maintains temporal coherence, reduces redundant descriptions, and results
in more accurate, context-aware scene summaries. Each described instance in the captions is indexed
with a unique label in the format [label_x:Type] to ensure consistent tracking across frames.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Knowledge Graph and Vector Store Population</title>
        <p>The information captured during the perception phase, cp. Section 3.2, is stored in a hybrid knowledge
base combining a knowledge graph consisting of structured relationships and a vector store, i.e., a
text-oriented representation allowing for semantic search. The ingestion process consists of four steps.
First, all entities identified during the perception phase are extracted from the image captions. Together
with the entity names, we extract the entity types. Second, we create embedding vectors for the image
captions using an embedding model, resulting in high-dimensional numeric representations of the text.
If necessary due to context window limitations, the captions are split up into several parts. Third, we
create nodes in the knowledge graph for all images and connect them sequentially. To these we add the
respective attributes, such as the captions and the paths for the image files, and the embedding vectors
for the image captions. Fourth, we add nodes for all actions, agents, and objects identified and connect
them to all the image nodes in which they appear, turning the sequence of images into a connected
graph. The knowledge graph, cp. Figure 4, maintains temporal order via sequentially connected image
nodes, while objects, agents, and actions structure events. Consistent entity labels ensure continuity
and enable context-aware retrieval.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Agentic Retrieval for Question Answering</title>
        <p>Combining a graph-based structured representation with natural language text notes allows the use of
several retrieval techniques. First, we use the memory system as a standard RAG system for semantic
search as it is based on natural language notes with embedding vectors associated to them. By embedding
the user query with the same embedding model, we create an embedding vector which we compare
to the embeddings of all the notes, retrieving the semantically most relevant notes. To increase the
relevance of the context provided to the LLM, we optionally use a reranker to check the retrieved notes
and filter out less relevant ones. Providing the retrieved notes as context, we let the LLM answer the
original question. This type of retrieval is especially eficient for questions which are likely to have
responses that are semantically close. Second, we leverage the structure of the memory system for
graph expansion. Relevant information may be included in notes which are not found when relying
purely on semantic search, but that are linked to the notes found in the graph. This is usually the case
for relevant background information, e.g., personal preferences not showing up in individual notes, but
represented in a note for an agent. Here, we start with regular semantic search for identifying an initial
set of relevant notes. Then we expand the search results using an expansion algorithm. Specifically, we
use PageRank [22], but other algorithms, e.g., random walks, also work. Finally, the expanded search
results are used as context for LLM-based question answering. This second type of retrieval is beneficial
whenever there is implicit background information, which cannot easily be found via semantic search,
but which becomes apparent when analyzing the surrounding environment of the relevant nodes in the
graph. Third, we rely on the graph representation for structured information retrieval in the form of
text2cypher. For this, we let the LLM translate the user input into a Cypher query, which is run against
the graph database. The result in the form of a table is interpreted by the LLM, which formulates a
natural language response. This type of retrieval is ideal for answering structure-oriented queries,
for instance questions that require counting entities. To leverage the benefits of all three retrieval
techniques, we combine them in an agentic retrieval system. For this, we wrap the three retrieval
functionalities into tools that we provide to an LLM-based agent, who can access them as needed for
answering user questions. The agent is prompted to select the most suitable tool, or several if necessary,
to retrieve information from the memory system and eventually formulate an appropriate answer.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Implementation Details</title>
      <p>We implemented a prototype using Neo4j as
the graph database and vector store. Videos
were recorded using an iPhone 15, split into
frames at 3 Hz, and captions were
generated for every fith frame using gpt-4o for
vision and text-embedding-3-small for
storing embeddings in the vector store. New
entities (images, actions, agents, objects) are
inserted into the graph using dynamically
generated Cypher queries. Memory notes are
also stored in Obsidian, enabling easy
browsing. A minimalist gradio frontend allows
users to interact with the system via a chat
interface, cp. Figure 5, that provides links to
the sources used for transparency. Table 1
presents representative questions and system
responses, demonstrating retrieval capabilities
across diferent query types.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Summary and Outlook</title>
      <p>This paper presents a grounded memory system that integrates the strengths of a knowledge graph and a
vector store for agentic RAG with an LLM as an intuitive natural language interface The system leverages
a minimalist schema and operates through three key phases: perception, ingestion, and retrieval. The
system has potential applications ranging from robotics to assistive technologies, such as support
systems for dementia patients. Our approach provides a foundation for structured memory-based
retrieval and serves as a starting point for future research in long-term knowledge representation and
context-aware reasoning. The integration of conceptual nodes provides additional flexibility, allowing
retrieval to be guided by semantic relationships rather than purely temporal order. This structured
approach enables conversational assistants to reason over past events, improving long-term memory
consistency compared to standard RAG techniques.</p>
      <p>Future work should focus on scaling up the system and conducting large-scale evaluations in
realworld scenarios. Expanding to longer multimodal sequences will allow the system to capture broader
temporal dependencies and leverage its efectiveness in retrieving and reasoning over complex event
histories. While we expect challenges in long-term entity disambiguation – ensuring that agents,
objects, and actions are consistently recognized across diferent scenes and timeframes – moving beyond
individual observations enables high-level behavior pattern identification by recognizing repetitive
actions, activity trends, and structured sequences of human interactions. Additionally, we will further
advance RAG techniques, such as recursive summarization and query rewriting, to enhance contextual
understanding and improve response accuracy. This can be supported by advancing agency in the
retriever and providing further retrieval tools, e.g., via frameworks focusing on leveraging large sets of
tools for LLMs [23]. Further improvements will also involve extending the system’s multimodal
capabilities beyond vision, incorporating audio and spatial information. In doing so, we aim to move toward
a comprehensive memory system capable of supporting autonomous agents in complex environments.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used generative AI for grammar and spelling checks.
After using these tools, the authors reviewed and edited the content as needed and take full responsibility
for the publication’s content.
[21] J. Achiam, et al., Gpt-4 technical report, arXiv:2303.08774 (2024).
[22] S. Bin, K. L. Page, The anatomy of a large-scale hypertextual web search engine, in: Computer</p>
      <p>Networks, 1998.
[23] F. Ocker, D. Tanneberg, J. Eggert, M. Gienger, Tulip agent–enabling LLM-based agents to solve
tasks using large tool libraries, arXiv preprint arXiv:2407.21778 (2024).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>I. Kant</surname>
          </string-name>
          ,
          <source>Critique of pure reason</source>
          .
          <volume>1781</volume>
          ,
          <string-name>
            <surname>Modern</surname>
            <given-names>Classical Philosophers</given-names>
          </string-name>
          , Cambridge, MA: Houghton Miflin (
          <year>1908</year>
          )
          <fpage>370</fpage>
          -
          <lpage>456</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kashmira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Dantanarayana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Brodsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mahendra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Flautner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mars</surname>
          </string-name>
          ,
          <article-title>A graph-based approach for conversational AI-driven personal memory capture and retrieval in a real-world application</article-title>
          ,
          <source>arXiv:2412.05447</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Eggert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Deigmöller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fischer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Richter</surname>
          </string-name>
          ,
          <article-title>Action representation for intelligent agents using Memory Nets</article-title>
          , in: IC3K,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Al-Tawil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hempel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Abdelrahman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Al-Hamadi</surname>
          </string-name>
          ,
          <article-title>A review of visual slam for robotics: Evolution, properties, and future applications</article-title>
          ,
          <source>Frontiers in Robotics and AI</source>
          <volume>11</volume>
          (
          <year>2024</year>
          )
          <fpage>1347985</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>I.</given-names>
            <surname>Armeni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.-Y.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gwak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Zamir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fischer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Malik</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Savarese,</surname>
          </string-name>
          <article-title>3D scene graph: A structure for unified semantics, 3D space, and camera</article-title>
          , in: ICCV,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>U</given-names>
            <surname>.-H. Kim</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-M. Park</surname>
            , T.-J. Song,
            <given-names>J.-H.</given-names>
          </string-name>
          <string-name>
            <surname>Kim</surname>
          </string-name>
          ,
          <article-title>3D scene graph: A sparse and semantic representation of physical environments for intelligent agents</article-title>
          ,
          <source>IEEE transactions on cybernetics 50</source>
          (
          <year>2019</year>
          )
          <fpage>4921</fpage>
          -
          <lpage>4933</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rosinol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Abate</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. Carlone,</surname>
          </string-name>
          <article-title>3D dynamic scene graphs: Actionable spatial perception with places, objects, and humans</article-title>
          , in: RSS,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.</given-names>
            <surname>Agia</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. M. Jatavallabhula</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Khodeir</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Miksik</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Vineet</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Mukadam</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Paull</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Shkurti</surname>
          </string-name>
          , Taskography:
          <article-title>Evaluating robot task planning over large 3D scene graphs</article-title>
          , in: CoRL,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. Y.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bajaj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Johnson-Roberson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bisk</surname>
          </string-name>
          , Embodied-RAG:
          <article-title>General non-parametric embodied memory for retrieval and generation</article-title>
          ,
          <source>arXiv:2409.18313</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Ren, LifelongMemory: Leveraging LLMs for answering queries in long-form egocentric videos</article-title>
          ,
          <source>arXiv:2312.05269</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ma</surname>
          </string-name>
          , R. Wu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>VideoAgent: A memory-augmented multimodal agent for video understanding</article-title>
          ,
          <source>in: ECCV</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>G.</given-names>
            <surname>Goletto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Nagarajan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Averta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Damen</surname>
          </string-name>
          , Amego:
          <article-title>Active memory from long egocentric videos</article-title>
          ,
          <source>in: ECCV</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Bao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Liu</surname>
          </string-name>
          , H. Liu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kong</surname>
          </string-name>
          ,
          <article-title>Visual large language models for generalized and specialized applications</article-title>
          ,
          <source>arXiv:2501.02765</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          , E. Blomqvist,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cochez</surname>
          </string-name>
          , C. d'Amato,
          <string-name>
            <given-names>G. D.</given-names>
            <surname>Melo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gutierrez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kirrane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E. L.</given-names>
            <surname>Gayo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Neumaier</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Knowledge</surname>
            <given-names>graphs</given-names>
          </string-name>
          ,
          <source>ACM Computing Surveys</source>
          <volume>54</volume>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Eggert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ocker</surname>
          </string-name>
          ,
          <article-title>Graph based memory extension for large language models</article-title>
          ,
          <year>2025</year>
          . US Patent App.
          <volume>18</volume>
          /898,
          <fpage>607</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>N.</given-names>
            <surname>Maniar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. W.</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zulfikar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Maes</surname>
          </string-name>
          , MemPal:
          <article-title>Leveraging multimodal AI and LLMs for voice-activated object retrieval in homes of older adults</article-title>
          ,
          <source>in: IUI</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J. N.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z. J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Ma, Omniquery:
          <article-title>Contextually augmenting captured multimodal memory to enable personal question answering</article-title>
          ,
          <source>arXiv:2409.08250</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>B. J.</given-names>
            <surname>Gutiérrez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yasunaga</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Su,</surname>
          </string-name>
          <article-title>HippoRAG: Neurobiologically inspired long-term memory for large language models</article-title>
          ,
          <source>in: NeurIPS</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Ozsoy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Messallem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Besga</surname>
          </string-name>
          , G. Minneci,
          <article-title>Text2cypher: Bridging natural language and graph databases</article-title>
          ,
          <source>arXiv:2412.10064</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>D.</given-names>
            <surname>Edge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Trinh</surname>
          </string-name>
          , N. Cheng, J.
          <string-name>
            <surname>Bradley</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Chao</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Mody</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Truitt</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Larson</surname>
          </string-name>
          ,
          <article-title>From local to global: A graph RAG approach to query-focused summarization</article-title>
          ,
          <source>arXiv:2404.16130</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>