<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>OAM: Object-Aware Memory and Vision-Language Models for Zero-Shot Object Navigation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jiahui Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wen Liu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhongliang Deng</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Electronic Engineering, Beijing University of Posts and Telecommunications</institution>
          ,
          <addr-line>Beijing 100876</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Object-goal navigation is a key challenge in the field of robotics, which requires robots to navigate and locate a target object in unknown environments. Previous work usually relies on the semantic analysis of single-frame observations, which is prone to semantic understanding bias and instability problems. In this paper, we propose OAM, a novel zero-shot object navigation framework that builds an object-centric spatiotemporal memory. OAM breaks through the limitations of single-frame observations by introducing a spatiotemporal memory bufer module.This module integrates visual, depth, and pose information, enabling the agent to intelligently recall and reason about their historical observation information. In addition, an object-aware semantic focusing mechanism is designed to accurately extract object-related information associated with frontier cells from memory. OAM incorporates this mechanism into the visual language model to identify the most promising frontiers to explore as targets. We evaluated OAM in photorealistic environments from the Gibson and Habitat-Matterport 3D (HM3D) datasets within the Habitat simulator. The results show that our method outperforms existing methods in terms of both SR and SPL.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Object Goal Navigation</kwd>
        <kwd>Memory Bufer</kwd>
        <kwd>Object Extraction</kwd>
        <kwd>Zero-shot Object Navigation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Object-target navigation (ObjectNav) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] requires the agent to actively explore and locate a specified
target object in an unknown environment based on a given instruction (e.g., "bed"). This task has
broad applications in real-world scenarios, such as disaster rescue and domestic service. In this task,
the semantic understanding of the environment [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] by the agent becomes a key factor afecting its
navigation decisions.
      </p>
      <p>
        With the development of large-scale simulation datasets[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and reinforcement learning(RL), RL-based
methods have been widely applied to address the ObjectNav task. These approaches, including
endto-end[
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ] and modular[
        <xref ref-type="bibr" rid="ref2 ref6">2, 6</xref>
        ], aim to implicitly learn semantic relationships from large amounts of
data. However, they heavily rely on task-specific training and exhibit limited generalization. Recently,
advances in large-language models (LLMs) and vision-language models (VLMs) have introduced new
solutions for ObjectNav, notably zero-shot object navigation (ZSON). ZSON eliminates the need for
taskspecific training by leveraging common sense knowledge encoded in pre-trained models to understand
the relationship between the goal and the environment, guiding the agent’s navigation decisions.
However, most of the existing ZSON methods rely on single-frame observations for semantic reasoning.
This makes them vulnerable to issues such as viewpoint bias, poor lighting, occlusion, and background
clutter, leading to agent semantic understanding errors and afecting the navigation efect.
      </p>
      <p>Inspired by how humans search for objects in unfamiliar environments, we identify a key mechanism:
short-term memory and spatial reasoning. When uncertain about their surroundings, humans tend
to recall recently observed information (for example, ’I remember seeing an alarm clock and a bed in
that corner’) and integrate visual cues from diferent points of view to make informed decisions. If the
agent could simulate this cognitive ability, as shown in Fig. 1, it could combine current and historical
observations to select the most promising exploration path most likely to lead to the target object.</p>
      <p>In this work, we propose OAM, a novel navigation framework built around object-centric
spatiotemporal memory. By introducing a module that simulates human-like short-term memory and attention,
OAM overcomes the limitations of single-frame observations and instead integrates visual, depth,
and pose information from past experiences into a coherent spatiotemporal representation. During
navigation, the agent retrieves this memory in the frontier cells, uses an object detector to extract object
patches from historical frames, and employs VLM to identify the frontier most likely to lead to the
target object. This retrieve–focus–decide mechanism enables more robust and eficient ZSON. The
main contributions of this paper are as follows.</p>
      <p>• We propose OAM, which overcomes the limitations of single-frame observations by retrospectively
retrieving historical observations through short-term memory and utilizing VLM for semantic
matching.
• We introduce an object-level semantic focusing mechanism to achieve eficient matching of
frontier points with target semantics, thus improving the quality of decision-making in complex
environments.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Object-goal Navigation: The ObjectNav task requires an agent to navigate to a specified object
in an unknown environment based on instructions. Traditional methods are divided into end-to-end
approaches and map-based modular approaches. End-to-end methods [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] map observations directly
to navigation actions using reinforcement learning or imitation learning. These methods typically
require extensive training and have poor interpretability. Subsequently, a significant body of research
has focused on map-based modular approaches, which consist of modules for map mapping, policy
learning, and path planning. The classical modular approach SemExp [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is to construct an explicit
semantic map and combine it with objective-directed exploration strategies. FSE [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] constructs both
semantic and frontier maps, using deep learning strategies to select long-term goals for exploration.
These methods reduce computational cost and improve interpretability to some extent, but still rely on
large-scale training data. The emergence of zero-shot object navigation (ZSON) methods addresses this
issue and has gained widespread attention. These methods guide the agent to make navigation decisions
without the need for additional training. Recent research has introduced VLMs [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ] and LLMs [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
into ObjectNav, using the semantic knowledge embedded in these models as prior knowledge, thereby
allowing zero-shot object navigation without task-specific training.
      </p>
      <p>
        Frontier Exploration Strategies in ZSON: Recent work has introduced frontier-based[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]
exploration methods to zero-shot navigation tasks. CLIP on Wheels (CoW) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] uses FBE to explore
the boundaries between free and unknown spaces, employing a simple heuristic. ESC [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] combines
LLM with GLIP[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] for object and room recognition, inferring the boundaries that are the most likely
to lead to the target object. L3VMN [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] uses textual descriptions of the boundaries of the semantic
map, and VLFM [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] directly inputs single-frame images into VLM[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], outputting semantic scores from
visual and textual cues. These methods focus on frontier exploration as a heuristic, but we improve
upon this by aggregating spatiotemporal semantic information from frontier points as input to the
VLM. By leveraging multi-frame object-level observations, we refine semantic similarity scores and
more accurately utilize the contextual information around the frontier points, resulting in enhanced
navigation performance.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <sec id="sec-3-1">
        <title>3.1. Pipeline</title>
        <p>As illustrated in Fig. 2, OAM consists of three main modules.First, the perception module takes RGB
images, depth maps, and pose information as input to build an occupancy grid map, extract frontier
points, and construct a frontier map(3.2). It also stores the agent’s observation sequence as historical
memory(3.3).The exploration module filters historical observations for each frontier point based on
ifeld of view coverage and spatial distance, then uses an object detector to extract object patches from
images. These are fed into a VLM to compute similarity scores, which reflect the semantic relevance of
each frontier(3.4). After the final selection of the most valuable frontier point the path is planned and
navigated in the navigation module to update the observation inputs(3.5).</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Frontier Map Construction</title>
        <p>At each timestep , the agent receives an RGB image , a depth map , and the current pose .
By projecting the depth data into the global coordinate system, we obtain the local 3D structure of
the environment. This information is then integrated into a global 2D occupancy grid map occ ∈
{0, 1, − 1}×  , where 0 denotes free space, 1 indicates obstacles, and − 1 represents unknown regions.
We define frontier points as free-space cells that are adjacent to at least one unknown cell, representing
the boundary between explored and unexplored areas. To construct the frontier map, we first identify
all free cells in occ and mark those with at least one unknown neighbor as frontier candidates. To
reduce redundancy, we apply spatial clustering to group neighboring frontier points into clusters, from
which representative points are extracted to form a discrete frontier set:</p>
        <p>= {1, 2, . . . ,  },  ∈ R2.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Spatiotemporal Memory Bufer</title>
        <p>If this angle falls within the horizontal field of view  , the frontier point is considered potentially
visible:</p>
        <p>2. Distance constraint: Additionally, the Euclidean distance between the frontier point and the
historical observation position must be below a threshold th:</p>
        <p>When facing a potential target location, humans often recall objects they previously saw in that
area—an ability resembling short-term memory. Inspired by this cognitive mechanism, we design a
Spatiotemporal Memory Bufer module that caches visual and geometric information from the agent’s
recently visited locations, enabling semantic perception of frontier points. This module maintains a
ifrst-in-first-out (FIFO) observation queue of length
 :
 = {(− +1, − +1, − +1), . . . , (, , )}.
where each entry consists of an RGB image , a depth image , and the agent’s global pose at time
step . This observation history is temporally continuous and spatially grounded, forming a lightweight
short-term perceptual memory that serves as the input for the subsequent PathMatcher semantic scoring
module.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. PatchMatcher</title>
        <p>After obtaining the candidate frontier set  = {f | f ∈ R2,  = 1, 2, . . . ,  } ,We design a
core component, PatchMatcher, as the key part of the exploration module, which assigns task-relevant
semantic value to each frontier point. By comparing the current navigation target with semantic patches
extracted from historical observations, the module estimates the likelihood that each frontier leads
to the target object. PatchMatcher implements an object-centric, semantics-driven frontier selection
strategy and forms the core of our exploration module.</p>
        <sec id="sec-3-4-1">
          <title>3.4.1. Frontier-to-History Association</title>
          <p>The system maintains a history of past observations . For each frontier point , we first apply
geometric priors to filter a subset of historical frames  ⊂
information. This filtering process is based on two key geometric constraints:
 that are most likely to contain relevant
1. Visibility constraint: We consider whether the camera’s field of view at a certain past observation
covers the current frontier point. Let the agent’s pose at time  be  = (, ,  ) and the frontier
point be  = ( ,  ). We define the relative angle between the frontier and the agent as:
 = arctan 2( − ,  − )
| −  | ≤

2
‖ − (, )‖2 ≤ th
(1)
(2)
(3)
(4)
(5)
 =
This process is described in Algorithm 1.</p>
          <p>Algorithm 1 Associate Frontier Points with Historical Frames
2: Output: Associated frames  for each 
3: for each  in  do</p>
          <p>Initialize  as an empty set
for each (, , ) in  do
1: Input: Frontier points  = {1, 2, . . . ,  }, memory bufer , distance ℎ, angle threshold 
4:
5:
6:
7:
8:
9:
10:</p>
          <p>end if
end for
11: end for
12: Return:  for each</p>
        </sec>
        <sec id="sec-3-4-2">
          <title>3.4.2. Object Patch Extraction</title>
          <p>Add (, , ) to 
 ←
if | −  | ≤
arctan2( − ,  − )

2 and (, (, )) ≤ ℎ then</p>
          <p>To ensure that the retrieved observations are both spatially relevant and reliable, and to reduce the
impact of irrelevant data, we define the subset of historical observations retained for frontier point f as:
(6)
(7)
(8)
(9)</p>
          <p>We design an Object Patch Extraction module to identify semantically relevant object regions from
selected historical observations. This module extracts the corresponding image regions as patches
and feeds them into a downstream vision-language model. For each selected historical frame, we use
a frozen object detector (Grounding DINO) to extract potential object regions from the image. Each
detected bounding box is cropped into an individual image patch , forming a patch set that represent
the semantic memory most relevant to the frontier point :</p>
          <p>= {1, . . . , }
To obtain the overall semantic score for each frontier point , we average the similarity scores of all
Here, each  is an image region that is spatially associated with  and potentially semantically
meaningful, to be used for subsequent matching and reasoning.</p>
        </sec>
        <sec id="sec-3-4-3">
          <title>3.4.3. Semantic Froniter</title>
          <p>
            Inspired by VLFM[
            <xref ref-type="bibr" rid="ref15">15</xref>
            ] methods, we adopt the vision-language model BLIP-2[
            <xref ref-type="bibr" rid="ref8">8</xref>
            ] to estimate the
semantic relevance between each frontier point and the target object described in natural language.
BLIP2 computes a similarity score between an image patch and a text prompt representing the navigation
goal. The core innovation of this paper is patch-level semantic matching based on object memory,
which avoids redundant processing of the whole image. For each patch  ∈  , we construct a text
prompt  using natural language to describe the navigation objective, e.g., "Seems like there is
a &lt;target object&gt; ahead.". We input both the patch and text into BLIP-2 to calculate the consine
similarity score:
its associated patches:
          </p>
          <p>= BLIP2( ,  )
() =
1
| ()|
∑︁ 
| ()| =1
where () denotes the semantic relevance score of frontier point .</p>
          <p>Finally, we select the frontier point with the highest semantic score as the next exploration target:
 * = arg max ()</p>
          <p>∈
where  is the set of all currently detected frontier points.</p>
          <p>This method achieves goal-driven semantic exploration usingge Model (VLM) for memory-based
matching. By identifying semantically meaningful visual regions from historical observations, our
approach significantly enhances both the eficiency and accuracy of semantic reasoning, thereby
improving the object-centric frontier selection process based on memory.</p>
        </sec>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Waypoint Navigation</title>
        <p>The system uses the fast marching method (FMM) on the occupancy grid map to plan a collision-free
path from the agent’s position to a frontier point and guides navigation. During navigation, it follows
VLFM’s strategy to detect objects matching the target semantic label g. If found, the task succeeds and
terminates; otherwise, the system re-evaluates semantic scores, selects a new goal, and repeats the
process.
(10)
(11)
(12)</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <sec id="sec-4-1">
        <title>4.1. Datasets</title>
        <p>We evaluate our method on the HM3D and Gibson datasets within the Habitat simulator. HM3D
includes 20 scenes with 2000 episodes, and Gibson includes 5 scenes with 1000 episodes. Task settings
follow SemExp. Each episode lasts up to 500 steps and is successful if the agent issues a STOP within
the goal object’s region, as defined in the Habitat ObjectNav challenge.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Implementation details</title>
        <p>Our experimental code is based on the open-source framework VLFM, which we extend with our
own strategies, and is implemented using the PyTorch deep learning framework. We set the size of
the historical frame bufer to  = 50 frames. The association distance between frontier points and
historical frames is set to 3 meters, and the field-of-view (FoV) angle for visibility checks is set to 79∘ .
For object patch extraction, we adopt the Grounding-DINO model.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Metrics</title>
        <p>For all navigation experiments, we adopt Success Rate (SR) and Success weighted by Path Length
(SPL) as evaluation metrics. SR is defined as the proportion of tasks in which the agent successfully
navigates to the target object (within a distance of less than 1 meter) and correctly triggers the STOP
action within a limited number of steps (500 steps):
=1
SPL is used to evaluate path eficiency and is defined as:
 =</p>
        <p>1 ∑︁ 

  =</p>
        <p>1 ∑︁  · max(, )</p>
        <p>=1
where  is the total number of tasks,  ∈ {0, 1} indicates whether task  is successful,  denotes
the shortest path distance from the start to the target object and  is the actual path length taken by
the agent.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Baselines</title>
        <p>
          To evaluate the performance of our method, we compare it with supervised baselines including
SemExp, FSE, PONI, and zero-shot methods such as ESC, L3MVN, and VLFM. All these methods adopt
frontier-based exploration strategies:
• SemExp [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]is a classical modular method that constructs an explicit semantic map and combines
it with goal-directed exploration strategies.
• Frontier Semantic Exploration (FSE) [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]builds semantic and frontier maps and leverages deep
reinforcement learning strategies to select long-term exploration goals.
• PONI[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] predicts "region potential" and "goal potential" on the boundary of the semantic map,
and learns to infer the likely goal location under environment-agnostic conditions.
• ESC[
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] leverages large language models (LLMs) and GLIP-detected objects to reason about likely
target boundaries and guide exploration.
• L3MVN-2[
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] constructs semantic maps, forms simple sentences from objects around frontier
regions and target-related entities, and scores them using a BERT model.
• VLFM[
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] integrates visual observations and target categories into a VLM to generate a value
map, which is then combined with frontier points for exploration.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <sec id="sec-5-1">
        <title>5.1. Benchmark results</title>
        <p>SPL ↑</p>
        <p>SPL ↑</p>
        <p>As shown in Table 1, OAM consistently outperforms all baselines on both Gibson and HM3D datasets.
Compared with supervised methods (SemExp, FSE, and PONI), OAM improves SR and SPL by up to
10. 9% and 12. 2% in Gibson and by 0. 5% and 7. 4% in HM3D. These gains come from leveraging
pre-trained vision-language models instead of task-specific training, enhancing generalization in unseen
environments. Compared with zero-shot methods, OAM surpasses ESC by 15. 1% SR and 9. 7% SPL
in HM3D. It also outperforms L3MVN-Z by 8.4%/15.6% on Gibson and 3.9%/8.9% on HM3D (SR/SPL),
demonstrating the advantage of using frontier-aligned semantic memory over constructing full semantic
maps. Furthermore, OAM outperforms the second-best VLFM by 1. 8% SR and 1. 6% SPL in HM3D,
indicating that object-level memory provides richer semantic cues for frontier selection, especially
under occlusions or viewpoint shifts. The relatively lower SR of OAM on HM3D is due to the current
lack of multi-floor navigation support, which limits performance on tasks involving cross-floor goal
localization.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Ablation study</title>
        <p>To understand the importance of each module in our framework, we conducted the following two
ablation experiments on the HM3D dataset:
• w/o memory bufer : We remove the temporal memory bufer module. The agent is unable to
reference historical observation information and can only make navigation decisions based on
current visual input.
• w/o object patches: We remove the object-level semantic aggregation mechanism. The agent
does not extract relevant patch information from historical frames around target objects, but
instead relies solely on semantic understanding of complete images.</p>
        <p>Table 2 reports the ablation results on the HM3D validation set, evaluating the contributions of two
core components in our OAM framework: the temporal memory bufer and the object-level patch
extraction module. When the memory bufer is removed, the agent can no longer utilize historical
observations and must rely solely on the current frame for semantic reasoning. This leads to a performance
drop in both SR (from 54.3% to 51.2%) and SPL (from 32.0% to 31.3%), demonstrating the importance of
memory in maintaining consistent semantic understanding over time. Similarly, removing the object
patch module, which prevents the model from aggregating fine-grained semantic cues around the
frontier, leads to a slight degradation in SR (52.3%) and a more noticeable drop in SPL (30.9%). This
suggests that patch-based local semantic alignment improves navigation eficiency by guiding the agent
toward more semantically meaningful frontiers. In general, these results confirm that both modules
contribute significantly to the robustness and eficiency of our zero-shot navigation system.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this paper, we present OAM, a novel zero-shot object-goal navigation framework that integrates
object-level spatiotemporal memory with a vision-language model. By retrieving semantically relevant
historical observations around frontier regions, OAM enables goal-driven exploration without requiring
task-specific training. This design overcomes the limitations of single-frame semantic reasoning and
significantly improves both the navigation success rate and eficiency. Experimental results on the
Gibson and HM3D benchmarks validate the efectiveness and generalizability of our approach. In future
work, we aim to extend OAM to support multi-floor navigation and deploy it in real-world scenarios,
further exploring the potential of memory-augmented semantic reasoning in embodied AI.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This research was supported by the National Natural Science Foundation of China under Grant
No.62372049.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>In the preparation of this thesis, no generative AI tools were used.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Batra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gokaslan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kembhavi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Maksymets</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mottaghi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Savva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Toshev</surname>
          </string-name>
          , E. Wijmans, Objectnav revisited:
          <article-title>On evaluation of embodied agents navigating to objects, 2020</article-title>
          . URL: https: //arxiv.org/abs/
          <year>2006</year>
          .13171. arXiv:
          <year>2006</year>
          .13171.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Chaplot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gandhi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          ,
          <article-title>Object goal navigation using goal-oriented semantic exploration</article-title>
          ,
          <year>2020</year>
          . URL: https://arxiv.org/abs/
          <year>2007</year>
          .00643. arXiv:
          <year>2007</year>
          .00643.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Ramakrishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gokaslan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Wijmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Maksymets</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Clegg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Turner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Undersander</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Galuba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Westbury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. X.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Savva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Batra</surname>
          </string-name>
          ,
          <article-title>Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/2109. 08238. arXiv:
          <volume>2109</volume>
          .
          <fpage>08238</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <article-title>Jiang, Long-short term policy for visual object navigation</article-title>
          ,
          <source>in: 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>9035</fpage>
          -
          <lpage>9042</lpage>
          . doi:
          <volume>10</volume>
          .1109/IROS55552.
          <year>2023</year>
          .
          <volume>10341652</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Fukushima</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ota</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kanezaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sasaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yoshiyasu</surname>
          </string-name>
          ,
          <article-title>Object memory transformer for object goal navigation</article-title>
          ,
          <source>in: 2022 International Conference on Robotics and Automation (ICRA)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>11288</fpage>
          -
          <lpage>11294</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICRA46639.
          <year>2022</year>
          .
          <volume>9812027</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Ramakrishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Chaplot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Al-Halah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Malik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Grauman</surname>
          </string-name>
          , Poni:
          <article-title>Potential functions for objectgoal navigation with interaction-free learning</article-title>
          ,
          <year>2022</year>
          . URL: https://arxiv.org/abs/2201.10029. arXiv:
          <volume>2201</volume>
          .
          <fpage>10029</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>B.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kasaei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <article-title>Frontier semantic exploration for visual target navigation</article-title>
          ,
          <source>in: 2023 IEEE International Conference on Robotics and Automation (ICRA)</source>
          , IEEE,
          <year>2023</year>
          , p.
          <fpage>4099</fpage>
          -
          <lpage>4105</lpage>
          . URL: http: //dx.doi.org/10.1109/ICRA48891.
          <year>2023</year>
          .
          <volume>10161059</volume>
          . doi:
          <volume>10</volume>
          .1109/icra48891.
          <year>2023</year>
          .
          <volume>10161059</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Savarese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hoi</surname>
          </string-name>
          ,
          <article-title>Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2301.12597. arXiv:
          <volume>2301</volume>
          .
          <fpage>12597</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L. H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-N.</given-names>
            <surname>Hwang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.-W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <article-title>Grounded language-image pre-training</article-title>
          ,
          <year>2022</year>
          . URL: https://arxiv.org/abs/2112.03857. arXiv:
          <volume>2112</volume>
          .
          <fpage>03857</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>OpenAI</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Achiam</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Adler</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Agarwal</surname>
          </string-name>
          , et al.,
          <source>Gpt-4 technical report</source>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/ abs/2303.08774. arXiv:
          <volume>2303</volume>
          .
          <fpage>08774</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>B.</given-names>
            <surname>Yamauchi</surname>
          </string-name>
          ,
          <article-title>A frontier-based approach for autonomous exploration</article-title>
          ,
          <source>in: Proceedings 1997 IEEE International Symposium on Computational Intelligence in Robotics and Automation CIRA'97. 'Towards New Computational Principles for Robotics and Automation'</source>
          ,
          <year>1997</year>
          , pp.
          <fpage>146</fpage>
          -
          <lpage>151</lpage>
          . doi:
          <volume>10</volume>
          . 1109/CIRA.
          <year>1997</year>
          .
          <volume>613851</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S. Y.</given-names>
            <surname>Gadre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wortsman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Ilharco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Song,</surname>
          </string-name>
          <article-title>Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation</article-title>
          ,
          <source>in: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>23171</fpage>
          -
          <lpage>23181</lpage>
          . doi:
          <volume>10</volume>
          .1109/ CVPR52729.
          <year>2023</year>
          .
          <volume>02219</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pryor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Getoor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X. E.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Esc: Exploration with soft commonsense constraints for zero-shot object navigation</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2301. 13166. arXiv:
          <volume>2301</volume>
          .
          <fpage>13166</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>B.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kasaei</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Cao, L3mvn: Leveraging large language models for visual target navigation</article-title>
          ,
          <source>in: 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)</source>
          , IEEE,
          <year>2023</year>
          , p.
          <fpage>3554</fpage>
          -
          <lpage>3560</lpage>
          . URL: http://dx.doi.org/10.1109/IROS55552.
          <year>2023</year>
          .
          <volume>10342512</volume>
          . doi:
          <volume>10</volume>
          .1109/iros55552.
          <year>2023</year>
          .
          <volume>10342512</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>N.</given-names>
            <surname>Yokoyama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Batra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bucher</surname>
          </string-name>
          ,
          <article-title>Vlfm: Vision-language frontier maps for zero-shot semantic navigation</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2312.03275. arXiv:
          <volume>2312</volume>
          .
          <fpage>03275</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>