<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>IRCDL</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>AgriMus: Developing Museums in the Metaverse for Agricultural Education</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ali Abdari</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alex Falcon</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Serra</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Naples Federico II</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Udine</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>21</volume>
      <fpage>20</fpage>
      <lpage>21</lpage>
      <abstract>
        <p>Learning agricultural practices-such as gardening, maintaining fruit trees, and general farming techniques-has increasingly shifted towards digital platforms, with tutorials on YouTube being a popular resource. As the metaverse expands, immersive experiences are emerging as powerful tools for skill acquisition. This work introduces AgriMus, a search tool designed for metaverse environments, enabling users to discover both videos and interactive experiences tailored to teaching practical skills in agriculture. AgriMus aims to connect users with relevant virtual spaces where they can learn and practice agricultural tasks in a hands-on, engaging way. Initial experiments conducted on 83 exhibitions demonstrate the potential of zero-shot search methods, achieving 27% R@1, 41% MRR, and 52% nDCG@5. The results also highlight the importance of leveraging the hierarchical structure of exhibition data and integrating state-of-the-art vision-language models to improve search performance. The source code and data of this work is available at https://github.com/aliabdari/AgriMus.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Metaverse</kwd>
        <kwd>Digital Museums</kwd>
        <kwd>Agriculture Education</kwd>
        <kwd>Cross-modal Retrieval</kwd>
        <kwd>Multimedia</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Nowadays, with the user-generated content uploaded on the Internet increasing dramatically every year,
it is becoming a common practice to acquire new skills by watching tutorials on video sharing platforms
such as YouTube. These tutorial videos span a broad range of diferent skills, including general life skills
such as cooking, home organization, and DIY crafts; technical skills like coding, graphic design, and
video editing; and practical hands-on activities like gardening, farming, and maintaining fruit trees. For
instance, users can find step-by-step guides on planting and cultivating vegetables, pruning fruit trees
for optimal growth, designing irrigation systems, and even employing modern farming technologies,
e.g., hydroponics or drone-assisted crop monitoring. This vast repository of user-generated content
empowers individuals to learn both everyday and specialized skills at their own pace.</p>
      <p>With the rapid growth of the metaverse, a new dimension of learning and skill acquisition is emerging,
particularly in areas like agriculture. Initiatives such as the Agriscience Metaverse Academy are
already leveraging virtual reality (VR) to provide immersive educational experiences for agriculture
teachers and students, enabling them to explore agriscience concepts without the constraints of physical
resources. Similarly, projects like “Georgia Agriculture in the Metaverse” introduce AI-powered,
gamebased learning environments where users can grow crops, manage agricultural businesses, and gain
practical farming skills through interactive simulations. These examples illustrate how the metaverse
is transforming traditional tutorial-based learning into dynamic, hands-on experiences, making skill
development more accessible, engaging, and impactful.</p>
      <p>To take advantage of the strengths of both traditional tutorial videos and immersive metaverse
experiences, we introduce the AgriMus project, the overview of which can be seen in Figure 1. AgriMus
focuses on developing a specialized search tool that empowers users interested in learning agricultural
activities to explore and identify relevant agricultural metaverses. By integrating video content with
(b)</p>
      <sec id="sec-1-1">
        <title>I want to learn how to</title>
        <p>prune fig trees, I’ve
never done that…</p>
      </sec>
      <sec id="sec-1-2">
        <title>I want to learn about</title>
        <p>pruning techniques
for fruit trees</p>
        <p>NLP
CV
NLP
CV</p>
        <p>V+L
RS
V+L
RS
harvest fig</p>
        <p>trees
prune kiwi
vines
prune fig
trees
plant fig</p>
        <p>trees
prune lemon
trees
fertilize fig
trees
prune
apple trees
prune fig
trees
interactive virtual experiences, this tool allows users to search for and access metaverse environments
tailored to their specific interests, such as gardening, farming techniques, or advanced agricultural
practices. AgriMus bridges the gap between conventional online tutorials and the growing potential of
the metaverse, ofering a comprehensive platform for skill development in agriculture.</p>
        <p>To demonstrate the feasibility of AgriMus, we collected a dataset specifically designed for
proof-ofconcept purposes. The dataset comprises 83 topical exhibitions, each dedicated to a broad agricultural
theme (e.g., pruning fruit trees), with individual rooms focusing on more specific subtopics (e.g., pruning
lemon trees). We conducted experiments in a zero-shot scenario, leveraging the hierarchical structure
of the exhibitions to model the data as envisioned for AgriMus. Our experimental results demonstrate
promising performance, achieving 27% recall at rank 1 (R@1), 66% recall at rank 5 (R@5), and a mean
reciprocal rank (MRR) of 41%. Additionally, we achieved 52% normalized discounted cumulative gain
(nDCG) at rank 5 and 56% recall at rank 10. These results highlight the efectiveness of the hierarchical
approach and validate the potential of AgriMus for enabling eficient exploration and retrieval in
agricultural metaverses.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <sec id="sec-2-1">
        <title>2.1. Digital museums</title>
        <p>
          The emergence of digital museums represents a transformative shift in how cultural heritage is accessed
and experienced, ofering unprecedented opportunities for engagement and education [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. With the
advancements in technologies such as high-quality 3D modeling and virtual reality (VR), digital museums
are becoming more popular and it is possible for them to host rich and immersive experiences. For
instance, they allow for detailed representations of artifacts and exhibitions [
          <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
          ], enabling visitors to
explore diverse themes ranging from ancient civilizations to contemporary art [
          <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
          ]. Moreover, unlike
traditional museums, which are constrained by physical space and operating hours, digital museums
can operate continuously, providing access to global audiences at any time.
        </p>
        <p>Thus, digital museums play a vital role in preserving and promoting cultural heritage by making
artifacts and traditions accessible to wider audiences. However, they usually focus their attention to
cultural heritage. Conversely, this work builds on the concept of digital museums by focusing on the
integration of agricultural knowledge and training materials into museum-like exhibits, creating a
unique training avenue for novices and practitioners in agricultural domains, which has not been studied
in the researches so far. The aim is to support the acquirement of new skills by mixing lecture-like
videos and virtual hands-on practice by means of VR experiences.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Multimedia-rich 3D scenarios</title>
        <p>
          Recent advancements in vision and language techniques have significantly enhanced the retrieval of 3D
scenes and objects through natural language queries. The integration of dense captioning methods with
RGB-D scans enables the generation of detailed, context-aware descriptions of localized objects within
3D environments [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. These approaches allow users to input natural language queries to retrieve specific
objects or scenes, thereby improving the eficiency and accuracy of retrieval systems. By combining
language and 3D visual data, these techniques facilitate more intuitive interactions between humans
and machines, enabling natural language descriptions to guide the search and discovery of relevant 3D
models or environments.
        </p>
        <p>
          Instead of focusing on single objects, recent research has focused on more complex indoor scene
retrieval using text, involving longer descriptions, as they need to describe many objects and their
position within the entire scene. Several contributions were made in this direction, including CRISP
[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], which provides a large collection of 3D indoor scenes and their corresponding textual descriptions,
Farmare [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] and Adoctera [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] which focus on learning to search furnished multi-room apartments and
rank them against user queries. More recently, Text2SceneGraphMatcher [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] introduced a method for
aligning open-set text queries with 3D scene graphs to facilitate efective scene retrieval.
        </p>
        <p>
          However, the approaches mentioned above do not consider the possibility of having inside the scenes
some multimedia content which afects the relevance to the user query. This problem raises additional
challenges as both the global structure and the local components need to be accounted for in the learned
representation in order to fully capture the contents of the scenes and align them to the queries. For
instance, in our previous works we investigated the use of cross-modal approaches to rank 3D scenarios
comprising additional multimedia data in the form of either videos [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] or images [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. AgriMus: An overview of the project</title>
      <p>This section ofers an overview of the plans to implement the AgriMus project. These are also presented
graphically in Figure 2. The project will involve three main steps, namely data collection, data modeling,
and the evaluation phase with an emphasis on user studies.</p>
      <sec id="sec-3-1">
        <title>3.1. Collecting the data</title>
        <p>The data collection phase will involve three main ingredients: tutorial videos, experiences, and related
descriptions.</p>
        <p>
          For videos, we will use an automated pipeline to collect relevant tutorial videos from YouTube
by querying for keywords related to agricultural skills, gardening, and DIY projects. Videos with
informative titles will be prioritized to ensure the relevance of the content. The audio tracks of these
videos will be transcribed using Whisper [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], a state-of-the-art speech-to-text model known for its high
accuracy across multiple languages and challenging audio conditions. The resulting transcripts will
serve as a basis for generating detailed textual descriptions. We will use large language models (LLMs)
to process these transcripts, as previously done in recent research [
          <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
          ], and extract key procedural
steps to produce structured descriptions that enhance video indexing and facilitate the search process.
        </p>
        <p>The process of gathering virtual experiences will involve a combination of automated and manual
curation. We will systematically review academic literature to identify virtual agricultural training
Step 1: Data Collection</p>
        <p>Step 2: Hierarchical Museum Modeling</p>
        <p>In this museum, there are
seven rooms. [...] The topic of
second room is prune lemon</p>
        <p>trees. It contains two
educational videos and an
interactive experience. [...]
aligning vis-txt museum
representations
aligning vis-txt room
representations
aligning vis-txt content
representations</p>
        <p>Inthismuseum,therearesevenroms.[.]
Thetopicofsecondromisprunelemon
tres.Itcontainstwoeducationalvideosand
aninteractiveexperience.[.]
1
2
3
Step 3: User Study</p>
        <p>In this museum, there are
seven rooms. [...] The topic of
second room is prune lemon</p>
        <p>
          trees. It contains two
educational videos and an
interactive experience. [...]
Results
3
2
1
environments described in research papers, with particular attention to interactive simulations and
metaverse-based experiences. For instance, Fabrika et al [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] developed a system for educating the user
into thinning practices, fundamental for forestry management, whereas even better digital twins of
forests were recently created using data and procedural approaches, e.g. [
          <xref ref-type="bibr" rid="ref17 ref18">17, 18</xref>
          ]. Another example is
related to teaching the users to detect ripe fruit, e.g. strawberries [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. In addition, publicly available
amateur simulations and virtual environments created by independent developers will be sourced from
online repositories and virtual experience platforms. This dual approach ensures a diverse collection of
virtual experiences, covering both high-fidelity simulations and more accessible, grassroots solutions.
The collected experiences will be cataloged and integrated into the AgriMus platform, enriching the
learning ecosystem with practical, hands-on tools.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Modeling the museums</title>
        <p>
          The exhibitions collected in the previous step are quite rich in content: each exhibition contains multiple
rooms, each containing diferent videos or experiences. To encode all this information in a way that
it is easily searchable and avoids information loss, we will rely on a combination of state-of-the-art
computer vision, natural language processing, and multimedia analysis techniques. Specifically, as
shown in Figure 2, we plan to use hierarchical modeling to leverage the structure of the exhibitions,
roughly divided into content-level (videos or interactive experiences), room-level, and museum-level.
By aligning the visual and textual representations within each level (i.e., a video/experience with its
description, a room with the descriptions of all its contents, and finally the museum with the full
description), it will become easier for the model to learn how to orderly encode them while minimizing
information loss [
          <xref ref-type="bibr" rid="ref20 ref21 ref22">20, 21, 22</xref>
          ].
        </p>
        <p>
          For content-level representations, given that both videos and more complex interactive experiences
will be integrated, a mixture of spatial and spatio-temporal models will be used. This will include 2D
Large Vision-Language Models (LVLM) such as CLIP [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] and Mobile-CLIP [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ], and spatio-temporal
LVLMs such as LaViLa [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] or InternVideo2 [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]. In this way, it will be possible to separately encode both
appearance and motion information, useful to better understand the primary entities of the experiences
(e.g. the tree species) and the actions performed on it.
        </p>
        <p>
          For room-level representation, a naive solution would be to aggregate the content representation
through mean pooling, eventually learning the weight of each. Alternatively, graph networks could
also play an important role in understanding how to aggregate them by capturing relationships and
dependencies between contents, at the cost of more computational resources. These have been previously
used to capture single objects inside rooms (e.g. furniture) and their relationships by using scene graphs
[
          <xref ref-type="bibr" rid="ref27 ref28">27, 28</xref>
          ].
        </p>
        <p>
          Finally, for the museum-level representation, diferent types of aggregation could be used depending
on the constraints to be imposed on the exhibition itself. Generally, learning a weighted mean of the
room representations could sufice, as the information coming from each room would have its weight
defined on the content without, for instance, any constraint on the visit order. However, it is common
for exhibitions to have a predefined visit order, usually done by the exhibition curator. Therefore,
exploring sequential models (e.g., standard recurrent networks such as LSTM and GRU, or the more
recent xLSTM [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ] and minGRU [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ]) for the aggregation of the rooms could play an important role in
how to encode their content into the museum representation. As in the previous case, graph neural
networks could also be used to capture neighbor relations between rooms and assess the relevance of
each.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Searching through the museums</title>
        <p>Once the representations for the museums are computed, they can be searched using similarity-based
approaches. Here, two methodologies can be followed.</p>
        <p>As content-level representations involve LVLMs, processing the user queries through the same
techniques means that the query representation falls into the same latent space, hence enabling
trainingfree search. However, this would imply either that the museums are modeled without relying on the
hierarchy or that the aggregation functions are not trained (e.g., mean or max pooling). Although
both cases are likely leading to poorer performance compared to a solution using trained components,
they enable efective solutions even in scarce data scenarios. In Section 4, we show some early results
obtained using this methodology.</p>
        <p>
          In general, user queries may also be long and articulated, describing specific scenarios and thus
requiring more advanced query processing. While large vision-language models (LVLMs) are typically
trained with simple captions—often composed of primary entities and a few additional descriptive
words (e.g., half of the captions in LAION-2B are less than 50 characters long [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ])—there are LVLMs
trained to handle more complex query scenarios. An example is represented by LaCLIP [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ], which uses
Large Language Models to rewrite the original captions paired with the training images. This suggests
that the zero-shot approach should also work for longer queries, although it is generally unlikely to
perform similarly to a model trained specifically for the task at hand. In particular, training the proposed
method using the vision and language data collected in the previous step allows the models to become
more tailored to the task, potentially preserving more details in the encoding.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Early experimental results</title>
      <p>As a proof-of-concept for the AgriMus project, we collected a dataset of exhibitions for educational
purposes in the agriculture domain. The details of the dataset are provided in Section 4.1, whereas early
experimental results are reported in Section 4.2.</p>
      <sec id="sec-4-1">
        <title>4.1. Collected data</title>
        <p>As mentioned above, a staple of the AgriMus project will be the availability of museums, or exhibitions,
about important topics for education in the field of agriculture, which we will collect because this is
not currently available. For an early prototype of the proposed AgriMus project, we created a set of 83
…</p>
        <p>…
…
…
…
…
1) frame-level
representation</p>
        <sec id="sec-4-1-1">
          <title>LVLM</title>
          <p>…</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>LVLM</title>
        </sec>
        <sec id="sec-4-1-3">
          <title>LVLM</title>
          <p>…
LVLM
2) video-level
representation
aframes
…
aframes
…
3) room-level 4) museum-level
representation representation</p>
          <p>
            arooms
avideos
…
avideos
topical museums, each focusing on a branch of topics relevant to agricultural education, e.g. tutorials
on pruning trees. Then, each room focuses on more specific topics, e.g. how to prune lemon trees. On
average, there are 4.6 rooms per museum, with about 11.2 videos per museum. To achieve this, we first
collected a total of 288 relevant videos from the HowTo100M dataset [
            <xref ref-type="bibr" rid="ref33">33</xref>
            ]. The main topics distilled
from the videos range from teaching the user the best practices for growing a tomato plant at home to
watering indoor plants or pruning outdoor trees. The topics are extracted using KeyBert [
            <xref ref-type="bibr" rid="ref34">34</xref>
            ] looking
for representative bigrams in the video title. Examples of topics include keywords for actions such as
“sow” and “prune”, for entities such as “rose” and “garden”, and also for some technical approaches
such as “hydroponic”. As we looked for bigrams, these are typically grouped in pairs, e.g. “rid” with
“weed”. In total, we extracted 213 topics. Most of them (about 80) are only bound to one museum or two
museums (about 100), and only seven are repeated in four or five museums (Figure 4). The videos, with
a length spanning from 38 seconds to 31 minutes, are then “grouped” to form viable candidates’ pools
for the museum rooms. Specifically, we first selected part of the bigrams (e.g. “growing”) to decide a
topic for the museum, and then built the rooms based on the second part (e.g. “tomatoes” for one room,
and “potatoes” for another room).
          </p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Zero-shot search method</title>
        <p>As the dataset collected is small, experiments that involve training the neural network outlined in
Section 3 would be unfeasible. Therefore, we designed a zero-shot methodology based on the discussion
in Section 3.3. An overview of the zero-shot methodology is illustrated in Figure 3. It is made of four
main steps.</p>
        <p>
          First, in each video within the room, 150 frames are uniformly sampled and resized to (H, W), then
processed through a spatial LVLM. In the experiments in the following sections, three LVLMs are
considered: CLIP [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ], Mobile-CLIP [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ], and BLIP [
          <xref ref-type="bibr" rid="ref35">35</xref>
          ]. H and W are set to 224 for CLIP and BLIP,
whereas 256 is used for Mobile-CLIP.
        </p>
        <p>
          The frame representations are then aggregated by , implemented in the experiments as mean,
maximum, or median pooling. Although the mean pooling of frame vectors is quite typical to obtain
a rough representation of the video [
          <xref ref-type="bibr" rid="ref36">36, 37</xref>
          ], maximum pooling is another way to aggregate frames
by looking at spikes in the features (e.g., often done when reducing the spatial dimensions in deep
convolutional networks such as ResNet). However, to avoid overemphasizing spurious spikes which
can happen with max pooling, and to avoid diluting meaningful features with mean pooling, which
happens especially when the videos are long, median pooling can be a viable candidate as it focuses on
the middle value in a region, improving its robustness to extreme values [38].
        </p>
        <p>For the room-level representation, the function  is used. As in the previous case, mean,
maximum, and median pooling can be used to implement such a function. Although it can be argued
that mean and median are more reasonable, as the videos in the room follow the same topic, there
are nuances which could be more important to retain. This is the case of many tutorial videos which
are longer than the average because they explain how to perform more than one task at once, for
instance, showing both how to plow, sow, and water a crop. Therefore, maximum pooling is also a
viable candidate for .</p>
        <p>Finally, for museum-level aggregation we rely on mean pooling to implement , so that each
room has the same weight in the final encoded representation.</p>
        <p>Since we leverage LVLMs to process the visual information, the queries are also processed and
encoded through the same models without any additional training. This is because their embedding
space is learned by jointly training the visual encoder and aligning it to the textual encoder, so that both
output a similar representation for aligned inputs (e.g., an image and its textual description). In our
setting, the test queries are made of bigrams which consist of the 213 topics extracted from the video
titles. To perform the search, the queries are first tokenized and encoded through the textual encoder of
the LVLM, and then cosine similarity is used to rank the museum representations created by .</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Evaluation metrics</title>
        <p>To assess the performance of the system, a relevance score was computed for each exhibition given a
query . The score for museum  is a real value computed by summing 1.0 for each room in  that
has  as one of its topics, and 0.1 for each video in other rooms which has  as one of its topics. For
instance, if the query is “rid weed” and a museum has two rooms, one with topics “rid weed” and “start
hydroponic”, and the other one with “rid rose”. In the second room there are four videos inside, two of
which have “rid weed” in their topics (note that one video may have more topics extracted from it).
Then, the relevance score of  to  is 1.2 as 1.0 is summed for the first room, and 0.2 is summed for the
two relevant videos in the second room. When computing the recall rates and the median rank, the
relevant museums are those for which the relevance score is the highest in the ranking list, for that
query.</p>
        <p>The performance evaluation is done using four main metrics: Recall at rank K (R@k), Median
rank (MedR), Mean Reciprocal Rank (MRR), and Normalized Discounted Cumulative Gain at rank
k(nDCG@k). R@k measures the proportion of relevant museums found within the top k retrieved items.
MedR represents the median rank position of the first relevant item across queries. MRR evaluates
the rank position of the first relevant item, averaging the reciprocal of the rank across queries. nDCG
assesses the quality of the ranking list, with higher-ranked relevant items contributing more to the
score, rewarding systems that prioritize important results. In all metrics apart from Median rank, the
higher value the better performance.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Which aggregation style is best?</title>
        <p>As mentioned, there are several reasons supporting the use of mean, maximum, or median pooling to
implement the functions , , and  in the zero-shot search method explored in this
paper. Here, we explore several combinations of these functions and assess their performance on the
dataset collected. The results are reported in Table 1.</p>
        <p>First, aggregating the frames using mean leads to the best R@1 and MRR both when using mean
(23.94% R@1 and 39.09% MRR), median (19.71% and 36.11%), or maximum pooling (20.65% and 31.57%)
to aggregate the videos, compared to using median or max. In particular, the diference in performance
with max pooling is ample compared to median pooling. On the one hand, it shows that preserving
some information from all the frames, although noisily, is efective in this scenario. On the other hand,
it confirms that using maximum pooling becomes too sensible to spurious spikes and possibly loses
sight of the general content of the video, leading to the worst results, e.g. 7.51% R@1 and 18.50% MRR
in the case of (max, mean, mean).</p>
        <p>Second, using mean pooling for all three functions, i.e. the row represented by (mean, mean, mean),
leads to 23.94% R@1 and 39.09% MRR, whereas all the other combinations have less than 20% R@1
and 37% MRR. It also achieves 51.77% nDCG@5 and 55.34% nDCG@10, which ranks second in our
experiments as (median, mean, mean) achieves 52.74% nDCG@5 and 55.96% nDCG@10. This indicates
a higher chance to retrieve a relevant museum in the first rank than other combinations and a good
quality of the proposed ranking lists, hence representing a good candidate for the proposed zero-shot
method. Therefore, in the following experiments we used (mean, mean, mean).</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Which feature extractor is best?</title>
        <p>
          In the previous experiment, using mean pooling for all three aggregation functions atop CLIP frame
features led to the best results. Here, we explore how other LVLMs afect the final performance of our
zero-search method. Specifically, we test Mobile-CLIP [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] and BLIP [
          <xref ref-type="bibr" rid="ref35">35</xref>
          ], and combinations of two to
three LVLMs by concatenating the frame features. The results are reported in Table 2.
        </p>
        <p>First, using Mobile-CLIP led to an increase in performance compared to CLIP, for instance from
23.94% R@1 and 39.09% MRR to 27.23% and 41.33%.</p>
        <p>Second, combining the information extracted by the LVLMs does not lead to better results. Specifically,
with two methods the best results are obtained by CLIP+Mobile-CLIP, but they still fall short of
MobileCLIP on its own, for instance their combinations obtains 22.06% R@1 and 38.44% MRR, yet these are
lower than those obtained by Mobile-CLIP (27.23% and 41.33%). Although putting together all the
models leads to slightly better nDCG compared to Mobile-CLIP (e.g. 54.60% nDCG@5 compared to
52.55%), the increased computational or storage costs would not make the solution better.</p>
      </sec>
      <sec id="sec-4-6">
        <title>4.6. Is an hierarchical approach better than a flat one?</title>
        <p>For the future of the AgriMus project, we hypothesized that leveraging the hierarchical nature of
museums is fundamental to correctly model them, both when training the components and when
performing zero-shot search. Here, we validate such hypothesis by performing the aggregation of
all the videos in the museum at the video level, neglecting the room separation. The LVLM is set to
Mobile-CLIP and the aggregation functions to mean pooling, as this combination performed best in the
previous experiments. The results are reported in Table 3.</p>
        <p>The main result is a confirmation of the hypothesis, as leveraging the hierarchy leads to 27.23%
R@1, 41.33% MRR, 52.55% nDCG@5, and 56.57% nDCG@10, whereas in the other ablations, the best
results are 26.29% R@1, 40.34% MRR, 52.50% nDCG@5, and 56.48% nDCG@10. Although the use of
maximum pooling leads to significantly worse results, the use of mean pooling at the video level leads
to comparable results to the proposed method under several metrics, especially those looking above the
ifrst rank (R@5 and 10, nDCG@5 and 10). Nonetheless, we hypothesize that training the aggregation
functions will lead to considerably better performance, as that would allow better preservation of the
temporal information in the videos and improve the encoding capabilities for the videos and the rooms.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion/limitations/future work</title>
      <p>In this section, we highlight the limitations of our current approach and outline directions for future
work.</p>
      <p>As the current implementation of AgriMus relies on a zero-shot search method, we employed simple
aggregation operations to combine the representations of frames, videos, and rooms. While this
approach is straightforward and computationally eficient, it is well known that such operations are not
optimal, and for instance they tend to lose temporal information in videos [39, 40]. In future iterations,
once we have collected a suficient amount of data, we plan to experiment with neural sequential models
and learned aggregation functions. These should enhance the system’s ability to recognize temporal
patterns, leading to better video representation and improved search accuracy. Training on larger
datasets will not only improve content recognition but also facilitate a deeper usage of the hierarchical
structure present in the exhibitions, contributing to more precise search results.</p>
      <p>Another challenge is the inherent diversity and complexity of topics related to agriculture, gardening,
and related fields. These domains encompass a wide range of subfields, each requiring specific expertise
and datasets. To develop a robust and comprehensive system useful to both practitioners and novices, it
is essential to collect a larger and more diverse set of videos. For example, there are currently no videos
covering certain tree species, such as cedar trees. Interestingly, increasing the scope of the dataset could
also facilitate the creation of more specialized virtual museums. For instance, an exhibition might focus
specifically on “lemon trees”, with rooms dedicated to diferent stages of growth and care (e.g., planting,
watering, pruning, harvesting). Alternatively, broader topics like “growing vegetables indoors” could
be broken down into rooms focusing on various crops, such as tomatoes, potatoes, and zucchini. This
structured, hierarchical approach will enhance the learning experience by organizing content logically
and progressively.</p>
      <p>In addition to expanding the video dataset, future eforts will focus on incorporating virtual
experiences that allow users to practice within the metaverse. By complementing tutorial videos with
interactive, immersive environments, users can engage more deeply with the content, reinforcing their
learning through hands-on experiences. Such experiences will be particularly valuable for tasks that
require manual skills, such as pruning or grafting, as they enable users to practice techniques in a
simulated environment. User studies need also to be conducted to assess the comprehensiveness of the
exhibitions and their educational efectiveness.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions</title>
      <p>With the growth of the internet and user-generated content, video tutorials have become essential
tools for supporting educational eforts across various domains, teaching the watchers best practices to
grow vegetables at home, prune fruit trees, and other practical agricultural skills. As the metaverse
continues to evolve, these video tutorials can be complemented by interactive and immersive experiences,
enhancing the learning process by providing hands-on practice opportunities.</p>
      <p>To realize this vision, we introduced the AgriMus project, which focuses on developing digital
exhibitions aimed at educating both novices and practitioners in a broad range of topics related to
agriculture and gardening. AgriMus aims to build a search tool that allows users to explore these virtual
museums, enabling them to watch tutorial videos to learn best practices and then engage in interactive
experiences to practice and consolidate their skills within the metaverse.</p>
      <p>As an initial step, we collected a dataset of 83 exhibitions, each consisting of multiple topical rooms
enriched with video content. We conducted zero-shot experiments, achieving 27.23% R@1, 75.58% R@10,
41.33% MRR, and 52.55% nDCG@5 on a test set of 213 queries. Our experimental results demonstrated
that leveraging the hierarchical structure of the data improves performance. In addition, they validated
design choices for our scenario: mean pooling proved to be the most efective aggregation method, and
Mobile-CLIP outperformed other models in feature extraction from video frames.</p>
      <p>Looking ahead, several steps remain to fully realize the AgriMus project. We plan to expand the
dataset by incorporating more videos to capture greater diversity across agricultural topics. Furthermore,
integrating temporal information will enhance video content representation, improving search accuracy
and museum organization. Lastly, conducting user evaluations will be crucial to refining the system
and ensuring its efectiveness in real-world scenarios.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work was supported by the PRIN 2022 “MUSMA” - CUP G53D23002930006 - “Funded by EU
Next-Generation EU – M4 C2 I1.1”, and by the Department Strategic Plan (PSD) of the University of
Udine–Interdepartmental Project on Artificial Intelligence (2020-25).
vision, 2021, pp. 1728–1738.
[37] V. Gabeur, C. Sun, K. Alahari, C. Schmid, Multi-modal transformer for video retrieval, in: Computer
Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings,
Part IV 16, Springer, 2020, pp. 214–229.
[38] W. Shi, C. C. Loy, X. Tang, Deep specialized network for illuminant estimation, in: Computer
Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14,
2016, Proceedings, Part IV 14, Springer, 2016, pp. 371–387.
[39] X. Jiang, Y. Gong, X. Guo, Q. Yang, F. Huang, W.-S. Zheng, F. Zheng, X. Sun, Rethinking temporal
fusion for video-based person re-identification on semantic and time aspect, in: Proceedings of
the AAAI Conference on Artificial Intelligence, volume 34, 2020, pp. 11133–11140.
[40] M. Li, H. Xu, J. Wang, W. Li, Y. Sun, Temporal aggregation with clip-level attention for video-based
person re-identification, in: Proceedings of the IEEE/CVF Winter Conference on Applications of
Computer Vision, 2020, pp. 3376–3384.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <fpage>3D</fpage>
          -Ace,
          <article-title>What is a virtual museum: Benefits, types</article-title>
          and creation process,
          <year>2022</year>
          . URL: https://3d-ace. com/blog/virtual-museum/, accessed:
          <fpage>2024</fpage>
          -12-23.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Kiourt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Koutsoudis</surname>
          </string-name>
          , G. Pavlidis,
          <article-title>Dynamus: A fully dynamic 3d virtual museum framework</article-title>
          ,
          <source>Journal of Cultural Heritage</source>
          <volume>22</volume>
          (
          <year>2016</year>
          )
          <fpage>984</fpage>
          -
          <lpage>991</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>E.</given-names>
            <surname>Zidianakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Partarakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ntoa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dimopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kopidaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ntagianta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ntafotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Xhako</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Pervolarakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Kontaki</surname>
          </string-name>
          , et al.,
          <article-title>The invisible museum: A user-centric platform for creating virtual 3d exhibitions with vr support</article-title>
          ,
          <source>Electronics</source>
          <volume>10</volume>
          (
          <year>2021</year>
          )
          <fpage>363</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Barszcz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Dziedzic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Skublewska-Paszkowska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Powroznik</surname>
          </string-name>
          ,
          <article-title>3d scanning digital models for virtual museums</article-title>
          ,
          <source>Computer Animation and Virtual Worlds</source>
          <volume>34</volume>
          (
          <year>2023</year>
          )
          <article-title>e2154</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Merella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Farina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Scaglia</surname>
          </string-name>
          , G. Caneve,
          <string-name>
            <given-names>G.</given-names>
            <surname>Bernardini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Collareta</surname>
          </string-name>
          , G. Bianucci,
          <article-title>Structured-light 3d scanning as a tool for creating a digital collection of modern and fossil cetacean skeletons (natural history museum</article-title>
          , university of pisa),
          <source>Heritage</source>
          <volume>6</volume>
          (
          <year>2023</year>
          )
          <fpage>6762</fpage>
          -
          <lpage>6776</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gholami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nießner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. X.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <article-title>Scan2cap: Context-aware dense captioning in rgb-d scans</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>3193</fpage>
          -
          <lpage>3203</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Okumura</surname>
          </string-name>
          ,
          <article-title>Towards cross-modal point cloud retrieval for indoor scenes</article-title>
          , in: International Conference on Multimedia Modeling, Springer,
          <year>2024</year>
          , pp.
          <fpage>89</fpage>
          -
          <lpage>102</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Abdari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Falcon</surname>
          </string-name>
          , G. Serra,
          <article-title>Farmare: a furniture-aware multi-task methodology for recommending apartments based on the user interests</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF International Conference on Computer Vision</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>4293</fpage>
          -
          <lpage>4303</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Abdari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Falcon</surname>
          </string-name>
          , G. Serra, Adoctera:
          <article-title>Adaptive optimization constraints for improved textguided retrieval of apartments</article-title>
          ,
          <source>in: Proceedings of the 2024 International Conference on Multimedia Retrieval</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>1043</fpage>
          -
          <lpage>1050</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Barath</surname>
          </string-name>
          , I. Armeni,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pollefeys</surname>
          </string-name>
          , H. Blum, “
          <article-title>where am i?” scene retrieval with language</article-title>
          ,
          <source>in: European Conference on Computer Vision</source>
          , Springer,
          <year>2025</year>
          , pp.
          <fpage>201</fpage>
          -
          <lpage>220</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Abdari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Falcon</surname>
          </string-name>
          , G. Serra,
          <article-title>Metaverse retrieval: Finding the best metaverse environment via language</article-title>
          ,
          <source>in: Proceedings of the 1st International Workshop on Deep Multimodal Learning for Information Retrieval</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Abdari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Falcon</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Serra, A language-based solution to enable metaverse retrieval</article-title>
          , in: International Conference on Multimedia Modeling, Springer,
          <year>2024</year>
          , pp.
          <fpage>477</fpage>
          -
          <lpage>488</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          , T. Xu,
          <string-name>
            <given-names>G.</given-names>
            <surname>Brockman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>McLeavey</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <article-title>Robust speech recognition via large-scale weak supervision</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>28492</fpage>
          -
          <lpage>28518</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>W. R.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Allauzen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-Y.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. N.</given-names>
            <surname>Sainath</surname>
          </string-name>
          ,
          <article-title>Multilingual and fully non-autoregressive asr with large language model fusion: A comprehensive study</article-title>
          ,
          <source>in: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          , IEEE,
          <year>2024</year>
          , pp.
          <fpage>13306</fpage>
          -
          <lpage>13310</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>R.</given-names>
            <surname>Ma</surname>
          </string-name>
          , M. Qian,
          <string-name>
            <given-names>P.</given-names>
            <surname>Manakul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gales</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Knill</surname>
          </string-name>
          ,
          <article-title>Can generative large language models perform asr error correction?</article-title>
          ,
          <source>arXiv preprint arXiv:2307.04172</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Fabrika</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Valent</surname>
          </string-name>
          , L. Scheer,
          <article-title>Thinning trainer based on forest-growth model, virtual reality and computer-aided virtual environment</article-title>
          ,
          <source>Environmental modelling &amp; software 100</source>
          (
          <year>2018</year>
          )
          <fpage>11</fpage>
          -
          <lpage>23</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Badr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. D.</given-names>
            <surname>Hsiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rundel</surname>
          </string-name>
          , R. de Amicis,
          <article-title>Leveraging data-driven and procedural methods for generating high-fidelity visualizations of real forests</article-title>
          ,
          <source>Environmental Modelling &amp; Software</source>
          <volume>172</volume>
          (
          <year>2024</year>
          )
          <fpage>105899</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>H.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <article-title>Forest digital twin: A new tool for forest management practices based on spatio-temporal data, 3d simulation engine, and intelligent interactive environment, Computers and Electronics in Agriculture 215 (</article-title>
          <year>2023</year>
          )
          <fpage>108416</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Chai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-L.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>O'Sullivan, Real-time detection of strawberry ripeness using augmented reality and deep learning</article-title>
          ,
          <source>Sensors</source>
          <volume>23</volume>
          (
          <year>2023</year>
          )
          <fpage>7639</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>K.</given-names>
            <surname>Ashutosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girdhar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Torresani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Grauman</surname>
          </string-name>
          ,
          <article-title>Hiervl: Learning hierarchical video-language embeddings</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>23066</fpage>
          -
          <lpage>23078</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kallidromitis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kozuka</surname>
          </string-name>
          , T. Darrell,
          <article-title>Hierarchical open-vocabulary universal image segmentation</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>36</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ye</surname>
          </string-name>
          , G. Xu,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Qian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Huang</surname>
          </string-name>
          , Hitea:
          <article-title>Hierarchical temporal-aware videolanguage pre-training</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF International Conference on Computer Vision</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>15405</fpage>
          -
          <lpage>15416</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          , et al.,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>8748</fpage>
          -
          <lpage>8763</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>P. K. A. Vasu</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Pouransari</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Faghri</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Vemulapalli</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Tuzel</surname>
          </string-name>
          ,
          <article-title>Mobileclip: Fast image-text models through multi-modal reinforced training</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>15963</fpage>
          -
          <lpage>15974</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Misra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Krähenbühl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girdhar</surname>
          </string-name>
          ,
          <article-title>Learning video representations from large language models</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>6586</fpage>
          -
          <lpage>6597</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Pei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shi</surname>
          </string-name>
          , et al.,
          <article-title>Internvideo2: Scaling foundation models for multimodal video understanding</article-title>
          ,
          <source>in: European Conference on Computer Vision</source>
          , Springer,
          <year>2025</year>
          , pp.
          <fpage>396</fpage>
          -
          <lpage>416</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-M. Sun</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Mo</surname>
            ,
            <given-names>Y.-K.</given-names>
          </string-name>
          <string-name>
            <surname>Lai</surname>
            ,
            <given-names>L. J.</given-names>
          </string-name>
          <string-name>
            <surname>Guibas</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
          </string-name>
          , Scenehgn:
          <article-title>Hierarchical graph networks for 3d indoor scene generation with fine-grained geometry</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>45</volume>
          (
          <year>2023</year>
          )
          <fpage>8902</fpage>
          -
          <lpage>8919</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wald</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Navab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Tombari</surname>
          </string-name>
          ,
          <article-title>Learning 3d semantic scene graphs with instance embeddings</article-title>
          ,
          <source>International Journal of Computer Vision</source>
          <volume>130</volume>
          (
          <year>2022</year>
          )
          <fpage>630</fpage>
          -
          <lpage>651</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>M.</given-names>
            <surname>Beck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Pöppel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Spanring</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Prudnikova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kopp</surname>
          </string-name>
          , G. Klambauer,
          <string-name>
            <given-names>J.</given-names>
            <surname>Brandstetter</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <article-title>Hochreiter, xlstm: Extended long short-term memory</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>L.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Tung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. O.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hajimirsadegh</surname>
          </string-name>
          ,
          <article-title>Were rnns all we needed?</article-title>
          ,
          <source>arXiv preprint arXiv:2410.01201</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>C.</given-names>
            <surname>Schuhmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Beaumont</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Vencu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gordon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wightman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cherti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Coombes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Katta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Mullis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wortsman</surname>
          </string-name>
          , et al.,
          <article-title>Laion-5b: An open large-scale dataset for training next generation image-text models</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>35</volume>
          (
          <year>2022</year>
          )
          <fpage>25278</fpage>
          -
          <lpage>25294</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>L.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Krishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Isola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Katabi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <article-title>Improving clip training with language rewrites</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>36</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>A.</given-names>
            <surname>Miech</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhukov</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-B. Alayrac</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Tapaswi</surname>
            ,
            <given-names>I. Laptev</given-names>
          </string-name>
          ,
          <string-name>
            <surname>J. Sivic,</surname>
          </string-name>
          <article-title>Howto100m: Learning a text-video embedding by watching hundred million narrated video clips</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF international conference on computer vision</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>2630</fpage>
          -
          <lpage>2640</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>M.</given-names>
            <surname>Grootendorst</surname>
          </string-name>
          ,
          <article-title>Keybert: Minimal keyword extraction with bert</article-title>
          .,
          <year>2020</year>
          . URL: https://doi.org/10. 5281/zenodo.4461265. doi:
          <volume>10</volume>
          .5281/zenodo.4461265.
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hoi</surname>
          </string-name>
          ,
          <article-title>Blip: Bootstrapping language-image pre-training for unified visionlanguage understanding and generation</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>12888</fpage>
          -
          <lpage>12900</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nagrani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Varol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <article-title>Frozen in time: A joint video and image encoder for end-to-end retrieval</article-title>
          , in: Proceedings of the IEEE/CVF international conference on computer
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>