AgriMus: Developing Museums in the Metaverse for Agricultural Education Ali Abdari1,2,* , Alex Falcon1 and Giuseppe Serra1 1 University of Udine, Italy 2 University of Naples Federico II, Italy Abstract Learning agricultural practices—such as gardening, maintaining fruit trees, and general farming techniques—has increasingly shifted towards digital platforms, with tutorials on YouTube being a popular resource. As the metaverse expands, immersive experiences are emerging as powerful tools for skill acquisition. This work introduces AgriMus, a search tool designed for metaverse environments, enabling users to discover both videos and interactive experiences tailored to teaching practical skills in agriculture. AgriMus aims to connect users with relevant virtual spaces where they can learn and practice agricultural tasks in a hands-on, engaging way. Initial experiments conducted on 83 exhibitions demonstrate the potential of zero-shot search methods, achieving 27% R@1, 41% MRR, and 52% nDCG@5. The results also highlight the importance of leveraging the hierarchical structure of exhibition data and integrating state-of-the-art vision-language models to improve search performance. The source code and data of this work is available at https://github.com/aliabdari/AgriMus. Keywords Metaverse, Digital Museums, Agriculture Education, Cross-modal Retrieval, Multimedia 1. Introduction Nowadays, with the user-generated content uploaded on the Internet increasing dramatically every year, it is becoming a common practice to acquire new skills by watching tutorials on video sharing platforms such as YouTube. These tutorial videos span a broad range of different skills, including general life skills such as cooking, home organization, and DIY crafts; technical skills like coding, graphic design, and video editing; and practical hands-on activities like gardening, farming, and maintaining fruit trees. For instance, users can find step-by-step guides on planting and cultivating vegetables, pruning fruit trees for optimal growth, designing irrigation systems, and even employing modern farming technologies, e.g., hydroponics or drone-assisted crop monitoring. This vast repository of user-generated content empowers individuals to learn both everyday and specialized skills at their own pace. With the rapid growth of the metaverse, a new dimension of learning and skill acquisition is emerging, particularly in areas like agriculture. Initiatives such as the Agriscience Metaverse Academy are already leveraging virtual reality (VR) to provide immersive educational experiences for agriculture teachers and students, enabling them to explore agriscience concepts without the constraints of physical resources. Similarly, projects like “Georgia Agriculture in the Metaverse” introduce AI-powered, game- based learning environments where users can grow crops, manage agricultural businesses, and gain practical farming skills through interactive simulations. These examples illustrate how the metaverse is transforming traditional tutorial-based learning into dynamic, hands-on experiences, making skill development more accessible, engaging, and impactful. To take advantage of the strengths of both traditional tutorial videos and immersive metaverse experiences, we introduce the AgriMus project, the overview of which can be seen in Figure 1. AgriMus focuses on developing a specialized search tool that empowers users interested in learning agricultural activities to explore and identify relevant agricultural metaverses. By integrating video content with IRCDL 2025: 21st Conference on Information and Research Science Connecting to Digital and Library Science, February 20–21 2025, Udine, Italy * Corresponding author. $ abdari.ali@spes.uniud.it (A. Abdari); falcon.alex@spes.uniud.it (A. Falcon); giuseppe.serra@uniud.it (G. Serra)  0000-0002-4482-0479 (A. Abdari); 0000-0002-6325-9066 (A. Falcon); 0000-0002-4269-4501 (G. Serra) © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings prune fig (a) harvest fig trees I want to learn how to trees prune fig trees, I’ve never done that…🤔 NLP fertilize fig V+L trees CV RS plant fig trees prune lemon prune kiwi (b) I want to learn about vines trees pruning techniques for fruit trees 🤔 NLP prune V+L apple trees CV RS prune fig trees Figure 1: Given the user query, formulated in natural language, the method uses natural language processing (NLP) for it, then combines computer vision (CV) techniques and multimodal analysis (V+L) to process the metaverses available in the database. Then, it recommends (RS) a ranking list of the relevant metaverses. The two cases show possible results. (a) Metaverse focusing on the specific tree (fig), with the rooms dedicated to different aspects for it. (b) Metaverse focusing on the action (pruning), with the rooms dedicated to applying it in diverse agriculture scenarios. interactive virtual experiences, this tool allows users to search for and access metaverse environments tailored to their specific interests, such as gardening, farming techniques, or advanced agricultural practices. AgriMus bridges the gap between conventional online tutorials and the growing potential of the metaverse, offering a comprehensive platform for skill development in agriculture. To demonstrate the feasibility of AgriMus, we collected a dataset specifically designed for proof-of- concept purposes. The dataset comprises 83 topical exhibitions, each dedicated to a broad agricultural theme (e.g., pruning fruit trees), with individual rooms focusing on more specific subtopics (e.g., pruning lemon trees). We conducted experiments in a zero-shot scenario, leveraging the hierarchical structure of the exhibitions to model the data as envisioned for AgriMus. Our experimental results demonstrate promising performance, achieving 27% recall at rank 1 (R@1), 66% recall at rank 5 (R@5), and a mean reciprocal rank (MRR) of 41%. Additionally, we achieved 52% normalized discounted cumulative gain (nDCG) at rank 5 and 56% recall at rank 10. These results highlight the effectiveness of the hierarchical approach and validate the potential of AgriMus for enabling efficient exploration and retrieval in agricultural metaverses. 2. Related work 2.1. Digital museums The emergence of digital museums represents a transformative shift in how cultural heritage is accessed and experienced, offering unprecedented opportunities for engagement and education [1]. With the advancements in technologies such as high-quality 3D modeling and virtual reality (VR), digital museums are becoming more popular and it is possible for them to host rich and immersive experiences. For instance, they allow for detailed representations of artifacts and exhibitions [2, 3], enabling visitors to explore diverse themes ranging from ancient civilizations to contemporary art [4, 5]. Moreover, unlike traditional museums, which are constrained by physical space and operating hours, digital museums can operate continuously, providing access to global audiences at any time. Thus, digital museums play a vital role in preserving and promoting cultural heritage by making artifacts and traditions accessible to wider audiences. However, they usually focus their attention to cultural heritage. Conversely, this work builds on the concept of digital museums by focusing on the integration of agricultural knowledge and training materials into museum-like exhibits, creating a unique training avenue for novices and practitioners in agricultural domains, which has not been studied in the researches so far. The aim is to support the acquirement of new skills by mixing lecture-like videos and virtual hands-on practice by means of VR experiences. 2.2. Multimedia-rich 3D scenarios Recent advancements in vision and language techniques have significantly enhanced the retrieval of 3D scenes and objects through natural language queries. The integration of dense captioning methods with RGB-D scans enables the generation of detailed, context-aware descriptions of localized objects within 3D environments [6]. These approaches allow users to input natural language queries to retrieve specific objects or scenes, thereby improving the efficiency and accuracy of retrieval systems. By combining language and 3D visual data, these techniques facilitate more intuitive interactions between humans and machines, enabling natural language descriptions to guide the search and discovery of relevant 3D models or environments. Instead of focusing on single objects, recent research has focused on more complex indoor scene retrieval using text, involving longer descriptions, as they need to describe many objects and their position within the entire scene. Several contributions were made in this direction, including CRISP [7], which provides a large collection of 3D indoor scenes and their corresponding textual descriptions, Farmare [8] and Adoctera [9] which focus on learning to search furnished multi-room apartments and rank them against user queries. More recently, Text2SceneGraphMatcher [10] introduced a method for aligning open-set text queries with 3D scene graphs to facilitate effective scene retrieval. However, the approaches mentioned above do not consider the possibility of having inside the scenes some multimedia content which affects the relevance to the user query. This problem raises additional challenges as both the global structure and the local components need to be accounted for in the learned representation in order to fully capture the contents of the scenes and align them to the queries. For instance, in our previous works we investigated the use of cross-modal approaches to rank 3D scenarios comprising additional multimedia data in the form of either videos [11] or images [12]. 3. AgriMus: An overview of the project This section offers an overview of the plans to implement the AgriMus project. These are also presented graphically in Figure 2. The project will involve three main steps, namely data collection, data modeling, and the evaluation phase with an emphasis on user studies. 3.1. Collecting the data The data collection phase will involve three main ingredients: tutorial videos, experiences, and related descriptions. For videos, we will use an automated pipeline to collect relevant tutorial videos from YouTube by querying for keywords related to agricultural skills, gardening, and DIY projects. Videos with informative titles will be prioritized to ensure the relevance of the content. The audio tracks of these videos will be transcribed using Whisper [13], a state-of-the-art speech-to-text model known for its high accuracy across multiple languages and challenging audio conditions. The resulting transcripts will serve as a basis for generating detailed textual descriptions. We will use large language models (LLMs) to process these transcripts, as previously done in recent research [14, 15], and extract key procedural steps to produce structured descriptions that enhance video indexing and facilitate the search process. The process of gathering virtual experiences will involve a combination of automated and manual curation. We will systematically review academic literature to identify virtual agricultural training Step 1: Data Collection Step 2: Hierarchical Museum Modeling In this museum, there are In this museum, there are seven rooms. [...] The topic of seven rooms. [...] The topic of second room is prune lemon second room is prune lemon trees. It contains two trees. It contains two educational videos and an educational videos and an interactive experience. [...] interactive experience. [...] aligning vis-txt museum In this museum, there are seven rooms. [...] The topic of second room is prune lemon trees. It contains two educational videos and representations an interactive experience. [...] Step 3: User Study aligning vis-txt room 3 2 1 representations 1 2 3 Results aligning vis-txt content representations Figure 2: An overview of the AgriMus project. It consists of three main steps. Step 1 is about collecting the required data, comprising topical 3D exhibitions adorned with educational videos and experiences in fields related to agriculture. Step 2 introduces a hierarchical methodology for aligning the visual contents to the textual ones, and also for modeling the exhibitions, with the aim of garnering information about the single experiences or videos, how these form the contents of a room with a specific topic (e.g. how to prune a specific type of tree), and finally how the rooms capture a more comprehensive view on it (e.g. pruning that type of tree, and also growing, harvesting, etc). Step 3 will involve user studies to better understand the user needs and the effectiveness of the proposed methodology in capturing them. environments described in research papers, with particular attention to interactive simulations and metaverse-based experiences. For instance, Fabrika et al [16] developed a system for educating the user into thinning practices, fundamental for forestry management, whereas even better digital twins of forests were recently created using data and procedural approaches, e.g. [17, 18]. Another example is related to teaching the users to detect ripe fruit, e.g. strawberries [19]. In addition, publicly available amateur simulations and virtual environments created by independent developers will be sourced from online repositories and virtual experience platforms. This dual approach ensures a diverse collection of virtual experiences, covering both high-fidelity simulations and more accessible, grassroots solutions. The collected experiences will be cataloged and integrated into the AgriMus platform, enriching the learning ecosystem with practical, hands-on tools. 3.2. Modeling the museums The exhibitions collected in the previous step are quite rich in content: each exhibition contains multiple rooms, each containing different videos or experiences. To encode all this information in a way that it is easily searchable and avoids information loss, we will rely on a combination of state-of-the-art computer vision, natural language processing, and multimedia analysis techniques. Specifically, as shown in Figure 2, we plan to use hierarchical modeling to leverage the structure of the exhibitions, roughly divided into content-level (videos or interactive experiences), room-level, and museum-level. By aligning the visual and textual representations within each level (i.e., a video/experience with its description, a room with the descriptions of all its contents, and finally the museum with the full description), it will become easier for the model to learn how to orderly encode them while minimizing information loss [20, 21, 22]. For content-level representations, given that both videos and more complex interactive experiences will be integrated, a mixture of spatial and spatio-temporal models will be used. This will include 2D Large Vision-Language Models (LVLM) such as CLIP [23] and Mobile-CLIP [24], and spatio-temporal LVLMs such as LaViLa [25] or InternVideo2 [26]. In this way, it will be possible to separately encode both appearance and motion information, useful to better understand the primary entities of the experiences (e.g. the tree species) and the actions performed on it. For room-level representation, a naive solution would be to aggregate the content representation through mean pooling, eventually learning the weight of each. Alternatively, graph networks could also play an important role in understanding how to aggregate them by capturing relationships and dependencies between contents, at the cost of more computational resources. These have been previously used to capture single objects inside rooms (e.g. furniture) and their relationships by using scene graphs [27, 28]. Finally, for the museum-level representation, different types of aggregation could be used depending on the constraints to be imposed on the exhibition itself. Generally, learning a weighted mean of the room representations could suffice, as the information coming from each room would have its weight defined on the content without, for instance, any constraint on the visit order. However, it is common for exhibitions to have a predefined visit order, usually done by the exhibition curator. Therefore, exploring sequential models (e.g., standard recurrent networks such as LSTM and GRU, or the more recent xLSTM [29] and minGRU [30]) for the aggregation of the rooms could play an important role in how to encode their content into the museum representation. As in the previous case, graph neural networks could also be used to capture neighbor relations between rooms and assess the relevance of each. 3.3. Searching through the museums Once the representations for the museums are computed, they can be searched using similarity-based approaches. Here, two methodologies can be followed. As content-level representations involve LVLMs, processing the user queries through the same techniques means that the query representation falls into the same latent space, hence enabling training- free search. However, this would imply either that the museums are modeled without relying on the hierarchy or that the aggregation functions are not trained (e.g., mean or max pooling). Although both cases are likely leading to poorer performance compared to a solution using trained components, they enable effective solutions even in scarce data scenarios. In Section 4, we show some early results obtained using this methodology. In general, user queries may also be long and articulated, describing specific scenarios and thus requiring more advanced query processing. While large vision-language models (LVLMs) are typically trained with simple captions—often composed of primary entities and a few additional descriptive words (e.g., half of the captions in LAION-2B are less than 50 characters long [31])—there are LVLMs trained to handle more complex query scenarios. An example is represented by LaCLIP [32], which uses Large Language Models to rewrite the original captions paired with the training images. This suggests that the zero-shot approach should also work for longer queries, although it is generally unlikely to perform similarly to a model trained specifically for the task at hand. In particular, training the proposed method using the vision and language data collected in the previous step allows the models to become more tailored to the task, potentially preserving more details in the encoding. 4. Early experimental results As a proof-of-concept for the AgriMus project, we collected a dataset of exhibitions for educational purposes in the agriculture domain. The details of the dataset are provided in Section 4.1, whereas early experimental results are reported in Section 4.2. 4.1. Collected data As mentioned above, a staple of the AgriMus project will be the availability of museums, or exhibitions, about important topics for education in the field of agriculture, which we will collect because this is not currently available. For an early prototype of the proposed AgriMus project, we created a set of 83 1) frame-level representation … 2) video-level … LVLM representation … … aframes 3) room-level 4) museum-level LVLM representation representation … avideos arooms LVLM … … … … aframes … LVLM … avideos Figure 3: An overview of the prototype for zero-shot understanding of the exhibition contents used in this paper. Starting from the full museum, it highlights one of the rooms (in green) and two of the videos contained in it (yellow and purple). 1) The frames of the videos are processed using a Large Vision-Language Model (LVLM). 2) The frames’ representations are then aggregated using the function 𝑎𝑓 𝑟𝑎𝑚𝑒𝑠 . 3) The videos are then aggregated using 𝑎𝑣𝑖𝑑𝑒𝑜𝑠 to capture the contents of the room. 4) Finally, 𝑎𝑟𝑜𝑜𝑚𝑠 aggregates the rooms’ contents to capture the full exhibition. This final representation is then used to rank the exhibitions against the representation of the user query. topical museums, each focusing on a branch of topics relevant to agricultural education, e.g. tutorials on pruning trees. Then, each room focuses on more specific topics, e.g. how to prune lemon trees. On average, there are 4.6 rooms per museum, with about 11.2 videos per museum. To achieve this, we first collected a total of 288 relevant videos from the HowTo100M dataset [33]. The main topics distilled from the videos range from teaching the user the best practices for growing a tomato plant at home to watering indoor plants or pruning outdoor trees. The topics are extracted using KeyBert [34] looking for representative bigrams in the video title. Examples of topics include keywords for actions such as “sow” and “prune”, for entities such as “rose” and “garden”, and also for some technical approaches such as “hydroponic”. As we looked for bigrams, these are typically grouped in pairs, e.g. “rid” with “weed”. In total, we extracted 213 topics. Most of them (about 80) are only bound to one museum or two museums (about 100), and only seven are repeated in four or five museums (Figure 4). The videos, with a length spanning from 38 seconds to 31 minutes, are then “grouped” to form viable candidates’ pools for the museum rooms. Specifically, we first selected part of the bigrams (e.g. “growing”) to decide a topic for the museum, and then built the rooms based on the second part (e.g. “tomatoes” for one room, and “potatoes” for another room). 4.2. Zero-shot search method As the dataset collected is small, experiments that involve training the neural network outlined in Section 3 would be unfeasible. Therefore, we designed a zero-shot methodology based on the discussion in Section 3.3. An overview of the zero-shot methodology is illustrated in Figure 3. It is made of four main steps. First, in each video within the room, 150 frames are uniformly sampled and resized to (H, W), then processed through a spatial LVLM. In the experiments in the following sections, three LVLMs are considered: CLIP [23], Mobile-CLIP [24], and BLIP [35]. H and W are set to 224 for CLIP and BLIP, whereas 256 is used for Mobile-CLIP. The frame representations are then aggregated by 𝑎𝑓 𝑟𝑎𝑚𝑒𝑠 , implemented in the experiments as mean, maximum, or median pooling. Although the mean pooling of frame vectors is quite typical to obtain Figure 4: Statistics of the dataset collected. (a) shows the repetition per topics, illustrating that most of the topics have been used in one or two museums, while a few topics have been used in four or five museums. (b) shows some of the topics which have been presented in three or more museums. a rough representation of the video [36, 37], maximum pooling is another way to aggregate frames by looking at spikes in the features (e.g., often done when reducing the spatial dimensions in deep convolutional networks such as ResNet). However, to avoid overemphasizing spurious spikes which can happen with max pooling, and to avoid diluting meaningful features with mean pooling, which happens especially when the videos are long, median pooling can be a viable candidate as it focuses on the middle value in a region, improving its robustness to extreme values [38]. For the room-level representation, the function 𝑎𝑣𝑖𝑑𝑒𝑜𝑠 is used. As in the previous case, mean, maximum, and median pooling can be used to implement such a function. Although it can be argued that mean and median are more reasonable, as the videos in the room follow the same topic, there are nuances which could be more important to retain. This is the case of many tutorial videos which are longer than the average because they explain how to perform more than one task at once, for instance, showing both how to plow, sow, and water a crop. Therefore, maximum pooling is also a viable candidate for 𝑎𝑣𝑖𝑑𝑒𝑜𝑠 . Finally, for museum-level aggregation we rely on mean pooling to implement 𝑎𝑟𝑜𝑜𝑚𝑠 , so that each room has the same weight in the final encoded representation. Since we leverage LVLMs to process the visual information, the queries are also processed and encoded through the same models without any additional training. This is because their embedding space is learned by jointly training the visual encoder and aligning it to the textual encoder, so that both output a similar representation for aligned inputs (e.g., an image and its textual description). In our setting, the test queries are made of bigrams which consist of the 213 topics extracted from the video titles. To perform the search, the queries are first tokenized and encoded through the textual encoder of the LVLM, and then cosine similarity is used to rank the museum representations created by 𝑎𝑟𝑜𝑜𝑚𝑠 . 4.3. Evaluation metrics To assess the performance of the system, a relevance score was computed for each exhibition given a query 𝑞. The score for museum 𝑚 is a real value computed by summing 1.0 for each room in 𝑚 that has 𝑞 as one of its topics, and 0.1 for each video in other rooms which has 𝑞 as one of its topics. For instance, if the query is “rid weed” and a museum has two rooms, one with topics “rid weed” and “start hydroponic”, and the other one with “rid rose”. In the second room there are four videos inside, two of which have “rid weed” in their topics (note that one video may have more topics extracted from it). Then, the relevance score of 𝑚 to 𝑞 is 1.2 as 1.0 is summed for the first room, and 0.2 is summed for the two relevant videos in the second room. When computing the recall rates and the median rank, the relevant museums are those for which the relevance score is the highest in the ranking list, for that Table 1 We investigate different aggregation styles for the functions 𝑎𝑓 𝑟𝑎𝑚𝑒𝑠 , 𝑎𝑣𝑖𝑑𝑒𝑜𝑠 , and 𝑎𝑟𝑜𝑜𝑚𝑠 . CLIP is used as LVLM to process and encode the video frames. Discussion in Section 4.4. Aggregation of Recall Frames Videos Rooms R@1 R@5 R@10 MedR MRR nDCG@5 nDCG@10 Mean Mean Mean 23.94 53.05 70.89 5 39.09 51.77 55.34 Median Mean Mean 20.18 53.52 70.42 5 36.85 52.74 55.96 Max Mean Mean 7.51 27.23 45.53 12 18.50 44.63 52.83 Mean Median Mean 19.71 50.70 69.48 5 36.11 50.35 54.09 Median Median Mean 19.24 50.70 69.95 5 35.09 51.12 54.74 Max Median Mean 10.32 26.76 38.02 19 19.18 43.21 49.59 Mean Max Mean 20.65 41.31 57.74 7 31.57 48.94 54.28 Median Max Mean 18.77 40.84 53.99 9 30.11 49.41 54.37 Max Max Mean 11.73 39.98 45.53 12 22.35 49.17 58.15 query. The performance evaluation is done using four main metrics: Recall at rank K (R@k), Median rank (MedR), Mean Reciprocal Rank (MRR), and Normalized Discounted Cumulative Gain at rank k(nDCG@k). R@k measures the proportion of relevant museums found within the top k retrieved items. MedR represents the median rank position of the first relevant item across queries. MRR evaluates the rank position of the first relevant item, averaging the reciprocal of the rank across queries. nDCG assesses the quality of the ranking list, with higher-ranked relevant items contributing more to the score, rewarding systems that prioritize important results. In all metrics apart from Median rank, the higher value the better performance. 4.4. Which aggregation style is best? As mentioned, there are several reasons supporting the use of mean, maximum, or median pooling to implement the functions 𝑎𝑓 𝑟𝑎𝑚𝑒𝑠 , 𝑎𝑣𝑖𝑑𝑒𝑜𝑠 , and 𝑎𝑟𝑜𝑜𝑚𝑠 in the zero-shot search method explored in this paper. Here, we explore several combinations of these functions and assess their performance on the dataset collected. The results are reported in Table 1. First, aggregating the frames using mean leads to the best R@1 and MRR both when using mean (23.94% R@1 and 39.09% MRR), median (19.71% and 36.11%), or maximum pooling (20.65% and 31.57%) to aggregate the videos, compared to using median or max. In particular, the difference in performance with max pooling is ample compared to median pooling. On the one hand, it shows that preserving some information from all the frames, although noisily, is effective in this scenario. On the other hand, it confirms that using maximum pooling becomes too sensible to spurious spikes and possibly loses sight of the general content of the video, leading to the worst results, e.g. 7.51% R@1 and 18.50% MRR in the case of (max, mean, mean). Second, using mean pooling for all three functions, i.e. the row represented by (mean, mean, mean), leads to 23.94% R@1 and 39.09% MRR, whereas all the other combinations have less than 20% R@1 and 37% MRR. It also achieves 51.77% nDCG@5 and 55.34% nDCG@10, which ranks second in our experiments as (median, mean, mean) achieves 52.74% nDCG@5 and 55.96% nDCG@10. This indicates a higher chance to retrieve a relevant museum in the first rank than other combinations and a good quality of the proposed ranking lists, hence representing a good candidate for the proposed zero-shot method. Therefore, in the following experiments we used (mean, mean, mean). 4.5. Which feature extractor is best? In the previous experiment, using mean pooling for all three aggregation functions atop CLIP frame features led to the best results. Here, we explore how other LVLMs affect the final performance of our zero-search method. Specifically, we test Mobile-CLIP [24] and BLIP [35], and combinations of two to three LVLMs by concatenating the frame features. The results are reported in Table 2. Table 2 We investigate different LVLMs and their combination to extract the frame-level features. The aggregation functions are set to mean pooling. Discussion in Section 4.5. Feature extractor R@1 R@5 R@10 MedR MRR nDCG@5 nDCG@10 CLIP 23.94 53.05 70.89 5 39.09 51.77 55.34 BLIP 0.46 3.28 12.20 44 4.99 44.38 50.31 Mobile-CLIP 27.23 56.33 75.58 4 41.33 52.55 56.57 CLIP+BLIP 20.65 47.88 68.07 6 34.76 54.59 58.85 CLIP+MCLIP 22.06 53.52 71.83 5 38.44 51.59 55.10 BLIP+MCLIP 7.98 17.84 29.10 26 15.04 48.54 53.59 CLIP+BLIP+MCLIP 20.65 48.82 68.05 6 34.99 54.60 58.56 Table 3 We validate the assumption that, even when performing zero-shot search, leveraging the hierarchical nature of the data is useful. Mobile-CLIP is used as the LVLM for frame features extraction, and the aggregation functions are set to mean pooling for our approach. Discussion in Section 4.6. Feature extractor R@1 R@5 R@10 MedR MRR nDCG@5 nDCG@10 Hierarchical (ours) 27.23 56.33 75.58 4 41.33 52.55 56.57 Video-level (mean frames, mean videos) 26.29 55.39 75.58 4 40.34 52.50 56.48 Video-level (max frames, mean videos) 11.73 29.57 38.49 16 21.20 49.06 55.92 Video-level (max frames, max videos) 8.92 22.06 30.51 21 17.08 43.80 49.35 First, using Mobile-CLIP led to an increase in performance compared to CLIP, for instance from 23.94% R@1 and 39.09% MRR to 27.23% and 41.33%. Second, combining the information extracted by the LVLMs does not lead to better results. Specifically, with two methods the best results are obtained by CLIP+Mobile-CLIP, but they still fall short of Mobile- CLIP on its own, for instance their combinations obtains 22.06% R@1 and 38.44% MRR, yet these are lower than those obtained by Mobile-CLIP (27.23% and 41.33%). Although putting together all the models leads to slightly better nDCG compared to Mobile-CLIP (e.g. 54.60% nDCG@5 compared to 52.55%), the increased computational or storage costs would not make the solution better. 4.6. Is an hierarchical approach better than a flat one? For the future of the AgriMus project, we hypothesized that leveraging the hierarchical nature of museums is fundamental to correctly model them, both when training the components and when performing zero-shot search. Here, we validate such hypothesis by performing the aggregation of all the videos in the museum at the video level, neglecting the room separation. The LVLM is set to Mobile-CLIP and the aggregation functions to mean pooling, as this combination performed best in the previous experiments. The results are reported in Table 3. The main result is a confirmation of the hypothesis, as leveraging the hierarchy leads to 27.23% R@1, 41.33% MRR, 52.55% nDCG@5, and 56.57% nDCG@10, whereas in the other ablations, the best results are 26.29% R@1, 40.34% MRR, 52.50% nDCG@5, and 56.48% nDCG@10. Although the use of maximum pooling leads to significantly worse results, the use of mean pooling at the video level leads to comparable results to the proposed method under several metrics, especially those looking above the first rank (R@5 and 10, nDCG@5 and 10). Nonetheless, we hypothesize that training the aggregation functions will lead to considerably better performance, as that would allow better preservation of the temporal information in the videos and improve the encoding capabilities for the videos and the rooms. 5. Discussion/limitations/future work In this section, we highlight the limitations of our current approach and outline directions for future work. As the current implementation of AgriMus relies on a zero-shot search method, we employed simple aggregation operations to combine the representations of frames, videos, and rooms. While this approach is straightforward and computationally efficient, it is well known that such operations are not optimal, and for instance they tend to lose temporal information in videos [39, 40]. In future iterations, once we have collected a sufficient amount of data, we plan to experiment with neural sequential models and learned aggregation functions. These should enhance the system’s ability to recognize temporal patterns, leading to better video representation and improved search accuracy. Training on larger datasets will not only improve content recognition but also facilitate a deeper usage of the hierarchical structure present in the exhibitions, contributing to more precise search results. Another challenge is the inherent diversity and complexity of topics related to agriculture, gardening, and related fields. These domains encompass a wide range of subfields, each requiring specific expertise and datasets. To develop a robust and comprehensive system useful to both practitioners and novices, it is essential to collect a larger and more diverse set of videos. For example, there are currently no videos covering certain tree species, such as cedar trees. Interestingly, increasing the scope of the dataset could also facilitate the creation of more specialized virtual museums. For instance, an exhibition might focus specifically on “lemon trees”, with rooms dedicated to different stages of growth and care (e.g., planting, watering, pruning, harvesting). Alternatively, broader topics like “growing vegetables indoors” could be broken down into rooms focusing on various crops, such as tomatoes, potatoes, and zucchini. This structured, hierarchical approach will enhance the learning experience by organizing content logically and progressively. In addition to expanding the video dataset, future efforts will focus on incorporating virtual ex- periences that allow users to practice within the metaverse. By complementing tutorial videos with interactive, immersive environments, users can engage more deeply with the content, reinforcing their learning through hands-on experiences. Such experiences will be particularly valuable for tasks that require manual skills, such as pruning or grafting, as they enable users to practice techniques in a simulated environment. User studies need also to be conducted to assess the comprehensiveness of the exhibitions and their educational effectiveness. 6. Conclusions With the growth of the internet and user-generated content, video tutorials have become essential tools for supporting educational efforts across various domains, teaching the watchers best practices to grow vegetables at home, prune fruit trees, and other practical agricultural skills. As the metaverse continues to evolve, these video tutorials can be complemented by interactive and immersive experiences, enhancing the learning process by providing hands-on practice opportunities. To realize this vision, we introduced the AgriMus project, which focuses on developing digital exhibitions aimed at educating both novices and practitioners in a broad range of topics related to agriculture and gardening. AgriMus aims to build a search tool that allows users to explore these virtual museums, enabling them to watch tutorial videos to learn best practices and then engage in interactive experiences to practice and consolidate their skills within the metaverse. As an initial step, we collected a dataset of 83 exhibitions, each consisting of multiple topical rooms enriched with video content. We conducted zero-shot experiments, achieving 27.23% R@1, 75.58% R@10, 41.33% MRR, and 52.55% nDCG@5 on a test set of 213 queries. Our experimental results demonstrated that leveraging the hierarchical structure of the data improves performance. In addition, they validated design choices for our scenario: mean pooling proved to be the most effective aggregation method, and Mobile-CLIP outperformed other models in feature extraction from video frames. Looking ahead, several steps remain to fully realize the AgriMus project. We plan to expand the dataset by incorporating more videos to capture greater diversity across agricultural topics. Furthermore, integrating temporal information will enhance video content representation, improving search accuracy and museum organization. Lastly, conducting user evaluations will be crucial to refining the system and ensuring its effectiveness in real-world scenarios. Acknowledgments This work was supported by the PRIN 2022 “MUSMA” - CUP G53D23002930006 - “Funded by EU - Next-Generation EU – M4 C2 I1.1”, and by the Department Strategic Plan (PSD) of the University of Udine–Interdepartmental Project on Artificial Intelligence (2020-25). References [1] 3D-Ace, What is a virtual museum: Benefits, types and creation process, 2022. URL: https://3d-ace. com/blog/virtual-museum/, accessed: 2024-12-23. [2] C. Kiourt, A. Koutsoudis, G. Pavlidis, Dynamus: A fully dynamic 3d virtual museum framework, Journal of Cultural Heritage 22 (2016) 984–991. [3] E. Zidianakis, N. Partarakis, S. Ntoa, A. Dimopoulos, S. Kopidaki, A. Ntagianta, E. Ntafotis, A. Xhako, Z. Pervolarakis, E. Kontaki, et al., The invisible museum: A user-centric platform for creating virtual 3d exhibitions with vr support, Electronics 10 (2021) 363. [4] M. Barszcz, K. Dziedzic, M. Skublewska-Paszkowska, P. Powroznik, 3d scanning digital models for virtual museums, Computer Animation and Virtual Worlds 34 (2023) e2154. [5] M. Merella, S. Farina, P. Scaglia, G. Caneve, G. Bernardini, A. Pieri, A. Collareta, G. Bianucci, Structured-light 3d scanning as a tool for creating a digital collection of modern and fossil cetacean skeletons (natural history museum, university of pisa), Heritage 6 (2023) 6762–6776. [6] Z. Chen, A. Gholami, M. Nießner, A. X. Chang, Scan2cap: Context-aware dense captioning in rgb-d scans, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 3193–3203. [7] F. Yu, Z. Wang, D. Li, P. Zhu, X. Liang, X. Wang, M. Okumura, Towards cross-modal point cloud retrieval for indoor scenes, in: International Conference on Multimedia Modeling, Springer, 2024, pp. 89–102. [8] A. Abdari, A. Falcon, G. Serra, Farmare: a furniture-aware multi-task methodology for recom- mending apartments based on the user interests, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4293–4303. [9] A. Abdari, A. Falcon, G. Serra, Adoctera: Adaptive optimization constraints for improved text- guided retrieval of apartments, in: Proceedings of the 2024 International Conference on Multimedia Retrieval, 2024, pp. 1043–1050. [10] J. Chen, D. Barath, I. Armeni, M. Pollefeys, H. Blum, “where am i?” scene retrieval with language, in: European Conference on Computer Vision, Springer, 2025, pp. 201–220. [11] A. Abdari, A. Falcon, G. Serra, Metaverse retrieval: Finding the best metaverse environment via language, in: Proceedings of the 1st International Workshop on Deep Multimodal Learning for Information Retrieval, 2023, pp. 1–9. [12] A. Abdari, A. Falcon, G. Serra, A language-based solution to enable metaverse retrieval, in: International Conference on Multimedia Modeling, Springer, 2024, pp. 477–488. [13] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, I. Sutskever, Robust speech recognition via large-scale weak supervision, in: International conference on machine learning, PMLR, 2023, pp. 28492–28518. [14] W. R. Huang, C. Allauzen, T. Chen, K. Gupta, K. Hu, J. Qin, Y. Zhang, Y. Wang, S.-Y. Chang, T. N. Sainath, Multilingual and fully non-autoregressive asr with large language model fusion: A comprehensive study, in: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2024, pp. 13306–13310. [15] R. Ma, M. Qian, P. Manakul, M. Gales, K. Knill, Can generative large language models perform asr error correction?, arXiv preprint arXiv:2307.04172 (2023). [16] M. Fabrika, P. Valent, L. Scheer, Thinning trainer based on forest-growth model, virtual reality and computer-aided virtual environment, Environmental modelling & software 100 (2018) 11–23. [17] A. S. Badr, D. D. Hsiao, S. Rundel, R. de Amicis, Leveraging data-driven and procedural methods for generating high-fidelity visualizations of real forests, Environmental Modelling & Software 172 (2024) 105899. [18] H. Qiu, H. Zhang, K. Lei, H. Zhang, X. Hu, Forest digital twin: A new tool for forest manage- ment practices based on spatio-temporal data, 3d simulation engine, and intelligent interactive environment, Computers and Electronics in Agriculture 215 (2023) 108416. [19] J. J. Chai, J.-L. Xu, C. O’Sullivan, Real-time detection of strawberry ripeness using augmented reality and deep learning, Sensors 23 (2023) 7639. [20] K. Ashutosh, R. Girdhar, L. Torresani, K. Grauman, Hiervl: Learning hierarchical video-language embeddings, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 23066–23078. [21] X. Wang, S. Li, K. Kallidromitis, Y. Kato, K. Kozuka, T. Darrell, Hierarchical open-vocabulary universal image segmentation, Advances in Neural Information Processing Systems 36 (2024). [22] Q. Ye, G. Xu, M. Yan, H. Xu, Q. Qian, J. Zhang, F. Huang, Hitea: Hierarchical temporal-aware video- language pre-training, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15405–15416. [23] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transferable visual models from natural language supervision, in: International conference on machine learning, PMLR, 2021, pp. 8748–8763. [24] P. K. A. Vasu, H. Pouransari, F. Faghri, R. Vemulapalli, O. Tuzel, Mobileclip: Fast image-text models through multi-modal reinforced training, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 15963–15974. [25] Y. Zhao, I. Misra, P. Krähenbühl, R. Girdhar, Learning video representations from large language models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6586–6597. [26] Y. Wang, K. Li, X. Li, J. Yu, Y. He, G. Chen, B. Pei, R. Zheng, Z. Wang, Y. Shi, et al., Internvideo2: Scaling foundation models for multimodal video understanding, in: European Conference on Computer Vision, Springer, 2025, pp. 396–416. [27] L. Gao, J.-M. Sun, K. Mo, Y.-K. Lai, L. J. Guibas, J. Yang, Scenehgn: Hierarchical graph networks for 3d indoor scene generation with fine-grained geometry, IEEE Transactions on Pattern Analysis and Machine Intelligence 45 (2023) 8902–8919. [28] J. Wald, N. Navab, F. Tombari, Learning 3d semantic scene graphs with instance embeddings, International Journal of Computer Vision 130 (2022) 630–651. [29] M. Beck, K. Pöppel, M. Spanring, A. Auer, O. Prudnikova, M. Kopp, G. Klambauer, J. Brandstetter, S. Hochreiter, xlstm: Extended long short-term memory, Advances in Neural Information Processing Systems (2024). [30] L. Feng, F. Tung, M. O. Ahmed, Y. Bengio, H. Hajimirsadegh, Were rnns all we needed?, arXiv preprint arXiv:2410.01201 (2024). [31] C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, et al., Laion-5b: An open large-scale dataset for training next generation image-text models, Advances in Neural Information Processing Systems 35 (2022) 25278–25294. [32] L. Fan, D. Krishnan, P. Isola, D. Katabi, Y. Tian, Improving clip training with language rewrites, Advances in Neural Information Processing Systems 36 (2024). [33] A. Miech, D. Zhukov, J.-B. Alayrac, M. Tapaswi, I. Laptev, J. Sivic, Howto100m: Learning a text-video embedding by watching hundred million narrated video clips, in: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 2630–2640. [34] M. Grootendorst, Keybert: Minimal keyword extraction with bert., 2020. URL: https://doi.org/10. 5281/zenodo.4461265. doi:10.5281/zenodo.4461265. [35] J. Li, D. Li, C. Xiong, S. Hoi, Blip: Bootstrapping language-image pre-training for unified vision- language understanding and generation, in: International conference on machine learning, PMLR, 2022, pp. 12888–12900. [36] M. Bain, A. Nagrani, G. Varol, A. Zisserman, Frozen in time: A joint video and image encoder for end-to-end retrieval, in: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 1728–1738. [37] V. Gabeur, C. Sun, K. Alahari, C. Schmid, Multi-modal transformer for video retrieval, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, Springer, 2020, pp. 214–229. [38] W. Shi, C. C. Loy, X. Tang, Deep specialized network for illuminant estimation, in: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, Springer, 2016, pp. 371–387. [39] X. Jiang, Y. Gong, X. Guo, Q. Yang, F. Huang, W.-S. Zheng, F. Zheng, X. Sun, Rethinking temporal fusion for video-based person re-identification on semantic and time aspect, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 2020, pp. 11133–11140. [40] M. Li, H. Xu, J. Wang, W. Li, Y. Sun, Temporal aggregation with clip-level attention for video-based person re-identification, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2020, pp. 3376–3384.