<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">AgriMus: Developing Museums in the Metaverse for Agricultural Education</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Ali</forename><surname>Abdari</surname></persName>
							<email>abdari.ali@spes.uniud.it</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Udine</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution">University of Naples Federico II</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Alex</forename><surname>Falcon</surname></persName>
							<email>falcon.alex@spes.uniud.it</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Udine</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Giuseppe</forename><surname>Serra</surname></persName>
							<email>giuseppe.serra@uniud.it</email>
							<affiliation key="aff0">
								<orgName type="institution">University of Udine</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">AgriMus: Developing Museums in the Metaverse for Agricultural Education</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">E0160B04F24DF873C845156E94397C5E</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T17:27+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Metaverse</term>
					<term>Digital Museums</term>
					<term>Agriculture Education</term>
					<term>Cross-modal Retrieval</term>
					<term>Multimedia</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Learning agricultural practices-such as gardening, maintaining fruit trees, and general farming techniques-has increasingly shifted towards digital platforms, with tutorials on YouTube being a popular resource. As the metaverse expands, immersive experiences are emerging as powerful tools for skill acquisition. This work introduces AgriMus, a search tool designed for metaverse environments, enabling users to discover both videos and interactive experiences tailored to teaching practical skills in agriculture. AgriMus aims to connect users with relevant virtual spaces where they can learn and practice agricultural tasks in a hands-on, engaging way. Initial experiments conducted on 83 exhibitions demonstrate the potential of zero-shot search methods, achieving 27% R@1, 41% MRR, and 52% nDCG@5. The results also highlight the importance of leveraging the hierarchical structure of exhibition data and integrating state-of-the-art vision-language models to improve search performance.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Nowadays, with the user-generated content uploaded on the Internet increasing dramatically every year, it is becoming a common practice to acquire new skills by watching tutorials on video sharing platforms such as YouTube. These tutorial videos span a broad range of different skills, including general life skills such as cooking, home organization, and DIY crafts; technical skills like coding, graphic design, and video editing; and practical hands-on activities like gardening, farming, and maintaining fruit trees. For instance, users can find step-by-step guides on planting and cultivating vegetables, pruning fruit trees for optimal growth, designing irrigation systems, and even employing modern farming technologies, e.g., hydroponics or drone-assisted crop monitoring. This vast repository of user-generated content empowers individuals to learn both everyday and specialized skills at their own pace.</p><p>With the rapid growth of the metaverse, a new dimension of learning and skill acquisition is emerging, particularly in areas like agriculture. Initiatives such as the Agriscience Metaverse Academy are already leveraging virtual reality (VR) to provide immersive educational experiences for agriculture teachers and students, enabling them to explore agriscience concepts without the constraints of physical resources. Similarly, projects like "Georgia Agriculture in the Metaverse" introduce AI-powered, gamebased learning environments where users can grow crops, manage agricultural businesses, and gain practical farming skills through interactive simulations. These examples illustrate how the metaverse is transforming traditional tutorial-based learning into dynamic, hands-on experiences, making skill development more accessible, engaging, and impactful.</p><p>To take advantage of the strengths of both traditional tutorial videos and immersive metaverse experiences, we introduce the AgriMus project, the overview of which can be seen in Figure <ref type="figure" target="#fig_0">1</ref>. AgriMus focuses on developing a specialized search tool that empowers users interested in learning agricultural activities to explore and identify relevant agricultural metaverses. By integrating video content with interactive virtual experiences, this tool allows users to search for and access metaverse environments tailored to their specific interests, such as gardening, farming techniques, or advanced agricultural practices. AgriMus bridges the gap between conventional online tutorials and the growing potential of the metaverse, offering a comprehensive platform for skill development in agriculture.</p><p>To demonstrate the feasibility of AgriMus, we collected a dataset specifically designed for proof-ofconcept purposes. The dataset comprises 83 topical exhibitions, each dedicated to a broad agricultural theme (e.g., pruning fruit trees), with individual rooms focusing on more specific subtopics (e.g., pruning lemon trees). We conducted experiments in a zero-shot scenario, leveraging the hierarchical structure of the exhibitions to model the data as envisioned for AgriMus. Our experimental results demonstrate promising performance, achieving 27% recall at rank 1 (R@1), 66% recall at rank 5 (R@5), and a mean reciprocal rank (MRR) of 41%. Additionally, we achieved 52% normalized discounted cumulative gain (nDCG) at rank 5 and 56% recall at rank 10. These results highlight the effectiveness of the hierarchical approach and validate the potential of AgriMus for enabling efficient exploration and retrieval in agricultural metaverses.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related work</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Digital museums</head><p>The emergence of digital museums represents a transformative shift in how cultural heritage is accessed and experienced, offering unprecedented opportunities for engagement and education <ref type="bibr" target="#b0">[1]</ref>. With the advancements in technologies such as high-quality 3D modeling and virtual reality (VR), digital museums are becoming more popular and it is possible for them to host rich and immersive experiences. For instance, they allow for detailed representations of artifacts and exhibitions <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b2">3]</ref>, enabling visitors to explore diverse themes ranging from ancient civilizations to contemporary art <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b4">5]</ref>. Moreover, unlike traditional museums, which are constrained by physical space and operating hours, digital museums can operate continuously, providing access to global audiences at any time.</p><p>Thus, digital museums play a vital role in preserving and promoting cultural heritage by making artifacts and traditions accessible to wider audiences. However, they usually focus their attention to cultural heritage. Conversely, this work builds on the concept of digital museums by focusing on the integration of agricultural knowledge and training materials into museum-like exhibits, creating a unique training avenue for novices and practitioners in agricultural domains, which has not been studied in the researches so far. The aim is to support the acquirement of new skills by mixing lecture-like videos and virtual hands-on practice by means of VR experiences.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Multimedia-rich 3D scenarios</head><p>Recent advancements in vision and language techniques have significantly enhanced the retrieval of 3D scenes and objects through natural language queries. The integration of dense captioning methods with RGB-D scans enables the generation of detailed, context-aware descriptions of localized objects within 3D environments <ref type="bibr" target="#b5">[6]</ref>. These approaches allow users to input natural language queries to retrieve specific objects or scenes, thereby improving the efficiency and accuracy of retrieval systems. By combining language and 3D visual data, these techniques facilitate more intuitive interactions between humans and machines, enabling natural language descriptions to guide the search and discovery of relevant 3D models or environments.</p><p>Instead of focusing on single objects, recent research has focused on more complex indoor scene retrieval using text, involving longer descriptions, as they need to describe many objects and their position within the entire scene. Several contributions were made in this direction, including CRISP <ref type="bibr" target="#b6">[7]</ref>, which provides a large collection of 3D indoor scenes and their corresponding textual descriptions, Farmare <ref type="bibr" target="#b7">[8]</ref> and Adoctera <ref type="bibr" target="#b8">[9]</ref> which focus on learning to search furnished multi-room apartments and rank them against user queries. More recently, Text2SceneGraphMatcher <ref type="bibr" target="#b9">[10]</ref> introduced a method for aligning open-set text queries with 3D scene graphs to facilitate effective scene retrieval.</p><p>However, the approaches mentioned above do not consider the possibility of having inside the scenes some multimedia content which affects the relevance to the user query. This problem raises additional challenges as both the global structure and the local components need to be accounted for in the learned representation in order to fully capture the contents of the scenes and align them to the queries. For instance, in our previous works we investigated the use of cross-modal approaches to rank 3D scenarios comprising additional multimedia data in the form of either videos <ref type="bibr" target="#b10">[11]</ref> or images <ref type="bibr" target="#b11">[12]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">AgriMus: An overview of the project</head><p>This section offers an overview of the plans to implement the AgriMus project. These are also presented graphically in Figure <ref type="figure">2</ref>. The project will involve three main steps, namely data collection, data modeling, and the evaluation phase with an emphasis on user studies.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Collecting the data</head><p>The data collection phase will involve three main ingredients: tutorial videos, experiences, and related descriptions.</p><p>For videos, we will use an automated pipeline to collect relevant tutorial videos from YouTube by querying for keywords related to agricultural skills, gardening, and DIY projects. Videos with informative titles will be prioritized to ensure the relevance of the content. The audio tracks of these videos will be transcribed using Whisper <ref type="bibr" target="#b12">[13]</ref>, a state-of-the-art speech-to-text model known for its high accuracy across multiple languages and challenging audio conditions. The resulting transcripts will serve as a basis for generating detailed textual descriptions. We will use large language models (LLMs) to process these transcripts, as previously done in recent research <ref type="bibr" target="#b13">[14,</ref><ref type="bibr" target="#b14">15]</ref>, and extract key procedural steps to produce structured descriptions that enhance video indexing and facilitate the search process.</p><p>The process of gathering virtual experiences will involve a combination of automated and manual curation. We will systematically review academic literature to identify virtual agricultural training Step 1 is about collecting the required data, comprising topical 3D exhibitions adorned with educational videos and experiences in fields related to agriculture.</p><p>Step 2 introduces a hierarchical methodology for aligning the visual contents to the textual ones, and also for modeling the exhibitions, with the aim of garnering information about the single experiences or videos, how these form the contents of a room with a specific topic (e.g. how to prune a specific type of tree), and finally how the rooms capture a more comprehensive view on it (e.g. pruning that type of tree, and also growing, harvesting, etc). Step 3 will involve user studies to better understand the user needs and the effectiveness of the proposed methodology in capturing them.</p><p>environments described in research papers, with particular attention to interactive simulations and metaverse-based experiences. For instance, Fabrika et al <ref type="bibr" target="#b15">[16]</ref> developed a system for educating the user into thinning practices, fundamental for forestry management, whereas even better digital twins of forests were recently created using data and procedural approaches, e.g. <ref type="bibr" target="#b16">[17,</ref><ref type="bibr" target="#b17">18]</ref>. Another example is related to teaching the users to detect ripe fruit, e.g. strawberries <ref type="bibr" target="#b18">[19]</ref>. In addition, publicly available amateur simulations and virtual environments created by independent developers will be sourced from online repositories and virtual experience platforms. This dual approach ensures a diverse collection of virtual experiences, covering both high-fidelity simulations and more accessible, grassroots solutions.</p><p>The collected experiences will be cataloged and integrated into the AgriMus platform, enriching the learning ecosystem with practical, hands-on tools.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Modeling the museums</head><p>The exhibitions collected in the previous step are quite rich in content: each exhibition contains multiple rooms, each containing different videos or experiences. To encode all this information in a way that it is easily searchable and avoids information loss, we will rely on a combination of state-of-the-art computer vision, natural language processing, and multimedia analysis techniques. Specifically, as shown in Figure <ref type="figure">2</ref>, we plan to use hierarchical modeling to leverage the structure of the exhibitions, roughly divided into content-level (videos or interactive experiences), room-level, and museum-level. By aligning the visual and textual representations within each level (i.e., a video/experience with its description, a room with the descriptions of all its contents, and finally the museum with the full description), it will become easier for the model to learn how to orderly encode them while minimizing information loss <ref type="bibr" target="#b19">[20,</ref><ref type="bibr" target="#b20">21,</ref><ref type="bibr" target="#b21">22]</ref>. For content-level representations, given that both videos and more complex interactive experiences will be integrated, a mixture of spatial and spatio-temporal models will be used. This will include 2D Large Vision-Language Models (LVLM) such as CLIP <ref type="bibr" target="#b22">[23]</ref> and Mobile-CLIP <ref type="bibr" target="#b23">[24]</ref>, and spatio-temporal LVLMs such as LaViLa <ref type="bibr" target="#b24">[25]</ref> or InternVideo2 <ref type="bibr" target="#b25">[26]</ref>. In this way, it will be possible to separately encode both appearance and motion information, useful to better understand the primary entities of the experiences (e.g. the tree species) and the actions performed on it.</p><p>For room-level representation, a naive solution would be to aggregate the content representation through mean pooling, eventually learning the weight of each. Alternatively, graph networks could also play an important role in understanding how to aggregate them by capturing relationships and dependencies between contents, at the cost of more computational resources. These have been previously used to capture single objects inside rooms (e.g. furniture) and their relationships by using scene graphs <ref type="bibr" target="#b26">[27,</ref><ref type="bibr" target="#b27">28]</ref>.</p><p>Finally, for the museum-level representation, different types of aggregation could be used depending on the constraints to be imposed on the exhibition itself. Generally, learning a weighted mean of the room representations could suffice, as the information coming from each room would have its weight defined on the content without, for instance, any constraint on the visit order. However, it is common for exhibitions to have a predefined visit order, usually done by the exhibition curator. Therefore, exploring sequential models (e.g., standard recurrent networks such as LSTM and GRU, or the more recent xLSTM <ref type="bibr" target="#b28">[29]</ref> and minGRU <ref type="bibr" target="#b29">[30]</ref>) for the aggregation of the rooms could play an important role in how to encode their content into the museum representation. As in the previous case, graph neural networks could also be used to capture neighbor relations between rooms and assess the relevance of each.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Searching through the museums</head><p>Once the representations for the museums are computed, they can be searched using similarity-based approaches. Here, two methodologies can be followed.</p><p>As content-level representations involve LVLMs, processing the user queries through the same techniques means that the query representation falls into the same latent space, hence enabling trainingfree search. However, this would imply either that the museums are modeled without relying on the hierarchy or that the aggregation functions are not trained (e.g., mean or max pooling). Although both cases are likely leading to poorer performance compared to a solution using trained components, they enable effective solutions even in scarce data scenarios. In Section 4, we show some early results obtained using this methodology.</p><p>In general, user queries may also be long and articulated, describing specific scenarios and thus requiring more advanced query processing. While large vision-language models (LVLMs) are typically trained with simple captions-often composed of primary entities and a few additional descriptive words (e.g., half of the captions in LAION-2B are less than 50 characters long <ref type="bibr" target="#b30">[31]</ref>)-there are LVLMs trained to handle more complex query scenarios. An example is represented by LaCLIP <ref type="bibr" target="#b31">[32]</ref>, which uses Large Language Models to rewrite the original captions paired with the training images. This suggests that the zero-shot approach should also work for longer queries, although it is generally unlikely to perform similarly to a model trained specifically for the task at hand. In particular, training the proposed method using the vision and language data collected in the previous step allows the models to become more tailored to the task, potentially preserving more details in the encoding.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Early experimental results</head><p>As a proof-of-concept for the AgriMus project, we collected a dataset of exhibitions for educational purposes in the agriculture domain. The details of the dataset are provided in Section 4.1, whereas early experimental results are reported in Section 4.2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Collected data</head><p>As mentioned above, a staple of the AgriMus project will be the availability of museums, or exhibitions, about important topics for education in the field of agriculture, which we will collect because this is not currently available. For an early prototype of the proposed AgriMus project, we created a set of 83 Starting from the full museum, it highlights one of the rooms (in green) and two of the videos contained in it (yellow and purple). 1) The frames of the videos are processed using a Large Vision-Language Model (LVLM).</p><p>2)</p><p>The frames' representations are then aggregated using the function 𝑎 𝑓 𝑟𝑎𝑚𝑒𝑠 .</p><p>3) The videos are then aggregated using 𝑎 𝑣𝑖𝑑𝑒𝑜𝑠 to capture the contents of the room. 4) Finally, 𝑎 𝑟𝑜𝑜𝑚𝑠 aggregates the rooms' contents to capture the full exhibition. This final representation is then used to rank the exhibitions against the representation of the user query.</p><p>topical museums, each focusing on a branch of topics relevant to agricultural education, e.g. tutorials on pruning trees. Then, each room focuses on more specific topics, e.g. how to prune lemon trees. On average, there are 4.6 rooms per museum, with about 11.2 videos per museum. To achieve this, we first collected a total of 288 relevant videos from the HowTo100M dataset <ref type="bibr" target="#b32">[33]</ref>. The main topics distilled from the videos range from teaching the user the best practices for growing a tomato plant at home to watering indoor plants or pruning outdoor trees. The topics are extracted using KeyBert <ref type="bibr" target="#b33">[34]</ref> looking for representative bigrams in the video title. Examples of topics include keywords for actions such as "sow" and "prune", for entities such as "rose" and "garden", and also for some technical approaches such as "hydroponic". As we looked for bigrams, these are typically grouped in pairs, e.g. "rid" with "weed". In total, we extracted 213 topics. Most of them (about 80) are only bound to one museum or two museums (about 100), and only seven are repeated in four or five museums (Figure <ref type="figure" target="#fig_3">4</ref>). The videos, with a length spanning from 38 seconds to 31 minutes, are then "grouped" to form viable candidates' pools for the museum rooms. Specifically, we first selected part of the bigrams (e.g. "growing") to decide a topic for the museum, and then built the rooms based on the second part (e.g. "tomatoes" for one room, and "potatoes" for another room).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Zero-shot search method</head><p>As the dataset collected is small, experiments that involve training the neural network outlined in Section 3 would be unfeasible. Therefore, we designed a zero-shot methodology based on the discussion in Section 3.3. An overview of the zero-shot methodology is illustrated in Figure <ref type="figure" target="#fig_2">3</ref>. It is made of four main steps. First, in each video within the room, 150 frames are uniformly sampled and resized to (H, W), then processed through a spatial LVLM. In the experiments in the following sections, three LVLMs are considered: CLIP <ref type="bibr" target="#b22">[23]</ref>, Mobile-CLIP <ref type="bibr" target="#b23">[24]</ref>, and BLIP <ref type="bibr" target="#b34">[35]</ref>. H and W are set to 224 for CLIP and BLIP, whereas 256 is used for Mobile-CLIP.</p><p>The frame representations are then aggregated by 𝑎 𝑓 𝑟𝑎𝑚𝑒𝑠 , implemented in the experiments as mean, maximum, or median pooling. Although the mean pooling of frame vectors is quite typical to obtain a rough representation of the video <ref type="bibr" target="#b35">[36,</ref><ref type="bibr" target="#b36">37]</ref>, maximum pooling is another way to aggregate frames by looking at spikes in the features (e.g., often done when reducing the spatial dimensions in deep convolutional networks such as ResNet). However, to avoid overemphasizing spurious spikes which can happen with max pooling, and to avoid diluting meaningful features with mean pooling, which happens especially when the videos are long, median pooling can be a viable candidate as it focuses on the middle value in a region, improving its robustness to extreme values <ref type="bibr" target="#b37">[38]</ref>.</p><p>For the room-level representation, the function 𝑎 𝑣𝑖𝑑𝑒𝑜𝑠 is used. As in the previous case, mean, maximum, and median pooling can be used to implement such a function. Although it can be argued that mean and median are more reasonable, as the videos in the room follow the same topic, there are nuances which could be more important to retain. This is the case of many tutorial videos which are longer than the average because they explain how to perform more than one task at once, for instance, showing both how to plow, sow, and water a crop. Therefore, maximum pooling is also a viable candidate for 𝑎 𝑣𝑖𝑑𝑒𝑜𝑠 .</p><p>Finally, for museum-level aggregation we rely on mean pooling to implement 𝑎 𝑟𝑜𝑜𝑚𝑠 , so that each room has the same weight in the final encoded representation.</p><p>Since we leverage LVLMs to process the visual information, the queries are also processed and encoded through the same models without any additional training. This is because their embedding space is learned by jointly training the visual encoder and aligning it to the textual encoder, so that both output a similar representation for aligned inputs (e.g., an image and its textual description). In our setting, the test queries are made of bigrams which consist of the 213 topics extracted from the video titles. To perform the search, the queries are first tokenized and encoded through the textual encoder of the LVLM, and then cosine similarity is used to rank the museum representations created by 𝑎 𝑟𝑜𝑜𝑚𝑠 .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Evaluation metrics</head><p>To assess the performance of the system, a relevance score was computed for each exhibition given a query 𝑞. The score for museum 𝑚 is a real value computed by summing 1.0 for each room in 𝑚 that has 𝑞 as one of its topics, and 0.1 for each video in other rooms which has 𝑞 as one of its topics. For instance, if the query is "rid weed" and a museum has two rooms, one with topics "rid weed" and "start hydroponic", and the other one with "rid rose". In the second room there are four videos inside, two of which have "rid weed" in their topics (note that one video may have more topics extracted from it). Then, the relevance score of 𝑚 to 𝑞 is 1.2 as 1.0 is summed for the first room, and 0.2 is summed for the two relevant videos in the second room. When computing the recall rates and the median rank, the relevant museums are those for which the relevance score is the highest in the ranking list, for that Table <ref type="table">1</ref> We investigate different aggregation styles for the functions 𝑎 𝑓 𝑟𝑎𝑚𝑒𝑠 , 𝑎 𝑣𝑖𝑑𝑒𝑜𝑠 , and 𝑎 𝑟𝑜𝑜𝑚𝑠 . CLIP is used as LVLM to process and encode the video frames. Discussion in Section 4.4.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Aggregation of</head><p>Recall Frames Videos Rooms R@1 R@5 R@10 MedR MRR nDCG@5 nDCG@ query. The performance evaluation is done using four main metrics: Recall at rank K (R@k), Median rank (MedR), Mean Reciprocal Rank (MRR), and Normalized Discounted Cumulative Gain at rank k(nDCG@k). R@k measures the proportion of relevant museums found within the top k retrieved items. MedR represents the median rank position of the first relevant item across queries. MRR evaluates the rank position of the first relevant item, averaging the reciprocal of the rank across queries. nDCG assesses the quality of the ranking list, with higher-ranked relevant items contributing more to the score, rewarding systems that prioritize important results. In all metrics apart from Median rank, the higher value the better performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4.">Which aggregation style is best?</head><p>As mentioned, there are several reasons supporting the use of mean, maximum, or median pooling to implement the functions 𝑎 𝑓 𝑟𝑎𝑚𝑒𝑠 , 𝑎 𝑣𝑖𝑑𝑒𝑜𝑠 , and 𝑎 𝑟𝑜𝑜𝑚𝑠 in the zero-shot search method explored in this paper. Here, we explore several combinations of these functions and assess their performance on the dataset collected. The results are reported in Table <ref type="table">1</ref>.</p><p>First, aggregating the frames using mean leads to the best R@1 and MRR both when using mean (23.94% R@1 and 39.09% MRR), median (19.71% and 36.11%), or maximum pooling (20.65% and 31.57%) to aggregate the videos, compared to using median or max. In particular, the difference in performance with max pooling is ample compared to median pooling. On the one hand, it shows that preserving some information from all the frames, although noisily, is effective in this scenario. On the other hand, it confirms that using maximum pooling becomes too sensible to spurious spikes and possibly loses sight of the general content of the video, leading to the worst results, e.g. 7.51% R@1 and 18.50% MRR in the case of (max, mean, mean).</p><p>Second, using mean pooling for all three functions, i.e. the row represented by (mean, mean, mean), leads to 23.94% R@1 and 39.09% MRR, whereas all the other combinations have less than 20% R@1 and 37% MRR. It also achieves 51.77% nDCG@5 and 55.34% nDCG@10, which ranks second in our experiments as (median, mean, mean) achieves 52.74% nDCG@5 and 55.96% nDCG@10. This indicates a higher chance to retrieve a relevant museum in the first rank than other combinations and a good quality of the proposed ranking lists, hence representing a good candidate for the proposed zero-shot method. Therefore, in the following experiments we used (mean, mean, mean).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5.">Which feature extractor is best?</head><p>In the previous experiment, using mean pooling for all three aggregation functions atop CLIP frame features led to the best results. Here, we explore how other LVLMs affect the final performance of our zero-search method. Specifically, we test Mobile-CLIP <ref type="bibr" target="#b23">[24]</ref> and BLIP <ref type="bibr" target="#b34">[35]</ref>, and combinations of two to three LVLMs by concatenating the frame features. The results are reported in Table <ref type="table">2</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 2</head><p>We investigate different LVLMs and their combination to extract the frame-level features. The aggregation functions are set to mean pooling. Discussion in Section 4.5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Feature extractor</head><p>R@1 R@5 R@10 MedR MRR nDCG@5 nDCG@ </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 3</head><p>We validate the assumption that, even when performing zero-shot search, leveraging the hierarchical nature of the data is useful. Mobile-CLIP is used as the LVLM for frame features extraction, and the aggregation functions are set to mean pooling for our approach. Discussion in Section 4.6.</p><p>Feature extractor R@1 R@5 R@10 MedR MRR nDCG@5 nDCG@10 Hierarchical (ours) First, using Mobile-CLIP led to an increase in performance compared to CLIP, for instance from 23.94% R@1 and 39.09% MRR to 27.23% and 41.33%.</p><p>Second, combining the information extracted by the LVLMs does not lead to better results. Specifically, with two methods the best results are obtained by CLIP+Mobile-CLIP, but they still fall short of Mobile-CLIP on its own, for instance their combinations obtains 22.06% R@1 and 38.44% MRR, yet these are lower than those obtained by Mobile-CLIP (27.23% and 41.33%). Although putting together all the models leads to slightly better nDCG compared to Mobile-CLIP (e.g. 54.60% nDCG@5 compared to 52.55%), the increased computational or storage costs would not make the solution better.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.6.">Is an hierarchical approach better than a flat one?</head><p>For the future of the AgriMus project, we hypothesized that leveraging the hierarchical nature of museums is fundamental to correctly model them, both when training the components and when performing zero-shot search. Here, we validate such hypothesis by performing the aggregation of all the videos in the museum at the video level, neglecting the room separation. The LVLM is set to Mobile-CLIP and the aggregation functions to mean pooling, as this combination performed best in the previous experiments. The results are reported in Table <ref type="table">3</ref>.</p><p>The main result is a confirmation of the hypothesis, as leveraging the hierarchy leads to 27.23% R@1, 41.33% MRR, 52.55% nDCG@5, and 56.57% nDCG@10, whereas in the other ablations, the best results are 26.29% R@1, 40.34% MRR, 52.50% nDCG@5, and 56.48% nDCG@10. Although the use of maximum pooling leads to significantly worse results, the use of mean pooling at the video level leads to comparable results to the proposed method under several metrics, especially those looking above the first rank (R@5 and 10, nDCG@5 and 10). Nonetheless, we hypothesize that training the aggregation functions will lead to considerably better performance, as that would allow better preservation of the temporal information in the videos and improve the encoding capabilities for the videos and the rooms.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Discussion/limitations/future work</head><p>In this section, we highlight the limitations of our current approach and outline directions for future work.</p><p>As the current implementation of AgriMus relies on a zero-shot search method, we employed simple aggregation operations to combine the representations of frames, videos, and rooms. While this approach is straightforward and computationally efficient, it is well known that such operations are not optimal, and for instance they tend to lose temporal information in videos <ref type="bibr" target="#b38">[39,</ref><ref type="bibr" target="#b39">40]</ref>. In future iterations, once we have collected a sufficient amount of data, we plan to experiment with neural sequential models and learned aggregation functions. These should enhance the system's ability to recognize temporal patterns, leading to better video representation and improved search accuracy. Training on larger datasets will not only improve content recognition but also facilitate a deeper usage of the hierarchical structure present in the exhibitions, contributing to more precise search results.</p><p>Another challenge is the inherent diversity and complexity of topics related to agriculture, gardening, and related fields. These domains encompass a wide range of subfields, each requiring specific expertise and datasets. To develop a robust and comprehensive system useful to both practitioners and novices, it is essential to collect a larger and more diverse set of videos. For example, there are currently no videos covering certain tree species, such as cedar trees. Interestingly, increasing the scope of the dataset could also facilitate the creation of more specialized virtual museums. For instance, an exhibition might focus specifically on "lemon trees", with rooms dedicated to different stages of growth and care (e.g., planting, watering, pruning, harvesting). Alternatively, broader topics like "growing vegetables indoors" could be broken down into rooms focusing on various crops, such as tomatoes, potatoes, and zucchini. This structured, hierarchical approach will enhance the learning experience by organizing content logically and progressively.</p><p>In addition to expanding the video dataset, future efforts will focus on incorporating virtual experiences that allow users to practice within the metaverse. By complementing tutorial videos with interactive, immersive environments, users can engage more deeply with the content, reinforcing their learning through hands-on experiences. Such experiences will be particularly valuable for tasks that require manual skills, such as pruning or grafting, as they enable users to practice techniques in a simulated environment. User studies need also to be conducted to assess the comprehensiveness of the exhibitions and their educational effectiveness.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusions</head><p>With the growth of the internet and user-generated content, video tutorials have become essential tools for supporting educational efforts across various domains, teaching the watchers best practices to grow vegetables at home, prune fruit trees, and other practical agricultural skills. As the metaverse continues to evolve, these video tutorials can be complemented by interactive and immersive experiences, enhancing the learning process by providing hands-on practice opportunities.</p><p>To realize this vision, we introduced the AgriMus project, which focuses on developing digital exhibitions aimed at educating both novices and practitioners in a broad range of topics related to agriculture and gardening. AgriMus aims to build a search tool that allows users to explore these virtual museums, enabling them to watch tutorial videos to learn best practices and then engage in interactive experiences to practice and consolidate their skills within the metaverse.</p><p>As an initial step, we collected a dataset of 83 exhibitions, each consisting of multiple topical rooms enriched with video content. We conducted zero-shot experiments, achieving 27.23% R@1, 75.58% R@10, 41.33% MRR, and 52.55% nDCG@5 on a test set of 213 queries. Our experimental results demonstrated that leveraging the hierarchical structure of the data improves performance. In addition, they validated design choices for our scenario: mean pooling proved to be the most effective aggregation method, and Mobile-CLIP outperformed other models in feature extraction from video frames.</p><p>Looking ahead, several steps remain to fully realize the AgriMus project. We plan to expand the dataset by incorporating more videos to capture greater diversity across agricultural topics. Furthermore, integrating temporal information will enhance video content representation, improving search accuracy and museum organization. Lastly, conducting user evaluations will be crucial to refining the system and ensuring its effectiveness in real-world scenarios.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Given the user query, formulated in natural language, the method uses natural language processing (NLP) for it, then combines computer vision (CV) techniques and multimodal analysis (V+L) to process the metaverses available in the database. Then, it recommends (RS) a ranking list of the relevant metaverses. The two cases show possible results. (a) Metaverse focusing on the specific tree (fig), with the rooms dedicated to different aspects for it. (b) Metaverse focusing on the action (pruning), with the rooms dedicated to applying it in diverse agriculture scenarios.</figDesc><graphic coords="2,326.36,195.98,125.97,94.22" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Step 1 :Figure 2 :</head><label>12</label><figDesc>Figure 2: An overview of the AgriMus project. It consists of three main steps.Step 1 is about collecting the required data, comprising topical 3D exhibitions adorned with educational videos and experiences in fields related to agriculture. Step 2 introduces a hierarchical methodology for aligning the visual contents to the textual ones, and also for modeling the exhibitions, with the aim of garnering information about the single experiences or videos, how these form the contents of a room with a specific topic (e.g. how to prune a specific type of tree), and finally how the rooms capture a more comprehensive view on it (e.g. pruning that type of tree, and also growing, harvesting, etc). Step 3 will involve user studies to better understand the user needs and the effectiveness of the proposed methodology in capturing them.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: An overview of the prototype for zero-shot understanding of the exhibition contents used in this paper.Starting from the full museum, it highlights one of the rooms (in green) and two of the videos contained in it (yellow and purple). 1) The frames of the videos are processed using a Large Vision-Language Model (LVLM). 2) The frames' representations are then aggregated using the function 𝑎 𝑓 𝑟𝑎𝑚𝑒𝑠 . 3) The videos are then aggregated using 𝑎 𝑣𝑖𝑑𝑒𝑜𝑠 to capture the contents of the room. 4) Finally, 𝑎 𝑟𝑜𝑜𝑚𝑠 aggregates the rooms' contents to capture the full exhibition. This final representation is then used to rank the exhibitions against the representation of the user query.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Statistics of the dataset collected. (a) shows the repetition per topics, illustrating that most of the topics have been used in one or two museums, while a few topics have been used in four or five museums. (b) shows some of the topics which have been presented in three or more museums.</figDesc><graphic coords="7,83.28,65.61,428.69,173.02" type="bitmap" /></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>This work was supported by the PRIN 2022 "MUSMA" -CUP G53D23002930006 -"Funded by EU -Next-Generation EU -M4 C2 I1.1", and by the Department Strategic Plan (PSD) of the University of Udine-Interdepartmental Project on Artificial Intelligence (2020-25).</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<ptr target="https://3d-ace.com/blog/virtual-museum/" />
		<title level="m">What is a virtual museum: Benefits, types and creation process</title>
				<imprint>
			<date type="published" when="2022">2022. 2024-12-23</date>
		</imprint>
	</monogr>
	<note>3D-Ace</note>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Dynamus: A fully dynamic 3d virtual museum framework</title>
		<author>
			<persName><forename type="first">C</forename><surname>Kiourt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Koutsoudis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Pavlidis</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Cultural Heritage</title>
		<imprint>
			<biblScope unit="volume">22</biblScope>
			<biblScope unit="page" from="984" to="991" />
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">The invisible museum: A user-centric platform for creating virtual 3d exhibitions with vr support</title>
		<author>
			<persName><forename type="first">E</forename><surname>Zidianakis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Partarakis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ntoa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Dimopoulos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kopidaki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ntagianta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Ntafotis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Xhako</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Pervolarakis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Kontaki</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Electronics</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="page">363</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">3d scanning digital models for virtual museums</title>
		<author>
			<persName><forename type="first">M</forename><surname>Barszcz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Dziedzic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Skublewska-Paszkowska</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Powroznik</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computer Animation and Virtual Worlds</title>
		<imprint>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="page">e2154</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Structured-light 3d scanning as a tool for creating a digital collection of modern and fossil cetacean skeletons (natural history museum, university of pisa)</title>
		<author>
			<persName><forename type="first">M</forename><surname>Merella</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Farina</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Scaglia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Caneve</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Bernardini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Pieri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Collareta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Bianucci</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Heritage</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="page" from="6762" to="6776" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Scan2cap: Context-aware dense captioning in rgb-d scans</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Gholami</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Nießner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">X</forename><surname>Chang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</title>
				<meeting>the IEEE/CVF conference on computer vision and pattern recognition</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="3193" to="3203" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Towards cross-modal point cloud retrieval for indoor scenes</title>
		<author>
			<persName><forename type="first">F</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Liang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Okumura</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Multimedia Modeling</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2024">2024</date>
			<biblScope unit="page" from="89" to="102" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Farmare: a furniture-aware multi-task methodology for recommending apartments based on the user interests</title>
		<author>
			<persName><forename type="first">A</forename><surname>Abdari</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Falcon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Serra</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF International Conference on Computer Vision</title>
				<meeting>the IEEE/CVF International Conference on Computer Vision</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="4293" to="4303" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Adoctera: Adaptive optimization constraints for improved textguided retrieval of apartments</title>
		<author>
			<persName><forename type="first">A</forename><surname>Abdari</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Falcon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Serra</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2024 International Conference on Multimedia Retrieval</title>
				<meeting>the 2024 International Conference on Multimedia Retrieval</meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
			<biblScope unit="page" from="1043" to="1050" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">where am i?&quot; scene retrieval with language</title>
		<author>
			<persName><forename type="first">J</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Barath</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Armeni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Pollefeys</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Blum</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">European Conference on Computer Vision</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2025">2025</date>
			<biblScope unit="page" from="201" to="220" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Metaverse retrieval: Finding the best metaverse environment via language</title>
		<author>
			<persName><forename type="first">A</forename><surname>Abdari</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Falcon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Serra</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 1st International Workshop on Deep Multimodal Learning for Information Retrieval</title>
				<meeting>the 1st International Workshop on Deep Multimodal Learning for Information Retrieval</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="1" to="9" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">A language-based solution to enable metaverse retrieval</title>
		<author>
			<persName><forename type="first">A</forename><surname>Abdari</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Falcon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Serra</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Multimedia Modeling</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2024">2024</date>
			<biblScope unit="page" from="477" to="488" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Robust speech recognition via large-scale weak supervision</title>
		<author>
			<persName><forename type="first">A</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Brockman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Mcleavey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International conference on machine learning</title>
				<meeting><address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="28492" to="28518" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Multilingual and fully non-autoregressive asr with large language model fusion: A comprehensive study</title>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">R</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Allauzen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Gupta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Qin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S.-Y</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">N</forename><surname>Sainath</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2024">2024</date>
			<biblScope unit="page" from="13306" to="13310" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<title level="m" type="main">Can generative large language models perform asr error correction?</title>
		<author>
			<persName><forename type="first">R</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Qian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Manakul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Gales</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Knill</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2307.04172</idno>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Thinning trainer based on forest-growth model, virtual reality and computer-aided virtual environment</title>
		<author>
			<persName><forename type="first">M</forename><surname>Fabrika</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Valent</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Scheer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Environmental modelling &amp; software</title>
		<imprint>
			<biblScope unit="volume">100</biblScope>
			<biblScope unit="page" from="11" to="23" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Leveraging data-driven and procedural methods for generating high-fidelity visualizations of real forests</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">S</forename><surname>Badr</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">D</forename><surname>Hsiao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Rundel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>De Amicis</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Environmental Modelling &amp; Software</title>
		<imprint>
			<biblScope unit="volume">172</biblScope>
			<biblScope unit="page">105899</biblScope>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Forest digital twin: A new tool for forest management practices based on spatio-temporal data, 3d simulation engine, and intelligent interactive environment</title>
		<author>
			<persName><forename type="first">H</forename><surname>Qiu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Hu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computers and Electronics in Agriculture</title>
		<imprint>
			<biblScope unit="volume">215</biblScope>
			<biblScope unit="page">108416</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Real-time detection of strawberry ripeness using augmented reality and deep learning</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">J</forename><surname>Chai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J.-L</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>O'sullivan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Sensors</title>
		<imprint>
			<biblScope unit="volume">23</biblScope>
			<biblScope unit="page">7639</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Hiervl: Learning hierarchical video-language embeddings</title>
		<author>
			<persName><forename type="first">K</forename><surname>Ashutosh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Girdhar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Torresani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Grauman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE/CVF Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="23066" to="23078" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Hierarchical open-vocabulary universal image segmentation</title>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Kallidromitis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Kato</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Kozuka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Darrell</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">36</biblScope>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Hitea: Hierarchical temporal-aware videolanguage pre-training</title>
		<author>
			<persName><forename type="first">Q</forename><surname>Ye</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Yan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Qian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Huang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF International Conference on Computer Vision</title>
				<meeting>the IEEE/CVF International Conference on Computer Vision</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="15405" to="15416" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Learning transferable visual models from natural language supervision</title>
		<author>
			<persName><forename type="first">A</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Hallacy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ramesh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Goh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sastry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Askell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Mishkin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Clark</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International conference on machine learning</title>
				<meeting><address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="8748" to="8763" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">Mobileclip: Fast image-text models through multi-modal reinforced training</title>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">K A</forename><surname>Vasu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Pouransari</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Faghri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Vemulapalli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Tuzel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE/CVF Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
			<biblScope unit="page" from="15963" to="15974" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Learning video representations from large language models</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Misra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Krähenbühl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Girdhar</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE/CVF Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="6586" to="6597" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Internvideo2: Scaling foundation models for multimodal video understanding</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Pei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Shi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">European Conference on Computer Vision</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2025">2025</date>
			<biblScope unit="page" from="396" to="416" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Scenehgn: Hierarchical graph networks for 3d indoor scene generation with fine-grained geometry</title>
		<author>
			<persName><forename type="first">L</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J.-M</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Mo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y.-K</forename><surname>Lai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">J</forename><surname>Guibas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Pattern Analysis and Machine Intelligence</title>
		<imprint>
			<biblScope unit="volume">45</biblScope>
			<biblScope unit="page" from="8902" to="8919" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Learning 3d semantic scene graphs with instance embeddings</title>
		<author>
			<persName><forename type="first">J</forename><surname>Wald</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Navab</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Tombari</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Computer Vision</title>
		<imprint>
			<biblScope unit="volume">130</biblScope>
			<biblScope unit="page" from="630" to="651" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<analytic>
		<title level="a" type="main">xlstm: Extended long short-term memory</title>
		<author>
			<persName><forename type="first">M</forename><surname>Beck</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Pöppel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Spanring</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Auer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Prudnikova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kopp</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Klambauer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Brandstetter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Hochreiter</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<monogr>
		<author>
			<persName><forename type="first">L</forename><surname>Feng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Tung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">O</forename><surname>Ahmed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Bengio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Hajimirsadegh</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2410.01201</idno>
		<title level="m">Were rnns all we needed?</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b30">
	<analytic>
		<title level="a" type="main">Laion-5b: An open large-scale dataset for training next generation image-text models</title>
		<author>
			<persName><forename type="first">C</forename><surname>Schuhmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Beaumont</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Vencu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Gordon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Wightman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Cherti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Coombes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Katta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Mullis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Wortsman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">35</biblScope>
			<biblScope unit="page" from="25278" to="25294" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b31">
	<analytic>
		<title level="a" type="main">Improving clip training with language rewrites</title>
		<author>
			<persName><forename type="first">L</forename><surname>Fan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Krishnan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Isola</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Katabi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Tian</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">36</biblScope>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b32">
	<analytic>
		<title level="a" type="main">Howto100m: Learning a text-video embedding by watching hundred million narrated video clips</title>
		<author>
			<persName><forename type="first">A</forename><surname>Miech</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Zhukov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J.-B</forename><surname>Alayrac</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Tapaswi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Laptev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sivic</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF international conference on computer vision</title>
				<meeting>the IEEE/CVF international conference on computer vision</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="2630" to="2640" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b33">
	<monogr>
		<title level="m" type="main">Keybert: Minimal keyword extraction with bert</title>
		<author>
			<persName><forename type="first">M</forename><surname>Grootendorst</surname></persName>
		</author>
		<idno type="DOI">10.5281/zenodo.4461265</idno>
		<ptr target="https://doi.org/10.5281/zenodo.4461265.doi:10.5281/zenodo.4461265" />
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b34">
	<analytic>
		<title level="a" type="main">Blip: Bootstrapping language-image pre-training for unified visionlanguage understanding and generation</title>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Xiong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Hoi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International conference on machine learning</title>
				<meeting><address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="12888" to="12900" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b35">
	<analytic>
		<title level="a" type="main">Frozen in time: A joint video and image encoder for end-to-end retrieval</title>
		<author>
			<persName><forename type="first">M</forename><surname>Bain</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Nagrani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Varol</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zisserman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF international conference on computer vision</title>
				<meeting>the IEEE/CVF international conference on computer vision</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="1728" to="1738" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b36">
	<analytic>
		<title level="a" type="main">Multi-modal transformer for video retrieval</title>
		<author>
			<persName><forename type="first">V</forename><surname>Gabeur</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Alahari</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Schmid</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Computer Vision-ECCV 2020: 16th European Conference</title>
				<meeting><address><addrLine>Glasgow, UK</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2020">August 23-28, 2020. 2020</date>
			<biblScope unit="volume">16</biblScope>
			<biblScope unit="page" from="214" to="229" />
		</imprint>
	</monogr>
	<note>Proceedings, Part IV</note>
</biblStruct>

<biblStruct xml:id="b37">
	<analytic>
		<title level="a" type="main">Deep specialized network for illuminant estimation</title>
		<author>
			<persName><forename type="first">W</forename><surname>Shi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">C</forename><surname>Loy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Tang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Computer Vision-ECCV 2016: 14th European Conference</title>
				<meeting><address><addrLine>Amsterdam, The Netherlands</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2016">October 11-14, 2016. 2016</date>
			<biblScope unit="volume">14</biblScope>
			<biblScope unit="page" from="371" to="387" />
		</imprint>
	</monogr>
	<note>Proceedings, Part IV</note>
</biblStruct>

<biblStruct xml:id="b38">
	<analytic>
		<title level="a" type="main">Rethinking temporal fusion for video-based person re-identification on semantic and time aspect</title>
		<author>
			<persName><forename type="first">X</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Gong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Guo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W.-S</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Sun</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the AAAI Conference on Artificial Intelligence</title>
				<meeting>the AAAI Conference on Artificial Intelligence</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="page" from="11133" to="11140" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b39">
	<analytic>
		<title level="a" type="main">Temporal aggregation with clip-level attention for video-based person re-identification</title>
		<author>
			<persName><forename type="first">M</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Sun</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision</title>
				<meeting>the IEEE/CVF Winter Conference on Applications of Computer Vision</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="3376" to="3384" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
