<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>of 3D generated content through eXtended Reality</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lorenzo Stacchio</string-name>
          <email>lorenzo.stacchio2@unibo.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Claudia Scorolli</string-name>
          <email>claudia.scorolli@unibo.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gustavo Marfia</string-name>
          <email>gustavo.marfia@unibo.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Bologna, Department for Life Quality Studies</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Bologna, Department of Philosophy and Communication Studies</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Bologna, Department of the Arts</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <abstract>
        <p>The Metaverse era is rapidly shaping novel and efective tools particularly useful in the entertainment and creative industry. A fundamental role is played by modern generative deep learning models, that can be used to provide varied and high-quality multimedia content, considerably lowering costs while increasing production eficiency. The goodness of such models is usually evaluated quantitatively with established metrics on data and humans using simple constructs such as the Mean Opinion Score. However, these scales and scores don't take into account the aesthetical and emotional components, which could play a role in positively controlling the automatic generation of multimedia content while at the same time introducing novel forms of human-in-the-loop in generative deep learning. Furthermore, considering data such as 3D models/scenes, and 360° panorama images and videos, conventional display hardware may not be the most efective means for human evaluation. A first solution to such a problem could consist of employing eXtendend Reality paradigms and devices. Considering all such aspects, we here discuss a recent contribution that adopted a well-known scale to evaluate the aesthetic and emotional experience of watching a 360° video of a musical concert in Virtual Reality (VR) compared to a classical 2D webstream, showing that adopting fully immersive VR experience could be a possible path to follow.</p>
      </abstract>
      <kwd-group>
        <kwd>Generative Artificial Intelligence</kwd>
        <kwd>eXtended Reality</kwd>
        <kwd>aesthetic evaluation</kwd>
        <kwd>human-in-the-loop</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        With the advancements in technologies such as eXtended reality (XR), Artificial Intelligence
(AI), Cloud Computing (CC), and Digital Twins (DT) the Metaverse is stepping into an upcoming
reality [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. From industrial, healthcare, and research applications to entertainment, tourism,
and gaming, Metaverse-related technologies are shaping new tools to improve the way we
interact with both the digital and physical worlds [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. Several works studied how AI-aided
paradigms could be applied and integrated into virtual worlds for the aforementioned fields, with
nEvelop-O
LGOBE
(G. Marfia)
(G. Marfia)
a particular focus on entertainment and creative industry [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5 ref6 ref7 ref8 ref9">3, 4, 5, 1, 6, 7, 2, 8, 9</xref>
        ]. Considering the
latter, a plethora of scientific contributions aimed at designing and developing Machine Learning
(ML), especially Deep learning (DL), models to improve user experiences in XR environments.
ChatBots, Virtual Agents, 2D/3D visual restoration, and generative models are just examples of
tools that we can nowadays employ in XR experiences [
        <xref ref-type="bibr" rid="ref1 ref10 ref2 ref5 ref6 ref7 ref8">1, 5, 6, 7, 2, 8, 10</xref>
        ].
      </p>
      <p>
        In particular, Generative Content Models (GCM) architectures, such as Generative pre-trained
Transformers (GPT), Generative Adversarial Networks (GAN), Variational Autoencoders (VAEs),
Neural Radiance Fields (NeRFs), and Difusion Models (DM) are crucial to providing varied and
high-quality multimedia content, including images, videos, music, and 3D assets, considerably
lowering costs while increasing production eficiency. Those DL architectures were successfully
trained to perform tasks such as Text-To-Image (e.g., Imagen, Stable Difusion, DALL·E) or
Text-to-Video tools (e.g., Gen-1, VideoFusion), 2D-to-3D (e.g., NeRFs) and Text-to-Panorama
(e.g., DifCollage) on several datasets, providing novel tools that could be used to easily create
customized and personalized content [
        <xref ref-type="bibr" rid="ref11 ref12 ref13 ref14 ref15 ref16 ref17 ref18 ref19 ref20 ref21 ref22 ref23">11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23</xref>
        ]. Recently,
also multi-modal difusion models were employed to generate a joint audio-video generation
framework to create engaging watching and listening experiences simultaneously, with
highquality realistic videos [
        <xref ref-type="bibr" rid="ref22 ref24">22, 24</xref>
        ].
      </p>
      <p>
        By using such technologies, it is possible to envision entire entertainment products created
with automatic or nearly automatic approaches. 3D environments generated with basic
descriptions, digital art paintings, and fashion 3D garments are just examples of virtual elements that
could be automatically generated for the benefit of the entertainment and creative industry in
virtual experiences. Given these capabilities it is also important to consider how such models
are evaluated and if the existing evaluation can fulfill human needs in the target scenarios (i.e.,
entertainment and creative industry in this study). The goodness of such models is usually
evaluated quantitatively with metrics such as Learned Perceptual Image Patch Similarity (LPIPS),
Fréchet Inception Distance (FID), Structural Similarity (SSIM), Kullback–Leibler Divergence (KL),
Mover’s distance (EMD), Chamfer (pseudo)-distance (CD) and Minimum Matching Distance
(MMD) [25, 26, 27, 28, 29, 30, 31]. Considering instead human evaluation of such models, the
majority of those contributions adopt the Mean Opinion Score (MOS), which corresponds to a
5-point Likert question, or the classical Turing Test delivered in controlled labs or with services
such as Amazon Mechanical Turk [
        <xref ref-type="bibr" rid="ref11 ref12 ref15 ref16 ref17 ref18 ref19 ref20 ref22">32, 11, 15, 12, 16, 17, 18, 19, 20, 22</xref>
        ].
      </p>
      <p>To the best of our knowledge, there is in fact a lack of work to establish standardized scales
and metrics that allow humans to evaluate such generated content based on diferent and
more complex factors that go beyond the simple MOS and Turing test, like the aesthetics and
emotional ones [33, 34]. As reported in [33] emotions and aesthetics bear high-level semantics
that could be bonded to low-level computable visual features to create novel and reliable
inferring systems with positive steering the perception of digital multimedia material, even if
challenging considering human subjectivity. However, those constructs could be put to good
use to positive control of the design and automatic generation of multimedia content while
at the same time introducing novel forms of human-in-the-loop in generative deep learning,
particularly related to (i) creative fields like art, music, and literature, where human artists
collaborate with generative models to maintain artistic vision and control to induce certain
moods or emotions; (ii) healthcare applications where customized positive emotional efects can
be used for mental health treatment, rehabilitation, and stress reduction [35, 33, 36]. Furthermore,
when extending our perspective to encompass digital content that goes beyond traditional 2D
images, videos, and audio, such as 3D models/scenes, and 360° panorama images and videos,
it becomes clear that conventional display hardware may not be the most efective means for
human evaluation, as noted in previous research studies [37, 36, 38]. A first solution to such a
problem could consist of employing XR paradigms and devices to find better and more efective
ways to present such visual/auditory stimuli to humans and ease their evaluation. A side efect
of adopting XR paradigms for this purpose could be the improvement of the role of humans
in driving generative models to specific outcomes, from a human-in-the-loop perspective [ 39].
The overall framework is visually depicted in Figure 1.</p>
      <p>Considering all the mentioned aspects, we will first discuss whether XR paradigms could be
employed to evaluate multimedia content, particularly referring to 3D models and scenes, and
360° panorama images/videos. Then, we will report the main insights from a recent related
work [40] that adopted a well-known scale to evaluate the aesthetic and emotional experience
of watching a 360° video of a musical concert in VR compared to living the real experience and
enjoying it from classical 2D displays. This contribution introduced an evaluation framework
that could be deployed to evaluate generative deep learning from aesthetic and emotional
perspectives.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Evaluating multimedia content from an Aesthetic and</title>
    </sec>
    <sec id="sec-3">
      <title>Emotional perspective with eXtended Reality</title>
      <p>
        XR integrates digital and physical to various degrees, e.g., augmented reality (AR), mixed
reality (MR), and virtual reality (VR). Diferent degrees of virtuality can be exploited to furnish
multimedia digital assets to be enjoyed and evaluated by humans [41]. VR is a technology that
permits the immersion of the user within a digital layer of information, making them part of
the simulated scenario, completely replacing the physical world with devices such as HMDs
(Head Mounted Displays) and CAVEs (Cave Automatic Virtual Environments)[42, 43, 38]. On
the other hand, AR overlays digital information onto the physical world using screen-based
interfaces (e.g., mobile devices) or dedicated glasses (e.g., Magic Leap) [44]. Finally, MR merges
the physical and the virtual world, providing user-interactable digital content responsive to the
surrounding physical environment, [
        <xref ref-type="bibr" rid="ref1 ref2 ref7">1, 38, 7, 2</xref>
        ].
      </p>
      <p>For the particular aim of evaluating immersive and/or 3D digital multimedia content, MR and
VR represent potential tools to evaluate responses to visual conditions addressing limitations of
the other experimental methodologies [38]. Referring to MR, Spacedesign is one of the first
frameworks introducing a mixed virtual environment to evaluate the aesthetic of 3D models [42],
focusing on free-form curves and surfaces. The authors reported that the introduced system
has been tested by experienced industrial designers who appreciated the 3D visualization and
navigation, real-time editing, and intuitive interaction. In this scenario, designers and stylists
play a significant role in the development process, enabling them to efectively steer their vision
from start to finish, resulting in the final product.</p>
      <p>Authors of [45] proposed using VR paradigms to emphasize aesthetic and emotional abilities
in students for 3D design. Through a statistical assessment process, the authors reported that
VR could have a positive impact on creative thinking as well as students’ academic performance.
This means that virtual technologies are able to highlight these human factors while interacting
with such multimedia data and so impact their evaluation. On a similar line, some works have
shown that VR can be efectively used while evaluating the aesthetic and emotional efect of 3D
scenes or 360° videos [36, 40].</p>
      <p>For example, [46] studied the impact of VR in awe emotions. They considered 360° immersive
videos, examining 42 participants who watched immersive and normal 2D videos displaying
awe or neutral content, rating their level of awe and sense of presence after the experiment.
Results indicated that immersive videos significantly enhanced the self-reported intensity of
awe as well as the sense of presence.</p>
      <p>
        To the best of our knowledge, there is no prior work that considered both aesthetic and
emotional components in evaluating immersive 3D models or 360° videos by adopting XR
paradigms. Considering this, in the following, we discuss one of the most recent works on the
topic [40] which represents a first attempt at comparing aesthetic and emotional constructs
among real and VR experiences, focusing on 360° videos. The introduced framework could put
the basis to create a novel generative DL human evaluation pipeline for synthetic multimedia
data [
        <xref ref-type="bibr" rid="ref13">13, 47</xref>
        ].
      </p>
    </sec>
    <sec id="sec-4">
      <title>3. Evaluating digital content from an Aesthetic and Emotional perspective through eXtended Reality: use case of 360° panorama Tango Concert video in Virtual Reality</title>
      <p>In evaluating digital multimedia content with VR, some validated scales could be adopted to
assess the Aesthetic and Emotional dimensions related to it. To elaborate on this topic, we
reference a recent study conducted in [40]. In this work, authors presented initial findings from
a project whose primary objective was to compare the aesthetic and emotional experiences
of a musical concert lived in presence with respect to virtual ones. They also explored the
social dimension, which won’t be discussed here to highlight their contribution to aesthetic
and emotional dimensions. This multidisciplinary project has been questioning whether virtual
environments could recreate live aesthetic experiences, considering the lack of related works,
circumscribing the analysis to the musical ones. This could demonstrate the ability of virtual
environments to provide novel ways and tools of enjoying not only digital assets but also
compare the enjoyment and their evaluation with respect to a realistic situation. To understand
whether and how these aesthetic and emotional experiences are systematically modulated by
diferently immersive settings, authors selected physical and social environments for which the
degree of immersiveness is incrementally increased.</p>
      <p>Consistently, authors contrasted the classic experience of a live concert (LC, in presence)
with three “remote” conditions, not simultaneous with respect to the event: the experience with
audiovisual musical files (MV, the classic viewing of a concert on YouTube) and two experiences
lived thanks to a less or more sophisticated eye-mounted apparatus for virtual reality (VR), i.e.,
google cardboard (CVR) and an HTC-Vive (HVR). The CVR and the HVR allow respectively for
a basic and easily accessible experience vs. a less afordable but more immersive experience.
Both the devices permitted a three-dimensional vision: by moving the head or the whole body,
participants could have a 360° view, therefore an overall vision of the concert venue, including
musicians and audience – together with their possible reactions to a virtuosity or a false note
played by the performers (see Figure 2).</p>
      <p>For this reason, authors take as a use case a theater experience. From an experimental
perspective, authors intended to exploit VR technologies as a methodological boost to empirical
aesthetics: virtual environments provide an excellent compromise between ecological validity
and experimental control. Here, authors compared diferent devices able to convey an
aestheticmusical experience, including VR devices with varying degrees of immersiveness, in order to
investigate their ability to engage, more or less powerfully, the participant with respect to that
experience, typically enjoyed in a theater. To investigate so human aesthetic and emotional
experience, the authors stated that a rethinking of classic laboratory experiments, using VR
paradigms to overcome traditional and reductive approaches in aesthetic and emotional. Thus,
the authors addressed the aesthetics and emotions evoked in the four conditions (LC, MV, CVR,
HVR) through the administration of a validated questionnaire: the Aesthetic Emotions Scale
[48], structured in 21 subscales covering prototypical aesthetic emotions, epistemic emotions,
and emotions indicative of amusement.</p>
      <p>As mentioned, these experimental perspectives were adopted and applied for a particular use
case: evaluating how Virtual Experiences could be put to good use within a musical aesthetic
experience. Specifically, they used VR technologies to increase the interest and engagement
of a population (young students) that is not interested in attending a certain kind of cultural
experience, namely a Tango music concert in a theatre, but that is prone to use Virtual Experience
technologies [49, 50] and comparing them with a strong baseline: passionate adult people, which
is usually the target audience for this kind of cultural activities [51]. In practice, authors used LC
participants’ survey scores as a benchmark to compare the magnitude of interest of passionate
adult people with respect to the youngsters who lived instead of Virtual Experiences (MV, CVR,
HVR). As an additional contribution, this framework could be adopted with any other kind of
aesthetic experience.</p>
      <sec id="sec-4-1">
        <title>3.1. Experimental session</title>
        <p>In the experimental session, authors tested 70 participants, 10 for the live concert condition; 20
for the music video condition; 20 for the immersive condition with the Google Cardboard; and
20 for the immersive condition with the HTC-VIVE. We remind the original work for further
details [40]. In the experimental session, the LC condition was executed at the Teatro Comunale
“Pavarotti-Freni” in Modena, during the concert “Amarcord d’un Tango”. The event took place
outdoors, in the theatre courtyard. At the end of the concert, the authors asked volunteers to
ifll in a hard copy of both questionnaires. For the other three virtual conditions, they used a
360° video of the concerts, testing participants were tested at the Virtual and Augmented Reality
Lab (VARLAB) of the University of Bologna. The Aesthemos questionnaire was furnished in
the form of forty-two 5-point Likert scale scores referring to twenty-one emotion subscale[48]:
Feeling of beauty; Fascination; Being moved; Awe; Enchantment; Nostalgia; Joy; Humor; Vitality;
Energy; Relaxation; Surprise; Interest; Intellectual challenge; Insight; Feeling of ugliness; Boredom;
Confusion; Anger; Uneasiness; Sadness as described in [48, 40].</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Results and discussions</title>
        <p>The author’s collected data has undergone a reliability check to test for internal consistency
and validate the research: they analyzed those constructs that exhibited a Cronbach’s alpha
index ≥ 0.70 [52]. All the constructs that passed this test, were subjected to statistical analyses
to verify any significant diferences among the four conditions, namely LC, MV, CVR, and HVR.
Authors performed an Adjusted Wald-Confidence Interval test [ 53], which allows to check if
the diference between two proportions is significant and how large the diference is. Authors
selected this specific statistical test as they had four conditions and a low number of samples
for each of them [54]. To adapt their data for the specific test, they binarized the Likert-scale
answers with a threshold of 4: ≥ 4 Likert scale answers were converted to 1, the lower to 0.</p>
        <p>Question</p>
        <p>FB-E1
FB-E1
FB-E1
INT-E38
INT-E38
INT-E38
FA-E31
FA-E31
FA-E31
FA-E31</p>
        <p>Table 1 reports some of the statistically significant results obtained for the Feeling of Beauty
(FB-x items), Interest (INT-x items) and Fascination (FA-x items) constructs1. Each question item
was codified to improve the readability of the table as follows: (FB-E1) “I found it beautiful”;
(INT-E38) “It piqued my interest”; (FA-E31) “I found it sublime”.</p>
        <p>The results highlighted how the Live Condition (LC) was in general the most efective
condition in activating aesthetic emotions, in particular, related to Feeling of Beauty when
compared with any of the virtual conditions. However, it can be also observed that, for the other
two constructs, the superiority of the LC condition with respect to the HTC-Vive (HVR) one
was not statistically significant. Moreover, considering the same constructs, HVR was superior
to the Music Video (MV) and VR Google cardboard (CVR) conditions. Thus, even if the LC
remains the best way to enjoy a music concert, the diference with the same experience lived
through the HCT-Vive headset is not statistically significant for certain aesthetic and emotional
constructs. This suggests that the fully immersive virtual reality is the “artificial experience”
that can ofer the spectator an experience more like the “real-live one” with respect to classical
means such as 2D displays or mobile VR. To the best of our knowledge, this is one of the first
empirical evidence of a major interest in musical aesthetic experiences when experienced in VR
(HVR) than on a computer screen. This is also the first contribution that highlights how VR
paradigms could be better than classical means to evaluate digital multimedia data such as 360°
videos.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Conclusion</title>
      <p>
        In this work, we discussed the possibility of measuring aesthetic and emotional constructs to
evaluate the generative DL models exploiting XR paradigms. In fact, nowadays, the goodness
of such models is evaluated from a qualitative human perspective using simple constructs
1We remind to [40] for the complete analysis.
such as the MOS and the Turing Test [
        <xref ref-type="bibr" rid="ref11 ref12 ref15 ref16 ref17 ref18 ref19 ref20 ref22">11, 15, 12, 16, 17, 18, 19, 20, 22</xref>
        ]. However, aesthetic
and emotional factors could be used to positively improve generative model performance [33],
even if challenging considering human subjectivity and the lack of validated scales. Extending
the perspective to encompass digital content that goes beyond traditional images, videos, and
audio, such as 3D models/scenes, and 360° panorama images and videos, conventional display
hardware may not be the most efective means for human evaluation. A solution could be
resorting to XR paradigms, which could also provide novel insights from a human-in-the-loop
perspective [37, 36, 38, 39]. To the best of our knowledge, no previous work considered such a
specific research direction. However, the framework introduced din [ 40] focused on the aesthetic
and emotional evaluation of 360° videos in VR comparing them with their realistic counterpart
and classical displays, providing one of the first empirical evidence that XR paradigms could be
a better means to furnish the mentioned type of multimedia content to judge its quality (the
approach is extensible also to generative deep learning material). In future works, we intend
to explore such an approach with generative models involving 3D models/scenes, and 360°
panorama images and videos and measure how the aesthetic and emotional perception changes
with respect to classical fruition devices and paradigms.
[25] Y. Rubner, C. Tomasi, L. J. Guibas, The earth mover’s distance as a metric for image
retrieval, International journal of computer vision 40 (2000) 99–121.
[26] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli, Image quality assessment: from error
visibility to structural similarity, IEEE transactions on image processing 13 (2004) 600–612.
[27] J. Shlens, Notes on kullback-leibler divergence and likelihood, arXiv preprint
arXiv:1404.2000 (2014).
[28] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, S. Hochreiter, Gans trained by a
two time-scale update rule converge to a local nash equilibrium, Advances in neural
information processing systems 30 (2017).
[29] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, O. Wang, The unreasonable efectiveness of
deep features as a perceptual metric, in: Proceedings of the IEEE conference on computer
vision and pattern recognition, 2018, pp. 586–595.
[30] P. Achlioptas, O. Diamanti, I. Mitliagkas, L. Guibas, Learning representations and
generative models for 3d point clouds, in: International conference on machine learning, PMLR,
2018, pp. 40–49.
[31] T. Wu, L. Pan, J. Zhang, T. Wang, Z. Liu, D. Lin, Density-aware chamfer distance as a
comprehensive metric for point cloud completion, arXiv preprint arXiv:2111.12702 (2021).
[32] M. G. Keith, L. Tay, P. D. Harms, Systems perspective of amazon mechanical turk for
organizational research: Review and recommendations, Frontiers in psychology 8 (2017)
1359.
[33] D. Joshi, R. Datta, E. Fedorovskaya, Q.-T. Luong, J. Z. Wang, J. Li, J. Luo, Aesthetics and
emotions in images, IEEE Signal Processing Magazine 28 (2011) 94–115.
[34] NVIDIA, Understanding aesthetics in deep learning, https://developer.nvidia.com/blog/
understanding-aesthetics-deep-learning/, 2016.
[35] H. L. O’Brien, E. G. Toms, The development and evaluation of a survey to measure user
engagement, Journal of the American Society for Information Science and Technology 61
(2010) 50–69.
[36] S. Triberti, A. Chirico, G. La Rocca, G. Riva, Developing emotional design: Emotions as
cognitive processes and their role in the design of interactive technologies, Frontiers in
psychology 8 (2017) 1773.
[37] A. Mahdavi, H. Eissa, Subjective evaluation of architectural lighting via computationally
rendered images, Journal of the illuminating Engineering Society 31 (2002) 11–20.
[38] A. Bellazzi, L. Bellia, G. Chinazzo, F. Corbisiero, P. D’Agostino, A. Devitofrancesco,
F. Fragliasso, M. Ghellere, V. Megale, F. Salamone, Virtual reality for assessing visual
quality and lighting perception: A systematic review, Building and Environment 209 (2022)
108674.
[39] X. Wu, L. Xiao, Y. Sun, J. Zhang, T. Ma, L. He, A survey of human-in-the-loop for machine
learning, Future Generation Computer Systems 135 (2022) 364–381.
[40] C. Scorolli, E. N. Grasso, L. Stacchio, V. Armandi, G. Matteucci, G. Marfia, Would you
rather come to a tango concert in theater or in vr? aesthetic emotions &amp; social presence in
musical experiences, either live, 2d or 3d, Computers in Human Behavior (2023) 107910.
[41] P. Milgram, F. Kishino, A taxonomy of mixed reality visual displays, IEICE
TRANSAC
      </p>
      <p>TIONS on Information and Systems 77 (1994) 1321–1329.
[42] M. Fiorentino, R. de Amicis, G. Monno, A. Stork, Spacedesign: A mixed reality workspace
for aesthetic industrial design, in: Proceedings. International Symposium on Mixed and
Augmented Reality, IEEE, 2002, pp. 86–318.
[43] M. F. Shiratuddin, W. Thabet, D. Bowman, Evaluating the efectiveness of virtual
environment displays for reviewing construction 3d models, CONVR 2004 (2004) 87–98.
[44] M. Leap, Magic Leap 2, The most immersive AR platform for enterprise, https://www.</p>
      <p>magicleap.com/en-us/magic-leap-2-video, 2023.
[45] A. Jimeno-Morenilla, J. L. Sánchez-Romero, H. Mora-Mora, R. Coll-Miralles, Using virtual
reality for industrial design learning: a methodological proposal, Behaviour &amp; Information
Technology 35 (2016) 897–906.
[46] A. Chirico, P. Cipresso, D. B. Yaden, F. Biassoni, G. Riva, A. Gaggioli, Efectiveness of
immersive videos in inducing awe: an experimental study, Scientific reports 7 (2017) 1218.
[47] Q. Zhang, J. Song, X. Huang, Y. Chen, M.-Y. Liu, Difcollage: Parallel generation of large
content with difusion models, arXiv preprint arXiv:2303.17076 (2023).
[48] I. Schindler, G. Hosoya, W. Menninghaus, U. Beermann, V. Wagner, M. Eid, K. R. Scherer,
Measuring aesthetic emotions: A review of the literature and a new assessment tool, PloS
one 12 (2017) e0178899.
[49] J. de la Fuente Prieto, P. Lacasa, R. Martínez-Borda, Approaching metaverses: Mixed reality
interfaces in youth media platforms, New Techno Humanities 2 (2022) 136–145.
[50] L. Geng, Y. Li, Y. Xue, Will the interest triggered by virtual reality (vr) turn into intention to
travel (vr vs. corporeal)? the moderating efects of customer segmentation, Sustainability
14 (2022) 7010.
[51] S. Meeks, S. K. Shryock, R. J. Vandenbroucke, Theatre involvement and well-being, age
diferences, and lessons from long-time subscribers, The Gerontologist 58 (2018) 278–289.
[52] K. S. Taber, The use of cronbach’s alpha when developing and reporting research
instruments in science education, Research in science education 48 (2018) 1273–1296.
[53] A. Agresti, B. Cafo, Simple and efective confidence intervals for proportions and
diferences of proportions result from adding two successes and two failures, The American
Statistician 54 (2000) 280–288.
[54] J. Sauro, J. R. Lewis, Quantifying the user experience: Practical statistics for user research,
Morgan Kaufmann, 2016.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Xing</surname>
          </string-name>
          , D. Liu,
          <string-name>
            <given-names>T. H.</given-names>
            <surname>Luan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <article-title>A survey on metaverse: Fundamentals, security, and privacy</article-title>
          ,
          <source>IEEE Communications Surveys &amp; Tutorials</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Huynh-The</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.-V.</given-names>
            <surname>Pham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.-Q.</given-names>
            <surname>Pham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. T.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <surname>D</surname>
          </string-name>
          .-S. Kim,
          <article-title>Artificial intelligence for the metaverse: A survey</article-title>
          ,
          <source>Engineering Applications of Artificial Intelligence</source>
          <volume>117</volume>
          (
          <year>2023</year>
          )
          <fpage>105581</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Stacchio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Perlino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Vagnoni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Sasso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Scorolli</surname>
          </string-name>
          , G. Marfia,
          <article-title>Who will trust my digital twin? maybe a clerk in a brick and mortar fashion shop</article-title>
          ,
          <source>in: 2022 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW)</source>
          , IEEE,
          <year>2022</year>
          , pp.
          <fpage>814</fpage>
          -
          <lpage>815</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Stacchio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Angeli</surname>
          </string-name>
          , G. Marfia,
          <article-title>Empowering digital twins with extended reality collaborations</article-title>
          ,
          <source>Virtual Reality &amp; Intelligent Hardware</source>
          <volume>4</volume>
          (
          <year>2022</year>
          )
          <fpage>487</fpage>
          -
          <lpage>505</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Morotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Stacchio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Donatiello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Roccetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tarabelli</surname>
          </string-name>
          , G. Marfia,
          <article-title>Exploiting fashion x-commerce through the empowerment of voice in the fashion virtual reality arena: Integrating voice assistant and virtual reality technologies for fashion communication</article-title>
          , Virtual
          <string-name>
            <surname>Reality</surname>
          </string-name>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y. Y.</given-names>
            <surname>Dyulicheva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. O.</given-names>
            <surname>Glazieva</surname>
          </string-name>
          ,
          <article-title>Game based learning with artificial intelligence and immersive technologies: an overview</article-title>
          ,
          <source>in: CEUR workshop proceedings</source>
          , volume
          <volume>3077</volume>
          ,
          <year>2022</year>
          , pp.
          <fpage>146</fpage>
          -
          <lpage>159</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G.-J.</given-names>
            <surname>Hwang</surname>
          </string-name>
          , S.-Y. Chien, Definition, roles, and
          <article-title>potential research issues of the metaverse in education: An artificial intelligence perspective</article-title>
          ,
          <source>Computers and Education: Artificial Intelligence</source>
          <volume>3</volume>
          (
          <year>2022</year>
          )
          <fpage>100082</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. Yang,</surname>
          </string-name>
          <article-title>Survey on controlable image synthesis with deep learning</article-title>
          ,
          <source>arXiv preprint arXiv:2307.10275</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Venugopal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A. V.</given-names>
            <surname>Subramanian</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Peatchimuthu,</surname>
          </string-name>
          <article-title>The realm of metaverse: A survey, Computer Animation</article-title>
          and Virtual
          <string-name>
            <surname>Worlds</surname>
          </string-name>
          (
          <year>2023</year>
          )
          <article-title>e2150</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Lamberti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>El</surname>
          </string-name>
          <string-name>
            <surname>Saddik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Thawonmas</surname>
          </string-name>
          ,
          <string-name>
            <surname>F. G.</surname>
          </string-name>
          <article-title>Prattico, extended meta-uni-omni-verse (xv): Introduction, taxonomy, and state-of-the-art</article-title>
          ,
          <source>IEEE Consumer Electronics Magazine</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rombach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Blattmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lorenz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Esser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ommer</surname>
          </string-name>
          ,
          <article-title>High-resolution image synthesis with latent difusion models</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>10684</fpage>
          -
          <lpage>10695</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>C.</given-names>
            <surname>Saharia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Saxena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Whang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. L.</given-names>
            <surname>Denton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ghasemipour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. Gontijo</given-names>
            <surname>Lopes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. Karagol</given-names>
            <surname>Ayan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Salimans</surname>
          </string-name>
          , et al.,
          <article-title>Photorealistic text-to-image difusion models with deep language understanding</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>35</volume>
          (
          <year>2022</year>
          )
          <fpage>36479</fpage>
          -
          <lpage>36494</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>K.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Nerf: Neural radiance field in 3d vision, a comprehensive review</article-title>
          ,
          <source>arXiv preprint arXiv:2210.00379</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>P.</given-names>
            <surname>Cascarano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Franchini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Porta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sebastiani</surname>
          </string-name>
          ,
          <article-title>On the first-order optimization methods in deep image prior</article-title>
          ,
          <source>Journal of Verification, Validation and Uncertainty Quantification</source>
          <volume>7</volume>
          (
          <year>2022</year>
          )
          <fpage>041002</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>N.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Duinkharjav</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Chakravarthula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Sun</surname>
          </string-name>
          , Fov-nerf:
          <article-title>Foveated neural radiance fields for virtual reality</article-title>
          ,
          <source>IEEE Transactions on Visualization and Computer Graphics</source>
          <volume>28</volume>
          (
          <year>2022</year>
          )
          <fpage>3854</fpage>
          -
          <lpage>3864</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kwon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jeong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Uh</surname>
          </string-name>
          ,
          <article-title>Difusion models already have a semantic latent space</article-title>
          ,
          <source>arXiv preprint arXiv:2210.10960</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <article-title>Videofusion: Decomposed difusion models for high-quality video generation</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>10209</fpage>
          -
          <lpage>10218</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>P.</given-names>
            <surname>Esser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Atighehchian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Granskog</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Germanidis</surname>
          </string-name>
          ,
          <article-title>Structure and content-guided video synthesis with difusion models</article-title>
          ,
          <source>arXiv preprint arXiv:2302.03011</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Saharia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Whang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gritsenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Kingma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Poole</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Norouzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Fleet</surname>
          </string-name>
          , et al.,
          <article-title>Imagen video: High definition video generation with difusion models</article-title>
          ,
          <source>arXiv preprint arXiv:2210.02303</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>R.</given-names>
            <surname>Gozalo-Brizuela</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. C.</given-names>
            <surname>Garrido-Merchan</surname>
          </string-name>
          ,
          <article-title>Chatgpt is not all you need. a state of the art review of large generative ai models</article-title>
          ,
          <source>arXiv preprint arXiv:2301.04655</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>P.</given-names>
            <surname>Cascarano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Franchini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Kobler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Porta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sebastiani</surname>
          </string-name>
          ,
          <article-title>Constrained and unconstrained deep image prior optimization models with automatic regularization</article-title>
          ,
          <source>Computational Optimization and Applications</source>
          <volume>84</volume>
          (
          <year>2023</year>
          )
          <fpage>125</fpage>
          -
          <lpage>149</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ruan</surname>
          </string-name>
          , Y. Ma,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. J.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <article-title>Mm-difusion: Learning multi-modal difusion models for joint audio and video generation</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>10219</fpage>
          -
          <lpage>10228</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>A.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.-H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Hui</surname>
          </string-name>
          ,
          <article-title>Towards computational architecture of liberty: A comprehensive survey on deep learning for generating virtual architecture in the metaverse</article-title>
          ,
          <source>arXiv preprint arXiv:2305.00510</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>G.</given-names>
            <surname>Yariv</surname>
          </string-name>
          , I. Gat,
          <string-name>
            <given-names>S.</given-names>
            <surname>Benaim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Schwartz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Adi</surname>
          </string-name>
          ,
          <article-title>Diverse and aligned audio-to-video generation via text-to-video model adaptation</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2309</volume>
          .
          <fpage>16429</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>