<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Event Recommendations through the Lens of Vision and Language Foundation Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Haya Halimeh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Florian Freese</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oliver Müller</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Paderborn University</institution>
          ,
          <addr-line>Paderborn</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Wuppertal</institution>
          ,
          <addr-line>Wuppertal</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <abstract>
        <p>Recommender systems now span the entire customer journey. Amid the multitude of diversified experiences, immersing in cultural events has become a key aspect of tourism. Cultural events, however, sufer from fleeting lifecycles, evade exact replication, and invariably lie in the future. In addition, their low standardization makes harnessing historical data regarding event content or past patron evaluations intricate. The distinctive traits of events thereby compound the challenge of the cold-start dilemma in event recommenders. Content-based recommendations stand as a viable avenue to alleviate this issue, functioning even in scenarios where item-user information is scarce. Still, the efectiveness of contentbased recommendations often hinges on the quality of the data representation they build upon. In this study, we explore an array of cutting-edge uni- and multimodal vision and language foundation models (VL-FMs) for this purpose. Next, we derive content-based recommendations through a straightforward clustering approach that groups akin events together, and evaluate the eficacy of the models through a series of online user experiments across three dimensions: similarity-based evaluation, comparison-based evaluation, and clustering assignment evaluation. Our experiments generated four major findings. First, we found that all VL-FMs consistently outperformed a naive baseline of recommending randomly drawn events. Second, unimodal text-based embeddings were surprisingly on par or in some cases even superior to multimodal embeddings. Third, multimodal embeddings yielded arguably more fine-grained and diverse clusters in comparison to their unimodal counterparts. Finally, we could confirm that cross event interest is indeed reliant on the perceived similarity of events, resonating with the notion of similarity in content-based recommendations. All in all, we believe that leveraging the potential of contemporary FMs for content-based event recommendations would help address the cold-start problem and propel this field of research forward in new and exciting ways.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;content-based event recommendation</kwd>
        <kwd>clustering-based recommendation</kwd>
        <kwd>cold-start problem</kwd>
        <kwd>vision and language foundation models</kwd>
        <kwd>domain adaptation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Over the last years, the application of recommender systems in the tourism industry has
expanded to cover more and more facets of the overall customer experience, ranging from
recommending trips [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] over hotels [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] to restaurants [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Among these experiences, participating
in local cultural events is a touristic activity that is gaining importance. By connecting travelers
with residents and local communities, these events enable cultural exchange and understanding.
      </p>
      <p>
        Yet, the task of recommending events poses, arguably, greater challenges compared to
recommending standardized mass services like transport or accommodation [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Events, by definition,
have short life cycles, are never repeated in exactly the same way, and are always happening in
the future [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. These unique characteristics make it especially challenging to digitally represent
events in their full breadth and depth. Unlike hotel rooms or flights, for example, we are lacking
standard descriptors for representing events. In addition, due to their time-limited nature and
low standardization, it is dificult to leverage historical data about the content of an event or
past customer ratings as a basis for recommender systems. After all, an event recommendation
is basically only reliable after the event took place, but must be created before it happens [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
In addition, compared to global e-commerce or streaming platforms, travel and cultural
platforms typically have only information about very few purchases or ratings of their customers,
making collaborative filtering strategies dificult to implement, as these platforms are often
limited to specific services, geographies, or genres and customers are switching between many
platforms. In sum, the unique attributes of events amplify the well-known cold-start problem of
recommender systems.
      </p>
      <p>
        In scenarios characterized by limited or constrained data about item-user relationships (e.g.,
purchases, ratings), the system’s ability to proficiently represent the content of these items
is vital. Yet, as described above, the efective modeling of event contents hinges on encoding
them into semantically meaningful and expressive representations. Recent advances in Deep
Representation Learning (DRL), notably through foundation models (FMs) pretrained on massive
amounts of broad data and adaptable to a wide range of specific tasks [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], ofer a promising
avenue to achieve this objective. In fact, since the groundbreaking introduction of BERT in
2018 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], there has been a remarkable upsurge in the widespread adoption of large-scale FMs
in various contexts. Owning to their capability to automatically extract rich representations
from raw data, the consensus within the AI community now gravitates towards embracing FMs
as the fundamental framework for training machine learning models on downstream tasks,
moving away from the conventional practice of building models from scratch. Large-scale FMs
were initially introduced for natural language processing (NLP) and later extended to include
applications in computer vision (CV). More recently, the growing prominence of FMs within
these two fields has led to increased research attention towards amalgamating both modalities.
Multimodal FMs emerged as a natural outcome for this trend, specifically vision and language
models (VL-FMs), which can handle both text and visual data simultaneously.
      </p>
      <p>
        Considering that the intangible nature of events begs for multimodal content descriptions,
it seems promising to evaluate the capacity of VL-FMs for integrating multimedia data into
content-based event recommendations [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Previous research in this avenue has predominantly
focused on designing modality-specific features based on event textual content [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref5">5, 10, 11, 12</xref>
        ]. As
a result, there remains an unaddressed gap in incorporating multimodal content and leveraging
images as supplementary signals within the representation learning process.
      </p>
      <p>In light of the above, we set out to explore the potential of harnessing both uni- and multimodal
VL-FMs to learn informative representations of cultural events. The resultant event embeddings
formed, in turn, the foundational cornerstone for content-based recommendations. We derived
these content-based recommendations through a simple clustering approach that groups the
events into semantically related cluster.</p>
      <p>We conducted our computational experiments on an event dataset sourced from the
eventbased platform Meetup.com1. The dataset comprises 10, 658 distinct cultural events from the
ten largest cities in the US, each accompanied by its respective descriptions and corresponding
images. To evaluate the usefulness of diferent VL-FMs for generating content-based
recommendations on this dataset, we conducted a series of online user experiments, in which we
presented users with recommendations based on diferent VL-FMs and asked them to
evaluate these recommendations with regards to three dimensions: similarity-based evaluation,
comparison-based evaluation, and clustering assignment evaluation.</p>
      <p>Our experiments generated four major findings. First, we found that all VL-FMs consistently
outperform a naive baseline of recommending randomly drawn events. Second, surprisingly
we found that unimodal text-based embeddings matched or in some cases even outperformed
multimodal embeddings in terms of perceived similarity. Third, multimodal embeddings yielded
arguably more fine-grained and diverse clusters in comparison to their unimodal counterparts.
Finally, we could confirm that cross-event interest is reliant on the perceived similarity of events.</p>
      <p>Our contributions can be summarized as follows: First, to the best of our knowledge, we
present a novel exploratory method for grasping the impact of FMs on event recommendation
systems. Second, to tackle the cold-start problem, we provide a straightforward
clusteringbased approach aimed at automatically identifying semantically relevant events based on their
multimedia content. Third, we conduct a sequence of user experiments to confirm the results
across three diferent dimensions. In doing so, we assess the eficacy of modern unimodal and
multimodal FM techniques for cultural events representation.</p>
      <p>The remainder of the paper is organized as follows: The next section ofers an overview of
related work on using embeddings for content-based event recommendations. The subsequent
section imparts foundational knowledge regarding the technical background. Following that,
section four delves into the approaches for generating uni- and multimodal embeddings and
introduces representative FMs for each one. In section five, we outline the general approach,
including the clustering framework, data and user experiments. Section six elaborates on and
discusses the results, while section seven wraps up the study by addressing its limitations,
implications, and providing a perspective on potential future research directions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Events in the Recommender Landscape</title>
      <p>
        For decades, there has been a growing interest in incorporating richer information beyond
numerical ratings to promote recommendations. Over time, the field has evolved significantly
with the development of various frameworks and methodologies [
        <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
        ]. Acquiring a semantic
representation of the recommended items is a pivotal aspect in the functionality of most of
these frameworks, whether they are content-based or collaborative in nature.
      </p>
      <p>
        Several works considered content-based event recommendations. Authors in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] used
eventrelated discussions to estimate the future popularity of events, while authors in [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] sought
to enrich culture event metadata with open linked data. The latter enabled adding semantic
knowledge structure into their recommendation methods. In [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], topic modeling techniques
and Gibbs Sampling method were used to generate topic distributions based on the content of the
events and then map it to user features. In [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], the authors investigated event recommendation
within the framework of the Douban network. They introduced a model that considers semantics
and context by making use of content information analysis and social relations. Other authors
introduced a hybrid approach that combines content-based and collaborative methods to provide
recommendations for academic events to users [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>
        Some studies tackled the cold-start problem in event-based social networks (EBSNs). Authors
in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], for example, suggest exploiting contextual signals, such as social, location based, and
temporal signals, to enhance the recommendation quality. For the content based signals, they use
basic TF-IDF analysis on the event descriptions. Authors in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] developed a collective Bayesian
Poisson factorization model that integrates location, organizer, user relationships, and event
textual content to infer content topics and mitigate the cold-start local event recommendations.
In [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], information about event venue, event popularity, temporal influence and geographical
distance are used to create group event recommendations.
      </p>
      <p>While these studies collectively exploit content based signals, they frequently rely on user
features or contextual input and overlook scenarios where such information might be unavailable.
Furthermore, most of these works typically start from scratch, without exploring the possibilities
ofered by large-scale state-of-the-art FMs for eficiently encoding event data. They also overlook
including images as valuable signals in the recommendation process.</p>
      <p>Motivated by these factors, we shift our focus towards modeling event data by leveraging
accessible and advanced uni- and multimodal VL-FMs. By doing so, we (i) intend to experiment
with integrating image information into event embeddings and (ii) probe into the potential of
these models for enhancing content-based event recommendations against cold-start scenarios.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Deep Representation Learning</title>
      <p>
        The key to enabling an intelligent system to comprehend the world around us lies in its ability to
identify and separate the fundamental explanatory factors that are hidden within the low-level
sensory data it observes. Representation Learning (RL) is a subarea of machine learning (ML)
that aims to accomplish precisely that. At its core, RL learns a set of meaningful features from a
given collection of data, making it simpler to derive valuable information when performing
diferent ML tasks [
        <xref ref-type="bibr" rid="ref18 ref19 ref20">18, 19, 20</xref>
        ]. Deep RL seeks to achieve this by relying on sophisticated neural
architectures such as FM models.
      </p>
      <p>
        More recently, the growing popularity of unimodal FMs has sparked increased research
interest in the integration of multiple modalities together. Indeed, at the most basic level, it is
theoretically insuficient to imitate a whole range of human perceptions and understanding
through only one modality [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. For instance, describing a concept that is grounded in visual
representation – such as shape constancy [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]– solely through non-visual means can be dificult.
      </p>
      <p>
        The overarching goal for multimodal FMs thus becomes to capture the joint distribution
of multiple modalities and to learn a shared representation space that reduces the semantic
heterogeneity gap between the modalities while preserving the modality-specific semantics
[
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. As mentioned in the introduction, a representative of this class of models are VL-FMs.
      </p>
      <p>
        Pre-training VL-FMs typically involves three main steps: (i) encoding images and text into
latent representations, (ii) designing a high-performing architecture for modeling interactions
between modalities, and (iii) devising efective pre-training tasks [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ].
      </p>
      <p>The main architectural distinction lies in the interaction step, and there are several strategies
to generating multimodal embeddings. However, the literature lacks agreement on the optimal
design. For cultural event data, we find it intriguing to experiment with a single-stream fusion
encoder approach, a dual encoder approach, and a two-step approach.</p>
      <p>
        In the single-stream fusion encoder approach, the text embeddings and image features are
concatenated together before feeding them into a transformer-based encoder to model the
vision and language (VL) interaction [
        <xref ref-type="bibr" rid="ref24 ref25">24, 25</xref>
        ].
      </p>
      <p>
        In the dual encoder approach, the text and image modalities are encoded separately through
two single-modal encoders. As next, the embeddings are projected through a shallow interaction
module to the same semantic space to compute VL similarity scores [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ].
      </p>
      <p>Lastly, in the two-step approach, one modality is translated into the other and the multimodal
embeddings are generated using an unimodal FM based on the combined resulting set.</p>
      <p>
        As is widely recognized, pre-trained FM models amass broad knowledge through extensive
pre-training on numerous source tasks with abundant labeled and unlabeled data [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Research,
however, suggests that domain adaptation could enhance the model’s performance, even in
cases where the source and target domains closely align [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. Putting this into perspective, in
our experiments, we prioritize adapting the pre-trained models to the target domain of event
data, before proceeding to the recommendation task.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Vision and Language Foundation Models</title>
      <p>We experiment with a separate unimodal FM for each individual modality– i.e. a language model
and a vision model. For the multimodal VL-FMs, we evaluate the three approaches introduced
in the previous section 3 using corresponding VL-FMs. Throughout this study, we denote
embeddings generated from text only as text-based embeddings and those generated solely
from vision as vision-based embeddings. An overall overview of the all approaches is provided
in Figure 1. The forthcoming subsections delve deeper into the architectures, pre-training
paradigms, and adaptation processes of the models for each approach.</p>
      <sec id="sec-4-1">
        <title>4.1. Text-Based Embeddings</title>
        <p>
          To create unimodal embeddings that rely exclusively on text but are adjusted to suit the event
data domain, we employ the methodology described in [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]. We opt for this method, as it
was recognized by the authors as the most efective approach in their investigation of domain
adaptation for dense retrievals. We outline the steps visually in Figure 2.
        </p>
        <p>
          The method combines two domain adaptation strategies, namely, Transformer-based
Denoising AutoEncoder (TSDAE) [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ] and Generative Pseudo Labeling (GPL) [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]. TSDAE is a
denoising autoencoder-based architecture that focuses on reconstructing the original input
sentences from a corrupted version of itself without accessing all contextualized word embeddings.
GPL on the other hand combines a query generator and pseudo labeling. The approach involves
creating queries for unlabeled sentences in the target domain, followed by pairing them with
sentence passages. Subsequently, the resulting (query, passage) pairs are pseudo-labeled using a
Event Descriptions
        </p>
        <p>Event Images
Event Descriptions</p>
        <p>Event Images
Event Descriptions</p>
        <p>Event Images
Event Descriptions</p>
        <p>Event Images</p>
        <p>Text Transfromer</p>
        <p>Encoder
Vision Transfromer</p>
        <p>Encoder
Text Transfromer</p>
        <p>Encoder
Vision Transfromer</p>
        <p>Encoder
Text Transfromer</p>
        <p>Encoder
Vision Transfromer</p>
        <p>Encoder
Text Transfromer</p>
        <p>Encoder
Vision Captioning
model
Vision-Based
Embeddings
Multimodal
Embeddings
Multimodal</p>
        <p>Embeddings
Text Embeddings
Image Features
Text Embeddings
Image Features
Image Captions</p>
        <p>Concatenate
Embeddings
Concatenate
Embeddings</p>
        <p>Multimodal
Transfromer</p>
        <p>Encoder
Preappend</p>
        <p>Combined Textual</p>
        <p>Descriptions</p>
        <p>
          Multimodal
Embeddings
cross-encoder. The model is then trained on these synthetically generated labels, utilizing the
MarginMSE loss as introduced in [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ].
        </p>
        <p>Our implementation heavily relies on the publicly available GitHub repositories2,3 of the two
methods and adapts largely the same training arguments.</p>
        <sec id="sec-4-1-1">
          <title>2Last visited 10.08.2023: https://github.com/UKPLab/gpl 3Last visited 10.08.2023: https://github.com/UKPLab/sentence-transformers/</title>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Vision-Based Embeddings</title>
        <p>
          Until recently, CV applications were dominated by convolutional neural networks (CNNs), while
transformer architectures were completely absent from the field. In 2020, this architectural
gap was bridged with the introduction of ViT -i.e. the first Vision Transformer Model by
authors in [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ]. ViT operates by partitioning an image into smaller patches and arranging
them linearly as sequence elements, akin tokens in NLP. It then encodes the patches through
linear projection and incorporates positional embeddings. The resultant set is further processed
through a series of transformer blocks with a supervised training objective, commonly applied
for image classification tasks.
        </p>
        <p>
          In 2022, the Masked Autoencoder (MAE) for CV was proposed [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ]. Following a standard
self-supervised pre-training paradigm, MAE is designed to reconstruct partially corrupted
patches in input images using a ViT initialized encoder and a lightweight decoder.
        </p>
        <p>
          The unsupervised nature of autoencoders renders MAE as a viable tool for pre-training
based domain adaptation on in-domain data. As such, we utilized the Huggingface library4 and
continued training the (facebook/vitmae-base) model for further 30 epochs and with the same
default hyper-parameters as set in the original implementation. Finally, during inference, we
extract akin to authors in [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ] the image features from the penultimate layer pf the model.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Single-Stream Fusion Encoder Approach</title>
        <p>
          As a representative for the single-stream architecture, we selected the Vision-and-Language
Transformer (ViLT) [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ]. ViLT is a self-supervised single-stream VL architecture that deviates
from other VL-FMs through its minimal convolution-free pipeline. In particular, instead of
relying on heavy convolutional networks (like Faster R-CNN [
          <xref ref-type="bibr" rid="ref34">34</xref>
          ]) ViLT employs ViT to encode
pixel-level inputs. ViT is also utilized to initialize the interaction transformer in the model.
        </p>
        <p>
          The image and text embeddings are concatenated into a single sequence, that undergoes
multiple interaction block layer updates up until the final contextualized sequence. The pooled
multimodal representation is then obtained by linear projection upon the first index of this
sequence. The model is pre-trained on Image text matching (ITM) and masked language
modeling (MLM) [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] objectives and targets cross-modal and multi-modal vision-and language
tasks.
        </p>
        <p>Since ViLT is pre-trained on smaller public paired datasets, we opted for domain adapting a
base ViLT model on the event data following the same methodology as in 4.2. For the
implementation, we leveraged the publicly available ViLT repository5 and trained (vilt_200k_mlm_itm.ckpt)
with the default settings for further 3 epochs.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Dual Encoder Approach</title>
        <p>
          Contrastive Language-Image Pre-training (CLIP) [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ] is a state-of-the-art deep learning model
designed for vision-and-language understanding using a form of supervised contrastive learning.
        </p>
        <p>
          CLIP undergoes pre-training on an immense private image-and-text dataset, comprising 400
million (image, text) pairs and sourced from publicly available web data. In line with other
4Last visited 10.08.2023: https://github.com/huggingface/transformers
5Last visited 10.08.2023: https://github.com/dandelin/vilt
VL-FMs in this class [
          <xref ref-type="bibr" rid="ref35 ref36">35, 36</xref>
          ], CLIP employs two standard transformer-based encoders to embed
image-and-text pairs individually. By training the encoders to pull together associated
textimage pairs while pulling apart mismatching ones, the model is compelled to learn a joint vector
space that adeptly captures the intricate connections between the images and their associated
texts. CLIP’s representation learning capabilities were demonstrated through a standard linear
probing protocol6.
        </p>
        <p>Taken the above into account, there appears to be no need for any further training. Instead,
we apply the CLIP model7 directly to the event texts and images to obtain their respective
embeddings. As per the authors’ approach, we extract the features before the linear projection
layer, and fuse them into a unified vector representation through simple concatenation.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Two-Step Approach</title>
        <p>Up until now, we have covered two VL-FMs, each with a diferent architecture. However,
another viable approach is to convert one modality into the other. Once both modalities are
represented in the same modality form, multimodal representations can be obtained by applying
an unimodal FM on the combined information. By adopting such a strategy, one can benefit
from the strengths of unimodal models in handling singular modalities, while still achieving
multimodal VL understanding.</p>
        <p>
          Converting an item from one modality into another would require creating an adequate
representation of the same item in the other modality. For this purpose, VL generation techniques,
such as visual captioning (VC) or text-to-image synthesis [
          <xref ref-type="bibr" rid="ref37 ref38">37, 38</xref>
          ], could be used.
        </p>
        <p>
          Here, we employ VC to automatically generate textual captions for the event images. In
particular, we implement the generative image-to-text transformer (GiT) proposed in [
          <xref ref-type="bibr" rid="ref39">39</xref>
          ].
Generative models typically contain complex architectures. In contrast, GiT adopts a
straightforward encoder-decoder structure and the common language modeling (LM) loss while still
maintaining state-of-the-art performance.
        </p>
        <p>The obtained multimodal representations are achieved by applying the domain-adapted
language model as discussed in subsection 4.1 on the combined set of event captions and event
descriptions.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Approach</title>
      <sec id="sec-5-1">
        <title>5.1. Clustering-Based Framework</title>
        <p>As already discussed, content-based recommendations can mitigate the cold-start problem in
event recommenders by delivering suggestions solely grounded in event content. The idea
behind clustering-based recommendations on the other hand builds upon similarity within
clusters to provide alike suggestions. Events within the same cluster are assumed to have shared
characteristics, making it likely that users showing interest in a specific event will tend to also
show interest in others from the same cluster. Alternately, cluster-based recommendations
can serve to diversify suggestions by ensuring that events from various clusters are suggested,
6For detailed experiments and results, please refer to the supplementary material provided by the authors.
7Last visited 10.08.2023: https://github.com/huggingface/transformers
thereby expanding the spectrum of choices available to users and enhancing overall exposure
to diferent events.</p>
        <p>This study is primarily concerned with exploring whether contemporary FMs are suitable
candidates for event content representations. As such, our emphasis lies in evaluating a select
set of unimodal and multimodal VL-FMs for event representation within a simple clustering
framework designed for content-based recommendations.</p>
        <p>
          Embeddings generated by large-scale FMs are typically high-dimensional. However, as the
number of dimensions in data increases, the proximity of a random data point to its nearest
neighbor and its farthest neighbor approximate each other. In spaces with high dimensionality,
the notion of spatial locality therefore loses its clarity [
          <xref ref-type="bibr" rid="ref40 ref41">40, 41</xref>
          ]. This complexity adds dificulty
to the clustering task and renders applying an appropriate dimension reduction technique
necessary [
          <xref ref-type="bibr" rid="ref42 ref43">42, 43</xref>
          ].
        </p>
        <p>
          One of the available techniques is Uniform Manifold Approximation and Projection (UMAP)
[
          <xref ref-type="bibr" rid="ref44">44</xref>
          ]. Over the last few years, UMAP has acquired recognition as a promising alternative to PCA
[
          <xref ref-type="bibr" rid="ref45">45</xref>
          ] and t-SNE [
          <xref ref-type="bibr" rid="ref46">46</xref>
          ] for its ability to balance the preservation of the local and global structure of
high-dimensional data when projected none-linearly onto lower dimensions. Recent research
even shows that applying UMAP improves the performance of several clustering algorithms,
both in terms of accuracy and computation time [
          <xref ref-type="bibr" rid="ref47">47</xref>
          ].
        </p>
        <p>
          We proceeded by clustering the embeddings with Hierarchical Density-Based Spatial
Clustering (HDBSCAN) [
          <xref ref-type="bibr" rid="ref48">48</xref>
          ]. HDBSCAN is a density-based clustering technique known for its ability
to identify clusters of arbitrary densities, its resilience to outliers and noisy data, and its minimal
reliance on prior knowledge or data assumptions.
        </p>
        <p>
          In the last step, we draw on approaches proposed by [
          <xref ref-type="bibr" rid="ref49 ref50">49, 50</xref>
          ] and transfer the same setup to
the multimodal case to pull representative exemplars for each cluster. In detail, we harness the
Maximal Marginal Relevance (MMR) method [
          <xref ref-type="bibr" rid="ref51">51</xref>
          ] to select exemplars that are most relevant to
each representative embedding while still being suficiently dissimilar to each other.
        </p>
        <p>It is well established that the eficacy of clustering algorithms is contingent upon the inherent
structure of the data. By the same token, we argue that the quality of representations can
be assessed by evaluating the quality of the resultant clusters. That is, provided that the
experimental setting and clustering procedure remain consistent across all models. Figure 3
illustrates the clustering framework visually.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Data Acquisition and Preprocessing</title>
        <p>
          We fill in the footsteps of previous research about event recommendations [
          <xref ref-type="bibr" rid="ref11 ref5 ref52">5, 11, 52</xref>
          ] and create
the dataset using events from the platform Meetup.com. Meetup.com is an online event-social
networking platform (EBSN) that facilitates online and face-to-face meetings, known as Meetups.
The platform is built around a web-based structure of groups where individuals with similar
interests can collaborate, plan, create, comment, share, and promote cultural events.
        </p>
        <p>We used the Meetup GraphQL API in November 2022 to perform a comprehensive crawl of
all publicly available activity on the platform from the ten largest cities located in the USA8,
namely, New York, Los Angeles, Chicago, Houston, Phoenix, Philadelphia, San Antonio, San</p>
        <sec id="sec-5-2-1">
          <title>8Last visited 25.04.2023: https://worldpopulationreview.com/us-cities</title>
          <p>Empirical
evaluation</p>
          <p>Content-Based
Recommendations</p>
          <p>Unimdal</p>
          <p>Type?
Vision Only
Vision-Based
Embedings
Preprocess event</p>
          <p>data
Modality?
Embeddings with
reduced dimensions</p>
          <p>using UMAP
Unified Represenatation of
Events for each Cluster using</p>
          <p>MMR
Text Only SinAgplep-rSoatrceham DuAaplpErnocaocdher TAwpop-rSoatecph</p>
          <p>Diego, Dallas, and San Jose, California. The resulting collection served as the training dataset,
on which we trained all FMs models. In April 2023, we performed another crawl to create a
second independent set for testing purposes. This way, we emulate real-world conditions by
delivering recommendations for events that are scheduled to occur in the future.</p>
          <p>We selected these particular cities for the analysis based on two primary reasons: Firstly, as
the largest cities in the United States, they are considered to be among the most popular and
vibrant locations, which often corresponds with a high level of event activity. Secondly, these
cities are spread across diferent states, which provides a level of cultural diversity 9. In total,
the collection comprises 13, 685 distinct events, with 10, 658 in the training set and 3, 027 in
the testing set. For each event, we obtained (i) the event image and (ii) the event description
and its title. The events do not have true labels, thus, our setting is unsupervised.</p>
          <p>
            Prior to the analysis, we carried out several standard NLP preprocessing steps such as
excluding non-English texts and removing special characters HTML tags and hashtags. Additionally,
we applied regex [
            <xref ref-type="bibr" rid="ref53">53</xref>
            ] filtering to exclude any irrelevant information such as Zoom invitations,
email addresses, phone numbers, and URLs from the text data. Further preprocessing steps are
handled by the respective models’ tokenizers.
          </p>
          <p>Regarding the images, we applied data augmentation including resizing, cropping, and
normalization techniques for the vision-based and VL-FMs separately as outlined in their</p>
        </sec>
        <sec id="sec-5-2-2">
          <title>9A similar approach was applied in [5].</title>
          <p>original research papers.</p>
        </sec>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Quantitative Empirical Evaluation</title>
        <p>
          One significant limitation of clustering research is the lack of validation methods for unlabeled
data. Many studies in this area of research attempt to solve the clustering problem on datasets
with known true labels. However, in real-life situations, having prior knowledge of the ground
truth is often the exception rather than the norm. There are few external evaluation methods
available that can be applied in such settings, like the silhouette coeficient [
          <xref ref-type="bibr" rid="ref54">54</xref>
          ] and the
densitybased clustering validation metric (DBCV) [
          <xref ref-type="bibr" rid="ref55">55</xref>
          ]. While these metrics may ofer valid alternatives
to measure the clustering performance, they have their own limitations and are consequently
not applicable in many actual situations. The Silhouette Coeficient, for example, assumes
that the clusters are convex and well-separated, which is not always the case in density-based
clustering. On the other hand, although developed for density-based clustering, DBCV sufers
from increasing run times as the number of data points increases.
        </p>
        <p>These reasons motivated us to validate the results by justifying them with human perception
through a series of online user experiments. We leveraged the crowdsourcing service Prolific 10
to recruit 300 human evaluators, all of which passed an attention check. We presented users
with content-based recommendations founded on the VL-FMs introduced in this study and
prompted them to evaluate these recommendations along three dimensions: similarity-based
evaluation, comparison-based evaluation, and clustering assignment evaluation.</p>
        <p>All in all, the recommendations were based on one of the following types: (i) Text-based
embeddings. (ii) Vision-based embeddings. (iii) VL-based embeddings obtained through
singlestream fusion encoder approach. (iv) VL-based embeddings obtained through the dual encoder
approach. (v) VL-based embeddings obtained through the two-step approach. As a baseline, we
match and recommend events randomly.</p>
        <p>In the first set of experiments, participants were presented with two events displayed side
by side. For each event, participants viewed the corresponding image, title, and description.
Both events were sampled randomly from the same cluster and had the same embedding type.
The events were randomly placed on either the left or right side and were not presented in any
particular order.</p>
        <p>As previously mentioned, clustering-based recommendations rely on the assumption of
similarities among events within a cluster and consider other events within the same cluster as
relevant. We attempt to examine this assumption by modeling participants’ perceived similarity
and cross event interest between two such events. We refer to Figure 5 in Appendix for an
illustrated example.</p>
        <p>In the second set, participants were shown one event and two clusters, with one cluster
being the correct cluster to which the event belonged. The events and clusters were randomly
sampled and based on the same embedding type. We created the cluster visualizations by
drawing representative exemplars from each cluster (as explained in subsection 5.1), and then
consolidating them into one single visual representation. To evaluate the representations quality,
we gauge the agreement between the clustering assignments and the participants’ perceived
memberships by asking them to indicate the cluster to which they believed the event belonged.
We refer to Figure 8 in Appendix for an illustrated example.</p>
        <p>In the last set, we asked the respondents to choose the form of modality (text,image or both)
that they believe had the most significant impact on their judgment and to motivate their
choice. Here, we attempted to model the participants’ preferences and perceptions regarding
the influence of diferent modalities on their decision-making process.</p>
        <p>Furthermore, to ensure that the participants’ attention is not diverted to or biased by secondary
information such as the varying length of descriptions, we constrained the length of the
description for the sampled events to one standard deviation around the mean length of all
deceptions. We reconstructed the events by parsing the raw HTML data and converting it
into a structured HTML document. Before this process, we took measures to protect privacy
by removing sensitive personal data and anonymizing images containing individuals using
Gaussian blurring. The remaining settings, including image resolution and font size, were left
unaltered. In the last step, we integrated the reconstructed events into windows of equal sizes
and blended them into the experiment’s user interface.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Results and Discussion</title>
      <p>In this section, we detail the findings of the study. We commence by delivering quantitative
measures and further support these findings with a concise qualitative analysis.</p>
      <sec id="sec-6-1">
        <title>6.1. Similarity-Based Evaluation</title>
        <p>For the similarity-based evaluation (corresponding to the first set of experiments), we carried
out a series of simple linear regressions in which we regressed the variable of interest against
the type of embeddings used. Table 1 displays the regression results of the perceived similarity
degree across the reference groups, random, vision-based and text-based, respectively.</p>
        <p>First and foremost, the results indicate that recommendations based on event representations,
as encoded by the FMs, performed on average significantly better than a naive random baseline.
The two-step approach emerged, notably, as the most superior among all models.</p>
        <p>The results also suggest, to our surprise, that text-based embeddings led to recommendations
that were either comparable or in some cases even superior in the perceived similarity degree
to their multimodal counterparts. This strongly suggests that text-based clusters seem to satisfy
the assumption of semantic similarity and demonstrate comparable traits.</p>
        <p>Additionally, when pitted against vision-based embeddings, all models once again achieved
superior performance with the two-step approach surpassing all others. Comparing the random
baseline with vision-based embeddings however yielded insignificance. This implies that there
was minimal distinction in the perceived similarity between recommendations based solely on
visual similarities and a randomly selected sample.</p>
        <p>To explore this in more detail, we looked into cases where the similarities between events
were rated high by the study participants and compared them to those rated low11.</p>
        <p>11In all examples, we retrieved similar images under Creative Commons licenses due to copyright reasons from
https://commons.wikimedia.org (last visited on 10.09.2023)</p>
        <p>
          Vision-based clusters can be interpreted along two dimensions: content and context. The
content dimension reflects the entities that are visible in the images, and it is the most intuitive
dimension. In contrast, the context dimension reflects the circumstances in which the content
occurs [
          <xref ref-type="bibr" rid="ref56">56</xref>
          ]. We notice homogeneous clusters on the content dimension, yet recommendations
derived from them may not be reliable as the context is mostly unclear.
        </p>
        <p>Further, the instances where participants rated the events as highly similar were relatively
limited. These cases featured events that seem to have images matching across the content
and the context dimensions. As an example for such a case we provide Figure 6, in which the
event images of a sample recommendation are shown. Both images depict a book and both
events were about literary gatherings. On the other hand, instances rated as dissimilar by the
participants appeared to feature events with images that matched only in terms of their content.
Figure 6 in the Appendix visually displays such an example. While both images share similar
attributes, there is an absence of contextual similarity between both events. To elaborate, the
event depicted on the left pertains to travel and tourist excursions, whereas the event on the
right is centered around meditation journeys. These observations could potentially therefore
reafirm the notion that vision models encode images primarily based on content, lacking the
incorporation of suficient semantic signals within them.</p>
        <p>Furthermore, and perhaps most importantly, we conducted a regression analysis to examine
the relationship between cross-event interest as a dependent and embeddings type as an
independent variable. The rationale behind this is that the perceived similarity may not always
convey a complete picture. For instance, two events with the same image color are likely to be
perceived as similar but the two events may still be completely diferent.</p>
        <p>Table 2 demonstrates that cross interest is significantly higher to the text-based and two-step
approach, thereby confirming that instances with higher perceived similarity indeed correspond
to higher cross interest levels.</p>
        <p>We also found that all multimodal embeddings resulted in clusters that are more varied and
detailed in comparison to those achieved by their unimodal counterparts, all the while
maintaining comparable perceived similarity and cross-event interest as the text-based ones. This
observation could hint that the VL-FMs provided richer and more ample event representations,
as Figure 7 in Appendix shows.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Clustering Assignment Evaluation</title>
        <p>In the second set of experiments, we attempted to assess the clustering quality by measuring
the agreement degree between human judgment and clustering assignment.</p>
        <p>As can be seen in Table 5 in Appendix, the agreement metric appears to vary across the
diferent embeddings’ types. In particular, recommendations founded on embeddings created
by the VL two-step approach showed the highest agreement frequencies. To test for statistical
significance, we opted for a simple logistic regression, modeling the variable agreement as
the dependent variable and the type of embedding as the independent variable. The two-step
approach was set as the reference level.</p>
        <p>The coeficient estimates in Table 3 suggest that clusters formed using the two-step approach
tend to exhibit notably higher agreement probabilities in comparison to other VL and text-based
approaches, although not when compared to the vision-based approach.
Treatment(reference=Vision-Based)
C(Random)
C(Text-Based)
C( VL Single-Stream Approach)
C(VL Dual Encoder Approach)
C(VL Two-Step Approach)
Treatment(reference=Text-Based)
C(Random)
C(Vision-Based)
C( VL Single-Stream Approach)
C(VL Dual Encoder Approach)
C(VL Two-Step Approach)</p>
        <p>
          This observation aligns with the principles of Gestalt [
          <xref ref-type="bibr" rid="ref57 ref58">57, 58</xref>
          ], and particularly resonates with
the concept of similarity, which refers to our capacity to perceive identical visual elements as
cohesive units. Items sharing resemblances in terms of shapes and colors for instance are likely
perceived to fall within the same category.
        </p>
      </sec>
      <sec id="sec-6-3">
        <title>6.3. Comparison-Based Evaluation</title>
        <p>A Chi Square test of independence revealed that among participants, the form of modality that
they believe had the most significant impact on their judgment and embeddings type were
significantly associated,  2(10) = 29.36,  &lt; 0.001. Post hoc comparisons in Table 4 for each
pair, with FDR correction applied, revealed that participants who were shown random events
were more likely to answer based on event descriptions. In comparison, statistical similarity
was observed among all other cases.</p>
        <p>Intercept
Treatment(reference=Random)
C(Vision-Based Approach)
C(Text-Based Approach)
C(VL Single-Stream Approach)
C(VL Dual Encoder Approach)
C(VL Two-Step Approach)
Treatment(reference=VL Two-Step Approach)
C(Vision-Based)
C(Text-Based)
C(VL Single-Stream Approach)
C(VL Dual Encoder Approach)
Intercept
Observations
Note:</p>
        <p>Dependent variable: Cross Event Interest</p>
        <p>(1)
3.360*
(0.355)</p>
        <p>The results suggest therewith that both the textual and visual stimuli were perceived as
equally important regardless of the embedding type.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>This work has showcased the potential of harnessing contemporary uni- and multimodal
FMs as a means for event data representation. We contend that (i) these models present
themselves as efective tools for appropriately capturing event content, and that (ii)
contentbased recommendations grounded on these models can efectively address the cold-start problem
in event recommendations.</p>
      <p>We started of by building a collection of events from the event-based platform Meetup.
Subsequently, we transformed these into informative vector representations using a selected
set of FMs. The resultant representations were arranged into clusters, bringing together events
with common characteristics. The clusters served in turn as the corner stone for generating
content-based recommendations. we then executed a series of user experiments to ascertain the
quality of these recommendations with regards to three dimensions: similarity-based evaluation,
comparison-based evaluation, and clustering assignment evaluation. The findings of this study
can be summarized to : (i) Recommendations generated through the use of FMs outperformed
a naive baseline that randomly drew events, with the VL two-step approach emerging as the
top performer. (ii) Surprisingly, text-based embeddings performed on par or even surpassed
multimodal VL embeddings in some cases. (iii) All multimodal VL-FMs yielded more diverse
clusters, each with diferent scopes, while still demonstrating comparable perceived similarity
to the text-based clusters. (v) The agreement between cluster assignments and human judgment
revealed that the two-step approach was statistically superior to all other models except for
vision-based embeddings, aligning with the principles of Gestalt theory. (iv) The results indeed
verified that cross interest is associated with perceived similarity, confirming that similar
content-based recommendations are more likely to be regarded as relevant.</p>
      <p>Nonetheless, the empirical results reported herein should be considered in light of some
limitations. First, the baseline is limited to a naive random approach. Additional standard
content-based extraction methods could be considered to substantiate the superior performance
of FM models. Second, the dataset was sourced from Meetup during the months of November
and April of 2023, limiting the scope of our data to the type of events commonly promoted
on the platform during these months. In order to ofer a more thorough representation of the
broader spectrum of events, further experiments spanning extended time frames are imperative.</p>
      <p>In conclusion, our experiments indicate that leveraging the capabilities of modern FMs for
content-based event recommendations could efectively address the cold-start problem. It
would be interesting to test these models in a real-world recommendation setting with
postrecommendation human feedback. We view our study as a step in this direction and hope that
our results can support the development and application of future technologies in this field.</p>
    </sec>
    <sec id="sec-8">
      <title>A. Appendix</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>W.</given-names>
            <surname>Wörndl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hefele</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Herzog</surname>
          </string-name>
          ,
          <article-title>Recommending a sequence of interesting places for tourist trips</article-title>
          ,
          <source>Information Technology &amp; Tourism</source>
          <volume>17</volume>
          (
          <year>2017</year>
          )
          <fpage>31</fpage>
          -
          <lpage>54</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>I.</given-names>
            <surname>Partalas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Morvan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>SADEGHIAN</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. MINAEE</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>LI</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>COWAN</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. Z. WANG</surname>
          </string-name>
          ,
          <article-title>Hotel2vec: Learning hotel embeddings from user click sessions with side information (</article-title>
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. B.</given-names>
            <surname>Croft</surname>
          </string-name>
          ,
          <article-title>Joint representation learning for top-n recommendation with heterogeneous information sources</article-title>
          ,
          <source>in: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>1449</fpage>
          -
          <lpage>1458</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Zhang,</surname>
          </string-name>
          <article-title>Cvtm: A content-venue-aware topic model for group event recommendation</article-title>
          ,
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          <volume>32</volume>
          (
          <year>2019</year>
          )
          <fpage>1290</fpage>
          -
          <lpage>1303</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A. Q.</given-names>
            <surname>Macedo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. B.</given-names>
            <surname>Marinho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Santos</surname>
          </string-name>
          ,
          <article-title>Context-aware event recommendation in eventbased social networks</article-title>
          ,
          <source>in: Proceedings of the 9th ACM Conference on Recommender Systems</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>123</fpage>
          -
          <lpage>130</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>A collective bayesian poisson factorization model for cold-start local event recommendation</article-title>
          ,
          <source>in: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>1455</fpage>
          -
          <lpage>1464</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Bommasani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Hudson</surname>
          </string-name>
          , E. Adeli,
          <string-name>
            <given-names>R.</given-names>
            <surname>Altman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Arora</surname>
          </string-name>
          , S. von Arx,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bohg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bosselut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Brunskill</surname>
          </string-name>
          , et al.,
          <article-title>On the opportunities and risks of foundation models (</article-title>
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>J. D. M.-W. C. Kenton</surname>
            ,
            <given-names>L. K.</given-names>
          </string-name>
          <string-name>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          ,
          <source>in: Proceedings of NAACL-HLT</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tay</surname>
          </string-name>
          ,
          <article-title>Deep learning based recommender system: A survey and new perspectives, ACM computing surveys (CSUR) 52 (</article-title>
          <year>2019</year>
          )
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Madisetty</surname>
          </string-name>
          ,
          <article-title>Event recommendation using social media</article-title>
          ,
          <source>in: 2019 IEEE 35th International Conference on Data Engineering (ICDE)</source>
          , IEEE,
          <year>2019</year>
          , pp.
          <fpage>2106</fpage>
          -
          <lpage>2110</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Trinh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Z.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>An efective content-based event recommendation model</article-title>
          ,
          <source>Multimedia Tools and Applications</source>
          <volume>80</volume>
          (
          <year>2021</year>
          )
          <fpage>16599</fpage>
          -
          <lpage>16618</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>E.</given-names>
            <surname>Minkov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Charrow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ledlie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Teller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Jaakkola</surname>
          </string-name>
          ,
          <article-title>Collaborative future event recommendation</article-title>
          ,
          <source>in: Proceedings of the 19th ACM international conference on Information and knowledge management</source>
          ,
          <year>2010</year>
          , pp.
          <fpage>819</fpage>
          -
          <lpage>828</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>U.</given-names>
            <surname>Javed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Shaukat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. A.</given-names>
            <surname>Hameed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Iqbal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. M.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <article-title>A review of contentbased and context-based recommendation systems</article-title>
          ,
          <source>International Journal of Emerging Technologies in Learning (iJET) 16</source>
          (
          <year>2021</year>
          )
          <fpage>274</fpage>
          -
          <lpage>306</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>M. M. Afsar</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Crump</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Far</surname>
          </string-name>
          ,
          <article-title>Reinforcement learning based recommender systems: A survey</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>55</volume>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>T. De Pessemier</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Coppens</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Mannens</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Dooms</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Martens</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Geebelen</surname>
          </string-name>
          ,
          <article-title>An event distribution platform for recommending cultural activities</article-title>
          ,
          <source>in: 7th International Conference on Web Information Systems and Technologies (WEBIST-2011)</source>
          , Ghent University, Department of Information technology,
          <year>2011</year>
          , pp.
          <fpage>231</fpage>
          -
          <lpage>236</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Xu</surname>
          </string-name>
          , S. Liu,
          <article-title>Semantic-enhanced and context-aware hybrid collaborative filtering for event recommendation in event-based social networks</article-title>
          ,
          <source>IEEE Access 7</source>
          (
          <year>2019</year>
          )
          <fpage>17493</fpage>
          -
          <lpage>17502</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jhamb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <article-title>A dual-perspective latent factor model for group-aware social event recommendation</article-title>
          ,
          <source>Information Processing &amp; Management</source>
          <volume>53</volume>
          (
          <year>2017</year>
          )
          <fpage>559</fpage>
          -
          <lpage>576</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Courville</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vincent</surname>
          </string-name>
          ,
          <article-title>Representation learning: A review and new perspectives</article-title>
          ,
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          <volume>35</volume>
          (
          <year>2013</year>
          )
          <fpage>1798</fpage>
          -
          <lpage>1828</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>P.</given-names>
            <surname>Mohan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. C.</given-names>
            <surname>Brilley</surname>
          </string-name>
          ,
          <string-name>
            <surname>N. K. AA</surname>
          </string-name>
          ,
          <article-title>Multimodal representation learning: Cross-modality and shared representation</article-title>
          ,
          <source>in: 2022 International Conference on Industry 4.0 Technology (I4Tech)</source>
          , IEEE,
          <year>2022</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          , R. Zemel,
          <article-title>Autoencoders, minimum description length and helmholtz free energy</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>6</volume>
          (
          <year>1993</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <article-title>Multimodal intelligence: Representation learning, information fusion, and applications</article-title>
          ,
          <source>IEEE Journal of Selected Topics in Signal Processing</source>
          <volume>14</volume>
          (
          <year>2020</year>
          )
          <fpage>478</fpage>
          -
          <lpage>493</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>P.</given-names>
            <surname>Thompson</surname>
          </string-name>
          ,
          <article-title>Margaret thatcher: A new illusion</article-title>
          ,
          <source>Perception</source>
          <volume>9</volume>
          (
          <year>1980</year>
          )
          <fpage>483</fpage>
          -
          <lpage>484</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>W.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Deep multimodal representation learning: A survey</article-title>
          ,
          <source>IEEE Access 7</source>
          (
          <year>2019</year>
          )
          <fpage>63373</fpage>
          -
          <lpage>63394</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <article-title>A survey of vision-language pre-trained models</article-title>
          ,
          <source>arXiv preprint arXiv:2202.10936</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>P.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Clifton</surname>
          </string-name>
          ,
          <article-title>Multimodal learning with transformers: A survey</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gururangan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Marasović</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Swayamdipta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Beltagy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Downey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Smith,</surname>
          </string-name>
          <article-title>Don't stop pretraining: Adapt language models to domains and tasks</article-title>
          ,
          <source>in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>8342</fpage>
          -
          <lpage>8360</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>K.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Thakur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          , Gpl:
          <article-title>Generative pseudo labeling for unsupervised domain adaptation of dense retrieval</article-title>
          ,
          <source>in: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>2345</fpage>
          -
          <lpage>2360</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>K.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>Tsdae: Using transformer-based sequential denoising auto-encoderfor unsupervised sentence embedding learning</article-title>
          ,
          <source>in: Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2021</year>
          ,
          <year>2021</year>
          , pp.
          <fpage>671</fpage>
          -
          <lpage>688</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hofstätter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Althammer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schröder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sertkan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanbury</surname>
          </string-name>
          ,
          <article-title>Improving eficient neural ranking models with cross-architecture knowledge distillation</article-title>
          , arXiv preprint arXiv:
          <year>2010</year>
          .
          <volume>02666</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          , et al.,
          <article-title>An image is worth 16x16 words: Transformers for image recognition at scale</article-title>
          , arXiv preprint arXiv:
          <year>2010</year>
          .
          <volume>11929</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dollár</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <article-title>Masked autoencoders are scalable vision learners</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>16000</fpage>
          -
          <lpage>16009</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          , et al.,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>8748</fpage>
          -
          <lpage>8763</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Son</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Kim</surname>
          </string-name>
          ,
          <article-title>Vilt: Vision-and-language transformer without convolution or region supervision</article-title>
          ,
          <source>in: International Conference on Machine Learning, PMLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>5583</fpage>
          -
          <lpage>5594</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <surname>Fast</surname>
          </string-name>
          r-cnn,
          <source>in: Proceedings of the IEEE international conference on computer vision</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>1440</fpage>
          -
          <lpage>1448</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>C.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-T.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Parekh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Pham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-H.</given-names>
            <surname>Sung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Duerig</surname>
          </string-name>
          ,
          <article-title>Scaling up visual and vision-language representation learning with noisy text supervision</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>4904</fpage>
          -
          <lpage>4916</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <surname>K.-H. Lee</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            , G. Hua,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Hu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
          </string-name>
          ,
          <article-title>Stacked cross attention for image-text matching</article-title>
          ,
          <source>in: Proceedings of the European conference on computer vision (ECCV)</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>201</fpage>
          -
          <lpage>216</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>S.</given-names>
            <surname>Uppal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bhagat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hazarika</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Poria</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zimmermann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zadeh</surname>
          </string-name>
          ,
          <article-title>Multimodal research in vision and language: A review of current and emerging trends</article-title>
          ,
          <source>Information Fusion</source>
          <volume>77</volume>
          (
          <year>2022</year>
          )
          <fpage>149</fpage>
          -
          <lpage>171</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>W.</given-names>
            <surname>Chai</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Wang, Deep vision multimodal learning: Methodology, benchmark</article-title>
          , and trend,
          <source>Applied Sciences</source>
          <volume>12</volume>
          (
          <year>2022</year>
          )
          <fpage>6588</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Gan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Git: A generative image-to-text transformer for vision and language</article-title>
          ,
          <source>arXiv preprint arXiv:2205.14100</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>K.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Goldstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ramakrishnan</surname>
          </string-name>
          , U. Shaft,
          <article-title>When is “nearest neighbor” meaningful?</article-title>
          , in: Database Theory-ICDT'
          <volume>99</volume>
          : 7th International Conference Jerusalem, Israel, January
          <volume>10</volume>
          -
          <issue>12</issue>
          ,
          <year>1999</year>
          Proceedings 7, Springer,
          <year>1999</year>
          , pp.
          <fpage>217</fpage>
          -
          <lpage>235</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>C. C.</given-names>
            <surname>Aggarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hinneburg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Keim</surname>
          </string-name>
          ,
          <article-title>On the surprising behavior of distance metrics in high dimensional space</article-title>
          ,
          <source>in: Database Theory-ICDT</source>
          <year>2001</year>
          : 8th International Conference London, UK, January 4-
          <issue>6</issue>
          ,
          <source>2001 Proceedings 8</source>
          , Springer,
          <year>2001</year>
          , pp.
          <fpage>420</fpage>
          -
          <lpage>434</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>M.</given-names>
            <surname>Steinbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ertöz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <article-title>The challenges of clustering high dimensional data, New directions in statistical physics: econophysics, bioinformatics, and pattern recognition (</article-title>
          <year>2004</year>
          )
          <fpage>273</fpage>
          -
          <lpage>309</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>D.</given-names>
            <surname>Pandove</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Goel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rani</surname>
          </string-name>
          ,
          <article-title>Systematic review of clustering high-dimensional and large datasets, ACM Transactions on Knowledge Discovery from Data (TKDD) 12 (</article-title>
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>68</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [44]
          <string-name>
            <given-names>L.</given-names>
            <surname>McInnes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Healy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Melville</surname>
          </string-name>
          , Umap:
          <article-title>Uniform manifold approximation and projection for dimension reduction</article-title>
          , arXiv preprint arXiv:
          <year>1802</year>
          .
          <volume>03426</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          [45]
          <string-name>
            <given-names>H.</given-names>
            <surname>Abdi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <article-title>Principal component analysis</article-title>
          ,
          <source>Wiley interdisciplinary reviews: computational statistics 2</source>
          (
          <year>2010</year>
          )
          <fpage>433</fpage>
          -
          <lpage>459</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          [46]
          <string-name>
            <given-names>L.</given-names>
            <surname>Van der Maaten</surname>
          </string-name>
          , G. Hinton,
          <article-title>Visualizing data using t-sne.</article-title>
          ,
          <source>Journal of machine learning research 9</source>
          (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          [47]
          <string-name>
            <given-names>M.</given-names>
            <surname>Allaoui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Kherfi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cheriet</surname>
          </string-name>
          ,
          <article-title>Considerably improving clustering algorithms using umap dimensionality reduction technique: a comparative study</article-title>
          ,
          <source>in: Image and Signal Processing: 9th International Conference, ICISP</source>
          <year>2020</year>
          , Marrakesh, Morocco, June 4-6,
          <year>2020</year>
          , Proceedings 9, Springer,
          <year>2020</year>
          , pp.
          <fpage>317</fpage>
          -
          <lpage>325</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          [48]
          <string-name>
            <given-names>L.</given-names>
            <surname>McInnes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Healy</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <article-title>Astels, hdbscan: Hierarchical density based clustering</article-title>
          .,
          <source>J. Open Source Softw</source>
          .
          <volume>2</volume>
          (
          <year>2017</year>
          )
          <fpage>205</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          [49]
          <string-name>
            <given-names>M.</given-names>
            <surname>Grootendorst</surname>
          </string-name>
          , Bertopic:
          <article-title>Neural topic modeling with a class-based tf-idf procedure</article-title>
          ,
          <source>arXiv preprint arXiv:2203.05794</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          [50]
          <string-name>
            <given-names>M.</given-names>
            <surname>Grootendorst</surname>
          </string-name>
          , Concept, https://github.com/MaartenGr/Concept,
          <year>2022</year>
          . GitHub repository.
        </mixed-citation>
      </ref>
      <ref id="ref51">
        <mixed-citation>
          [51]
          <string-name>
            <given-names>A. Z.</given-names>
            <surname>Broder</surname>
          </string-name>
          ,
          <article-title>On the resemblance and containment of documents</article-title>
          ,
          <source>in: Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171)</source>
          , IEEE,
          <year>1997</year>
          , pp.
          <fpage>21</fpage>
          -
          <lpage>29</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref52">
        <mixed-citation>
          [52]
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-C.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>McPherson</surname>
          </string-name>
          , J. Han,
          <article-title>Event-based social networks: linking the online and ofline social worlds</article-title>
          ,
          <source>in: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          ,
          <year>2012</year>
          , pp.
          <fpage>1032</fpage>
          -
          <lpage>1040</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref53">
        <mixed-citation>
          [53]
          <string-name>
            <given-names>A. V.</given-names>
            <surname>Aho</surname>
          </string-name>
          ,
          <article-title>Algorithms for finding patterns in strings, handbook of theoretical computer science (vol. a): algorithms</article-title>
          and complexity,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref54">
        <mixed-citation>
          [54]
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Rousseeuw</surname>
          </string-name>
          ,
          <article-title>Silhouettes: a graphical aid to the interpretation and validation of cluster analysis</article-title>
          ,
          <source>Journal of computational and applied mathematics 20</source>
          (
          <year>1987</year>
          )
          <fpage>53</fpage>
          -
          <lpage>65</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref55">
        <mixed-citation>
          [55]
          <string-name>
            <given-names>D.</given-names>
            <surname>Moulavi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. A.</given-names>
            <surname>Jaskowiak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Campello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zimek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sander</surname>
          </string-name>
          ,
          <article-title>Density-based clustering validation</article-title>
          ,
          <source>in: Proceedings of the 2014 SIAM international conference on data mining, SIAM</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>839</fpage>
          -
          <lpage>847</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref56">
        <mixed-citation>
          [56]
          <string-name>
            <given-names>J.</given-names>
            <surname>Klostermann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Plumeyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Böger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Decker</surname>
          </string-name>
          ,
          <article-title>Extracting brand information from social networks: Integrating image, text, and social tagging data</article-title>
          ,
          <source>International Journal of Research in Marketing 35</source>
          (
          <year>2018</year>
          )
          <fpage>538</fpage>
          -
          <lpage>556</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref57">
        <mixed-citation>
          [57]
          <string-name>
            <given-names>K.</given-names>
            <surname>Kofka</surname>
          </string-name>
          , Principles of Gestalt psychology, volume
          <volume>44</volume>
          ,
          <string-name>
            <surname>Routledge</surname>
          </string-name>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref58">
        <mixed-citation>
          [58]
          <string-name>
            <given-names>M.</given-names>
            <surname>Wertheimer</surname>
          </string-name>
          ,
          <article-title>Laws of organization in perceptual forms</article-title>
          . (
          <year>1938</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>