<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Multimodal Dataset and Benchmark for Tourism Review Generation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hiromasa Yamanishi</string-name>
          <email>yamanishi@cvm.t.u-tokyo.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ling Xiao</string-name>
          <email>ling@cvm.t.u-tokyo.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Toshihiko Yamasaki</string-name>
          <email>yamasaki@cvm.t.u-tokyo.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>The University of Tokyo</institution>
          ,
          <addr-line>Hongo ̄, Bunkyo, Tokyo 113-8656</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper addresses the challenge of generating accurate and contextually relevant tourism reviews, which are essential to assist travelers in creating reviews and allowing businesses to predict the reactions of diefrent users to tourist spots. To address this problem, we introduce the first multimodal dataset for Japanese tourism review generation, TourMix1M, which contains one million review instances under various conditions, including images, user attributes, user profiles, review ratings, review length, key phrases, and visit seasons, collected from Japanese tourism websites. Based on this dataset, we develop a baseline model for multimodal review generation, LLaVA-Review, by performing instruction tuning of the LLaVA model. Furthermore, to enhance domain knowledge, we apply retrieval-augmented fine-tuning with aggregated tourism review data, exploring two types of knowledge representations: one incorporating noun and adjective information from Sentiment Aware Knowledge Graph, and another using aspect-based summaries from reviews. Experimental results show that LLaVA-Review outperforms existing models in review generation and adapts well to various conditioning factors, with improved accuracy by conditioning the gender, visiting month, review length, key phrases, and user profiles information into the prompt. Furthermore, retrieval-augmented fine-tuning using tourism information efectively improved accuracy across both types of knowledge representations.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Tourism</kwd>
        <kwd>Review Generation</kwd>
        <kwd>Conditional Review Generation</kwd>
        <kwd>Large Multimodal Model</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Tourism is one of the most crucial sectors in the global economy, contributing more than 15 trillion US
dollars [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], and is enjoyed for various purposes, such as relaxation, exploration, and education. Online
platforms such as TripAdvisor and Google Maps play essential roles [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] for tourism. In tourism services,
user-generated content (UGC), such as reviews and photos of tourist spots, plays a vital role for both the
tourism business and individual tourists. By collecting diverse aspects and opinions about these locations,
the tourism business can further improve the services, enhance their credibility, and make more profits.
At the same time, UGC helps shape tourists’ perceptions of tourist destinations and significantly
influences travel planning [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. As tourism demand diversifies, understanding the preferences, needs,
and objectives of diferent segments is crucial for identifying market opportunities. The literature
showed that UGC reflects diferences in trends based on user attributes and preferences [
        <xref ref-type="bibr" rid="ref5 ref6 ref7">5, 6, 7</xref>
        ], and
analyzing these trends has provided valuable insights for improving tourist destinations and for efective
marketing targeting [
        <xref ref-type="bibr" rid="ref10 ref8 ref9">8, 9, 10</xref>
        ].
      </p>
      <p>
        This paper focuses specifically on review generation. There are two primary applications for review
generation. First, presenting automatically generated reviews to users can significantly minimize their
efort [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and encourage more review posting. Additionally, presenting generated reviews to users is
beneficial for recommendation systems [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Second, predicting user reactions, such as “what a man
in his 50s who prefers leisurely tourism would say” can be valuable for businesses in marketing and
improving tourist destinations. Recent advancements in deep learning and natural language processing
(NLP) have enabled the generation of high-quality reviews [
        <xref ref-type="bibr" rid="ref11 ref13 ref14 ref15">11, 13, 14, 15</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], they developed a
recurrent neural networks based on long short-term memory (LSTM) for user conditioned product
review generation. Truong et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] improved review generation by leveraging image information. Li
et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] enhanced product-related factual information by leveraging knowledge graphs. Xie et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]
utilized existing reviews as additional context and generated reviews using GPT-2. However, although
various datasets exist, relatively few studies focus on tourism-related data in review generation research
literature. In particular, there is a significant gap in research that employs multimodal datasets combining
diverse contexts, such as images, rich user attributes, and review content, specifically tailored for
tourism. To address this gap, this research creates a multimodal review generation dataset, TourMix1M,
specifically tailored for tourism. The TourMix1M is created from data collected from Japanese tourism
websites, comprising 470,000 images, 510,000 reviews, and a total of one million training instances. We
will release a list of URLs for the collected data at https://github.com/HiromasaYamanishi/TourMix1M.
Our dataset supports the generation of reviews based on various contexts such as images, age, gender,
season, user profiles, review length, and ratings. This is the first dataset specialized in generating
tourism reviews that incorporates diverse information such as images and user data, promoting the
understanding and advancement of review generation in the tourism sector.
      </p>
      <p>
        We also propose a baseline model for multimodal review generation, LLaVA-Review. LLaVA-Review
was developed by instruction-tuning the open-source LLaVA model [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Compared to previous
multimodal review generation models [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], LLaVA-Review ofers greater flexibility in handling diferent
conditions by changing natural language instructions. LLaVA-Review outperformed state-of-the-art
Large Multimodal Model (LMM) ChatGPT-4o and review generation models on various metrics in
experiments conducted on our dataset, particularly in terms of BLEU, ROUGE, and consistency with user
opinions. By incorporating tourism knowledge and user opinions to fine-tune LLaVA, LLaVA-Review
also demonstrated stronger adaptability to diverse tourism related conditioning. Furthermore, inspired
by recent research in enhancing domain specificity [
        <xref ref-type="bibr" rid="ref17 ref18 ref19">17, 18, 19</xref>
        ] with retrieval-augmented fine-tuning,
we incorporate aggregated information from tourism spot reviews during fine-tuning and inference
stages. We explored two approaches: subgraph method and aspect-based summary. The subgraph
method means adding noun and adjective information extracted from a Sentiment Aware Knowledge
Graph [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. The aspect-based summary denotes using aspect-based summaries generated from reviews.
Experimental results showed that the subgraph method increased BLEU by 12% and user opinion
accuracy by 11%, while using aspect-based summary improved proper noun variety by 18% and domain
knowledge by 4%.
      </p>
      <p>
        One related submission to this paper is [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], which develops a multimodal model for landmark
recognition and review generation. However, this work difers from [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] in two key aspects: 1) Dataset:
The dataset used in [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] addresses limited conditions such as gender, age, group, and key phrases,
whereas the TourMix1M includes a wider range of factors such as visit season, user profiles, ratings,
and review length, providing a more comprehensive foundation for tourism review generation. 2)
Model design: [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] addresses review generation as part of a multitask framework, which is trained by
standard instruction-tuning. Our research explores the usage of information aggregated from reviews
to improve quality. Overall, our contributions are threefold:
• We created a multimodal dataset, TourMix1M, for Japanese tourism review generation. The
TourMix1M has one million instances and ten types of diverse conditions. The URLs used in the dataset
construction will be made publicly available.
• We proposed LLaVA-Review, which outperformed state-of-the-art large multimodal models and
review generation models in general review generation. LLaVA-Review showed strong adaptability
to diferent conditions, and we analyzed variations in its outputs across these conditions.
• We propose two retrieval-augmented fine-tuning methods: aggregating sub-graphs from a Sentiment
      </p>
      <p>Aware Knowledge Graph and aspect-based summaries from reviews, both of them enhanced accuracy.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <sec id="sec-2-1">
        <title>2.1. User-Generated Content in Tourism</title>
        <p>
          Tourism is a key sector in the global economy, accounting for over 10% of global economic output [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
As tourism is often a high-cost, once in a lifetime experience, UGC on social platforms plays a crucial
role in helping tourists decide what to visit. UGC, typically in the form of reviews and images, ofers
valuable insights. Reviews express user opinions on aspects such as content, price, and service, while
images enhance understanding of intangible experiences such as tourist destinations [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Furthermore,
images complement textual information and function as visual evidence to substantiate the content,
thus enhancing the usefulness of reviews when combined with images [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. UGC significantly
influences travel planning and behavior by shaping both cognitive and emotional perceptions of tourist
destinations [
          <xref ref-type="bibr" rid="ref23 ref3 ref4">4, 23, 3</xref>
          ].
        </p>
        <p>
          Analyzing UGC is vital for marketing and improving tourist destinations. Text mining has shown that
reviews reflect user satisfaction and preferences based on factors such as gender [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], age, companions [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ],
and season [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. For example, [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] finds that men are more interested in historical sites, while women
prefer landscapes and rural areas. Similarly, [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] shows that couples tend to leave reviews with higher
satisfaction levels. Analyzing tourist reviews with deep learning techniques, such as topic modeling and
sentiment classification, is efective for marketing and destination improvement [
          <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
          ]. Recently, large
language models such as ChatGPT have shown potential for applications such as automated replies to
customer questions and requests in tourism through prompt engineering [
          <xref ref-type="bibr" rid="ref24 ref25">24, 25</xref>
          ]. Our research focuses
on developing large-scale models using tourism-specific corpora and images.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Review Generation</title>
        <p>
          Review generation helps users write reviews [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] and improve the trustworthiness of
recommendations [
          <xref ref-type="bibr" rid="ref12 ref26">26, 12</xref>
          ]. Unlike explanation generation [
          <xref ref-type="bibr" rid="ref26 ref27">26, 27</xref>
          ] or tip generation [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ], review generation aims
to create longer texts that cover multiple aspects. Various models have been developed for review
generation using recurrent neural networks (RNNs) [
          <xref ref-type="bibr" rid="ref11 ref14 ref29 ref30">11, 14, 29, 30</xref>
          ], generative adversarial networks
(GANs) [
          <xref ref-type="bibr" rid="ref31 ref32">31, 32</xref>
          ], and large-scale models [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ]. Typically, these models generate reviews based on inputs
such as user data, product information, ratings, or sentiment polarity. Furthermore, some research
explores generation conditioned on images [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] or aspects [
          <xref ref-type="bibr" rid="ref34">34</xref>
          ] such as Enviroment, Service, and Price.
Additionally, guiding the generation process with information extracted from reviews is efective in
improving quality and factuality. For instance, [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ] identified relevant aspects between users and items
and guided generation using words from those aspects. Similarly, [
          <xref ref-type="bibr" rid="ref35">35</xref>
          ] proposed a method that first
generates an aspect sequence and then performs review generation from coarse to fine. In addition, [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]
expanded a Freebase-based knowledge graph using user and item reviews, capturing user preferences
for each aspect via Caps GNN. Furthermore, [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] generated reviews by inputting past reviews and key
terms into GPT-2 and asking questions like “What was great?” to guide the generation.
        </p>
        <p>
          Research on review generation in the tourism domain remains underexplored. One of the biggest
challenges is the lack of a dataset. One related work under submission is [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ], where review generation
is addressed as part of a multitask framework, with conditions limited to user attributes and keywords.
This paper focuses on creating a dataset specifically for review generation with a broader range of ten
conditions such as visit timing, user profiles, ratings, and review length. We also apply a large-scale
multimodal model to multimodal review generation for the first time.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Retrieval-augmented Large-Scale Model Fine-tuning</title>
        <p>
          Large-scale models (LMs) based on the Transformer architecture [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ] have shown high performance
across various tasks, with notable examples including large language models (LLMs) such as LLaMA [
          <xref ref-type="bibr" rid="ref37">37</xref>
          ]
and GPT [
          <xref ref-type="bibr" rid="ref38 ref39">38, 39</xref>
          ], large multimodal models (LMMs) such as LLaVA [
          <xref ref-type="bibr" rid="ref16 ref40">40, 16</xref>
          ], ChatGPT-4 [
          <xref ref-type="bibr" rid="ref41">41</xref>
          ], and
QwenVL [
          <xref ref-type="bibr" rid="ref42">42</xref>
          ]. These LMMs align language with images and can generate text across diverse inputs and
conditions.
(Image-Review) This time, I walked along the canal at night. Illuminated by
gas lamps, I was satisfied with the beautiful scenery. After dinner, I walked all
the way to the back and back again, making for a nice walk. The warehouses
were also lit up and looked beautiful. I definitely recommend going at night.
        </p>
        <p>
          However, these models often lack domain-specific knowledge. To address this, methods such as
instruction-tuning [
          <xref ref-type="bibr" rid="ref43">43</xref>
          ] and retrieval-augmented generation (RAG) [
          <xref ref-type="bibr" rid="ref17 ref18 ref44 ref45 ref46 ref47">44, 45, 46, 47, 48, 49, 17, 50, 18, 51</xref>
          ]
have been developed to enhance performance in specific domains. RAG, which incorporates external
knowledge, including structured knowledge such as Knowledge Graphs [
          <xref ref-type="bibr" rid="ref44 ref45 ref46 ref47">44, 45, 46, 47</xref>
          ] and unstructured
sources [
          <xref ref-type="bibr" rid="ref17 ref18">48, 49, 17, 50, 18, 51</xref>
          ], has proven efective for domain-specific tasks. Knowledge integration has
proven to be beneficial either during pre-training [
          <xref ref-type="bibr" rid="ref44">48, 49, 44</xref>
          ], fine-tuning [
          <xref ref-type="bibr" rid="ref17 ref18 ref45">17, 18, 45</xref>
          ], or inference [
          <xref ref-type="bibr" rid="ref46">51, 46</xref>
          ].
Our research aligns closely with studies such as [
          <xref ref-type="bibr" rid="ref17 ref18 ref45">17, 18, 45</xref>
          ], which focus on retrieval-augmentation
during both fine-tuning and inference stages. Most research in this area is applied to question answering,
where a single or a few lengthy documents are retrieved. However, this approach may not be ideal for
review generation. A small number of reviews may not capture all aspects and opinions of a certain
tourism spot, while too many reviews can introduce noise, potentially degrading quality. Although
frameworks such as [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] train large language models using concept retrieval, the knowledge may still
be insuficient. So, efective knowledge retrieval methods for toursim review generation requires careful
design.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Method</title>
      <p>
        Considering the fact that most tourists take photos and their experiences vary depending on context
information such as user attributes and visit timing, we have constructed a multimodal tourism review
generation dataset. The created dataset accounts for ten conditions, including review length, user
gender, user age, group, visiting month, visiting season, two types of user profiles, and rating. Moreover,
we propose a high-performance LLaVA-Review model for review generation. We also propose
retrievalaugmented fine tuning where aggregated review information is incorporated into the prompt during
both training and inference. Each of these aspects will be detailed in the following sections. We
propose two formats for aggregating review information: one based on subgraphs consisting of noun
and adjective information extracted from a Sentiment Aware Knowledge Graph [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], and another
based on aspect based review summaries. Each of these aspects will be detailed in the following
sections. Figures 4, 6, 7, 8, and 10 as well as Tables 1 and 3 present the original reviews alongside the
generated examples, all of which were originally in Japanese and have been translated into English for
demonstration purposes in this paper.
      </p>
      <sec id="sec-3-1">
        <title>2https://cdn.jalan.jp/jalan/img/6/kuchikomi/3776/KL/6b4e1_0003776908_1.webp 3https://cdn.jalan.jp/jalan/img/0/kuchikomi/4140/KL/3a837_0004140092_1.webp</title>
        <p>Component
Dialogues
Prompts
Reviews
Images
Tourism Spots
Count</p>
        <sec id="sec-3-1-1">
          <title>3.1. TourMix1M Dataset</title>
          <p>We created a multimodal tourism dataset for review generation by utilizing data collected from the
Japanese tourism website jalan.net4, with permission to use its data for non-commercial purposes.
We originally collected 470k images and 2.4 million reviews from the website for 51k tourist spots
across Japan that had a suficient number of images and reviews. This represents a large portion of the
content available on the website. Initially, the collected images and texts were not always paired. To
create image-text pairs, we employed the Contrastive Language-Image Pretraining (CLIP) [52] method,
specifically focusing on the model tailored for the Japanese language 5. The data were generated by
pairing images and reviews based on the nearest neighbors in the embedding space. To create training
pairs, we used three retrieval methods for diverse image-text matching: image-to-review retrieval,
review-to-image retrieval, and image-sentence-review retrieval. Table 1 shows examples of obtained
image-text pairs. In the image-sentence-review retrieval process, sentences were first extracted from
the images, after which the full corresponding reviews were identified by locating the reviews that
contained those sentences.</p>
          <p>We constructed three review generation tasks based on the generated image-text pairs. The tasks
consist of Short Review Generation, General Review Generation, and Conditional Review Generation.
General and Conditional Review Generation are collectively referred to as Long Review Generation.
For Short Review Generation, image-sentence pairs obtained from image-sentence retrieval were used
to generate concise reviews. For Long Review Generation, image-review pairs obtained from three
retrieval methods were used to generate reviews. The left side of Figure 1 shows the distribution of
each task. As input for each task, in General Review Generation, instructions were given to generate
reviews based solely on images and place names. In Short Review Generation, only one condition,
rating, was applied. For Conditional Review Generation, instructions were given to generate reviews
based on combinations of images, place names, and various conditions. The conditions considered
include ten categories: review length, gender, age, groups, visit month, season, two types of user
profiles (tag and long), rating, and key phrase in the review. These variables encompass a wide range of
conditions specific to tourism. While this research does not perform conditioning based on user ID,
making the setup less personalized, it provides a more general context and attribute based framework,
which is applicable even to cold-start users. For example, a new user, such as a man in his 50s who
enjoys leisurely activities, could request a review for a spring visit to a tranquil garden or scenic nature
trail, allowing the system to generate a review tailored to his preferences without prior interactions.
Moreover, our dataset enables analysis of how diferent conditioning factors, such as age, gender, and
season, influence the generated reviews, providing insights into the diverse user experiences.</p>
          <p>Specifically, for categorical variables, gender is either male or female; age is in ten-year increments
from the 10s to the 90s; groups include family, couple, friends, single, or other; rating is an integer
between 1 and 5; visit month is an integer between 1 and 12; and the season is either spring, summer,
autumn, or winter. For user profiles, we use two types: “tag” profiles, which are simple, keyword-based
summaries such as “history enthusiast” and “long” profiles, which provide more detailed descriptions
such as “a curious traveler with a deep interest in local history and culture, enjoying museum visits
and exploring traditional cuisine.” These profiles are generated by prompting a large language model 6
based on past reviews. For key phrases, we use an LLM to extract important parts from sentences,
ranging from single words to short sentences, that users found positive or negative. The attributes
for each review are derived from the accompanying metadata of reviews and the data of the user who
wrote them. During training, instead of using all conditions for each instance, we perform condition
sampling based on a pre-defined probability to increase the diversity of condition combinations. The
probabilities were chosen so that each conditioning factor would appear in approximately 10% to 30%
of the entire dataset. Figure 2 illustrates how frequently each attribute appears throughout the entire
dataset. Figure 3 shows the distribution of categorical attributes such as Gender, Age, Group, Season,
Month, and Rating across the original 2.4 million reviews for both the TourMix1M dataset and the
sampled TourMix1M dataset. The results indicate that the TourMix1M dataset closely replicates the
original distribution. We release both the full set of conditions and the specific conditions used in
the experiments. The middle and right sections of Figure 1 show the distribution of the number of
conditions in short and long review generation, respectively. For prompt construction, in the General
Review Generation task, prompts are structured as “Generate a review for Sensoji based on the image.”
For Short Review Generation, prompts are phrased as “Generate a concise review for Sensoji based
on the image,” and for Conditional Review Generation, such as “Generate a 4-star review for Sensoji</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>5https://huggingface.co/rinna/japanese-clip-vit-b-16 6https://huggingface.co/UCLA-AGI/Gemma-2-9B-It-SPPO-Iter3</title>
        <p>using the keyword ‘blossom’ as written by a male in his 20s based on the image.” The resulting training
dataset comprises 1 million dialogues, 1.31 million prompts, 545,891 reviews, 476,167 images, and 51,011
tourist spots.</p>
        <sec id="sec-3-2-1">
          <title>3.2. LLaVA-Review</title>
          <p>
            3.2.1. Model Architecture
The architecture of our proposed model is illustrated in Figure 4. The icons in the figure will represent
LLaVA-Review in subsequent sections. The baseline model we developed, LLaVA-Review, is based
on LLaVA [
            <xref ref-type="bibr" rid="ref16 ref40">40, 16</xref>
            ], a prominent open-source large-scale multimodal model. In LLaVA, the image is
initially divided into patches, which are then converted into image tokens by the image encoder. These
tokens are transformed into the language space via a projector consisting of multilayer perceptrons. By
simultaneously feeding instruction tokens and image tokens into the large language model, the model
generates responses that incorporate image information. The training process consists of two stages:
pretraining, where only the projector is trained, and fine-tuning, where both the large language model
(LLM) and the projector are optimized. In the proposed LLaVA-Review, we performed only fine-tuning
based on the model pretrained in [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ]. Instruction-tuning [
            <xref ref-type="bibr" rid="ref43">43</xref>
            ] was employed as the fine-tuning method,
which learns to generate responses using instruction-response pairs. Optimization is conducted by
minimizing the negative log-likelihood of the response generation using the following loss function,
where  represents tokens,  denotes the length of the instruction part, and  is the total length of
the text. We used Vicuna-13B [53], which has strong capabilities in Japanese, as the language model,
and CLIP as the image encoder. Additionally, to ensure eficient training and to prevent degradation
in language capabilities, we employed a Low-Rank Adaptation (LoRA) [54]-based strategy for model
training.
          </p>
          <p>ℒ = −

∑︁ log  ( | 1, . . . , − 1) .
=+1
(1)
3.2.2. Instruction-Tuning with External Knowledge
We propose utilizing knowledge extracted from existing reviews as references to enhance the quality of
review generation. This external knowledge is incorporated during both the training and inference
phases, with the extracted sentences added to the end of the review generation prompts. Tourist
destination reviews typically reflect diverse perspectives and opinions, resulting in a large volume of
content. Relying on only a few reviews may fail to fully capture this diversity, while using the entire
review text as input risks introducing noise. To address these challenges, we empirically evaluated
various external knowledge aggregation methods that efectively capture domain knowledge and user
opinions. Figure 5 represents two proposed methods. We utilized external knowledge constructed from
2.4 million reviews that do not overlap with test reviews.</p>
          <p>
            Subgraph-based method. This method involves sampling from a Sentiment Aware Knowledge
Graph (SAKG). SAKG is a knowledge graph that incorporates user opinions and sentiment information.
The SAKG in this research is represented as  = {(ℎ, , )|ℎ,  ∈ ,  ∈ } , where  represents
entities,  represents relationships, and a triplet (ℎ, , ), which denotes a relationship  from a head
entity ℎ (e.g., Ueno Park) to a tail entity  (e.g., Panda). Unlike previous work [
            <xref ref-type="bibr" rid="ref20">20</xref>
            ], this graph uses
edges to represent adjectives and their frequency of use, such as (cute, 9) and (big, 4). The graph is
constructed by extracting noun-adjective pairs from reviews using syntactic parsing and aggregating
them for each tourist spot. During training, first, between 1 to 5 entities related to the target destination
are sampled based on edge frequency. Subsequently, relationships associated with the sampled entities
are selected similarly. During inference, the top-k entities and relationships by total edge frequency are
sampled and incorporated into the prompt as natural language.
          </p>
          <p>
            Summary-based method. This method involves adding aspect-based summaries. Previous research
on review summarization has shown that aspect-based summaries facilitate a broader understanding
of items [
            <xref ref-type="bibr" rid="ref10">55, 10</xref>
            ]. Furthermore, incorporating aspect information is crucial in review generation [55].
In this research, we input reviews and prompt a large-scale language model5 to generate a summary
in a single stage. Specifically, for each tourist destination, up to 50 reviews are selected as input, with
a total length less than 5,000 characters. The prompt instructs to create summaries of around 1000
characters for reviews from the perspectives of overviews, positive opinions, and negative opinions for
the elements of key content, including price, service, food and drink, facilities, transportation access,
and seasonal events.
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Results</title>
      <sec id="sec-4-1">
        <title>4.1. Experimental Setup</title>
        <p>The TourMix1M dataset was employed to train a model for generating reviews based on inputs such as
images, tourist destination names, and various contextual conditions. Specifically, to create the test
data, images not used in the training set were first selected, and then the nearest corresponding reviews
were retrieved with CLIP embeddings. Only image-review pairs where both the image and review were
absent from the training data were included in the test set.</p>
        <p>The training of LLaVA-Review was conducted using eight 48GB Ada 6000 RTX GPUs with a batch size
of 80 and a learning rate of 2 × 10− 4, taking approximately 37 hours per epoch. When incorporating
external knowledge, the training time increased to 41 hours for the subgraph-based method and 60
hours for the summary-based method. Evaluation was performed using 1,000 image-review pairs that
did not overlap with the training data.</p>
        <p>In this research, the key characteristics for efective review generation are 1) the integration of
image information, 2) maintaining high text quality, 3) incorporating detailed information about tourist
destinations, and 4) accounting for user opinions, particularly the collective sentiment regarding tourist
experiences.</p>
        <p>The evaluation metrics included BLEU [56], ROUGE-1, ROUGE-L [57], CIDEr [58], diversity (DIV),
the number of unique proper nouns (PROPN), TFIDF-F1 score, and Senti-F1 score. BLEU, ROUGE, and
CIDEr were used for overall text quality. BLEU and ROUGE were calculated with the sumeval library 8
and CIDEr was calculated with pycocoevalcap library 9. Diversity was assessed based on adjectives,
nouns, proper nouns, and verbs. It was calculated by measuring the overlap of these features between
generated sentences, using part-of-speech information from GiNZA10, following the approach of [59].
The number of unique proper nouns was calculated by dividing the total number of unique proper
nouns in all generated reviews by the number of reviews. To evaluate domain knowledge, the TFIDF-F1
metric was used by identifying the top 15 TFIDF words for each tourist spot and calculating the F1-score
between these top words and those in the generated reviews. The Senti-F1 metric, developed using
Aspect-Based Sentiment Analysis (ABSA) [60], measured user opinion consideration by extracting
(feature, opinion, sentiment) triplets from the text via a large language model5. F1 scores for the
alignment of (feature, sentiment) and (feature, opinion) were averaged to produce the Senti-F1 score.
For all metrics, except length, higher values indicate better performance.</p>
        <p>
          The comparison methods include MRG [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], PETER [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ], PEPLER [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ], LLaVA-1.5 [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ],
ChatGPT4V, and ChatGPT-4o. MRG is a multimodal review generation model based on LSTM [61]. We use
VGG16 [62] for vision backbone. PETER is an explanation generation model based on transformer
structure, while PEPLER is based on GPT-2 [
          <xref ref-type="bibr" rid="ref38">38</xref>
          ]. We utilized a GPT-2 model 11 trained on Japanese data
for PEPLER. In PETER and PEPLER, image features were extracted using ResNet [63], then reduced
via PCA [64], and clustered with KMeans [65] to generate photo_id, which was used in place of
user_id. LLaVA-1.5 is an open-source large multimodal model, while ChatGPT-4V [
          <xref ref-type="bibr" rid="ref41">41</xref>
          ] and ChatGPT-4o
are closed-source large multimodal models known for their state-of-the-art knowledge and language
capabilities. For large-scale multimodal models, prompts such as “You are a tourist who visited location.
Generate a review based on the image” are used. Since models other than the proposed method tend to
generate verbose text, we prompt the length of generated review to be approximately 100 characters. For
retrieval-augmented fine-tuning, we employed two methods: one that extracts entities for comparison
with subgraphs, and another that retrieves up to seven reviews based on CLIP image similarity for
comparison with summaries. During inference, we used four entities and three relations in the subgraph
method.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. General Review Generation</title>
        <p>Table 2 presents the quantitative evaluation results of review generation. General LMMs such as
LLaVA-1.5 and GPT-4v achieved high ROUGE scores due to their strong linguistic capabilities. However,
their knowledge related to tourism domain and user opinions is limited. ChatGPT-4o demonstrates
high quality in terms of domain specificity and understanding user opinions. However, it sometimes
generates factually incorrect outputs, such as describing a crowded place as quiet. Additionally, some
explanations lean towards generalizations and lack detailed knowledge of tourism spots. For fine-tuned
models such as MRG, PETER, PEPLER and LLaVA-Review, accuracy was generally improved in terms
of BLEU and ROUGE-L quality metrics, as well as domain specificity and user opinion metrics. For the
8https://github.com/chakki-works/sumeval
9https://github.com/sks3i/pycocoevalcap
10https://github.com/megagonlabs/ginza
11https://huggingface.co/rinna/japanese-gpt2-medium
12https://cdn.jalan.jp/jalan/img/2/kuchikomi/3622/KL/ed832_0003622962_1.webp
13https://cdn.jalan.jp/jalan/img/5/kuchikomi/3905/KL/52041_0003905143_1.webp
14https://cdn.jalan.jp/jalan/img/4/kuchikomi/0894/KL/de82f_0000894193_1.webp
inference. Bold text indicates the best performance in each groups.
LLaVA-Review characteristic of making your skin smooth.</p>
        <p>We stayed at Ooedo Onsen Monogatari with a baby, a 4-year-old child, and four adults. The bath was so
Ground-Truth nice that we wanted to go in many times, and it made our skin smooth
tourism spot name as input. The proposed baseline accurately captures the image information, user opinions,
and domain knowledge. 12
MRG, due to the limited expressiveness of the model, the generated output tended to have a relatively
small vocabulary and frequent repetitions, leading to lower overall accuracy. However, the PETER
achieved the highest CIDEr score and performed well in domain specificity and the number of proper
nouns used, as the expressiveness was improved by using the Transformer. The PETER also exhibits
limitations of repetitive outputs at the sentence or phrase level, likely due to a limited number of
Transformer layers. In contrast, the PEPLER consistently delivered superior performance in both review
quality and domain relevance, producing natural and coherent outputs.</p>
        <p>The proposed method, LLaVA-Review, leverages a large corpus of UGC and demonstrates excellent
review performance across metrics, particularly in BLEU, ROUGE-L, and sentiment expression. This
improvement is attributed to the model’s efective capture of image features and its capability to
incorporate relevant tourist information and user opinions, facilitated by its large parameterization.
While it tends to generate shorter reviews, the length can be adjusted by specifying the desired review
(a) Retrieved Subgraph (b) Retrieved Summary
(road, narrow, 18), (road, narrow, 8), (road, (extraction of summary part) The Yamanashi Prefectural Art Museum ofers
dificult, 6) an extensive collection centered around Millet’s works, providing a quiet
(mountain road, narrow, 7), (mountain road, and serene environment for visitors. The museum also features outdoor
dificult, 5), (mountain road, good, 3) sculptures and a park, allowing visitors to enjoy both art appreciation and a
(scenery, wonderful, 5), (scenery, good, 3) leisurely stroll. With relatively easy access, the museum provides a range of
services that visitors will appreciate, including discounts for local residents
and special ofers for those staying at nearby accommodations.
length in the prompt as discussed later. Figure 6 illustrates the generated review results for Isawa Onsen.
ChatGPT-4o produces relatively descriptive and typical outputs, while PETER and PEPLER efectively
capture the domain-specific information of the tourist spot. Notably, the proposed LLaVA-Review
accurately identifies the presence of indoor hot springs and their characteristics from the images,
resulting in high-quality generation.</p>
        <p>Table 2 also presents the efects of incorporating external knowledge. While knowledge acquisition
improved accuracy overall, the extent of improvement varied. Sampling from the Knowledge Graph
contributed to enhancements in BLEU by 12% and Senti-F1 by 11%, leading to better consideration of
user opinions. In contrast, simply retrieving nouns showed limited improvements, highlighting the
importance of including adjective information. Aspect-based summaries notably enhanced domain
specificity and informativeness, increasing TFIDF-F1 by 4% and the number of proper nouns by 12%,
without negatively afecting quality metrics. However, directly retrieving reviews introduced noise,
which lowered overall quality despite improved user opinion consideration. Table 3 shows specific
examples.</p>
        <p>Finally, for ChatGPT-4o, we retrieved five, ten, and twenty reviews similar to the image using CLIP,
and added them to the original prompt during inference. This approach significantly boosted the
original ChatGPT-4o’s performance, achieving the highest CIDEr score and high TFIDF-F1 and Senti-F1
score. However, the variants of LLaVA-Review show comparable accuracy across almost all metrics,
confirming its high review generation performance.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Conditional Review Generation</title>
        <p>
          Table 4 presents the quantitative results of conditioned review generation. LLaVA-1.5 shows minimal
to no performance improvement when considering attributes, as reflected in its low CIDEr scores for
diferent conditions, such as gender (CIDEr: 0.014) and rating (CIDEr: 0.012). In contrast, LLaVA-Review
model
LLaVA-1.5 [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]
LLaVA-1.5 + gender
LLaVA-1.5 + season
LLaVA-1.5 + rating
LLaVA-1.5 + length
LLaVA-1.5 + profile (long)
LLaVA-1.5 + Keyphrase
LLaVA-Review
LLaVA-Review + gender
LLaVA-Review + age
LLaVA-Review + tag
LLaVA-Review + profile_tag
LLaVA-Review + profile_long
LLaVA-Review + rating
LLaVA-Review + length
LLaVA-Review + key phrase
LLaVA-Review + season
LLaVA-Review + month
        </p>
        <p>BLUE
exhibits significant accuracy gains for certain attributes. LLaVA-1.5 fails to capture the characteristics
of reviews influenced by user attributes and context, often directly outputting the conditions in the
reviews. Although conditioning with key phrases improves accuracy in LLaVA-1.5, it is less efective
than in LLaVA-Review.</p>
        <p>Due to difering attribute frequencies as shown in Figure 2, directly comparing the efects of
conditioning is dificult, but certain trends in accuracy improvements are observed across diferent attributes.
Among user attributes, gender significantly improves generation quality, while age and group
information have little impact, suggesting that gender has a stronger influence on shaping individual user
reviews compared to age or group composition. For profiles, the short tag profile did not significantly
improve accuracy, but long-form and detailed profiles reflecting styles and preferences, such as
“relaxed” or “preference for natural landscapes,” enhanced review generation accuracy. Incorporating
more detailed profiles could be a direction for future work. For style conditions such as review length
and rating, review length proved especially efective, significantly improving accuracy. The average
deviation between generated and target lengths was just 4.1 characters, reflecting precise conditioning.
Low ratings had some efect, but imbalanced training data led to occasional under-representation of
negative opinions. Keyword conditioning showed the highest accuracy gains. For temporal context,
month-based conditioning outperformed seasonal, indicating greater efectiveness at a finer granularity.</p>
        <p>Figure 7 illustrates examples of diferent conditioning: user demographic, age, user profile, rating and
review lengths, and key phrases. For user demographic conditioning, characteristics of men, women,
and families are reflected, such as mentioning “playground equipment” for families and “beautiful
lfowers” for women. For age conditioning, in the twenties conditioning, a lively writing style using
exclamations was observed, while in the sixties conditioning, more factual reviews related to historical
knowledge were produced. For review length, it was observed that even with the requirement of 290
characters, reviews were generated without content redundancy or structural inconsistencies. For rating,
for the negative conditioning of one star, reviews expressing disappointment over the lack of menu
options were generated. Generation based on keywords also demonstrated high fidelity, producing
results that were very close to the ground truth.
15https://cdn.jalan.jp/jalan/img/7/kuchikomi/2837/KL/d72d0_0002837701_1.webp
16https://cdn.jalan.jp/jalan/img/8/kuchikomi/1388/KL/e88ab_0001388176_2.webp
17https://cdn.jalan.jp/jalan/img/5/kuchikomi/4115/KL/09030_0004115266_1.webp
18https://cdn.jalan.jp/jalan/img/3/kuchikomi/5203/KL/02e5b_0005203413_1.webp
19https://cdn.jalan.jp/jalan/img/5/kuchikomi/4745/KL/be8e4_0004745614_4.webp</p>
        <p>You are a tourist of ○○ who visited Expo ’70 Commemorative Park. Please write a review based on
the image.</p>
        <p>Male</p>
        <p>Expo ’70 Commemorative Park, which features the famous Tower of the Sun sculpture by Taro Okamoto,
sometimes hosts flea markets on weekends.</p>
        <p>Female Expo ’70 Commemorative Park is very spacious and a great place for a walk. Depending on the season,
various flowers bloom, making it very beautiful.</p>
        <p>Family It is a very spacious park. Many flowers bloom, making it ideal for walking. There are also playground
equipment, so children can enjoy it too.</p>
        <p>You are a tourist of age ○○ who visited Nagoya Castle. Please write a review based on the image.
20s Nagoya Castle is a very magnificent castle. You can go inside, so please be sure to visit. I highly
recommend it!</p>
        <p>If you visit Nagoya Castle during the cherry blossom season, you can enjoy and take photos with the castle
40s in the background. Nearby, there are other attractions like the Nagoya City Science Museum and Kinshachi</p>
        <p>Yokocho, making it a place where you can enjoy a full day.
60s</p>
        <p>Nagoya Castle was originally built by Toyotomi Hideyoshi. The current main keep was reconstructed in the
Showa era. It is especially beautiful during the cherry blossom season.</p>
        <p>You are a tourist with a profile of 〇〇. Generate a review for Godaido Hall.</p>
        <p>Prof: Nature lover.</p>
        <p>It is located in a place where you can get a panoramic view of Matsushima Bay.</p>
        <p>Since it is close to the pier for sightseeing boats, I think it’s good to visit both.</p>
        <p>Profile (short)</p>
        <p>Prof: A calm traveler who loves the beauty of nature,especially the scenery of lakes and coastlines, and
enjoys leisurely exploring tourist spots like Shiretoko , Lake Mashu.</p>
        <p>It is located in a place where you can get a panoramic view. From inside Godaido Hall, you can see a great
Profile (Long) view of Matsushima Bay. Sightseeing boats also pass by, so I think it’s a good spot for taking photos.</p>
        <p>It is an island near the Matsushima sightseeing boat pier, located after crossing three red bridges where you
can see the bottom. In the center of the island, there is a hall, and it is a wonderful place with a panoramic
Ground Truth view of Matsushima Bay.</p>
        <p>Generate a review for Nagoya Tokyu Hotel with a condition of 〇〇</p>
        <p>I went there with high expectations since it was lunch at a hotel. I chose the hamburger steak as the main dish, and it was tender and
290 delicious. The side buffet had a wide variety of options, and I left feeling full. There was also dessert, but it was just ice cream. The
chars coffee after the meal was self-service. Since it was a restaurant inside a hotel, the service and atmosphere were good. However, the
Length line in front of the entrance was long, and it took time to get in. It can't be helped since it's a popular restaurant, but I think it would be
better if they could widen the entrance or make some improvements. I used points to get a good deal, so I was satisfied. I'd like to
come back and try a different menu next time.
1
stars I went there for lunch at a hotel with high expectations, but the menu was limited, and there were no
Rating desserts. There was a salad bar and a drink bar, but considering the price, it felt a bit... disappointing</p>
        <p>Generate a review for Kunozan Toshogu Shrine using the keyword “more strenuous than I imagined”
that matches the image</p>
        <p>It was more strenuous than I imagined. I was drenched in sweat climbing the stairs. But the view after
Keyword reaching the top was amazing.</p>
        <p>I climbed the stone steps from the seaside to the Gorieki-dori side. It was more strenuous than I imagined,
Ground-Truth but the view was magnificent and I felt a great sense of accomplishment.</p>
        <p>Figure 7: Conditional review generation examples. The first example shows generation based on user
demographics. The second example shows generation based on age. The third example illustrates generation based on
the user’s profile. The fourth example shows generation based on rating and review length. The fifth example
illustrates generation based on key phrases.15 16 17 18 19</p>
        <p>As an aggregate analysis, we generated reviews for the same image by changing only the seasonal
instructions across the four seasons, then compared word frequency in the output. As shown in
Figure 8, distinct features appear: cherry blossoms in spring, children in summer, autumn leaves in
fall, and hot springs in winter, highlighting the impact of seasonal conditioning. Figure 9 provides
a quantitative view of the conditioning efects under gender conditioning. In generated reviews, we
compared word frequencies between male and female conditioned reviews using 1,000 test image-text
pairs. For real reviews, we sampled ten reviews from men and three from women for the same tourist
destination. The table highlights the top 15 words with the largest (female - male) and (male - female)
frequency diferences in real reviews. In reviews originally written by women, certain words appear
more frequently, and this pattern is reflected in generated reviews. All 15 words with higher frequency
in women’s original reviews also show higher frequency under female conditioning, with notable
diferences for “Go,” “Beautiful,” and “Spacious.” In contrast, men’s reviews show fewer specific words,
and the male-to-female ratio remains small in generated reviews, maintaining the overall trend of
higher male-conditioned word frequency.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Short Review Generation</title>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Visualization of Attention</title>
        <p>The visualization results of the attention weights in the final layer of the language model during the
generation of the first token in the review generation process are shown in Figure 10. Specifically, the
visualization highlights the 576 tokens that correspond to image positions within the cs-token attention,
resized to 24 × 24 and presented as a heatmap. In the standard LLaVA-1.5 model, high attention values
are concentrated in narrow regions of the image. In contrast, LLaVA-Review exhibits high attention
values over a broader area of the image and across more extensive objects.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Limitations and Future Works</title>
      <p>In terms of dataset, we are considering updates to improve the fidelity of image-text pair creation.
Leveraging more detailed user information, such as utilizing all past reviews written by the user, or
considering their actual behavior at tourist destinations when possible, is also a promising direction.
Additionally, while this research developed a dataset specific to Japanese data, it is known that tourist
destinations and perceptions of tourism vary by country. Building a more comprehensive dataset
that encompasses diverse languages and cultures remains a challenge for future research. Moreover,
expanding the dataset to support more tasks such as personalized product description generations
and recommendations are also future works. In terms of the model design, we will leverage more
powerful language models as the backbone, incorporate more robust image information and domain
knowledge, and develop more adaptive external documents. We will also consider the application to
real-world scenarios, such as simulations and marketing. Our model is capable of generating virtual
user experiences, which could be used to improve tourist destinations and simulate travel experiences.
However, when utilizing such pseudo-reviews, it is essential to address potential issues related to
privacy, bias, and reputation harms that these reviews might cause.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this research, we developed TourMix1M, the first multimodal dataset for tourism review generation.
We also introduced LLaVA-Review, a large-scale multimodal model for review generation. Furthermore,
we researched two knowledge retrieval methods for tourism review generation. Experiments with
the proposed dataset showed LLaVA-Review’s superior performance in domain specificity and user
sentiment expression. The proposed two retrieval-augmented fine-tuning strategies further improved
accuracy. Additionally, additional attention to factors such as gender, user profiles, month, review
length, and key phrases significantly enhanced review generation. This work is expected to advance
research in tourism and broader review generation fields.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Acknowledgments References</title>
      <p>This work is partially financially supported by Center for Real Estate Innovation, The University of
Tokyo.
model reasoning, in: International Conference on Learning Representations, 2024.
[48] S. Borgeaud, A. Mensch, J. Hofmann, T. Cai, E. Rutherford, K. Millican, G. van den Driessche, J.-B.</p>
      <p>Lespiau, B. Damoc, , A. Guy, J. Menick, R. Ring, T. Hennigan, S. Huang, L. Maggiore, C. Jones,
A. Cassirer, A. Brock, M. Paganini, G. Irving, O. Vinyals, S. Osindero, K. Simonyan, J. W. Rae,
E. Elsen, L. Sifre, Improving language models by retrieving from trillions of tokens., in:
International Conference on Machine Learning, 2022, pp. 2206–2240.
[49] K. Guu, K. Lee, Z. Tung, P. Pasupat, M.-W. Chang, Realm: Retrieval-augmented language model
pre-training, in: International Conference on Machine Learning, 2020, pp. 3929–3938.
[50] W. Shi, S. Min, M. Yasunaga, M. Seo, R. James, M. Lewis, L. Zettlemoyer, W. tau Yih, Replug:
Retrieval-augmented black-box language models, in: Proceedings of the 2024 Conference of the
North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, 2024, pp. 8371–8384.
[51] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. tau Yih,
T. Rocktäschel, S. Riedel, D. Kiela, Retrieval-augmented generation for knowledge-intensive nlp
tasks, in: Advances in Neural Information Processing Systems, 2024, pp. 9459–9474.
[52] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, et al., Learning
transferable visual models from natural language supervision, in: International Conference on
Machine Learning, 2021, pp. 8748–8763.
[53] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E.</p>
      <p>Gonzalez, I. Stoica, E. P. Xing, Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt
quality, 2023. URL: https://lmsys.org/blog/2023-03-30-vicuna/.
[54] E. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, Lora: Low-rank adaptation
of large language models, in: Proc. International Conference on Learning Representations, 2022.
[55] R. K. Amplayo, S. Angelidis, M. Lapata, Aspect-controllable opinion summarization, in: Proceedings
of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, p. 6578–6593.
[56] K. Papineni, S. Roukos, T. Ward, W. Zhu, Bleu: A method for automatic evaluation of machine
translation, in: Proc. 40th Annual Meeting on Association for Computational Linguistics, 2002, p.
311–318.
[57] C.-Y. LIN, Rouge: A package for automatic evaluation of summaries, in: Text Summarization</p>
      <p>Branches Out, 2004, pp. 74–81.
[58] R. Vedantam, C. L. Zitnick, D. Parikh, Cider: Consensus-based image description evaluation.,
in: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2015, pp.
4566–4575.
[59] B. Smyth, P. McClave., Similarity vs. diversity., in: Proceedings of the 4th International Conference
on Case-Based Reasoning, 2001, pp. 347–361.
[60] H. Peng, L. Xu, L. Bing, F. Huang, W. Lu, L. Si, Knowing what, how and why: A near complete
solution for aspect-based sentiment analysis„ in: Proceedings of the AAAI Conference on Artificial
Intelligence, volume 34, 2020, pp. 8600–8607.
[61] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation 9 (1997) 1735–1780.
[62] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition,
in: International Conference on Learning Representations, 2015.
[63] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proc. IEEE</p>
      <p>Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
[64] S. Wold, K. Esbensen, P. Geladi, Principal component analysis, Chemometrics and intelligent
laboratory systems (1987) 37–52.
[65] S. P. Lloyf, Least squares quantization in pcm, IEEE transactions on information theory, 28 (1982)
129–137.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>WTTC</surname>
          </string-name>
          ,
          <source>Economic Impact Report: Global Infographic</source>
          ,
          <source>Technical Report, World Travel Tourism Council (WTTC)</source>
          ,
          <year>2023</year>
          . URL: https://wttc.org/Research/Economic-Impact.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Anonymous</surname>
          </string-name>
          ,
          <article-title>Tourist experiences at overcrowded attractions: A text analytics approach</article-title>
          , in: Information and Communication Technologies in Tourism,
          <year>2022</year>
          , pp.
          <fpage>231</fpage>
          -
          <lpage>243</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.-E.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Y.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. I.</given-names>
            <surname>Shin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-B.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Efects of tourism information quality in social media on destination image formation: The case of sina weibo</article-title>
          ,
          <source>Information &amp; Management</source>
          <volume>54</volume>
          (
          <year>2017</year>
          )
          <fpage>687</fpage>
          -
          <lpage>702</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M. del Carmen</given-names>
            <surname>Hidalgo Alcázar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Piñero</surname>
          </string-name>
          , S. R. de Maya,
          <article-title>The efect of user-generated content on tourist behavior: The mediating role of destination image</article-title>
          ,
          <source>Tourism &amp; Management Studies</source>
          <volume>10</volume>
          (
          <year>2014</year>
          )
          <fpage>158</fpage>
          -
          <lpage>164</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y. L. . H. S.</given-names>
            <surname>Dogan</surname>
          </string-name>
          <string-name>
            <surname>Gursoy</surname>
          </string-name>
          ,
          <article-title>Gender diference on destination image and travel options: An exploratory text-mining study</article-title>
          .,
          <source>in: PloS one</source>
          , volume
          <volume>30</volume>
          ,
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y. L. . H. S.</given-names>
            <surname>Dogan</surname>
          </string-name>
          <string-name>
            <surname>Gursoy</surname>
          </string-name>
          ,
          <article-title>Does traveler satisfaction difer in various travel group compositions?</article-title>
          , in:
          <source>International Journal of Contemporary Hospitality Management</source>
          , volume
          <volume>30</volume>
          ,
          <year>2018</year>
          , pp.
          <fpage>1663</fpage>
          -
          <lpage>1685</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>J. J. Padilla1</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kavak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Lynch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Gore1</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. Y.</given-names>
            <surname>Diallo</surname>
          </string-name>
          ,
          <article-title>Temporal and spatiotemporal investigation of tourist attraction visit sentiment on twitter</article-title>
          .,
          <source>PloS one 13</source>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Rossetti1</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Stella1</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zanker</surname>
          </string-name>
          ,
          <article-title>Analyzing user reviews in tourism with topic models</article-title>
          .,
          <source>Information Technology Tourism</source>
          <volume>316</volume>
          (
          <year>2016</year>
          )
          <fpage>5</fpage>
          -
          <lpage>21</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S. A. C.</given-names>
            <surname>Estela</surname>
          </string-name>
          Marine-Roig,
          <article-title>Tourism analytics with massive user-generated content: A case study of barcelona</article-title>
          .,
          <source>Journal of Destination Marketing Management</source>
          <volume>4</volume>
          (
          <year>2015</year>
          )
          <fpage>162</fpage>
          -
          <lpage>172</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>C.-F. Tsai</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Y.-H.</given-names>
          </string-name>
          <string-name>
            <surname>Hu</surname>
          </string-name>
          , W.-K. Chen,
          <article-title>Improving text summarization of online hotel reviews with review helpfulness and sentiment</article-title>
          ,
          <source>Tourism Management</source>
          <volume>80</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>L.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lapata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>Learning to generate product reviews from attributes</article-title>
          ,
          <source>in: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>623</fpage>
          -
          <lpage>632</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Deep residual learning for image recognition</article-title>
          ,
          <source>in: Proc. IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Q.-T.</given-names>
            <surname>Truong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lauw</surname>
          </string-name>
          ,
          <article-title>Multimodal review generation for recommender systems</article-title>
          ,
          <source>in: The World Wide Web Conference</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1864</fpage>
          -
          <lpage>1874</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. J.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-R.</given-names>
            <surname>Wen</surname>
          </string-name>
          .,
          <article-title>Knowledge-enhanced personalized review generation with capsule graph neural network</article-title>
          ,
          <source>in: Proceedings of the 29th ACM International Conference on Information Knowledge Management</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>735</fpage>
          -
          <lpage>744</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. McAuley</given-names>
            ,
            <surname>Bodhisattwa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <article-title>Factual and informative review generation for explainable recommendation</article-title>
          .,
          <source>in: Proceedings of the AAAI Conference on Artificial Intelligence</source>
          , volume
          <volume>37</volume>
          ,
          <year>2023</year>
          , pp.
          <fpage>13816</fpage>
          -
          <lpage>13824</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Improved baselines with visual instruction tuning</article-title>
          ,
          <source>Proc. IEEE/CVF Conference on Computer Vision</source>
          and Pattern
          <string-name>
            <surname>Recognition</surname>
          </string-name>
          (
          <year>2024</year>
          )
          <fpage>26296</fpage>
          -
          <lpage>26306</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>X. V.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lomeli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>James</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rodriguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kahn</surname>
          </string-name>
          , G. Szilvasy,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yih</surname>
          </string-name>
          ,
          <article-title>Ra-dit: Retrieval-augmented dual instruction tuning</article-title>
          ,
          <source>in: The Twelfth International Conference on Learning Representations., 2024</source>
          , pp.
          <fpage>2206</fpage>
          -
          <lpage>2240</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. J. Shishir G.</given-names>
            <surname>Patil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zaharia</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Stoica</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          ,
          <article-title>Raft: Adapting language model to domain specific rag</article-title>
          , in: arXiv preprint,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>M.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>Zhu1</year>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <article-title>Fine-tuning large language model based explainable recommendation with explainable quality reward</article-title>
          .,
          <source>in: Proceedings of the AAAI Conference on Artificial Intelligence</source>
          , volume
          <volume>38</volume>
          ,
          <year>2024</year>
          , pp.
          <fpage>9250</fpage>
          -
          <lpage>9259</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>S.-J.</given-names>
            <surname>Park</surname>
          </string-name>
          , D.
          <article-title>-</article-title>
          <string-name>
            <surname>K. Chae</surname>
          </string-name>
          , H.
          <article-title>-</article-title>
          <string-name>
            <surname>K. Bae</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Park</surname>
          </string-name>
          , S.-W. Kim,
          <article-title>Reinforcement learning over sentimentaugmented knowledge graphs towards accurate and explainable recommendation</article-title>
          ,
          <source>in: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>784</fpage>
          -
          <lpage>793</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Anonymous</surname>
          </string-name>
          ,
          <article-title>Llava-tour: Creation of a large-scale multimodal model specializing in japanese tourism data</article-title>
          .,
          <source>in: Proceedings of IEEE Visual Communications and Image Processing (under review)</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>M. L. Cheung</surname>
            ,
            <given-names>W. K. S.</given-names>
          </string-name>
          <string-name>
            <surname>Leung</surname>
            ,
            <given-names>J.-H.</given-names>
          </string-name>
          <string-name>
            <surname>Cheah</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Ting</surname>
          </string-name>
          ,
          <article-title>Efects of user-provided photos on hotel review helpfulness: An analytical approach with deep leaning</article-title>
          ,
          <source>International Journal of Hospitality Management</source>
          <volume>71</volume>
          (
          <year>2018</year>
          )
          <fpage>120</fpage>
          -
          <lpage>131</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>M. L. Cheung</surname>
            ,
            <given-names>W. K. S.</given-names>
          </string-name>
          <string-name>
            <surname>Leung</surname>
            ,
            <given-names>J.-H.</given-names>
          </string-name>
          <string-name>
            <surname>Cheah</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Ting</surname>
          </string-name>
          ,
          <article-title>Exploring the efectiveness of emotional and rational user-generated contents in digital tourism platforms</article-title>
          ,
          <source>Vacation Marketing</source>
          <volume>28</volume>
          (
          <year>2022</year>
          )
          <fpage>152</fpage>
          -
          <lpage>170</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>Y. L. . H. S.</given-names>
            <surname>Dogan</surname>
          </string-name>
          <string-name>
            <surname>Gursoy</surname>
          </string-name>
          ,
          <article-title>Chatgpt and the hospitality and tourism industry: an overview of current trends and future research directions</article-title>
          , in
          <source>: Journal of Hospitality Marketing &amp; Management</source>
          , volume
          <volume>32</volume>
          ,
          <year>2023</year>
          , pp.
          <fpage>579</fpage>
          -
          <lpage>592</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>W. C. Yogesh K Dwivedi</surname>
            , Neeraj Pandey,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Micu</surname>
          </string-name>
          ,
          <article-title>Leveraging chatgpt and other generative artificial intelligence (ai)-based applications in the hospitality and tourism industry: practices, challenges and research agenda</article-title>
          , in:
          <source>International Journal of Contemporary Hospitality Management</source>
          , volume
          <volume>36</volume>
          ,
          <year>2024</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>C.</given-names>
            <surname>Zong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          ,
          <article-title>Personalized transformer for explainable recommendation</article-title>
          ,
          <source>in: Proc. 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>4947</fpage>
          -
          <lpage>4957</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , L. Chen,
          <article-title>Personalized prompt learning for explainable recommendation</article-title>
          ,
          <source>in: ACM Transactions on Information Systems</source>
          , volume
          <volume>41</volume>
          ,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>P.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bing</surname>
          </string-name>
          , W. Lam,
          <article-title>Neural rating regression with abstractive tips generation for recommendation</article-title>
          ,
          <source>in: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>345</fpage>
          -
          <lpage>354</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. McAuley</surname>
          </string-name>
          ,
          <article-title>Personalized review generation by expanding phrases and attending on aspectaware representations</article-title>
          ,
          <source>in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>706</fpage>
          -
          <lpage>711</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>Z. C.</given-names>
            <surname>Lipton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. V.</given-names>
            <surname>andand Julian</surname>
          </string-name>
          <string-name>
            <surname>McAuley</surname>
          </string-name>
          ,
          <article-title>Generative concatenative nets jointly learn to write and classify reviews</article-title>
          , in: arxiv preprint,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>P.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tuzhilin</surname>
          </string-name>
          ,
          <article-title>Towards controllable and personalized review generation</article-title>
          ,
          <source>in: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics</source>
          , volume
          <volume>1</volume>
          ,
          <year>2019</year>
          , pp.
          <fpage>3237</fpage>
          -
          <lpage>3245</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>J.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          <article-title>Wang, Long text generation via adversarial training with leaked information</article-title>
          .,
          <source>in: Proceedings of the AAAI conference on artificial intelligence</source>
          , volume
          <volume>32</volume>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>D. V.</given-names>
            <surname>Hada</surname>
          </string-name>
          , V. M.,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Shevade</surname>
          </string-name>
          , Rexplug:
          <article-title>Explainable recommendation using plug and play language model</article-title>
          ,
          <source>in: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , volume
          <volume>1</volume>
          ,
          <year>2021</year>
          , pp.
          <fpage>81</fpage>
          -
          <lpage>91</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Aspect-level sentiment-controllable review generation with mutual learning framework</article-title>
          ,
          <source>in: Proceedings of the AAAI Conference on Artificial Intelligence</source>
          , volume
          <volume>35</volume>
          ,
          <year>2021</year>
          , pp.
          <fpage>12639</fpage>
          -
          <lpage>12647</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-R.</given-names>
            <surname>Wen</surname>
          </string-name>
          , ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <article-title>Generating long and informative reviews with aspect-aware coarse-to-fine decoding</article-title>
          ,
          <source>in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1969</fpage>
          -
          <lpage>1979</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          , volume
          <volume>30</volume>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lavril</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Izacard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Martinet</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Lachaux</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lacroix</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Rozière</surname>
          </string-name>
          , et al.,
          <article-title>Llama: Open and eficient foundation language models, arXiv preprint (</article-title>
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Child</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Luan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <article-title>Language Models are Unsupervised Multitask Learners</article-title>
          ,
          <source>Technical Report, OpenAI</source>
          ,
          <year>2019</year>
          . URL: https://api.semanticscholar.org/ CorpusID:160025533.
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <surname>T. B. Brown</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Mann</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ryder</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Subbiah</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kaplan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dhariwal</surname>
            ,
            <given-names>A. N. P.</given-names>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>in: Proc. Advances in Neural Information Processing Systems</source>
          , volume
          <volume>33</volume>
          ,
          <year>2020</year>
          , pp.
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Visual instruction tuning</article-title>
          ,
          <source>in: Proc. Advances in Neural Information Processing Systems</source>
          , volume
          <volume>36</volume>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <surname>OpenAI</surname>
          </string-name>
          , Gpt-4
          <source>technical report</source>
          , arXiv preprint (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Qwen-vl: A versatile vision-language model for understanding, localization</article-title>
          , text reading, and beyond, arxiv preprint (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Guu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. W.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Du</surname>
          </string-name>
          , et al.,
          <article-title>Finetuned language models are zero-shot learners</article-title>
          ,
          <source>in: Proc. International Conference on Learning Representations</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [44]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , X. Han,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Liu</surname>
          </string-name>
          , Ernie:
          <article-title>Enhanced language representation with informative entities</article-title>
          ,
          <source>in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1441</fpage>
          -
          <lpage>1451</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          [45]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Kwak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Baek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Hwang</surname>
          </string-name>
          ,
          <article-title>Knowledge graph-augmented language models for knowledge-grounded dialogue generation</article-title>
          , in: arxiv preprint,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          [46]
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. V.</given-names>
            <surname>Chawla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Laurent</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>LeCun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Bresson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hooi</surname>
          </string-name>
          , G-retriever:
          <article-title>Retrieval-augmented generation for textual graph understanding and question answering</article-title>
          , in: arxiv preprint,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          [47]
          <string-name>
            <given-names>L.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-F.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Hafari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <article-title>Reasoning on graphs: Faithful and interpretable large language</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>