A Multimodal Dataset and Benchmark for Tourism Review
                         Generation
                         Hiromasa Yamanishi1 , Ling Xiao1,* and Toshihiko Yamasaki1
                         1
                             The University of Tokyo, Hongō, Bunkyo, Tokyo 113-8656, Japan


                                        Abstract
                                        This paper addresses the challenge of generating accurate and contextually relevant tourism reviews, which
                                        are essential to assist travelers in creating reviews and allowing businesses to predict the reactions of different
                                        users to tourist spots. To address this problem, we introduce the first multimodal dataset for Japanese tourism
                                        review generation, TourMix1M, which contains one million review instances under various conditions, including
                                        images, user attributes, user profiles, review ratings, review length, key phrases, and visit seasons, collected
                                        from Japanese tourism websites. Based on this dataset, we develop a baseline model for multimodal review
                                        generation, LLaVA-Review, by performing instruction tuning of the LLaVA model. Furthermore, to enhance
                                        domain knowledge, we apply retrieval-augmented fine-tuning with aggregated tourism review data, exploring
                                        two types of knowledge representations: one incorporating noun and adjective information from Sentiment
                                        Aware Knowledge Graph, and another using aspect-based summaries from reviews. Experimental results show
                                        that LLaVA-Review outperforms existing models in review generation and adapts well to various conditioning
                                        factors, with improved accuracy by conditioning the gender, visiting month, review length, key phrases, and user
                                        profiles information into the prompt. Furthermore, retrieval-augmented fine-tuning using tourism information
                                        effectively improved accuracy across both types of knowledge representations.

                                        Keywords
                                        Tourism, Review Generation, Conditional Review Generation, Large Multimodal Model


                         1. Introduction
                         Tourism is one of the most crucial sectors in the global economy, contributing more than 15 trillion US
                         dollars [1], and is enjoyed for various purposes, such as relaxation, exploration, and education. Online
                         platforms such as TripAdvisor and Google Maps play essential roles [2] for tourism. In tourism services,
                         user-generated content (UGC), such as reviews and photos of tourist spots, plays a vital role for both the
                         tourism business and individual tourists. By collecting diverse aspects and opinions about these locations,
                         the tourism business can further improve the services, enhance their credibility, and make more profits.
                         At the same time, UGC helps shape tourists’ perceptions of tourist destinations and significantly
                         influences travel planning [3, 4]. As tourism demand diversifies, understanding the preferences, needs,
                         and objectives of different segments is crucial for identifying market opportunities. The literature
                         showed that UGC reflects differences in trends based on user attributes and preferences [5, 6, 7], and
                         analyzing these trends has provided valuable insights for improving tourist destinations and for effective
                         marketing targeting [8, 9, 10].
                            This paper focuses specifically on review generation. There are two primary applications for review
                         generation. First, presenting automatically generated reviews to users can significantly minimize their
                         effort [11] and encourage more review posting. Additionally, presenting generated reviews to users is
                         beneficial for recommendation systems [12]. Second, predicting user reactions, such as “what a man
                         in his 50s who prefers leisurely tourism would say” can be valuable for businesses in marketing and
                         improving tourist destinations. Recent advancements in deep learning and natural language processing
                         (NLP) have enabled the generation of high-quality reviews [11, 13, 14, 15]. In [11], they developed a
                         Workshop on Recommenders in Tourism (RecTour 2024), October 18h, 2024, co-located with the 18th ACM Conference on
                         Recommender Systems, Bari, Italy
                         *
                           Corresponding author. Email: ling@cvm.t.u-tokyo.ac.jp
                         $ yamanishi@cvm.t.u-tokyo.ac.jp (H. Yamanishi); ling@cvm.t.u-tokyo.ac.jp (L. Xiao); yamasaki@cvm.t.u-tokyo.ac.jp
                         (T. Yamasaki)
                          0009-0006-8289-757X (H. Yamanishi); 0000-0002-4650-8841 (L. Xiao); 0000-0002-1784-2314 (T. Yamasaki)
                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
recurrent neural networks based on long short-term memory (LSTM) for user conditioned product
review generation. Truong et al. [13] improved review generation by leveraging image information. Li
et al. [14] enhanced product-related factual information by leveraging knowledge graphs. Xie et al. [15]
utilized existing reviews as additional context and generated reviews using GPT-2. However, although
various datasets exist, relatively few studies focus on tourism-related data in review generation research
literature. In particular, there is a significant gap in research that employs multimodal datasets combining
diverse contexts, such as images, rich user attributes, and review content, specifically tailored for
tourism. To address this gap, this research creates a multimodal review generation dataset, TourMix1M,
specifically tailored for tourism. The TourMix1M is created from data collected from Japanese tourism
websites, comprising 470,000 images, 510,000 reviews, and a total of one million training instances. We
will release a list of URLs for the collected data at https://github.com/HiromasaYamanishi/TourMix1M.
Our dataset supports the generation of reviews based on various contexts such as images, age, gender,
season, user profiles, review length, and ratings. This is the first dataset specialized in generating
tourism reviews that incorporates diverse information such as images and user data, promoting the
understanding and advancement of review generation in the tourism sector.
   We also propose a baseline model for multimodal review generation, LLaVA-Review. LLaVA-Review
was developed by instruction-tuning the open-source LLaVA model [16]. Compared to previous
multimodal review generation models [13], LLaVA-Review offers greater flexibility in handling different
conditions by changing natural language instructions. LLaVA-Review outperformed state-of-the-art
Large Multimodal Model (LMM) ChatGPT-4o and review generation models on various metrics in
experiments conducted on our dataset, particularly in terms of BLEU, ROUGE, and consistency with user
opinions. By incorporating tourism knowledge and user opinions to fine-tune LLaVA, LLaVA-Review
also demonstrated stronger adaptability to diverse tourism related conditioning. Furthermore, inspired
by recent research in enhancing domain specificity [17, 18, 19] with retrieval-augmented fine-tuning,
we incorporate aggregated information from tourism spot reviews during fine-tuning and inference
stages. We explored two approaches: subgraph method and aspect-based summary. The subgraph
method means adding noun and adjective information extracted from a Sentiment Aware Knowledge
Graph [20]. The aspect-based summary denotes using aspect-based summaries generated from reviews.
Experimental results showed that the subgraph method increased BLEU by 12% and user opinion
accuracy by 11%, while using aspect-based summary improved proper noun variety by 18% and domain
knowledge by 4%.
   One related submission to this paper is [21], which develops a multimodal model for landmark
recognition and review generation. However, this work differs from [21] in two key aspects: 1) Dataset:
The dataset used in [21] addresses limited conditions such as gender, age, group, and key phrases,
whereas the TourMix1M includes a wider range of factors such as visit season, user profiles, ratings,
and review length, providing a more comprehensive foundation for tourism review generation. 2)
Model design: [21] addresses review generation as part of a multitask framework, which is trained by
standard instruction-tuning. Our research explores the usage of information aggregated from reviews
to improve quality. Overall, our contributions are threefold:

• We created a multimodal dataset, TourMix1M, for Japanese tourism review generation. The Tour-
  Mix1M has one million instances and ten types of diverse conditions. The URLs used in the dataset
  construction will be made publicly available.

• We proposed LLaVA-Review, which outperformed state-of-the-art large multimodal models and
  review generation models in general review generation. LLaVA-Review showed strong adaptability
  to different conditions, and we analyzed variations in its outputs across these conditions.

• We propose two retrieval-augmented fine-tuning methods: aggregating sub-graphs from a Sentiment
  Aware Knowledge Graph and aspect-based summaries from reviews, both of them enhanced accuracy.
2. Related Works
2.1. User-Generated Content in Tourism
Tourism is a key sector in the global economy, accounting for over 10% of global economic output [1].
As tourism is often a high-cost, once in a lifetime experience, UGC on social platforms plays a crucial
role in helping tourists decide what to visit. UGC, typically in the form of reviews and images, offers
valuable insights. Reviews express user opinions on aspects such as content, price, and service, while
images enhance understanding of intangible experiences such as tourist destinations [2]. Furthermore,
images complement textual information and function as visual evidence to substantiate the content,
thus enhancing the usefulness of reviews when combined with images [22]. UGC significantly influ-
ences travel planning and behavior by shaping both cognitive and emotional perceptions of tourist
destinations [4, 23, 3].
   Analyzing UGC is vital for marketing and improving tourist destinations. Text mining has shown that
reviews reflect user satisfaction and preferences based on factors such as gender [5], age, companions [6],
and season [7]. For example, [5] finds that men are more interested in historical sites, while women
prefer landscapes and rural areas. Similarly, [6] shows that couples tend to leave reviews with higher
satisfaction levels. Analyzing tourist reviews with deep learning techniques, such as topic modeling and
sentiment classification, is effective for marketing and destination improvement [8, 9]. Recently, large
language models such as ChatGPT have shown potential for applications such as automated replies to
customer questions and requests in tourism through prompt engineering [24, 25]. Our research focuses
on developing large-scale models using tourism-specific corpora and images.

2.2. Review Generation
Review generation helps users write reviews [11] and improve the trustworthiness of recommenda-
tions [26, 12]. Unlike explanation generation [26, 27] or tip generation [28], review generation aims
to create longer texts that cover multiple aspects. Various models have been developed for review
generation using recurrent neural networks (RNNs) [11, 14, 29, 30], generative adversarial networks
(GANs) [31, 32], and large-scale models [33]. Typically, these models generate reviews based on inputs
such as user data, product information, ratings, or sentiment polarity. Furthermore, some research
explores generation conditioned on images [13] or aspects [34] such as Enviroment, Service, and Price.
Additionally, guiding the generation process with information extracted from reviews is effective in
improving quality and factuality. For instance, [29] identified relevant aspects between users and items
and guided generation using words from those aspects. Similarly, [35] proposed a method that first
generates an aspect sequence and then performs review generation from coarse to fine. In addition, [14]
expanded a Freebase-based knowledge graph using user and item reviews, capturing user preferences
for each aspect via Caps GNN. Furthermore, [15] generated reviews by inputting past reviews and key
terms into GPT-2 and asking questions like “What was great?” to guide the generation.
   Research on review generation in the tourism domain remains underexplored. One of the biggest
challenges is the lack of a dataset. One related work under submission is [21], where review generation
is addressed as part of a multitask framework, with conditions limited to user attributes and keywords.
This paper focuses on creating a dataset specifically for review generation with a broader range of ten
conditions such as visit timing, user profiles, ratings, and review length. We also apply a large-scale
multimodal model to multimodal review generation for the first time.

2.3. Retrieval-augmented Large-Scale Model Fine-tuning
Large-scale models (LMs) based on the Transformer architecture [36] have shown high performance
across various tasks, with notable examples including large language models (LLMs) such as LLaMA [37]
and GPT [38, 39], large multimodal models (LMMs) such as LLaVA [40, 16], ChatGPT-4 [41], and
QwenVL [42]. These LMMs align language with images and can generate text across diverse inputs and
conditions.
Table 1
Examples of text-image pair construction using CLIP. The bold text represents sentences retrieved through
image-sentence retrieval.
        Image               Spot Name       Review
                            KITTE           (Image-Sentence-Review) To avoid the crowds near Christmas, I went Christ-
                            Ootemachi       mas tree touring in mid-December. At KITTE, a large white Christmas tree
                                            was displayed in the atrium on the first floor entrance.
        2


                            The canal       (Image-Review) This time, I walked along the canal at night. Illuminated by
                            and    the      gas lamps, I was satisfied with the beautiful scenery. After dinner, I walked all
                            stone ware-     the way to the back and back again, making for a nice walk. The warehouses
                            houses          were also lit up and looked beautiful. I definitely recommend going at night.
        3


   However, these models often lack domain-specific knowledge. To address this, methods such as
instruction-tuning [43] and retrieval-augmented generation (RAG) [44, 45, 46, 47, 48, 49, 17, 50, 18, 51]
have been developed to enhance performance in specific domains. RAG, which incorporates external
knowledge, including structured knowledge such as Knowledge Graphs [44, 45, 46, 47] and unstructured
sources [48, 49, 17, 50, 18, 51], has proven effective for domain-specific tasks. Knowledge integration has
proven to be beneficial either during pre-training [48, 49, 44], fine-tuning [17, 18, 45], or inference [51, 46].
Our research aligns closely with studies such as [17, 18, 45], which focus on retrieval-augmentation
during both fine-tuning and inference stages. Most research in this area is applied to question answering,
where a single or a few lengthy documents are retrieved. However, this approach may not be ideal for
review generation. A small number of reviews may not capture all aspects and opinions of a certain
tourism spot, while too many reviews can introduce noise, potentially degrading quality. Although
frameworks such as [19] train large language models using concept retrieval, the knowledge may still
be insufficient. So, effective knowledge retrieval methods for toursim review generation requires careful
design.


3. Proposed Method
Considering the fact that most tourists take photos and their experiences vary depending on context
information such as user attributes and visit timing, we have constructed a multimodal tourism review
generation dataset. The created dataset accounts for ten conditions, including review length, user
gender, user age, group, visiting month, visiting season, two types of user profiles, and rating. Moreover,
we propose a high-performance LLaVA-Review model for review generation. We also propose retrieval-
augmented fine tuning where aggregated review information is incorporated into the prompt during
both training and inference. Each of these aspects will be detailed in the following sections. We
propose two formats for aggregating review information: one based on subgraphs consisting of noun
and adjective information extracted from a Sentiment Aware Knowledge Graph [20], and another
based on aspect based review summaries. Each of these aspects will be detailed in the following
sections. Figures 4, 6, 7, 8, and 10 as well as Tables 1 and 3 present the original reviews alongside the
generated examples, all of which were originally in Japanese and have been translated into English for
demonstration purposes in this paper.


2
    https://cdn.jalan.jp/jalan/img/6/kuchikomi/3776/KL/6b4e1_0003776908_1.webp
3
    https://cdn.jalan.jp/jalan/img/0/kuchikomi/4140/KL/3a837_0004140092_1.webp
                                                                                   Component         Count
                                                                                   Dialogues       1,000,000
                                                                                   Prompts         1,310,000
                                                                                   Reviews           545,891
                                                                                   Images            476,167
                                                                                   Tourism Spots      51,011


Figure 1: Dataset statistics. Left: proportions of the three tasks—Short Review Generation, General Review
Generation, and Conditional Review Generation, with the latter two collectively referred to as long review
generation. “RG” refers to review generation. Middle and right: distribution of conditions in short and long
review generation. The table on the right summarizes the dataset components.


Figure 2: Proportion of each of the 10 conditioning types within the whole dataset. We constructed prompts by
sampling attributes to increase the diversity of conditioning.


Figure 3: Comparison of categorical attribute distribution in: 1) the TourMix1M dataset, and 2) the sampled
TourMix1M dataset 3) the original web data, .


3.1. TourMix1M Dataset
We created a multimodal tourism dataset for review generation by utilizing data collected from the
Japanese tourism website jalan.net4 , with permission to use its data for non-commercial purposes.
We originally collected 470k images and 2.4 million reviews from the website for 51k tourist spots
across Japan that had a sufficient number of images and reviews. This represents a large portion of the
content available on the website. Initially, the collected images and texts were not always paired. To
create image-text pairs, we employed the Contrastive Language-Image Pretraining (CLIP) [52] method,
4
    https://www.jalan.net/
specifically focusing on the model tailored for the Japanese language5 . The data were generated by
pairing images and reviews based on the nearest neighbors in the embedding space. To create training
pairs, we used three retrieval methods for diverse image-text matching: image-to-review retrieval,
review-to-image retrieval, and image-sentence-review retrieval. Table 1 shows examples of obtained
image-text pairs. In the image-sentence-review retrieval process, sentences were first extracted from
the images, after which the full corresponding reviews were identified by locating the reviews that
contained those sentences.
   We constructed three review generation tasks based on the generated image-text pairs. The tasks
consist of Short Review Generation, General Review Generation, and Conditional Review Generation.
General and Conditional Review Generation are collectively referred to as Long Review Generation.
For Short Review Generation, image-sentence pairs obtained from image-sentence retrieval were used
to generate concise reviews. For Long Review Generation, image-review pairs obtained from three
retrieval methods were used to generate reviews. The left side of Figure 1 shows the distribution of
each task. As input for each task, in General Review Generation, instructions were given to generate
reviews based solely on images and place names. In Short Review Generation, only one condition,
rating, was applied. For Conditional Review Generation, instructions were given to generate reviews
based on combinations of images, place names, and various conditions. The conditions considered
include ten categories: review length, gender, age, groups, visit month, season, two types of user
profiles (tag and long), rating, and key phrase in the review. These variables encompass a wide range of
conditions specific to tourism. While this research does not perform conditioning based on user ID,
making the setup less personalized, it provides a more general context and attribute based framework,
which is applicable even to cold-start users. For example, a new user, such as a man in his 50s who
enjoys leisurely activities, could request a review for a spring visit to a tranquil garden or scenic nature
trail, allowing the system to generate a review tailored to his preferences without prior interactions.
Moreover, our dataset enables analysis of how different conditioning factors, such as age, gender, and
season, influence the generated reviews, providing insights into the diverse user experiences.
   Specifically, for categorical variables, gender is either male or female; age is in ten-year increments
from the 10s to the 90s; groups include family, couple, friends, single, or other; rating is an integer
between 1 and 5; visit month is an integer between 1 and 12; and the season is either spring, summer,
autumn, or winter. For user profiles, we use two types: “tag” profiles, which are simple, keyword-based
summaries such as “history enthusiast” and “long” profiles, which provide more detailed descriptions
such as “a curious traveler with a deep interest in local history and culture, enjoying museum visits
and exploring traditional cuisine.” These profiles are generated by prompting a large language model6
based on past reviews. For key phrases, we use an LLM to extract important parts from sentences,
ranging from single words to short sentences, that users found positive or negative. The attributes
for each review are derived from the accompanying metadata of reviews and the data of the user who
wrote them. During training, instead of using all conditions for each instance, we perform condition
sampling based on a pre-defined probability to increase the diversity of condition combinations. The
probabilities were chosen so that each conditioning factor would appear in approximately 10% to 30%
of the entire dataset. Figure 2 illustrates how frequently each attribute appears throughout the entire
dataset. Figure 3 shows the distribution of categorical attributes such as Gender, Age, Group, Season,
Month, and Rating across the original 2.4 million reviews for both the TourMix1M dataset and the
sampled TourMix1M dataset. The results indicate that the TourMix1M dataset closely replicates the
original distribution. We release both the full set of conditions and the specific conditions used in
the experiments. The middle and right sections of Figure 1 show the distribution of the number of
conditions in short and long review generation, respectively. For prompt construction, in the General
Review Generation task, prompts are structured as “Generate a review for Sensoji based on the image.”
For Short Review Generation, prompts are phrased as “Generate a concise review for Sensoji based
on the image,” and for Conditional Review Generation, such as “Generate a 4-star review for Sensoji

5
    https://huggingface.co/rinna/japanese-clip-vit-b-16
6
    https://huggingface.co/UCLA-AGI/Gemma-2-9B-It-SPPO-Iter3
    Short Review Generation                  General Review Generation                         Conditional Review Generation
                                                                                                                                                                    The cherry blossoms were in
         Generate a concise review of Ueno       You are a tourist who visited Ueno Zoo.         You are a nature-loving female tourist                            full bloom and beautiful. The
            Zoo based on an image.              Generate a review based on the image.            who visited Ueno Zoo in the spring.                               pandas were so cute and made
                                                                                                Generate a review based on an image.                                       my heart melt.
                                                          The cherry blossoms are very
       The cherry blossoms were                           beautiful in spring. The zoo is             You are a family tourist visiting Ueno
       very beautiful during the                            filled with many animals,                                                                       The panda was so cute, and the
                                                                                                  Zoo. Using the phrase ‘The panda is cute,’            children were thrilled. They were even
            spring season.                               making it enjoyable all day long        generate a 5-star review based on an image.            selling stuffed animals at the gift shop
                                                                                                                       .

                                                   Large Language Model (Vicuna 13B)                                                                                                      + LoRA


                      Projector                                   Review Generation Prompt                                (panda, cute), (panda, big, 5), (blossom, awesome, 5)
                      CLIP-ViT                                Gender          Age            Group       Profile
                                                                                                                                                    (Cute, 9)
                                                                                                                                                    (Big, 3)
                                                                                                                                                                                      Adventure
                                                                                                                                          Ueno      ..             panda               World
                                                                                                                                          Zoo
                                                                                                                                                                     (cheap, 5)
                                                                                                                                                                                                  Show
                                                                                                                                                                     ..
                                                                                                                                                          zoo
                                                            Keyphrase Rating                 Length      Visit Time        External       (enjoy, 10)                                     price
     Original                                 Various                                                                                     ..                      (enjoy, 19)

     Image                                                                                                                 Knowledge                              ..          Kizzania
                                              Conditions                                                                                          children
                                                                                                                                                                (love, 1)
                                                                                                                                                                                  Tokyo


                Figure 4: LLaVA-Review architecture: Images are split into patches and projected with an MLP layer.
                Image tokens are combined with review prompts and external information. Instruction-tuning is done
                using instruction-response pairs, as shown at the top, for three review generation tasks. 7


                                                    (Cute, 9)                                                                             Overall
                                                    (Big, 3)    panda
                                                                                                                                          Price
    Tourism Spot Reviews                            ..                                      Tourism Spot Reviews
                                                Ueno        (wide, 4)                                                                     Access
                                                Zoo         (enjoyable, 5)


                                                                                                                                                                                                         ・・
                                                            ..                                                                            Service
                            count (noun, adj)                          zoo                                                      Facility
                                                  (enjoy, 10)                                                     LLM
                                                  (excited, 6)      children                                                    Season・Event
                                                                                                                  summarization
                                                  ..
                                                                                                                                Food

                Figure 5: External Knowledge for retrieval-augmented fine-tuning. Left: subgraph from sentiment
                aware knowledge graph. Right: aspect and sentiment based review summary


using the keyword ‘blossom’ as written by a male in his 20s based on the image.” The resulting training
dataset comprises 1 million dialogues, 1.31 million prompts, 545,891 reviews, 476,167 images, and 51,011
tourist spots.

3.2. LLaVA-Review
3.2.1. Model Architecture
The architecture of our proposed model is illustrated in Figure 4. The icons in the figure will represent
LLaVA-Review in subsequent sections. The baseline model we developed, LLaVA-Review, is based
on LLaVA [40, 16], a prominent open-source large-scale multimodal model. In LLaVA, the image is
initially divided into patches, which are then converted into image tokens by the image encoder. These
tokens are transformed into the language space via a projector consisting of multilayer perceptrons. By
simultaneously feeding instruction tokens and image tokens into the large language model, the model
generates responses that incorporate image information. The training process consists of two stages:
pretraining, where only the projector is trained, and fine-tuning, where both the large language model
(LLM) and the projector are optimized. In the proposed LLaVA-Review, we performed only fine-tuning
based on the model pretrained in [16]. Instruction-tuning [43] was employed as the fine-tuning method,
which learns to generate responses using instruction-response pairs. Optimization is conducted by
minimizing the negative log-likelihood of the response generation using the following loss function,
where 𝑥 represents tokens, 𝑘 denotes the length of the instruction part, and 𝑇 is the total length of
the text. We used Vicuna-13B [53], which has strong capabilities in Japanese, as the language model,

7
    https://farm6.static.flickr.com/5018/5580357389_a472ea2466_z.jpg
and CLIP as the image encoder. Additionally, to ensure efficient training and to prevent degradation
in language capabilities, we employed a Low-Rank Adaptation (LoRA) [54]-based strategy for model
training.
                                           𝑇
                                          ∑︁
                                 ℒ=−             log 𝑃 (𝑥𝑡 | 𝑥1 , . . . , 𝑥𝑡−1 ) .                       (1)
                                         𝑡=𝑘+1


3.2.2. Instruction-Tuning with External Knowledge
We propose utilizing knowledge extracted from existing reviews as references to enhance the quality of
review generation. This external knowledge is incorporated during both the training and inference
phases, with the extracted sentences added to the end of the review generation prompts. Tourist
destination reviews typically reflect diverse perspectives and opinions, resulting in a large volume of
content. Relying on only a few reviews may fail to fully capture this diversity, while using the entire
review text as input risks introducing noise. To address these challenges, we empirically evaluated
various external knowledge aggregation methods that effectively capture domain knowledge and user
opinions. Figure 5 represents two proposed methods. We utilized external knowledge constructed from
2.4 million reviews that do not overlap with test reviews.
   Subgraph-based method. This method involves sampling from a Sentiment Aware Knowledge
Graph (SAKG). SAKG is a knowledge graph that incorporates user opinions and sentiment information.
The SAKG in this research is represented as 𝐺 = {(𝑒ℎ , 𝑟, 𝑒𝑡 )|𝑒ℎ , 𝑒𝑡 ∈ 𝐸, 𝑟 ∈ 𝑅} , where 𝐸 represents
entities, 𝑅 represents relationships, and a triplet (𝑒ℎ , 𝑟, 𝑒𝑡 ), which denotes a relationship 𝑟 from a head
entity 𝑒ℎ (e.g., Ueno Park) to a tail entity 𝑒𝑡 (e.g., Panda). Unlike previous work [20], this graph uses
edges to represent adjectives and their frequency of use, such as (cute, 9) and (big, 4). The graph is
constructed by extracting noun-adjective pairs from reviews using syntactic parsing and aggregating
them for each tourist spot. During training, first, between 1 to 5 entities related to the target destination
are sampled based on edge frequency. Subsequently, relationships associated with the sampled entities
are selected similarly. During inference, the top-k entities and relationships by total edge frequency are
sampled and incorporated into the prompt as natural language.
   Summary-based method. This method involves adding aspect-based summaries. Previous research
on review summarization has shown that aspect-based summaries facilitate a broader understanding
of items [55, 10]. Furthermore, incorporating aspect information is crucial in review generation [55].
In this research, we input reviews and prompt a large-scale language model5 to generate a summary
in a single stage. Specifically, for each tourist destination, up to 50 reviews are selected as input, with
a total length less than 5,000 characters. The prompt instructs to create summaries of around 1000
characters for reviews from the perspectives of overviews, positive opinions, and negative opinions for
the elements of key content, including price, service, food and drink, facilities, transportation access,
and seasonal events.


4. Experimental Results
4.1. Experimental Setup
The TourMix1M dataset was employed to train a model for generating reviews based on inputs such as
images, tourist destination names, and various contextual conditions. Specifically, to create the test
data, images not used in the training set were first selected, and then the nearest corresponding reviews
were retrieved with CLIP embeddings. Only image-review pairs where both the image and review were
absent from the training data were included in the test set.
   The training of LLaVA-Review was conducted using eight 48GB Ada 6000 RTX GPUs with a batch size
of 80 and a learning rate of 2 × 10−4 , taking approximately 37 hours per epoch. When incorporating
external knowledge, the training time increased to 41 hours for the subgraph-based method and 60
hours for the summary-based method. Evaluation was performed using 1,000 image-review pairs that
did not overlap with the training data.
   In this research, the key characteristics for effective review generation are 1) the integration of
image information, 2) maintaining high text quality, 3) incorporating detailed information about tourist
destinations, and 4) accounting for user opinions, particularly the collective sentiment regarding tourist
experiences.
   The evaluation metrics included BLEU [56], ROUGE-1, ROUGE-L [57], CIDEr [58], diversity (DIV),
the number of unique proper nouns (PROPN), TFIDF-F1 score, and Senti-F1 score. BLEU, ROUGE, and
CIDEr were used for overall text quality. BLEU and ROUGE were calculated with the sumeval library 8
and CIDEr was calculated with pycocoevalcap library 9 . Diversity was assessed based on adjectives,
nouns, proper nouns, and verbs. It was calculated by measuring the overlap of these features between
generated sentences, using part-of-speech information from GiNZA10 , following the approach of [59].
The number of unique proper nouns was calculated by dividing the total number of unique proper
nouns in all generated reviews by the number of reviews. To evaluate domain knowledge, the TFIDF-F1
metric was used by identifying the top 15 TFIDF words for each tourist spot and calculating the F1-score
between these top words and those in the generated reviews. The Senti-F1 metric, developed using
Aspect-Based Sentiment Analysis (ABSA) [60], measured user opinion consideration by extracting
(feature, opinion, sentiment) triplets from the text via a large language model5 . F1 scores for the
alignment of (feature, sentiment) and (feature, opinion) were averaged to produce the Senti-F1 score.
For all metrics, except length, higher values indicate better performance.
   The comparison methods include MRG [13], PETER [26], PEPLER [27], LLaVA-1.5 [16], ChatGPT-
4V, and ChatGPT-4o. MRG is a multimodal review generation model based on LSTM [61]. We use
VGG16 [62] for vision backbone. PETER is an explanation generation model based on transformer
structure, while PEPLER is based on GPT-2 [38]. We utilized a GPT-2 model 11 trained on Japanese data
for PEPLER. In PETER and PEPLER, image features were extracted using ResNet [63], then reduced
via PCA [64], and clustered with KMeans [65] to generate photo_id, which was used in place of
user_id. LLaVA-1.5 is an open-source large multimodal model, while ChatGPT-4V [41] and ChatGPT-4o
are closed-source large multimodal models known for their state-of-the-art knowledge and language
capabilities. For large-scale multimodal models, prompts such as “You are a tourist who visited location.
Generate a review based on the image” are used. Since models other than the proposed method tend to
generate verbose text, we prompt the length of generated review to be approximately 100 characters. For
retrieval-augmented fine-tuning, we employed two methods: one that extracts entities for comparison
with subgraphs, and another that retrieves up to seven reviews based on CLIP image similarity for
comparison with summaries. During inference, we used four entities and three relations in the subgraph
method.

4.2. General Review Generation
Table 2 presents the quantitative evaluation results of review generation. General LMMs such as
LLaVA-1.5 and GPT-4v achieved high ROUGE scores due to their strong linguistic capabilities. However,
their knowledge related to tourism domain and user opinions is limited. ChatGPT-4o demonstrates
high quality in terms of domain specificity and understanding user opinions. However, it sometimes
generates factually incorrect outputs, such as describing a crowded place as quiet. Additionally, some
explanations lean towards generalizations and lack detailed knowledge of tourism spots. For fine-tuned
models such as MRG, PETER, PEPLER and LLaVA-Review, accuracy was generally improved in terms
of BLEU and ROUGE-L quality metrics, as well as domain specificity and user opinion metrics. For the
8
 https://github.com/chakki-works/sumeval
9
 https://github.com/sks3i/pycocoevalcap
10
   https://github.com/megagonlabs/ginza
11
   https://huggingface.co/rinna/japanese-gpt2-medium
12
   https://cdn.jalan.jp/jalan/img/2/kuchikomi/3622/KL/ed832_0003622962_1.webp
13
   https://cdn.jalan.jp/jalan/img/5/kuchikomi/3905/KL/52041_0003905143_1.webp
14
   https://cdn.jalan.jp/jalan/img/4/kuchikomi/0894/KL/de82f_0000894193_1.webp
Table 2
Results of General Review Generation: The first group compares models. The second group compares retrieval-
augmented fine-tuning with different knowledge sources. The last group shows results of RAG during ChatGPT-4o
inference. Bold text indicates the best performance in each groups.
Model                             BLUE     ROUGE-1     ROUGE-L      CIDEr     DIV    PROPN      TFIDF-F1    Senti-F1     length
LLaVA-1.5 [16]                    0.683      0.254       0.162      0.099    0.863    0.303      0.141       0.029       133.8
ChatGPT-4V [41]                   0.622      0.250       0.165      0.103    0.955    0.278      0.169       0.036       70.9
ChatGPT-4o                        0.856      0.272       0.168      0.022    0.926    0.270      0.175       0.040       122.8
MRG [13]                          0.692      0.202       0.158      0.041    0.941    0.025      0.169       0.029       51.1
PETER [26]                        1.192      0.241       0.179      0.116    0.923    0.525      0.327       0.042        62.1
PEPLER [27]                       1.157      0.248       0.171      0.115    0.931    0.452      0.316       0.040       60.5
LLaVA-Review                      1.291      0.272       0.185      0.108    0.920    0.457      0.244       0.045        54.9
LLaVA-Review w entity             1.320      0.271       0.184      0.111    0.921    0.441      0.243       0.045        56.1
LLaVA-Review w subgraph           1.454      0.275       0.187      0.110    0.918    0.394      0.247       0.050        55.1
LLaVA-Review w 7 reviews          1.175      0.258       0.178      0.100    0.937    0.491      0.230       0.049       53.1
LLaVA-Review w summary            1.296      0.269       0.184      0.111    0.927    0.543      0.254       0.046        55.3
ChatGPT-4o + RAG (5 reviews)      0.935      0.264       0.176      0.151    0.967    0.356      0.250       0.045       61.0
ChatGPT-4o + RAG (10 reviews)     0.942      0.267       0.179      0.144    0.965    0.347      0.255       0.046        63.4
ChatGPT-4o + RAG (20 reviews)     1.024      0.269       0.179      0.141    0.965    0.380      0.258       0.045        67.7


                   This tourist spot is Isawa Onsen. Please generate a review based on the photo.

               Since it is located a little away from the hot spring town, it is a quiet hot spring town. The room was
               spacious and clean. The staff was attentive, making our stay pleasant.
   LLaVA
               My stay at the ryokan in the Isawa Onsen area was truly an amazing experience. The room had a calm,
               traditional Japanese atmosphere, and the outdoor bath was especially wonderful. The tranquility of the
               surroundings and the warmth of the hot spring healed both my body and mind. The attentive service was
   GPT-4o
               also impeccable, and I definitely want to visit again. It was a perfect retreat to enjoy Japanese tradition
               Since it’s located in a hot spring town, the bath is a very clean hot spring. Within the hot spring town, there
   MRG
               is also a bath, and it’s a very clean hot spring.
               It is a hot spring town within walking distance from Isawa Onsen Station. The town is very quiet, and you
  PETER
               can spend a relaxing time.

  PEPLER
             It is about a 10-minute walk from Isawa Onsen Station. There are various accommodations from large
             hotels to small inns. There are also foot baths and drinking fountains
             The Isawa Onsen area is a hot spring destination located in the southern part of Yamanashi Prefecture. The
             hotel where I stayed also had a natural hot spring on the premises. It is said that this hot spring has the
LLaVA-Review characteristic of making your skin smooth.

               We stayed at Ooedo Onsen Monogatari with a baby, a 4-year-old child, and four adults. The bath was so
Ground-Truth   nice that we wanted to go in many times, and it made our skin smooth

Figure 6: Comparison of General Review Generation using different models. Each model is given an image and
tourism spot name as input. The proposed baseline accurately captures the image information, user opinions,
and domain knowledge. 12


MRG, due to the limited expressiveness of the model, the generated output tended to have a relatively
small vocabulary and frequent repetitions, leading to lower overall accuracy. However, the PETER
achieved the highest CIDEr score and performed well in domain specificity and the number of proper
nouns used, as the expressiveness was improved by using the Transformer. The PETER also exhibits
limitations of repetitive outputs at the sentence or phrase level, likely due to a limited number of
Transformer layers. In contrast, the PEPLER consistently delivered superior performance in both review
quality and domain relevance, producing natural and coherent outputs.
   The proposed method, LLaVA-Review, leverages a large corpus of UGC and demonstrates excellent
review performance across metrics, particularly in BLEU, ROUGE-L, and sentiment expression. This
improvement is attributed to the model’s effective capture of image features and its capability to
incorporate relevant tourist information and user opinions, facilitated by its large parameterization.
While it tends to generate shorter reviews, the length can be adjusted by specifying the desired review
    Table 3
    Examples showing the generated reviews with different retrieval strategies. (a) shows subgraph retrieval
    while (b) shows summary retrieval. Subgraph enhances user opinion while summary enhances domain
    knowledge.13 14
     Context       Gold response                             w/o retrieval                    w retrieval
                 There is a parking lot at the highest       This place became famous         It is a village in the moun-
                 point of the village, and from there        because of an old TV drama.      tains. The roads are narrow
 (a)             it took about 20 minutes to walk to         If you keep climbing the         and there are few parking
     Shimokuri the observation deck overlooking the          mountain road, you will see      lots, but it is worth a visit.
     Village     village. Shimokuri Village opened in        a vast expanse of rice fields.   The view from the observa-
                 the deep mountains. It was indeed a         This is Shimokuri Village.       tion deck is wonderful.
                 heavenly village. The scenery is spec-
                 tacular.
                 I often go to see Millet’s paintings in     The Yamanashi Prefectural        I love Millet’s paintings, so
                 the permanent exhibition because I          Art Museum is located near       I visited the Yamanashi Pre-
 (b)             really like them. It’s usually not very     Takeda Shrine in Kofu City.      fectural Art Museum. See-
     Yamanashi crowded, so I recommend it. I espe-           The permanent exhibition         ing Millet’s works calms my
     Prefectural cially liked Millet’s painting “Pauline,”   features many works by           mind and makes me feel lib-
     Art Museum but when I visited, it was on loan over-     artists from Yamanashi.          erated from the hustle and
                 seas.                                                                        bustle of everyday life.

  (a) Retrieved Subgraph                          (b) Retrieved Summary
  (road, narrow, 18), (road, narrow, 8), (road,   (extraction of summary part) The Yamanashi Prefectural Art Museum offers
  difficult, 6)                                   an extensive collection centered around Millet’s works, providing a quiet
  (mountain road, narrow, 7), (mountain road,     and serene environment for visitors. The museum also features outdoor
  difficult, 5), (mountain road, good, 3)         sculptures and a park, allowing visitors to enjoy both art appreciation and a
  (scenery, wonderful, 5), (scenery, good, 3)     leisurely stroll. With relatively easy access, the museum provides a range of
                                                  services that visitors will appreciate, including discounts for local residents
                                                  and special offers for those staying at nearby accommodations.


length in the prompt as discussed later. Figure 6 illustrates the generated review results for Isawa Onsen.
ChatGPT-4o produces relatively descriptive and typical outputs, while PETER and PEPLER effectively
capture the domain-specific information of the tourist spot. Notably, the proposed LLaVA-Review
accurately identifies the presence of indoor hot springs and their characteristics from the images,
resulting in high-quality generation.
   Table 2 also presents the effects of incorporating external knowledge. While knowledge acquisition
improved accuracy overall, the extent of improvement varied. Sampling from the Knowledge Graph
contributed to enhancements in BLEU by 12% and Senti-F1 by 11%, leading to better consideration of
user opinions. In contrast, simply retrieving nouns showed limited improvements, highlighting the
importance of including adjective information. Aspect-based summaries notably enhanced domain
specificity and informativeness, increasing TFIDF-F1 by 4% and the number of proper nouns by 12%,
without negatively affecting quality metrics. However, directly retrieving reviews introduced noise,
which lowered overall quality despite improved user opinion consideration. Table 3 shows specific
examples.
   Finally, for ChatGPT-4o, we retrieved five, ten, and twenty reviews similar to the image using CLIP,
and added them to the original prompt during inference. This approach significantly boosted the
original ChatGPT-4o’s performance, achieving the highest CIDEr score and high TFIDF-F1 and Senti-F1
score. However, the variants of LLaVA-Review show comparable accuracy across almost all metrics,
confirming its high review generation performance.

4.3. Conditional Review Generation
Table 4 presents the quantitative results of conditioned review generation. LLaVA-1.5 shows minimal
to no performance improvement when considering attributes, as reflected in its low CIDEr scores for
different conditions, such as gender (CIDEr: 0.014) and rating (CIDEr: 0.012). In contrast, LLaVA-Review
Table 4
Conditioned Review Generation Results. The first group shows results of conditioning on LLaVA. The second
group shows results of conditioning on LLaVA-Review. Light blue indicates user attribute conditions, green
represents style-related information, and yellow indicates time-related conditions.
 model                          BLUE      ROUGE-1      ROUGE-L      CIDEr       DIV     PROPN   TFIDF-F1   Senti-F1   length
 LLaVA-1.5 [16]                  0.683      0.254        0.162      0.099       0.863   0.303    0.142      0.029     133.8
 LLaVA-1.5 + gender              0.687      0.254        0.163      0.014       0.874   0.248    0.141      0.024     120.5
 LLaVA-1.5 + season              0.627      0.255        0.163      0.012       0.869   0.253    0.139      0.024     120.5
 LLaVA-1.5 + rating              0.679      0.252        0.161      0.012       0.877   0.261    0.143      0.024     122.4
 LLaVA-1.5 + length              0.699      0.254        0.165      0.015       0.878   0.285    0.143      0.025     116.0
 LLaVA-1.5 + profile (long)      0.597      0.244        0.156      0.013       0.876   0.258    0.131      0.021     120.9
 LLaVA-1.5 + Keyphrase           2.699      0.287        0.184      0.030       0.885   0.312    0.152      0.058     124.3
 LLaVA-Review                   1.291       0.272        0.185      0.108       0.920   0.457    0.244      0.045      54.9
 LLaVA-Review + gender          1.410       0.269        0.185      0.106       0.920   0.430    0.239      0.046      54.3
 LLaVA-Review + age             1.161       0.268        0.186      0.104       0.920   0.421    0.239      0.049      52.5
 LLaVA-Review + tag             1.195       0.276        0.187      0.110       0.919   0.425    0.249      0.050      56.2
 LLaVA-Review + profile_tag     1.510       0.273        0.186      0.117       0.919   0.431    0.240      0.045      54.7
 LLaVA-Review + profile_long    1.673       0.279        0.189      0.123       0.920   0.485    0.243      0.050      56.7
 LLaVA-Review + rating          1.320       0.270        0.186      0.103       0.920   0.447    0.243      0.047      54.5
 LLaVA-Review + length          1.952       0.308        0.198      0.184       0.923   0.510    0.244      0.048      87.3
 LLaVA-Review + key phrase      5.251       0.316        0.233      0.263       0.922   0.425    0.197      0.118      50.8
 LLaVA-Review + season          1.313       0.268        0.183      0.106       0.919   0.447    0.242      0.048      54.1
 LLaVA-Review + month           1.471       0.271        0.188      0.107       0.919   0.426    0.240      0.049      53.5


exhibits significant accuracy gains for certain attributes. LLaVA-1.5 fails to capture the characteristics
of reviews influenced by user attributes and context, often directly outputting the conditions in the
reviews. Although conditioning with key phrases improves accuracy in LLaVA-1.5, it is less effective
than in LLaVA-Review.
   Due to differing attribute frequencies as shown in Figure 2, directly comparing the effects of condi-
tioning is difficult, but certain trends in accuracy improvements are observed across different attributes.
Among user attributes, gender significantly improves generation quality, while age and group infor-
mation have little impact, suggesting that gender has a stronger influence on shaping individual user
reviews compared to age or group composition. For profiles, the short tag profile did not significantly
improve accuracy, but long-form and detailed profiles reflecting styles and preferences, such as “re-
laxed” or “preference for natural landscapes,” enhanced review generation accuracy. Incorporating
more detailed profiles could be a direction for future work. For style conditions such as review length
and rating, review length proved especially effective, significantly improving accuracy. The average
deviation between generated and target lengths was just 4.1 characters, reflecting precise conditioning.
Low ratings had some effect, but imbalanced training data led to occasional under-representation of
negative opinions. Keyword conditioning showed the highest accuracy gains. For temporal context,
month-based conditioning outperformed seasonal, indicating greater effectiveness at a finer granularity.
   Figure 7 illustrates examples of different conditioning: user demographic, age, user profile, rating and
review lengths, and key phrases. For user demographic conditioning, characteristics of men, women,
and families are reflected, such as mentioning “playground equipment” for families and “beautiful
flowers” for women. For age conditioning, in the twenties conditioning, a lively writing style using
exclamations was observed, while in the sixties conditioning, more factual reviews related to historical
knowledge were produced. For review length, it was observed that even with the requirement of 290
characters, reviews were generated without content redundancy or structural inconsistencies. For rating,
for the negative conditioning of one star, reviews expressing disappointment over the lack of menu
options were generated. Generation based on keywords also demonstrated high fidelity, producing
results that were very close to the ground truth.
15
   https://cdn.jalan.jp/jalan/img/7/kuchikomi/2837/KL/d72d0_0002837701_1.webp
16
   https://cdn.jalan.jp/jalan/img/8/kuchikomi/1388/KL/e88ab_0001388176_2.webp
17
   https://cdn.jalan.jp/jalan/img/5/kuchikomi/4115/KL/09030_0004115266_1.webp
18
   https://cdn.jalan.jp/jalan/img/3/kuchikomi/5203/KL/02e5b_0005203413_1.webp
19
   https://cdn.jalan.jp/jalan/img/5/kuchikomi/4745/KL/be8e4_0004745614_4.webp
                      You are a tourist of ○○ who visited Expo ’70 Commemorative Park. Please write a review based on
                  the image.
   Male           Expo ’70 Commemorative Park, which features the famous Tower of the Sun sculpture by Taro Okamoto,
                  sometimes hosts flea markets on weekends.
  Female Expo ’70 Commemorative Park is very spacious and a great place for a walk. Depending on the season,
                  various flowers bloom, making it very beautiful.
  Family It is a very spacious park. Many flowers bloom, making it ideal for walking. There are also playground
                  equipment, so children can enjoy it too.

                      You are a tourist of age ○○ who visited Nagoya Castle. Please write a review based on the image.

                  Nagoya Castle is a very magnificent castle. You can go inside, so please be sure to visit. I highly
        20s
                  recommend it!
            If you visit Nagoya Castle during the cherry blossom season, you can enjoy and take photos with the castle
        40s in the background. Nearby, there are other attractions like the Nagoya City Science Museum and Kinshachi
                  Yokocho, making it a place where you can enjoy a full day.
                  Nagoya Castle was originally built by Toyotomi Hideyoshi. The current main keep was reconstructed in the
        60s
                  Showa era. It is especially beautiful during the cherry blossom season.

                     You are a tourist with a profile of 〇〇. Generate a review for Godaido Hall.

                                                    It is located in a place where you can get a panoramic view of Matsushima Bay.
                   Prof: Nature lover.
                                                    Since it is close to the pier for sightseeing boats, I think it’s good to visit both.
Profile (short)
                Prof: A calm traveler who loves the beauty of nature,especially the scenery of lakes and coastlines, and
               enjoys leisurely exploring tourist spots like Shiretoko , Lake Mashu.
               It is located in a place where you can get a panoramic view. From inside Godaido Hall, you can see a great
Profile (Long)
               view of Matsushima Bay. Sightseeing boats also pass by, so I think it’s a good spot for taking photos.
             It is an island near the Matsushima sightseeing boat pier, located after crossing three red bridges where you
             can see the bottom. In the center of the island, there is a hall, and it is a wonderful place with a panoramic
Ground Truth view of Matsushima Bay.

                    Generate a review for Nagoya Tokyu Hotel with a condition of 〇〇
                  I went there with high expectations since it was lunch at a hotel. I chose the hamburger steak as the main dish, and it was tender and
        290 delicious. The side buffet had a wide variety of options, and I left feeling full. There was also dessert, but it was just ice cream. The
        chars     coffee after the meal was self-service. Since it was a restaurant inside a hotel, the service and atmosphere were good. However, the
  Length          line in front of the entrance was long, and it took time to get in. It can't be helped since it's a popular restaurant, but I think it would be
                  better if they could widen the entrance or make some improvements. I used points to get a good deal, so I was satisfied. I'd like to
                  come back and try a different menu next time.
          1       I went there for lunch at a hotel with high expectations, but the menu was limited, and there were no
         stars
  Rating          desserts. There was a salad bar and a drink bar, but considering the price, it felt a bit... disappointing
             Generate a review for Kunozan Toshogu Shrine using the keyword “more strenuous than I imagined”
         that matches the image
         It was more strenuous than I imagined. I was drenched in sweat climbing the stairs. But the view after
 Keyword reaching the top was amazing.
                  I climbed the stone steps from the seaside to the Gorieki-dori side. It was more strenuous than I imagined,
Ground-Truth but the view was magnificent and I felt a great sense of accomplishment.

Figure 7: Conditional review generation examples. The first example shows generation based on user demo-
graphics. The second example shows generation based on age. The third example illustrates generation based on
the user’s profile. The fourth example shows generation based on rating and review length. The fifth example
illustrates generation based on key phrases.15 16 17 18 19
Figure 8: Word Cloud of Season Conditioned Review Generation, showing reflections such as cherry blossoms
in spring, children in summer, and fall leaves in autumn.


Figure 9: Frequency differences in generated reviews for the top 15 words with notable gender-based frequency
differences in the original reviews. Left: Words with higher frequency for females, Right: Words with higher
frequency for males.


   As an aggregate analysis, we generated reviews for the same image by changing only the seasonal
instructions across the four seasons, then compared word frequency in the output. As shown in
Figure 8, distinct features appear: cherry blossoms in spring, children in summer, autumn leaves in
fall, and hot springs in winter, highlighting the impact of seasonal conditioning. Figure 9 provides
a quantitative view of the conditioning effects under gender conditioning. In generated reviews, we
compared word frequencies between male and female conditioned reviews using 1,000 test image-text
pairs. For real reviews, we sampled ten reviews from men and three from women for the same tourist
destination. The table highlights the top 15 words with the largest (female - male) and (male - female)
frequency differences in real reviews. In reviews originally written by women, certain words appear
more frequently, and this pattern is reflected in generated reviews. All 15 words with higher frequency
in women’s original reviews also show higher frequency under female conditioning, with notable
differences for “Go,” “Beautiful,” and “Spacious.” In contrast, men’s reviews show fewer specific words,
and the male-to-female ratio remains small in generated reviews, maintaining the overall trend of
higher male-conditioned word frequency.
Table 5
Results of Short Review Generation.
                           Model              BLUE     ROUGE-1        ROUGE-L           CIDEr      length
                           LLaVA-Review       1.172       0.215         0.190           0.146       19.1
                           ChatGPT-4o         0.596       0.167         0.144           0.170       16.3


Table 6
Long Review Generation Results with and without Short Reivew Generation.
Model                              BLUE    ROUGE-1      ROUGE-L      CIDEr      DIV       PROPN       TFIDF-F1     Senti-F1   length
LLaVA-Review w/o Short Review      1.291      0.272        0.185      0.108     0.920      0.457           0.244    0.045      54.9
LLaVA-Review w Short Review        1.310      0.269        0.183      0.104     0.921      0.447           0.248    0.046      55.6


Figure 10: Attention visualization for image patches. The first row shows visualization results of LLaVA-1.5, and
the second row refers to outputs of the LLaVA-Review. 20 21 22 23


4.4. Short Review Generation
Figure 5 presents the results of Short Review Generation. The ground truth, similar to General Review
Generation, is a sentence retrieved by CLIP for each image, ensuring no overlap with the training
data. A comparison between ChatGPT-4o and LLaVA-Review reveals that both models generate concise
sentences of around 20 characters, with LLaVA-Review achieving higher ROUGE and BLEU scores.
Additionally, an evaluation of the effect of fine-tuning with and without Short Review Generation on
General Review Generation performance, as shown in Table 6, indicated that all metrics changed by no
more than 4%, suggesting minimal impact.

4.5. Visualization of Attention
The visualization results of the attention weights in the final layer of the language model during the
generation of the first token in the review generation process are shown in Figure 10. Specifically, the
visualization highlights the 576 tokens that correspond to image positions within the cs-token attention,
resized to 24 × 24 and presented as a heatmap. In the standard LLaVA-1.5 model, high attention values
are concentrated in narrow regions of the image. In contrast, LLaVA-Review exhibits high attention
values over a broader area of the image and across more extensive objects.


20
   https://cdn.jalan.jp/jalan/img/1/kuchikomi/4251/KL/2c80c_0004251538_1.webp
21
   https://cdn.jalan.jp/jalan/img/8/kuchikomi/4228/KL/2c27e_0004228816_3.webp
22
   https://cdn.jalan.jp/jalan/img/1/kuchikomi/0011/KL/c5437_0000011649.webp
23
   https://cdn.jalan.jp/jalan/img/1/kuchikomi/0821/KL/d60dd_0000821087_2.webp
5. Limitations and Future Works
In terms of dataset, we are considering updates to improve the fidelity of image-text pair creation.
Leveraging more detailed user information, such as utilizing all past reviews written by the user, or
considering their actual behavior at tourist destinations when possible, is also a promising direction.
Additionally, while this research developed a dataset specific to Japanese data, it is known that tourist
destinations and perceptions of tourism vary by country. Building a more comprehensive dataset
that encompasses diverse languages and cultures remains a challenge for future research. Moreover,
expanding the dataset to support more tasks such as personalized product description generations
and recommendations are also future works. In terms of the model design, we will leverage more
powerful language models as the backbone, incorporate more robust image information and domain
knowledge, and develop more adaptive external documents. We will also consider the application to
real-world scenarios, such as simulations and marketing. Our model is capable of generating virtual
user experiences, which could be used to improve tourist destinations and simulate travel experiences.
However, when utilizing such pseudo-reviews, it is essential to address potential issues related to
privacy, bias, and reputation harms that these reviews might cause.


6. Conclusion
In this research, we developed TourMix1M, the first multimodal dataset for tourism review generation.
We also introduced LLaVA-Review, a large-scale multimodal model for review generation. Furthermore,
we researched two knowledge retrieval methods for tourism review generation. Experiments with
the proposed dataset showed LLaVA-Review’s superior performance in domain specificity and user
sentiment expression. The proposed two retrieval-augmented fine-tuning strategies further improved
accuracy. Additionally, additional attention to factors such as gender, user profiles, month, review
length, and key phrases significantly enhanced review generation. This work is expected to advance
research in tourism and broader review generation fields.


7. Acknowledgments
This work is partially financially supported by Center for Real Estate Innovation, The University of
Tokyo.


References
 [1] WTTC, Economic Impact Report: Global Infographic, Technical Report, World Travel Tourism
     Council (WTTC), 2023. URL: https://wttc.org/Research/Economic-Impact.
 [2] Anonymous, Tourist experiences at overcrowded attractions: A text analytics approach, in:
     Information and Communication Technologies in Tourism, 2022, pp. 231–243.
 [3] S.-E. Kim, K. Y. Lee, S. I. Shin, S.-B. Yang, Effects of tourism information quality in social media
     on destination image formation: The case of sina weibo, Information & Management 54 (2017)
     687–702.
 [4] M. del Carmen Hidalgo Alcázar, M. S. Piñero, S. R. de Maya, The effect of user-generated content
     on tourist behavior: The mediating role of destination image, Tourism & Management Studies 10
     (2014) 158–164.
 [5] Y. L. . H. S. Dogan Gursoy, Gender difference on destination image and travel options: An
     exploratory text-mining study., in: PloS one, volume 30, 2018, pp. 1–5.
 [6] Y. L. . H. S. Dogan Gursoy, Does traveler satisfaction differ in various travel group compositions?, in:
     International Journal of Contemporary Hospitality Management, volume 30, 2018, pp. 1663–1685.
 [7] J. J. Padilla1, H. Kavak, C. J. Lynch, R. J. Gore1, S. Y. Diallo, Temporal and spatiotemporal
     investigation of tourist attraction visit sentiment on twitter., PloS one 13 (2018).
 [8] M. Rossetti1, F. Stella1, M. Zanker, Analyzing user reviews in tourism with topic models., Infor-
     mation Technology Tourism 316 (2016) 5–21.
 [9] S. A. C. Estela Marine-Roig, Tourism analytics with massive user-generated content: A case study
     of barcelona., Journal of Destination Marketing Management 4 (2015) 162–172.
[10] C.-F. Tsai, K. Chen, Y.-H. Hu, W.-K. Chen, Improving text summarization of online hotel reviews
     with review helpfulness and sentiment, Tourism Management 80 (2020).
[11] L. Dong, S. Huang, F. Wei, M. Lapata, M. Zhou, K. Xu, Learning to generate product reviews from
     attributes, in: Proceedings of the 15th Conference of the European Chapter of the Association for
     Computational Linguistics, 2017, pp. 623–632.
[12] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proc. IEEE
     Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
[13] Q.-T. Truong, H. Lauw, Multimodal review generation for recommender systems, in: The World
     Wide Web Conference, 2019, pp. 1864–1874.
[14] J. Li, S. Li, W. X. Zhao, G. He, Z. Wei, N. J. Yuan, J.-R. Wen., Knowledge-enhanced personal-
     ized review generation with capsule graph neural network, in: Proceedings of the 29th ACM
     International Conference on Information Knowledge Management, 2020, pp. 735–744.
[15] Z. Xie, S. Singh, J. McAuley, Bodhisattwa, P. Majumder, Factual and informative review generation
     for explainable recommendation., in: Proceedings of the AAAI Conference on Artificial Intelligence,
     volume 37, 2023, pp. 13816–13824.
[16] H. Liu, C. Li, Y. Li, Y. J. Lee, Improved baselines with visual instruction tuning, Proc. IEEE/CVF
     Conference on Computer Vision and Pattern Recognition (2024) 26296–26306.
[17] X. V. Lin, X. Chen, M. Chen, W. Shi, M. Lomeli, R. James, P. Rodriguez, J. Kahn, G. Szilvasy,
     M. Lewis, L. Zettlemoyer, S. Yih, Ra-dit: Retrieval-augmented dual instruction tuning, in: The
     Twelfth International Conference on Learning Representations., 2024, pp. 2206–2240.
[18] T. Zhang, N. J. Shishir G. Patil, S. Shen, M. Zaharia, I. Stoica, J. E. Gonzalez, Raft: Adapting language
     model to domain specific rag, in: arXiv preprint, 2024.
[19] M. Yang, M. Zhu1, Y. Wang, L. Chen, Y. Zhao, X. Wang, B. Han, X. Zheng, J. Yin, Fine-tuning
     large language model based explainable recommendation with explainable quality reward., in:
     Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, 2024, pp. 9250–9259.
[20] S.-J. Park, D.-K. Chae, H.-K. Bae, S. Park, S.-W. Kim, Reinforcement learning over sentiment-
     augmented knowledge graphs towards accurate and explainable recommendation, in: Proceedings
     of the Fifteenth ACM International Conference on Web Search and Data Mining, 2022, pp. 784–793.
[21] Anonymous, Llava-tour: Creation of a large-scale multimodal model specializing in japanese
     tourism data., in: Proceedings of IEEE Visual Communications and Image Processing (under
     review), 2024.
[22] M. L. Cheung, W. K. S. Leung, J.-H. Cheah, H. Ting, Effects of user-provided photos on hotel
     review helpfulness: An analytical approach with deep leaning, International Journal of Hospitality
     Management 71 (2018) 120–131.
[23] M. L. Cheung, W. K. S. Leung, J.-H. Cheah, H. Ting, Exploring the effectiveness of emotional
     and rational user-generated contents in digital tourism platforms, Vacation Marketing 28 (2022)
     152–170.
[24] Y. L. . H. S. Dogan Gursoy, Chatgpt and the hospitality and tourism industry: an overview of
     current trends and future research directions, in: Journal of Hospitality Marketing & Management,
     volume 32, 2023, pp. 579–592.
[25] W. C. Yogesh K Dwivedi, Neeraj Pandey, A. Micu, Leveraging chatgpt and other generative
     artificial intelligence (ai)-based applications in the hospitality and tourism industry: practices, chal-
     lenges and research agenda, in: International Journal of Contemporary Hospitality Management,
     volume 36, 2024, pp. 1–12.
[26] C. Zong, F. Xia, W. Li, R. Navigli, Personalized transformer for explainable recommendation,
     in: Proc. 59th Annual Meeting of the Association for Computational Linguistics and the 11th
     International Joint Conference on Natural Language Processing, 2021, pp. 4947–4957.
[27] L. Li, Y. Zhang, L. Chen, Personalized prompt learning for explainable recommendation, in: ACM
     Transactions on Information Systems, volume 41, 2023, pp. 1–26.
[28] P. Li, Z. Wang, Z. Ren, L. Bing, W. Lam, Neural rating regression with abstractive tips generation
     for recommendation, in: Proceedings of the 40th International ACM SIGIR Conference on Research
     and Development in Information Retrieval, 2017, pp. 345–354.
[29] J. Ni, J. McAuley, Personalized review generation by expanding phrases and attending on aspect-
     aware representations, in: Proceedings of the 56th Annual Meeting of the Association for Compu-
     tational Linguistics, 2018, pp. 706–711.
[30] Z. C. Lipton, S. V. andand Julian McAuley, Generative concatenative nets jointly learn to write
     and classify reviews, in: arxiv preprint, 2015.
[31] P. Li, A. Tuzhilin, Towards controllable and personalized review generation, in: Proceedings of
     the 15th Conference of the European Chapter of the Association for Computational Linguistics,
     volume 1, 2019, pp. 3237–3245.
[32] J. Guo, S. Lu, H. Cai, W. Zhang, Y. Yu, J. Wang, Long text generation via adversarial training with
     leaked information., in: Proceedings of the AAAI conference on artificial intelligence, volume 32,
     2019.
[33] D. V. Hada, V. M., S. K. Shevade, Rexplug: Explainable recommendation using plug and play
     language model, in: Proceedings of the 44th International ACM SIGIR Conference on Research
     and Development in Information Retrieval, volume 1, 2021, pp. 81–91.
[34] H. Chen, Y. Lin, F. Qi, J. Hu, P. Li, J. Zhou, M. Sun, Aspect-level sentiment-controllable review
     generation with mutual learning framework, in: Proceedings of the AAAI Conference on Artificial
     Intelligence, volume 35, 2021, pp. 12639–12647.
[35] J. Li, W. X. Zhao, J.-R. Wen, , Y. Song, Generating long and informative reviews with aspect-aware
     coarse-to-fine decoding, in: Proceedings of the 57th Annual Meeting of the Association for
     Computational Linguistics, 2019, pp. 1969–1979.
[36] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin,
     Attention is all you need, in: Advances in Neural Information Processing Systems, volume 30,
     2017.
[37] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, et al., Llama:
     Open and efficient foundation language models, arXiv preprint (2023).
[38] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, Language Models are Unsuper-
     vised Multitask Learners, Technical Report, OpenAI, 2019. URL: https://api.semanticscholar.org/
     CorpusID:160025533.
[39] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. N. P., et al., Language
     models are few-shot learners, in: Proc. Advances in Neural Information Processing Systems,
     volume 33, 2020, pp. 1877–1901.
[40] H. Liu, C. Li, Q. Wu, Y. J. Lee, Visual instruction tuning, in: Proc. Advances in Neural Information
     Processing Systems, volume 36, 2023.
[41] OpenAI, Gpt-4 technical report, arXiv preprint (2023).
[42] J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, J. Zhou, Qwen-vl: A versatile
     vision-language model for understanding, localization, text reading, and beyond, arxiv preprint
     (2023).
[43] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, et al., Finetuned language models
     are zero-shot learners, in: Proc. International Conference on Learning Representations, 2022.
[44] Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, Q. Liu, Ernie: Enhanced language representation
     with informative entities, in: Proceedings of the 57th Annual Meeting of the Association for
     Computational Linguistics, 2019, pp. 1441–1451.
[45] M. Kang, J. M. Kwak, J. Baek, S. J. Hwang, Knowledge graph-augmented language models for
     knowledge-grounded dialogue generation, in: arxiv preprint, 2023.
[46] X. He, Y. Tian, Y. Sun, N. V. Chawla, T. Laurent, Y. LeCun, X. Bresson, B. Hooi, G-retriever:
     Retrieval-augmented generation for textual graph understanding and question answering, in:
     arxiv preprint, 2024.
[47] L. Luo, Y.-F. Li, G. Haffari, S. Pan, Reasoning on graphs: Faithful and interpretable large language
     model reasoning, in: International Conference on Learning Representations, 2024.
[48] S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai, E. Rutherford, K. Millican, G. van den Driessche, J.-B.
     Lespiau, B. Damoc, , A. Guy, J. Menick, R. Ring, T. Hennigan, S. Huang, L. Maggiore, C. Jones,
     A. Cassirer, A. Brock, M. Paganini, G. Irving, O. Vinyals, S. Osindero, K. Simonyan, J. W. Rae,
     E. Elsen, L. Sifre, Improving language models by retrieving from trillions of tokens., in: Interna-
     tional Conference on Machine Learning, 2022, pp. 2206–2240.
[49] K. Guu, K. Lee, Z. Tung, P. Pasupat, M.-W. Chang, Realm: Retrieval-augmented language model
     pre-training, in: International Conference on Machine Learning, 2020, pp. 3929–3938.
[50] W. Shi, S. Min, M. Yasunaga, M. Seo, R. James, M. Lewis, L. Zettlemoyer, W. tau Yih, Replug:
     Retrieval-augmented black-box language models, in: Proceedings of the 2024 Conference of the
     North American Chapter of the Association for Computational Linguistics: Human Language
     Technologies, 2024, pp. 8371–8384.
[51] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. tau Yih,
     T. Rocktäschel, S. Riedel, D. Kiela, Retrieval-augmented generation for knowledge-intensive nlp
     tasks, in: Advances in Neural Information Processing Systems, 2024, pp. 9459–9474.
[52] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, et al., Learning
     transferable visual models from natural language supervision, in: International Conference on
     Machine Learning, 2021, pp. 8748–8763.
[53] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E.
     Gonzalez, I. Stoica, E. P. Xing, Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt
     quality, 2023. URL: https://lmsys.org/blog/2023-03-30-vicuna/.
[54] E. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, Lora: Low-rank adaptation
     of large language models, in: Proc. International Conference on Learning Representations, 2022.
[55] R. K. Amplayo, S. Angelidis, M. Lapata, Aspect-controllable opinion summarization, in: Proceedings
     of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, p. 6578–6593.
[56] K. Papineni, S. Roukos, T. Ward, W. Zhu, Bleu: A method for automatic evaluation of machine
     translation, in: Proc. 40th Annual Meeting on Association for Computational Linguistics, 2002, p.
     311–318.
[57] C.-Y. LIN, Rouge: A package for automatic evaluation of summaries, in: Text Summarization
     Branches Out, 2004, pp. 74–81.
[58] R. Vedantam, C. L. Zitnick, D. Parikh, Cider: Consensus-based image description evaluation.,
     in: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2015, pp.
     4566–4575.
[59] B. Smyth, P. McClave., Similarity vs. diversity., in: Proceedings of the 4th International Conference
     on Case-Based Reasoning, 2001, pp. 347–361.
[60] H. Peng, L. Xu, L. Bing, F. Huang, W. Lu, L. Si, Knowing what, how and why: A near complete
     solution for aspect-based sentiment analysis„ in: Proceedings of the AAAI Conference on Artificial
     Intelligence, volume 34, 2020, pp. 8600–8607.
[61] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation 9 (1997) 1735–1780.
[62] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition,
     in: International Conference on Learning Representations, 2015.
[63] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proc. IEEE
     Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
[64] S. Wold, K. Esbensen, P. Geladi, Principal component analysis, Chemometrics and intelligent
     laboratory systems (1987) 37–52.
[65] S. P. Lloyf, Least squares quantization in pcm, IEEE transactions on information theory, 28 (1982)
     129–137.