1. Introduction

A Multimodal Dataset and Benchmark for Tourism Review Generation

Hiromasa Yamanishi

yamanishi@cvm.t.u-tokyo.ac.jp 0

Ling Xiao

ling@cvm.t.u-tokyo.ac.jp 0

Toshihiko Yamasaki

yamasaki@cvm.t.u-tokyo.ac.jp 0 0 The University of Tokyo , Hongo ̄, Bunkyo, Tokyo 113-8656 , Japan

This paper addresses the challenge of generating accurate and contextually relevant tourism reviews, which are essential to assist travelers in creating reviews and allowing businesses to predict the reactions of diefrent users to tourist spots. To address this problem, we introduce the first multimodal dataset for Japanese tourism review generation, TourMix1M, which contains one million review instances under various conditions, including images, user attributes, user profiles, review ratings, review length, key phrases, and visit seasons, collected from Japanese tourism websites. Based on this dataset, we develop a baseline model for multimodal review generation, LLaVA-Review, by performing instruction tuning of the LLaVA model. Furthermore, to enhance domain knowledge, we apply retrieval-augmented fine-tuning with aggregated tourism review data, exploring two types of knowledge representations: one incorporating noun and adjective information from Sentiment Aware Knowledge Graph, and another using aspect-based summaries from reviews. Experimental results show that LLaVA-Review outperforms existing models in review generation and adapts well to various conditioning factors, with improved accuracy by conditioning the gender, visiting month, review length, key phrases, and user profiles information into the prompt. Furthermore, retrieval-augmented fine-tuning using tourism information efectively improved accuracy across both types of knowledge representations.

eol>Tourism Review Generation Conditional Review Generation Large Multimodal Model

1. Introduction

Tourism is one of the most crucial sectors in the global economy, contributing more than 15 trillion US dollars [ 1 ], and is enjoyed for various purposes, such as relaxation, exploration, and education. Online platforms such as TripAdvisor and Google Maps play essential roles [ 2 ] for tourism. In tourism services, user-generated content (UGC), such as reviews and photos of tourist spots, plays a vital role for both the tourism business and individual tourists. By collecting diverse aspects and opinions about these locations, the tourism business can further improve the services, enhance their credibility, and make more profits. At the same time, UGC helps shape tourists’ perceptions of tourist destinations and significantly influences travel planning [ 3, 4 ]. As tourism demand diversifies, understanding the preferences, needs, and objectives of diferent segments is crucial for identifying market opportunities. The literature showed that UGC reflects diferences in trends based on user attributes and preferences [ 5, 6, 7 ], and analyzing these trends has provided valuable insights for improving tourist destinations and for efective marketing targeting [ 8, 9, 10 ].

This paper focuses specifically on review generation. There are two primary applications for review generation. First, presenting automatically generated reviews to users can significantly minimize their efort [ 11 ] and encourage more review posting. Additionally, presenting generated reviews to users is beneficial for recommendation systems [ 12 ]. Second, predicting user reactions, such as “what a man in his 50s who prefers leisurely tourism would say” can be valuable for businesses in marketing and improving tourist destinations. Recent advancements in deep learning and natural language processing (NLP) have enabled the generation of high-quality reviews [ 11, 13, 14, 15 ]. In [ 11 ], they developed a recurrent neural networks based on long short-term memory (LSTM) for user conditioned product review generation. Truong et al. [ 13 ] improved review generation by leveraging image information. Li et al. [ 14 ] enhanced product-related factual information by leveraging knowledge graphs. Xie et al. [ 15 ] utilized existing reviews as additional context and generated reviews using GPT-2. However, although various datasets exist, relatively few studies focus on tourism-related data in review generation research literature. In particular, there is a significant gap in research that employs multimodal datasets combining diverse contexts, such as images, rich user attributes, and review content, specifically tailored for tourism. To address this gap, this research creates a multimodal review generation dataset, TourMix1M, specifically tailored for tourism. The TourMix1M is created from data collected from Japanese tourism websites, comprising 470,000 images, 510,000 reviews, and a total of one million training instances. We will release a list of URLs for the collected data at https://github.com/HiromasaYamanishi/TourMix1M. Our dataset supports the generation of reviews based on various contexts such as images, age, gender, season, user profiles, review length, and ratings. This is the first dataset specialized in generating tourism reviews that incorporates diverse information such as images and user data, promoting the understanding and advancement of review generation in the tourism sector.

We also propose a baseline model for multimodal review generation, LLaVA-Review. LLaVA-Review was developed by instruction-tuning the open-source LLaVA model [ 16 ]. Compared to previous multimodal review generation models [ 13 ], LLaVA-Review ofers greater flexibility in handling diferent conditions by changing natural language instructions. LLaVA-Review outperformed state-of-the-art Large Multimodal Model (LMM) ChatGPT-4o and review generation models on various metrics in experiments conducted on our dataset, particularly in terms of BLEU, ROUGE, and consistency with user opinions. By incorporating tourism knowledge and user opinions to fine-tune LLaVA, LLaVA-Review also demonstrated stronger adaptability to diverse tourism related conditioning. Furthermore, inspired by recent research in enhancing domain specificity [ 17, 18, 19 ] with retrieval-augmented fine-tuning, we incorporate aggregated information from tourism spot reviews during fine-tuning and inference stages. We explored two approaches: subgraph method and aspect-based summary. The subgraph method means adding noun and adjective information extracted from a Sentiment Aware Knowledge Graph [ 20 ]. The aspect-based summary denotes using aspect-based summaries generated from reviews. Experimental results showed that the subgraph method increased BLEU by 12% and user opinion accuracy by 11%, while using aspect-based summary improved proper noun variety by 18% and domain knowledge by 4%.

One related submission to this paper is [ 21 ], which develops a multimodal model for landmark recognition and review generation. However, this work difers from [ 21 ] in two key aspects: 1) Dataset: The dataset used in [ 21 ] addresses limited conditions such as gender, age, group, and key phrases, whereas the TourMix1M includes a wider range of factors such as visit season, user profiles, ratings, and review length, providing a more comprehensive foundation for tourism review generation. 2) Model design: [ 21 ] addresses review generation as part of a multitask framework, which is trained by standard instruction-tuning. Our research explores the usage of information aggregated from reviews to improve quality. Overall, our contributions are threefold: • We created a multimodal dataset, TourMix1M, for Japanese tourism review generation. The TourMix1M has one million instances and ten types of diverse conditions. The URLs used in the dataset construction will be made publicly available. • We proposed LLaVA-Review, which outperformed state-of-the-art large multimodal models and review generation models in general review generation. LLaVA-Review showed strong adaptability to diferent conditions, and we analyzed variations in its outputs across these conditions. • We propose two retrieval-augmented fine-tuning methods: aggregating sub-graphs from a Sentiment

Aware Knowledge Graph and aspect-based summaries from reviews, both of them enhanced accuracy.

2. Related Works 2.1. User-Generated Content in Tourism

Tourism is a key sector in the global economy, accounting for over 10% of global economic output [ 1 ]. As tourism is often a high-cost, once in a lifetime experience, UGC on social platforms plays a crucial role in helping tourists decide what to visit. UGC, typically in the form of reviews and images, ofers valuable insights. Reviews express user opinions on aspects such as content, price, and service, while images enhance understanding of intangible experiences such as tourist destinations [ 2 ]. Furthermore, images complement textual information and function as visual evidence to substantiate the content, thus enhancing the usefulness of reviews when combined with images [ 22 ]. UGC significantly influences travel planning and behavior by shaping both cognitive and emotional perceptions of tourist destinations [ 4, 23, 3 ].

Analyzing UGC is vital for marketing and improving tourist destinations. Text mining has shown that reviews reflect user satisfaction and preferences based on factors such as gender [ 5 ], age, companions [ 6 ], and season [ 7 ]. For example, [ 5 ] finds that men are more interested in historical sites, while women prefer landscapes and rural areas. Similarly, [ 6 ] shows that couples tend to leave reviews with higher satisfaction levels. Analyzing tourist reviews with deep learning techniques, such as topic modeling and sentiment classification, is efective for marketing and destination improvement [ 8, 9 ]. Recently, large language models such as ChatGPT have shown potential for applications such as automated replies to customer questions and requests in tourism through prompt engineering [ 24, 25 ]. Our research focuses on developing large-scale models using tourism-specific corpora and images.

2.2. Review Generation

Review generation helps users write reviews [ 11 ] and improve the trustworthiness of recommendations [ 26, 12 ]. Unlike explanation generation [ 26, 27 ] or tip generation [ 28 ], review generation aims to create longer texts that cover multiple aspects. Various models have been developed for review generation using recurrent neural networks (RNNs) [ 11, 14, 29, 30 ], generative adversarial networks (GANs) [ 31, 32 ], and large-scale models [ 33 ]. Typically, these models generate reviews based on inputs such as user data, product information, ratings, or sentiment polarity. Furthermore, some research explores generation conditioned on images [ 13 ] or aspects [ 34 ] such as Enviroment, Service, and Price. Additionally, guiding the generation process with information extracted from reviews is efective in improving quality and factuality. For instance, [ 29 ] identified relevant aspects between users and items and guided generation using words from those aspects. Similarly, [ 35 ] proposed a method that first generates an aspect sequence and then performs review generation from coarse to fine. In addition, [ 14 ] expanded a Freebase-based knowledge graph using user and item reviews, capturing user preferences for each aspect via Caps GNN. Furthermore, [ 15 ] generated reviews by inputting past reviews and key terms into GPT-2 and asking questions like “What was great?” to guide the generation.

Research on review generation in the tourism domain remains underexplored. One of the biggest challenges is the lack of a dataset. One related work under submission is [ 21 ], where review generation is addressed as part of a multitask framework, with conditions limited to user attributes and keywords. This paper focuses on creating a dataset specifically for review generation with a broader range of ten conditions such as visit timing, user profiles, ratings, and review length. We also apply a large-scale multimodal model to multimodal review generation for the first time.

2.3. Retrieval-augmented Large-Scale Model Fine-tuning

Large-scale models (LMs) based on the Transformer architecture [ 36 ] have shown high performance across various tasks, with notable examples including large language models (LLMs) such as LLaMA [ 37 ] and GPT [ 38, 39 ], large multimodal models (LMMs) such as LLaVA [ 40, 16 ], ChatGPT-4 [ 41 ], and QwenVL [ 42 ]. These LMMs align language with images and can generate text across diverse inputs and conditions. (Image-Review) This time, I walked along the canal at night. Illuminated by gas lamps, I was satisfied with the beautiful scenery. After dinner, I walked all the way to the back and back again, making for a nice walk. The warehouses were also lit up and looked beautiful. I definitely recommend going at night.

However, these models often lack domain-specific knowledge. To address this, methods such as instruction-tuning [ 43 ] and retrieval-augmented generation (RAG) [ 44, 45, 46, 47, 48, 49, 17, 50, 18, 51 ] have been developed to enhance performance in specific domains. RAG, which incorporates external knowledge, including structured knowledge such as Knowledge Graphs [ 44, 45, 46, 47 ] and unstructured sources [ 48, 49, 17, 50, 18, 51 ], has proven efective for domain-specific tasks. Knowledge integration has proven to be beneficial either during pre-training [ 48, 49, 44 ], fine-tuning [ 17, 18, 45 ], or inference [ 51, 46 ]. Our research aligns closely with studies such as [ 17, 18, 45 ], which focus on retrieval-augmentation during both fine-tuning and inference stages. Most research in this area is applied to question answering, where a single or a few lengthy documents are retrieved. However, this approach may not be ideal for review generation. A small number of reviews may not capture all aspects and opinions of a certain tourism spot, while too many reviews can introduce noise, potentially degrading quality. Although frameworks such as [ 19 ] train large language models using concept retrieval, the knowledge may still be insuficient. So, efective knowledge retrieval methods for toursim review generation requires careful design.

3. Proposed Method

Considering the fact that most tourists take photos and their experiences vary depending on context information such as user attributes and visit timing, we have constructed a multimodal tourism review generation dataset. The created dataset accounts for ten conditions, including review length, user gender, user age, group, visiting month, visiting season, two types of user profiles, and rating. Moreover, we propose a high-performance LLaVA-Review model for review generation. We also propose retrievalaugmented fine tuning where aggregated review information is incorporated into the prompt during both training and inference. Each of these aspects will be detailed in the following sections. We propose two formats for aggregating review information: one based on subgraphs consisting of noun and adjective information extracted from a Sentiment Aware Knowledge Graph [ 20 ], and another based on aspect based review summaries. Each of these aspects will be detailed in the following sections. Figures 4, 6, 7, 8, and 10 as well as Tables 1 and 3 present the original reviews alongside the generated examples, all of which were originally in Japanese and have been translated into English for demonstration purposes in this paper.

2https://cdn.jalan.jp/jalan/img/6/kuchikomi/3776/KL/6b4e1_0003776908_1.webp 3https://cdn.jalan.jp/jalan/img/0/kuchikomi/4140/KL/3a837_0004140092_1.webp

Component Dialogues Prompts Reviews Images Tourism Spots Count

3.1. TourMix1M Dataset

We created a multimodal tourism dataset for review generation by utilizing data collected from the Japanese tourism website jalan.net4, with permission to use its data for non-commercial purposes. We originally collected 470k images and 2.4 million reviews from the website for 51k tourist spots across Japan that had a suficient number of images and reviews. This represents a large portion of the content available on the website. Initially, the collected images and texts were not always paired. To create image-text pairs, we employed the Contrastive Language-Image Pretraining (CLIP) [52] method, specifically focusing on the model tailored for the Japanese language 5. The data were generated by pairing images and reviews based on the nearest neighbors in the embedding space. To create training pairs, we used three retrieval methods for diverse image-text matching: image-to-review retrieval, review-to-image retrieval, and image-sentence-review retrieval. Table 1 shows examples of obtained image-text pairs. In the image-sentence-review retrieval process, sentences were first extracted from the images, after which the full corresponding reviews were identified by locating the reviews that contained those sentences.

We constructed three review generation tasks based on the generated image-text pairs. The tasks consist of Short Review Generation, General Review Generation, and Conditional Review Generation. General and Conditional Review Generation are collectively referred to as Long Review Generation. For Short Review Generation, image-sentence pairs obtained from image-sentence retrieval were used to generate concise reviews. For Long Review Generation, image-review pairs obtained from three retrieval methods were used to generate reviews. The left side of Figure 1 shows the distribution of each task. As input for each task, in General Review Generation, instructions were given to generate reviews based solely on images and place names. In Short Review Generation, only one condition, rating, was applied. For Conditional Review Generation, instructions were given to generate reviews based on combinations of images, place names, and various conditions. The conditions considered include ten categories: review length, gender, age, groups, visit month, season, two types of user profiles (tag and long), rating, and key phrase in the review. These variables encompass a wide range of conditions specific to tourism. While this research does not perform conditioning based on user ID, making the setup less personalized, it provides a more general context and attribute based framework, which is applicable even to cold-start users. For example, a new user, such as a man in his 50s who enjoys leisurely activities, could request a review for a spring visit to a tranquil garden or scenic nature trail, allowing the system to generate a review tailored to his preferences without prior interactions. Moreover, our dataset enables analysis of how diferent conditioning factors, such as age, gender, and season, influence the generated reviews, providing insights into the diverse user experiences.

Specifically, for categorical variables, gender is either male or female; age is in ten-year increments from the 10s to the 90s; groups include family, couple, friends, single, or other; rating is an integer between 1 and 5; visit month is an integer between 1 and 12; and the season is either spring, summer, autumn, or winter. For user profiles, we use two types: “tag” profiles, which are simple, keyword-based summaries such as “history enthusiast” and “long” profiles, which provide more detailed descriptions such as “a curious traveler with a deep interest in local history and culture, enjoying museum visits and exploring traditional cuisine.” These profiles are generated by prompting a large language model 6 based on past reviews. For key phrases, we use an LLM to extract important parts from sentences, ranging from single words to short sentences, that users found positive or negative. The attributes for each review are derived from the accompanying metadata of reviews and the data of the user who wrote them. During training, instead of using all conditions for each instance, we perform condition sampling based on a pre-defined probability to increase the diversity of condition combinations. The probabilities were chosen so that each conditioning factor would appear in approximately 10% to 30% of the entire dataset. Figure 2 illustrates how frequently each attribute appears throughout the entire dataset. Figure 3 shows the distribution of categorical attributes such as Gender, Age, Group, Season, Month, and Rating across the original 2.4 million reviews for both the TourMix1M dataset and the sampled TourMix1M dataset. The results indicate that the TourMix1M dataset closely replicates the original distribution. We release both the full set of conditions and the specific conditions used in the experiments. The middle and right sections of Figure 1 show the distribution of the number of conditions in short and long review generation, respectively. For prompt construction, in the General Review Generation task, prompts are structured as “Generate a review for Sensoji based on the image.” For Short Review Generation, prompts are phrased as “Generate a concise review for Sensoji based on the image,” and for Conditional Review Generation, such as “Generate a 4-star review for Sensoji

5https://huggingface.co/rinna/japanese-clip-vit-b-16 6https://huggingface.co/UCLA-AGI/Gemma-2-9B-It-SPPO-Iter3

using the keyword ‘blossom’ as written by a male in his 20s based on the image.” The resulting training dataset comprises 1 million dialogues, 1.31 million prompts, 545,891 reviews, 476,167 images, and 51,011 tourist spots.

3.2. LLaVA-Review

3.2.1. Model Architecture The architecture of our proposed model is illustrated in Figure 4. The icons in the figure will represent LLaVA-Review in subsequent sections. The baseline model we developed, LLaVA-Review, is based on LLaVA [ 40, 16 ], a prominent open-source large-scale multimodal model. In LLaVA, the image is initially divided into patches, which are then converted into image tokens by the image encoder. These tokens are transformed into the language space via a projector consisting of multilayer perceptrons. By simultaneously feeding instruction tokens and image tokens into the large language model, the model generates responses that incorporate image information. The training process consists of two stages: pretraining, where only the projector is trained, and fine-tuning, where both the large language model (LLM) and the projector are optimized. In the proposed LLaVA-Review, we performed only fine-tuning based on the model pretrained in [ 16 ]. Instruction-tuning [ 43 ] was employed as the fine-tuning method, which learns to generate responses using instruction-response pairs. Optimization is conducted by minimizing the negative log-likelihood of the response generation using the following loss function, where represents tokens, denotes the length of the instruction part, and is the total length of the text. We used Vicuna-13B [53], which has strong capabilities in Japanese, as the language model, and CLIP as the image encoder. Additionally, to ensure eficient training and to prevent degradation in language capabilities, we employed a Low-Rank Adaptation (LoRA) [54]-based strategy for model training.

ℒ = − ∑︁ log ( | 1, . . . , − 1) . =+1 (1) 3.2.2. Instruction-Tuning with External Knowledge We propose utilizing knowledge extracted from existing reviews as references to enhance the quality of review generation. This external knowledge is incorporated during both the training and inference phases, with the extracted sentences added to the end of the review generation prompts. Tourist destination reviews typically reflect diverse perspectives and opinions, resulting in a large volume of content. Relying on only a few reviews may fail to fully capture this diversity, while using the entire review text as input risks introducing noise. To address these challenges, we empirically evaluated various external knowledge aggregation methods that efectively capture domain knowledge and user opinions. Figure 5 represents two proposed methods. We utilized external knowledge constructed from 2.4 million reviews that do not overlap with test reviews.

Subgraph-based method. This method involves sampling from a Sentiment Aware Knowledge Graph (SAKG). SAKG is a knowledge graph that incorporates user opinions and sentiment information. The SAKG in this research is represented as = {(ℎ, , )|ℎ, ∈ , ∈ } , where represents entities, represents relationships, and a triplet (ℎ, , ), which denotes a relationship from a head entity ℎ (e.g., Ueno Park) to a tail entity (e.g., Panda). Unlike previous work [ 20 ], this graph uses edges to represent adjectives and their frequency of use, such as (cute, 9) and (big, 4). The graph is constructed by extracting noun-adjective pairs from reviews using syntactic parsing and aggregating them for each tourist spot. During training, first, between 1 to 5 entities related to the target destination are sampled based on edge frequency. Subsequently, relationships associated with the sampled entities are selected similarly. During inference, the top-k entities and relationships by total edge frequency are sampled and incorporated into the prompt as natural language.

Summary-based method. This method involves adding aspect-based summaries. Previous research on review summarization has shown that aspect-based summaries facilitate a broader understanding of items [ 55, 10 ]. Furthermore, incorporating aspect information is crucial in review generation [55]. In this research, we input reviews and prompt a large-scale language model5 to generate a summary in a single stage. Specifically, for each tourist destination, up to 50 reviews are selected as input, with a total length less than 5,000 characters. The prompt instructs to create summaries of around 1000 characters for reviews from the perspectives of overviews, positive opinions, and negative opinions for the elements of key content, including price, service, food and drink, facilities, transportation access, and seasonal events.

4. Experimental Results 4.1. Experimental Setup

The TourMix1M dataset was employed to train a model for generating reviews based on inputs such as images, tourist destination names, and various contextual conditions. Specifically, to create the test data, images not used in the training set were first selected, and then the nearest corresponding reviews were retrieved with CLIP embeddings. Only image-review pairs where both the image and review were absent from the training data were included in the test set.

The training of LLaVA-Review was conducted using eight 48GB Ada 6000 RTX GPUs with a batch size of 80 and a learning rate of 2 × 10− 4, taking approximately 37 hours per epoch. When incorporating external knowledge, the training time increased to 41 hours for the subgraph-based method and 60 hours for the summary-based method. Evaluation was performed using 1,000 image-review pairs that did not overlap with the training data.

In this research, the key characteristics for efective review generation are 1) the integration of image information, 2) maintaining high text quality, 3) incorporating detailed information about tourist destinations, and 4) accounting for user opinions, particularly the collective sentiment regarding tourist experiences.

The evaluation metrics included BLEU [56], ROUGE-1, ROUGE-L [57], CIDEr [58], diversity (DIV), the number of unique proper nouns (PROPN), TFIDF-F1 score, and Senti-F1 score. BLEU, ROUGE, and CIDEr were used for overall text quality. BLEU and ROUGE were calculated with the sumeval library 8 and CIDEr was calculated with pycocoevalcap library 9. Diversity was assessed based on adjectives, nouns, proper nouns, and verbs. It was calculated by measuring the overlap of these features between generated sentences, using part-of-speech information from GiNZA10, following the approach of [59]. The number of unique proper nouns was calculated by dividing the total number of unique proper nouns in all generated reviews by the number of reviews. To evaluate domain knowledge, the TFIDF-F1 metric was used by identifying the top 15 TFIDF words for each tourist spot and calculating the F1-score between these top words and those in the generated reviews. The Senti-F1 metric, developed using Aspect-Based Sentiment Analysis (ABSA) [60], measured user opinion consideration by extracting (feature, opinion, sentiment) triplets from the text via a large language model5. F1 scores for the alignment of (feature, sentiment) and (feature, opinion) were averaged to produce the Senti-F1 score. For all metrics, except length, higher values indicate better performance.

The comparison methods include MRG [ 13 ], PETER [ 26 ], PEPLER [ 27 ], LLaVA-1.5 [ 16 ], ChatGPT4V, and ChatGPT-4o. MRG is a multimodal review generation model based on LSTM [61]. We use VGG16 [62] for vision backbone. PETER is an explanation generation model based on transformer structure, while PEPLER is based on GPT-2 [ 38 ]. We utilized a GPT-2 model 11 trained on Japanese data for PEPLER. In PETER and PEPLER, image features were extracted using ResNet [63], then reduced via PCA [64], and clustered with KMeans [65] to generate photo_id, which was used in place of user_id. LLaVA-1.5 is an open-source large multimodal model, while ChatGPT-4V [ 41 ] and ChatGPT-4o are closed-source large multimodal models known for their state-of-the-art knowledge and language capabilities. For large-scale multimodal models, prompts such as “You are a tourist who visited location. Generate a review based on the image” are used. Since models other than the proposed method tend to generate verbose text, we prompt the length of generated review to be approximately 100 characters. For retrieval-augmented fine-tuning, we employed two methods: one that extracts entities for comparison with subgraphs, and another that retrieves up to seven reviews based on CLIP image similarity for comparison with summaries. During inference, we used four entities and three relations in the subgraph method.

4.2. General Review Generation

Table 2 presents the quantitative evaluation results of review generation. General LMMs such as LLaVA-1.5 and GPT-4v achieved high ROUGE scores due to their strong linguistic capabilities. However, their knowledge related to tourism domain and user opinions is limited. ChatGPT-4o demonstrates high quality in terms of domain specificity and understanding user opinions. However, it sometimes generates factually incorrect outputs, such as describing a crowded place as quiet. Additionally, some explanations lean towards generalizations and lack detailed knowledge of tourism spots. For fine-tuned models such as MRG, PETER, PEPLER and LLaVA-Review, accuracy was generally improved in terms of BLEU and ROUGE-L quality metrics, as well as domain specificity and user opinion metrics. For the 8https://github.com/chakki-works/sumeval 9https://github.com/sks3i/pycocoevalcap 10https://github.com/megagonlabs/ginza 11https://huggingface.co/rinna/japanese-gpt2-medium 12https://cdn.jalan.jp/jalan/img/2/kuchikomi/3622/KL/ed832_0003622962_1.webp 13https://cdn.jalan.jp/jalan/img/5/kuchikomi/3905/KL/52041_0003905143_1.webp 14https://cdn.jalan.jp/jalan/img/4/kuchikomi/0894/KL/de82f_0000894193_1.webp inference. Bold text indicates the best performance in each groups. LLaVA-Review characteristic of making your skin smooth.

We stayed at Ooedo Onsen Monogatari with a baby, a 4-year-old child, and four adults. The bath was so Ground-Truth nice that we wanted to go in many times, and it made our skin smooth tourism spot name as input. The proposed baseline accurately captures the image information, user opinions, and domain knowledge. 12 MRG, due to the limited expressiveness of the model, the generated output tended to have a relatively small vocabulary and frequent repetitions, leading to lower overall accuracy. However, the PETER achieved the highest CIDEr score and performed well in domain specificity and the number of proper nouns used, as the expressiveness was improved by using the Transformer. The PETER also exhibits limitations of repetitive outputs at the sentence or phrase level, likely due to a limited number of Transformer layers. In contrast, the PEPLER consistently delivered superior performance in both review quality and domain relevance, producing natural and coherent outputs.

The proposed method, LLaVA-Review, leverages a large corpus of UGC and demonstrates excellent review performance across metrics, particularly in BLEU, ROUGE-L, and sentiment expression. This improvement is attributed to the model’s efective capture of image features and its capability to incorporate relevant tourist information and user opinions, facilitated by its large parameterization. While it tends to generate shorter reviews, the length can be adjusted by specifying the desired review (a) Retrieved Subgraph (b) Retrieved Summary (road, narrow, 18), (road, narrow, 8), (road, (extraction of summary part) The Yamanashi Prefectural Art Museum ofers dificult, 6) an extensive collection centered around Millet’s works, providing a quiet (mountain road, narrow, 7), (mountain road, and serene environment for visitors. The museum also features outdoor dificult, 5), (mountain road, good, 3) sculptures and a park, allowing visitors to enjoy both art appreciation and a (scenery, wonderful, 5), (scenery, good, 3) leisurely stroll. With relatively easy access, the museum provides a range of services that visitors will appreciate, including discounts for local residents and special ofers for those staying at nearby accommodations. length in the prompt as discussed later. Figure 6 illustrates the generated review results for Isawa Onsen. ChatGPT-4o produces relatively descriptive and typical outputs, while PETER and PEPLER efectively capture the domain-specific information of the tourist spot. Notably, the proposed LLaVA-Review accurately identifies the presence of indoor hot springs and their characteristics from the images, resulting in high-quality generation.

Table 2 also presents the efects of incorporating external knowledge. While knowledge acquisition improved accuracy overall, the extent of improvement varied. Sampling from the Knowledge Graph contributed to enhancements in BLEU by 12% and Senti-F1 by 11%, leading to better consideration of user opinions. In contrast, simply retrieving nouns showed limited improvements, highlighting the importance of including adjective information. Aspect-based summaries notably enhanced domain specificity and informativeness, increasing TFIDF-F1 by 4% and the number of proper nouns by 12%, without negatively afecting quality metrics. However, directly retrieving reviews introduced noise, which lowered overall quality despite improved user opinion consideration. Table 3 shows specific examples.

Finally, for ChatGPT-4o, we retrieved five, ten, and twenty reviews similar to the image using CLIP, and added them to the original prompt during inference. This approach significantly boosted the original ChatGPT-4o’s performance, achieving the highest CIDEr score and high TFIDF-F1 and Senti-F1 score. However, the variants of LLaVA-Review show comparable accuracy across almost all metrics, confirming its high review generation performance.

4.3. Conditional Review Generation

Table 4 presents the quantitative results of conditioned review generation. LLaVA-1.5 shows minimal to no performance improvement when considering attributes, as reflected in its low CIDEr scores for diferent conditions, such as gender (CIDEr: 0.014) and rating (CIDEr: 0.012). In contrast, LLaVA-Review model LLaVA-1.5 [ 16 ] LLaVA-1.5 + gender LLaVA-1.5 + season LLaVA-1.5 + rating LLaVA-1.5 + length LLaVA-1.5 + profile (long) LLaVA-1.5 + Keyphrase LLaVA-Review LLaVA-Review + gender LLaVA-Review + age LLaVA-Review + tag LLaVA-Review + profile_tag LLaVA-Review + profile_long LLaVA-Review + rating LLaVA-Review + length LLaVA-Review + key phrase LLaVA-Review + season LLaVA-Review + month

BLUE exhibits significant accuracy gains for certain attributes. LLaVA-1.5 fails to capture the characteristics of reviews influenced by user attributes and context, often directly outputting the conditions in the reviews. Although conditioning with key phrases improves accuracy in LLaVA-1.5, it is less efective than in LLaVA-Review.

Due to difering attribute frequencies as shown in Figure 2, directly comparing the efects of conditioning is dificult, but certain trends in accuracy improvements are observed across diferent attributes. Among user attributes, gender significantly improves generation quality, while age and group information have little impact, suggesting that gender has a stronger influence on shaping individual user reviews compared to age or group composition. For profiles, the short tag profile did not significantly improve accuracy, but long-form and detailed profiles reflecting styles and preferences, such as “relaxed” or “preference for natural landscapes,” enhanced review generation accuracy. Incorporating more detailed profiles could be a direction for future work. For style conditions such as review length and rating, review length proved especially efective, significantly improving accuracy. The average deviation between generated and target lengths was just 4.1 characters, reflecting precise conditioning. Low ratings had some efect, but imbalanced training data led to occasional under-representation of negative opinions. Keyword conditioning showed the highest accuracy gains. For temporal context, month-based conditioning outperformed seasonal, indicating greater efectiveness at a finer granularity.

Figure 7 illustrates examples of diferent conditioning: user demographic, age, user profile, rating and review lengths, and key phrases. For user demographic conditioning, characteristics of men, women, and families are reflected, such as mentioning “playground equipment” for families and “beautiful lfowers” for women. For age conditioning, in the twenties conditioning, a lively writing style using exclamations was observed, while in the sixties conditioning, more factual reviews related to historical knowledge were produced. For review length, it was observed that even with the requirement of 290 characters, reviews were generated without content redundancy or structural inconsistencies. For rating, for the negative conditioning of one star, reviews expressing disappointment over the lack of menu options were generated. Generation based on keywords also demonstrated high fidelity, producing results that were very close to the ground truth. 15https://cdn.jalan.jp/jalan/img/7/kuchikomi/2837/KL/d72d0_0002837701_1.webp 16https://cdn.jalan.jp/jalan/img/8/kuchikomi/1388/KL/e88ab_0001388176_2.webp 17https://cdn.jalan.jp/jalan/img/5/kuchikomi/4115/KL/09030_0004115266_1.webp 18https://cdn.jalan.jp/jalan/img/3/kuchikomi/5203/KL/02e5b_0005203413_1.webp 19https://cdn.jalan.jp/jalan/img/5/kuchikomi/4745/KL/be8e4_0004745614_4.webp

You are a tourist of ○○ who visited Expo ’70 Commemorative Park. Please write a review based on the image.

Male

Expo ’70 Commemorative Park, which features the famous Tower of the Sun sculpture by Taro Okamoto, sometimes hosts flea markets on weekends.

Female Expo ’70 Commemorative Park is very spacious and a great place for a walk. Depending on the season, various flowers bloom, making it very beautiful.

Family It is a very spacious park. Many flowers bloom, making it ideal for walking. There are also playground equipment, so children can enjoy it too.

You are a tourist of age ○○ who visited Nagoya Castle. Please write a review based on the image. 20s Nagoya Castle is a very magnificent castle. You can go inside, so please be sure to visit. I highly recommend it!

If you visit Nagoya Castle during the cherry blossom season, you can enjoy and take photos with the castle 40s in the background. Nearby, there are other attractions like the Nagoya City Science Museum and Kinshachi

Yokocho, making it a place where you can enjoy a full day. 60s

Nagoya Castle was originally built by Toyotomi Hideyoshi. The current main keep was reconstructed in the Showa era. It is especially beautiful during the cherry blossom season.

You are a tourist with a profile of 〇〇. Generate a review for Godaido Hall.

Prof: Nature lover.

It is located in a place where you can get a panoramic view of Matsushima Bay.

Since it is close to the pier for sightseeing boats, I think it’s good to visit both.

Profile (short)

Prof: A calm traveler who loves the beauty of nature,especially the scenery of lakes and coastlines, and enjoys leisurely exploring tourist spots like Shiretoko , Lake Mashu.

It is located in a place where you can get a panoramic view. From inside Godaido Hall, you can see a great Profile (Long) view of Matsushima Bay. Sightseeing boats also pass by, so I think it’s a good spot for taking photos.

It is an island near the Matsushima sightseeing boat pier, located after crossing three red bridges where you can see the bottom. In the center of the island, there is a hall, and it is a wonderful place with a panoramic Ground Truth view of Matsushima Bay.

Generate a review for Nagoya Tokyu Hotel with a condition of 〇〇

I went there with high expectations since it was lunch at a hotel. I chose the hamburger steak as the main dish, and it was tender and 290 delicious. The side buffet had a wide variety of options, and I left feeling full. There was also dessert, but it was just ice cream. The chars coffee after the meal was self-service. Since it was a restaurant inside a hotel, the service and atmosphere were good. However, the Length line in front of the entrance was long, and it took time to get in. It can't be helped since it's a popular restaurant, but I think it would be better if they could widen the entrance or make some improvements. I used points to get a good deal, so I was satisfied. I'd like to come back and try a different menu next time. 1 stars I went there for lunch at a hotel with high expectations, but the menu was limited, and there were no Rating desserts. There was a salad bar and a drink bar, but considering the price, it felt a bit... disappointing

Generate a review for Kunozan Toshogu Shrine using the keyword “more strenuous than I imagined” that matches the image

It was more strenuous than I imagined. I was drenched in sweat climbing the stairs. But the view after Keyword reaching the top was amazing.

I climbed the stone steps from the seaside to the Gorieki-dori side. It was more strenuous than I imagined, Ground-Truth but the view was magnificent and I felt a great sense of accomplishment.

Figure 7: Conditional review generation examples. The first example shows generation based on user demographics. The second example shows generation based on age. The third example illustrates generation based on the user’s profile. The fourth example shows generation based on rating and review length. The fifth example illustrates generation based on key phrases.15 16 17 18 19

As an aggregate analysis, we generated reviews for the same image by changing only the seasonal instructions across the four seasons, then compared word frequency in the output. As shown in Figure 8, distinct features appear: cherry blossoms in spring, children in summer, autumn leaves in fall, and hot springs in winter, highlighting the impact of seasonal conditioning. Figure 9 provides a quantitative view of the conditioning efects under gender conditioning. In generated reviews, we compared word frequencies between male and female conditioned reviews using 1,000 test image-text pairs. For real reviews, we sampled ten reviews from men and three from women for the same tourist destination. The table highlights the top 15 words with the largest (female - male) and (male - female) frequency diferences in real reviews. In reviews originally written by women, certain words appear more frequently, and this pattern is reflected in generated reviews. All 15 words with higher frequency in women’s original reviews also show higher frequency under female conditioning, with notable diferences for “Go,” “Beautiful,” and “Spacious.” In contrast, men’s reviews show fewer specific words, and the male-to-female ratio remains small in generated reviews, maintaining the overall trend of higher male-conditioned word frequency.

4.4. Short Review Generation 4.5. Visualization of Attention

The visualization results of the attention weights in the final layer of the language model during the generation of the first token in the review generation process are shown in Figure 10. Specifically, the visualization highlights the 576 tokens that correspond to image positions within the cs-token attention, resized to 24 × 24 and presented as a heatmap. In the standard LLaVA-1.5 model, high attention values are concentrated in narrow regions of the image. In contrast, LLaVA-Review exhibits high attention values over a broader area of the image and across more extensive objects.

5. Limitations and Future Works

In terms of dataset, we are considering updates to improve the fidelity of image-text pair creation. Leveraging more detailed user information, such as utilizing all past reviews written by the user, or considering their actual behavior at tourist destinations when possible, is also a promising direction. Additionally, while this research developed a dataset specific to Japanese data, it is known that tourist destinations and perceptions of tourism vary by country. Building a more comprehensive dataset that encompasses diverse languages and cultures remains a challenge for future research. Moreover, expanding the dataset to support more tasks such as personalized product description generations and recommendations are also future works. In terms of the model design, we will leverage more powerful language models as the backbone, incorporate more robust image information and domain knowledge, and develop more adaptive external documents. We will also consider the application to real-world scenarios, such as simulations and marketing. Our model is capable of generating virtual user experiences, which could be used to improve tourist destinations and simulate travel experiences. However, when utilizing such pseudo-reviews, it is essential to address potential issues related to privacy, bias, and reputation harms that these reviews might cause.

6. Conclusion

In this research, we developed TourMix1M, the first multimodal dataset for tourism review generation. We also introduced LLaVA-Review, a large-scale multimodal model for review generation. Furthermore, we researched two knowledge retrieval methods for tourism review generation. Experiments with the proposed dataset showed LLaVA-Review’s superior performance in domain specificity and user sentiment expression. The proposed two retrieval-augmented fine-tuning strategies further improved accuracy. Additionally, additional attention to factors such as gender, user profiles, month, review length, and key phrases significantly enhanced review generation. This work is expected to advance research in tourism and broader review generation fields.

7. Acknowledgments References

This work is partially financially supported by Center for Real Estate Innovation, The University of Tokyo. model reasoning, in: International Conference on Learning Representations, 2024. [48] S. Borgeaud, A. Mensch, J. Hofmann, T. Cai, E. Rutherford, K. Millican, G. van den Driessche, J.-B.

Lespiau, B. Damoc, , A. Guy, J. Menick, R. Ring, T. Hennigan, S. Huang, L. Maggiore, C. Jones, A. Cassirer, A. Brock, M. Paganini, G. Irving, O. Vinyals, S. Osindero, K. Simonyan, J. W. Rae, E. Elsen, L. Sifre, Improving language models by retrieving from trillions of tokens., in: International Conference on Machine Learning, 2022, pp. 2206–2240. [49] K. Guu, K. Lee, Z. Tung, P. Pasupat, M.-W. Chang, Realm: Retrieval-augmented language model pre-training, in: International Conference on Machine Learning, 2020, pp. 3929–3938. [50] W. Shi, S. Min, M. Yasunaga, M. Seo, R. James, M. Lewis, L. Zettlemoyer, W. tau Yih, Replug: Retrieval-augmented black-box language models, in: Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2024, pp. 8371–8384. [51] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. tau Yih, T. Rocktäschel, S. Riedel, D. Kiela, Retrieval-augmented generation for knowledge-intensive nlp tasks, in: Advances in Neural Information Processing Systems, 2024, pp. 9459–9474. [52] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, et al., Learning transferable visual models from natural language supervision, in: International Conference on Machine Learning, 2021, pp. 8748–8763. [53] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E.

Gonzalez, I. Stoica, E. P. Xing, Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023. URL: https://lmsys.org/blog/2023-03-30-vicuna/. [54] E. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, Lora: Low-rank adaptation of large language models, in: Proc. International Conference on Learning Representations, 2022. [55] R. K. Amplayo, S. Angelidis, M. Lapata, Aspect-controllable opinion summarization, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, p. 6578–6593. [56] K. Papineni, S. Roukos, T. Ward, W. Zhu, Bleu: A method for automatic evaluation of machine translation, in: Proc. 40th Annual Meeting on Association for Computational Linguistics, 2002, p. 311–318. [57] C.-Y. LIN, Rouge: A package for automatic evaluation of summaries, in: Text Summarization

Branches Out, 2004, pp. 74–81. [58] R. Vedantam, C. L. Zitnick, D. Parikh, Cider: Consensus-based image description evaluation., in: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2015, pp. 4566–4575. [59] B. Smyth, P. McClave., Similarity vs. diversity., in: Proceedings of the 4th International Conference on Case-Based Reasoning, 2001, pp. 347–361. [60] H. Peng, L. Xu, L. Bing, F. Huang, W. Lu, L. Si, Knowing what, how and why: A near complete solution for aspect-based sentiment analysis„ in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 2020, pp. 8600–8607. [61] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation 9 (1997) 1735–1780. [62] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in: International Conference on Learning Representations, 2015. [63] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proc. IEEE

Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778. [64] S. Wold, K. Esbensen, P. Geladi, Principal component analysis, Chemometrics and intelligent laboratory systems (1987) 37–52. [65] S. P. Lloyf, Least squares quantization in pcm, IEEE transactions on information theory, 28 (1982) 129–137.

[1] WTTC , Economic Impact Report: Global Infographic , Technical Report, World Travel Tourism Council (WTTC) , 2023 . URL: https://wttc.org/Research/Economic-Impact.

[2] Anonymous , Tourist experiences at overcrowded attractions: A text analytics approach , in: Information and Communication Technologies in Tourism, 2022 , pp. 231 - 243 .

[3]

S.-E.

Kim ,

K. Y.

Lee ,

S. I.

Shin ,

S.-B.

Yang , Efects of tourism information quality in social media on destination image formation: The case of sina weibo , Information & Management 54 ( 2017 ) 687 - 702 .

[4]

M. del Carmen

Hidalgo Alcázar ,

M. S.

Piñero , S. R. de Maya, The efect of user-generated content on tourist behavior: The mediating role of destination image , Tourism & Management Studies 10 ( 2014 ) 158 - 164 .

[5]

Y. L. . H. S.

Dogan Gursoy , Gender diference on destination image and travel options: An exploratory text-mining study ., in: PloS one , volume 30 , 2018 , pp. 1 - 5 .

[6]

Y. L. . H. S.

Dogan Gursoy , Does traveler satisfaction difer in various travel group compositions? , in: International Journal of Contemporary Hospitality Management , volume 30 , 2018 , pp. 1663 - 1685 .

[7] J. J. Padilla1 ,

Kavak ,

C. J.

Lynch ,

R. J.

Gore1 ,

S. Y.

Diallo , Temporal and spatiotemporal investigation of tourist attraction visit sentiment on twitter ., PloS one 13 ( 2018 ).

[8]

Rossetti1 ,

Stella1 ,

Zanker , Analyzing user reviews in tourism with topic models ., Information Technology Tourism 316 ( 2016 ) 5 - 21 .

[9]

S. A. C.

Estela Marine-Roig, Tourism analytics with massive user-generated content: A case study of barcelona ., Journal of Destination Marketing Management 4 ( 2015 ) 162 - 172 .

[10] C.-F. Tsai , K.

Chen , Y.-H.

Hu , W.-K. Chen, Improving text summarization of online hotel reviews with review helpfulness and sentiment , Tourism Management 80 ( 2020 ).

[11]

Dong ,

Huang ,

Wei ,

Lapata ,

Zhou ,

Xu , Learning to generate product reviews from attributes , in: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics , 2017 , pp. 623 - 632 .

[12]

He ,

Zhang , S. Ren,

Sun , Deep residual learning for image recognition , in: Proc. IEEE Conference on Computer Vision and Pattern Recognition , 2016 , pp. 770 - 778 .

[13]

Q.-T.

Truong ,

Lauw , Multimodal review generation for recommender systems , in: The World Wide Web Conference , 2019 , pp. 1864 - 1874 .

[14]

Li ,

W. X.

Zhao ,

He ,

Wei ,

N. J.

Yuan ,

J.-R.

Wen ., Knowledge-enhanced personalized review generation with capsule graph neural network , in: Proceedings of the 29th ACM International Conference on Information Knowledge Management , 2020 , pp. 735 - 744 .

[15]

Xie ,

Singh , J. McAuley , Bodhisattwa ,

Majumder , Factual and informative review generation for explainable recommendation ., in: Proceedings of the AAAI Conference on Artificial Intelligence , volume 37 , 2023 , pp. 13816 - 13824 .

[16]

Liu ,

Li ,

Y. J.

Lee , Improved baselines with visual instruction tuning , Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition ( 2024 ) 26296 - 26306 .

[17]

X. V.

Lin ,

Chen ,

Shi ,

Lomeli ,

James ,

Rodriguez ,

Kahn , G. Szilvasy,

Lewis ,

Zettlemoyer ,

Yih , Ra-dit: Retrieval-augmented dual instruction tuning , in: The Twelfth International Conference on Learning Representations., 2024 , pp. 2206 - 2240 .

[18]

Zhang ,

N. J. Shishir G.

Patil ,

Shen ,

Zaharia , I. Stoica ,

J. E.

Gonzalez , Raft: Adapting language model to domain specific rag , in: arXiv preprint, 2024 .

[19]

Yang , M. Zhu1 ,

Wang ,

Chen ,

Zhao ,

Wang ,

Han ,

Zheng ,

Yin , Fine-tuning large language model based explainable recommendation with explainable quality reward ., in: Proceedings of the AAAI Conference on Artificial Intelligence , volume 38 , 2024 , pp. 9250 - 9259 .

[20]

S.-J.

Park , D. - K. Chae , H. -

K. Bae , S.

Park , S.-W. Kim, Reinforcement learning over sentimentaugmented knowledge graphs towards accurate and explainable recommendation , in: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining , 2022 , pp. 784 - 793 .

[21] Anonymous , Llava-tour: Creation of a large-scale multimodal model specializing in japanese tourism data ., in: Proceedings of IEEE Visual Communications and Image Processing (under review) , 2024 .

[22] M. L. Cheung , W. K. S.

Leung , J.-H.

Cheah , H.

Ting , Efects of user-provided photos on hotel review helpfulness: An analytical approach with deep leaning , International Journal of Hospitality Management 71 ( 2018 ) 120 - 131 .

[23] M. L. Cheung , W. K. S.

Leung , J.-H.

Cheah , H.

Ting , Exploring the efectiveness of emotional and rational user-generated contents in digital tourism platforms , Vacation Marketing 28 ( 2022 ) 152 - 170 .

[24]

Y. L. . H. S.

Dogan Gursoy , Chatgpt and the hospitality and tourism industry: an overview of current trends and future research directions , in : Journal of Hospitality Marketing & Management , volume 32 , 2023 , pp. 579 - 592 .

[25] W. C. Yogesh K Dwivedi , Neeraj Pandey, A. Micu , Leveraging chatgpt and other generative artificial intelligence (ai)-based applications in the hospitality and tourism industry: practices, challenges and research agenda , in: International Journal of Contemporary Hospitality Management , volume 36 , 2024 , pp. 1 - 12 .

[26]

Zong ,

Xia ,

Li ,

Navigli , Personalized transformer for explainable recommendation , in: Proc. 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing , 2021 , pp. 4947 - 4957 .

[27]

Li ,

Zhang , L. Chen, Personalized prompt learning for explainable recommendation , in: ACM Transactions on Information Systems , volume 41 , 2023 , pp. 1 - 26 .

[28]

Li ,

Wang ,

Ren ,

Bing , W. Lam, Neural rating regression with abstractive tips generation for recommendation , in: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval , 2017 , pp. 345 - 354 .

[29]

Ni , J. McAuley , Personalized review generation by expanding phrases and attending on aspectaware representations , in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics , 2018 , pp. 706 - 711 .

[30]

Z. C.

Lipton ,

S. V.

andand Julian McAuley , Generative concatenative nets jointly learn to write and classify reviews , in: arxiv preprint, 2015 .

[31]

Li ,

Tuzhilin , Towards controllable and personalized review generation , in: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics , volume 1 , 2019 , pp. 3237 - 3245 .

[32]

Guo ,

Lu ,

Cai ,

Zhang ,

Yu , J. Wang, Long text generation via adversarial training with leaked information ., in: Proceedings of the AAAI conference on artificial intelligence , volume 32 , 2019 .

[33]

D. V.

Hada , V. M.,

S. K.

Shevade , Rexplug: Explainable recommendation using plug and play language model , in: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval , volume 1 , 2021 , pp. 81 - 91 .

[34]

Chen ,

Lin ,

Qi ,

Hu ,

Li ,

Zhou ,

Sun , Aspect-level sentiment-controllable review generation with mutual learning framework , in: Proceedings of the AAAI Conference on Artificial Intelligence , volume 35 , 2021 , pp. 12639 - 12647 .

[35]

Li ,

W. X.

Zhao ,

J.-R.

Wen , ,

Song , Generating long and informative reviews with aspect-aware coarse-to-fine decoding , in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , 2019 , pp. 1969 - 1979 .

[36]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez ,

Kaiser , I. Polosukhin , Attention is all you need , in: Advances in Neural Information Processing Systems , volume 30 , 2017 .

[37]

Touvron ,

Lavril ,

Izacard ,

Martinet , M. -

A. Lachaux , T.

Lacroix , B.

Rozière , et al., Llama: Open and eficient foundation language models, arXiv preprint ( 2023 ).

[38]

Radford , J. Wu ,

Child ,

Luan ,

Amodei , I. Sutskever , Language Models are Unsupervised Multitask Learners , Technical Report, OpenAI , 2019 . URL: https://api.semanticscholar.org/ CorpusID:160025533.

[39] T. B. Brown , B.

Mann , N.

Ryder , M.

Subbiah , J.

Kaplan , P.

Dhariwal , A. N. P. , et al., Language models are few-shot learners , in: Proc. Advances in Neural Information Processing Systems , volume 33 , 2020 , pp. 1877 - 1901 .

[40]

Liu ,

Li ,

Wu ,

Y. J.

Lee , Visual instruction tuning , in: Proc. Advances in Neural Information Processing Systems , volume 36 , 2023 .

[41] OpenAI , Gpt-4 technical report , arXiv preprint ( 2023 ).

[42]

Bai ,

Yang ,

Wang ,

Tan ,

Wang ,

Lin ,

Zhou ,

Zhou , Qwen-vl: A versatile vision-language model for understanding, localization , text reading, and beyond, arxiv preprint ( 2023 ).

[43]

Wei ,

Bosma ,

V. Y.

Zhao ,

Guu ,

A. W.

Yu ,

Lester ,

Du , et al., Finetuned language models are zero-shot learners , in: Proc. International Conference on Learning Representations , 2022 .

[44]

Zhang , X. Han,

Liu ,

Jiang ,

Sun ,

Liu , Ernie: Enhanced language representation with informative entities , in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , 2019 , pp. 1441 - 1451 .

[45]

Kang ,

J. M.

Kwak ,

Baek ,

S. J.

Hwang , Knowledge graph-augmented language models for knowledge-grounded dialogue generation , in: arxiv preprint, 2023 .

[46]

He ,

Tian ,

Sun ,

N. V.

Chawla ,

Laurent ,

LeCun ,

Bresson ,

Hooi , G-retriever: Retrieval-augmented generation for textual graph understanding and question answering , in: arxiv preprint, 2024 .

[47]

Luo ,

Y.-F.

Li ,

Hafari ,

Pan , Reasoning on graphs: Faithful and interpretable large language