1. Introduction

Journal of King Saud University

10.1080/13683500.2021.2007227

LPQ Team at Rest-Mex 2025: BERT and LLM Approaches in Tourism Review Classification

Le Phu Quy

0 1

Dang Van Thin

0 1 0 University of Information Technology-VNUHCM , Quarter 6, Linh Trung Ward, Thu Duc District, Ho Chi Minh City , Vietnam 1 Vietnam National University , Ho Chi Minh City , Vietnam

2020

75 2980 2988

This study addresses the Rest-Mex 2025 challenge by developing a multi-task framework for Spanish tourism review analysis, focusing on sentiment polarity (1-5 scale), destination type classification (hotel/restaurant/attraction), and Magical Town identification. We explore transformer-based models (BETO, XLM-RoBERTa), hybrid architectures (BERT embeddings with XGBoost), domain adaptation, ensemble strategies, and LLaMA-3 fine-tuned specifically for Magical Town recognition. The approach provides a scalable pipeline for enhancing destination analytics through advanced NLP techniques.

eol>Hope classification Spanish language English language sentiment analysis fine-tuning BERT

1. Introduction

Mexico’s Pueblos Mágicos (Magical Towns) are vibrant destinations where history, culture, and economic development intersect, attracting millions of travelers each year. These towns are celebrated for their rich artisan traditions, historic landmarks, and breathtaking natural scenery, making them essential to Mexico’s tourism industry. As digital platforms like TripAdvisor and social media grow in influence, travelers share their experiences more widely than ever, creating a wealth of reviews that reveal diverse sentiments, destination preferences, and regional identities. However, analyzing these texts poses challenges, as they are written in various Spanish dialects and often include irony or local slang[ 1, 2, 3, 4 ].

Unlike past editions [5, 6, 7], the Rest-Mex 2025 [8, 9] shared task addresses these complexities by introducing three key objectives. First, sentiment analysis helps determine the emotional tone of a review, using a rating scale from 1 to 5. Second, destination classification identifies whether a review refers to a hotel, restaurant, or tourist attraction. Lastly, geolocation detection pinpoints which Pueblo Mágico the review describes. These tasks extend beyond academic interest—they play a crucial role in shaping sustainable tourism strategies, improving infrastructure, and preserving the unique cultural heritage of Mexico’s 40 designated Pueblos Mágicos. By extracting meaningful insights from online reviews, researchers and policymakers can enhance visitor experiences while ensuring these towns retain their distinctive charm for future generations.

In this work, we present an evaluation of modeling strategies for tourism analytics, benchmarking several approaches. First, we optimize baseline BERT models—including BETO (Spanish BERT), Roberto (RoBERTa Spanish), and XLM-RoBERTa—to serve as our fundamental framework. Additionally, we explore embedding-XGBoost hybrids, where sentence embeddings derived from these BERT variants are fed into XGBoost classifiers fine-tuned with focal loss to better capture minority classes. To further enhance destination classification, we introduce a domain-adapted BETO model, trained on 15GB of Mexican tourism texts to efectively capture region-specific expressions. We also employ BERT ensemble models by fine-tuning BETO, Roberto, and XLM-RoBERTa independently and aggregating their outputs via both soft and hard voting, supplemented by metadata features such as regional keywords. Finally, we fine-tune LLaMA-3 using LoRA, further enriching our approach.

2. Related Work

Tourism review analytics has evolved significantly over the years, moving from early rule-based[ 10] and traditional machine learning methods to sophisticated deep learning approaches. Early work in this field focused on manual feature engineering for sentiment extraction and destination classification, but these methods struggled with the variability and cultural nuances present in user-generated content. The advent of transformer-based models, especially Spanish-specific variants like BETO[ 11] and Roberto has greatly enhanced our ability to capture complex expressions and regional idioms in tourism reviews.

More recent studies have leveraged hybrid architectures that combine the semantic strength of transformer models with the robustness of classical classifiers. In particular, embedding-XGBoost hybrids—where sentence embeddings from BETO, Roberto, and XLM-RoBERTa are fed into XGBoost classifiers fine-tuned[ 12] with focal loss have been successful in addressing challenges related to class imbalance and minority class emphasis[13]. Additionally, domain adaptation[14] via pre-training on extensive tourism-specific corpora has proven efective for capturing regional expressions, enhancing the performance of models in destination classification tasks.

Ensemble methods[15] and large language models (LLMs) have further pushed the boundaries in tourism review analysis. Independent fine-tuning of various BERT variants followed by ensemble aggregation has demonstrated notable improvements in tasks like Magical Town identification. Moreover, ifne-tuning LLaMA-3 using parameter-eficient methods like LoRA on culturally annotated datasets has enabled more precise disambiguation of similar regional references. Our work builds on these independent modeling strategies, ofering a modular approach that isolates and leverages the unique strengths of each method to achieve state-of-the-art performance in tourism NLP.

3. Methodology 3.1. Transformer-Based Classification Models

The Transformer architecture has become a cornerstone in Natural Language Processing due to its efective use of self-attention mechanisms. These mechanisms allow the model to capture the contextual relationships between tokens, while positional encoding preserves the sequential order. Multi-head attention further enables the parallel extraction of distinct patterns and representations from the input text. Such principles form the theoretical basis of our classification approach.

Implemented Model Variants: In our experiments, we employ four Transformer-based models: • BETO: A Spanish language model adapted from BERT, which utilizes dynamic gradient accumulation to address instability in gradient updates during training. • Roberto: A variant similar in foundation to BETO, optimized for our specific domain requirements. • XLM-RoBERTa: A robust multilingual model fine-tuned using conventional truncation settings to handle regional vocabulary and context. • Domain-Adapted BETO: This model undergoes an additional phase of continual pre-training on 18 million tokens from the hospitality domain, enhancing its understanding of tourism-related texts.

3.2. Hybrid Embedding-XGBoost Framework

In our approach, the process is divided into two main phases: extracting information from text and then using that information to make predictions. First, a transformer model transforms the raw text into a vector, which serves as a semantic summary that eficiently captures the context and meaning of the original content. This vector eliminates the need for continuous, heavy computation in the subsequent stages. For sentiment analysis, the extracted embedding is passed to an XGBoost model designed to predict ratings while naturally respecting their ordinal relationship. This means that the model is set up so that a higher rating is always treated as more positive than a lower one, ensuring that the predictions follow the expected order and are consistent with the natural ranking of sentiments.

For town or geographical classification, a similar XGBoost model is employed, but it is fine-tuned to focus on features that are most relevant to location information. By analyzing which parts of the text provide the strongest geographic signals, the model filters out less useful features, which simplifies the decision process and improves overall eficiency. This selective approach ensures that the classification is both fast and accurate. To further improve the performance of the system, we apply Bayesian hyperparameter optimization. This method carefully adjusts key parameters such as model complexity and regularization factors, helping to balance the trade-ofs between accuracy and speed while also addressing potential class imbalances in the data. Overall, the hybrid framework not only separates the heavy lifting of context extraction from the prediction tasks but also achieves faster inference times compared to using a complete transformer model for every operation.

3.3. LLaMA-3 Instruction Tuning

In this approach, we adjust the LLaMA-3-8B model to enhance its understanding of geocultural contexts in tourism. We use a method called Low-Rank Adaptation (LoRA), which adds a small number of trainable components to specific layers of the model while keeping most of the original parameters unchanged. This allows the model to learn tourism-focused information without losing its general language abilities.

To incorporate tourism domain expertise, trainable adapters are inserted into the model’s query and value projection layers. This selective tuning enables the model to efectively absorb and use domain knowledge while still relying on its robust pre-trained capabilities. A well-structured, three-part prompt strategy guides the model’s response. To ensure the geographical information is accurate, two validation methods are applied. First, the system employs pattern matching to filter out any inputs that do not meet the expected format for town names. Second, a character-level similarity check is used as a fallback to correct minor errors in the town names by comparing them against an oficial list. This dual-check approach minimizes errors in geographic details and ensures the output remains precise.

Overall, this instruction tuning framework adapts the LLaMA-3 model to be both knowledgeable and reliable within the tourism domain. By combining targeted parameter tuning with a structured prompting and validation system, the model is capable of generating detailed, accurate responses while maintaining eficiency—a quality that is essential for deployment in resource-constrained environments.

3.4. Ensemble Strategies

Our ensemble approach is centered on two key voting techniques: soft voting and hard voting, each contributing uniquely to the final decision-making process.

In soft voting, the models provide probabilistic estimates that reflect the confidence of each prediction. These probabilities are combined in a way that gives higher influence to models with stronger performance. This method allows the ensemble to capture subtle distinctions in the data, efectively leveraging the context-aware abilities of transformer-based models when the diferences between classes are not pronounced.

In contrast, hard voting involves each model casting a clear, discrete vote for its predicted outcome. The final prediction is determined by a majority rule—if the votes are tied, a simple tie-breaking procedure selects the outcome. This approach provides decisiveness and transparency, ensuring that the ensemble can deliver a clear prediction even when the individual model opinions diverge.

Category Polarity

Type 1 5,441 2 5,496 3 15,519 4 45,034 5 136,561 Hotel 51,410 Restaurant 86,720 Attractive 69,921

Total 208,051 208,051

4. Experiments 4.1. Dataset

Our experiments employ the competition dataset from Rest-Mex 2025, containing tourist reviews of Mexico’s special "magical towns". The data includes user opinions with sentiment ratings and venue categories, showcasing real tourism feedback across diverse Mexican locations:

4.2. Experiment Setting

In this study, we employ the Rest-Mex 2025 dataset, a comprehensive corpus comprising 10,000 Spanishlanguage travel reviews collected from Mexican tourism destinations for the Rest-Mex 2025 challenge. Each review is systematically annotated for three distinct classification tasks: polarity (rated 1–5), destination type (restaurant, hotel, or attraction), and magical town (one of 40 distinct towns). The corpus exhibits moderate class imbalance for polarity—skewed toward positive ratings (4 and 5)—and for destination type, while the magical-town labels demonstrate high imbalance due to the rarity of some towns. Pre-processing removes entries with missing values and concatenates review titles and bodies using a [SEP] token to form a unified input sequence. Labels are numerically encoded (polarity: 0–4, destination type: 0–2, magical town: 0–39), and the data are partitioned through stratified sampling into 80% training, 10% validation, and 10% test splits to preserve class distributions.

Three transformer models are fine-tuned for each task: BETO (dccuchile/bert-base-spanish-wwmcased), RoBERTa (bertin-project/ bertin-roberta-base-spanish), and XLM-RoBERTa (xlm-roberta-large). For polarity and destination type classification, models predict 5 and 3 classes respectively, with a maximum sequence length of 128 tokens; the magical-town task utilizes 256 tokens and predicts 40 classes. Fine-tuning is conducted for three epochs with a learning rate of 2e-5, the AdamW optimizer, a batch size of 16, and gradient accumulation (4 steps for BETO and RoBERTa, 2 for XLM-RoBERTa). Training employs mixed precision (FP16) for enhanced memory eficiency. Additionally, a domainadapted BETO is created by pre-training on the corpus with masked-language modeling for two epochs before task-specific fine-tuning. To explore complementary methodologies, final-layer [CLS] embeddings from BETO, RoBERTa, and XLM-RoBERTa are fed to XGBoost classifiers. XGBoost is configured with depth 6, learning rate 0.1, and 1,000 boosting rounds, utilizing early stopping on validation loss and class weights, particularly for the highly imbalanced magical-town task. Ensemble strategies include soft voting (probability averaging with weights tuned for macro-F1 on the validation set) and hard voting (equal-weight majority vote). A LLaMA-3.2-3B-Instruct model is additionally ifne-tuned via LoRA ( = 16, = 32) in a multi-task setting to generate structured outputs for all three labels, though this remains exploratory due to output-parsing challenges. Transformer models provide robust baselines for Spanish NLP, while XGBoost and ensemble methods leverage complementary inductive biases to ofset individual weaknesses.

Evaluation relies on accuracy, macro-F1, and weighted F1 metrics to reflect performance across imbalanced classes. These experimental settings balance computational feasibility with rigorous analyF1 Score BETO RoBERTa XLM-RoBERTa BETO + XGBoost RoBERTa + XGBoost DA-BETO Ensemble (Soft) Ensemble (Hard) LLM-FT XLM-RoBERTa + XGBoost Method BETO RoBERTa XLM-RoBERTa BETO + XGBoost RoBERTa + XGBoost XLM-RoBERTa + XGBoost DA-BETO Ensemble (Hard) LLM-FT Ensemble (Soft) Method BETO RoBERTa XLM-RoBERTa BETO + XGBoost RoBERTa + XGBoost XLM-RoBERTa + XGBoost DA-BETO Ensemble (Soft) Ensemble (Hard) LLM-FT

4.3. Main Results

Our experimental evaluation reveals distinct performance patterns across the three classification tasks. For polarity classification, XLM-RoBERTa combined with XGBoost achieved the best overall performance, demonstrating the efectiveness of hybrid approaches. The type classification task showed consistently high performance across all methods, with the soft-ensemble approach slightly outperforming individual models. Town classification exhibited significant variation, with the standalone RoBERTa model surprisingly outperforming more complex ensemble approaches, highlighting that optimal model selection is highly task-dependent. Overall, these findings stress the importance of matching model complexity to task characteristics rather than adopting a one-size-fits-all solution.

Macro F1(Polarity) Macro F1(Type) Macro F1(Town)

1st 2nd 3rd HM

5. Error Analysis and Discussion

The most prominent weakness in our approach is observed in the Town Classification task, where our model achieved a macro-F1 score of 0.4690, significantly lower than the top-performing system’s score of 0.6919. This performance gap stems primarily from our use of metadata (Region) during training to distinguish between towns. While this metadata enhanced the model’s ability to diferentiate towns in the training data, it was not available in the test set. As a result, the model failed to generalize efectively, particularly for less frequent towns, leading to poor classification performance.

This issue highlights the critical importance of maintaining data consistency between training and testing phases. The absence of the Region metadata during inference created a mismatch that undermined the model’s predictive capability. To address this, potential solutions include eliminating reliance on metadata entirely or integrating external knowledge sources, such as geographical or cultural databases, to provide contextual cues independent of the training data. Additionally, improving data balance—perhaps through oversampling or synthetic data generation—and enhancing the model’s ability to handle linguistic variations, such as slang or dialects, could further increase its reliability and robustness for future applications.

5.1. Conclusion

This study examines transformer-based and hybrid approaches for multi-task tourism review analysis, focusing on the classification of magical towns in the Rest-Mex 2025 dataset. While the overall method shows promise, the town classification task achieved a macro-F1 score of 0.4690—significantly lower than the leading 0.6919—primarily due to using Region metadata during training that was not available at testing, resulting in poor generalizability. These findings underscore the importance of consistent feature availability across training and testing, suggesting that future models should avoid reliance on such metadata by incorporating external knowledge, advanced data augmentation, and improved handling of linguistic diversity like regional slang to ensure robust real-world performance..

Acknowledgements

This research was supported by The VNUHCM-University of Information Technology’s Scientific Research Support Fund.

Declaration on Generative AI

We declare that the present manuscript has been written entirely by the authors and that no generative artificial intelligence tools were used in its preparation, drafting, or editing.

[1]

Guerrero-Rodriguez ,

M. A.

Álvarez Carmona ,

Aranda ,

A. P.

López-Monroy , Studying online travel reviews related to tourist attractions using nlp methods: the case of guanajuato, mexico , Current Issues in Tourism 26 ( 2023 ) 289 - 304 . URL: