<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Balanced Bag-of-Words Classification of Spanish Tourism Reviews Using Random Forests</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luisa Agudelo Fuentes</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rodrigo Sebastián Rojas Miranda</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Instituto tecnológico de toluca</institution>
          ,
          <country country="MX">México</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universidad Iberoamericana</institution>
          ,
          <country country="MX">México</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This paper presents a classical machine learning approach for sentiment and thematic classification of Spanishlanguage tourist reviews, using a bag-of-words representation filtered by part-of-speech and trained with Random Forest classifiers. To address the severe class imbalance in the Rest-Mex 2025 dataset, we applied random undersampling, equalizing the number of instances per class to that of the minority class. Additionally, we reduced linguistic noise by limiting the input features to nouns and verbs only. The resulting vectors were used to train independent models for polarity, type, and town classification. Despite the simplicity of the approach, it achieved a macro F1-score of 0.198 for sentiment polarity, 0.331 for type classification, and 0.025 for town classification. These results are considerably lower than those reported by Transformer-based models; however, our method ofers transparency, interpretability, and computational eficiency. As such, it serves as a useful baseline for low-resource scenarios or educational settings where interpretability and deployment cost are prioritized over peak performance.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Random Forest</kwd>
        <kwd>Class Balancing</kwd>
        <kwd>Bag-of-Words</kwd>
        <kwd>Part-of-Speech Filtering</kwd>
        <kwd>Sentiment Analysis</kwd>
        <kwd>Spanish NLP</kwd>
        <kwd>Low-Resource NLP</kwd>
        <kwd>Rest-Mex 2025</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The rise of user-generated content on platforms such as TripAdvisor has created new opportunities
for understanding traveler preferences, satisfaction, and perceptions of destinations [1, 2]. These
platforms capture vast amounts of textual data in natural language, ofering a valuable resource for both
tourism analytics and computational linguistics [3]. However, this data is often noisy, unstructured, and
imbalanced, presenting considerable challenges for automated processing, especially in low-resource
settings or in underrepresented languages like Spanish [4].</p>
      <p>Sentiment analysis and thematic classification are essential tasks within this context. In
Spanishlanguage tourism, the Rest-Mex Shared Task series has played a central role in defining benchmarks
and stimulating research. Launched in 2021, Rest-Mex initially focused on polarity classification and
satisfaction prediction using TripAdvisor reviews from Mexican destinations [5]. In 2022, the task
expanded to include a third track on classifying COVID-19 epidemiological risk levels from news articles,
further diversifying its application scenarios [6, 7]. The 2023 edition introduced data from Cuba and
Colombia and incorporated unsupervised clustering as a fourth task, while maintaining polarity and
service type classification as core components [ 8]. Now in its fourth edition, Rest-Mex 2025 presents
a more complex challenge by including fine-grained town classification over 40 Mexican “Pueblos
Mágicos,” thereby emphasizing geographically-aware sentiment modeling [9, 10].</p>
      <p>While recent winning systems at Rest-Mex have relied heavily on Transformer-based architectures,
such as BETO [11], not all deployment contexts have access to high-end GPUs or suficient infrastructure
for training or inference with large-scale neural models. Moreover, in some applications—such as
educational settings, embedded systems, or public sector deployments—model interpretability and
computational eficiency may outweigh the need for state-of-the-art performance.</p>
      <p>In this paper, we revisit classical machine learning techniques to provide a lightweight and
interpretable baseline for sentiment and thematic classification. Specifically, we train Random Forest
classifiers using bag-of-words representations constructed from syntactically filtered texts[ 12]. By
limiting the vocabulary to only nouns and verbs (extracted via part-of-speech tagging), we aim to
capture content and action-oriented signals while reducing the noise associated with less informative
word classes.</p>
      <p>
        To address the class imbalance characteristic of the Rest-Mex dataset, we apply random undersampling
so that each class has the same number of training instances as the smallest class. This uniform sampling
is performed independently for each task—polarity (
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1–5</xref>
        ), type (Hotel, Restaurant, Attractive), and town
(40 categories).
      </p>
      <p>Although this method cannot match the performance of modern Transformer-based approaches,
it ofers advantages in simplicity, explainability, and deployment eficiency. Our results demonstrate
its viability as a baseline system, particularly in low-resource environments where transparency and
accessibility are priorities.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>Text classification has traditionally relied on linear models and decision tree ensembles trained over
frequency-based representations such as bag-of-words (BoW) or term frequency-inverse document
frequency (TF-IDF). These methods are computationally eficient, easy to interpret, and suitable for
a wide range of text mining tasks. In particular, Random Forest classifiers [ 13] have proven robust
in noisy domains due to their ensemble nature and ability to handle non-linear boundaries without
extensive hyperparameter tuning.</p>
      <p>To improve text representation, linguistic preprocessing techniques such as part-of-speech (POS)
ifltering have been proposed. Selecting only nouns and verbs has been shown to preserve key semantic
and action-related content while reducing dimensionality and noise [14]. This is particularly relevant
in opinion mining, where sentiment-bearing words often correlate with content (nouns) and
sentimentladen actions (verbs).</p>
      <p>However, classical models face limitations in handling polysemy, long-range dependencies, and
contextual shifts. These issues motivated the emergence of deep learning approaches—first via word
embeddings like Word2Vec and GloVe, and more recently with Transformer-based architectures such as
BETO [4] or RoBERTa [15]. These models have demonstrated superior performance in virtually every
NLP task, including sentiment analysis, named entity recognition, and thematic classification.</p>
      <p>The Rest-Mex Shared Task has tracked this evolution since its inception. In 2021, participating systems
primarily relied on BoW and embedding-based models for polarity classification and recommendation
scoring [5]. By 2022, Transformer-based architectures had become dominant, especially in the new
COVID-19 risk classification task [ 6, 7]. In 2023, the top-ranking system used a domain-adapted version
of RoBERTa fine-tuned on tourism data, further confirming the importance of specialized pretraining
[8].</p>
      <p>Despite their accuracy, these models require substantial computational resources and infrastructure
for training and deployment. In contrast, classical models remain relevant in scenarios where eficiency,
interpretability, or hardware limitations are critical—particularly in educational, governmental, or
embedded system contexts.</p>
      <p>Our work contributes to this ongoing conversation by revisiting a traditional pipeline—POS-filtered
bag-of-words with Random Forests—and testing its viability on the challenging Rest-Mex 2025
benchmark. Although not competitive with neural models in terms of raw performance, this approach ofers
clarity, speed, and fairness through class-balanced training, making it a useful alternative or baseline in
constrained environments.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>Our classification pipeline is designed to be simple, transparent, and computationally lightweight.
It consists of the following stages: dataset balancing, part-of-speech-based preprocessing, feature
extraction via bag-of-words, and model training using Random Forest classifiers. This section details
each component.</p>
      <sec id="sec-3-1">
        <title>3.1. Dataset and Class Distribution</title>
        <p>We use the oficial training split of the Rest-Mex 2025 dataset, which contains over 200,000
Spanishlanguage tourist reviews labeled for three classification tasks: sentiment polarity (five levels), service
type (three categories), and geographic destination (forty towns). Table 1 presents the class distribution
for each task in raw counts and relative frequencies.</p>
        <sec id="sec-3-1-1">
          <title>Label</title>
          <p>
            Sentiment Polarity (
            <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1–5</xref>
            )
Service Type
Town Labels (top 3)
          </p>
        </sec>
        <sec id="sec-3-1-2">
          <title>Class</title>
          <p>As shown above, all three classification axes are imbalanced, particularly sentiment polarity and
town labels. To ensure fair learning, we apply random undersampling independently for each task
so that every class contains the same number of instances as the least frequent one. For example, all
polarity classes are subsampled to 5,441 reviews (equal to the number in class 1).</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Text Preprocessing and POS Filtering</title>
        <p>Each review is processed using a Spanish part-of-speech tagger. We retain only tokens tagged as
nouns or verbs, removing adjectives, adverbs, function words, and punctuation. This linguistic filtering
reduces vocabulary size and focuses on content-bearing words and action verbs, which are often more
informative for tourism sentiment and topic classification.</p>
        <p>The filtered tokens are lowercased and lemmatized to normalize inflections (e.g., comer, comió,
comiendo → comer). No additional stopword removal or stemming is applied, to avoid discarding
potentially meaningful tourism-related terms.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Bag-of-Words Feature Representation</title>
        <p>We use a simple binary bag-of-words (BoW) model to convert the filtered text into feature vectors. Each
document is represented as a sparse vector indicating the presence or absence of a word in a fixed
vocabulary. The vocabulary is constructed from the training split after POS filtering, keeping the top
5,000 most frequent lemmas across the corpus.</p>
        <p>This BoW representation is interpretable and compatible with tree-based models. It also allows us to
inspect which nouns and verbs are most relevant for each class prediction through feature importance
scores.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Classification with Random Forests</title>
        <p>We train a separate Random Forest classifier for each task (polarity, type, town). Each model uses
100 decision trees, Gini impurity as the splitting criterion, and no depth restriction. The balanced
class distribution allows us to skip class weighting and directly optimize accuracy and macro-averaged
F1-score.</p>
        <p>All classifiers are trained on 80% of the balanced dataset and evaluated on the remaining 20% using
stratified splits. No external resources or embeddings are used, keeping the method fully self-contained
and computationally eficient.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>We evaluated our approach across the three classification tasks defined in the Rest-Mex 2025 challenge:
sentiment polarity (5 classes), service type (3 classes), and tourist town identification (40 classes).
All models were trained on class-balanced subsets using only nouns and verbs as input features in a
bag-of-words representation, and evaluated on held-out validation data.</p>
      <p>Table 2 summarizes the macro-averaged F1-scores obtained for each task. While the performance
remains modest compared to Transformer-based baselines, the results demonstrate the viability of
interpretable, resource-light pipelines in multilingual tourism opinion classification.</p>
      <sec id="sec-4-1">
        <title>Task</title>
        <p>Polarity Classification (5 classes)
Service Type Classification (3 classes)
Town Identification (40 classes)</p>
      </sec>
      <sec id="sec-4-2">
        <title>Macro F1-score</title>
        <sec id="sec-4-2-1">
          <title>4.1. Polarity Classification</title>
          <p>Despite the use of balanced training data, the polarity classifier struggled to distinguish between
nuanced sentiment levels. Performance was particularly low on the negative and neutral classes (F1 ≈
0.02–0.07), while the model showed slightly better ability in detecting positive (F1 ≈ 0.21) and very
positive (F1 ≈ 0.65) opinions. This asymmetry may reflect the limitations of BoW in capturing subtle
linguistic signals like irony, intensifiers, or negation.</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>4.2. Service Type Classification</title>
          <p>Service type classification achieved the highest F1-score among the three tasks. The model was
moderately efective in distinguishing between Hotel, Restaurant, and Attractive categories (macro F1
= 0.331). This is likely due to the presence of characteristic nouns (e.g., habitación, comida, vista) that
provide strong lexical cues.</p>
        </sec>
        <sec id="sec-4-2-3">
          <title>4.3. Town Classification</title>
          <p>Identifying the town mentioned in the review proved to be the most dificult task, with a macro F1-score
of just 0.025. Although some highly frequent towns (e.g., Tulum, Isla Mujeres) were recognized better
than others, most low-frequency classes exhibited near-zero performance. This highlights a fundamental
limitation of BoW models: their inability to capture entity-level semantics or geographical relationships.</p>
        </sec>
        <sec id="sec-4-2-4">
          <title>4.4. Eficiency and Practicality</title>
          <p>From a computational perspective, the entire training and evaluation pipeline completed in under five
minutes on a standard laptop (Intel i7, 16 GB RAM). This reinforces the main advantage of the proposed
method: eficiency and ease of deployment. While accuracy is sacrificed, the cost of experimentation is
significantly reduced.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>In this study, we explored a resource-eficient approach to multilingual text classification in the tourism
domain, using a classical machine learning pipeline based on bag-of-words representations filtered by
part-of-speech and Random Forest classifiers. By balancing the dataset through random undersampling
and restricting input features to content-bearing words (nouns and verbs), we aimed to reduce noise
and emphasize interpretability.</p>
      <p>Our results on the Rest-Mex 2025 dataset show that, while such classical approaches are far from
achieving state-of-the-art performance—particularly in complex tasks like fine-grained sentiment
analysis or town identification—they ofer a lightweight and transparent alternative in scenarios with
limited computational resources. The best results were obtained in service type classification, suggesting
that even simple lexical patterns can be suficient for detecting domain-specific categories.</p>
      <p>Nonetheless, the substantial gap between our method and Transformer-based architectures highlights
the inherent limitations of bag-of-words models, including their inability to capture context, semantics,
or syntactic dependencies. Future work may explore hybrid strategies that combine classical
interpretability with lightweight neural embeddings, or evaluate the trade-ofs between model complexity
and portability in real-world tourism applications.</p>
      <p>Ultimately, this work contributes a strong, interpretable baseline for Spanish-language opinion
classification, and underscores the value of accessible methods in multilingual NLP, especially for
educational and low-resource settings.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>We declare that the present manuscript has been written entirely by the authors and that no generative
artificial intelligence tools were used in its preparation, drafting, or editing.
[5] M. Á. Álvarez-Carmona, R. Aranda, S. Arce-Cárdenas, D. Fajardo-Delgado, R. Guerrero-Rodríguez,
A. P. López-Monroy, J. Martínez-Miranda, H. Pérez-Espinosa, A. Rodríguez-González, Overview
of rest-mex at iberlef 2021: Recommendation system for text mexican tourism, Procesamiento del
Lenguaje Natural 67 (2021). doi:https://doi.org/10.26342/2021-67-14.
[6] M. Á. Álvarez-Carmona, Á. Díaz-Pacheco, R. Aranda, A. Y. Rodríguez-González, D. Fajardo-Delgado,
R. Guerrero-Rodríguez, L. Bustio-Martínez, Overview of rest-mex at iberlef 2022:
Recommendation system, sentiment analysis and covid semaphore prediction for mexican tourist texts,
Procesamiento del Lenguaje Natural 69 (2022) 289–299.
[7] M. Á. Alvarez-Carmona, R. Aranda, Determinación automática del color del semáforo mexicano
del covid-19 a partir de las noticias (2022).
[8] M. Á. Álvarez-Carmona, Á. Díaz-Pacheco, R. Aranda, A. Y. Rodríguez-González, V. Muñiz-Sánchez,
A. P. López-Monroy, F. Sánchez-Vega, L. Bustio-Martínez, Overview of rest-mex at iberlef 2023:
Research on sentiment analysis task for mexican tourist texts, Procesamiento del Lenguaje Natural
71 (2023) 425–436.
[9] M. Á. Álvarez-Carmona, Á. Díaz-Pacheco, R. Aranda, A. Y. Rodríguez-González, L. Bustio-Martínez,
V. Herrera-Semenets, Overview of rest-mex at iberlef 2025: Researching sentiment evaluation in
text for mexican magical towns, volume 75, 2025.
[10] J. Á. González-Barba, L. Chiruzzo, S. M. Jiménez-Zafra, Overview of IberLEF 2025: Natural
Language Processing Challenges for Spanish and other Iberian Languages, in: Proceedings of the
Iberian Languages Evaluation Forum (IberLEF 2025), co-located with the 41st Conference of the
Spanish Society for Natural Language Processing (SEPLN 2025), CEUR-WS. org, 2025.
[11] S. Arce-Cardenas, D. Fajardo-Delgado, M. Á. Álvarez-Carmona, J. P. Ramírez-Silva, A tourist
recommendation system: a study case in mexico, in: Mexican international conference on artificial
intelligence, Springer, 2021, pp. 184–195.
[12] M. A. Álvarez-Carmona, R. Aranda, A. Y. Rodríguez-González, L. Pellegrin, H. Carlos, Classifying
the mexican epidemiological semaphore colour from the covid-19 text spanish news, Journal of
Information Science 50 (2024) 568–589.
[13] L. Breiman, Random forests, Machine learning 45 (2001) 5–32.
[14] M. Gamon, A. Aue, S. Corston-Oliver, E. Ringger, Pulse: Mining customer opinions from free
text, in: Advances in Intelligent Data Analysis VI: 6th International Symposium on Intelligent
Data Analysis, IDA 2005, Madrid, Spain, September 8-10, 2005. Proceedings 6, Springer, 2005, pp.
121–132.
[15] V. G. Morales-Murillo, H. Gómez-Adorno, D. Pinto, I. A. Cortés-Miranda, P. Delice, Lke-iimas
team at rest-mex 2023: Sentiment analysis on mexican tourism reviews using transformer-based
domain adaptation (2023).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Á</surname>
          </string-name>
          .
          <string-name>
            <surname>Álvarez-Carmona</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Aranda</surname>
            ,
            <given-names>A. Y.</given-names>
          </string-name>
          <string-name>
            <surname>Rodríguez-Gonzalez</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Fajardo-Delgado</surname>
            ,
            <given-names>M. G.</given-names>
          </string-name>
          <string-name>
            <surname>Sánchez</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Pérez-Espinosa</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Martínez-Miranda</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Guerrero-Rodríguez</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <article-title>Bustio-Martínez, Á. DíazPacheco, Natural language processing applied to tourism research: A systematic review and future research directions</article-title>
          ,
          <source>Journal of king Saud university-computer and information sciences 34</source>
          (
          <year>2022</year>
          )
          <fpage>10125</fpage>
          -
          <lpage>10144</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E.</given-names>
            <surname>Olmos-Martínez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Á</surname>
          </string-name>
          .
          <string-name>
            <surname>Álvarez-Carmona</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Aranda</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Díaz-Pacheco</surname>
          </string-name>
          ,
          <article-title>What does the media tell us about a destination? the cancun case, seen from the usa, canada, and mexico</article-title>
          ,
          <source>International Journal of Tourism Cities</source>
          <volume>10</volume>
          (
          <year>2024</year>
          )
          <fpage>639</fpage>
          -
          <lpage>661</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Á.</given-names>
            <surname>Díaz-Pacheco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Guerrero-Rodríguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Á</surname>
          </string-name>
          .
          <string-name>
            <surname>Álvarez-Carmona</surname>
            ,
            <given-names>A. Y.</given-names>
          </string-name>
          <string-name>
            <surname>Rodríguez-González</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Aranda</surname>
          </string-name>
          ,
          <article-title>A comprehensive deep learning approach for topic discovering and sentiment analysis of textual information in tourism</article-title>
          ,
          <source>Journal of King Saud University-Computer and Information Sciences</source>
          <volume>35</volume>
          (
          <year>2023</year>
          )
          <fpage>101746</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Jurado-Buch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Minayo-Díaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chaucanes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Salazar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Oquendo-Coral</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Á</surname>
          </string-name>
          .
          <string-name>
            <surname>Álvarez-Carmona</surname>
          </string-name>
          ,
          <article-title>A single model based on beto to classify spanish tourist opinions through the random instances selection</article-title>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>