<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Journal of King Saud University</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1080/13683500.2021.2007227</article-id>
      <title-group>
        <article-title>Holistic Classification of Tourism Reviews: A Structured Prediction Approach with Energy-Based Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hugo Carlos-Martínez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jorge Pool-Cen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centro de Investigación en Ciencias de Información Geoespacial</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Laboratorio Nacional de Geointeligencia</institution>
          ,
          <country country="MX">México</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Secretaría de Ciencia</institution>
          ,
          <addr-line>Humanidades, Tecnología e Innovación, SECIHTI</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <volume>1</volume>
      <fpage>10125</fpage>
      <lpage>10144</lpage>
      <abstract>
        <p>The analysis of user-generated content is vital for understanding tourism dynamics, particularly for culturally significant destinations like Mexico's "Pueblos Mágicos." These reviews contain multiple, interdependent facets, including sentiment, site type, and location. However, standard multi-task classification models address these aspects independently, relying on a flawed assumption of conditional independence that often leads to predictions that are locally plausible but globally incoherent. To address this limitation, we propose a novel framework based on an Energy-Based Model (EBM) for structured prediction. Instead of predicting each label in isolation, our model learns a global energy function that measures the semantic compatibility between the raw text of a review and an entire set of candidate labels. Inference is then performed by searching for the label configuration that minimizes this energy function, thereby identifying the most coherent and plausible output. This approach provides a principled method to capture the complex relationships between classification aspects, demonstrating a path toward generating more reliable, consistent, and semantically sound insights from user-generated content at scale.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Energy-Based Models</kwd>
        <kwd>Structured Prediction</kwd>
        <kwd>Natural Language Processing</kwd>
        <kwd>Sentiment Analysis</kwd>
        <kwd>Multi-Task Learning</kwd>
        <kwd>Tourism Reviews</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>1.1. Importance of Magical Towns and Digital Tourism</title>
        <p>
          The "Magical Towns" program (Pueblos Mágicos, PPM), established by Mexico’s Ministry of Tourism in
2001, represents one of the country’s most significant tourism development strategies in recent decades
[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Its primary objective is to diversify the national tourism ofering, traditionally concentrated on
sun-and-beach destinations, by revaluing towns with unique historical, cultural, and natural attributes.
The program structures a tourism ofering based on local uniqueness, promoting festivals, gastronomy,
crafts, and tangible and intangible heritage to generate diferentiated tourism products [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Since
its creation, the program has expanded considerably, growing from a handful of initial towns to a
consolidated network of 177 Magical Towns distributed throughout the national territory by 2023 [
          <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
          ].
        </p>
        <p>
          The program’s relevance extends beyond symbolism, generating profound economic and social impact.
Tourism constitutes approximately 13% of economic activity in municipalities with this designation [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
These destinations house over 10 million inhabitants and maintain considerable tourism infrastructure,
including more than 7,300 accommodation establishments with nearly 160,000 available rooms [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
Experience quality in these locations is notably high, with studies showing average tourist satisfaction
ratings of 8.55 out of 10 [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>
          In the contemporary context, traveler decision-making is inextricably linked to the digital ecosystem.
User-generated content (UGC) platforms, with TripAdvisor as the dominant player, have become
indispensable tools for travel planning [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Economic studies have quantified TripAdvisor’s massive
influence on global tourism spending, which reached 460 billion euros in 2017 [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. In the Mexican
context, penetration is particularly high; research in Saltillo, Coahuila, revealed that 99% of young
travelers use the internet for information, with over three-quarters specifically using TripAdvisor [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
        <p>This confluence of successful public policy and global technological trends has generated a vast corpus
of unstructured data in the form of Spanish-language reviews. Manual analysis of this information
volume is unfeasible, creating an urgent need and unique opportunity for Natural Language Processing
(NLP) tools capable of extracting actionable insights at scale [9, 10, 11, 12].</p>
      </sec>
      <sec id="sec-1-2">
        <title>1.2. The Multi-Aspect Classification Challenge</title>
        <p>The analysis of tourism reviews presents a challenge that extends beyond simple sentiment analysis. Each
review is an information-rich document that encapsulates multiple facets of the traveler’s experience.
To extract its full value, it is necessary to address a multi-aspect classification problem. Formally, this
task can be defined as a structured prediction problem.</p>
        <p>Given a review document  ∈  , where  is the space of all possible review texts, the objective is to
predict a structured label tuple y = (sent, type, loc) ∈ . The output space  is the Cartesian product
of three individual label spaces:
1. Sentiment Polarity (sent): An ordinal label representing the user’s original rating, sent ∈
{1, 2, 3, 4, 5}, where 1 is very negative and 5 is very positive.
2. Site Type (type): A categorical label identifying the type of establishment being reviewed,
type ∈ {Hotel, Restaurant, Attraction}.
3. Location (loc): A categorical label identifying which of the 177 Magical Towns the review
belongs to, loc ∈ {Aculco, Bacalar, Creel, . . . , Zozocolco}.</p>
        <p>The combined output space, , is therefore a large discrete and combinatorial space, with a total of
5 × 3 × 177 = 2,655 possible label tuples. The task consists of learning a mapping  :  →  that, for
a given review, predicts the most plausible and coherent label tuple.</p>
        <p>This structured prediction formulation captures the inherent complexity of tourism review analysis,
where multiple interdependent aspects must be simultaneously considered to achieve accurate and
meaningful classification results.</p>
      </sec>
      <sec id="sec-1-3">
        <title>1.3. Limitations of Conventional Models and Our Contribution</title>
        <p>A common approach to address multi-aspect problems like this is to employ a multi-task learning (MTL)
architecture with hard parameter sharing [13]. In our case, this would translate to a strong baseline
model: a Transformer-based text encoder, such as BETO [14], whose contextual representations feed
into three independent classification "heads."</p>
        <p>The main deficiency of this approach lies in its implicit assumption of **conditional independence**
between output labels given the input text. Mathematically, this model assumes that the joint probability
of the label tuple can be factorized as the product of the marginal probabilities of each label:
 (sent, type, loc|) ≈ (sent|) · (type|) · (loc|)
(1)</p>
        <p>This assumption is fundamentally incorrect in domains where inherent correlations exist between
labels. As literature has consistently shown, ignoring these dependencies leads to the generation of
outputs that, while locally plausible, are globally incoherent [15, 16]. To illustrate, consider a review
containing the phrase "we enjoyed the beach and the sun" (*"disfrutamos de la playa y el sol"*):
• An independent classifier might correctly predict a positive sentiment and a site type of
"Attraction."
• However, due to unrelated keywords or data noise, it could erroneously predict the location as
"Creel," a landlocked town in the Chihuahua mountains.</p>
        <p>The resulting tuple, ‘(Positive, Attraction, Creel)‘, is semantically absurd. The model is unable to reason
that the concept of "beach" in the text makes the location "Creel" extremely improbable.</p>
        <p>To overcome this structural weakness, this paper presents as its main contribution the **design and
application of an Energy-Based Model (EBM) for structured prediction** in the domain of Spanish
tourism reviews. Unlike conventional probabilistic models that require computing an intractable
partition function to model (|), an EBM learns a **compatibility or energy function**,  (, ). This
function, parameterized by a deep neural network  , assigns a low scalar energy to label configurations
 that are highly compatible with the review text , and a high energy to those that are incoherent.</p>
        <p>Inference is then elegantly formulated as an energy minimization problem:
* = argmin  (, )
∈
(2)</p>
        <p>This framework provides a flexible and powerful way to explicitly model the high-order dependencies
between the full set of labels and the semantic content of the text, without the need for probabilistic
normalization.</p>
      </sec>
      <sec id="sec-1-4">
        <title>1.4. Article Structure</title>
        <p>The remainder of this article is organized as follows. Section 2 presents a review of the state of the
art in sentiment analysis, multi-task classification, and energy-based models in the NLP field. Section
3 details the methodology of our proposed EBM model. Section 4 describes the experimental design,
including dataset construction, evaluation metrics, and baseline models. Section 5 presents and analyzes
the results. Finally, Section 6 concludes the work and discusses future research directions.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>The analysis of Spanish tourism reviews requires advances across three interconnected research areas:
sentiment analysis, multi-task learning, and structured prediction. This section traces the evolution of
these fields and positions our energy-based approach within the current landscape.</p>
      <sec id="sec-2-1">
        <title>2.1. Evolution of Sentiment Analysis in Spanish</title>
        <p>Sentiment analysis has undergone three major paradigmatic shifts, each driven by advances in text
representation and modeling capabilities. Classical machine learning approaches relied on bag-of-words
representations with TF-IDF weighting, feeding traditional classifiers like Naive Bayes and Support
Vector Machines [17, 18]. While computationally eficient and interpretable, these methods failed to
capture semantic context and long-range dependencies.</p>
        <p>The deep learning revolution introduced sequential modeling through Convolutional Neural Networks
for local feature extraction [19] and Long Short-Term Memory networks for capturing long-range
dependencies [20]. LSTMs became the architecture of choice for sentiment analysis, significantly
outperforming classical methods by modeling word order and sentential context [21].</p>
        <p>The current state-of-the-art is dominated by Transformer-based models, whose self-attention
mechanism enables simultaneous consideration of all words in a sequence [22]. The breakthrough came
with large-scale pre-training on massive unlabeled corpora, exemplified by BERT’s bidirectional
representations [23]. For Spanish NLP, models like BETO have demonstrated superior performance over
multilingual or translated approaches [24], making them the natural choice for our text encoder.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Multi-Task Learning and Its Limitations</title>
        <p>Modern NLP commonly addresses multi-output problems through hard parameter sharing, where a
shared encoder (typically a Transformer) feeds multiple independent classification heads [ 25]. This
architecture is attractive for its eficiency and regularization efects, particularly when tasks are related
and data is scarce [26].</p>
        <p>However, this approach sufers from a critical conceptual limitation: it assumes conditional
independence between output labels given the input text. Recent work has demonstrated that this assumption
is fundamentally flawed when labels exhibit strong correlations [ 26]. In real-world scenarios, ignoring
label dependencies can lead to logically inconsistent or statistically improbable predictions [27]. The
problem lies not in the encoder’s capacity to understand input, but in the output architecture’s inability
to model output structure.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Energy-Based Models for Structured Prediction</title>
        <p>Energy-Based Models (EBMs) ofer a powerful alternative framework, popularized by LeCun’s seminal
work [28]. Instead of directly modeling probability distributions, EBMs learn an energy function
(, ) that assigns low energy to compatible input-output pairs and high energy to incompatible ones.
Inference becomes an optimization problem: * = arg min (, ).</p>
        <p>A key advantage of EBMs for structured prediction is avoiding the intractable partition function
computation required by probabilistic models [28]. In NLP, EBMs have been successfully applied to
language modeling [29], structured prediction through SPENs [30], and various tasks where global
output coherence is paramount [26].</p>
        <p>Our work extends this paradigm to multi-aspect classification of Spanish tourism reviews. While
standard softmax classifiers can be viewed as locally normalized EBMs [28], our approach uses
Transformer representations to define a global energy function over the entire output tuple. This formulation
transcends conditional independence assumptions and explicitly models semantic compatibility across
the structured output space, representing a novel contribution that bridges contextual representation
advances with structured prediction principles.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Problem Formulation as Structured Prediction</title>
        <p>We depart from traditional approaches that treat the classification of sentiment, site type, and location
as independent tasks. Instead, we frame the problem as a structured prediction task. The goal is
to learn a single, holistic model that jointly predicts the entire set of labels for a given tourist review,
thereby capturing the inherent dependencies and constraints among the output variables.</p>
        <p>As established in Section 1.2, our objective is to predict a composite output variable  for a given
input review  (represented as a sequence of tokens). We reiterate this formal definition here as the
foundation of our methodology. The structured tuple  contains the three aspects of interest:
 = (pol, type, pm)
where each component belongs to a discrete, finite set:
• Sentiment Polarity: pol ∈ pol = {1, 2, 3, 4, 5}
• Site Type: type ∈ type = {Hotel, Restaurant, Attraction}
• Pueblo Mágico: pm ∈ pm = {Pueblo1, . . . , Pueblo }, where  is the total number of Pueblos</p>
        <p>Mágicos in our dataset.</p>
        <p>The complete output space  is the Cartesian product of these individual label spaces:  = pol ×
type ×  pm.</p>
        <p>The core of our approach is an Energy-Based Model (EBM), which learns a scalar-valued energy
function (, ). This function measures the compatibility between an input review  and a potential
output structure  [28]. The energy value is interpreted as follows:
• Low Energy: Indicates a high degree of compatibility. The set of labels in  is a plausible and
coherent description for the review .</p>
        <p>(|) =
exp(− (, ))
()
where () = ∑︀′∈ exp(− (, ′)) is the partition function, which normalizes the distribution over
all possible output structures.</p>
        <p>Thus, the learning task transforms into finding the parameters  of a function  (, ) that assign the
lowest energy to the ground-truth label configuration and higher energies to all incorrect configurations.
The subsequent sections will detail the architecture of  (, ) and the contrastive learning strategy
used for its training.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Architectural Formulation of the Energy Model</title>
        <p>The energy function (, ) is parameterized by a neural network designed to process both the
unstructured text and the structured labels. The architecture is composed of three functional blocks
which we detail below.</p>
        <p>First, to capture the rich semantic information from the review text , we employ a pre-trained
Transformer. Specifically, we use BETO, a BERT model trained on a large Spanish corpus, which is
ideal for this task. The review is tokenized and processed by the model, and we use the final hidden
state of the special [CLS] token as the holistic text representation, ℎ:
• High Energy: Indicates a low degree of compatibility, or incompatibility. The set of labels in  is
an unlikely or inconsistent description for the review .</p>
        <p>This energy function implicitly defines a conditional probability distribution over the output space 
via the Boltzmann (or Gibbs) distribution:
(3)
(4)
(5)
(6)
(7)
ℎ = BETO()
∈ RBERT</p>
        <p>Second, the structured label tuple  = (pol, type, pm) is encoded into a vector, ℎ. Each component
is mapped to a dense embedding via a dedicated embedding matrix (pol, type, pm), and the resulting
vectors are concatenated:
pol = pol(pol); type = type(type); pm = pm(pm)</p>
        <p>ℎ = concat(pol, type, pm) ∈ Rpol+type+pm</p>
        <p>Finally, the compatibility score is computed by an Energy Module, which is a Multi-Layer Perceptron
(MLP). This MLP takes the concatenated text and label representations as input and outputs a single
scalar value representing the energy. This allows the model to learn complex, non-linear interactions
between the review’s content and the proposed labels.</p>
        <p>(, ) = MLP(concat(ℎ, ℎ))
∈ R
The MLP typically consists of several hidden layers with non-linear activations (e.g., ReLU), followed
by a final linear output neuron.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Training Strategy: Contrastive Learning</title>
        <p>Training an Energy-Based Model presents a unique challenge. Directly minimizing the energy (, )
is a trivial solution, as the model could learn to output a constant low energy for all inputs. Maximizing
the likelihood (Equation 3) is generally intractable, as it requires computing the partition function (),
which involves a sum over the entire, often exponentially large, output space .</p>
        <p>To circumvent this, we employ a contrastive learning framework. The objective is not to model
the probability distribution explicitly, but rather to "sculpt" the energy landscape such that the energy
of the ground-truth pair (, +) is lower than the energy of all other "negative" or "contrastive" pairs
(, − ).</p>
        <sec id="sec-3-3-1">
          <title>3.3.1. Negative Sampling</title>
          <p>A critical component of this strategy is the generation of informative negative samples. For each training
instance, which consists of a review  and its correct label tuple + = (p+ol, t+ype, p+m), we generate a
set of  negative label tuples, {1− , . . . , − }.</p>
          <p>We generate these negatives by corrupting one or more components of the ground-truth tuple
+. This creates a spectrum of negatives, from "easy" (where all components are wrong) to "hard"
(where only one component is subtly incorrect). For example, if + = (5-star, Hotel, Tulum), potential
negatives could be:
• (1-star, Hotel, Tulum): A hard negative, forcing the model to rely on sentiment cues in the text.
• (5-star, Restaurant, Tulum): Another hard negative, requiring the model to distinguish between
hotel and restaurant-specific vocabulary.
• (5-star, Hotel, Creel): An easy negative, as the geographic and contextual cues for Tulum and</p>
          <p>Creel are vastly diferent.</p>
          <p>This strategy ensures that the model learns to make fine-grained distinctions and understands the
compatibility between the text and the full label structure.</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>3.3.2. Contrastive Loss Function</title>
          <p>We use the InfoNCE (Noise Contrastive Estimation) loss, a widely used objective in self-supervised
and contrastive learning [31, 32]. For a given input , we treat the task as identifying its true label
configuration + from a set containing + and  negative samples {− }=1.</p>
          <p>The loss is formulated as the negative log-likelihood of correctly classifying the positive sample.
The probability of selecting + is modeled using a softmax function over the negative energies of the
candidate set.</p>
          <p>ℒ(, +, {− }) = − log</p>
          <p>exp(− (, +)/ )
exp(− (, +)/ ) + ∑︀=1 exp(− (, − )/ )
(8)
where  is a temperature hyperparameter that controls the sharpness of the distribution. A lower
temperature makes the classification task harder, forcing the model to be more discriminative.</p>
          <p>During training, we iterate through the dataset, and for each sample (, +), we generate  negatives,
compute their energies along with the energy of the positive pair, and update the model parameters 
by minimizing the loss ℒ via stochastic gradient descent.</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Inference: Finding the Minimum Energy Configuration</title>
        <p>Once the energy model (, ) has been trained, the inference process for a new, unseen review new
consists of finding the label configuration * from the entire output space  that minimizes the energy
function. This is equivalent to finding the most probable output under the model’s learned distribution
(see Equation 3):
* = argmin (new, ) (9)</p>
        <p>∈</p>
        <p>For many structured prediction problems, this search can be computationally prohibitive. However,
in our specific problem setting, the output space  is discrete and of a manageable size. The total
number of possible label configurations is the product of the cardinalities of the individual label sets:
|| = |pol| × | type| × | pm|
Given the defined cardinalities ( |pol| = 5, |type| = 3) and the number of Pueblos Mágicos in our study
(approx. 177), the total search space is || ≈ 5 × 3 × 177 = 2, 655.</p>
        <p>This number is small enough to allow for an exhaustive search at inference time. The procedure is
as follows:
1. For a given new review new, generate all possible  ∈ .
2. Encode the review once to obtain the vector ℎnew .
3. For each candidate label tuple , compute its embedding ℎ .
4. Calculate the energy (new, ) for all  = 1, . . . , ||.
5. The final prediction * is the tuple that yielded the lowest energy score.</p>
        <p>This brute-force approach guarantees that we find the global minimum of the energy function over
the output space, ensuring that the final prediction is the most coherent and compatible label set
according to the learned model. This deterministic and exact inference process is a significant advantage
of applying EBMs to problems with moderately-sized, discrete output spaces.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Design</title>
      <p>This section outlines the experimental setup designed to validate our proposed Energy-Based Model.
We adhere to the framework provided by the Rest-Mex 2025 shared task on sentiment analysis, using
its oficial dataset, evaluation metrics, and baselines for a fair and direct comparison.</p>
      <sec id="sec-4-1">
        <title>4.1. Dataset and Pre-processing</title>
        <p>The dataset for our experiments is provided by the organizers of the Rest-Mex 2025 shared task [33, 34].
Unlike to others editions [35, 36, 37], it consists of a collection of TripAdvisor reviews written in Spanish,
pertaining to the 177 oficially designated "Pueblos Mágicos" of Mexico. The dataset is structured in
XML format and is partitioned into oficial training and test sets. We will use the provided training set
to train our models and the test set exclusively for the final evaluation, as per the competition rules.</p>
        <p>Each review in the dataset is associated with the three target labels that form our structured output :
• polarity: A 1-to-5 integer rating.
• type: The category of the reviewed establishment (Hotel, Restaurant, or Attraction).
• pueblo_magico: The name of the Pueblo Mágico.</p>
        <p>Pre-processing: For our Transformer-based models, minimal text pre-processing is required. The
primary step involves tokenizing the raw review text using the specific WordPiece tokenizer associated
with our chosen pre-trained model (BETO). No stemming, lemmatization, or extensive stop-word
removal is performed, in order to preserve the full context for the encoder.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Baselines for Comparison</title>
        <p>To demonstrate the eficacy of our joint-energy approach, we will compare its performance against two
strong baseline models that represent common strategies for this type of task.</p>
        <p>1. Independent Classifiers (BETO-Indep): This baseline consists of three separate BETO models.</p>
        <p>Each model is independently fine-tuned for one of the three subtasks (polarity, type, or Pueblo
Mágico). This approach treats the tasks as completely unrelated and serves to measure the
performance without any knowledge sharing.
2. Multi-Task Model (BETO-Multi): This is a more advanced baseline consisting of a single,
shared BETO encoder with three independent classification heads. One head is a linear layer for
polarity classification, another for site type, and a third for Pueblo Mágico classification. The
model is trained jointly by summing the cross-entropy losses from each head. This architecture
allows for implicit knowledge sharing through the shared text representations but assumes
conditional independence between the outputs given the input. This is the most direct and
common alternative to our EBM, and outperforming it would strongly support our hypothesis
that explicit modeling of output dependencies is beneficial.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Evaluation Metrics</title>
        <p>Our evaluation protocol is designed to be fully compliant with the oficial guidelines of the Rest-Mex
2025 shared task, while also including a specific metric to validate our core hypothesis.</p>
        <sec id="sec-4-3-1">
          <title>4.3.1. Oficial Competition Metrics</title>
          <p>As specified by the competition organizers, the final ranking of the systems is determined by a weighted
average of the performance on the three sub-tasks. The evaluation for each task is based on the Macro
F1-Score, which is the unweighted average of the F1-Scores for each class within a task. This metric is
well-suited for multi-class problems, as it treats all classes equally, regardless of their frequency.</p>
          <p>The secondary metrics are the Macro F1-Scores for each individual task:
• Polarity (ResPol): The Macro F1-Score calculated over the 5 polarity classes.
• Site Type (ResType): The Macro F1-Score calculated over the 3 site type classes.
• Pueblo Mágico (ResMT): The Macro F1-Score calculated over the 177 Pueblo Mágico classes, as
defined in Equation (3) from the task description.</p>
          <p>The primary evaluation metric (Final_Score) is a weighted average of these three scores, giving
more importance to the polarity and Pueblo Mágico identification tasks:</p>
        </sec>
        <sec id="sec-4-3-2">
          <title>4.3.2. Holistic Coherence Metric (EMR)</title>
          <p>In addition to the oficial competition metrics, we will report the Exact Match Ratio (EMR). While not
used for the oficial ranking, the EMR is central to our work as it directly measures the model’s ability
to predict the entire label tuple correctly. It is the strictest measure of a model’s holistic understanding
and predictive coherence. Formally:</p>
          <p>EMR =</p>
          <p>1 ∑︁ I(ˆ(po)l = pol ∧ ˆtype = type ∧ ˆ(pm) = p(m) )</p>
          <p>() () ()
 =1
We hypothesize that our EBM, by explicitly modeling the dependencies between labels, will show a
significant improvement in EMR compared to the baselines, even if the gains in the individual Macro
F1-scores are more modest.</p>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Implementation Details</title>
        <p>All models will be implemented using the PyTorch framework. The core of our review encoder will
be the dccuchile/bert-base-spanish-wwm-cased model (BETO), accessed via the Hugging Face
Transformers library. Key hyperparameters, such as the learning rate, batch size, the embedding
dimensions for the label encoder, the temperature  for the InfoNCE loss, and the number of negative
samples , will be tuned based on performance on a dedicated validation split (10%) of the oficial
training data. The model showing the best EMR on the validation set will be selected for the final
evaluation on the test set.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Discussion</title>
      <p>In this section, we present the performance of our proposed Energy-Based Model (EBM) and compare it
against the baseline models. The results are based on the oficial Rest-Mex 2025 test set. Our discussion
will focus not only on the quantitative improvements but also on the qualitative advantages of modeling
output dependencies.</p>
      <sec id="sec-5-1">
        <title>5.1. Quantitative Analysis</title>
        <p>The models were trained on the oficial training set, with hyperparameters selected based on performance
on a validation split. The final results on the validation set are summarized in Table 1.
1. Superior Overall Performance: Our EBM achieves the highest Final_Score (0.879),
outperforming both the independent classifiers (BETO-Indep) and the multi-task model (BETO-Multi).
This indicates that our approach is the most efective according to the oficial primary metric of
the shared task.
2. Modest Gains in Per-Task F1-Scores: The improvements in the individual Macro F1-Scores are
present but modest. The EBM shows the largest gain in the most complex task, Pueblo Mágico
identification ( F1-MT), while being on par with the BETO-Multi model on the other tasks. This
suggests that while a multi-task setup efectively shares information through its encoder, it is not
suficient to resolve more complex ambiguities.
3. Significant Improvement in Holistic Accuracy (EMR): The most compelling result is the
dramatic increase in the Exact Match Ratio. Our EBM achieves an EMR of 0.795, a significant
improvement of over 5 percentage points compared to the strongest baseline (BETO-Multi). This
demonstrates that our model is substantially more efective at producing fully correct, coherent
predictions. This finding strongly supports our central hypothesis: explicitly modeling the
dependencies between labels via an energy function leads to more globally consistent outputs.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Qualitative Analysis: A Case Study</title>
        <p>To understand how the EBM achieves superior coherence, consider the following (abbreviated) review:
• Review (): "Nuestra estancia en el Hotel ’La Casona’ fue mágica. Las habitaciones son coloniales y
muy cómodas. Lo mejor, sin duda, es su restaurante ’El Patio’, la cecina de Yecapixtla que sirven es la
mejor que he probado. Una joya en Tepoztlán."
• Ground Truth (+): {Polarity: 5, Type: Hotel, Pueblo Mágico: Tepoztlán}
The review is challenging because it praises a hotel but focuses heavily on its restaurant.
• BETO-Multi Prediction: This model, while capturing the positive sentiment and correct
location, incorrectly classifies the site type. Its attention mechanism is likely drawn to keywords
like "restaurante", "cecina", and "probado", leading to the prediction: {Polarity: 5, Type:
Restaurant, Pueblo Mágico: Tepoztlán}. This prediction is locally plausible but globally
incorrect, as the primary subject of the review is the hotel stay.
• Our EBM Prediction: Our model correctly predicts the ground truth. It arrives at this by
evaluating the energy of possible label configurations.</p>
        <p>– The energy (,  = {5, Hotel, Tepoztlán}) is very low. The model has learned that
highquality hotels often have praised restaurants, making this a highly compatible and coherent
configuration.
– The energy (,  = {5, Restaurant, Tepoztlán}) is comparatively higher. The presence
of keywords like "estancia" and "habitaciones" creates a slight "tension" or incompatibility
with the Restaurant label, which the energy function captures.</p>
        <p>By finding the configuration with the minimum energy, our model correctly identifies the main subject
of the review, demonstrating a deeper understanding of the context.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Analysis of Generalization Gap to the Test Set</title>
        <p>While our Energy-Based Model showed promising results on the validation set, leading to its selection
as our final model, its performance degraded substantially on the oficial test set. This significant gap
between validation and test performance indicates that our model, despite its sophisticated architecture,
overfitted to the specific characteristics of the training data and failed to generalize to unseen examples
efectively.</p>
        <p>We hypothesize that this underperformance stems from a combination of factors related to the
complexity of both our model’s architecture and its training regime:
• Limited Interaction in the Label Representation: A potential source of brittleness lies in how
the structured label  is encoded. We used simple concatenation of the three label embeddings
(pol, type, pm) to form the vector ℎ. This approach places the entire burden of learning the
complex, non-linear interactions between the labels on the subsequent MLP (Energy Module). It
is plausible that the MLP learned superficial or spurious correlations present in the training data
(e.g., that a certain Pueblo Mágico *only* appears with high-polarity reviews) but failed to capture
the deeper, generalizable semantic relationships. A more robust architecture might require a
more explicit interaction mechanism between label embeddings, such as a bilinear model, a small
attention module, or tensor products, before they are presented to the energy function.
• Brittleness of the Contrastive Learning Objective: The contrastive training framework, while
powerful, is highly sensitive to its configuration. The performance is critically dependent on the
quality and diversity of the negative samples. It is likely that our negative sampling strategy, while
efective for the validation set, was not suficient to create a robust and smooth energy landscape.
The model may have learned to simply distinguish the positive sample from a set of "easy" or
synthetically generated negatives, but it was not prepared for the more subtle and challenging
distinctions required by the test set. The distribution of "hard negatives"—incorrect labels that
are semantically very close to the correct ones—was likely diferent and more challenging in the
test data.
• Hyperparameter Sensitivity: The training of our EBM involves several sensitive
hyperparameters, most notably the temperature  of the InfoNCE loss and the number of negative samples
. A temperature value that works well on the validation data might create an overly "peaked"
and sharp energy function, punishing even minor deviations and thus failing to generalize. The
model becomes too confident in the patterns seen during training and is not robust to the natural
variations in the test set.</p>
        <p>These insights suggest that while the EBM framework is theoretically potent, its practical application
requires careful consideration of these factors. Future work should focus on developing more robust
label interaction architectures, implementing more adaptive and "hard" negative sampling strategies
during training, and employing more rigorous regularization techniques to prevent the model from
overfitting to spurious correlations in the training data.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Discussion and Limitations</title>
        <p>The experimental results strongly suggest that for complex, structured classification tasks, moving
beyond conditionally independent predictions is crucial. Our EBM provides a principled way to learn
the "rules" of a coherent label set directly from data. The significant leap in EMR indicates that this
approach helps eliminate combinations of predictions that, while individually plausible, are contextually
inconsistent as a whole.</p>
        <p>However, we must acknowledge the limitations and trade-ofs of our approach:
• Computational Cost: The primary drawback of our method is the computational expense
at inference time. While the exhaustive search is feasible for this task’s output space (≈ 2,655
combinations), it is significantly slower than a single forward pass in a multi-head model. This
trade-of between accuracy and speed is a critical consideration for real-world deployment.
• Sensitivity to Negative Sampling: The performance of the EBM during training is sensitive to
the strategy used for generating negative samples. A poorly designed sampling strategy could
lead to a suboptimal energy landscape.
• Dependence on Encoder Quality: The EBM’s ability to measure compatibility is fundamentally
dependent on the quality of the representations ℎ and ℎ. Any information lost or misinterpreted
by the BETO encoder cannot be recovered by the energy module.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions and Future Work</title>
      <sec id="sec-6-1">
        <title>6.1. Conclusions</title>
        <p>In this work, we addressed the multi-aspect classification of Spanish tourism reviews, moving beyond
standard models that assume conditional independence between output labels. We argued that this
assumption is a fundamental limitation, leading to incoherent predictions. To overcome this, we
proposed and implemented an Energy-Based Model (EBM), a structured prediction framework designed
to learn a global compatibility function over the entire label space. The core idea was to train a model
that explicitly reasons about the coherence of a full set of labels (polarity, type, location) in
relation to a review’s content.</p>
        <p>Our initial experiments on the validation set supported this hypothesis, indicating that the EBM
was capable of producing more holistically accurate predictions than strong multi-task baselines, as
measured by the Exact Match Ratio (EMR). However, the transition to the oficial test set revealed
significant generalization challenges, with a substantial drop in performance.</p>
        <p>This leads us to a critical conclusion: while the EBM framework is theoretically elegant and promising
for capturing output dependencies, its practical application is non-trivial and fraught with challenges.
The complexity of the contrastive training objective, the design of the label interaction architecture,
and the sensitivity to hyperparameter tuning collectively create a high risk of overfitting. Our results
highlight that a sophisticated model architecture does not guarantee robust generalization, especially
when faced with the subtle distributional shifts between training and unseen data. The proposed
EBM, in its current form, learned patterns specific to the training data but failed to capture the more
fundamental, generalizable semantic rules governing label coherence.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Future Work</title>
        <p>The insights gained from our model’s underperformance on the test set clearly illuminate several
compelling directions for future research. Addressing these challenges is key to unlocking the full
potential of energy-based approaches for this and similar tasks. We identify the following priorities:
1. More Sophisticated Label Interaction Architectures: The simple concatenation of label
embeddings proved to be a significant limitation. Future work should explore more expressive
interaction mechanisms to model the relationships between labels before they are combined
with the text representation. This could include using bilinear models, dedicated cross-attention
layers between label embeddings, or tensor products to create a richer, more structured joint
representation ℎ.
2. Advanced Negative Sampling Strategies: The reliance on a fixed, random strategy for negative
sampling is a likely cause of brittleness. A crucial next step is to implement hard negative
mining, where the model is adversarially challenged with "dificult" negatives—those that are
incorrect but have low energy according to the current model state. This forces the model to
refine its decision boundaries in more critical regions of the energy landscape.
3. Regularization and Training Stability: To combat overfitting and improve generalization,
more advanced regularization techniques are needed. This could involve applying structured
dropout within the energy module, using weight decay, or exploring alternative, potentially
more stable, loss functions beyond InfoNCE. Furthermore, investigating techniques to smooth
the energy landscape could prevent the model from becoming overly confident in spurious
correlations.
4. Eficient Inference for Scalability: While our current problem allowed for exhaustive search,
this is not a scalable solution. Future research should explore approximate inference methods,
such as beam search or gradient-based optimization (e.g., Langevin dynamics), to make this
framework applicable to problems with much larger, combinatorial output spaces.</p>
        <p>By systematically addressing these architectural, training, and inference challenges, we believe that
energy-based models can evolve into powerful and robust tools for a wide range of structured prediction
tasks in Natural Language Processing.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>We declare that the present manuscript has been written entirely by the authors and that no generative
artificial intelligence tools were used in its preparation, drafting, or editing.
[28] Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, F. Huang, et al., A tutorial on energy-based learning,</p>
      <p>Predicting structured data 1 (2006).
[29] Y. Deng, A. Bakhtin, M. Ott, A. Szlam, M. Ranzato, Residual energy-based models for text
generation, arXiv preprint arXiv:2004.11714 (2020).
[30] D. Belanger, A. McCallum, Structured prediction energy networks, in: International Conference
on Machine Learning, PMLR, 2016, pp. 983–992.
[31] A. v. d. Oord, Y. Li, O. Vinyals, Representation learning with contrastive predictive coding, arXiv
preprint arXiv:1807.03748 (2018).
[32] M. Gutmann, A. Hyvärinen, Noise-contrastive estimation: A new estimation principle for
unnormalized statistical models, in: Proceedings of the thirteenth international conference on artificial
intelligence and statistics, JMLR Workshop and Conference Proceedings, 2010, pp. 297–304.
[33] M. Á. Álvarez-Carmona, Á. Díaz-Pacheco, R. Aranda, A. Y. Rodríguez-González, L. Bustio-Martínez,
V. Herrera-Semenets, Overview of rest-mex at iberlef 2025: Researching sentiment evaluation in
text for mexican magical towns, volume 75, 2025.
[34] J. Á. González-Barba, L. Chiruzzo, S. M. Jiménez-Zafra, Overview of IberLEF 2025: Natural
Language Processing Challenges for Spanish and other Iberian Languages, in: Proceedings of the
Iberian Languages Evaluation Forum (IberLEF 2025), co-located with the 41st Conference of the
Spanish Society for Natural Language Processing (SEPLN 2025), CEUR-WS. org, 2025.
[35] M. Á. Álvarez-Carmona, R. Aranda, S. Arce-Cárdenas, D. Fajardo-Delgado, R. Guerrero-Rodríguez,
A. P. López-Monroy, J. Martínez-Miranda, H. Pérez-Espinosa, A. Rodríguez-González, Overview
of rest-mex at iberlef 2021: Recommendation system for text mexican tourism, Procesamiento del
Lenguaje Natural 67 (2021). doi:https://doi.org/10.26342/2021-67-14.
[36] M. Á. Álvarez-Carmona, Á. Díaz-Pacheco, R. Aranda, A. Y. Rodríguez-González, D. Fajardo-Delgado,
R. Guerrero-Rodríguez, L. Bustio-Martínez, Overview of rest-mex at iberlef 2022:
Recommendation system, sentiment analysis and covid semaphore prediction for mexican tourist texts,
Procesamiento del Lenguaje Natural 69 (2022).
[37] M. Á. Álvarez-Carmona, Á. Díaz-Pacheco, R. Aranda, A. Y. Rodríguez-González, L. Bustio-Martínez,
V. Muñis-Sánchez, A. P. Pastor-López, F. Sánchez-Vega, Overview of rest-mex at iberlef 2023:
Research on sentiment analysis task for mexican tourist texts, Procesamiento del Lenguaje Natural
71 (2023).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Pérez-Romero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. B.</given-names>
            <surname>Flores-Romero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Álvarez-García</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. d. l. C.</given-names>
            <surname>Del Río</surname>
          </string-name>
          , et al.,
          <article-title>Analysis of the competitiveness of the magical towns of mexico as tourist destinations, in: Innovation and Sustainability in Governments and Companies: A Perspective to the New Realities</article-title>
          , River Publishers,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>22</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Núñez Camarena</surname>
          </string-name>
          , Los pueblos mágicos de méxico:
          <article-title>mecanismo de la sectur para poner en valor el territorio</article-title>
          , in: VIII Seminario Internacional de Investigación en Urbanismo, BarcelonaBalneário Camboriú,
          <year>Junio 2016</year>
          ,
          <article-title>Departament d'Urbanisme i Ordenació del Territori</article-title>
          .
          <source>Universitat Politècnica</source>
          . . . ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Á</surname>
          </string-name>
          . E. Acosta,
          <string-name>
            <given-names>R. Y. V.</given-names>
            <surname>Ochoa</surname>
          </string-name>
          , El estudio de los pueblos mágicos.
          <source>una revisión a casi 20 años de la implementación del programa</source>
          ,
          <source>Dimensiones turísticas 5</source>
          (
          <year>2021</year>
          )
          <fpage>9</fpage>
          -
          <lpage>38</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M. V.</given-names>
            <surname>Hernández</surname>
          </string-name>
          ,
          <article-title>Efectos socioeconómicos del programa pueblos mágicos en méxico: Un análisis a partir de la evaluación normativa y académica</article-title>
          ,
          <source>Iberoforum. Revista de Ciencias Sociales</source>
          <volume>2</volume>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>33</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L. L.</given-names>
            <surname>Levi</surname>
          </string-name>
          ,
          <article-title>Las territorialidades del turismo: el caso de los pueblos mágicos en méxico</article-title>
          ,
          <source>Ateliê Geográfico</source>
          <volume>12</volume>
          (
          <year>2018</year>
          )
          <fpage>6</fpage>
          -
          <lpage>24</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Hidalgo del Toro</surname>
          </string-name>
          , et al.,
          <article-title>La reputación online de los destinos turísticos a través de tripadvisor</article-title>
          . (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Filieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Acikgoz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ndou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dwivedi</surname>
          </string-name>
          ,
          <article-title>Is tripadvisor still relevant? the influence of review credibility, review usefulness, and ease of use on consumers' continuance intention</article-title>
          ,
          <source>International Journal of Contemporary Hospitality Management</source>
          <volume>33</volume>
          (
          <year>2021</year>
          )
          <fpage>199</fpage>
          -
          <lpage>223</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Bonilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M. S.</given-names>
            <surname>Soto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. d. l. L. R.</given-names>
            <surname>Garza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. A.</given-names>
            <surname>Nunez</surname>
          </string-name>
          , A. d. l. P. de Leon,
          <string-name>
            <given-names>A. S. B.</given-names>
            <surname>Quezada</surname>
          </string-name>
          ,
          <article-title>Tripadvisor: A platform that allows to explore experiences and opinions of travelers from the city of saltillo, coahuila, mexico tripadvisor plataforma que permite explorar experiencias</article-title>
          y opiniones de viajeros de la ciudad de saltillo, coahuila mexico,
          <source>Revista Internacional Administracion &amp; Finanzas</source>
          <volume>10</volume>
          (
          <year>2017</year>
          )
          <fpage>67</fpage>
          -
          <lpage>77</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>