<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>M. d. C. Toapanta-Bernabé);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Analysis of Dialectal and Noisy Spanish Text for Sentiment Classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mariuxi del Carmen Toapanta-Bernabé</string-name>
          <email>mariuxi.toapantab@ug.edu.ec</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Miguel Ángel García-Cumbreras</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luis Alfonso Ureña-López</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Karen Gabriela Bajaña-Bastidas</string-name>
          <email>karen.bajanaba@ug.edu.ec</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sairamy Lakshmy Urgiles-Manzano</string-name>
          <email>sairamy.urgilesm@ug.edu.ec</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department, SINAI, CEATIC, Universidad de Jaén</institution>
          ,
          <addr-line>23071, Jaén</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universidad de Guayaquil</institution>
          ,
          <addr-line>090514, Guayas</addr-line>
          ,
          <country country="EC">Ecuador</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>Dialectal and orthographically noisy user-generated text in Spanish presents significant challenges for sentiment analysis, due to variant spellings, missing diacritics, emojis, and informal expressions that degrade the performance of standard classifiers. In this paper, we describe SINAI-UGPLN's submission to the REST-Mex 2025 Sentiment Analysis task at IberLEF, introducing a comprehensive multilingual preprocessing pipeline that includes Unicode normalization, emoji conversion to textual tokens, and orthographic cleaning to produce a balanced training corpus of approximately 353,650 examples across six major dialects. We fine-tune two transformer-based models, BETO and BETO-Emotion, using stratified oversampling and class-weighted loss, and perform extensive ablation studies to quantify the impact of data balancing and emoji normalization. Our best model, BETO-Emotion, achieves 74.86% accuracy and 0.6768 macro‐F1 on the validation split but experiences a substantial drop on the oficial Codabench test set (39.81% accuracy, 0.1915 macro‐F1), underscoring a pronounced generalization gap under dialectal noise. We analyze common error patterns, such as confusions between intermediate sentiment classes, and propose future directions including adversarial dialectal augmentation, dialect-specific embeddings, and improved tokenization schemes to enhance robustness.</p>
      </abstract>
      <kwd-group>
        <kwd>text restoration</kwd>
        <kwd>dialectal Spanish</kwd>
        <kwd>sentiment classification</kwd>
        <kwd>evaluation metrics</kwd>
        <kwd>Mexican Magical Towns</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Sentiment analysis in Spanish faces significant challenges due to dialectal variation, orthographic noise,
and informal expressions in user-generated content [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. Social media posts and customer reviews
often include non-standard spellings, missing diacritics, and emojis, which can degrade the performance
of standard classification models [
        <xref ref-type="bibr" rid="ref3">3, 4, 5</xref>
        ]. The REST-Mex 2025 Sentiment Analysis task, organized
within IberLEF, builds upon prior editions (e.g., REST-Mex 2021 [6], REST-Mex 2022 [7, 8] and REST-Mex
2023 [9]) by benchmarking systems on noisy and dialectal Spanish text, requiring classification into six
sentiment categories—from “Muy malo” to “Muy bueno,” plus an “Otro” class.
      </p>
      <p>Dialectal phenomena—such as phonetic spellings and region-specific slang—can significantly alter
semantic content. Previous IberLEF editions have demonstrated that dialect-aware preprocessing and
ifne-tuning enhance accuracy; however, many systems continue to struggle with intermediate sentiment
classes and underrepresented dialects [10]. The IberLEF 2025 overview paper [11] highlights these
challenges across Spanish and other Iberian languages, and the REST-Mex 2025 overview [12] details
this year’s task setup.</p>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073</p>
      <p>
        Our goal is to develop a scalable, dialect-robust sentiment classifier that outperforms existing baselines
on the REST-Mex 2025 corpus. We propose:
1. A preprocessing pipeline that normalizes encoding (Latin-1 to UTF-8), converts emojis to textual
tokens, and cleans orthographic variants across six dialects.
2. A comparison of two transformer-based architectures—bert-base-spanish-cased (BETO) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
and BETO-Emotion—fine-tuned on the cleaned and balanced dataset.
3. An evaluation of data balancing strategies (random oversampling and class-weighted loss) and
emoji normalization via ablation studies.
4. Submission to the oficial Codabench leaderboard under the team name UGPLN for top placement.
Our main contributions are:
• A multilingual preprocessing workflow that normalizes raw reviews, converts emojis into textual
tokens, and standardizes orthographic variants (e.g., transforming “y/o” to “y o”).
• A balanced training strategy combining random oversampling with class-weighted loss, yielding
a training set of approximately 353 650 examples and mitigating minority-class bias.
• A comparative study of BETO [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and BETO-Emotion, demonstrating that BETO-Emotion achieves
higher Macro-F1 on intermediate sentiment classes.
• An empirical analysis on the REST-Mex 2025 test set showing that our best model attains 39.81%
accuracy and 0.1915 Macro-F1, highlighting a significant generalization gap under dialectal noise.
      </p>
      <p>The remainder of this paper is organized as follows. Section 2 reviews prior work on Spanish sentiment
analysis and noisy text processing. Section 3 describes the REST-Mex 2025 corpus and evaluation metrics.
Section 4 details our preprocessing, data balancing, and model fine-tuning procedures. Section 5 presents
experimental results, ablation studies, and error analysis. Finally, Section 7 concludes and outlines
future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Early work on Spanish noisy text includes Cañette et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], who introduce BETO, a BERT model
pre-trained on Spanish corpora and release evaluation data tailored to informal text. De la Rosa et al.
[13] propose BERTIN, pre-trained via perplexity sampling, demonstrating improved performance on
downstream tasks with limited resources. Pérez et al. [4] present RoBERTuito, a RoBERTa-based model
ifne-tuned on Spanish social media, showing robust results on noisy corpora.
      </p>
      <p>Fernández et al. [10] analyze dialectal variations in Spanish sentiment corpora and propose robustness
benchmarks for regional orthographic diferences.</p>
      <p>The REST-Mex 2023 overview [9] describes the Sentiment Analysis task for Mexican tourist texts
under IberLEF 2023. The REST-Mex 2025 overview [12] details this year’s task setup, and the broader
IberLEF 2025 challenges are summarized in González-Barba et al. [11].</p>
      <p>
        Our work builds on these foundations by integrating Unicode normalization, emoji conversion, and
orthographic cleaning across six dialects; comparing fine-tuning of BETO [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and BETO-Emotion with
class-weighted loss and oversampling; and conducting ablation studies to quantify the contributions of
data balancing and emoji normalization.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Description of the Task and Dataset</title>
      <sec id="sec-3-1">
        <title>3.1. Overview of the REST-Mex 2025 Sentiment Analysis Task</title>
        <p>The REST-Mex 2025 Sentiment Analysis subtask requires systems to predict one of six discrete sentiment
labels for each input review in noisy or dialectal Spanish. Specifically, given a short text containing
orthographic errors, dialect-specific forms, emojis, and informal expressions, the model must output a
label  ∈ {0, 1, 2, 3, 4, 5} , where:
Participants receive two files from the organizers:
1. Rest-Mex_2025_train.csv (70% of the data, 208 051 rows), containing labeled examples.
2. Rest-Mex_2025_test.xlsx (30% of the data, 89 166 rows), used for final Codabench evaluation;
its labels are concealed.</p>
        <p>During system development, Rest-Mex_2025_train.csv is further split by participants into an
internal training set (80%) and validation set (20%). The final test file is used as provided by the
organizers. The primary ranking metric is macro-averaged F1, with accuracy as a secondary tiebreaker.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Data Splits and Preprocessing</title>
        <p>The corpus is partitioned as follows, and Table 1 reports the number of examples per dialect for each
split:
• Organizers’ Train (70%): Rest-Mex_2025_train.csv, 208 051 examples with {id, review_text,
sentiment_label, dialect
• Organizers’ Test (30%): Rest-Mex_2025_test.xlsx, 89 166 examples with {id, review_text,
dialect}, labels withheld.
• Internal Train (80% of 208 051): ≈166 440 examples.</p>
        <p>• Internal Dev (20% of 208 051): ≈41 611 examples.</p>
        <p>Both splits are stratified by sentiment_label to preserve class proportions. The final test set of
89 166 examples is used unchanged for Codabench submissions.</p>
        <p>Each row in Rest-Mex_2025_train.csv has:
• id: Unique example identifier.
• review_text: Raw, noisy or dialectal Spanish text (may include emojis and encoding errors).
• sentiment_label: Integer {0, … , 5}.</p>
        <p>• dialect: One of {Andino, Caribeño, Centroamericano, Mexicano, Rioplatense, Chileno}.
We derive two additional columns for robust modeling:
• has_emoji_flag: Boolean flag indicating presence of any emoji.
• normalized_text: Cleaned text after encoding correction and orthographic normalization (see</p>
        <p>Section 4).</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Evaluation Metrics</title>
        <p>Evaluation of REST-Mex 2025 Sentiment Analysis relies on comprehensive label-level measures to
ensure balanced performance across all six sentiment classes. We employ the following oficial metrics:</p>
        <sec id="sec-3-3-1">
          <title>3.3.1. Accuracy</title>
          <p>Accuracy measures the proportion of correctly predicted sentiment labels:</p>
          <p>Accuracy = 1
∑ 1( ̂ =   ),

 =1
where  is the total number of test examples,   is the true label, and  ̂ is the predicted label. Although
useful for overall performance, Accuracy can be misleading under class imbalance.</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>3.3.2. Precision, Recall, and F1-Score</title>
          <p>For each sentiment class  ∈ {0, … , 5} , we compute:</p>
          <p>Precision =</p>
          <p>Recall =</p>
          <p>TP
TP + FP</p>
          <p>TP
TP + FN
,
,
F1 = 2 ⋅</p>
          <p>Precision × Recall
Precision + Recall
,
where TP , FP , and FN denote true positives, false positives, and false negatives for class  . Precision
assesses correctness among positive predictions, Recall reflects coverage of actual positives, and F1
balances both.</p>
        </sec>
        <sec id="sec-3-3-3">
          <title>3.3.3. Macro-Averaged F1 (Macro-F1)</title>
          <p>Macro-F1 is the unweighted average of F1-scores across all classes:</p>
          <p>Macro-F1 =
1 ∑5 F1 .
6 =0
By giving equal weight to each class, Macro-F1 mitigates the dominance of majority classes and
encourages models to perform well on minority categories. This metric serves as the primary ranking
criterion on Codabench.</p>
        </sec>
        <sec id="sec-3-3-4">
          <title>3.3.4. Other Metrics</title>
          <p>Macro Precision:
Macro Recall: 1 ∑5 Recall .</p>
          <p>6 =0
1 ∑5 Precision .
6 =0
5
=0
weighted by class prevalence.</p>
          <p>Weighted F1: ∑   × F1 , where   is the relative frequency of class  . This reflects performance</p>
        </sec>
        <sec id="sec-3-3-5">
          <title>3.3.5. Calculation Example</title>
          <p>To illustrate these metrics, consider a simplified confusion matrix for three classes {0, 1, 2}:
where rows are true classes and columns are predicted classes. For class 1:</p>
          <p>TP1 = 40, FP1 = 5 + 7 = 12, FN1 = 5 + 6 = 11,
hence</p>
          <p>Precision1 =</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>Macro-F1 would average F1 over classes 0, 1, 2. In REST-Mex 2025, Macro-F1 is computed analogously
over six sentiment classes and used as the principal ranking measure.</p>
      <p>This section describes our end-to-end pipeline, including data cleaning, balancing, feature preparation,
model architecture, and training configuration.</p>
      <sec id="sec-4-1">
        <title>4.1. Data Cleaning and Normalization</title>
        <p>To reduce noise and standardize orthographic variations across dialects, we apply the following steps
to Rest-Mex_2025_train.csv:
1. Encoding Correction: Convert from Latin-1 to utf-8 using Python’s built-in codecs. This
ifxes garbled characters (e.g., “Ã±” to “ñ”).
2. Symbol Removal: Remove residual symbols such as repeated punctuation (e.g., “!!”, “??”), leading
“=” signs, and zero-width Unicode characters. We apply a regular expression to remove unwanted
symbols.
3. Orthographic Normalization:
• Replace ambiguous constructs: "y/o" → "y o".
• Collapse multiple spaces: re.sub(r'</p>
        <p>s+', ' ', text).
• Unicode normalization: unicodedata.normalize('NFKC', text) to unify accented
characters (e.g., “é” vs. “é”).
4. Emoji Conversion: Use the emoji Python library to replace emojis with textual descriptions.</p>
        <p>For each emoji character, we map it to its CLDR short name, prefixed by “emoji_”. For example,
we replace a smiling‐face emoji with its textual description, such as "emoji_smile".
5. Lowercasing and Trimming: Convert all alphabetic text to lowercase and strip leading/trailing
whitespace.</p>
        <p>The resulting cleaned text is stored in the new column normalized_text.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Data Balancing</title>
        <p>Analysis of class frequencies in the original training set ( = 208 051 ) revealed moderate imbalance.
We employ a two-pronged approach:</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Feature Preparation</title>
        <p>We prepare inputs for transformer models as follows:</p>
        <p>Tokenizer: We use AutoTokenizer from HuggingFace:
tokenizer =
AutoTokenizer.from_pretrained("dccuchile/bert-basespanish-cased")
or, for the BETO-Emotion model:
tokenizer =
AutoTokenizer.from_pretrained("dccuchile/bert-basespanish-Emotion")
Input Construction: For each example, we tokenize normalized_text as follows:
encoding = tokenizer(text, max_length=128, padding='max_length',
truncation=True, return_tensors='pt')
This produces input_ids and attention_mask tensors.</p>
        <p>Dialect and Emoji Handling: We do not concatenate dialect embeddings explicitly; token-level
embeddings capture context. Emoji normalization (see Section 4) ensures consistent tokenization
of emotive content.</p>
        <p>Dataset Objects: Encoded inputs and labels are wrapped into a torch.utils.data.Dataset
subclass, allowing eficient batching during training and evaluation.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Model Architecture</title>
        <p>We compare two transformer-based architectures:</p>
        <sec id="sec-4-4-1">
          <title>BETO</title>
          <p>Base Model: bert-base-spanish-cased (BETO), pre-trained on diverse Spanish corpora.
Classification Head : A linear layer mapping the [CLS] embedding (768 dimensions) to six
sentiment logits.</p>
          <p>Implementation:
model =
AutoModelForSequenceClassification.from_pretrained("dccuchile/bertbase-spanish-cased", num_labels=6)</p>
        </sec>
        <sec id="sec-4-4-2">
          <title>BETO-Emotion</title>
          <p>Base Model: dccuchile/bert-base-spanish-Emotion, fine-tuned on Spanish social media
emotion data.</p>
          <p>Classification Head : Same as BETO.</p>
          <p>Implementation:
model =
AutoModelForSequenceClassification.from_pretrained("dccuchile/bertbase-spanish-Emotion", num_labels=6)</p>
          <p>In both cases, the transformer’s encoder weights are fine-tuned; no additional CRF or LSTM layers
are added, keeping the architecture simple and eficient.</p>
        </sec>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Training Configuration</title>
        <p>Training is performed on a single NVIDIA V100 GPU. The steps are:
1. Train/Validation Split: Further split the balanced data ( balanced) into:</p>
        <p>Training: 80% (≈282 920 examples).</p>
        <p>Validation: 20% (≈70 730 examples).</p>
        <p>Stratification on sentiment_label preserves class proportions.
2. Optimizer and Scheduler:
optimizer = AdamW(model.parameters(), lr=2e-5, weight_decay=0.01).
Total training steps:  = ⌈ batchtra_isnize ⌉ × epochs.</p>
        <p>Warmup: 10% of  .
scheduler = get_scheduler("linear", optimizer=optimizer,
num_warmup_steps=0.1*T, num_training_steps=T).
3. Loss Function:</p>
        <p>Class weights   are precomputed as:
criterion = CrossEntropyLoss(weight=class_weights).</p>
        <p>=
 total ,
6 ×  
where   is the number of examples of class  in the balanced training set.
4. Batching and Epochs:
batch_size = 16.
epochs = 5.</p>
        <p>Gradient clipping: clip_norm = 1.0.
5. Validation and Checkpointing:
a) After each epoch, evaluate on validation set: compute Macro-F1 and Accuracy.
b) Save model checkpoint if validation Macro-F1 improves.
6. Inference on Test Set: Load the best checkpoint (highest validation Macro-F1) and tokenize
Rest-Mex_2025_test.xlsx examples with identical preprocessing. Use:</p>
        <p>model.eval(); torch.no_grad();
to predict labels in batches of 16.</p>
        <p>7. Submission: Generate a CSV with columns {id, predicted_label} and submit to Codabench.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experiments and Results</title>
      <p>This section presents the experimental setup, validation (Dev) results (Subsections 5.1–5.3), ablation
studies (5.4), error analysis (5.5), and final test performance on Codabench ( 5.6) for our SINAI and
UGPLN submissions.</p>
      <sec id="sec-5-1">
        <title>5.1. Dev Performance: BETO</title>
        <p>We first evaluate bert-base-spanish-cased (BETO) on the held‐out validation split (20 % of the
balanced training set,  val = 70 730). Table 2 reports per‐class precision, recall, F1‐score, support,
overall accuracy, and Macro‐F1. BETO attains an overall Dev accuracy of 75.26 % and a Macro‐F1 of
0.7136. Classes 0 and 3 (“Muy malo” and “Bueno”) achieve the highest F1‐scores, while intermediate
classes (1 and 2) remain more challenging.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Dev Performance: BETO‐Emotion</title>
        <p>Next, we fine‐tune dccuchile/bert-base-spanish-Emotion (“BETO‐Emotion”) under identical
hyperparameters. Table 3 reports its Dev metrics BETO‐Emotion obtains Dev accuracy of 74.86 % and
Macro‐F1 of 0.6768. It improves on class 1 (“Malo”) relative to BETO, though its overall Macro‐F1 is
slightly lower.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. BETO vs. BETO‐Emotion Comparison</title>
        <p>BETO. Overall, BETO’s Macro‐F1 (0.7136) is higher than BETO‐Emotion’s (0.6768).</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Ablation Studies</title>
        <p>To measure the impact of data balancing and emoji normalization on validation performance, we
conduct three configurations with BETO-Emotion:</p>
        <p>Full Pipeline (oversampling + emoji normalization): Macro-F1 = 0.6768.</p>
        <p>Without Oversampling (weighted loss only): Macro-F1 = 0.6932 (−0.0164).</p>
        <p>Without Emoji Normalization (oversampling only): Macro-F1 = 0.7005 (−0.0237).</p>
      </sec>
      <sec id="sec-5-5">
        <title>5.5. Error Analysis</title>
        <p>Figure 2 shows the confusion matrix for BETO‐Emotion on Dev. The most frequent confusions occur
between “Regular” (2) and “Malo” (1), reflecting challenges in intermediate sentiment detection.</p>
        <p>Table 5 provides representative Dev error cases: These errors stem from residual orthographic noise
(missing diacritics), token‐splitting artifacts, and ambiguous intermediate sentiment expressions.</p>
      </sec>
      <sec id="sec-5-6">
        <title>5.6. Test Performance (Codabench)</title>
        <p>We evaluate our final models on the oficial REST-Mex 2025 test set (  test = 89,166) using the Codabench
leaderboard. Table 6 reports Accuracy, Macro‐F1 (Polarity), and the Rank (Macro‐F1) for each submission.
SINAI-UGPLN’s best BETO‐Emotion run (UGPLN_0) and BETO run (UGPLN_2) are listed below: Neither
run achieved a place in the oficial ranking (“HM” indicates Honorable Mention). These results confirm
that, despite strong Dev performance, both BETO and BETO‐Emotion struggled to generalize to the
noisy test set under the oficial Codabench evaluation.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <p>Predicted
“La película estuvo pesima”
(0 → 1: dropped accent; cry emoji present)
“Comida re gular, nada especial”
(2 → 1: unintended space split token)
“Me encanto muy bn”
(4 → 3: missing accent; “bn” abbreviation)
The Dev results and Codabench rankings reveal several insights about our fine‐tuning strategies and
preprocessing pipeline. First, while BETO achieved a Dev accuracy of 75.26% and Macro‐F1 of 0.7136,
BETO‐Emotion attained a slightly lower Dev accuracy (74.86%) and Macro‐F1 (0.6768). This indicates
that the emotion‐specialized pretraining benefited class‐specific detection of moderate negativity (class
1 “Malo”)—BETO‐Emotion’s F1 for class 1 (0.5302) exceeded BETO’s (0.5042)—but at the expense of
overall Macro‐F1, as BETO maintained stronger performance on extreme sentiment classes.</p>
      <p>Our ablation studies further demonstrate the delicate trade‐ofs in data handling. Removing
oversampling (i.e., using only class weights in the loss) increased Dev Macro‐F1 from 0.6768 to 0.6932, suggesting
that oversampling introduced redundancy or noise that degraded validation performance. Conversely,
omitting emoji normalization (while retaining oversampled data) increased Dev Macro‐F1 to 0.7005,
suggesting that converting emojis to textual tokens may have inadvertently altered the underlying
sentiment cues. In both cases, the full pipeline (oversampling plus emoji normalization) performed
worse than either single‐factor ablation, indicating an interaction efect where combining both strategies
did not yield additive gains.</p>
      <p>The error analysis on Dev (Figure 2) shows persistent confusions between “Regular” (2) and “Malo”
(1). Many “Malo” examples lacked clear negative markers or contained mixed sentiment, causing
the model to favor class 2. Orthographic noise—missing diacritics and token splits—also contributed
to misclassification (e.g., “pésima” → “pesima” and “comida re gular”). These errors underscore the
challenge of intermediate sentiment detection in dialectal Spanish.</p>
      <p>On the oficial Codabench test set, both models sufered a substantial performance drop.
BETOEmotion achieved only 39.81% accuracy and Macro‐F1 of 0.1915 (Honorable Mention), while BETO
reached 16.28% accuracy and Macro‐F1 of 0.1027 (Honorable Mention). This sharp decline from Dev
performance highlights a significant generalization gap. Possible causes include:</p>
      <p>Data distribution shift : The test set likely contains dialectal variants or noise patterns not well
represented in the Dev split, causing erroneous predictions under out‐of‐distribution conditions.
Over‐reliance on surface cues: Both models may have learned spurious correlations (e.g.,
certain misspellings or emoji patterns) that did not transfer to the unseen test examples.
Insuficient dialect coverage : Although we balanced across six major dialects, some rare or
extreme dialectal forms in the test set may not have been adequately captured by our synthetic
or oversampled data.</p>
      <p>To close this gap, future work should explore:
1. Adversarial data augmentation: Automatically generate dialectal variants that mimic test‐time
noise, using larger generative models (e.g., Mistral‐7B‐Instruct) to expand the synthetic pool.
2. Dialect‐specific embeddings : Incorporate learned dialect embeddings or adapters to help the
model distinguish orthographic patterns unique to each region.
3. Robust tokenization: Employ subword vocabularies that better capture accent and diacritic
variations, or use byte‐level encoders to minimize the impact of missing accents.
4. Curriculum learning: Start training on cleaned, high‐quality examples, then gradually introduce
more noisy and dialectal inputs to improve generalization.</p>
      <p>In summary, although BETO-Emotion and our preprocessing pipeline delivered competitive Dev results,
the low test performance highlights the dificulty of real-world dialectal sentiment analysis. Addressing
distributional shifts and refining tokenization strategies will be crucial for closing the gap between
validation and test performance in future iterations.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions and Future Work</title>
      <p>In this work, we presented SINAI-UGPLN’s fine‐tuning strategies for the REST‐Mex 2025 Sentiment
Analysis task. Our multilingual preprocessing pipeline—including Unicode normalization, emoji
conversion, and orthographic cleaning—combined with class‐weighted loss and oversampling, yielded
strong Dev performance: BETO achieved 75.26 % accuracy and 0.7136 Macro‐F1, while BETO‐Emotion
reached 74.86 % accuracy and 0.6768 Macro‐F1. Ablation studies revealed that neither oversampling
nor emoji normalization alone consistently improved results when combined, highlighting complex
interactions between data balancing and tokenization. Error analysis identified persistent confusions
between intermediate classes (“Malo” vs. “Regular”) due to residual orthographic noise. On the oficial
Codabench test set, both BETO‐Emotion (39.81 % accuracy, 0.1915 Macro‐F1) and BETO (16.28 %
accuracy, 0.1027 Macro‐F1) fell short of Dev performance, illustrating a significant generalization gap under
dialectal noise.</p>
      <p>Future work will focus on closing this gap by (1) generating adversarial dialectal variants with large
generative models to simulate test‐time noise better; (2) integrating dialect‐specific embeddings or
adapter modules to capture region‐specific orthographic patterns; (3) adopting byte‐level or subword
tokenization schemes that preserve accent and diacritic information; and (4) applying curriculum
learning to introduce noise during training gradually. We also plan to explore active learning approaches
to identify underrepresented dialectal forms in the test distribution and to incorporate external lexicons
of regional slang. By refining tokenization and augmenting data with realistic dialectal noise, we aim to
improve robustness and narrow the gap between validation and test performance in future REST‐Mex
iterations.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This work is funded by the Ministerio para la Transformación Digital y de la Función Pública and
Plan de Recuperación, Transformación y Resiliencia – Funded by EU – NextGenerationEU within
the framework of the project Desarrollo Modelos ALIA. This work has also been partially supported
by Project CONSENSO (PID2021-122263OB-C21), Project MODERATES (TED2021-130145B-I00) and
Project SocialTox (PDC2022-133146-C21) funded by MCIN/AEI/10.13039/501100011033 and by the
European Union NextGenerationEU/PRTR. Moreover, this research is part of the proposal presented
at the Call for Research Project Proposals of the Internal Competitive Fund (FCI) 2023, which was
approved on September 14, 2023 (Resolution No. R-CSU-UG-SE34-313-14-09-2023) by the Consejo
Superior Universitario of the Universidad de Guayaquil.</p>
      <p>The authors declare that they have contributed equally and share authorship roles for this
publication.</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used GPT-4 and Grammarly to check grammar and
spelling. After using these tools and services, the authors reviewed and edited the content as needed
and took full responsibility for the publication’s content.
Conference (LREC’20), Workshop on Language Resources and Evaluation for NLP, European
Language Resources Association (ELRA), Marseille, France, 2020, pp. 5230–5239.
[4] J. M. Pérez, D. A. Furman, L. A. Alemany, F. Luque, R. Cordeiro, Robertuito: A roberta-based
model for social media text in spanish, arXiv preprint arXiv:2111.09453 (2021). URL: https:
//arxiv.org/abs/2111.09453.
[5] R. Guerrero-Rodriguez, M. A. Álvarez Carmona, R. Aranda, A. P. López-Monroy,
Studying online travel reviews related to tourist attractions using nlp methods: the
case of guanajuato, mexico, Current Issues in Tourism 26 (2023) 289–304. URL:
https://doi.org/10.1080/13683500.2021.2007227. doi:10.1080/13683500.2021.2007227.
arXiv:https://doi.org/10.1080/13683500.2021.2007227.
[6] M. Á. Álvarez-Carmona, R. Aranda, S. Arce-Cárdenas, D. Fajardo-Delgado, R. Guerrero-Rodríguez,
A. P. López-Monroy, J. Martínez-Miranda, H. Pérez-Espinosa, A. Rodríguez-González, Overview
of rest-mex at iberlef 2021: Recommendation system for text mexican tourism, Procesamiento del
Lenguaje Natural 67 (2021). doi:https://doi.org/10.26342/2021- 67- 14.
[7] M. Á. Álvarez-Carmona, Á. Díaz-Pacheco, R. Aranda, A. Y. Rodríguez-González, D. Fajardo-Delgado,
R. Guerrero-Rodríguez, L. Bustio-Martínez, Overview of rest-mex at iberlef 2022:
Recommendation system, sentiment analysis and covid semaphore prediction for mexican tourist texts,
Procesamiento del Lenguaje Natural 69 (2022) 289–299.
[8] M. Á. Álvarez-Carmona, R. Aranda, R. Guerrero-Rodríguez, A. Y. Rodríguez-González, A. P.
LópezMonroy, A combination of sentiment analysis systems for the study of online travel reviews:
Many heads are better than one, Computación y Sistemas 26 (2022) 977–987.
[9] M. A. Álvarez-Carmona, A. Díaz-Pacheco, R. Aranda, A. Y. Rodríguez-González, V. Muńiz-Sánchez,
A. Pastor López-Monroy, F. Sánchez-Vega, L. Bustio-Martínez, Overview of rest-mex at iberlef
2023: Research on sentiment analysis task for mexican tourist texts, in: Proceedings of the Iberian
Languages Evaluation Forum (IberLEF 2023), co-located with the 39th Conference of the Spanish
Society for Natural Language Processing (SEPLN 2023), CEUR-WS.org, 2023, pp. 425–436.
[10] M. Fernández, J. López, E. Ruiz, Dialectal variations in spanish sentiment corpora: Challenges
and benchmarks, in: Proceedings of IberLEF 2020, volume 2798 of CEUR Workshop Proceedings,
CEUR-WS.org, 2020, pp. 89–102.
[11] J. Á. González-Barba, L. Chiruzzo, S. M. Jiménez-Zafra, Overview of IberLEF 2025: Natural
Language Processing Challenges for Spanish and other Iberian Languages, in: Proceedings of the
Iberian Languages Evaluation Forum (IberLEF 2025), co-located with the 41st Conference of the
Spanish Society for Natural Language Processing (SEPLN 2025), CEUR-WS. org, 2025.
[12] M. Á. Álvarez-Carmona, Á. Díaz-Pacheco, R. Aranda, A. Y. Rodríguez-González, L. Bustio-Martínez,
V. Herrera-Semenets, Overview of rest-mex at iberlef 2025: Researching sentiment evaluation in
text for mexican magical towns, volume 75, 2025.
[13] J. De la Rosa, E. G. Ponferrada, M. Romero, P. Villegas, P. González de Prado Salas, M. Grandury,
Bertin: Eficient pre-training of a spanish language model using perplexity sampling,
Procesamiento del Lenguaje Natural 68 (2022) 13–23. URL: http://journal.sepln.org/sepln/ojs/ojs/index.
php/pln/issue/view/285.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Álvarez-Carmona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Aranda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Rodríguez-Gonzalez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fajardo-Delgado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Sánchez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Pérez-Espinosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Martínez-Miranda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Guerrero-Rodríguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bustio-Martínez</surname>
          </string-name>
          ,
          <article-title>Ángel DíazPacheco, Natural language processing applied to tourism research: A systematic review and future research directions</article-title>
          ,
          <source>Journal of King Saud University - Computer and Information Sciences</source>
          <volume>34</volume>
          (
          <year>2022</year>
          )
          <fpage>10125</fpage>
          -
          <lpage>10144</lpage>
          . URL: https://www.sciencedirect.com/science/article/pii/S1319157822003615. doi:https://doi.org/10.1016/j.jksuci.
          <year>2022</year>
          .
          <volume>10</volume>
          .010.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Guerrero-Rodríguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Álvarez-Carmona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Aranda</surname>
          </string-name>
          , et al.,
          <article-title>Big data analytics of online news to explore destination image using a comprehensive deep-learning approach: a case from mexico</article-title>
          ,
          <source>Information Technology &amp; Tourism</source>
          <volume>26</volume>
          (
          <year>2024</year>
          )
          <fpage>147</fpage>
          -
          <lpage>182</lpage>
          . URL: https://doi.org/10.1007/ s40558-023-00278-5. doi:
          <volume>10</volume>
          .1007/s40558- 023- 00278- 5.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cañette</surname>
          </string-name>
          , G. Chacón,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fuentes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chishti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gutiérrez</surname>
          </string-name>
          , G. Pablo,
          <article-title>Spanish pretrained bert model and evaluation data</article-title>
          ,
          <source>in: Proceedings of the Thirteenth Language Resources and Evaluation</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>