<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Opinions in Spanish?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Federico Sandoval</string-name>
          <email>fsandoval@algiedi.com.mx</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Algiedi Solutions</institution>
          ,
          <addr-line>Cholula, Mexico, 72760</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper investigates the feasibility of using smaller amounts of data to accurately classify tourist opinions in Spanish. The classification of tourist reviews is important for businesses in the tourism industry, but data collection and processing can be time-consuming and costly. To test the efectiveness of smaller datasets, we conducted experiments using a machine learning approach to classify polarity, type, and country in a dataset of tourist reviews in Spanish. Our results show that it is possible to achieve good levels of accuracy with smaller datasets.</p>
      </abstract>
      <kwd-group>
        <kwd>Rest-Mex</kwd>
        <kwd>Sentiment Analysis</kwd>
        <kwd>Few opinions</kwd>
        <kwd>data imbalance</kwd>
        <kwd>Spanish opinions</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Tourism is a key sector for many countries, providing employment opportunities and
contributing significantly to their economy [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2, 3</xref>
        ]. With the advent of social media and online review
platforms, tourists are increasingly sharing their experiences and opinions about destinations,
attractions, and services [4, 5]. This wealth of information presents a valuable opportunity
for tourism industry stakeholders to gain insights into customer preferences, improve service
quality, and enhance the overall tourist experience [6, 7].
      </p>
      <p>However, the large volume of user-generated content (UGC) presents a challenge for
extracting useful insights from it [8]. Natural language processing (NLP) techniques have been widely
used to classify and analyze UGC in various languages, including Spanish [9]. Traditionally,
NLP models require large amounts of data to achieve high accuracy in classification tasks.
However, collecting and annotating large amounts of data can be time-consuming and expensive</p>
      <p>Our findings indicate that it is possible to achieve acceptable accuracy in classifying tourist
opinions in Spanish even with fewer data. The results have implications for small businesses
and organizations with limited resources, as they can still benefit from the insights gained from
analyzing UGC, without having to invest in large-scale data collection and annotation eforts.</p>
      <p>In this paper, we describe our experiment, present our findings, and discuss the implications
of our results for tourism industry stakeholders. We also discuss the limitations of our study
and suggest directions for future research. Overall, our paper contributes to the growing body
of research on NLP techniques for analyzing UGC in the tourism industry, and highlights the
potential for using smaller datasets to achieve meaningful insights.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Rest-Mex 2023 Corpus</title>
      <p>The organizers of Rest-Mex 2023 [15] have curated a train collection comprising 251,702 opinions
extracted from TripAdvisor. The dataset includes three classification labels:
1. Polarity
2. Type
3. Country</p>
      <p>The polarity classification encompasses five classes, ranging from class 1 representing the
most negative polarity to class 5 denoting the most positive polarity. Table 1 displays the
distribution of these classes, revealing an evident class imbalance.</p>
      <p>The classification of the type of place includes three classes: Attractive, Hotel, and Restaurant.
Table 2 showcases the distribution of these classes. Although the imbalance is not as pronounced
as observed in polarity, the table indicates some degree of imbalance.</p>
      <p>The classification based on the country of origin of the visited place encompasses three
classes: Mexico, Cuba, and Colombia. Table 3 illustrates the distribution of these classes.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Simple instances selection</title>
      <p>In this study, we investigate the impact of using reduced amounts of data on the classification
of tourist opinions in Spanish. Specifically, we aim to determine whether randomly selecting
100, 500, 1000, and 5000 opinions from each polarity class can yield a balanced database while
maintaining classification performance.</p>
      <sec id="sec-3-1">
        <title>3.1. Classifier</title>
        <p>For sentiment analysis, we utilize a BERT-based classifier that specifically utilizes the Beto-cased
model. BERT, which stands for Bidirectional Encoder Representations from Transformers, is a
highly efective pre-trained language model that has shown exceptional performance across
various natural language processing tasks.</p>
        <p>Model: We employ the Beto-cased model, which is a variant of BERT trained specifically on
Spanish text. This model captures detailed information and retains the capitalization of words,
enabling better understanding of the context.</p>
        <p>Max Length: To handle input sequences eficiently, we set a maximum sequence length of
32 tokens. Any input longer than this is either truncated or divided into smaller segments based
on BERT’s tokenization approach.</p>
        <p>Optimizer: The Adam optimizer is used, as it is a popular choice for training deep neural
networks. Adam combines adaptive learning rates with momentum, resulting in eficient
optimization and convergence.</p>
        <p>Learning Rate: We set the learning rate to 5 × 10−5, a commonly used value for
finetuning BERT models. This value strikes a balance between convergence speed and detailed
optimization.</p>
        <p>Steps: The step size, represented as epsilon ( ), is set to 1 × 10−8. This parameter controls the
level of noise added to the learning rate update, ensuring stability during the training process.</p>
        <p>Epochs: The classifier is trained for 4 epochs, where each epoch represents a complete pass
through the entire training dataset. This choice balances the model’s learning capacity with
computational resources.</p>
        <p>By utilizing BERT-based models with these specific configurations, our aim is to leverage the
contextual representation capabilities of BERT for accurate sentiment analysis on Spanish text.
The selected settings provide a strong foundation for training and optimizing the classifier.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <sec id="sec-4-1">
        <title>4.1. Train results</title>
        <p>• Polarity Classification: As the number of instances increases from 100 to 5000, the
Fmeasure for polarity classification improves, with the highest F-measure of 0.53 achieved
with 5000 instances. This suggests that increasing the training data helps in capturing
the nuances of sentiment analysis more accurately.
• Type Classification: The F-measure for type classification remains consistently high
across diferent numbers of instances, ranging from 0.91 to 0.95. This indicates that the
classifier performs well in accurately categorizing opinions into attractive, hotel, and
restaurant types, regardless of the training data size.
• Country Classification: Similar to the type classification, the F-measure for country
classification remains consistently high, ranging from 0.52 to 0.83. This suggests that the
classifier efectively recognizes the country of origin of the visited places, irrespective of
the number of training instances.</p>
        <p>Based on these results, it is evident that increasing the number of instances generally leads to
improved performance in polarity classification. However, for type and country classification,
the classifier achieves high F-measures even with smaller training datasets.</p>
        <p>These findings highlight the efectiveness of the BERT-based classifier and demonstrate its
robustness across diferent classification tasks. By utilizing a reduced number of instances, we
can achieve competitive classification performance, potentially reducing the computational
resources required for training without significant loss in accuracy.</p>
        <p>Further analysis and evaluation on larger datasets and real-world scenarios would be valuable
to validate the generalization capabilities of the classifier.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Rest-Mex 2023 oficial results</title>
        <p>Table 5 presents the results obtained from the Rest-Mex 2023 forum. The analysis reveals that
utilizing 5000 instances for each class yields the most favorable outcomes. This amounts to a
total of 25,000 instances, approximately 10% of the entire dataset. These findings demonstrate
the efectiveness of this approach, yielding compelling and competitive results.</p>
        <p>Moreover, the achieved results surpass the baselines proposed by the organizers. Out of a total
of 17 participants, our approach secured the 13th position, showcasing its superior performance.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>The findings of this study demonstrate that it is feasible to achieve reasonable results in sentiment
analysis for tourist opinions in Spanish by utilizing a small percentage of the original data.</p>
      <p>One prominent characteristic observed across various tourism collections is the inherent data
imbalance, with negative polarities representing the minority classes. Therefore, it becomes
imperative to address this data imbalance issue.</p>
      <p>For the Rest-Mex 2023 edition, the organizers compiled a database consisting of over 250,000
opinions, with more than 50% exclusive to class 5. This presents a challenge for
oversampling techniques. Consequently, in this research, we explored the application of sub-sampling
methods.</p>
      <p>Specifically, we employed a random selection of data based on polarity class, considering
sample sizes of 100, 500, 1000, and 5000 (the highest available value).</p>
      <p>The most favorable results were obtained when utilizing 5000 instances per class, totaling 25
instances. This subset, constituting only 10% of the total data, enabled the development of a
competitive model, surpassing the baselines and yielding acceptable results.</p>
      <p>Such an approach proves to be an ideal solution in scenarios where there are constraints on
execution time and available memory. By utilizing a smaller representative sample, the
computational resources required are reduced without compromising the performance significantly.
tematic review and future research directions, Journal of King Saud University-Computer
and Information Sciences (2022).
[3] A. Diaz-Pacheco, M. Á. Álvarez-Carmona, R. Guerrero-Rodríguez, L. A. C. Chávez, A. Y.</p>
      <p>Rodríguez-González, J. P. Ramírez-Silva, R. Aranda, Artificial intelligence methods to
support the research of destination image in tourism. a systematic review, Journal of
Experimental &amp; Theoretical Artificial Intelligence (2022) 1–31.
[4] M. Á. Álvarez-Carmona, R. Aranda, S. Arce-Cardenas, D. Fajardo-Delgado, R.
GuerreroRodríguez, A. P. López-Monroy, J. Martínez-Miranda, H. Pérez-Espinosa, A. Y.
RodríguezGonzález, Overview of rest-mex at iberlef 2021: recommendation system for text mexican
tourism 67 (2021). doi:https://doi.org/10.26342/2021- 67- 14.
[5] M. Á. Álvarez-Carmona, Á. Díaz-Pacheco, R. Aranda, A. Y. Rodríguez-González, D.
FajardoDelgado, R. Guerrero-Rodríguez, L. Bustio-Martínez, Overview of rest-mex at iberlef 2022:
Recommendation system, sentiment analysis and covid semaphore prediction for mexican
tourist texts, Procesamiento del Lenguaje Natural 69 (2022).
[6] R. Guerrero-Rodriguez, M. Á. Álvarez-Carmona, R. Aranda, A. P. López-Monroy,
Studying online travel reviews related to tourist attractions using nlp methods: the case of
guanajuato, mexico, Current issues in tourism 26 (2023) 289–304.
[7] E. Olmos-Martínez, M. Á. Álvarez-Carmona, R. Aranda, A. Díaz-Pacheco, What does the
media tell us about a destination? the cancun case, seen from the usa, canada, and mexico,
International Journal of Tourism Cities (2023).
[8] M. A. Álvarez-Carmona, R. Aranda, R. Guerrero-Rodrıguez, A. Y. Rodrıguez-González,
A. P. López-Monroy, A combination of sentiment analysis systems for the study of online
travel reviews: Many heads are better than one, Computación y Sistemas 26 (2022).
doi:https://doi.org/10.13053/CyS- 26- 2- 4055.
[9] M. Á. Álvarez-Carmona, E. Villatoro-Tello, L. Villaseñor-Pineda, M. Montes-y Gómez,
Classifying the social media author profile through a multimodal representation, in:
Intelligent Technologies: Concepts, Applications, and Future Directions, Springer, 2022,
pp. 57–81.
[10] M. A. Alvarez-Carmona, A. P. López-Monroy, M. Montes-y Gómez, L. Villasenor-Pineda,
H. Jair-Escalante, Inaoe’s participation at pan’15: Author profiling task, Working Notes
Papers of the CLEF 103 (2015).
[11] M. Á. Álvarez-Carmona, E. Guzmán-Falcón, M. Montes-y Gómez, H. J. Escalante,
L. Villasenor-Pineda, V. Reyes-Meza, A. Rico-Sulayes, Overview of mex-a3t at ibereval
2018: Authorship and aggressiveness analysis in mexican spanish tweets, in: Notebook
papers of 3rd sepln workshop on evaluation of human language technologies for iberian
languages (ibereval), seville, spain, volume 6, 2018.
[12] M. E. Aragón, M. A. A. Carmona, M. Montes-y Gómez, H. J. Escalante, L. V. Pineda,
D. Moctezuma, Overview of mex-a3t at iberlef 2019: Authorship and aggressiveness
analysis in mexican spanish tweets., in: IberLEF@ SEPLN, 2019, pp. 478–494.
[13] L. Bustio-Martínez, M. A. Álvarez-Carmona, V. Herrera-Semenets, C. Feregrino-Uribe,
R. Cumplido, A lightweight data representation for phishing urls detection in iot
environments, Information Sciences 603 (2022) 42–59.
[14] M. A. Alvarez-Carmona, R. Aranda, A. Diaz-Pacheco, J. de Jesús Ceballos-Mejıa, Generador
automático de resumenes cientıficos en investigación turıstica, Research in Computing
Science (2022).
[15] M. Á. Álvarez-Carmona, Á. Díaz-Pacheco, R. Aranda, A. Y. Rodríguez-González, L.
BustioMartínez, V. Muñis-Sánchez, A. P. Pastor-López, F. Sánchez-Vega, Overview of rest-mex at
iberlef 2023: Research on sentiment analysis task for mexican tourist texts, Procesamiento
del Lenguaje Natural 71 (2023).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Arce-Cardenas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fajardo-Delgado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Á</surname>
          </string-name>
          .
          <string-name>
            <surname>Álvarez-Carmona</surname>
            ,
            <given-names>J. P.</given-names>
          </string-name>
          <string-name>
            <surname>Ramírez-Silva</surname>
          </string-name>
          ,
          <article-title>A tourist recommendation system: a study case in mexico</article-title>
          ,
          <source>in: Advances in Soft Computing: 20th Mexican International Conference on Artificial Intelligence, MICAI</source>
          <year>2021</year>
          ,
          <string-name>
            <given-names>Mexico</given-names>
            <surname>City</surname>
          </string-name>
          , Mexico,
          <source>October 25-30</source>
          ,
          <year>2021</year>
          , Proceedings,
          <source>Part II 20</source>
          , Springer,
          <year>2021</year>
          , pp.
          <fpage>184</fpage>
          -
          <lpage>195</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Alvarez-Carmona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Aranda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rodriguez-Gonzalez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fajardo-Delgado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G. A.</given-names>
            <surname>Sanchez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Perez-Espinosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Martinez-Miranda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Guerrero-Rodriguez</surname>
          </string-name>
          , L. BustioMartinez,
          <string-name>
            <given-names>A. D.</given-names>
            <surname>Pacheco</surname>
          </string-name>
          , Natural language processing applied to tourism research: A sys-
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>