<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Balancing of Tourist Opinions for Sentiment Analysis Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrea Bethsabe García-Gutiérrez</string-name>
          <email>andrea.garcia@cimat.mx</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pablo Emilio López-Ávila</string-name>
          <email>pablo.lopez@cimat.mx</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pedro Adair Gallegos-Ávila</string-name>
          <email>pedro.gallegos@cimat.mx</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ramón Aranda</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Miguel Ángel Álvarez-Carmona</string-name>
          <email>miguel.alvarez@cimat.mx</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centro de Investigación en Matemáticas (CIMAT)</institution>
          ,
          <addr-line>Sede Mérida, Yucatán, Mexico, 97302</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Centro de Investigación en Matemáticas (CIMAT)</institution>
          ,
          <addr-line>Sede Monterrey, Nuevo León, Mexico, 66629</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Consejo Nacional de Humanidades, Ciencias y Tecnologías (CONAHCYT)</institution>
          ,
          <addr-line>CDMX, Mexico, 03940</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>This article presents a proposal for the treatment of an unbalanced tourist database with emphasis on minority classes for its classification, in this case, one based on BERT, called BETO. This methodology originally forms part of the thesis project of the authors, with the objective of balancing data with a tourist focus and being able to measure the impact that it has on the classification of texts.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Unbalanced data</kwd>
        <kwd>Oversampling</kwd>
        <kwd>Subsampling</kwd>
        <kwd>Tourism</kwd>
        <kwd>Minority classes</kwd>
        <kwd>BETO</kwd>
        <kwd>Spanish</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The tourism sector contributes around 8% of the Gross Domestic Product (GDP) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In addition,
based on the Quarterly Indicators of Tourism Activity, tourism GDP in the fourth quarter of 2022
registered an increase of 1.3% compared to the third quarter of 2022, according to government
estimates [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Interaction through social networks has increased in recent years and tourists
are part of it. When a tourist stays in a hotel, visits a tourist attraction, or eats in a restaurant,
they have the possibility of expressing a comment based on their experience, be it positive or
negative [
        <xref ref-type="bibr" rid="ref3 ref4 ref5 ref6">3, 4, 5, 6</xref>
        ]. The analysis of these comments can help the owners or managers of the
commented places to make decisions to improve the tourist experience [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        Since 2021, the Rest-Mex team has served as an evaluation forum that seeks to specialize in
the analysis of texts from the tourism sector to solve diferent tasks in Mexican Spanish. In the
2023 edition, there are two tasks where the opinions of tourists, which are the object of analysis,
obtained from the TripAdvisor site [
        <xref ref-type="bibr" rid="ref10 ref8 ref9">8, 9, 10</xref>
        ].
      </p>
      <p>This article describes the participation of the Dataverse team in the sentiment analysis
task. First, a Dataset description is shown, followed by Proposed methodology for class
(a) Distribution by “Country”
(b) Distribution by “Type”
(c) Distribution by “Polarity”
balance and their classification. Subsequently, the Results obtained are shown, ending with
the Conclusions and future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Dataset description</title>
      <p>
        There is a total of 251,702 tourist reviews obtained from TripAdvisor tagged as follows:
• Polarity: It is an integer between [
        <xref ref-type="bibr" rid="ref1 ref5">1,5</xref>
        ] where 1 is the most negative polarity and 5 is the
most positive.
• Country: Represents the name of the country that was visited and is one of the following:
      </p>
      <p>Mexico, Cuba, or Colombia.
• Type: Represents the type of destination being reviewed and can be: Hotel, Restaurant,
or Attractive.</p>
      <p>The distribution of the data by class and by label type is depicted in Figure 1. As can be seen
in Fig. 1b the Type reviews are not very unbalanced, for country reviews, Fig. 1a, there is a
slight imbalance that can be dealt with. On the other hand, for polarity, Fig. 1c, there is a very
notorious imbalance, as seen in Fig. more than half of the reviews are from class 5, that is, very
positive, in contrast to the data from class 1, very negative. and 2 negative, which present less
than 5% of the reviews, and these represent the minority classes [11].</p>
      <p>For the preprocessing part of the data, it was all lowercase; special characters, multiple spaces,
numbers, stop words and words of length equal to or less than 3 were removed.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed methodology</title>
      <p>It is important to deal with minority classes, to address class imbalance to improve the
performance of our model by avoiding possible bias, especially when there are highly imbalanced
classes, because if the data set is biased towards one class, the model trained with the same data
will be biased towards the same class[12].</p>
      <p>There are several methods that can be used to address class imbalance. They are subsampling
and oversampling. On the one hand, there is subsampling, which consists of reducing the data
obtained by randomly taking reviews from the majority classes. On the other hand, there is
oversampling, which is the increase in data, generally on the representation of minority classes
where new instances can be generated or duplicate existing data of those classes randomly.
In this last case, it has been found that it is not very eficient since new information is not
introduced, it does not address the fundamental problem of the lack of information and real
variability in this type of classes. However, generating new instances can alleviate this problem,
but overfitting should be avoided[13].</p>
      <p>In the case of subsampling, it is proposed to randomly obtain a certain number of reviews
than the total of each one that coincides with the number of reviews of the “neutral” class, in this
case reduce class 4 and 5. In the case of oversampling It is proposed to obtain the representative
words of all classes with mutual information, which helps us to measure the dependence or
association between two random variables. In this context, to measure the association between
words within classes[14]. It can be defined as follows:
 (, )
(, ) = 2 ( () ())
(1)
where  (, ) is the joint probability that word  and class  appear together.  () and
 () are the marginal probabilities that they will appear in reviews. Mutual information
measures the deviation of the joint probability of  and  from what would be expected if
they were independent. A higher value of mutual information indicates a stronger association
between the word and the class.</p>
      <p>Once you have these words, ordered from the most representative to the least, synthetic
opinions are generated, which consists of replacing one of the representative words that will
generate a new review as a synonym from a Web dictionary or a FastText embedding.</p>
      <p>The synonyms are obtained from the Word Reference virtual dictionary, and the embeddings
from the Spanish version of FastText trained by Common Crawl, which is an organization
that has been doing web scraping since 2008 and makes its data public, and with the free
encyclopedia Wikipedia [15].</p>
      <p>Having already the possible words that can be chosen for the generation of new instances, it
should be considered that by having representative words of each class, if only those values
were taken, there could be a risk of bias and being over adjusted both in the classes. In general,
as well as in the context of the data that is available, this could be observed using techniques that
help us to lower the dimension of the entire data set and see the behavior of the new data. To
mitigate the above, it is proposed to include the hyperparameters temperature and probability.</p>
      <p>The generation function requires the following hyperparameters:
• The database of representative words
• The class in which to generate new data
• The maximum number of words that can be in the new data
• What other classes are taken into account for the generation
• Temperature
• Probability
• Synonym source type</p>
      <p>First, the number of words that the text will have is randomly chosen, there is a minimum of
4 and a maximum of the total number of words that can be included, given as a hyperparameter.</p>
      <p>The probability refers to whether or not one of the representative words is considered as a
synonym for the generation of the new text. The procedure consists of first giving an integer
and then choosing a random number between 1 and that given number. Having that value, it is
verified if it matches the number given at the beginning. For example, if you give the number
2, there is a 50% probability that the given word will be considered, on the other hand, if the
number 1 is given, there is a 100% probability of including it in the new data.</p>
      <p>Subsequently, it is decided if it is with a dictionary or FastText, in any of both, a list of similar
words based on is obtained and a word is chosen according to a certain temperature, which
will be the random number between 0 and the minimum between a given number and the total
number of synonyms that are taken into account. For example, if the number 1 is given, the
minimum between the two is taken, and the random one between 0 and 1 is more likely that a
word is similar to the representative one, in this case it can be the first or second value of the
list of synonyms, that is, the most similar ones, on the other hand, if a large value is given, it
gives more possibility of taking words that are further away from the representative word that
is found. On the other hand, in order not only to have words from one class, a limited random
number of words that can be included are taken from the representative words of the respective
classes chosen to enter the model. To finish this part, all the values generated are taken to form
the text strings where the length of each one varies and the type of words that are also present.</p>
      <p>The classification model used is BERT (Bidirectional Encoder Representations from
Transformers) is a language model based on neural networks that has revolutionized natural language
processing (NLP). BERT is a pre-trained model that uses the Transformer architecture, which is
an attention-based neural network. Unlike previous language models that were trained in a
unidirectional fashion (left to right or right to left), BERT is bidirectional, which means that it
can capture the context of words in both directions[16, 17].</p>
      <p>BERT training is done in a task known as “pretraining language modeling”. During this stage,
BERT is trained to predict hidden words in masked sentences and to predict the relationship
between pairs of sentences. This massive training allows you to learn contextualized
representations of words, capturing information from the context that surrounds them. in the Spanish
version that is called BETO.</p>
      <p>Training details:
• 3 epochs
• Adam Optimizer using a learning rate of 5− 5
• Batch size of 32 elements
• Unfreezed BETO weights</p>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>In this edition of Rest-Mex, the Dataverse team evaluated models with diferent metrics, however,
minority classes are given more weight. The team placed seventh in the competition with a
Sentiment Track Score of 0.7173586609. The results of the metrics obtained are the following:</p>
      <p>The result of the classification of the test reviews that was sent, the dictionary option was
taken into account, with temperature 100 and probability 1, they were balanced with respect to
polarity and countries. From the results of the metrics obtained, as can be seen in the Table1, it
can be seen that in most of the metrics, with the exception of the MAE in the median, both the
average and median values of all the participants were exceeded. Something similar happens in
the F1 scores of the minority classes as seen in the Table 2.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions and future work</title>
      <p>It can be concluded that the oversampling and subsampling techniques are very useful together
with the BERT model, it produced good results, but particularly not the best ones. However,
oversampling shows that it can be an efective strategy to improve the performance of models
in unbalanced data sets. Nevertheless, it is important to note that oversampling must be applied
carefully, as excessive generation of synthetic examples can lead to overfitting and degradation
of model performance. Tests were needed with the combinations of temperature and probability,
which could have better results in this competition.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>The authors thank the Mexican Academy of Tourism Research (AMIT) for their support of
the project ”Creation of a labeled database related to tourist destinations for training artificial
intelligence models for classifying relevant topics" through the call ”I Research Projects 2022",
which originated this work.
[11] M. Á. Álvarez-Carmona, Á. Díaz-Pacheco, R. Aranda, A. Y. Rodríguez-González, L.
BustioMartínez, V. Muñiz-Sánchez, A. P. Pastor-López, F. Sánchez-Vega, Overview of rest-mex at
iberlef 2023: Research on sentiment analysis task for mexican tourist texts, Procesamiento
del Lenguaje Natural 71 (2023).
[12] G. E. Batista, R. C. Prati, M. C. Monard, A study of the behavior of several methods for
balancing machine learning training data, ACM SIGKDD explorations newsletter 6 (2004)
20–29.
[13] Y. Ma, H. He, Imbalanced learning: foundations, algorithms, and applications (2013).
[14] K. Church, P. Hanks, Word association norms, mutual information, and lexicography,</p>
      <p>Computational linguistics 16 (1990) 22–29.
[15] E. Grave, P. Bojanowski, P. Gupta, A. Joulin, T. Mikolov, Learning word vectors for 157
languages, arXiv preprint arXiv:1802.06893 (2018).
[16] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[17] N. Sabharwal, A. Agrawal, N. Sabharwal, A. Agrawal, Bert algorithms explained, Hands-on
Question Answering Systems with BERT: Applications in Neural Networks and Natural
Language Processing (2021) 65–95.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Arce-Cardenas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fajardo-Delgado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Á</surname>
          </string-name>
          .
          <string-name>
            <surname>Álvarez-Carmona</surname>
            ,
            <given-names>J. P.</given-names>
          </string-name>
          <string-name>
            <surname>Ramírez-Silva</surname>
          </string-name>
          ,
          <article-title>A tourist recommendation system: a study case in mexico</article-title>
          ,
          <source>in: Advances in Soft Computing: 20th Mexican International Conference on Artificial Intelligence, MICAI</source>
          <year>2021</year>
          ,
          <string-name>
            <given-names>Mexico</given-names>
            <surname>City</surname>
          </string-name>
          , Mexico,
          <source>October 25-30</source>
          ,
          <year>2021</year>
          , Proceedings,
          <source>Part II 20</source>
          , Springer,
          <year>2021</year>
          , pp.
          <fpage>184</fpage>
          -
          <lpage>195</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2] S. de Turismo, Resultados de la actividad turística marzo
          <year>2023</year>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>E.</given-names>
            <surname>Olmos-Martínez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Á</surname>
          </string-name>
          .
          <string-name>
            <surname>Álvarez-Carmona</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Aranda</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Díaz-Pacheco</surname>
          </string-name>
          ,
          <article-title>What does the media tell us about a destination? the cancun case, seen from the usa, canada, and mexico</article-title>
          ,
          <source>International Journal of Tourism Cities</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Guerrero-Rodriguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Á</surname>
          </string-name>
          .
          <string-name>
            <surname>Álvarez-Carmona</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Aranda</surname>
            ,
            <given-names>A. P.</given-names>
          </string-name>
          <string-name>
            <surname>López-Monroy</surname>
          </string-name>
          ,
          <article-title>Studying online travel reviews related to tourist attractions using nlp methods: the case of guanajuato, mexico</article-title>
          ,
          <source>Current issues in tourism 26</source>
          (
          <year>2023</year>
          )
          <fpage>289</fpage>
          -
          <lpage>304</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Alvarez-Carmona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Aranda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rodriguez-Gonzalez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fajardo-Delgado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G. A.</given-names>
            <surname>Sanchez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Perez-Espinosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Martinez-Miranda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Guerrero-Rodriguez</surname>
          </string-name>
          , L. BustioMartinez,
          <string-name>
            <given-names>A. D.</given-names>
            <surname>Pacheco</surname>
          </string-name>
          ,
          <article-title>Natural language processing applied to tourism research: A systematic review and future research directions</article-title>
          ,
          <source>Journal of King</source>
          Saud University-Computer and Information Sciences (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Diaz-Pacheco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Á</surname>
          </string-name>
          .
          <string-name>
            <surname>Álvarez-Carmona</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Guerrero-Rodríguez</surname>
            ,
            <given-names>L. A. C.</given-names>
          </string-name>
          <string-name>
            <surname>Chávez</surname>
            ,
            <given-names>A. Y.</given-names>
          </string-name>
          <string-name>
            <surname>Rodríguez-González</surname>
            ,
            <given-names>J. P.</given-names>
          </string-name>
          <string-name>
            <surname>Ramírez-Silva</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Aranda</surname>
          </string-name>
          ,
          <article-title>Artificial intelligence methods to support the research of destination image in tourism. a systematic review</article-title>
          ,
          <source>Journal of Experimental &amp; Theoretical Artificial Intelligence</source>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>31</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Alvarez-Carmona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. P.</given-names>
            <surname>López-Monroy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Montes-y Gómez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Villasenor-Pineda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jair-Escalante</surname>
          </string-name>
          ,
          <article-title>Inaoe's participation at pan'15: Author profiling task</article-title>
          ,
          <source>Working Notes Papers of the CLEF</source>
          <volume>103</volume>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Á</surname>
          </string-name>
          .
          <string-name>
            <surname>Álvarez-Carmona</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Aranda</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Arce-Cardenas</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Fajardo-Delgado</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>GuerreroRodríguez</surname>
            ,
            <given-names>A. P.</given-names>
          </string-name>
          <string-name>
            <surname>López-Monroy</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Martínez-Miranda</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Pérez-Espinosa</surname>
            ,
            <given-names>A. Y.</given-names>
          </string-name>
          <article-title>RodríguezGonzález, Overview of rest-mex at iberlef 2021: recommendation system for text mexican tourism 67 (</article-title>
          <year>2021</year>
          ). doi:https://doi.org/10.26342/2021-67-14.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Á</surname>
          </string-name>
          .
          <article-title>Álvarez-Carmona, Á</article-title>
          . Díaz-Pacheco,
          <string-name>
            <given-names>R.</given-names>
            <surname>Aranda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Rodríguez-González</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. FajardoDelgado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Guerrero-Rodríguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bustio-Martínez</surname>
          </string-name>
          ,
          <article-title>Overview of rest-mex at iberlef 2022: Recommendation system, sentiment analysis and covid semaphore prediction for mexican tourist texts</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>69</volume>
          (
          <year>2022</year>
          )
          <fpage>289</fpage>
          -
          <lpage>299</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Á</surname>
          </string-name>
          .
          <string-name>
            <surname>Álvarez-Carmona</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Aranda</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Guerrero-Rodríguez</surname>
            ,
            <given-names>A. Y.</given-names>
          </string-name>
          <string-name>
            <surname>Rodríguez-González</surname>
            ,
            <given-names>A. P.</given-names>
          </string-name>
          <string-name>
            <surname>López-Monroy</surname>
          </string-name>
          ,
          <article-title>A combination of sentiment analysis systems for the study of online travel reviews: Many heads are better than one</article-title>
          ,
          <source>Computación y Sistemas</source>
          <volume>26</volume>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>