<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>A. Z. Gallardo-Hernández); arac@cimat.mx (R. Aranda); angel.diaz@ugto.mx
(A. Diaz-Pacheco)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Classifying Tourist Text Reviews by Means of Mutual Information Features</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alvaro Zaid Gallardo-Hernández</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ramón Aranda</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Angel Diaz-Pacheco</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centro de Investigación en Matemáticas</institution>
          ,
          <addr-line>Sede Mérida, Mérida, Yucatán</addr-line>
          ,
          <country country="MX">México</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Consejo Nacional de Humanidades, Ciencias y Tecnologías</institution>
          ,
          <addr-line>Ciudad de México</addr-line>
          ,
          <country country="MX">México</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Departamento de Ciencias e Ingenieras, Universidad Iberoamericana Puebla</institution>
          ,
          <addr-line>San Andrés Cholula, Puebla</addr-line>
          ,
          <country country="MX">México</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Departamento de Ingeniería Electrónica, División de Ingenierías, Universidad de Guanajuato - Campus Irapuato-Salamanca</institution>
          ,
          <addr-line>Yuriria</addr-line>
          ,
          <country country="MX">Mexico</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>This paper introduces a proposed solution for the classification of tourist text reviews. The problem was initially presented at Rest-Mex 2023: Research on Sentiment Analysis Task for Mexican Tourist Texts. The objective of this task is to determine the polarity (1 and 5), the type of opinion (hotel, restaurant, or attraction), and the country (Mexico, Cuba, Colombia) associated with a given set of reviews. Our approach is primarily based on the Mutual Information (MI) measure. During the training stage, our approach involves clustering each word from the provided training data according to their respective classes. Subsequently, we compute the MI value of each word within each class. Additionally, we generate synonyms for each word and incorporate them into a set, associating them with the same MI value as their respective word. This set of words, referred to as "trained" words, along with their normalized MI values, is utilized as class features. In the classification stage, when a new instance is provided, each word is compared with the "trained" words belonging to each class. The MI values of the intersected words are then summed. The predicted class is assigned based on the class with the highest sum value.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Mutual Information</kwd>
        <kwd>Sentiment Analysis</kwd>
        <kwd>Rest-Mex</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In the 2019 edition of the "Travel &amp; Tourism Competitiveness Report" (TTCR) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], published
by the World Economic Forum, it was reported that the Travel &amp; Tourism (T&amp;T) sector was
experiencing remarkable growth. The World Tourism Organization (UNWTO) stated that
international tourist arrivals worldwide reached 1.4 billion in 2018, surpassing earlier predictions
by two years. However, the findings of the TTCR also raised concerns about a potential tipping
point where the relentless pursuit of growth and competitiveness in the sector could undermine
the very assets on which it depends.
      </p>
      <p>
        Fast forward two years, and the T&amp;T sector looks drastically diferent. The COVID-19
pandemic had a devastating impact on the demand for travel, hitting the sector particularly
hard. Shutdowns, travel restrictions, and the disappearance of international travel not only
severely afected companies but also tourism-dependent national economies. Fortunately, there
are now positive indications of recovery, although the pace and progress vary across diferent
regions and market segments. Additionally, the complexities of this uneven recovery are further
compounded by factors like the war in Ukraine [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        As a result, the T&amp;T sector and its customers have likely undergone permanent changes.
Travelers have become more discerning, especially regarding the health and hygiene conditions
of potential destinations. They are also cautious about the potential impact of future COVID
variants, as well as challenges arising from government policies, border closures, and travel
disruptions. Furthermore, the pause in international travel allowed leisure and business travelers
to reflect on the environmental consequences of their choices. Consequently, governments and
T&amp;T businesses have had to reassess their investments, develop strategies to mitigate risk and
volatility in demand, and adapt to the changing expectations of their customers . Additional
to the COVID-19 pandemic impacts, in the last decade tourism has also been influenced by
numerous technological advances and tools such as digitization, information and communication
technology, machine learning, robotics, and artificial intelligence (AI) [
        <xref ref-type="bibr" rid="ref3 ref4 ref5 ref6">3, 4, 5, 6, 7, 8</xref>
        ]. Thus,
most of international travelers plan their trips by digital means, and a part of their decisions are
based on information online [9].
      </p>
      <p>One task of the Rest-Mex 2023: Research on Sentiment Analysis Task for Mexican Tourist
Texts[10] is to determine the polarity (1 and 5), the type of opinion (hotel, restaurant, or
attraction), and the country (Mexico, Cuba, Colombia) associated with a given set of reviews.
For this reason, it is essential to use algorithms from the Artificial Intelligence field, specifically
the area of the Natural Language Processing (NLP) to achieve human-like processing capabilities
of the language for diverse scopes [11, 12]. NLP intersects artificial intelligence and linguistics
[13] and covers a wide range of methods to analyze and represent naturally occurring text at
one or more linguistic examination levels, for example see [14, 15, 16, 17, 18]. Thus, in this work,
we propose a method to predict the classes based on the Mutual Information measure [19].</p>
      <p>This work is organized as follows: Section 2 describes the task to solve; Section 3 shows in
details the proposal followed in this work; In section 4 the results are presented; and finally,
section 5 presents the conclusions and limitations of our proposal.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Task Description: Sentiment Analysis</title>
      <p>The task at hand involves a classification problem where participating systems are tasked with
predicting the polarity, type and country of opinions expressed by tourists who have visited
attractions, restaurants, and hotels in Mexico, Cuba, and Colombia. The dataset used for this
task was compiled from opinions shared by tourists on TripAdvisor between 2002 and 2022.
Each opinion belongs to a specific class represented by an integer ranging from 1 to 5, where 1
indicates the most negative polarity and 5 denotes the most positive. Additionally, each opinion
is labeled with its corresponding type (hotel, restaurant, or attraction). For example:
• "Un callejón donde tienes que besar a tu amante por años de felicidad, en el amor es parte
de un mito en esta ciudad especial. El callejón estrecho con escalones no es muy especial
en sí mismo. Lo que lo hace especial es toda la historia a su alrededor."
– Polarity: 5 (Very positive)
– Type: Attractive
– Location: Mexico</p>
      <p>To evaluate the results of the polarity task, the organizer proposed to give more weight to
minority classes. For the sentiment analysis collection of the Rest-Mex, the minority classes are
the ones with the most negative polarities. Therefore, for this edition, to evaluate the result of
the polarity classification, it is as follows:
 () =
∑︀|=|1((1 −
 ) * ())</p>
      <p>∑︀|=|1 1 − 
,
where  is a forum participant system,  = 1, 2, 3, 4, 5,  is the total instances in the collection,
 is the total is instances in the class i. Finally, () is the F-measure value for the class 
obtained by the system . Thus, this formulation gives more weight to the classes with less
instances. For the type (()) and country (()) classification, the organizer proposed
only to average the F-measure values corresponding their respective classes. Finally, the final
score of the whole task for the  system is given by:
() =
2 *  () + () +  ()
4
(1)
(2)</p>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Approach</title>
      <p>To attack the sentiments analysis task in tourist data, it is proposed to use simple features that
can capture important information to determine the polarity of an opinion in such a way that it
is quick to calculate and represent. Especially to ofer an option for restricted applications in
time or memory (such as IoT solutions) and that cannot use approaches that, although they have
outstanding efectiveness results, can be slow or use much computational power, in addition to
having the advantage of being language-independent features.</p>
      <p>Our proposal is a improve of a previous work [20], it consists in three main stages:
prepossessing, training and classification. For the prepossessing stage, we applied to the text next
steps:
• Uppercase was converted to lowercase.
• Stop-words were removed.
• Punctuation marks were removed.
• The digits were replaced by the letter ‘d’.
• Stemming was applied to the tokens in the texts.</p>
      <p>• Removed tokens that appear less than 50 times in the data set.</p>
      <sec id="sec-3-1">
        <title>3.1. Training stage</title>
        <p>The main diferent from [ 20] is in the this stage. In this stage, we use the using the training data
to extract features of each subtask (polarity, type and country). Thus, to analyze the information
from the dataset, similar to [21], we propose to use the well-known Mutual Information (MI)
measure. The MI measure was applied to all training data to extract the features for each
epidemiological color (red, orange, yellow and green) [22]. This measure basically computes
the mutual dependence between two variables  and  (information that  and  share). MI
is computed by the following equation:
 (,  ) =  (,  )( (,  )/ () ( )),
(3)
where  (,  ) is the joint probability between the variables  and  . For example, if 
and  are independent, then  is not important and does not exert any influence over  and
vice versa; then MI would be close to zero. Conversely, if  is describe in terms of  (or 
is in terms of ), then all information conveyed by  is shared with  [23]. In our case, MI
measures the influence of a word  =  with  ∈  = {all the words in the collections} in a
class  =  with  ∈  ={classes in subtask}:
• If a word  appears in all classes, then it is not relevant in any way, resulting in  (, ) ≈
0. The intuitive idea is that such word  does not help to discriminate among diferent
classes (epidemiological colors).
• If the word  is almost exclusive to a class , then this word is considered valuable for ,
and the expected result would be  (, ) &gt; 0. The intuition is that the higher the MI
score, the more representative the word is to the class (epidemiological colors)..
• If a word appears repeatedly in other classes but not in class , the result would be
 (, ) &lt; 0. The idea is that the lower the MI score, the less useful is the word to
represent the class.</p>
        <p>MI potentially reveals representative words for each class. Thus, it is possible to detect
exclusive words describing the reviews on each class [24]. However, to set of words obtained
by MI values, we add up to 5 synonymous to give a better representation of the classes. Thus,
we call to the result set of words and and MI measures for class , trained feature set Ω . The
-th element, , ∈ Ω  is a tuple of values, (,, , ), where , represents the -th word and
, represents the normalized MI measure for ,. Note that the MI value for each synonym
is the same as its word of origin.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Classification stage</title>
        <p>The classification stage is the same as [ 20], when a new instance is given, first the prepossessing
steps are applied. Then the resulting set of words for the instance is called Θ . After, Θ is
intersected with the words in set Ω  (set of trained words, ,, for class ). Then, we compute
the sum of the values , for  ∈ Θ ∩ Ω . This can be represented by equation 4:
(4)
(5)
 =</p>
        <p>∑︁
∈Ω∩Θ</p>
        <p>,
(Θ) = arg max {}

Thus, the predicted class for a instance Θ , (Θ) , is assigned to the class with the most high
similarity value :
with  ∈  represents the possibles class values for each subtask (polarity, type and country).
For example, for Type,  ={hotel, restaurant, atractive}</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>The oficial results for our proposal shows that we obtained a sentiment score (equation 2) of
0.229. In this sense our approach obtained the last place in the task. The obtained macro
Fmeasures were of 0.183, 0.301 and 0.250 for the subtasks polarity, Type and Country respectively.
Although, our proposal uses a simple idea, we aim to have a accuracy of 55.67 for polarity, 44.39
for Type and 47.25 for Country.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>In this study, we introduced a straightforward solution for the Sentiment Analysis Task of
Rest-Mex 2023, utilizing the Mutual Information measure. Despite its simplicity, our approach
demonstrated promising potential. However, we observed a significant drawback in our
methodology, namely the imbalance within the training dataset. Furthermore, we discovered that
numerous meaningless words (e.g., queretarcdm, metrocdmx, etc.) exhibited high MI values, yet
these words were essentially noise present in the dataset. Therefore, to enhance our proposal, it
is imperative to address this issue by removing such meaningless words from consideration.
[7] A. Diaz-Pacheco, M. A. Álvarez-Carmona, R. Guerrero-Rodríguez, L. A. C. Chávez, A. Y.</p>
      <p>Rodríguez-González, J. P. Ramírez-Silva, R. Aranda, Artificial intelligence methods to
support the research of destination image in tourism. a systematic review, Journal of
Experimental &amp; Theoretical Artificial Intelligence 0 (2022) 1–31. doi: 10.1080/0952813X.
2022.2153276.
[8] M. A. Álvarez-Carmona, R. Aranda, A. Y. Rodríguez-Gonzalez, D. Fajardo-Delgado, M. G.</p>
      <p>Sánchez, H. Pérez-Espinosa, J. Martínez-Miranda, R. Guerrero-Rodríguez, L.
BustioMartínez, Ángel Díaz-Pacheco, Natural language processing applied to tourism
research: A systematic review and future research directions, Journal of King Saud
University - Computer and Information Sciences 34 (2022) 10125–10144. URL: https:
//www.sciencedirect.com/science/article/pii/S1319157822003615. doi:https://doi.org/
10.1016/j.jksuci.2022.10.010.
[9] F. A. C. Calderón, M. V. V. Blanco, Impacto de internet en el sector turístico, Revista</p>
      <p>UNIANDES Episteme 4 (2017) 477–490.
[10] M. Á. Álvarez-Carmona, Á. Díaz-Pacheco, R. Aranda, A. Y. Rodríguez-González, L.
BustioMartínez, V. Muñis-Sánchez, A. P. Pastor-López, F. Sánchez-Vega, Overview of rest-mex at
iberlef 2023: Research on sentiment analysis task for mexican tourist texts, Procesamiento
del Lenguaje Natural 71 (2023).
[11] T. Cai, A. A. Giannopoulos, S. Yu, T. Kelil, B. Ripley, K. K. Kumamaru, F. J. Rybicki,
D. Mitsouras, Natural language processing technologies in radiology research and clinical
applications, Radiographics 36 (2016) 176–191.
[12] G. G. Chowdhury, Natural language processing, Annual review of information science
and technology 37 (2003) 51–89.
[13] P. M. Nadkarni, L. Ohno-Machado, W. W. Chapman, Natural language processing: an
introduction, Journal of the American Medical Informatics Association 18 (2011) 544–551.
[14] M. A. Álvarez-Carmona, A. P. López-Monroy, M. Montes-y Gómez, L. Villasenor-Pineda,
H. Jair-Escalante, Inaoe’s participation at pan’15: Author profiling task, Working Notes
Papers of the CLEF 103 (2015).
[15] M. E. Aragón, M. A. Álvarez-Carmona, M. Montes-y Gómez, H. J. Escalante, L. V. Pineda,
D. Moctezuma, Overview of mex-a3t at iberlef 2019: Authorship and aggressiveness
analysis in mexican spanish tweets., in: IberLEF@ SEPLN, 2019, pp. 478–494.
[16] M. Á. Álvarez-Carmona, R. Aranda, S. Arce-Cárdenas, D. Fajardo-Delgado, R.
GuerreroRodríguez, A. P. López-Monroy, J. Martínez-Miranda, H. Pérez-Espinosa, A.
RodríguezGonzález, Overview of rest-mex at iberlef 2021: Recommendation system for text mexican
tourism, Procesamiento del Lenguaje Natural 67 (2021). doi:https://doi.org/10.
26342/2021-67-14.
[17] M. Á. Álvarez-Carmona, Á. Díaz-Pacheco, R. Aranda, A. Y. Rodríguez-González, D.
FajardoDelgado, R. Guerrero-Rodríguez, L. Bustio-Martínez, Overview of rest-mex at iberlef 2022:
Recommendation system, sentiment analysis and covid semaphore prediction for mexican
tourist texts, Procesamiento del Lenguaje Natural 69 (2022).
[18] M. Á. Álvarez-Carmona, E. Villatoro-Tello, L. Villaseñor-Pineda, M. Montes-y Gómez,
Classifying the social media author profile through a multimodal representation, in:
Intelligent Technologies: Concepts, Applications, and Future Directions, Springer, 2022,
pp. 57–81.
[19] C. E. Shannon, A mathematical theory of communication, The Bell System Technical</p>
      <p>Journal 27 (1948) 379–423. doi:10.1002/j.1538-7305.1948.tb01338.x.
[20] A. Romero-Cantón, R. Aranda, AngelDiaz-Pacheco, J. P. Ramírez-Silva, Mexican
epidemiological semaphore color prediction by means of mutual information features, in: CEUR
Workshop Proceedings, Coruña, Spain, 2022.
[21] R. Guerrero-Rodriguez, M. Á. Álvarez-Carmona, R. Aranda, A. P. López-Monroy, Studying
online travel reviews related to tourist attractions using nlp methods: the case of
guanajuato, mexico, Current Issues in Tourism (2021) 1–16. doi:https://doi.org/10.1080/
13683500.2021.2007227.
[22] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, Y.
Bengio, Learning deep representations by mutual information estimation and maximization,
arXiv preprint arXiv:1808.06670 (2018).
[23] M. Ravanelli, Y. Bengio, Learning speaker representations with mutual information, arXiv
preprint arXiv:1812.00271 (2018).
[24] M. Á. Álvarez-Carmona, M. Franco-Salvador, E. Villatoro-Tello, M. Montes-y Gómez,
P. Rosso, L. Villaseñor-Pineda, Semantically-informed distance and similarity measures
for paraphrase plagiarism identification, Journal of Intelligent &amp; Fuzzy Systems 34 (2018)
2983–2990. doi:10.3233/JIFS-169483, publisher: IOS Press.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L. U.</given-names>
            <surname>Calderwood</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Soshkin</surname>
          </string-name>
          ,
          <source>The travel and tourism competitiveness report</source>
          <year>2019</year>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>[2] Travel &amp; tourism development index</source>
          <year>2021</year>
          ,
          <article-title>rebuilding for a sustainable and resilient future</article-title>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R. T.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <article-title>Social costs of tourism during the covid19 pandemic</article-title>
          ,
          <source>Annals of Tourism Research</source>
          <volume>84</volume>
          (
          <year>2020</year>
          )
          <article-title>102994</article-title>
          . URL: https:// www.sciencedirect.com/science/article/pii/S0160738320301389. doi:https://doi.org/ 10.1016/j.annals.
          <year>2020</year>
          .
          <volume>102994</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gossling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Scott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Hall</surname>
          </string-name>
          , Pandemics, tourism
          <article-title>and global change: a rapid assessment of covid-</article-title>
          19
          <source>, Journal of Sustainable Tourism</source>
          <volume>29</volume>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>20</lpage>
          . URL: https://doi.org/10.1080/09669582.
          <year>2020</year>
          .
          <volume>1758708</volume>
          . doi:
          <volume>10</volume>
          .1080/09669582.
          <year>2020</year>
          .
          <volume>1758708</volume>
          . arXiv:https://doi.org/10.1080/09669582.
          <year>2020</year>
          .
          <volume>1758708</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Guerra-Montenegro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sanchez-Medina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Lana</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
            Sanchez-Rodriguez,
            <given-names>I. AlonsoGonzalez</given-names>
          </string-name>
          ,
          <source>J. Del Ser</source>
          ,
          <article-title>Computational intelligence in the hospitality industry: A systematic literature review and a prospect of challenges</article-title>
          ,
          <source>Applied Soft Computing</source>
          <volume>102</volume>
          (
          <year>2021</year>
          )
          <article-title>107082</article-title>
          . URL: https://www.sciencedirect.com/science/article/pii/S1568494621000053. doi:https://doi.org/10.1016/j.asoc.
          <year>2021</year>
          .
          <volume>107082</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Buhalis</surname>
          </string-name>
          ,
          <article-title>Technology in tourism-from information communication technologies to eTourism and smart tourism towards ambient intelligence tourism: a perspective article</article-title>
          ,
          <source>Tourism Review</source>
          <volume>75</volume>
          (
          <year>2020</year>
          )
          <fpage>267</fpage>
          -
          <lpage>272</lpage>
          . URL: https://doi.org/10.1108/TR-06-2019-0258. doi:
          <volume>10</volume>
          . 1108/TR-06-2019-0258, publisher: Emerald Publishing Limited.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>