=Paper=
{{Paper
|id=Vol-2943/restmex_paper8
|storemode=property
|title=Sentiment Classification for Mexican Tourist Reviews based on K-NN and Jaccard Distance
|pdfUrl=https://ceur-ws.org/Vol-2943/restmex_paper8.pdf
|volume=Vol-2943
|authors=Alejandra Romero-Cantón,Ramon Aranda
|dblpUrl=https://dblp.org/rec/conf/sepln/Romero-CantonA21
}}
==Sentiment Classification for Mexican Tourist Reviews based on K-NN and Jaccard Distance==
Sentiment Classification for Mexican Tourist Reviews based on K-NN and Jaccard Distance Alejandra Romero-Cantón1 and Ramon Aranda2,3[0000−0001−8269−3944] 1 Tecnologı́co Nacinal de México, Campus Mérida, 97118, Yucatán, México 2 Centro de Investigación Cientı́fica y de Educación Superior de Ensenada, Unidad de Transferencia Tecnológica Tepic (CICESE-UT3), 63173, Nayarit, Mexico 3 Consejo Nacional de Ciencia y Tecnologı́a (Conacyt), 03940, CDMX, Mexico Abstract. In this paper is presented a propose solution to the Sen- timent Analysis challenge presents in the Recommendation System for Text Mexican Tourism task during the Iberian Languages Evaluation Forum 2021. The task consists of predicting the polarity of an opinion issued by a tourist who traveled to the most representative places of Gua- najuato, Mexico. Thus, our approach is based K-Nearest Neighbors by using a distance based on the Jaccard coefficient concept. In the training stage, by using the training data, our approach first clusters every word from every opinion (review) by the respective class. Then, the stop words from each cluster are deleted. After, the normalized frequency of each word in a class is computed. In this way, the set of words (trained words) with their normalized frequency (NF) are used as class feature vector. In the classification stage, when a new opinion is given, each word from it is intersect with the trained words for each class and the NF of the intersected words are summed (dissimilarity value). The predicted class is assigned to the class with the most high dissimilarity value. The per- formance on the testing data were of 1.26 MAE and 0.22 of F-measure. We think that the obtained results are because the data is unbalanced and our approach does not deal with that issue. Keywords: K-NN · Jaccard Distance · Sentiment analysis · Mexican tourist texts. 1 Introduction In 2018, the World Economic Forum reports that the travel & tourism indus- try generated 10.4% of the world GDP and supported over 319 million jobs [6]. In the last year, global tourism has been impacted strongly due to COVID-19 pandemic and in the last decade tourism has also been influenced by numerous technological advances and tools such as digitization, information and commu- nication technology, machine learning, robotics, and artificial intelligence (AI) [12, 9, 10, 3]. IberLEF 2021, September 2021, Málaga, Spain. Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Most of international travelers plan their trips by digital means, and a big part of their decisions rely on other travelers shared online information, e.g. online touristic reviews [5]. To synthesize large amounts of reviews, it is essential to use algorithms from the Artificial Intelligence field, specifically the area of the Natural Language Processing (NLP). his sub-field of the artificial intelligence aims to achieve human-like processing capabilities of the language for diverse scopes [4, 8]. NLP intersects artificial intelligence and linguistics [11] and covers a wide range of methods to analyze and represent naturally occurring text at one or more linguistic examination levels. One task of the Recommendation System for Text Mexican Tourism task during the Iberian Languages Evaluation Forum 2021 [1] is classified the polarity (positive/negative) of an opinion issued by a tourist who traveled to the most representative places of Guanajuato, Mexico. This task is based on a sub–field of the PLN known as Sentiment analysis (SA) [2]. Thus, in this work, we propose a method to classify the polarity for mexican tourist reviews based on K-Nearest Neighbour (K-NN) and Jaccard Distance (JD). This work is organized as follows: – Section 2 describes the task to solve. – Section 3 shows in details the proposal followed in this work. – In section 4 the results are presented. – Finally, section 5 presents the conclusions and limitations of our proposal. 2 Task Description The subtask is a classification task where the participating system can predict the polarity of an opinion issued by a tourist who traveled to the most representative places of Guanajuato, Mexico. Guanajuato city is a well-known destination for domestic tourists and it has gained a progressive notoriety in the international arena since the last quarter of the previous century. Apart from famous inter- national destinations within the Mexican territory such as the cases of Cancun and Mexico City, Guanajuato ranks number six among the most visited cities for tourism purposes1 . Thus, this Sentiment Analysis problem is defined as follows: – ”Given an opinion about a Mexican tourist place, the goal is to determine the polarity, between 1 and 5, of the text.” Where 1 indicates most negative and 5 most positive. 2.1 Data set This collection was obtained from the tourists who shared their opinion on Tri- pAdvisor between 2002 and 2020. Each opinion’s class (review polarity) is an integer between [1, 5], where 1 represents the most negative polarity and 5 the 1 https://www.datatur.sectur.gob.mx/SitePages/CompendioEstadistico.aspx Table 1: Distribution of the polarity training data set. Class (polarity) Number of Instances (rows) Percentage 1 80 1.54% 2 145 2.80% 3 686 13.20% 4 1596 30.71% 5 2690 51.76% Total 5197 100% most positive. Each tourist has information about nationality and gender. Rest- Mex organizers available two data sets2 one for training and one for evaluation. Each instance (row) in the training and testing datasets contain the information as described below: – Index: the index of each opinion. – Title: The title that the tourist himself gave to his opinion. – Opinion: The opinion expressed by the tourist. – Place: Place that the tourist visited and to which the opinion is directed. – Gender: the gender of the tourist. – Age: The age of the tourist at the time of issuing the opinion. – Country: The country of origin of the tourist. – Date: the date when the review was issued. – Label: it represents the polarity of the review, labels goes from 1 to 5. Note that for the testing data set, the labels values are unknown. Training data set consists of 5197 instances. Table 1 shows the distribution of the review polarities for the training data set. It is important to mention that the training data set the classes are unbalanced. The test data set contained 2216 instances (the distribution polarity is unknown). 3 Proposed Approach Our proposal consists in two main stages: training stage and classification stage. We describe each stage below. 3.1 Training stage In this stage, we use the using the training data to extract features of each class. Thus, our approach first clusters every word from every opinion (review) by the respective class. Then, the stop words from each cluster are deleted. We call to the result set of words fro class c, trained words Ωc . After, the normalized frequency, ωi,c , of i-th word in the class, c, is computed. In this way, the sets Ωc with their normalized frequency ωi,c (for c ∈ 1, 2, 3, 4, 5 and i = 1, 2, ..., Nc where Nc = |Ωc |) are used as class feature vector. 2 https://sites.google.com/cicese.edu.mx/rest-mex-2021 3.2 classification stage In the classification stage, when a new opinion/review is given, the stop words from it are deleted. Then the resulting set of words for that opinion is called Θ. After, each word in Θ is intersect with the set Ωc (trained words of class c). Then, the NF of the intersected words are added. This can be represented by equation 1: X Sc = ωk,c (1) k∈Ωc ∩Θ Note that equation 1 is based on the concept of the Jaccard Distance [7]. Thus, the predicted class for the review Θ, C(Θ), is assigned to the class with the most high similarity value Sc : C(Θ) = arg max{Sc } ∀ c ∈ {1, 2, 3, 4, 5} (2) c Equation 2 is the representation of the K-NN method, when K= 1. 4 Results Figure 1 show the wordclouds of the sets Ωc weighed with their corresponding ωk,c for all classes. Note that although there are many words repeated in all classes words as ”museo” and ”momias” are more frequent in classes 1 and 2, and word as ”guanajuato” and ”historia” are more frequent in classes 3 to 5. The official results for our proposal are as follows: – Accuracy: 36.95 – F-measure: 0.22 – MAE: 1.27 In this sense our approach obtained the last place according to MAE value (15th place). According to accuracy, we obtained the 12th place. Finally, we obtained the 9th place with accordance to F-measure. 5 Conclusions In this work, we presented a simple solution based on the concept of the Jaccar Distance to classify the sentiment analysis problem presented on Recommen- dation System for Text Mexican Tourism task during the Iberian Languages Evaluation Forum 2021. Although, our proposal is based in a simple idea it showed potential. The most significant disadvantage of our approach was the unbalance training data set. Additionally, our proposal could be improved by removing the representative words as subjects and only work with qualifying adverbs. (a) Label/Class 1 (b) Label/Class 2 (c) Label/Class 3 (d) Label/Class 4 (e) Label/Class 5 Fig. 1: Wordclouds for Ωc weighed with their corresponding ωk,c for all the classes References 1. Álvarez-Carmona, M.Á., Aranda, R., Arce-Cárdenas, S., Fajardo-Delgado, D., Guerrero-Rodrı́guez, R., López-Monroy, A.P., Martı́nez-Miranda, J., Pérez- Espinosa, H., Rodrı́guez-González, A.: Overview of rest-mex at iberlef 2021: Rec- ommendation system for text mexican tourism. Procesamiento del Lenguaje Nat- ural 67 (2021) 2. Anis, S., Saad, S., Aref, M.: A survey on sentiment analysis in tourism. Inter- national Journal of Intelligent Computing and Information Sciences 20(1), 1–15 (2020). https://doi.org/10.21608/IJICIS.2020.106309 3. Buhalis, D.: Technology in tourism-from information communication technologies to eTourism and smart tourism towards ambient intelligence tourism: a perspective article. Tourism Review 75(1), 267–272 (Jan 2020). https://doi.org/10.1108/TR- 06-2019-0258, https://doi.org/10.1108/TR-06-2019-0258, publisher: Emerald Pub- lishing Limited 4. Cai, T., Giannopoulos, A.A., Yu, S., Kelil, T., Ripley, B., Kumamaru, K.K., Ry- bicki, F.J., Mitsouras, D.: Natural language processing technologies in radiology research and clinical applications. Radiographics 36(1), 176–191 (2016) 5. Calderón, F.A.C., Blanco, M.V.V.: Impacto de internet en el sector turı́stico. Re- vista UNIANDES Episteme 4(4), 477–490 (2017) 6. Calderwood, L.U., Soshkin, M.: The travel and tourism competitiveness report 2019 (Sep 2019) 7. Álvarez Carmona, M.A., Franco-Salvador, M., Villatoro-Tello, E., Montes-y Gómez, M., Rosso, P., Villaseñor-Pineda, L.: Semantically-informed distance and similarity measures for paraphrase plagiarism identification. Journal of Intelligent & Fuzzy Systems 34(5), 2983–2990 (2018). https://doi.org/10.3233/JIFS-169483, publisher: IOS Press 8. Chowdhury, G.G.: Natural language processing. Annual review of information sci- ence and technology 37(1), 51–89 (2003) 9. Gossling, S., Scott, D., Hall, C.M.: Pandemics, tourism and global change: a rapid assessment of covid-19. Journal of Sustainable Tourism 29(1), 1–20 (2021). https://doi.org/10.1080/09669582.2020.1758708, https://doi.org/10.1080/09669582.2020.1758708 10. Guerra-Montenegro, J., Sanchez-Medina, J., Lana, I., Sanchez- Rodriguez, D., Alonso-Gonzalez, I., Del Ser, J.: Computational in- telligence in the hospitality industry: A systematic literature re- view and a prospect of challenges. Applied Soft Computing 102, 107082 (2021). https://doi.org/https://doi.org/10.1016/j.asoc.2021.107082, https://www.sciencedirect.com/science/article/pii/S1568494621000053 11. Nadkarni, P.M., Ohno-Machado, L., Chapman, W.W.: Natural language process- ing: an introduction. Journal of the American Medical Informatics Association 18(5), 544–551 (2011) 12. Qiu, R.T., Park, J., Li, S., Song, H.: Social costs of tourism dur- ing the covid-19 pandemic. Annals of Tourism Research 84, 102994 (2020). https://doi.org/https://doi.org/10.1016/j.annals.2020.102994, https://www.sciencedirect.com/science/article/pii/S0160738320301389