Techkatl: A Sentiment Analysis Model to Identify the Polarity of Mexican’s Tourism Opinions Eduardo Roldán Reyes1[0000-0002-4212-1586] 1 TecNM / Instituto Tecnológico de Orizaba, Orizaba, CP 94300, Mexico eroldanr@orizaba.tecnm.mx Abstract. This article describes the model used for the Sentiment Analysis Task framed within the REST-MEX 2021: Recommendation System for Text Mexican Tourism. The sentiment analysis model, called Techkatl, implemented a generic five-step text mining process for the identification of the polarity of opinions issued by tourism visitors in Mexico. For the polarity detection, Techkatl utilized a supervised learning approach with cross-validation to train and test classification algorithms. For the development, the data analytic RapidMiner platform was used for the rapid prototyping and the performance evaluation of the classification task. The deployment of the model showed a performance above the baseline for fast identification of the polarity with a low computation cost. Keywords: Sentiment Analysis, Machine Learning, Supervised Algorithms, Polarity Detection, RapidMiner 1 Introduction Sentiment analysis (SA) is a novel approach to determine the sentiment, emotion, or polarity implicitly or explicitly expressed in an opinion [1]. It is mainly applied on the Internet’s social media and e-commerce websites to analyze the comments of users and customers' reviews. Under this approach, the polarity of an opinion is the degree of positiveness, negativity, or neutrality towards a certain topic. Nowadays, the SA has been successfully applied to understand customers' opinions and to propose marketing strategies to enhance the quality of the products and ser- vices of the companies as can be observed in [2–4]. Although several studies have been carried out in the tourism context [5–9], they are almost focused on the English language and very few have been addressed for the Spanish language, specifically on tourism in Mexico. This has motivated the REST-MEX 2021 Recommendation Sys- tem for Text Mexican Tourism [10]. In this edition, a contest on SA was proposed to challenge researchers and SA practitioners to participate with systems predicting the polarity of a database of opinions issued by tourists who have already traveled to attraction spots of Guanajuato in Mexico. In this paper a description of the Techkatl team from the TecNM-Instituto Tecno- lógico de Orizaba – MIA is presented. The Techkatl model, which name comes from IberLEF 2021, September 2021, Málaga, Spain. Copyright© 2021 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). the Nahuatl language meaning “Sentiment”, was developed under the machine learn- ing approach. It performs a five-step generic model, inspired by a text mining model previously developed by our team [11]. The rest of the paper is structured as follows. Section 2 explains the rules for the SA classification task. Section 3 describes the characteristics of the proposed system. Section 4 highlights the experimental evaluation and the attaint results. Finally, the general conclusions are mentioned in Section 5. 2 Task description The SA task consists of the classification of textual opinions, expressed by tourists visiting interesting spots of Guanajuato in Mexico, to identify the polarity. All the opinions were obtained from the TripAdvisor platform and provided by the contest organizers to participating teams in a .csv file for training. The opinions were regis- tered on the platform between 2002 and 2020. The polarity of each opinion ranged between 1 and 5, where 1 stands for the most negative polarity and 5 the most posi- tive. An excerpt of the released database is shown in Table 1. Table 1. Example of the training database. index Title Opinion Place Gender Age Country Date Label 1 ¡Momias… Las mom... Museo… Male 53 México 22/10/2016 1 … … … … … … … … … 5197 Muy bo… No te… Monum... Female 31 México 26/03/2016 5 The entire collection of the comments consist of 7,632 opinions where 5,784 are from Mexican tourists and 1,848 come from other Iberoamerican country’s tourists. For the SA track, the database was split into two partitions: 5,197 were provided as the training set (labeled) and 2,435 were later released as the test set (unlabeled). To evaluate the participant methods and to determine the winner of the challenge, the Mean Absolute Error (MAE) metric was used (Eq. 1). Thus, the system with the lowest MAE value was considered the winner. 1 𝑀𝐴𝐸 = ∑𝑛𝑡=1|𝑒𝑡 | (1) 𝑛 3 Model description In this part, the SA model developed to deal with the challenge is described. The model has been built under the RapidMiner Studio 9.9 version. The experiments and the evaluation were also performed on this platform. Among several advantages for using the RapidMiner [12], the main reason that motivated us to use RapidMiner is that it allows the rapid development of data analysis processes by chaining operators in a user-friendly graphical environment. The model is composed of five main steps to perform the SA process: 1) Acquisition, 2) Pre-processing, 3) Processing, 4) Evalu- ation, and 5) Results. The first and second stages are both applied for Training (A) and Testing (B). In the next subsections, each step is described in detail. A complete view of the five-step model is shown in Fig. 1. Fig. 1. The SA model on the RapidMiner platform. 3.1 Acquisition The information acquisition is performed through two operators that read the .csv file for the training set, and the .xls file for the testing one. The parameters of this operator were configured to recognize the encoding of the text file (UTF-8) since the Spanish language has accentuated characters, and to identify the character for column separa- tor (,). The output of these operators is the Example Set, which is a database internally created and displayed as a table in the results view panel of the program interface. 3.2 Pre-processing The pre-processing step involves two different groups of operators. The first one is composed of four operators which purpose is formatting the data in order to be recog- nizable for the classification algorithms. These operators are: “Numerical to Polyno- mial” for changing the type of attributes to a polynomial type; “Set Role” to indicate the index, the regular attributes, and the class (label); “Select Attributes” for dismiss- ing attributes according to its importance or irrelevance (e.g. the index attribute) hence, only the Title and Opinion attributes were kept for further analysis; and “Nom- inal to Text” to set up the text attributes into string attributes. The other group of operators is enclosed in the “Process Documents from Data” operator which generates word vectors from the string attributes. The objective of this group of operators is to reduce the information volume and to increase the efficiency of the classification algorithm. Within this operator the following operators are con- catenated to perform the next five sub-stages of text pre-processing: • Tokenize: this operator fragments the text into syntactic units (i.e. words). • Transform Cases: usually, the opinions are a mix of uppercase and lowercase words which may be difficult to further processing. With this operator, all the up- percase letters are converter to their lowercase forms. • Replace Tokens: this operator is used to replace: a) misspelled words, and b) ac- centuated characters with non accentuated characters. This helps to reduce the vol- ume of the text by identifying duplicate or misspelled words. • Filter Stopwords (Dictionary): this operator removes the most trivial words such as pronouns, prepositions, and articles by comparing each token to a stop-word list. Since Rapidminer does not have a Spanish stopword list, a custom list with 722 stopwords was created and loaded through this operator. The list of stopwords can be requested to the author on demand. This sub-process helped to reduce by 30% the text volume. • Stem (Snowball): this operator applies several stemming algorithms for the Snow- ball language [13]. This operator supports the Spanish language. Fig. 2 shows the number of removed tokens after the pre-processing task with the training set. The amount of reduced information is up to 60%. A similar result was obtained with the Test set. Fig. 2. The valid tokens after pre-processing tasks. The remaining tokens are used to create a word vector through the Term Frequency - Inverse Document Frequency (TF-IDF) method [14]. The TF-IDF (Eq. 2) computes the relative frequency of a word (t) in a specific document (d) through an inverse proportion of the word over the entire collection of documents (D). The IF-TFD was selected because it provides a simple, reliable, and fast schema to evaluate the rele- vance of each token within a large collection of opinions. 𝑁 𝑇𝐹 − 𝐼𝐹𝐷(𝑡, 𝑑, 𝐷) = log(1 + 𝑓𝑟𝑒𝑞(𝑡, 𝑑)) ∙ log ( ) (2) 𝑐𝑜𝑢𝑛𝑡(𝑑𝜖𝐷:𝑡𝜖𝑑) 3.3 Processing Within this step, a classification task is performed through the application of different supervised machine learning algorithms. This step aims to classify the opinions into five different classes (1 to 5), representing the different degrees of polarity. The algo- rithms applied were the following: • k-Nearest Neighbors (k-NN): this algorithm classifies a new opinion based on the majority class of its k neighbor opinions. A similarity metric (the mixed Euclidean distance) is used to measure the distances between the unclassified opinion and its neighbors. For the SA task, a distance of 0 is taken if both opinions are closest, otherwise, the distance is equal to 1. For the experimentation, different k values were selected (k = 1, 3, and 5). • Trees: with this operator, a decision tree (DT) model is generated. Each leaf of the model represents the class and the nodes represent a splitting rule for one specific attribute. The criterion used to construct and prune the trees was the information gain. Another two algorithm variants were also tested, such as the Gradient Boost- ed Trees (GBT) and Random Forest (RF). • Support Vector Machine (SVM): this learning method applies the mySVM algo- rithm [15] and supports various kernel types. For the experimentation, the Linear kernel type was chosen since the number of attributes is large and the relation be- tween the class labels is linear. • Bayesian Methods: two variants of these methods were applied: the simple Naïve Bayes (NBS) and the kernel one (NBK). For the second one, a greedy kernel was set with a minimum bandwidth of 0.1 and 10 kernels. • Artificial Neural Networks (ANN): finally, two of the most representative neural network algorithms were applied. The Neural Net (NN) algorithm built a model us- ing a feed-forward neural network trained by a backpropagation algorithm (i.e. a multi-layer perceptron) and the Deep Learning (DP) algorithm which performed a multi-layer feed-forward artificial neural network trained with stochastic gradient descent using back-propagation [16]. 3.4 Evaluation For the evaluation step, the Cross-Validation operator was applied to estimate the performance of the classification algorithms. This procedure encloses two subpro- cesses: training and evaluation. First, the input Example Set is split into n=10 subsets of equal dimensions (i.e. number of folds), and one of the subsets is kept as the test dataset. The rest of the subsets are used as the training dataset and processed by the classification algorithm. The procedure is repeated n-1 times, with all of the subsets. The performance metrics and results from the n iterations are finally averaged to out- put a single estimation. The performance evaluation of the classification model for each test set produces an acceptable estimation of the model performance on unlabeled datasets. Neverthe- less, it does not guarantees the same performance on new unlabeled data. The experiments for the classifier's evaluation were performed on a Dell XPS-9370 PC with a Core i7 Intel microprocessor. 3.5 Results The last step of the SA model displays two outputs: the performance evaluation of the algorithms and the classifications of the opinions of the Test Set (i.e. polarity). Sever- al operators were applied to meet the requirements of the output submission (Fig. 1). Table 2 summarizes the performance of the classification algorithms for the Training Set. Table 2. The performance metrics of the evaluated classification algorithms. Classification Processing time Accuracy MAE Algorithm (CPU-time) in min. k-NN (k = 1, 3, 5) 64.8%, 56.73%, 52.91% 0.324, 0.653, 0.682 43.53, 41.16, 40.69 DT, GBT, RF 51.82%, 57.84%, 53.22% 0.619, 0.542, 0.601 1.88, 121.3, SVM 51.76% 0.482 1.71 NVS, NBK 80.3%, 81.74% 0.197, 0.234 1.4, 3.67 NN, DP 69.84%, 72.58% 0.387, 0.351 408.32, 1567.85 Along with the MAE and the accuracy metric, the CPU processing time for each algorithm was also measured through the “Log” operator. As a result, the algorithms that were chosen for submission of the files were the Naive Bayes ones. 3.6 Discussion As can be seen in Table 2, the NVS and the NBK were able to produce the lower values of the MAE (which is the metric that was chosen to rank the team’s results in the contest) and the best accuracy rates. Even if other algorithms may perform higher rates of accuracy, the NVS and NBK algorithms also showed low processing times for the classification task. This can be an important issue since the rapidity of identi- fying the polarity of opinions could be crucial to producing short-term and low-cost improvement strategies. These were the main reasons why these two methods were selected over the other algorithms tested. 4 Conclusions In this article, the Techkatl model for the SA track of the REST-MEX 2021 was de- scribed. The model development and experimentations were carried out on the RapidMiner platform. It was chosen for its relative ease of use and because it offers a very user-friendly interface to develop machine learning models. Also, it is supported by a large community of practitioners, researchers, and data scientists. It is recom- mended for rapid prototyping, and it can also be used by decision-makers in crucial areas of industry, management, tourism, or marketing to perform machine learning, text mining, or sentiment analysis tasks. As an example of the practicality provided by this tool, it can be highlighted that the model hereby presented was developed in a very short time (less than a couple of hours). On the other hand, the Techkatl proposed model showed that even if several algo- rithms have been developed for SA, many of them are complexes, time-consumers and even performs low rates of efficiency in comparison with other simplest algo- rithms such as the Naïve Bayes methods which still performing well and fast at a low computing cost. Regarding other complexes and more recent methods, such as the Deep Learning approach, they present the main disadvantages that the training time may be considerable. This is a problem especially when it is required to obtain correct classifications in the short term and with limited computing power. Acknowledgments The author of this paper would like to express his gratitude to Conacyt, the Tecno- lógico Nacional de México, and to recognize the support of colleagues and students of the Instituto Tecnológico de Orizaba. References 1. Ravi, K., Ravi, V.: A survey on opinion mining and sentiment analysis: Tasks, approaches and applications. Knowledge-Based Systems. 89, 14–46 (2015). https://doi.org/10.1016/j.knosys.2015.06.015. 2. Balazs, J.A., Velásquez, J.D.: Opinion Mining and Information Fusion: A survey. Infor- mation Fusion. 27, 95–110 (2016). https://doi.org/10.1016/j.inffus.2015.06.002. 3. Gull, R., Shoaib, U., Rasheed, S., Abid, W., Zahoor, B.: Pre Processing of Twitter’s Data for Opinion Mining in Political Context. Procedia Computer Science. 96, 1560–1570 (2016). https://doi.org/10.1016/j.procs.2016.08.203. 4. Moussa, M.E., Mohamed, E.H., Haggag, M.H.: A survey on opinion summarization tech- niques for social media. Future Computing and Informatics Journal. 3, 82–109 (2018). https://doi.org/10.1016/j.fcij.2017.12.002. 5. Park, J., Lee, B.K.: An opinion-driven decision-support framework for benchmarking hotel service. Omega. 103, 102415 (2021). https://doi.org/10.1016/j.omega.2021.102415. 6. Li, W., Guo, K., Shi, Y., Zhu, L., Zheng, Y.: DWWP: Domain-specific new words detec- tion and word propagation system for sentiment analysis in the tourism domain. Knowledge-Based Systems. 146, 203–214 (2018). https://doi.org/10.1016/j.knosys.2018.02.004. 7. Sann, R., Lai, P.-C.: Understanding homophily of service failure within the hotel guest cycle: Applying NLP-aspect-based sentiment analysis to the hospitality industry. Interna- tional Journal of Hospitality Management. 91, 102678 (2020). https://doi.org/10.1016/j.ijhm.2020.102678. 8. Mehraliyev, F., Kirilenko, A.P., Choi, Y.: From measurement scale to sentiment scale: Examining the effect of sensory experiences on online review rating behavior. Tourism Management. 79, 104096 (2020). https://doi.org/10.1016/j.tourman.2020.104096. 9. Li, S., Li, G., Law, R., Paradies, Y.: Racism in tourism reviews. Tourism Management. 80, 104100 (2020). https://doi.org/10.1016/j.tourman.2020.104100. 10. Álvarez-Carmona, M.Á., Aranda, R., Arce-Cárdenas, S., Fajardo-Delgado, D., Guerrero- Rodríguez, R., López-Monroy, A.P., Martínez-Miranda, J., Pérez-Espinosa, H., Rodríguez- González, A.: Overview of Rest-Mex at IberLEF 2021: Recommendation System for Text Mexican Tourism. Procesamiento del Lenguaje Natural. 67, (2021). 11. Vásquez Rojas, C., Roldán Reyes, E., Aguirre y Hernández, F., Cortés Robles, G.: Integra- tion of a text mining approach in the strategic planning process of small and medium-sized enterprises. Industr Mngmnt & Data Systems. 118, 745–764 (2018). https://doi.org/10.1108/IMDS-01-2017-0029. 12. Kotu, V., Deshpande, B.: Chapter 15 - Getting Started with RapidMiner. In: Kotu, V. and Deshpande, B. (eds.) Data Science (Second Edition). pp. 491–521. Morgan Kaufmann (2019). https://doi.org/10.1016/B978-0-12-814761-0.00015-0. 13. Snowball: A language for stemming algorithms, http://snowball.tartarus.org/texts/introduction.html, last accessed 2021/06/05. 14. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing & Management. 24, 513–523 (1988). https://doi.org/10.1016/0306- 4573(88)90021-0. 15. Ruping, S.: Incremental learning with support vector machines. In: Proceedings 2001 IEEE International Conference on Data Mining. pp. 641–642 (2001). https://doi.org/10.1109/ICDM.2001.989589. 16. Neural Net - RapidMiner Documentation, https://docs.rapidminer.com/latest/studio/operators/modeling/predictive/neural_nets/neural_ net.html, last accessed 2021/06/05.