Uzbek Sentiment Analysis Based on Local Restaurant Reviews Sanatbek Matlatipov1 , Hulkar Rahimboeva1 , Jaloliddin Rajabov1 and Elmurod Kuriyozov2 1 National University of Uzbekistan named after Mirzo Ulugbek, 4 Universitet St, Tashkent, 100174, Uzbekistan 2 Universidade da Coruña, CITIC, Grupo LYS, Depto. de Computación y Tecnologías de la Información, Facultade de Informática, Campus de Elviña, A Coruña 15071, Spain Abstract Extracting useful information for sentiment analysis and classification problems from a big amount of user-generated feedback, such as restaurant reviews, is a crucial task of natural language processing, which is not only for customer satisfaction where it can give personalized services, but can also influence the further development of a company. In this paper, we present a work done on collecting restaurant reviews data as a sentiment analysis dataset for the Uzbek language, a member of the Turkic family which is heavily affected by the low-resource constraint, and provide some further analysis of the novel dataset by evaluation using different techniques, from logistic regression based models, to support vector machines, and even deep learning models, such as recurrent neural networks, as well as convolutional neural networks. The paper includes detailed information on how the data was collected, how it was pre-processed for better quality optimization, as well as experimental setups for the evaluation process. The overall evaluation results indicate that by performing pre-processing steps, such as stemming for agglutinative languages, the system yields better results, eventually achieving 91% accuracy result in the best performing model. Keywords Sentiment Analysis, Uzbek Language, Dataset, Support Vector Machine, RNN, CNN, 1. Introduction The power of Natural Language Processing (NLP) techniques relies on amounts of labelled data in many applications. Sentiment analysis is the process of analyzing and labelling the opinion which is posted by consumers. Consumers usually post their feedback about places/foods to famous applications such as Google Maps 1 , Yelp2 , etc). They often encourage consumers to actively participate in reviews, and massive user-generated restaurant reviews allow consumers to fully express their needs while helping merchants provide real-time and personalized service The International Conference and Workshop on Agglutinative Language Technologies as a challenge of Natural Language Processing (ALTNLP), June 7-8, Koper, Slovenia $ s.matlatipov@nuu.uz (S. Matlatipov); h.rahimboyeva@nuu.uz (H. Rahimboeva); j.rajabov@nuu.uz (J. Rajabov); e.kuriyozov@udc.es (E. Kuriyozov) € https://sanatbek.uz/ (S. Matlatipov)  0000-0002-6895-3436 (S. Matlatipov); 0000-0002-3259-7708 (H. Rahimboeva); 0000-0002-0369-6707 (J. Rajabov); 0000-0003-1702-1222 (E. Kuriyozov) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 1 Google Maps: https://www.google.com/maps 2 Yelp: https://www.yelp.com [1]. Moreover, the restaurant reviews express the composition of clients’ emotional necessities and are an important source of information about consumers’ choices [2]. Currently, opinion mining has achieved very high accuracy performances, especially after applying deep learning methods, for high resource languages [3]. However, applying deep learning and machine learning techniques for different types of domains [4] and gathering corpora with high quantity play an important[5] role in the development of low-resource languages. For example, the language we focus on is the Uzbek language which is being used by around 34 million native speakers in Uzbekistan and elsewhere in Central Asia and China3 . Uzbek is a null-subject and highly agglutinative language where one word can be a meaningful sentence[6, 7]. To our knowledge, there is no previous work for sentiment classification problems based on restaurant domain feedback. So, the following contributions are considered for this paper. • Restaurant domain annotated corpora is created for sentiment analysis which is collected from Google Maps based on Uzbek cuisine’s locations where local national food reviews are the primary target. The corpora contain 4500 positive and 3710 negative reviews after manually removing major errors and cleaning. The annotation process is based on the feedback’s 5 stars method provided by Google Maps where from 1 to 3 we consider the dataset as negative and from 4 to 5 as positive. We found some reviews are based on other languages such as English, Kyrgyz and Russian. We didn’t want to ignore them, so we decided to translate them into Uzbek using the official Google Translate API. • Pre-processing the corpora is applied in two steps. The first steps are removing URLs, punctuation, and lower-casing. The second step is ignoring stopwords[8] from the dataset where it is based on accuracy evaluation after generating the list of stop words using the TF-IDF algorithm; Then, we applied the stemming algorithm [7, 9] which is based on Uzbek words’ endings’ electronic dictionary that uses combinatorial approach inferring apply for part of speech of the Uzbek language: nouns, adjectives, numerals, verbs, participles, moods, voices. Advantages of using the algorithm are lexicon-free and its complexity that allows one operation (referring to the dictionary of endings of the language) to perform: segmentation of the word into suffixes; performs morphological analysis of the word. • Machine learning and deep learning algorithms have been applied. Furthermore, deep learning(Recurrent neural network) algorithm fed with fastText4 pre-trained word embedding is applied to improve the accuracy; All resources including the corpora, source code used for crawling techniques and classification algorithms are uploaded to the public repository 5 . The paper is structured as follows: Intro- duction(this section), Section 2 describes related work that has been done so far. It is followed by a description of the methodology in Section 3 and continues with Section 4, which focuses on experiments and results. The final part (Section 6) concludes the paper and highlights the future work. 3 https://en.wikipedia.org/wiki/Uzbek_language 4 https://fasttext.cc/docs/en/crawl-vectors.html 5 https://github.com/SanatbekMatlatipov/restauranat-sentiment/tree/main Figure 1: Research framework. 2. Related Work In recent years, several works were done in the NLP field for Uzbek, including sentiment analysis datasets [10, 11], created by collecting and analyzing Google Play app reviews, with two types of data: A medium-size manually annotated dataset and a larger-size dataset automatically translated from English. [12] obtained bilingual dictionaries for six Turkic languages and applied them to cross-lingually align word embeddings, backed by a bilingual dictionary induction evaluation task. They showed that obtained aligned word embeddings from a low-resource language can benefit from resource-rich closely-related languages. Another similar paper [13] investigated the effect of emoji-based features in opinion classification of Uzbek texts. A semantic evaluation dataset was presented with semantic similarity and relatedness scores in word pairs as well as its analysis for Uzbek in a recent work [14]. There is a very recent growing trend in NLP that makes use of AI-based techniques, which can be seen in the work on Uzbek with neural transformers-architecture based language model trained off raw Uzbek corpus [15]. In a global outlook to the field of sentiment analysis, there is a work [16] that used various methods of sentiment analysis techniques, such as machine learning and deep learning, in their work with an idea to take into account the differences in opinions and thoughts that exist on popular social platforms such as Twitter, Reddit, Tumblr and Facebook. 3. Methodology In this paper, we proposed a machine learning and deep learning-based sentiment analysis framework for the restaurant domain dataset (Figure 1). The framework includes data collection using web-crawler, pre-processing(cleaning, stopwords, lexicon-free stemming), constructing TF-IDF weight matrix, performing ML and DL for sentiment analysis; Figure 2: Feedback sample 3.1. Data collection We start by looking at a high number of the dataset available for crawling in the Uzbek language. However, the usual approaches such as Twitter or movie reviews are not the case for Uzbek. Therefore, we decided to collect restaurant reviews as local people mostly loved giving feedback which is restaurants. we think it makes sense as Uzbek cuisines are one of the most popular throughout the Commonwealth of Independent States (CIS, CA countries). In most CA cities, for instance, it’s easy to find busy restaurants specializing in Uzbek cuisine6 . We crawled all local restaurants in Tashkent from Google Maps. Firstly, we selected a list of more than 140 URLs which has at least 3 reviews and we retrieved all info shown in Figure 2. While crawling, we considered Google’s anti-spam and anti-DDOS policies as there are certain limitations on harvesting data. The source code is available on the repository. 3.2. Data pre-processing The collection of texts with star ratings in the crawled dataset was noisy and required manual correction. The comments containing only emojis, names or any other irrelevant content, such as username mentions, URLs or specific app names were removed. Those written in languages different from Uzbek (mostly in Russian and some in English) were translated using the official Google translate API. Although people in Uzbekistan use the official Latin alphabet, the use of the old Cyrillic alphabet is equally popular, especially among adults. The comments that were written in Cyrillic were converted to Latin using the Uzbek machine transliteration tool [17]. Then, we applied stop words to remove low-level information words from our comments to focus on important information. The technique is based on [8] paper where it is a proposed algorithm of automatic detection of single word stop words collection using TFIDF(Term frequency - inverse document frequency). After that, each word is processed to lexicon-free stemming tool [7] algorithm for decreasing the word capacity because of prefixes and suffixes. The basic idea is using the combinatorial approach of eligible endings candidates. Following table 1 shows processed data which is ready for TFIDF-vectorizer. We selected a set of words to visualize the word count. Figure 3 shows that people tend to give more positive feedback than negative on the domain of restaurants. 6 BBc Travel: https://www.bbc.com/travel/article/20191117-is-uzbek-cuisine-actually-to-die-for Table 1 The example of a chosen review before and after processing it. Review After processing Birinchi Milliy taomlardan biri - keng assortimentli taomlar! Bir/ milliy/ taom/ keng/ assortiment/ Gastro-turistlar uchun juda jozibali joy - bu yerda barcha taom/ gastro/ turist/ juda/ joziba/ joy/ turdagi milliy taomlar mavjud.Yagona salbiy tomoni shundaki, tur/ milliy/ taom/ mavjud/ salbiy/ tomon/ bunday yirik muassasa uchun to’xtash joyi kichik. Narxlar yirik/ muassa/ to’xta/ joy/ kichik/ narx/ nisbatan arzon! Turistlar uchun juda arzon! arzon/ turist/ arzon Figure 3: The visualisation of some selected examples of Uzbek words taken from positive and negative reviews with their log counts. 4. Evaluation The collected novel dataset has been split into training and testing subsets for evaluation with 8 x 2 ratio respectively. After the data cleaning process, we have the original dataset as follows, where 𝑥⃗𝑖 represents feature vectors and 𝑦⃗𝑖 represents annotated labels: (𝑥⃗𝑖 , 𝑦𝑖 ), 𝑖 = 1, 2, 3, ..., 𝑁 (1) 𝑥⃗𝑖 = (𝑥𝑖1 , 𝑥𝑖2 , ..., 𝑥𝑖𝑚 ) 𝑖 = 1, 2, 3, ..., 𝑁 (2) 𝑁 and 𝑚 is equal to the number of reviews and length of the feature vector, respectively. Then we calculate TFIDF scores for each feature vector 𝑥⃗𝑖 which vectorises words by taking into account the frequency of a word in a given review and the frequency between reviews. The final result of all 𝑧⃗𝑖 s is defined as a sparse matrix. 𝑧⃗𝑖 = 𝑇 𝐹 (𝑥𝑖 )𝑥𝐼𝐷𝐹 (𝑥𝑖 ) 𝑖 = 1, 2, 3, ..., 𝑁 (3) 4.1. Machine learning algorithms The \Logistic regression model is ℎ(𝑧⃗ ) = 1/(1 + exp(−𝑧)) (4) {︃ ℎ(𝑧 ⃗ ), if 𝑦 = +1(𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒) 𝑃 (𝑦|𝑧 ⃗) = 1 − ℎ(𝑧 ⃗ ), if 𝑦 = −1(𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒) Logistic regression[18] model is a classification algorithm, known for its exponential and log- linear functions. It works with discrete values and maps the function of any real value into 0 and 1. For sentiment analysis, the hypothesis shows, reviews are either positive or negative by using the (4). The Support Vector Machine(SVM) model has the following response function: ℎ(𝑧 ⃗ ) = 𝑠𝑖𝑔𝑛(𝑧 ⃗) (5) SVM algorithm is known for its fast and dependable classification which resolves two-group classification problems. The classification is conducted for finding a hyperplane between two classes’ positive and negative reviews in the model: After all, we implemented LR and SVM models utilizing the Scikit-Learn [19] machine learning library in Python with default configuration parameters. For the LR models, we implemented a variant based on word n-grams (unigrams and bigrams), and one with character n-grams (with 𝑛 ranging from 1 to 4). We also tested a model combining said word and character n-gram features. 4.2. Deep Learning algorithms Keras [20] is used on top of TensorFlow [21].The FastText pre-trained word embeddings of size 300 [22] for the Uzbek language are applied. For the CNN model, we used a multi-channel CNN with 256 filters and three parallel channels with kernel sizes of 2,3 and 5, and drop out of 0.3. The output of the hidden layer is the concatenation of the max-pooling of the three channels. For RNN, we use a bidirectional network of 100 GRUs. The output of the hidden layer is the concatenation of the average and max-pooling of the hidden states. For the combination of deep learning models, we stacked the CNN on top of the GRU. In the three cases, the final output is obtained through a sigmoid activation function [23] applied to the previous layer. In all cases, Adam optimization algorithm [24], an extension of stochastic gradient descent, was chosen for training, with standard parameters: learning rate 𝛼 = 0.0001 and exponential decay rates 𝛽1 = 0.9 and 𝛽2 = 0.999. Binary cross-entropy was used as a loss function. The same steps, but slightly different parameters were used in a work that presents guidance to use CNN for sentiment classification [25]. Inspired by their example that perfectly illustrates the steps of performing deep learning based sentiment classification using CNN, the visualisation of our steps can be seen in Figure 4. Figure 4: The illustration of steps taken in deep learning based sentiment classification using CNN, inspired by [25]. 4.3. Evaluation metrics Confusion[26] matrices are used in the task to determine the gap between predicted and true values which is shown in Table 2. Precision, Recall and F1-score are used as evaluation metrics for model performance. Table 2 Confusion matrix Classes Positive Negative Positive True Positive(TP) False Negative(FN) Negative False Positive(TP) True Negative(FN) The calculation of Precision and Recall is shown below: 𝑇𝑃 𝑇𝑃 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑅𝑒𝑐𝑎𝑙𝑙 = (6) 𝑇𝑃 + 𝐹𝑃 𝑇𝑃 + 𝐹𝑁 The F1-Score is used, which takes into account both accuracy and recall, and the F1-Score is calculated as follows: 2 * 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 * 𝑅𝑒𝑐𝑎𝑙𝑙 𝐹1 = (7) 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 5. Results and Discussion This section presents a detailed description of the results obtained by the evaluation process using both machine learning and deep learning techniques applied to the collected novel sentiment analysis dataset of restaurant reviews. 5.1. Experiment Results The overall experiment results of the above-mentioned evaluation were performed, and the results can be seen in Table 3. Table 3 Experiment results of sentiment analysis for all evaluation techniques, including models, their distinctive parameters, as well as evaluation metrics, such as Precision (Prec.), Recall (Rec.), F1-score (F1), as well as Accuracy (Acc.). Model name + Parameters Sentiment Prec. Rec. F1 Acc. Positive 88% 98% 93% Logistic Regression based on word n-grams 89% Negative 88% 67% 74% Positive 87% 51% 92% Logistic Regression based on char. n-grams 87% Negative 83% 97% 64% Positive 95% 95% 92% Logistic Regression (word + char. n-grams) 91% Negative 90% 89% 90% Positive 88% 97% 92% SVM based on linear kernel 88% Negative 84% 71% 80% Positive 90% 95% 92% RNN without word embeddings 88% Negative 78% 64% 70% Positive 90% 95% 93% RNN with word embeddings 88% Negative 80% 65% 72% Positive 90% 96% 93% CNN (multichannel) 89% Negative 83% 64% 72% The Logistic Regression(LR) based on word n-grams obtained a binary classification accuracy of 90% on the dataset, while the one based on character n-grams, with its better handling of misspelt words, improved it to 91%(which is the winner of this comparison). Support Vector machines based on Linear kernel mode have shown 88% accuracy overall. Recurrent Neural network models without and with fastText embedding show the same accuracy (88%). Convectional Neural Network showed slightly less performance(89.23%) than LR. However, this is the reason for lacking data for neural-network models, as it requires big data for better performance. 5.2. Discussion and limitations Nowadays, unstructured data are becoming more and more in the restaurant domain which requires performing high accuracy sentiment analysis. Especially, this is the case for low- resource languages. Based on the review data of Google Maps(Tashkent location) which is obtained by web-crawling, the paper has shown several ML& DL methods. It was observed that the LR algorithm outperforms the others which makes sense as our dataset is relatively small. The research also mentioned some theoretical and practical implications. We believe, in terms of gaining massive user reviews on the domain can provide consumers make their decision in the best manner such as lower cost and faster speed. However, we also wanted to point out some limitations in this research paper. The dataset we gathered has an unbalanced number of positive and negative reviews, which can cause deviations in the result. Moreover, we used the review rating in the annotation process which sometimes, in reality, consumers may give a high rating score, but polarity context may be related to negative, and vice versa. 6. Conclusion In this paper, we have shown a novel dataset in the restaurant domain for the Uzbek language, with 8210 reviews, annotated with positive or negative labels, which is crawled from Google Maps using URLs of all locations in the capital city Tashkent, and was labelled as their corre- sponding star score. Then, we applied full pre-processing steps to the dataset which contributed to increasing the accuracy of our baseline models. Further analysis of the collected dataset was shown with evaluations using both machine learning and deep learning techniques. The best accuracy result (91%) on the dataset was obtained using a logistic regression model with word and character n-grams. In the foreseen future, we are planning to extend the work by collecting more data, which can effectively analyze the restaurant reviews in a practical level. Also, the work is underway to remove the evaluation bias of the training experiments by using cross-validation methods in data splitting. Acknowledgments This work partially has received funding from ERDF/MICINN-AEI (SCANNER-UDC, PID2020- 113230RB-C21), and from Centro de Investigación de Galicia ”CITIC”, funded by Xunta de Galicia and the European Union (ERDF - Galicia 2014-2020 Program), by grant ED431G 2019/01. Elmurod Kuriyozov was funded for his PhD by El-Yurt-Umidi Foundation under the Cabinet of Ministers of the Republic of Uzbekistan. References [1] R. A.-I. Rafael Anaya-Sánchez, Sebastian Molinillo, F. Liébana-Cabanillas, Improving travellers’ trust in restaurant review sites, Tourism Review 74 (2019) 830–840. doi:10. 1108/TR-02-2019-0065. [2] E. Marine-Roig, S. A. Clave, A method for analysing large-scale UGC data for tourism: Application to the case of catalonia, in: Information and Communication Technologies in Tourism 2015, Springer International Publishing, Cham, 2015, pp. 3–17. [3] J. Barnes, R. Klinger, S. S. i. Walde, Assessing state-of-the-art sentiment models on state- of-the-art sentiment datasets, arXiv preprint arXiv:1709.04219 (2017). [4] L. Zhang, S. Wang, B. Liu, Deep learning for sentiment analysis: A survey, Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 8 (2018). URL: https://doi.org/10.1002/widm.1253. doi:10.1002/widm.1253. [5] M. Artetxe, I. Aldabe, R. Agerri, O. Perez-de Viñaspre, A. Soroa, Does corpus quality really matter for low-resource languages?, 2022. URL: https://arxiv.org/abs/2203.08111. doi:10.48550/ARXIV.2203.08111. [6] G. Matlatipov, Z. Vetulani, Representation of Uzbek morphology in prolog, in: Aspects of Natural Language Processing. Lecture Notes in Computer Science, volume 5070, Springer, 2009. [7] S. Matlatipov, U. Tukeyev, M. Aripov, Towards the uzbek language endings as a language resource, in: M. Hernes, K. Wojtkiewicz, E. Szczerbicki (Eds.), Advances in Computational Collective Intelligence, Springer International Publishing, Cham, 2020, pp. 729–740. [8] K. Madatov, S. Bekchanov, J. Vičič, Automatic detection of stop words for texts in the uzbek language, 2022. [9] U. Tukeyev, A. Turganbayeva, B. Abduali, D. Rakhimova, D. Amirova, A. Karibayeva, Lexicon-free stemming for kazakh language information retrieval, in: 2018 IEEE 12th International Conference on Application of Information and Communication Technologies (AICT), 2018, pp. 1–4. doi:10.1109/ICAICT.2018.8747021. [10] I. Rabbimov, S. Kobilov, I. Mporas, Opinion classification via word and emoji embedding models with lstm, in: International Conference on Speech and Computer, Springer, 2021, pp. 589–601. [11] E. Kuriyozov, S. Matlatipov, Building a new sentiment analysis dataset for uzbek language and creating baseline models, in: Multidisciplinary Digital Publishing Institute Proceedings, volume 21, 2019, p. 37. [12] E. Kuriyozov, Y. Doval, C. Gómez-Rodríguez, Cross-lingual word embeddings for Turkic languages, in: Proceedings of the 12th Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France, 2020, pp. 4054–4062. URL: https://aclanthology.org/2020.lrec-1.499. [13] I. Rabbimov, I. Mporas, V. Simaki, S. Kobilov, Investigating the effect of emoji in opinion classification of uzbek movie review comments, in: A. Karpov, R. Potapova (Eds.), Speech and Computer, Springer International Publishing, Cham, 2020, pp. 435–445. [14] U. Salaev, E. Kuriyozov, C. Gómez-Rodríguez, Simreluz: Similarity and relatedness scores as a semantic evaluation dataset for uzbek language, arXiv preprint arXiv:2205.06072 (2022). [15] B. Mansurov, A. Mansurov, Uzbert: pretraining a bert model for uzbek, arXiv preprint arXiv:2108.09814 (2021). [16] Y. Chandra, A. Jana, Sentiment analysis using machine learning and deep learning, in: 2020 7th International Conference on Computing for Sustainable Global Development (INDIACom), 2020, pp. 1–4. doi:10.23919/INDIACom49435.2020.9083703. [17] U. Salaev, E. Kuriyozov, C. Gómez-Rodríguez, A machine transliteration tool between uzbek alphabets, arXiv preprint arXiv:2205.09578 (2022). [18] E. Christodoulou, J. Ma, G. S. Collins, E. W. Steyerberg, J. Y. Verbakel, B. Van Calster, A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models, Journal of Clinical Epidemiology 110 (2019) 12–22. URL: https://www.sciencedirect.com/science/article/pii/S0895435618310813. doi:https: //doi.org/10.1016/j.jclinepi.2019.02.004. [19] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011) 2825–2830. [20] F. Chollet, et al., Keras, https://github.com/fchollet/keras, 2015. [21] M. Abadi, et al., TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL: https://www.tensorflow.org/, software available from tensorflow.org. [22] E. Grave, P. Bojanowski, P. Gupta, A. Joulin, T. Mikolov, Learning word vectors for 157 languages, in: Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), 2018. [23] A. C. Marreiros, J. Daunizeau, S. J. Kiebel, K. J. Friston, Population dynamics: variance and the sigmoid activation function, Neuroimage 42 (2008) 147–157. [24] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014). [25] Y. Zhang, B. Wallace, A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification, arXiv preprint arXiv:1510.03820 (2015). [26] M. Sokolova, G. Lapalme, A systematic analysis of performance measures for classi- fication tasks, Information Processing & Management 45 (2009) 427–437. URL: https: //www.sciencedirect.com/science/article/pii/S0306457309000259. doi:https://doi.org/ 10.1016/j.ipm.2009.03.002.