Sentiment Analysis in the Feedback of Peer Evaluation Activities René Elizalde-Solano1, Ma.Carmen Cabrera-Loayza1,2, Elzabeth Cadme1,2 and Nelson Piedra1,2 1 Universidad Técnica Particular de Loja, San Cayetano 1101608, Loja, Ecuador 2 Universidad Politécnica de Madrid, Boadilla del Monte 28660, Madrid, España Abstract Sentiment analysis is a technique used more frequently in the educational field. For the present work, the analysis and classification of the feedback comments issued by the students in the peer evaluation activities has been taken as the main application approach. Determining the polarity of these comments can help the teacher to identify characteristics and patterns in the criteria issued by the students to enrich the teaching-learning process. The present work aims to determine the polarity of feelings of the feedback comments of the peer evaluation activities planned as challenges within the courses offered by the Open Campus initiative. To do this, experimentation is carried out in three training scenarios and tests of the classification model using the corpus of tweets written in Spanish TASS and a corpus of comments extracted from the learning platform, manually classified by experts. Among the main results, it is observed that many students give feedback that is useful, be it positive or negative. However, there is a significant percentage of comments that are perceived as unjustified or incomprehensible, and this is observed in the number of comments classified as neutral and without polarity. Keywords 1 Sentiment Analysis, Peer Assessments, Open Campus, Feedback, Open Online Courses, Open Education 1. Introduction Currently, the design and planning of online courses, a number of evaluation and training activities are defined. It is intended that students acquire, beyond professional competencies, some soft skills within the teaching-learning process. Within the Open Campus initiative, the collaborative work of students is encouraged to create learning communities guided by a teacher and enriched by the participants. One of the main evaluation proposed activities in each course offered is called "challenge". Challenges are peer review activities that allow students to review, evaluate, and provide feedback on the work of their peers. This guarantees student is the main actor of the assessment process carried out, also acquired skills such as collaborative work, co-construction of knowledge, reflection, and critical assessment [1]. Students' general comments about the evaluation they have made of assigned work. Generally, these feedbacks or opinions are not mandatory, therefore are not considered in this analysis, and only the grades given are considered. The main objective of this work is to determine the polarity of feelings in the feedback comments of the peer evaluation activities posed as challenges within the courses offered by the Open Campus initiative. The experimentation in three scenarios is approached for the training and testing of the classification model. In the first scenario, TASS 2019 corpus is used [2]. In the second scenario, a manually classified corpus of comments from the Open Campus platform is used. For the third scenario, the model is trained with a mixture of the data mentioned above Finally, it should be mentioned that the comments are in Spanish and that the Linear Support Vector Classification algorithm is applied for each scenario [3]. CISETC 2021: International Congress on Educational and Technology in Sciences, November 16-18, 2021, Chiclayo, Peru EMAIL: rrelizalde@utpl.edu.ec (R. Elizalde-Solano); mccabrerax@utpl.edu.ec (MC. Cabrera-Loayza); iecadme@utpl.edu.ec (E. Cadme); nopiedra@utpl.edu.ec (N. Piedra) ORCID: 0000-0002-9534-8450 (R. Elizalde-Solano); 0000-0002-7664-5206 (MC. Cabrera-Loayza); 0000-0002-5554-0560 (E. Cadme); 0000-0003-1067-8707 (N. Piedra) ©️ 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 2. Sentiment Analysis 2.1. Feedback - Peer reviews Some actors have argued peer evaluation is a particularly useful practice of training activities because students need to develop their own evaluations skills to better recognize quality, understand evaluation criteria, and self-evaluate their own work [4 ]. Some actors have argued peer evaluation is a particularly useful practice of training activities because students need to develop their own evaluations skills to better recognize quality, understand evaluation criteria, and self-evaluate their own work [4 ]. This includes those students who can benefit both from receiving feedback from their peers and from building feedback on the work of others, and some research has determined that giving feedback improved writing performance as well as how to receive feedback [5]. In [6] peer assessment is defined as a teaching-learning strategy that allows students to provide peer feedback. Despite the benefits of peer review, it is always an arduous process for any teacher who explores some meaningful information about decision-making [7]. Therefore, it is important to analyze feedback comments given by students with the help of computational techniques, such as machine learning. In order to determine which are the most important aspects that learners consider when evaluating the work of their peers. In addition, through the comments, the perception and understanding of the students about the proposed activity can also be identified [8]. In addition, through the comments, the perception and understanding of the students about the proposed activity can also be identified [8]. Also, patterns are identified in the relationship between the student's opinions, the feedback they give to other students, and how they react to the feedback they receive. 2.1.1. Analysis of feelings in the educational context Sentiment analysis is a task that focuses on detecting polarity and recognizing the emotion that an individual may feel about a topic, or event. The main goal of sentiment analysis is to find the opinions of users, identify the feelings they express, and then classify their polarity into positive, negative, and neutral categories. Sentiment analysis systems use Natural Language Processing techniques as well as Machine Learning to discover, retrieve and extract information and opinions from large amounts of textual information [9]. Sentiment analysis and opinion mining are similar. But there is a slight difference, the former refers to finding feeling words and phrases that show emotions, while the latter refers to extracting and analyzing opinions of people for a given entity [8]. Sentiment analysis is a field of research that has grown rapidly in recent years in the context of student comments in learning platform environments [10]. When searching the term “sentiment analysis” in the Scopus database, results in about 19,000 papers at a general level. However, in the educational context, there are around 80 papers and few of them refer to the analysis of the students' comments obtained in the peer evaluation-type activities. In [11] a study on sentiment analysis in the educational context is carried out focuses on detecting the approaches and digital educational resources used in the sentiment analysis, as well as identifying the main benefits of using this analysis in the domain of education. The results show that Naïve Bayes is the most used technique and that the forums in MOOC and social networks are the most used digital education resources to collect the data necessary to carry out the sentiment analysis process. On the other hand, in [7] a study of several experiments is carried out with a manually labeled dataset to test different combinations of N-grams with inverse document term-frequency frequency (TF-IDF) and classification algorithms. As result, it is obtained that the Support Vector Machine classifier combining 1 gram + 2 grams + TF-IDF considered the best model in Precision, Recall and F-Measure In the study exposed in [12], it was determined that the students who considered the feedback useful tended to be more receptive when acknowledging their mistakes, while the students who found the feedback less useful tended to be more defensive when expressing that they were confused about the comments, and they disagreed with the statements given. Finally, the study carried out in [13] focuses on determining the inconsistencies that arise in the peer evaluations, between the numerical score and the textual feedback. Experiments carried out with 4 student groups and 2 activity types have determined that the general peer evaluation process is a process with reliable results, which guarantees a valuable approach to ensure the correct functioning of the peer review process. 3. Methodology The process carried out to analyze the polarity of the comments issued in the peer evaluation activities of the courses on the Open Campus platform is detailed below. First, an ETL process is performed to extract the data set from the comments. Then, a process of cleaning the information is carried out to later apply the classification algorithm and evaluate the performance using the precision metrics, the F-Score measure, and the confusion matrix. The next task is to find a corpus in Spanish that allows training the classification models for their subsequent application to the set of feedback comments. This task had difficulties since there are not many corpora in Spanish available. For the present work, the corpus generated in the Workshop on Semantic Analysis at SEPLN (TASS) [14] is used, which compiles a set of tweets written in Spanish. In addition, a corpus is also created with the comments of the feedback from the peer evaluations of the Open Campus platform, manually labeled by experts as positive, negative, neutral, and none (none). Finally, the classification models are trained in three scenarios that are detailed in the next section. 3.1. Training and testing phase Next, the training and test phase is developed in the three proposed scenarios: 3.1.1. Scenario 1 With the TASS corpus, we proceed to extract the necessary data to apply the classification algorithm with the comments of our context. It is important to indicate that some Python programming language libraries are used, such as Pan-das [15], Scikit Learn [16], NLTK [17]. Scikit Learn makes use of the supervised algorithm of Linear Support Vector Classification. The NLTK library uses it to generate a function that allows comments to be tokenized. Model training. The set of already classified comments used to train the selected model was a total of 7608; each one of them categorized as positive, negative, neutral and none, see Figure 1a. Before being able to apply the Linear Support Vector Classification algorithm, the CountVectorizer function is used, which allows each comment to be separated into a frequency vector for each word that composes it. When working with information in Spanish, procedures were specified to refine the vectorization process, such as not considering stopword in Spanish, using the SnowballStemmer algorithm to join words based on their root, and through the word_tokenize method of the NLTK library to separate each word into its respective syllables. The result of CountVectorizer is a data frame 5706 rows and 9754 columns. To generate the classification model, LinearSVC from the Scikit-learn library is used. It is important to highlight that to train the algorithm, the information of the comments is sent, but at the level of numerical vectors, together with the labeling of each expression. Model test. Once the model has been trained, it starts by separating the information to be used for training and testing; For this, the train_test_split function of the Scikit Learn library was the mechanism that allows having 5706 comments for training and 1902 comments for tests. As a result of the test phase of the model, there is an accuracy of 71.66% through the accuracy_score metric of Scikit-Learn and 68.45% through the f1_score metric. The confusion matrix after applying the algorithm mentioned is detailed in Figure 1b. 3.1.2. Scenario 2 Scenario 2 looks for a way to create a classified data set from the context of peer reviews of the Open Campus platform. Model training. From the set of 101559 comments extracted, a data set of 2992 comments are generated randomly. This data set was manually classified by experts to assign polarity according to their criteria. Figure 1c shows the result of manual classification. This new data set will be used to train the Linear Support Vector Classification algorithm. Before doing so, as indicated in scenario 1, the data set is divided for training 2244 records and test 748 records. Furthermore, the CountVectorizer function is used to vectorize the information set, obtaining a data frame of 2244 rows and 2171 columns. Finally, the classification algorithm LinearSVC is applied. Model test. Once the model has been trained, the model is evaluated with the 748 records. The result of applying the algorithm provides the following data, an accuracy of 73.66% through the Scikit-learn accuracy_score metric and 56.28% through the f1_score metric. The confusion matrix is detailed in Figure 1d. 3.1.3. Scenario 3 For Scenario 3, the research team decides to pool the trained dataset. Use is made of classified information from the TASS and comments manually classified by experts. Model training. For the training phase, a data set with 10,600 records is consolidated, classified according to their polarity, as can be seen in Figure 1e. As in scenarios 1 and 2, the data set is generated, for training 7950 data and for testing 2650 data. The information is vectorized through CountVectorizer obtaining a dataframe of 7950 rows and 16316 columns, and the Linear Support Vector Classification algorithm is applied. Model test. Once the model has been trained, we proceed to evaluate the model with the 748 records. And an accuracy of 70.67% is obtained through the accuracy_score metric of Scikit-Learn and 65.76% through the metric f1_score. Furthermore, the confusion matrix is obtained, see Figure 1f. (a) (b) (c) (d) (e) (f) Figure 1: Classified comments and confusion matrix for each scenario: (a) polarity of TASS corpus comments, (b) confusion matrix - scenario 1, (c) polarity of the manually created corpus, (d) confusion matrix - scenario 2, (e) TASS corpus data set and those manually classified from the Open Campus platform, (d) confusion matrix - scenario 3. 3.2. Classification phase 3.2.1. Classification using the scenario 1 model The process is carried out to determine the polarity of 101,559 comments from the peer reviews of the Open Campus platform and apply the trained model to each of them. The results are as follows see Table 1 and Figure 2a. Table 1 Comment Rating - Scenario 1 Feedback Polarity Positive Negative Neutral No Polarity (P) (N) (NEU) (NONE) Total comments 55975 28537 4469 12578 3.2.2. Classification using the scenario 2 model The 98567 comments from the peer reviews of the Open Campus platform are classified and the trained model is applied to each of them. The following results were obtained, see Table 2 and Figure 2b. Table 2 Comment Rating - Scenario 2 Feedback Polarity Positive Negative Neutral No Polarity (P) (N) (NEU) (NONE) Total comments 61622 18443 8035 10459 (a) (b) (c) Figure 2: Comments classified based on the trained model in each scenario (a) comments classified with scenario 1, (b) comments classified with scenario 2, (c) comments classified with scenario 3. 3.2.3. Classification using the scenario 3 model Then, the 98,559 comments from the peer reviews of the Open Campus platform are classified and the trained model is applied to each of them. The following results were obtained, see Table 3 and Figure 2c. Table 3 Comment Rating - scenario 3 Feedback Polarity Positive Negative Neutral No Polarity (P) (N) (NEU) (NONE) Total comments 64235 19247 7089 7901 4. Results and discussion In this research, scenarios were created to analyze the set of comments expressed by the participants of the Open Campus platform courses. Table 4 shows the polarity obtained from the classified feedback comments with the trained models in each scenario. Table 4 Classification of comments by stage Feedback Polarity Scenarios NcT Positive Negative Neutral No Polarity NC % NC % NC % NC % Scenario 1 101559 55975 55.11 28537 28.08 4469 4.4 12578 12.3 Scenario 2 98559 61622 62.52 18443 18.71 8035 8.15 10459 10.61 Scenario 3 98559 64235 65.17 19247 19.16 7089 7.19 7901 8.01 As can be seen in Table 4, for each scenario, a similar number of total comments (NcT) is classified. Based on this data set and the previously trained classification model, it is observed that the trend in the types of polarity in the three scenarios is equivalent since there is a greater polarity of positive comments from the participants. In order of polarity, negative comments are the second most frequent. However, it is observed that comments classified as non-polar have a higher number of occurrences than comments classified as neutral. This is because many comments do not contribute to feedback or cannot be framed in context. Furthermore, it is observed that when comparing scenarios 1 and 2, there is a considerable difference in the polarity classification percentage. This is because for scenario 1 only comments from the TASS corpus are used. And for scenario 2, the platform's own comments classified manually are used. With this, it is determined that while the model is trained with data closer to the context, the classification will be more reliable within the types of polarity proposed. With respect to scenario 3, an improvement in the classification of positives and negatives is observed. This is attributed to the fact that there is a larger number of training data than the previous scenarios, and that the TASS data set and the domain's own data set are involved for training. Even though the domain dataset is smaller in this scenario, the classification is more accurate. According to Figure 3, taking with reference the variation in the number of positive comments that the model generates, it is evident that scenario three has the highest number of positive comments. It is emphasized that said scenario has the following advantages: a greater number of trained data and information related to the context of the comments of the Open Campus platform. Figure 3. Results of the classification and polarity of comments considering the three proposed scenarios. 5. Conclusions In the present work, it is determined that the information within the feedback comments of the peer evaluation activities has great potential for both teachers and participants. For teachers this information can give a vision of how students perceive the activity and the contributions of their peers from a qualitative point of view. And from the students' side, evaluating the activities of their classmates allows them to have a better understanding of the subject of study and develop soft skills such as critical thinking, co-evaluation, and collaborative work. Furthermore, analyzing the results obtained, it is identified that many students give feedback that is useful, whether it is positive or negative. However, there is an important percentage of comments in the feedback that is perceived as unjustified or incomprehensible, and this is observed in the number of comments classified as neutral and without polarity. Finally, it is stated that the more context data is used in the training phase, the more remarkable the accuracy in the classification. In addition, with the present work it has been observed that there is no corpus in Spanish related to the educational field. This research is a contribution to future works that require a corpus of comments in Spanish for feedback. 6. Acknowledgements This research is supported by the Knowledge-Based Systems Research Group of the UTPL and the Open Campus initiative, Loja - Ecuador 7. References [1] N. Osheroff, W. B. Cutrer, C. C. Pettepher, R. H. Carnahan, and E. C. Bird, “Using Small Case- Based Learning Groups as a Setting for Teaching Medical Students How to Provide and Receive Peer Feedback,” Med. Sci. Educ., vol. 27, no. 4, pp. 759–765. (2017). doi: 10.1007/s40670-017- 0461-x [2] Martínez-Cámara, E., García-Cumbreras, M.A., Villena-Román, J., García-Morera, J. TASS 2015 - La evolución de los sistemas mineros de opinión españoles. Procesamiento del Lenguaje Natural, 56. (2020). [3] Esparza, G. G., de-Luna, A., Zezzatti, A. O., Hernandez, A., Ponce, J., Álvarez, M., ... & de Jesus Nava, J. A sentiment analysis model to analyze students reviews of teacher performance using support vector machines. In International Symposium on Distributed Computing and Artificial Intelligence (pp. 157-164). Springer, Cham. (2017). doi:10.1007/978-3-319-62410-5_19 [4] Sadler, D. R. Beyond feedback: Developing student capability in complex appraisal. Assessment & Evaluation in Higher Education, 35(5), 535–550. (2010). doi:10.1080/02602930903541015. [5] Lundstrom, K., & Baker Smemoe, W. To give is better than to receive: The benefits of peer review to the reviewer’s own writing. Journal of Second Language Writing, 18, 30–43. 2009. doi:10.1016/j.jslw.2008.06.002. [6] H. Shang, “An exploration of asynchronous and synchronous feedback modes in EFL writing,” J. Comput. High. Educ., vol. 29, no. 3, pp. 496–513, 2017.] [7] Ortega, M. P., Mendoza, L. B., Hormaza, J. M., & Soto, S. V. Accuracy'Measures of Sentiment Analysis Algorithms for Spanish Corpus generated in Peer Assessment. In Proceedings of the 6th International Conference on Engineering & MIS 2020. pp. 1-7. (2020). doi:10.1145/3410352.3410838 [8] Misiejuk, K., Wasson, B. and Egelandsdal, K. Using learning analytics to understand student perceptions of peer feedback. Computers in human behavior, 117, p.106658. (2021). doi: 10.1016/j.chb.2020.106658 [9] Cambria, E.; Schuller, B.; Xia, Y.; Havasi, C. New Avenues in Opinion Mining and Sentiment Analysis. IEEE Intell. Syst. (2013). doi:10.1109/MIS.2013.30 [10] Kastrati, Z., Dalipi, F., Imran, A. S., Pireva Nuci, K., & Wani, M. A. (2021). Sentiment Analysis of Students’ Feedback with NLP and Deep Learning: A Systematic Mapping Study. Applied Sciences, 11(9), 3986. (2021). doi:10.3390/app11093986 [11] Mite-Baidal, K.; Delgado-Vera, C.; Solís-Avilés, E.; Espinoza, A.H.; Ortiz-Zambrano, J.; Varela- Tapia, E. Sentiment Analysis in Education Domain: A Systematic Literature Review. In International Conference on Technologies and Innovation; Valencia-García, R., Alcaraz-Mármol, G., Del Cioppo-Morstadt, J., Vera-Lucio, N., Bucaram-Leverone, M., Eds.; Springer International Publishing: Berlin/Heidelberg, Germany, pp. 285–297. (2018). doi:10.1007/978-3-030-00940- 3_21 [12] Zong, Z., Schunn, C. D., & Wang, Y. (2021). What aspects of online peer feedback robustly predict growth in students’ task performance? Computers in Human Behavior, 106924. doi:10.1016/j.chb.2021.106924 [13] Rico-Juan, J. R., Gallego, A. J., & Calvo-Zaragoza, J. (2019). Automatic detection of inconsistencies between numerical scores and textual feedback in peer-assessment processes with machine learning. Computers & Education, 140. (2019). doi:10.1016/j.compedu.2019.103609 [14] TASS: Workshop on Semantic Analysis at SEPLN. http://tass.sepln.org/ [15] Pandas: https://pandas.pydata.org/ [16] Scikit-Learn: https://scikit-learn.org/ [17] NLTK Library: https://www.nltk.org/