1. INTRODUCTION

Clustering Students' Short Text Reflections: A Software Engineering Course Case Study

Mohsen Dorodchi Alexandria Benedict

Mohsen.Dorodchi@uncc.edu Mohsen.Dorodchi@uncc.edu abenedi4@uncc.edu 0

Andrew Quinn

aquinn16@uncc.edu 1

Sandra Wiktor

swiktor@uncc.edu 1

Mohammadali Fallahian

mfallahi@uncc.edu 1

Erfan Al-Hossami

ealhossa@uncc.edu 1

Aileen Benedict

abenedi3@uncc.edu 1 0 University of North Carolina at University of North Carolina at , Charlotte Charlotte, Charlotte, NC 28223 Charlotte, NC 28223 , USA 1 University of North Carolina at , Charlotte, Charlotte, NC 28223 , USA

Student re ections can provide instructors with bene cial knowledge regarding their progress in the course, what challenges they are facing, and how the instructor can provide more e ectively to the students' needs. Reading every student re ection, however, can be a time-consuming task that may a ect the instructor's ability to e ciently address student needs in a timely manner. In this research, we explore the use of clustering and sorting of student re ections to shorten reading time while maintaining a comprehensive understanding of the re ection content. We obtain student re ections from a software engineering course. Next, we generate transformer-based sentence embeddings and then cluster the re ections using K-Means. Lastly, we sort the re ections based on the distance of each re ection from its cluster center. We conduct a small-scale user study with the course's Teaching Assistants and provide promising preliminary results showing a signi cant increase in reading time e ciency without sacri cing understanding.

eol>Natural Language Processing Student Re ections Clustering

1. INTRODUCTION

Re ections are an e ective way for instructors to detect what their students may be struggling with throughout their courses, gain a perspective on students' impressions of course content, and track their overall progress [ 9 ]. However, in order to utilize these bene ts to the fullest, instructors would need to manually read through each individual re ection. Manually analyzing re ections can be overwhelming for an Copyright ©2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) instructor, especially in large classroom settings, where timely feedback is needed to address students' possible concerns. Machine learning and knowledge discovery-based methods have been used to assist educators in understanding and helping students [ 14, 1, 20 ]. Unsupervised methods in natural language processing (NLP) such as topic modeling, have been used to automatically extract topics from student reective journals [ 5 ]. However, they fall short when it comes to short text, typically around a sentence in length, such as tweets. Recent research has utilized K-means clustering along with transformer-based sentence-embeddings to automatically extract topics from tweets [ 12, 2 ]. K-means clustering is often supplemented with a representation of text. Representations can include statistical-learnt representations such as term frequency-inverse document (TFIDF) [ 13 ], neural-learnt representations also known as word embeddings (e.g. Word2Vec [ 19 ], Glove [ 22 ]), and more recently representations computed from large pretrained transformer deep learning models.

Transformers are deep learning models following the architecture proposed by Vaswani et al. [27]. These models often undergo an unsupervised pretraining on a massive text corpus to create an initial version of the network later ne-tuned for more speci c tasks in process called transfer learning. Pretrained transformers such as BERT [ 6 ], RoBERTa [ 16 ], and GPT-3 [ 4 ], have achieved state of the art in many natural language processing tasks. Some of these tasks include detecting positive and uplifting discussion on social media (e.g. [ 17 ]), determining answers to questions given a passage of text (e.g. [28]), summarizing text (e.g. [ 15 ]), and estimating semantic similarity between sentences. For this reason, we select a transformer-based language model to create a semantic representation of student responses.

In this research, we implement an approach using k-means clustering from the scikit-learn library and utilize transformerbased sentence embeddings. We evaluate our approach in a preliminary user study, observing the time taken for teaching assistants to read and analyze student re ections.

2. DATASET

Course. The data used in our research was collected from an undergraduate software engineering course based on the active learning course model proposed in [ 7 ]. The number of students enrolled in the course was 108 students. Modules are organized based on the concepts being taught and typically spanned across approximately one week. The course contained 11 modules in total with the topics listed in Table 1. Following the active learning course model presented in Dorodchi et al. [ 10, 7 ], each module is typically divided into multiple sca olds: prep-work to complete before class including reading assignments and videos to watch, in-class activities, post-lecture activities, including assignments and labs, and a re ection at the end of the module. Labs are more challenging assignments provided to students which require hands-on coding. These lab activities are typically divided into multiple parts. There are a total of 4 labs in this course, with the rst lab beginning in Module 2 and the last lab being introduced in Module 8.

Data Collection. A survey questionnaire was provided to students within Canvas, the University's Learning Management System (LMS), at the end of each module to allow students to re ect on their learning and challenges. We refer to student responses of this questionnaire as student re ections throughout this work. The questions asked of students were: 1. On a scale of 1 to 5, with 5 being Very Active and 1 being Not Active, how engaged would you rate your group this week? 2. What was your biggest challenge this past week? This can include in-class activities, assignments, prep work, studying, time management, motivation, and so on. 3. How can you address the challenge you mentioned above? What can you do to overcome this challenge for next time? For the purpose of this research, we focused solely on the students' responses to question 2, as this question was freeresponse and would provide unique responses for the clustering process.

Dataset Statistics. We used two di erent module re ections from the software engineering course throughout this study: Module 7 re ections and Module 8 re ections. Table 2 showcases our descriptive statistics of our collected student re ection responses corpora. The selected module re ections were comparable in size. Firstly, the response rates to the Module 7 and Module 8 re ections are 94 or 87.0% and 89 or 82.4% responses out of 108 total students after preprocessing respectively. Moreover, the total word counts were 1866, 1390 for Module 7 and Module 8 re ections respectively. We also observe that most student re ections contained between a sentence or two on average in both module re ections. Furthermore, we note that most re ections in our corpus were around a sentence in length.

3. APPROACH

Our overall approach is illustrated in Figure 1. First we collect data from an undergraduate course with 108 students. This is described in more detail in section 2. Then, we perform preprocessing on the data using natural language processing (section 3.1). Next, we generate sentence embeddings (section 3.2), cluster those embeddings (section 3.3), and sort the re ections based on clusters for TA's to view (section 3.4).

3.1 Preprocessing

Before we generate sentence embeddings from our re ections dataset, we rst preprocess the data by removing any blank, or null, student responses, and also removing any non-breaking spaces which appear in the text. Next, the student responses are compiled and provided to the model for generating sentence embeddings. Background. Transformer architectures can be computationally ine cient when trying to nd the most semantically similar pair in sizable collection of sentences. To address this issue, sentence transformers were developed. Sentence transformers utilize mean pooling which computes the average of all the word-level vectors in the inputted sentence. Pooling helps sentence transformers maintain a xed size vector as their output. Sentence transformers then undergo a ne-tuning training process using the SNLI dataset [ 3 ] containing over 570,000 annotated sentence pairs. The netuning process Siamese and triplet networks [ 26 ] are utilized to compute weights during ne-tuning so that sentence embeddings are optimizing for meaningfulness and can be compared with cosine-similarity. Working with sentence-level representations make it easier and more e cient for tasks such as computing the semantic similarity of 2 sentences. Sentence transformers reduce computation time of nding the most similar Quora question from over 50 hours to a few milliseconds using Transformer architectures [ 23 ]. Furthermore, Sentence transformers outperform regular transformers on several semantic textual similarity tasks [ 23 ]. Approach. We use the sentence-transformers package [ 23 ]. We particularly select the DistilRoBERTa-base-cased model to get our sentence embeddings. DistilRoBERTabase-cased is a RoBERTa transformer model [ 16 ], distilled using [ 25 ]. The dimension of the embeddings is 768. In the embedding process, we take each student response which is typically a sentence in length, and convert it into a vector of 768 oats representing the sentence. These embeddings are then used to cluster the re ections as described in the next subsection.

3.3 Clustering

Our earlier step yields a set of embedded student responses one set for module 7 re ections and another for module reection 8. For each set of embedded student responses from our earlier step, we use K-means clustering using the scikitlearn machine learning library [ 21 ]. We compute the cluster centers for each cluster using the embedded student responses, hence cluster centers are represented by an embedding vector of the same shape. We also assign each response to a cluster based on the nearest cluster center. re ection 7 and 8 clusters for module re ection 8.

3.4 Sorting of Student Reflections

After each student re ection is assigned a cluster, the reections undergo a sorting process. The goal of the sorting process is to group re ections from most similar to least similar to assist in the reading process. Cluster distances were calculated using the scikit-learn library fit_transform function which computes and transforms the sentence embeddings to cluster-distance space. This function uses the euclidean distance formula for calculating the distance between a student re ection response r and its assigned cluster center rc, as follows: distance(e(r); rc) = pe(r) e(r) (2 e(r) rc) + rc rc where e(r) represents a student response r embedded using sentence-transformers into a vector of 768 elements. rc represents the computed cluster center assigned to r. After computation, we sort the re ections using the assigned cluster number to group re ections within the same cluster together. Lastly, we sort the re ections within the same cluster using the distance metric in descending order as well. This way re ections are sorted by most semantically similar to the cluster center to least semantically similar to the cluster center. Next, we explore our user study set up and evaluate how well this approach assists in the reading process.

4. RESULTS 4.1 Experimental Setup

In order to measure the e cacy of clustering in the knowledge extraction process, we developed a user study which compares the time e ciency of reading through and extracting topics from student re ections in two formats: 1. Unsorted student re ections exported directly from the

LMS. 2. Sorted student re ections sorted based on cluster distances.

The number of clusters was determined using the Silhouette method [ 24 ] for nding the optimal number of clusters. Using the Silhouette method, we generate 4 clusters for module First, the method of the user study will be described, and then a summary of the results. Our hypothesis when conducting this study was that clustering can help reduce the cognitive load and increase e ectiveness and e ciency of knowledge extraction.

In this user study, four teaching assistants were selected to read through the student re ections of a Software Engineering course. The module 7 and 8 re ections were chosen as the corpora to extract knowledge from, as the TAs had not yet read these in particular.

Each TA was assigned a re ection and a format. For example, TA 1 would read and extract topics from Re ection 7 unsorted, TA 2 would read and extract topics from Re ection 7 clustered/sorted, and so on, as illustrated in Table 3. For the TAs which were assigned the clustered/sorted format, they individually ran the K-Means clustering algorithm rst without reading any responses before beginning the process. The free-response question used in particular for this study was: \What was your biggest challenge this past week? This can include in-class activities, assignments, prep work, studying, time management, motivation, and so on." Each TA individually read through each student's re ection response for this question, extracted any new topics mentioned in the student response, and timed themselves accordingly for the duration of the process. Once all TAs had collectively nished, they then met to discuss what topics they found, and compared times and results.

4.2 Evaluation

After comparing results of this study, we derive that by providing instructors with student re ections in a clustered and sorted format, the time needed for knowledge extraction decreases while maintaining the accuracy of identifying topics. Re ection 7, with a total of 94 student responses, took 90 minutes to completely read through and extract topics on the unsorted responses, while only requiring 15 minutes in the sorted and clustered format. Re ection 8 had similar results in which e ciency increased, with a total of 89 responses taking approximately 121.4 minutes on the unsorted format and 20.9 minutes on the clustered and sorted responses. It is important to note that the TA extracting knowledge from Re ection 8 unsorted did not complete within a 90 minute time frame, thus their results were normalized based on how many re ections they did complete.

These results are provided in Table 4.

In addition to the increased e ciency of knowledge extraction with a clustered and sorted format, the topics extracted remained consistent, with a slight improvement in comparison to the unsorted format. Following the portion of the user study which required TAs to individually extract topics from the re ections, they then met afterwards to discuss their similarities and di erences in topics. The TAs who analyzed Re ection 7 extracted the same topics from the student responses with no di erences. During the discussion, the Re ection 7 TAs took turns sharing the topics they had extracted during the user study, and concluded that they were in 100% agreement with the topics coded. Re ection 8, however, had one topic which was extracted in the clustered and sorted re ections and not in the unclustered/unsorted re ections. The TAs assigned with Re ection 8 noted that this was most likely due to a lack of time to completely analyze all unsorted student re ections, hence displaying how time e ciency can also be bene cial to improving the accuracy of knowledge extraction if given a time-constraint.

Despite the improved time e ciency of the clustered and sorted re ection format, no topics were missed.

We utilize the dimension reduction algorithm UMAP [ 18 ] to visualize the resulting clusters of student re ections as shown in Figure 2. The student re ections for Module 7 resulted in 4 clusters with 4 major topics including managing workload, motivation and time management, lab work, and group work. The Module 8 student re ections resulted in 8 clusters with each cluster containing a challenge in at least one of the following categories: Lab work, time management, studying, motivation, group work, and some re ections mentioned no challenges whatsoever. Managing workload, motivation, studying, and time management relate to the student's own discerned ability to handle the coursework in general. Lab work and group work were challenges in which students related their troubles more speci cally to di cult topics being covered, confusions about instructions, or trouble with communicating among their groups to complete activities. Students who were in the category of \no challenges" noted that they did not have any di culties or confusion during the span of that module. As displayed in these scatter plots and the major topics described, there are overlaps among several of the clusters. This overlap is created by the similarities in the students' wordings. For example, two student responses within the \Managing Workload" cluster of the Module 7 re ection were: 1. \My biggest challenge has been not procrastinating my

work." 2. \The biggest challenge this week was working with the

dash and the dashboard framework." The rst student response was the cluster center with a dis(a) Module 7 Re ection clusters based on question 2: student chal- (b) Module 8 Re ection clusters based on question 2: student challenges. lenges. tance of 3:12, and the second student response was one of the farthest points from the cluster center, with a distance of 7:05. Therefore, clusters still maintain semantic similarities to many of the responses with smaller intracluster distances, but contain outliers due to the overlap caused by similar word usages.

5. RELATED WORK

Re ections are a necessary component in active learning courses, as it allows the instructor to track students' impression on the course, activities, and social learning aspects [ 9 ]. In Dorodchi et al. [ 8 ], student re ections are used in an introductory computer science (CS1) course to test its e cacy as a feature to predict early on which students may be at-risk of failing. By including student re ection data as a feature in a temporal data model, referred to as the student sequence model, the authors were able to increase the accuracy of predicting student outcomes of pass or fail [ 8 ]. Despite the advantages of integrating student re ections into a course model, these bene ts require the time-consuming process of manually reading through individual re ections and extracting common themes. For this reason, creating an automated process to assist instructors is similarly explored in [ 5 ]. Chen et al. [ 5 ] presents positive results in exploring the usage of topic modeling for analyzing and extracting knowledge from student re ections. In this particular study, the MALLET toolkit was utilized for the topic modeling process, and the number of clusters K was manually selected. These methods of knowledge extraction are not only e ective in an academic environment, but is also used in other applications such as social media mining for COVID-19 related information. Comparatively to the time-sensitive task of analyzing student re ections, clustering can also be used to discover new information from relevant tweets to assist in the decision-making steps that may follow [ 12 ]. For this task, Ito et al. [ 12 ] and Asgari et al. [ 2 ] implement algorithms using K-means clustering and sentence embeddings, which both provide positive results in topic extraction. Our study is distinguished from prior works in that, we collect and cluster short text student re ections and we conduct an educator-centered evaluation where we assess the direct impact of our approach on teaching assistants' reading and analysis time.

6. DISCUSSION & FUTURE WORK

In our research, we implement an approach using k-means clustering and sentence-transformers on student re ections to aid in reducing the labor and time-consumption of manually analyzing re ections. Our study presents promising preliminary results showing that by clustering student reections based on semantic similarities and sorting by intracluster distance, instructors are able to decrease the time needed to extract topics from the student corpora. However, our study su ers from several limitations. Firstly, our sample size for the user study is very small (N = 4) and our results may not generalize to di erent classes, or reection corpora. Furthermore, teaching assistants read at di erent paces. Our results may not generalize to di erent teaching assistants. To address these limitations we intend to conduct a user study with a signi cantly larger pool of participants, module re ections, and in multiple courses. In addition, we are planning to utilize fuzzy clustering [ 11 ] in the future version as well.

Re ections are fundamental for enhancing learning in classrooms [ 9 ], and provides the instructor with instant feedback on student progress. This study focuses on exploring the impact of clustering on student re ections to assist instructors in reducing time costs of analysis. In our future work, we plan to integrate our k-means clustering algorithm into a dashboard tool for instructors and conduct an expanded user study to further evaluate our approach. The dashboard will provide instructors and TAs the functionality to cluster student re ections from the LMS and be guided through the responses. 815{823, 2015. [27] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017. [28] Z. Zhang, J. Yang, and H. Zhao. Retrospective reader for machine reading comprehension. arXiv preprint arXiv:2001.09694, 2020.

7. ADDITIONAL AUTHORS

8. REFERENCES

[1]

Al-Doulat ,

Nur ,

Karduni ,

Benedict ,

Al-Hossami ,

M. L.

Maher ,

Dou ,

Dorodchi , and

Niu . Making sense of student success and risk through unsupervised machine learning and interactive storytelling . In International Conference on Arti cial Intelligence in Education , pages 3 { 15. Springer, 2020 .

[2]

Asgari-Chenaghlu , N.

Nikzad-Khasmakhi, and

Minaee. Covid-transformer: Detecting covid-19 trending topics on twitter using universal sentence encoder .

[3]

S. R.

Bowman , G. Angeli,

Potts , and

C. D.

Manning . A large annotated corpus for learning natural language inference . In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , pages 632 { 642 , Lisbon, Portugal, Sept. 2015 . Association for Computational Linguistics .

[4] T. B. Brown , B.

Mann , N.

Ryder , M.

Subbiah , J.

Kaplan , P.

Dhariwal , A.

Neelakantan , P.

Shyam , G.

Sastry , A.

Askell , et al. Language models are few-shot learners . arXiv preprint arXiv:2005.14165 , 2020 .

[5]

Chen ,

Yu ,

Zhang , and

Yu . Topic modeling for evaluating students' re ective writing: A case study of pre-service teachers' journals . In Proceedings of the sixth international conference on learning analytics & knowledge, pages 1{5 , 2016 .

[6]

Devlin , M.-

Chang ,

Lee , and

Toutanova . Bert: Pre-training of deep bidirectional transformers for language understanding . arXiv preprint arXiv:1810.04805 , 2018 .

[7]

Dorodchi ,

Al-Hossami ,

Nagahisarchoghaei ,

R. S.

Diwadkar , and

Benedict . Teaching an undergraduate software engineering course using active learning and open source projects . In 2019 IEEE Frontiers in Education Conference (FIE) , pages 1 {5 . IEEE, 2019 .

[8]

Dorodchi ,

Benedict ,

Desai ,

M. J.

Mahzoon ,

Macneil , and

Dehbozorgi . Design and implementation of an activity-based introductory computer science course (cs1) with periodic re ections validated by learning analytics . 12 2018 .

[9]

Dorodchi ,

Powell ,

Dehbozorgi , and

Benedict . Strategies to Incorporate Active Learning Practice in Introductory Courses , pages 20 { 37 . 04 2020 .

[10] M. M. Dorodchi , N.

Dehbozorgi , A.

Benedict , E.

Al-Hossami , and

Benedict . Sca olding a team-based active learning course to engage students: A multidimensional approach . In 2020 ASEE Virtual Annual Conference Content Access. ASEE Conferences , Virtual On line, 2020 .

[11]

Doroodchi and

Reza . Implementation of fuzzy cluster lter for nonlinear signal and image processing . In Proceedings of IEEE 5th International Fuzzy Systems , volume 3 , pages 2117 { 2122 vol. 3 , 1996 .

[12]

Ito and

Chakraborty . Social media mining with dynamic clustering: A case study by covid-19 tweets . In 2020 11th International Conference on Awareness Science and Technology (iCAST) , pages 1 {6 . IEEE, 2020 .

[13]

K. S.

Jones . A statistical interpretation of term speci city and its application in retrieval . Journal of documentation , 1972 .

[14]

Li ,

Ding ,

Yang , and

Liu . Identifying at-risk k-12 students in multimodal online environments: A machine learning approach . arXiv preprint arXiv: 2003 .09670, 2020 .

[15]

Liu and

Lapata . Text summarization with pretrained encoders . arXiv preprint arXiv:1908.08345 , 2019 .

[16]

Liu ,

Ott ,

Goyal ,

Du ,

Joshi ,

Chen ,

Levy ,

Lewis ,

Zettlemoyer , and

Stoyanov . Roberta: A robustly optimized bert pretraining approach . arXiv preprint arXiv: 1907 .11692, 2019 .

[17]

Mahajan ,

Al-Hossami , and

Shaikh. TeamUNCC@LT-EDI-EACL2021 : Hope speech detection using transfer learning with transformers . In Proceedings of the First Workshop on Language Technology for Equality, Diversity and Inclusion , pages 136 { 142 , Kyiv , Apr. 2021 . Association for Computational Linguistics .

[18]

McInnes ,

Healy , and

Melville . Umap: Uniform manifold approximation and projection for dimension reduction . arXiv preprint arXiv:1802.03426 , 2018 .

[19]

Mikolov , I. Sutskever,

Chen , G. Corrado, and

Dean . Distributed representations of words and phrases and their compositionality . arXiv preprint arXiv:1310.4546 , 2013 .

[20]

Nur ,

Park ,

Dorodchi ,

Dou ,

M. J.

Mahzoon ,

Niu , and

M. L.

Maher . Student network analysis: a novel way to predict delayed graduation in higher education . In International Conference on Arti cial Intelligence in Education , pages 370 { 382 . Springer, 2019 .

[21]

Pedregosa ,

Varoquaux ,

Gramfort ,

Michel ,

Thirion ,

Grisel ,

Blondel ,

Prettenhofer ,

Weiss ,

Dubourg ,

Vanderplas ,

Passos ,

Cournapeau ,

Brucher ,

Perrot , and

Duchesnay . Scikit-learn: Machine learning in Python . Journal of Machine Learning Research , 12 : 2825 { 2830 , 2011 .

[22]

Pennington ,

Socher , and

C. D.

Manning . Glove: Global vectors for word representation . In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) , pages 1532 { 1543 , 2014 .

[23]

Reimers and

Gurevych . Sentence-bert: Sentence embeddings using siamese bert-networks . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics , 11 2019 .

[24]

P. J.

Rousseeuw . Silhouettes: A graphical aid to the interpretation and validation of cluster analysis . Journal of Computational and Applied Mathematics , 20 : 53 { 65 , 1987 .

[25]

Sanh ,

Debut ,

Chaumond , and

Wolf . Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter . arXiv preprint arXiv: 1910 .01108, 2019 .

[26]

Schro ,

Kalenichenko , and

Philbin . Facenet: A uni ed embedding for face recognition and clustering . In Proceedings of the IEEE conference on computer vision and pattern recognition , pages