Commit Classification Into Maintenance Activities Using Aggregated Semantic Word Embeddings of Software Change Messages Tjaša Heričko* , Saša Brdnik and Boštjan Šumak Faculty of Electrical Engineering and Computer Science, University of Maribor, Koroška cesta 46, Maribor, Slovenia Abstract Every change to a software repository (i.e., commit) is described by a software developer committing the change with a message written in natural language, indicating the purpose of the change. Automatically inferring the change intents of commits helps understand and manage software projects and their development. This paper presents and evaluates an approach leveraging the semantic characteristics of textual descriptions in software change messages to classify commits into adaptive, corrective, and perfective maintenance activities. The approach represents a commit message as a set of vectors, using a word2vec model trained on commits of the top starred GitHub repositories. Each vector corresponds to a semantic embedding of a word in the message. The resulting word embeddings are aggregated to a single vector representation for a commit, using the average, maximum, and term frequency-inverse document frequency weighted mapping. The experimental results revealed that the models based on the proposed features and simple classifiers are promising, outperforming the baseline ZeroR algorithm. Compared to the traditional approach that uses keywords as features, the models of the proposed approach performed better overall, and provided a more accurate prediction of adaptive and corrective maintenance activities. The best-performing model with a mean weighted F1-score of 75.9% used the maximum aggregation method, and was based on the Random Forest classifier. Keywords software maintenance, software repositories, commit messages, multi-class classification, natural lan- guage processing, word embeddings, word2vec 1. Introduction Throughout software development, the iterative changes made to a software project’s source code are typically tracked in a version control system, e.g., Subversion and Git [1]. They are designed primarily to support collaborative software development by storing the code, maintain- ing the history of every modification to the code, and simultaneously fostering collaboration and communication in development teams over the lifetime of a software project [1, 2]. Nevertheless, the abundance of the available data from version control systems and the code hosting platforms built on them, e.g., GitHub, presents a valuable resource, from which software research and practice can learn in order to understand and manage software projects better [2, 3]. SQAMIA 2022: Workshop on Software Quality, Analysis, Monitoring, Improvement, and Applications, September 11–14, 2022, Novi Sad, Serbia * Corresponding author. " tjasa.hericko@um.si (T. Heričko); sasa.brdnik@um.si (S. Brdnik); bostjan.sumak@um.si (B. Šumak)  0000-0002-0410-7724 (T. Heričko); 0000-0001-5535-3477 (B. Šumak) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 8:1 Changes to software repositories are performed for various purposes, including adapting, correcting, or perfecting software. Analyzing and categorizing changes helps understand and support the decision-making process of software practitioners, concerned mainly with optimum resource allocation and software quality assurance [4, 5, 6]. Additionally, it enables software researchers to study software changes [7]. However, the intents are often not well-written and documented explicitly by developers [7, 8, 9], making inferring the change intents of commits a challenging task. Thus, much effort has been dedicated to mining software repositories to obtain software artifacts of committed changes to a repository, e.g., source code, explanatory commit messages, or other metadata, and exploit them to categorize commits automatically. Existing research has already investigated the use of various natural language processing tech- niques for extracting relevant features from commit messages, with the objective of classifying commits into maintenance activities, including word-frequency analysis [10, 11], topic models [6, 12], and contextual embeddings [4, 8]; however, the semantics of the textual descriptions in messages are yet to be explored to the fullest extent of their potential. This paper aims to investigate commit messages from the perspective of semantic embeddings, and the usefulness of aggregated embeddings for multi-class commit classification. A word2vec model, trained on 400k commit messages from the top starred Java repositories on GitHub, was used to convert words into 300-dimensional vector representations. Next, the word embeddings were sum- marized into commit embeddings using three alternative strategies – averaging, maximizing, and term frequency-inverse document frequency (TF–IDF) weighting. An experiment was conducted to evaluate the approach on a balanced cross-project dataset of 1,793 labeled commits grouped into three maintenance categories, i.e., adaptive, corrective, and perfective maintenance. Several classification algorithms were tested for the classification task. In achieving the research objectives, we contribute to the body of literature in the following three ways: • Firstly, to address the problem of classifying commits into maintenance activities, we present commit embeddings, constructed by aggregating the embeddings of words found in each commit message. • Secondly, to evaluate the proposed approach, we analyzed classification models trained on different classifiers, and compared the results with the traditional keyword-based approach. • Thirdly, we highlight opportunities for modifications of the approach that may be ad- dressed in future work, which could improve the approach’s performance, as well as provide additional insight into the nature of software changes. The remainder of the paper is structured as follows. The relevant related work is reviewed in Section 2. Section 3 explains our proposed approach. The obtained experimental results are presented and discussed in Section 4. In Section 5, threats to internal and external validity are outlined, with employed strategies to alleviate or prevent them. Section 6 concludes the paper by presenting the summarized findings, implications, and future research prospects. 2. Related work Over the years, various features have been used for the classification of commits. Regardless of the main objective of the classification, the features can be divided based on their origin, 8:2 whether they were derived from properties of the source code (utilized in [5, 9, 13, 14]), the commit messages and their metadata (utilized in [6, 8, 11, 15, 16]), or external tools integrated with software repositories, e.g., bug tracking systems (utilized in [17]). Source code-derived features were used by Mariano et al. [13], with a count of the added and deleted code lines, and a count of the changed files as inputs of the classification. Hönel et al. [9] used source code density, a measure of the net size of a commit. Levin and Yehudai [5] targeted code changes as defined by Fluri’s taxonomy. Sabetta and Bezzi [14] treated code as a text document written in natural language. Keywords derived from commit messages were used by Saini and Chahal [15]. Textual parts of issues reported in bug tracking systems were considered for classification by Antoniol et al. [17]. In this work, we focused on the classification of commits, with the objective of categorizing commits into maintenance categories based on features derived from commit messages. Various approaches with similar aims have already been proposed in the field. Mockus and Votta [10] used word frequency analysis and normalization to select relevant keywords from commit messages and used them as features. A similar work by Mauczka et al. [16] also classified commits based on a set of keywords used in the message. Gharbi et al. [11] addressed the classification as a multi-label active learning problem, representing each commit message as a vector of feature values using the TF–IDF technique. Fu et al. [6] utilized domain knowledge of software changes to prepare labeled samples used to build the semi-supervised Latent Dirichlet Allocation model, and Yan et al. [12] presented a discriminative probability latent semantic analysis model. Sarwar et al. [8] employed transfer learning based on a fine-tuned neural network, i.e., DistilBERT. Apart from only using commit messages, some existing work relevant to ours has performed classification using both commit message- and code-derived features. In such a setting, Ghadhab et al. [4] employed a pre-trained neural language model, DistilBERT, on commit messages, together with extracted fine-grained code changes. The closest to our work are studies by Ghadhab et al. [4] and Sarwar et al. [8], which have already showcased the usefulness of semantic embedding models for dealing with the unstructured nature of commit messages written in natural language and classifying commit messages. We built on these works by providing an investigation of the usefulness of code embeddings constructed from a self-trained word2vec model, and an evaluation of different aggregation methods to aggregate individual vector representations of words into one embedding for a commit. 3. Approach A high-level overview of the approach is illustrated in Figure 1. Each step is presented in more detail in the following subsections. 3.1. Dataset An extended dataset made available by Ghadhab et al. [4] was used in our work. It combines data from three sources, i.e., [5], [16], and [18]. Altogether, the dataset contains commits collected from 109 open-source projects written in Java, covering a wide range of application domains, including integration frameworks, utility libraries, and integrated development environments. 8:3 Figure 1: An overview of our approach to commit classification It consists of 1,793 commits, labeled with maintenance activities belonging to one of the fol- lowing groups: adaptive (N=590), corrective (N=603), and perfective (N=600). The maintenance categories used, each describing the purpose of a software change, were interpreted as proposed by Swanson [19]. Adaptive maintenance refers to adapting the software to changes in the environment, corrective maintenance refers to activities related to fixing failures, and perfective maintenance refers to improving the software’s performance and quality. 3.2. Data preprocessing The data preparation phase included multiple steps. The text was first transformed for each commit message to lowercase letters, then numbers and punctuations were removed. Stop-word removal was conducted to eliminate commonly used words with little meaning (e.g., is, not, to). Lemmatization was used to remove the inflectional endings and return only the lemma, the dictionary form of every word. 3.3. Word2vec model training Word2vec is a well-known and commonly used word embedding learning method in natural language processing presented by Mikolov et al. [20]. The method uses a neural network that can represent a word in a text corpus as a vector with semantic relations to other words in a way that similar words are in close proximity to each other in the vector space. In this work, we trained our own word2vec model. First, 4,000 top starred GitHub repositories based on the Java programming language were extracted using the GitHub Search API. The list contained popular projects, including ElasticSearch, Selenium, Kafka, Gson, and Jenkins. Next, for every repository, a list of all commits, together with the commit subjects, was retrieved by cloning repositories with the git clone command and extracting lists of commits using the git log command. Commits with empty subjects were removed. From the remainder, 100 commits were selected randomly for each included project. If a project did not contain 100 commits it was removed, and the next available top starred repository after the collected ones was gathered using the GitHub API. The process was repeated until 100 commits were collected from 4,000 projects. The data were preprocessed after obtaining a list of 400k commits on which a word2vec model can learn, 8:4 i.e., a text corpus. Lemmatization was performed after transforming the text to lowercase, and removing punctuations, numbers, and stop words. Additionally, short messages with less than six words were removed, as their benefit for training is limited. Duplicated commit messages were dropped after the cleaning steps. The resulting output served as an input to the word2vec training process. We decided to use the skip-gram model, which performs well with smaller training sets and provides good accuracy for rare words. The skip-gram model learns in steps, taking one word as a target in each step, trying to predict the words in its context out to some window size. The model then defines the probability of a word appearing in the context given this target word [20]. The model was trained in 30 epochs using a minimum count of 20, a window size of 2, and a vector size of 300. The resulting trained word2vec model consisted of a vocabulary of 3,384 unique words. In Figure 2, the resulting word embeddings are visualized using the UMAP dimensionality reduction method based on the cosine distance metric. The Figure illustrates the semantic relationship on the example of the word exception. The most similar words, based on a computed cosine similarity, are runtimeexception (0.628), ioexception (0.605), illegalstateexception (0.601), and illegalargumentexception (0.597), which can also be seen in the Figure. Figure 2: Visualizing word embeddings using UMAP (n_neighbors=15, min_dist=0.1) 3.4. Feature extraction For each of the 1,793 commits in the dataset, the learned word2vec model was used to produce an embedding vector associated with each word in the commit message. Words not present in the word2vec model’s vocabulary were not considered. The TF–IDF matrix was calculated next. It assigned a weight to each word in the dataset based on the number of its occurrence in a commit message (i.e., term frequency) and the number of commit messages containing this word (i.e., document frequency) in the following way: 𝑁 TF −IDF = tfi,j × 𝑙𝑜𝑔( dfi ) (1) where tfi,j is the term frequency of the word 𝑖 in a commit message 𝑗, 𝑁 is the total number of commit messages in the dataset, and dfi is the document frequency of the word 𝑖. The 8:5 corresponding weight for a word in a commit message reflects its significance to the rest of our dataset. Three aggregation methods were then used to aggregate feature vectors of individual words of a commit message to one commit embedding. The mean value per dimension of words was calculated for averaging. The highest value per dimension of words was calculated for maximizing. For TF–IDF weighting, each word embedding was multiplied by the TF–IDF value found in the TF–IDF matrix, and the resulting vectors were then aggregated by averaging across dimensions. 3.5. Classification task A multi-class classification approach was performed, in which each commit was assigned a single label L ∈ {adaptive, corrective, perfective}. The inputs to the model were the vector representations of commit messages, constructed as described in the previous subsection (EmbeddingsAvg , EmbeddingsMax , EmbeddingsTF–IDF ). All three consisted of a 300-dimensional feature vector. The following classifiers were used, selected due to their inclusion in existing research [4, 5, 9, 11, 13]: Logistic Regression (LR), K-Nearest Neighbors (KNN ), Random Forest (RF), Gaussian Naïve Bayes (GNB), and Decision Tree (DT ). For the baseline, the ZeroR algorithm was used, whose output is the most frequent group in the dataset – in our case, this was corrective maintenance. Also, the traditional keyword-based approach, which uses a set of 20 keywords extracted from commit messages as binary features – True in the case where the keyword is present in the message, and False if the keyword is not present – was implemented as described in [4], to allow the comparison with our approach. Three times repeated 10-fold cross-validation was used to estimate the performance of the models using the mean accuracy and F1-score, averaged per all repeated folds, as the evaluation metrics. 4. Experiment 4.1. Research questions To assess the proposed approach and its variations based on the aggregation method used, an experiment was conducted, with the aim of addressing the following two research questions: (RQ1) Can the proposed approach classify commits accurately into maintenance activities? (RQ2) How does the proposed approach perform compared to the traditional keyword-based approach? 4.2. Results and discussion First, the models were assessed regarding accuracy – the overall percentage of correct predictions. The accuracy of all the folds across different classifiers is presented in Figure 3. Considering the mean accuracy of folds, when LR and RF classifiers were used, all three variations of the proposed approach outperformed the traditional approach. Only in the case of KNN and GNB, the traditional approach performed slightly better than the model using TF–IDF weighted embeddings, and in the case of DT, the traditional approach performed better than the models 8:6 using averaged and TF–IDF weighted embeddings. Comparing the best classification models of every feature in terms of the mean accuracy, the three models using the feature vectors proposed by this work (72.1%, 75.5%, and 65.6%) outperformed the traditional model (63.7%). Figure 3: Accuracy of the variants of the proposed approach and traditional approach across different classifiers Next, as accuracy as an evaluation metric can hinder some problems of the classification models, the F1-score, a weighted average of precision and recall, was also used for assessment. In Table 1, the weighted F1-score is presented, obtained for each model by all the folds and averaged by the total number of folds (N=30). When evaluating the performance of the models using the F1-score, the findings are generally similar to the ones observed by the accuracy. However, a larger difference between the mean score of the traditional approach and the mean score of the proposed approach’s variations for the metric F1-score (Δ=4.67%) compared to the metric accuracy (Δ=4.25%) hints that the traditional approach has some underlying problems related to false positives and false negatives. Comparing the best models of every feature in terms of the F1-score, the three models using the feature vector proposed by this work (71.8%, 75.9%, and 67.6%) outperformed the traditional model (64.0%). Table 1 Weighted F1-score (%) of the variants of the proposed and traditional approach across different classifiers LR KNN RF GNB DT ZeroR 𝑥 ¯ SD 𝑥 ¯ SD 𝑥 ¯ SD 𝑥 ¯ SD 𝑥 ¯ SD 𝑥 ¯ SD EmbeddingsAvg 71.7 3.3 64.1 3.5 71.8 2.6 64.3 3.1 58.5 3.6 EmbeddingsMax 75.3 2.5 70.0 2.6 75.9 3.5 66.0 3.4 63.5 3.7 17.0 3.1 EmbeddingsTF–IDF 65.6 3.9 59.3 3.5 67.6 3.3 52.0 4.0 52.6 4.1 Keywords 64.0 3.1 59.0 3.9 64.0 3.2 52.9 3.6 62.9 3.2 LR=Logistic Regression, KNN=K-Nearest Neighbors, RF=Random Forest, GNB=Gaussian Naïve Bayes, DT=Decision Tree, 𝑥 ¯ =Mean, SD=Standard Deviation All the proposed models performed better than the baseline ZeroR algorithm in terms of accuracy (33.6%) and the F1-score (17.0%). The overall best performing model, with a mean accuracy of 75.5% and weighted F1-score of 75.9%, used maximized embeddings as a feature 8:7 vector with RF as the classifier. In Figure 4, we attempt to visualize our dataset, represented using the best performing feature vector (maximized word embeddings) by transforming the 300-dimensional vectors to 2-dimensional vectors using the UMAP dimensionality reduction method and grouping commits by the maintenance category. We can observe rough areas and patterns where a particular commit category is prevalent. This visualization hints at the value and appropriateness of why using such representations of commit messages can be useful. Figure 4: Commit embeddings constructed with maximizing word embeddings visualized using UMAP (n_neighbors=100, min_dist=0.3) Performance on a more fine-grained level, i.e., per maintenance category, was observed, to analyze further the differences in performance of the proposed and traditional approaches. The normalized confusion matrices on the example of the four models averaged across all folds using the same classification algorithm are presented in Figure 5. We chose to present the performance results of models built with the RF because this classifier provided the best results for our variations of the proposed approach. Although this was not the best performing classifier for the traditional approach, the difference between the best one (which uses LR) was small; the difference in the mean accuracy was 0.074%, and the difference in the F1-score was 0.023%. The presented confusion matrices gave us a better insight into whether all maintenance categories were being predicted equally well, and what errors the models were making. By comparing the actual categories with the predicted ones, we can observe that all three variations of the proposed approach had a high percentage of true positive predictions (predicted categories are the same as actual categories), which can be indicated from confusion matrices by the dark-colored diagonals. In addition, all three had a similar degree of true positive rate for all three maintenance categories, meaning that the proposed approach worked equally well for all the maintenance categories. On the other hand , the traditional approach, relying only on a set of predefined keyword features, had a higher rate of misclassifications. It was very successful for predicting perfective commits correctly, yet less successful for adaptive and corrective commits, since the approach predicts a large portion of the commits belonging to the two groups falsely as perfective instead. In the context of (RQ1), we wanted to evaluate how the proposed approach performed for commit classification into maintenance activities. Our results show that using the proposed features to present software change messages described by the unstructured nature-language text, based on aggregation methods performed on word embeddings, improved the baseline 8:8 Figure 5: Normalized mean confusion matrices for models trained with the RF classifier algorithm significantly. Such features were shown to contain indicative information to classify commits. The best performing models were the ones using maximizing of word embeddings as an aggregation method. In the context of (RQ2), we wanted to make our work comparable by evaluating the ef- fectiveness of our approach in comparison to the traditional keyword-based approach. By reproducing the traditional keyword-based approach, we found that, overall, the proposed approach outperformed the traditional approach. Although the proposed approach was less successful at predicting perfective commits, the approach improved the prediction of adaptive and corrective commits compared to the traditional approach. 5. Threats to validity 5.1. Internal validity The results of this study were largely affected by the self-trained word2vec model embeddings. A major threat to validity relates to the selected subjects used in the model’s learning process. As the learning relies heavily on the quality of commit messages given to the model as inputs, 8:9 we chose commits from top starred repositories as a proxy of popular repositories, assuming that commits in popular projects were more likely to be well-documented and messages would correspond adequately to the applied software changes. To ensure further that the included messages would benefit the learning process, short commits with fewer than six words were excluded. To mitigate the risk of capturing semantic characteristics of messages written by only a few developers, we selected commits from a number of different repositories. Despite all these efforts, there is no guarantee that the resulting commits are good representatives for all software projects. This may impact the generated word embeddings used in the classification directly, and, therefore, the classification performance. In addition, it is essential to note that the selected word vector representation method could have affected the classification results. Alternatively, other methods could be used, e.g., GloVe. The parameter values of the word2vec model, e.g., model type, vector dimensions, and the number of epochs, could have also impacted the results. Note that our work did not set out to find optimal values for best results but instead attempted to validate the proposed approach. Our findings were also influenced by the dataset chosen for commit classification. To mitigate threats related to data quality, we reused an existing dataset used previously by several researchers. The validity threats reported by the authors related to the gathering and labeling of the included commits in the dataset should be considered. 5.2. External validity To mitigate threats related to the generalizability of the resulting word embeddings based on the trained word2vec model, we used a number of commits from various software repositories. It is possible that the results would differ for less popular repositories, closed-source repositories, and repositories of projects written in other programming languages, because the obtained results were restricted to taking into account solely popular open-source repositories of Java code. To maximize the generalization of our findings with regard to the classification task, classification was performed in a cross-project setting. However, the dataset was somewhat limited in size. In addition, the generalization of our findings may not apply to closed-source software and projects written in programming languages other than Java. Repeated k-fold cross-validation was employed to address the risk of under- and overfitting the training data. 6. Conclusions In this work, we studied the effectiveness of aggregated word2vec embeddings for classifying commits into maintenance categories. We demonstrated that the proposed features, capturing the semantics of commit messages, performed well with simple classifiers, outperforming the ZeroR baseline significantly. Additionally, we put the results of the proposed approach in relation to those of the traditional keyword-based approach and highlighted the differences. The approach is still subject to improvements in future work. Searching for the optimum parameter values of the word2vec model should be performed, for representing words in a semantical vector space adequately. More strategies to assess the quality of commit messages in the data preprocessing step should be considered, for example, dealing with links present in the commit message, as the applied preprocessing steps were found to be insufficient. Future research can focus on additional aggregation methods, e.g., summarization, minimization, 8:10 and even concatenation, aiming to find the method that best preserves the relevant semantic information from the commits. Applying strategies to deal with words of commit messages not present in the word2vec vocabulary should be addressed in future work. Hyperparameter tuning can be employed for finding the optimal parameters of classification models. Other classification methods, apart from the ones used in the study, especially neural network-based and ensemble models, could provide better results. In the future, language-agnostic classification models that work for multiple languages should be the focus. Larger and cross-language datasets should be obtained, enabling this research direction. In the current approach, the model can describe the maintenance nature of an entire commit only – each commit can belong to one group only. In the view of some commits belonging to multiple maintenance groups simultaneously, particularly in the case of merge commits, steps should be taken towards multi-label classification. Acknowledgments The authors acknowledge the financial support from the Slovenian Research Agency (Research Core Funding No. P2-0057). References [1] N. N. Zolkifli, A. Ngah, A. Deraman, Version control system: A review, Procedia Computer Science 135 (2018) 408–415. doi:10.1016/j.procs.2018.08.191. [2] G. Gousios, D. Spinellis, Mining software engineering data from github, in: 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C), 2017, pp. 501–502. doi:10.1109/ICSE-C.2017.164. [3] H. Kagdi, M. L. Collard, J. I. Maletic, A survey and taxonomy of approaches for mining software repositories in the context of software evolution, Journal of software maintenance and evolution: Research and practice 19 (2007) 77–131. doi:10.1002/smr.344. [4] L. Ghadhab, I. Jenhani, M. W. Mkaouer, M. Ben Messaoud, Augmenting commit classifica- tion by using fine-grained source code changes and a pre-trained deep neural language model, Information and Software Technology 135 (2021) 106566. doi:10.1016/j.infsof. 2021.106566. [5] S. Levin, A. Yehudai, Boosting automatic commit classification into maintenance activities by utilizing source code changes, in: Proceedings of the 13th International Conference on Predictive Models and Data Analytics in Software Engineering, PROMISE, Association for Computing Machinery, New York, NY, USA, 2017, pp. 97–106. doi:10.1145/3127005. 3127016. [6] Y. Fu, M. Yan, X. Zhang, L. Xu, D. Yang, J. D. Kymer, Automated classification of software change messages by semi-supervised latent dirichlet allocation, Information and Software Technology 57 (2015) 369–377. doi:10.1016/j.infsof.2014.05.017. [7] N. Meng, Z. Jiang, H. Zhong, Classifying code commits with convolutional neural networks, in: 2021 International Joint Conference on Neural Networks (IJCNN), 2021, pp. 1–8. doi:10. 1109/IJCNN52387.2021.9533534. [8] M. U. Sarwar, S. Zafar, M. W. Mkaouer, G. S. Walia, M. Z. Malik, Multi-label classification 8:11 of commit messages using transfer learning, in: 2020 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW), 2020, pp. 37–42. doi:10.1109/ ISSREW51248.2020.00034. [9] S. Hönel, M. Ericsson, W. Löwe, A. Wingkvist, Using source code density to improve the accuracy of automatic commit classification into maintenance activities, Journal of Systems and Software 168 (2020) 110673. doi:10.1016/j.jss.2020.110673. [10] Mockus, Votta, Identifying reasons for software changes using historic databases, in: Proceedings 2000 International Conference on Software Maintenance, 2000, pp. 120–130. doi:10.1109/ICSM.2000.883028. [11] S. Gharbi, M. W. Mkaouer, I. Jenhani, M. B. Messaoud, On the classification of software change messages using multi-label active learning, in: Proceedings of the 34th ACM/SI- GAPP Symposium on Applied Computing, SAC ’19, Association for Computing Machinery, New York, NY, USA, 2019, pp. 1760–1767. doi:10.1145/3297280.3297452. [12] M. Yan, Y. Fu, X. Zhang, D. Yang, L. Xu, J. D. Kymer, Automatically classifying software changes via discriminative topic model: Supporting multi-category and cross-project, Journal of Systems and Software 113 (2016) 296–308. doi:10.1016/j.jss.2015.12.019. [13] R. V. R. Mariano, G. E. dos Santos, M. V. de Almeida, W. C. Brandão, Feature changes in source code for commit classification into maintenance activities, in: 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), 2019, pp. 515–518. doi:10.1109/ICMLA.2019.00096. [14] A. Sabetta, M. Bezzi, A practical approach to the automatic classification of security- relevant commits, in: 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME), 2018, pp. 579–582. doi:10.1109/ICSME.2018.00058. [15] M. Saini, K. K. Chahal, Change profile analysis of open-source software systems to understand their evolutionary behavior, Frontiers of Computer Science 12 (2018) 1105– 1124. doi:10.1007/s11704-016-6301-0. [16] A. Mauczka, M. Huber, C. Schanes, W. Schramm, M. Bernhart, T. Grechenig, Tracing your maintenance work–a cross-project validation of an automated classification dictionary for commit messages, in: International Conference on Fundamental Approaches to Software Engineering, Springer, 2012, pp. 301–315. doi:10.1007/978-3-642-28872-2_21. [17] G. Antoniol, K. Ayari, M. Di Penta, F. Khomh, Y.-G. Guéhéneuc, Is it a bug or an enhance- ment? a text-based approach to classify change requests, in: Proceedings of the 2008 Conference of the Center for Advanced Studies on Collaborative Research: Meeting of Minds, CASCON ’08, Association for Computing Machinery, New York, NY, USA, 2008. doi:10.1145/1463788.1463819. [18] E. AlOmar, M. W. Mkaouer, A. Ouni, Can refactoring be self-affirmed? an exploratory study on how developers document their refactoring activities in commit messages, in: 2019 IEEE/ACM 3rd International Workshop on Refactoring (IWoR), 2019, pp. 51–58. doi:10.1109/IWoR.2019.00017. [19] E. B. Swanson, The dimensions of maintenance, in: Proceedings of the 2nd International Conference on Software Engineering, ICSE ’76, IEEE Computer Society Press, Washington, DC, USA, 1976, pp. 492–497. doi:10.5555/800253.807723. [20] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, 2013. doi:10.48550/ARXIV.1301.3781. 8:12