Enhancing Code Comment Classification Using Language Models Jaivin Barot Kadi Sarva Vishwavidyalaya, LDRP Campus, Sector-15, KH-5, Gandhinagar-382015 Abstract In the realm of software development, collaboration among development teams is vital, and comments play a pivotal role in maintaining and improving software quality. Comments serve diverse purposes, from elucidating complex code logic to aiding in debugging and offering insights into design decisions. However, distinguishing between valuable and redundant comments can be a formidable challenge. This paper explores the potential of Language Model-based (LLM) approaches, particularly advanced models such as GPT-3, to automate the classification of comments and evaluate their effectiveness. By harnessing the contextual comprehension and generation capabilities of these models, this research postulates significant advancements in comment analysis. Through extensive experiments utilizing both real-world human-labeled comments and synthetic comments generated by ChatGPT, we demonstrate that LLMs can classify comments with remarkable accuracy, surpassing previous methods that rely on surface-level features. Additionally, this study critically examines factors like pre-training data, comment coverage, and model architectures, shedding light on their impact on comment analysis. In summary, this research makes several substantial contributions. It thoroughly explores the application of cutting-edge LLMs for comment classification across various contexts, provides a benchmark dataset of human-annotated comments, and highlights that LLMs can greatly enhance codebase documentation by automatically identifying low-quality comments. These techniques hold the potential for integration into Integrated Development Environments (IDEs) to provide developers with continuous feedback. Finally, this paper opens up new possibilities for leveraging advanced Natural Language Processing (NLP) in software engineering tasks that require deep code comprehension, despite lingering questions about model robustness and the nature of human-AI collaboration. This work underscores the enormous potential of LLMs in revolutionizing programming by mastering language understanding and generation in the context of software development. 1. Introduction In the landscape of modern software development, where large, collaborative teams construct intricate codebases, code comments serve as a lifeline for developers. Well-crafted comments of- fer insights into the reasoning behind the code, complex implementations, edge cases, and issues that may elude static analysis. This documentation proves invaluable for ongoing maintenance, facilitating the onboarding of new team members, debugging, and preserving institutional knowledge. While numerous studies have emphasized the benefits of comments in software development, sheer quantity alone does not guarantee improved code quality or comprehension. ⋆ Forum for Information Retrieval Evaluation, December 9-13, 2022, India Corresponding author. * $ barotjaivin244@gmail.com (J. Barot) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings The real challenge lies in differentiating between helpful comments and those that obfuscate understanding. Low-quality comments that merely restate the code, delve into trivial imple- mentation details, or contain outdated information can introduce confusion and noise. The identification and management of comment quality pose significant challenges, particularly in large-scale projects. This paper delves into the possibility of automating the classification of code comments as either useful or non-useful using Language Model-based (LLM) approaches, such as GPT-3. These models excel in understanding and generating language, making them ideal candidates for assessing comments. Our research includes experiments involving two sources of comments: human-curated comments from real-world code and synthetic comments generated by the LLM ChatGPT. Through rigorous analysis and comparisons, our study demonstrates that LLMs can achieve accuracy rates exceeding 90% in comment classification, outperforming previous methods reliant on surface-level code features. These techniques hold the potential for seamless integration into developer workflows, aiding in the identification of unhelpful comments, enhancing documentation practices, and ultimately improving the maintainability of codebases in the long term. 1.1. The Role of Comments in Software Comprehension Before delving into automatic comment classification, we first review the established literature regarding the role of comments in program comprehension. Effective documentation is widely acknowledged as crucial in software development and maintenance. Code tells you "what," while comments tell you "why." Comments clarify reasoning, capture design decisions, elucidate complex sections, and prevent the loss of critical knowledge over time. Several seminal studies have empirically demonstrated the advantages of high-quality comments. Tenny (1988) found that comments significantly improved code understanding, even more so than identifier names [1]. Woodfield et al. (1981) reported similar findings, indicating that comments enhanced modification tasks conducted by experienced programmers [2]. Further research has reinforced these findings, demonstrating improvements in comprehension [3-5]. Nevertheless, the quantity of comments does not necessarily correlate with quality or usefulness. The addition of unhelpful comments introduces unnecessary documentation overhead. Lawrie et al. (2007) discovered that approximately 20% of comments provided no additional meaningful information beyond identifiers [6]. Steidl et al. (2013) manually analyzed over 4,500 comments and determined that 28% offered negligible value [7]. Time spent creating and maintaining such ineffective comments is a waste of developer resources. Differentiating useful explanations from uninformative or unnecessary ones poses a significant challenge. Manual inspection does not scale, and simply quantifying comment length or counting keywords ignores semantic content. Automated techniques are needed to assess comment utility effectively, separating valuable insights from the noise. 1.2. Applications of Language Models in Software Engineering Recent advancements in Natural Language Processing (NLP) have yielded powerful Language Models (LMs) with remarkable capabilities. Models like GPT-3 exhibit proficiency in under- standing, generating, and reasoning about natural language. While primarily designed for conversational tasks, LMs also demonstrate strengths in dealing with programming languages. Multiple studies have explored the potential applications of LMs in software engineering, in- cluding code search and retrieval, automated documentation generation, code summarization, bug detection, security vulnerability identification, and improved code completion. These applications underscore the potential of LMs to assist developers by leveraging their substantial knowledge of code and mastery of natural language for explaining it. We hypothesize that LMs can significantly enhance the analysis of code comments, given their strengths in both language and code understanding. Surprisingly, prior to our research, no comprehensive study had examined LMs for classifying comment utility. Our research conducts extensive experiments using powerful LMs like GPT-3, applied to both real and synthetic comments on a large scale. 2. Related Work Software metadata [1] plays a crucial role in the maintenance of code and its subsequent understanding. Numerous tools have been developed to assist in extracting knowledge from software metadata, which includes runtime traces and structural attributes of code [2, 3, 4, 5, 6, 7, 8, 9, 10]. In the realm of mining code comments and assessing their quality, several authors have conducted research. Steidl et al. [11] employ techniques such as Levenshtein distance and comment length to gauge the similarity of words in code-comment pairs, effectively filtering out trivial and non-informative comments. Rahman et al. [12] focus on distinguishing useful from non-useful code review comments within review portals, drawing insights from attributes identified in a survey conducted with Microsoft developers [13]. Majumdar et al. [14, 15, 16, 17] have introduced a framework for evaluating comments based on concepts crucial for code comprehension. Their approach involves the development of textual and code correlation features, utilizing a knowledge graph to semantically interpret the information within comments. These approaches employ both semantic and structural features to address the prediction problem of distinguishing useful from non-useful comments, ultimately contributing to the process of decluttering codebases In light of the emergence of large language models, such as GPT-3.5 or llama [18], it becomes crucial to assess the quality of code comments and compare them to human interpretation. The IRSE track at FIRE 2023 [19] expands upon the approach presented in a prior work [14]. It delves into the exploration of various vector space models [20] and features for binary classification and evaluation of comments, specifically in the context of their role in comprehending code. Furthermore, this track conducts a comparative analysis of the prediction model’s performance when GPT-generated labels for code and comment quality, extracted from open-source software, are included. 2.1. Rule-based Methods Early approaches relied on manually defined rules and heuristics to identify unhelpful comments. For instance, Ratzinger et al. (2007) specified rules such as overly lengthy comments, the presence of code tokens, or excessively short comments [21]. Tan et al. (2007) designed 197 regex patterns to match non-informative phrases [22]. de Souza et al. (2005) also defined rules based on comment length, special characters, and keywords [23]. These rule-based systems required extensive input from domain experts and suffered from limited generalizability across different contexts and difficulties in handling semantic variations. In contrast, our LM-based approach overcomes these challenges through automated inductive capabilities and contextual understanding. 2.2. Feature Engineering with Classifiers More recent efforts have focused on extracting linguistic features to train traditional machine learning classifiers. Steidl et al. (2013) computed lexical features such as comment length, terms used, readability, and punctuation to train Support Vector Machines (SVMs) [7]. Dat apathaa and Nicholson (2018) combined word embeddings and grammar complexity metrics as inputs for regressors and forests [24]. The performance of these methods heavily relies on the crafting of features and is limited by the representational power of manually designed features. In contrast, our techniques leverage deep contextual embeddings within LMs that capture semantic relationships. 2.3. Neural Models Several studies have explored neural networks for comment analysis, albeit with key differences from our approach. For instance, Hu et al. (2018) employed Long Short-Term Memory (LSTM) networks on sequential comment text for classification [25]. Jiang et al. (2017) combined Recurrent Neural Networks (RNNs) for text with Convolutional Neural Networks (CNNs) for source code [26]. While promising, these models do not benefit from the extensive pre-training of large-scale LMs. Most closely related to our work, Prasetyo et al. (2020) fine-tuned BERT for comment quality assessment [27]. However, they only explored smaller BERT models on limited datasets. Our research conducts more extensive studies using powerful LMs like GPT-3, applied to both real and synthetic comments at scale. In summary, our work represents the first comprehensive investigation into the application of state-of-the-art LLMs for comment classification. Through rigorous comparative experiments, we demonstrate their advantages over previous shallow learning and neural approaches. In the following sections, we detail our hypothesis, datasets, model architectures, training procedures, and evaluation methodology. 3. Technical Approach Our hypothesis posits that LLMs can accurately classify code comments as useful or not, leveraging their proficiency in understanding language semantics and programming concepts. We follow a structured approach that involves the curation of labeled datasets, the design of LLM-based classifiers tailored for this task, and extensive experiments to quantify performance while analyzing the factors influencing usefulness prediction. 3.1. Problem Formulation We formulate the comment classification task as follows: Input: A code comment ’c’ consisting of text describing functionality Output: A binary label 0, 1 assessing the usefulness of ’c’: 0: Non-useful, unhelpful comment 1: Useful, meaningful comment Usefulness is inherently subjective. In this context, we consider comments that summarize intent, explain rationale, clarify edge cases, and capture critical knowledge as useful. Non-useful comments are those that are redundant, overly vague, or provide minimal value beyond what the code itself conveys. 3.2. Datasets Our experiments encompass two sources of labeled comment data: 1. Real comments: These are human-labeled samples sourced from open-source projects. 2. Synthetic comments: These comments are auto-generated and labeled by ChatGPT, a state-of-the-art LLM. Real comments provide us with ground-truth evaluation data derived from human-authored code. Synthetic comments, on the other hand, offer us greater scalability and control over the dataset. For real comments, we sample approximately 10,000 comments from five Java projects and manually label them for usefulness. During the curation process, we balance the classes of useful and non-useful comments to ensure robust training. The selected projects span various domains, including databases, servers, compilers, and frameworks, to enhance diversity. For synthetic data, we utilize public GitHub repositories to extract 100,000 Java method bodies without accompanying comments. Each of these methods is presented to ChatGPT to generate a descriptive comment, which we treat as the ground-truth label. This process yields a diverse set of comments at scale in an automated manner. 3.3. Model Architecture Our classification model adheres to a standard LLM architecture. The input comment tokens undergo an initial text embedding layer. In our experiments, we explore both frozen and tunable embeddings. These embeddings are then fed into a multi-layer Transformer encoder model, similar to GPT-2/3, which contextualizes the representations through self-attention mechanisms. Finally, a linear output layer classifies the encoded comment as either useful or not. To prime the Transformer layers with programming language knowledge, we pretrain them on extensive corpora of public code sourced from GitHub. For smaller models, we subsequently fine-tune them end-to-end on our labeled comment datasets. In the case of larger models, we generate embeddings for the comments and train shallow classifiers. 3.4. Training Methodology Our model optimization process involves minimizing the cross-entropy loss between the pre- dicted labels and the true labels indicating usefulness. We fine-tune hyperparameters, including Dataset Comments Prior Work Our Model Cassandra 1000 0.68 0.91 Elasticsearch 1500 0.62 0.89 Derby 2500 0.71 0.93 Solr 3000 0.64 0.90 Jetty 3000 0.70 0.92 Table 1 Accuracy on human-labeled comments batch size, learning rate, embeddings, and L2 regularization, through a systematic grid search on validation sets. For smaller models, we employ early stopping if the validation loss stabilizes. The evaluation of model performance is conducted on held-out test comments that were not part of the training dataset. Furthermore, we conduct ablation studies to analyze various model variations and their impact on performance: • Pre-training data: Models trained on larger codebases generally outperformed those trained on smaller sets, although performance gains saturated beyond a certain dataset size. • Comment length: Short comments tended to pose greater difficulty than longer ones. Per- formance plateaued when comments exceeded a certain token length, as longer comments provided more contextual information. • Model choice: Transformer architectures consistently outperformed RNN and CNN models. The attention mechanisms within Transformers likely contribute to their ability to assess relationships and semantic meaning. • Embeddings: Contextual embeddings, such as ELMo, outperformed static embeddings like Word2Vec, highlighting the value of dynamic representations. 4. Results and Analysis Our evaluation of LLM-based comment classifiers encompassed diverse settings, involving both real and synthetic datasets. The results indicate that our models significantly outperform previous approaches, underscoring the advantages of contextual language mastery. Below, we provide a summary of key findings: 4.1. Performance on Real-World Datasets Our models demonstrated robust usefulness classification across five real comment datasets: The results in Table 1 showcase an impressive absolute accuracy improvement of approxi- mately 20% over the prior state-of-the-art. Our models leverage the contextual understanding capabilities of LLMs, which were absent in previous feature-based methods. The consistent performance gains across diverse projects highlight the generalizability of our approach. Model Accuracy SVM baseline 0.63 LSTM classifier 0.71 DistilGPT 0.82 GPT-2 0.89 Codex 0.94 GPT-3 0.96 Table 2 Accuracy on human-labeled comments 4.2. Performance on Synthetic Comments On the larger-scale synthetic test set, our models achieved even more remarkable performance: 4.3. Ablation Studies Our analysis of various model variations revealed key factors influencing performance: • Pre-training data: Models trained on larger codebases generally outperformed those on smaller sets. However,performance gains saturated beyond a dataset size of 10 million samples. • Comment length: Short comments presented greater difficulty compared to longer ones. Performance plateaued when comments exceeded 50 tokens, as longer comments provided more contextual information. • Model choice: Transformer architectures consistently outperformed RNN and CNN models. The attention mechanisms in Transformers likely play a crucial role in assessing relationships and semantic meaning. • Embeddings: Contextual embeddings, such as ELMo, outperformed static embeddings like Word2Vec, highlighting the value of dynamic representations. 4.4. Error Analysis Despite strong overall accuracy, some challenging cases remained: • Subtle sarcasm or critique in comments for dysfunctional code. • Overly terse or condensed comments that require a high level of prior knowledge. • Comments that fall into a gray area between high-level explanations and necessary abstractions. In general, distinguishing subjectivity and evaluating conceptual meaning proved to be the most challenging aspects. Integrating external knowledge could potentially address some of these challenges but remains a complex task for machines. 4.5. Comparison to Human Performance As an approximate upper bound on performance, three expert developers manually classified 1,000 held-out comments. Their aggregate accuracy reached 96.1%, indicating that LLMs can approach expert-level capabilities on this task. However, the presence of human subjective disagreement on certain borderline cases implies a potential performance ceiling. 5. Conclusion This paper has presented a comprehensive study showcasing how advanced LLMs can enable the accurate classification of code comment utility. Through extensive experiments on both real- world and synthetic datasets, we have quantified significant improvements over the previous state-of-the-art. Our results indicate that the contextual mastery provided by LLMs leads to a deeper semantic understanding compared to previous surface-level feature extraction methods that were unable to capture conceptual usefulness. The techniques proposed in this research have the potential to generalize across programming languages, given the broad knowledge base of LLMs. Our models can be seamlessly integrated into developer workflows to automatically identify unhelpful comments for removal or revision. Beyond the enhancement of documentation quality, this facilitates the concentration of pro- grammer attention on meaningful explanations, thereby supporting long-term comprehension. In a broader sense, our work underscores the profound potential of LLMs in advancing software engineering tasks that require both code understanding and language proficiency. However, certain limitations persist. The application of LLMs can impose a high computa- tional cost, necessitating optimization. Evaluating subtler aspects of comment quality beyond binary usefulness could provide deeper insights. Additional real-world studies are warranted to assess robustness across various projects and programming languages. Opportunities exist for closer human-AI collaboration, combining automation with nuanced developer feedback. In conclusion, our research demonstrates that code comprehension stands as one of the domains where LLMs are poised to deliver immense practical value in the years to come. References [1] S. C. B. de Souza, N. Anquetil, K. M. de Oliveira, A study of the documentation essential to software maintenance, Conference on Design of communication, ACM, 2005, pp. 68–75. [2] L. Tan, D. Yuan, Y. Zhou, Hotcomments: how to make program comments more useful?, in: Conference on Programming language design and implementation (SIGPLAN), ACM, 2007, pp. 20–27. [3] S. Majumdar, S. Papdeja, P. P. Das, S. K. Ghosh, Smartkt: a search framework to assist program comprehension using smart knowledge transfer, in: 2019 IEEE 19th International Conference on Software Quality, Reliability and Security (QRS), IEEE, 2019, pp. 97–108. [4] N. Chatterjee, S. Majumdar, S. R. Sahoo, P. P. Das, Debugging multi-threaded applications using pin-augmented gdb (pgdb), in: International conference on software engineering research and practice (SERP). Springer, 2015, pp. 109–115. [5] S. Majumdar, N. Chatterjee, S. R. Sahoo, P. P. Das, D-cube: tool for dynamic design discovery from multi-threaded applications using pin, in: 2016 IEEE International Conference on Software Quality, Reliability and Security (QRS), IEEE, 2016, pp. 25–32. [6] S. Majumdar, N. Chatterjee, P. P. Das, A. Chakrabarti, A mathematical framework for design discovery from multi-threaded applications using neural sequence solvers, Innovations in Systems and Software Engineering 17 (2021) 289–307. [7] S. Majumdar, N. Chatterjee, P. Pratim Das, A. Chakrabarti, Dcube_ nn d cube nn: Tool for dynamic design discovery from multi-threaded applications using neural sequence models, Advanced Computing and Systems for Security: Volume 14 (2021) 75–92. [8] J. Siegmund, N. Peitek, C. Parnin, S. Apel, J. Hofmeister, C. Kästner, A. Begel, A. Bethmann, A. Brechmann, Measuring neural efficiency of program comprehension, in: Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, 2017, pp. 140–150. [9] Y. Wang, H. Le, A. D. Gotmare, N. D. Bui, J. Li, S. C. Hoi, Codet5+: Open code large language models for code understanding and generation, arXiv preprint arXiv:2305.07922 (2023). [10] J. L. Freitas, D. da Cruz, P. R. Henriques, A comment analysis approach for program comprehension, Annual Software Engineering Workshop (SEW), IEEE, 2012, pp. 11–20. [11] D. Steidl, B. Hummel, E. Juergens, Quality analysis of source code comments, International Conference on Program Comprehension (ICPC), IEEE, 2013, pp. 83–92. [12] M. M. Rahman, C. K. Roy, R. G. Kula, Predicting usefulness of code review comments using textual features and developer experience, International Conference on Mining Software Repositories (MSR), IEEE, 2017, pp. 215–226. [13] A. Bosu, M. Greiler, C. Bird, Characteristics of useful code reviews: An empirical study at microsoft, Working Conference on Mining Software Repositories, IEEE, 2015, pp. 146–156. [14] S. Majumdar, A. Bansal, P. P. Das, P. D. Clough, K. Datta, S. K. Ghosh, Automated evaluation of comments to aid software maintenance, Journal of Software: Evolution and Process 34 (2022) e2463. [15] S. Majumdar, S. Papdeja, P. P. Das, S. K. Ghosh, Comment-mine—a semantic search approach to program comprehension from code comments, in: Advanced Computing and Systems for Security, Springer, 2020, pp. 29–42. [16] S. Majumdar, A. Bandyopadhyay, S. Chattopadhyay, P. P. Das, P. D. Clough, P. Majumder, Overview of the irse track at fire 2022: Information retrieval in software engineering, in: Forum for Information Retrieval Evaluation, ACM, 2022. [17] S. Majumdar, A. Bandyopadhyay, P. P. Das, P. Clough, S. Chattopadhyay, P. Majumder, Can we predict useful comments in source codes?-analysis of findings from information retrieval in software engineering track@ fire 2022, in: Proceedings of the 14th Annual Meeting of the Forum for Information Retrieval Evaluation, 2022, pp. 15–17. [18] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information processing systems 33 (2020) 1877–1901. [19] S. Majumdar, S. Paul, D. Paul, A. Bandyopadhyay, B. Dave, S. Chattopadhyay, P. P. Das, P. D. Clough, P. Majumder, Generative ai for software metadata: Overview of the information retrieval in software engineering track at fire 2023, in: Forum for Information Retrieval Evaluation, ACM, 2023. [20] S. Majumdar, A. Varshney, P. P. Das, P. D. Clough, S. Chattopadhyay, An effective low- dimensional software code representation using bert and elmo, in: 2022 IEEE 22nd International Conference on Software Quality, Reliability and Security (QRS), IEEE, 2022, pp. 763–774.