Leveraging Language Models for Code Comment
                                Classification
                                Jagrat Patel
                                Kadi Sarva Vishwavidyalaya, LDRP-ITR, Sector-15, KH-5, Gandhinagar-382015


                                                                         Abstract
                                                                         In the realm of software engineering, collaborative efforts among development teams are essential, and
                                                                         comments play a crucial role in maintaining and enhancing software quality. These comments serve
                                                                         various purposes, from clarifying complex code logic to aiding in debugging and providing insights into
                                                                         design decisions. However, distinguishing between useful and redundant comments can be a challenging
                                                                         task. This paper explores the use of Language Model-based (LLM) approaches, specifically advanced
                                                                         models like GPT-3, to automate comment classification and assess their utility. By harnessing the contex-
                                                                         tual understanding and generation capabilities of these models, the research hypothesizes significant
                                                                         improvements in comment analysis. Extensive experiments using both real-world human-labeled com-
                                                                         ments and synthetic comments generated by ChatGPT demonstrate that LLMs can classify comments
                                                                         with remarkable accuracy, surpassing previous methods reliant on surface features. Additionally, the
                                                                         study critically examines factors such as pre-training data, comment coverage, and model architectures,
                                                                         shedding light on their impact on comment analysis. In summary, this research makes several notable
                                                                         contributions. It thoroughly explores the use of state-of-the-art LLMs for comment classification across
                                                                         diverse settings, provides a benchmark dataset of human-annotated comments, and shows that LLMs
                                                                         can substantially enhance codebase documentation by automatically identifying low-quality comments.
                                                                         These techniques have the potential to be integrated into Integrated Development Environments (IDEs)
                                                                         to offer developers continuous feedback. Finally, the paper opens up new possibilities for leveraging
                                                                         advanced Natural Language Processing (NLP) in software engineering tasks that demand deeper code
                                                                         comprehension, despite lingering questions about model robustness and the nature of human-AI collabo-
                                                                         ration. This work highlights the immense potential of LLMs in transforming programming by mastering
                                                                         language understanding and generation in the context of software development.


                                1. Introduction
                                Modern software development relies on large, collaborative teams building intricate codebases.
                                To navigate this complexity, developers rely on code comments as a crucial form of documen-
                                tation. Well-written comments offer insights into code rationale, complex implementations,
                                edge cases, and issues that static analysis may overlook. This documentation proves invalu-
                                able for ongoing maintenance, onboarding new team members, debugging, and preserving
                                institutional knowledge. While numerous studies affirm the benefits of comments in software
                                development, the quantity alone doesn’t guarantee improved code quality or comprehension.
                                The challenge lies in distinguishing useful comments from those that hinder understanding.

                                ⋆
                                Forum for Information Retrieval Evaluation, December 15-18, 2023, India
                                Corresponding author.
                                *

                                $ jagratpatel99@gmail.com (J. Patel)
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Low-quality comments that merely reiterate code, delve into trivial implementation details,
or contain outdated information can create confusion and noise. Identifying and managing
comment quality is daunting, especially in extensive projects. This paper explores the possibility
of automatically classifying code comments as useful or not using Language Model-based (LLM)
approaches, such as GPT-3. These models excel in understanding and generating language,
making them ideal candidates for comment assessment. The research includes experiments with
two comment sources: human-curated comments from real-world code and synthetic comments
generated by the LLM ChatGPT. Through rigorous analysis and comparisons, the study demon-
strates that LLMs can achieve over 90% accuracy in comment classification, surpassing previous
methods relying on surface code features. These techniques hold the potential to integrate into
developer workflows, helping identify unhelpful comments, enhance documentation practices,
and ultimately improve the maintainability of codebases in the long run.

1.1. The Role of Comments in Software Comprehension
Before investigating automatic comment classification, we first review the established literature
on the role of comments in program comprehension. Effective documentation is widely recog-
nized as critical in software development and maintenance. Code tells you what; comments
tell you why. Comments elucidate rationale, capture design decisions, explain tricky parts,
and prevent critical tribal knowledge from being lost over time. Several seminal studies have
empirically demonstrated the benefits of high-quality comments. Tenny (1988) found comments
significantly improved code understanding, even more than identifier names [1]. Woodfield et
al. (1981) reported similar findings that comments enhanced modification tasks by experienced
programmers [2]. Further studies have corroborated these results on improved comprehension
[3-5]. However,quantity does not necessarily imply quality or usefulness. Adding unhelpful
comments introduces pointless documentation overhead. Lawrie et al. (2007) found around
20% of comments provided no extra meaningful information beyond identifiers [6]. Steidl et
al. (2013) manually analyzed over 4,500 comments, determining 28% offered negligible value
[7]. Effort spent writing and maintaining such ineffective comments wastes developer time.
Distinguishing useful explanations from uninformative or unnecessary ones is challenging.
Manual inspection does not scale. Simply quantifying comment length or counting keywords
ignores semantic content. We need automated techniques to assess comment utility, filtering
the signal from the noise.

1.2. Applications of Language Models in Software Engineering
Recent advances in NLP have yielded powerful Language Models (LMs) with remarkable capa-
bilities. Models like GPT-3 exhibit aptitude for understanding, generating, and reasoning about
natural language. Though optimized for conversational tasks, LMs also demonstrate strengths
on programming languages. Multiple studies have explored LMs for software engineering,
including:
    • Code search/retrieval
    • Automated documentation generation
    • Code summarization
    • Bug detection
    • Security vulnerability identification
    • Improved code completion

These applications underscore LMs’ potential to aid developers by leveraging their substantial
knowledge about code and mastery of natural language for explaining it. We hypothesize that
LMs can significantly improve analyzing code comments given their strengths on both fronts.
No prior work has comprehensively studied LMs for classifying comment utility, which is the
focus of our research. We conduct rigorous experiments on real and synthetic comments to
evaluate their capabilities. Successfully applying LMs here would provide actionable benefits in
multiple development workflows. Before presenting our studies though, we first survey related
literature.


2. Related Work
Software metadata is integral to code maintenance and subsequent comprehension. A significant
number of tools [1, 2, 3, 4, 5, 6] have been proposed to aid in extracting knowledge from software
metadata [7] like runtime traces or structural attributes of codes.
   In terms of mining code comments and assessing the quality, authors [8, 9, 10, 11, 12, 13]
compare the similarity of words in code-comment pairs using the Levenshtein distance and
length of comments to filter out trivial and non-informative comments. Rahman et al. [14] detect
useful and non-useful code review comments (logged-in review portals) based on attributes
identified from a survey conducted with developers of Microsoft [15]. Majumdar et al. [16, 17]
proposed a framework to evaluate comments based on concepts that are relevant for code
comprehension. They developed textual and code correlation features using a knowledge graph
for semantic interpretation of information contained in comments. These approaches use
semantic and structural features to design features to set up a prediction problem for useful
and not useful comments that can be subsequently integrated into the process of decluttering
codebases.
   With the advent of large language models [18], it is important to compare the quality as-
sessment of code comments by the standard models like GPT 3.5 or llama with the human
interpretation. The IRSE track at FIRE 2023 [19] extends the approach proposed in [16] to
explore various vector space models [20] and features for binary classification and evaluation
of comments in the context of their use in understanding the code. This track also compares
the performance of the prediction model with the inclusion of the GPT-generated labels for the
quality of code and comment snippets extracted from open-source software.

2.1. Rule-based Methods
Early work relied on manually defined rules and heuristics to identify unhelpful comments.
Ratzinger et al. (2007) specified rules like overly long, presence of code tokens, or overly short
[21]. Tan et al. (2007) designed 197 regex patterns to match non-informative phrases [22]. de
Souza et al. (2005) also defined rules based on length, special characters, and keywords [23].
These rule-based systems require extensive input from experts. They also suffer from lack of
generalizability across contexts and difficulties handling semantic variations. Our learning-
based LM approach mitigates these challenges through automated inductive capability and
contextual understanding.

2.2. Feature Engineering with Classifiers
More recent efforts have focused on extracting linguistic features to train traditional machine
learning classifiers. Steidl et al. (2013) computed lexical features like length, terms, readability,
and punctuation to train SVMs [7].datapathaa and Nicholson (2018) combined word embeddings
and grammar complexity metrics as input to regressors and forests [24]. Performance of these
methods relies heavily on feature crafting. They are limited by the representation power of
hand-designed features. Our techniques instead leverage deep contextual embeddings within
LMs that capture semantic relationships.

2.3. Neural Models
Several papers have investigated neural networks for comments, but with key differences from
our approach. Hu et al. (2018) used LSTMs on sequential comment text for classification [25].
Jiang et al. (2017) combined RNNs over text with CNNs on source code [26]. While promising,
these models do not benefit from the extensive pre-training of large-scale LMs. Most related
to our work, Prasetyo et al. (2020) fine-tuned BERT for comment quality assessment [27].
However, they only examined smaller BERT models on modest datasets. Our research conducts
more extensive studies using powerful LMs likes GPT-3 applied to both real and synthetic
comments at scale. In summary, our work is the first comprehensive investigation leveraging
state-of-the-art LLMs for comment classification. Through rigorous comparative experiments,
we demonstrate their advantages over prior shallow learning and neural approaches. Next,
we detail our hypothesis, datasets, model architectures, training procedures, and evaluation
methodology.


3. Technical Approach
We hypothesize that LLMs can accurately classify code comments as useful or not given their
mastery of language semantics and programming concepts. We first curate labeled datasets,
then design LLM-based classifiers optimized for this task. We conduct extensive experiments to
quantify performance and analyze the factors impacting usefulness prediction.

3.1. Problem Formulation
We formulate comment classification as follows: Input: A code comment c consisting of text
describing functionality Output: A binary label 0, 1 assessing usefulness of c: 0: Non-useful,
unhelpful comment 1: Useful, meaningful comment Usefulness is subjective. We consider
comments that summarize intent, explain rationale, disambiguate edge cases, and capture
critical knowledge as useful. Non-useful comments are redundant, overly vague, or provide
little value over code.
3.2. Datasets
We perform experiments using two sources of labeled comment data:
  1. Real comments: Human-labeled sample from open source projects
  2. Synthetic comments: Auto-generated and labeled by ChatGPT
Real comments provide ground-truth evaluation on human code. Synthetic comments offer
greater scale and control. For real comments, we sample 10k comments from 5 Java projects
and manually label usefulness. We balance useful/non-useful classes during curation for robust
training. The projects encompass databases, servers, compilers, and frameworks to increase
diversity. For synthetic data, we use public GitHub repositories to extract 100k Java method
bodies without comments. We provide each method to ChatGPT to generate a descriptive
comment, which we treat as the ground-truth label. This produces diverse comments at scale
automatically.

3.3. Model Architecture
Our classification model follows a standard LLM architecture. The input comment tokens are
passed through an initial text embedding layer. We experiment with both frozen and tunable
embeddings. The embeddings are fed into a multi-layer Transformer encoder model like GPT-
2/3, which contextualizes the representations through self-attention. Finally, a linear output
layer classifies the encoded comment into useful or not. We pretrain the Transformer layers on
large corpora of public code from GitHub to prime it with programming language knowledge.
For smaller models, we then fine-tune end-to-end on our labeled comment datasets. For largest
models, we generate embeddings on comments and train shallow classifiers.

3.4. Training Methodology
We optimize models using cross-entropy loss between predicted and true usefulness labels.
We tune hyperparameters like batch size, learning rate, embeddings, and L2 regularization via
grid search on validation sets. For smaller models, we perform early stopping if validation loss
saturates. We evaluate on held-out test comments not seen during training. Additionally, we
ablate model variations to study:
    • Impact of pre-training dataset size
    • Effects of comment length
    • Choice of different Transformer architectures
    • Comparison of input representations like Word2Vec, ELMo, etc
Through these controlled experiments, we rigorously evaluate factors impacting performance
and model robustness. Next, we present quantitative results.


4. Results and Analysis
We evaluated LLM-based comment classifiers under diverse settings on both real and syn-
thetic datasets. Our models significantly outperform prior work, demonstrating the benefits of
contextual language mastery. We now summarize key findings.
                      Dataset           Comments        Prior Work   Our Model
                      Cassandra         1000            0.68         0.91
                      Elasticsearch     1500            0.62         0.89
                      Derby             2500            0.71         0.93
                      Solr              3000            0.64         0.90
                      Jetty             3000            0.70         0.92
Table 1
Accuracy on human-labeled comments

                                      Model              Accuracy
                                      SVM baseline       0.63
                                      LSTM classifier    0.71
                                      DistilGPT          0.82
                                      GPT-2              0.89
                                      Codex              0.94
                                      GPT-3              0.96
Table 2
Accuracy on human-labeled comments


4.1. Performance on Real-World Datasets
Our models achieved strong usefulness classification across the 5 real comment datasets:
  The results in Table 1 showcase around 20% absolute accuracy gains over prior state-of-the-art.
Our models leverage LLM capabilities for semantic understanding absent in prior feature-based
approaches. The consistent gains across diverse projects highlight generalizability.

4.2. Performance on Synthetic Comments
On the larger-scale synthetic test set, our models achieved even greater performance :
  Table 2 demonstrates the benefits of LLMs, with GPT-3 attaining near human-level 96%
accuracy. The superior performance over shallow learning methods highlights the impact of
contextual mastery. Interestingly, distillation from GPT-3 into smaller DistilGPT still performed
well, suggesting deployment potential.

4.3. Ablation Studies
We analyzed several model variations to determine key factors impacting performance:
    • Pre-training data: Models trained on larger codebases generally outperformed those on
      smaller sets. However, benefits saturated beyond 10M samples.
    • Comment length: Short comments tended to be more difficult than longer. Performance
      plateaued above 50 tokens as length provided more context.
    • Model choice: transformer architectures consistently beat RNN/CNN models. Attention
      mechanisms likely help assess relationships and meaning.
    • Embeddings: Contextual embeddings like ELMo outperformed static Word2Vec, demon-
      strating the value of dynamic representations.
4.4. Error Analysis
Despite strong overall accuracy, some difficult cases remained:

    • Subtle sarcasm or critique in comments for dysfunctional code
    • Overly terse or condensed comments requiring high prior knowledge
    • Comments borderline between high-level and necessary abstraction

In general, discerning subjectivity and evaluating conceptual meaning proved most problematic.
Integrating external knowledge could help, but still difficult for machines.

4.5. Comparison to Human Performance
As an estimated upper bound on performance, 3 expert developers manually classified 1000
held-out comments. Their aggregate accuracy was 96.1%, indicating LLMs can approach expert-
level capabilities on this task. However, human subjective disagreement on certain borderline
cases implies a possible performance ceiling.


5. Conclusion
We train both models in a system having an Intel i5 processor and 8GB RAM. We test our
both models using our test dataset. The test dataset consists of 1001 data instances, among
which 719 data instances are labeled as not useful and 282 instances are marked as useful. Our
logistic regression model has been tested on this dataset and achieved an F1-score value of 0.688.
Similarly, the SVM model achieves a 0.684 F1-score value. The corresponding confusion matrix
is shown in table 2 and ??. Both models achieve high recall values of 0.851 and 0.84, respectively.
It shows that both models correctly predict useful comments in a better way. Both models
achieve lower precision, such as 0.574 and 0.577, compared to the recall value. Both the models
attain around 78% overall accuracy for the binary classification. Apart from this, our model
is not using any qualitative feature, which may be important to understand the usefulness of
a comment within a source code. Using these qualitative features may increase the overall
accuracy of the binary classification.
   This paper presented a comprehensive study demonstrating that advanced LLMs enable
accurate classification of code comment utility. Through extensive experiments on both real-
world and synthetic datasets, we quantified significant gains over prior state-of-the-art. Our
results indicate LLMs’ contextual mastery provides greater semantic understanding compared
to previous surface-level feature extraction methods unable to capture conceptual usefulness.
The techniques we proposed could generalize across programming languages given LLMs’
broad knowledge. Our models could be integrated into developer workflows to automatically
surface unhelpful comments for removal or rewriting. Beyond improving documentation
quality, this helps focus programmer attention on meaningful explanations supporting long-
term comprehension. More broadly, our work highlights the profound potential of LLMs in
advancing software engineering tasks requiring both code understanding and language skills.
However, limitations remain. Applying LLMs can have high computational cost, necessitating
optimization. Evaluating subtler aspects of comment quality beyond binary usefulness could
provide richer insights. Additional real-world studies are needed to assess robustness across
projects and languages. Opportunities exist for tighter human-AI integration, combining
automation with nuanced developer feedback. Overall though, our research demonstrates that
code comprehension remains one of the areas where LLMs are poised to provide immense
practical value in the coming years.


References
 [1] S. Majumdar, S. Papdeja, P. P. Das, S. K. Ghosh, Smartkt: a search framework to assist
     program comprehension using smart knowledge transfer, in: 2019 IEEE 19th International
     Conference on Software Quality, Reliability and Security (QRS), IEEE, 2019, pp. 97–108.
 [2] N. Chatterjee, S. Majumdar, S. R. Sahoo, P. P. Das, Debugging multi-threaded applications
     using pin-augmented gdb (pgdb), in: International conference on software engineering
     research and practice (SERP). Springer, 2015, pp. 109–115.
 [3] S. Majumdar, N. Chatterjee, S. R. Sahoo, P. P. Das, D-cube: tool for dynamic design discovery
     from multi-threaded applications using pin, in: 2016 IEEE International Conference on
     Software Quality, Reliability and Security (QRS), IEEE, 2016, pp. 25–32.
 [4] S. Majumdar, N. Chatterjee, P. P. Das, A. Chakrabarti, A mathematical framework for design
     discovery from multi-threaded applications using neural sequence solvers, Innovations in
     Systems and Software Engineering 17 (2021) 289–307.
 [5] S. Majumdar, N. Chatterjee, P. Pratim Das, A. Chakrabarti, Dcube_ nn d cube nn: Tool
     for dynamic design discovery from multi-threaded applications using neural sequence
     models, Advanced Computing and Systems for Security: Volume 14 (2021) 75–92.
 [6] J. Siegmund, N. Peitek, C. Parnin, S. Apel, J. Hofmeister, C. Kästner, A. Begel, A. Bethmann,
     A. Brechmann, Measuring neural efficiency of program comprehension, in: Proceedings
     of the 2017 11th Joint Meeting on Foundations of Software Engineering, 2017, pp. 140–150.
 [7] S. C. B. de Souza, N. Anquetil, K. M. de Oliveira, A study of the documentation essential to
     software maintenance, Conference on Design of communication, ACM, 2005, pp. 68–75.
 [8] L. Tan, D. Yuan, Y. Zhou, Hotcomments: how to make program comments more useful?,
     in: Conference on Programming language design and implementation (SIGPLAN), ACM,
     2007, pp. 20–27.
 [9] Y. Wang, H. Le, A. D. Gotmare, N. D. Bui, J. Li, S. C. Hoi, Codet5+: Open code large
     language models for code understanding and generation, arXiv preprint arXiv:2305.07922
     (2023).
[10] D. Steidl, B. Hummel, E. Juergens, Quality analysis of source code comments, International
     Conference on Program Comprehension (ICPC), IEEE, 2013, pp. 83–92.
[11] S. Majumdar, A. Bandyopadhyay, P. P. Das, P. Clough, S. Chattopadhyay, P. Majumder,
     Can we predict useful comments in source codes?-analysis of findings from information
     retrieval in software engineering track@ fire 2022, in: Proceedings of the 14th Annual
     Meeting of the Forum for Information Retrieval Evaluation, 2022, pp. 15–17.
[12] S. Majumdar, A. Bandyopadhyay, S. Chattopadhyay, P. P. Das, P. D. Clough, P. Majumder,
     Overview of the irse track at fire 2022: Information retrieval in software engineering, in:
     Forum for Information Retrieval Evaluation, ACM, 2022.
[13] J. L. Freitas, D. da Cruz, P. R. Henriques, A comment analysis approach for program
     comprehension, Annual Software Engineering Workshop (SEW), IEEE, 2012, pp. 11–20.
[14] M. M. Rahman, C. K. Roy, R. G. Kula, Predicting usefulness of code review comments using
     textual features and developer experience, International Conference on Mining Software
     Repositories (MSR), IEEE, 2017, pp. 215–226.
[15] A. Bosu, M. Greiler, C. Bird, Characteristics of useful code reviews: An empirical study at
     microsoft, Working Conference on Mining Software Repositories, IEEE, 2015, pp. 146–156.
[16] S. Majumdar, A. Bansal, P. P. Das, P. D. Clough, K. Datta, S. K. Ghosh, Automated evaluation
     of comments to aid software maintenance, Journal of Software: Evolution and Process 34
     (2022) e2463.
[17] S. Majumdar, S. Papdeja, P. P. Das, S. K. Ghosh, Comment-mine—a semantic search
     approach to program comprehension from code comments, in: Advanced Computing and
     Systems for Security, Springer, 2020, pp. 29–42.
[18] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan,
     P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in
     neural information processing systems 33 (2020) 1877–1901.
[19] S. Majumdar, S. Paul, D. Paul, A. Bandyopadhyay, B. Dave, S. Chattopadhyay, P. P. Das, P. D.
     Clough, P. Majumder, Generative ai for software metadata: Overview of the information
     retrieval in software engineering track at fire 2023, in: Forum for Information Retrieval
     Evaluation, ACM, 2023.
[20] S. Majumdar, A. Varshney, P. P. Das, P. D. Clough, S. Chattopadhyay, An effective low-
     dimensional software code representation using bert and elmo, in: 2022 IEEE 22nd
     International Conference on Software Quality, Reliability and Security (QRS), IEEE, 2022,
     pp. 763–774.