Overview of the IRSE track at FIRE 2022: Information Retrieval in Software Engineering Srijoni Majumdar1,2,*,† , Ayan Bandyopadhyay1,† , Samiran Chattopadhyay1,3 , Partha Pratim Das2 , Paul D Clough4,5 and Prasenjit Majumder1,6 1 TCG CREST, West-Bengal, India 2 IIT Kharagpur, West-Bengal, India 3 Jadavpur University, West-Bengal, India 4 TPXimpact London, UK 5 Sheffield University, Sheffield, UK 6 DA-IICT Gandhinagar, Gujarat, India Abstract Code Comments increase the readability of the surrounding code if they highlight concepts that are not evident from the source code itself. Hence, evaluation of the quality of code comments is important to de-clutter large code bases and remove not useful comments. The Information Retrieval in Software Engineering (IRSE) track aims to develop solutions for automated evaluation of code comments. In this track, there is a binary classification task to classify comments as useful and not useful. The dataset consists of 9048 code comments and surrounding code snippet pairs extracted from open source github C based projects. Overall 34 experiments have been submitted by 11 teams from various universities and software companies. The submissions have been evaluated quantitatively using the F1-Score and qualitatively based on the type of features developed, the supervised learning model used and their corresponding hyper-parameters. The best performing architectures mostly have employed transformer architectures coupled with a software development related embedding space. Keywords bert, GPT-2, Stanford POS Tagging, neural networks, abstract syntax tree 1. Introduction Assessing comment quality can help to de-clutter code bases and subsequently improve code maintainability. Comments can significantly help to read and comprehend code if they are consistent and informative. Comment analysis approaches have mainly focused on detecting inconsistent comments [1, 2] but not appreciably on the quality and relevance of the information contained in a comment. A poorly written or superfluous comment duplicating the information evident from source code identifiers can hinder the readability of code, even though it may be consistent [3, 4]. Forum for Information Retrieval Evaluation, December 9-13, 2022, India * Corresponding author. † These authors contributed equally. $ majumdar.srijoni@gmail.com (S. Majumdar); bandyopadhyay.ayan@gmail.com (A. Bandyopadhyay); samiran.chattopadhyay@jadavpuruniversity.in (S. Chattopadhyay); ppd@cse.iitkgp.ac.in (P. P. Das); p.d.clough@sheffield.ac.uk (P. D. Clough); prasenjit.majumder@gmail.com (P. Majumder) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Several approaches have been proposed to classify comments based on explicit syntactic information, such as the presence of specific tags (e.g., @param, @deprecated, etc.), words, and symbols; or implicit details, such as the type of associated code construct, length of the comment, parts of speech (POS) and dependency relations of comment words or the cosine similarity of vector representation of words in code-comment snippets [5, 6, 7]. These approaches do not target comment quality evaluation based on the interpretation of the information contained in comments. The syntactic methods used for comment classification need to be augmented so as to extract the semantics of a comment in order to develop an overall quality assessment model. Further, the perception of quality in terms of the ’usefulness’ of the information contained in comments is relative and hence is perceived differently based on the context. Bosu et al. [8] attempted to assess code review comments (logged in a separate tool) in the context of their utility in helping developers write better code through a detailed survey at Microsoft. A similar quality assessment model is important to analyse the type of source code comments that can help for standard maintenance tasks but is largely missing. Majumdar et al. [4] proposed a comment quality evaluation framework wherein comments were assessed as ’useful’, ’partially useful’, and ’not useful’ based on whether they increase the readability of the surrounding code snippets. The authors analyse comments for concepts that aid in code comprehension and also the redundancies or inconsistencies of these concepts with the related code constructs in a machine learning framework for an overall assessment. The concepts are derived through exploratory studies with developers across 7 companies and from a larger community using crowd-sourcing. The IRSE track of FIRE 2022, extends the work in [4] and empirically investigates comment quality with a larger set of machine learning solvers and features. The track is targeted to automate program comprehension tasks and subsequently reduce code maintenance overhead. In its first edition, the IRSE track is based on a task for quality evaluation of comments into two clusters - ’useful’ and ’not useful’. A ’useful’ comment (refer Table 1) contains relevant concepts that are not evident from the surrounding code design, and thus increases the com- prehensibility of the code. The suitability of analysing comment quality using various vector space representations of code and comment pairs along with standard textual features and code comment correlation links are evaluated. A total of 34 experiments have been submitted by 11 teams. 2. Related Work Several approaches exist that attempt to assess the quality of comments by detecting inconsis- tencies with source code or by classifying comments based on syntactic properties. Tan et al. [1, 9] use the sequence of occurrence of words (from an enumerated set) in a comment and the surrounding code to develop rules for detecting inconsistent comments related to memory errors. Ying et al. [10] undertake an empirical study to derive the attributes of the various categories of task comments in Java codes used for developer communication. Storey et al. [11] presented a detailed study to understand how the task comments are interpreted in larger projects during Table 1 Useful and Not-Useful comments in context of code comprehension # Comment Code Label 1 /* uses png_calloc /* uses png_calloc defined in pngriv.h*/ U defined in pn- PNG_FUNCTION(png_const_structrp png_ptr) griv.h*/ { if (png_ptr == NULL || info_ptr == NULL) return; png_calloc(png_ptr); ...} 2 /* serial bus is static int bus_reset ( . . . ) /* serial NU bus locked before use is locked before use*/ */ { .. update_serial_bus_lock (bus * busR); } 3 // integer variable NU int Delete\_Vendor; // integer variable U: Useful; NU: Not Useful the different phases of the software lifecycle. Comment quality evaluation: Steidl et al. [7] propose a comment quality detection method by comparing the similarity of words in code-comment pairs using the Levenshtein distance and length of comments to filter out trivial and non-informative comments. Rahman et al. [12] detect useful and non-useful code review comments (logged in review portals) based on attributes identified from a survey conducted with developers of Microsoft [8]. They use textual features (Table 2) and train using a set of 1,200 review comments for automated quality assessment using decision tree and naive bayes algorithms. Recent work in the Declutter Challenge of DocGen2 by Liu et al. [13] detects ’not useful’ comments using textual and structural features (Table 2) in a machine learning framework. [4] proposed a framework to evaluate comments based on concepts that are relevant for code comprehension. They developed textual and code correlation features using a knowledge graph for semantic interpretation of information contained in comments (Table 2). The available approaches mostly target to evaluate the quality of the comments by mining for irrelevant words and phrases, coupled with the repetitiveness in the surrounding constructs. The context based on which the quality is defined is essential, like Rahman et al. [12] assess comments based on attributes limited to code review comments only. Similarly, Majumdar et al. [4] analysed comments by mining concepts that are relevant to code comprehension and can aid in software maintenance tasks. The IRSE track extends the approach proposed in [4] to explore various vector space models and features for binary classification and evaluation of comments. 3. IRSE Track Overview and Data Set The following section outlines the task descriptions and the characteristics of the dataset. Table 2 Characteristics of the available quality assessment approaches Parame- steidl et al. [7] rahman et al. [12] liu et al. [13] majumdar et al. [4] ters Textual Length of com- Numerical estimation Comment Length Comment length Fea- ments of reading ease Type of Comment - Construct Names in Com- tures Stop word ratio Document level, Block ments % of interrogative sen- and Line Significant Words ratio tences Descriptive / Conditional Percentage of code con- % occurrence: Concept structs in review com- Mining for software devel- ments opment domains Number of commits on occurrence of commits, a file authored / re- version details viewed by a developer % of external libraries in review comments Code c_coeff : Similar Lexical similarity be- Similarity between Syntactic and semantic Corre- words between tween words of code words of code and matches using vector lation code and com- and review comments review comments using space Fea- ment based using cosine distances cosine distances structural match using tures on levenshtein Percentage of ex- abstract syntax tree in distance < 2 act match of words form of knowledge graph between code and comment Code in comment before code constructs men- code constructs in next Scope or after method tioned in review com- line of comment → method name; ments inline comment → constructs in the same line, Vector None tf-idf fastText (english cor- Word2Vec, ELMo (soft- Space pora) ware concepts corpora) Repre- senta- tion Dataset 1,330 comments 1,200 review comments 1,194 annotated com- 20206 C comments (from (C++ and Java) logged in Codeflow ments from JabRef 5 github projects) project [14] Quality Not Useful Useful and NotâUseful Informative and Non- Useful, NotâUseful and classes Informative Partially Useful 3.1. Task Description Comment Classification: A binary classification task to classify source code comments as Useful or Not Useful for a given comment and associated code pair as input. Input: A code comment with surrounding code snippet (written in C) Output: A label (Useful or Not Useful) that characterises whether the comment helps developers comprehend the associated code Therefore in this classification task, the output is based on whether the information contained in the comment is relevant and would help to comprehend the surrounding code, i.e., it is useful. Useful: Comments have sufficient software development concept → Comment is Relevant, and these concepts are not mostly present in the surrounding code → Comment is not Redundant, hence the comment is Useful Not Useful: Comments have sufficient software development concept → Comment is Rele- vant, and these concepts are mostly present in the surrounding code → Comment is Redundant, hence the comment is Not Useful It may also be the case that comments do not contain sufficient software development concepts → Comment is Not Relevant, hence the comment is Not Useful. It is left to the participants to decide on the threshold value for how many concepts retrieved make a comment relevant or how many matches with surrounding code make a comment redundant. 3.1.1. What is a Relevant Code Comment? The notion of relevant comments refers to those that developers perceive as important in comprehending the associated or surrounding lines of code. These concepts are related to the outline of the algorithm, data-structure descriptions, mapping to user interface details, possible exceptions, version details, etc. In the below examples, the comments highlight useful details about the input data to the function, which is not evident from the associated code itself. 1 # works on a two dimensional data matrix (each of size 8) generated from the light rider bot module 2 int* flood_fill(self, position, visited) {...} 1 /* uses png_calloc defined in pngriv.h*/ 2 PNG_FUNCTION(png_voidp,PNGAPI 3 png_calloc,(png_const_structrp 4 png_ptr, png_alloc_size_t size),PNG_ALLOCATED) 5 { } Therefore, a relevant comment provides more information for the surrounding code and subsequently aids better comprehension that can improve software maintenance. Relevant, but Redundant However, in the example below, even if the comment contains relevant information, it is already available in the associated code rendering the comment redundant. 1 // PHP Shutdown method to destroy the global php hash map, using zend hash api’s 2 PHP_MSHUTDOWN_FUNCTION(hash) { ... zend_hash_destroy(&php_hashtable); . } 3.2. Dataset We select 5 projects from Github and use the modified random sampling approach of Cochran [15] to sample source files with equal probability and hence provide an unbiased representation of the population (C code files with comments). We gather a total of 318 files with 20,206 comments. Ground Truth Generation: For every comment, a label (Useful or Not Useful) has been gener- ated by a team of 14 annotators. Every comment has been annotated by 2 annotators with a kappa (𝜅) value of 0.734 (Cohen’s metric [16]). The annotation process has been supervised through weekly meetings and brainstorming sessions and peer review. Out of the total 16,000 comments, 2,285 comments were annotated by every individual annotator. A total of 156 man-hours were required to complete the annotation process. For the IRSE track, we use a set of 9048 comments (from Github) with comment text, sur- rounding code snippets, and a label that specifies whether the comment is useful or not. Sample data has been characterised in Table 1. • The development dataset contains 8048 rows of comment text, surrounding code snippets, and labels (Useful and Not useful). • The test dataset contains 1,000 rows of comment text, surrounding code snippets, and labels (Useful and Not useful). Table 3 Ranking F1-Scores Team Name No. of Runs Model Macro F1-Score MentorX, Mentor Graphics, 4 Transformer - BERT (base: Code- 0.9073 Siemens BERT) Empathy AI Team, Amazon 2 Transformer - GPT-2 0.90472 CITK, Kokrajhar 2 Transformer - BERT (base: un- 0.90471 cased BERT) iRel, Indian Institute of Tech- 6 Transformer - BERT (base: AL- 0.8961 nology Hyderabad BERT) Conquerers, Indian Institute 1 SVM - Radial Basis Function 0.8810 of Technology Dhanbad Charusat University 6 Random Forest 0.834 Hunters, Jadavpur Univer- 3 Logistic Regression 0.83 sity Boomerang, Indian Institute 2 SVM - Radial Basis Function 0.78 of Technology Kharagpur SpechTech, Indian Institute 2 Logistic Regression 0.688 of Technology Kharagpur BioNLP, IISER Bhopal 5 SVM - Radial Basis Function 0.53 YA, IISER Kolkata 2 Naive Bayes 0.26 4. Participation and Evaluation In its first edition, IRSE 2022 received a total of 34 experiments from 11 teams. As this track is related to software maintenance, we received participation from companies like Amazon and Mentor Graphics along with several research labs of educational institutes. The various teams with the details of their submissions are characterised in Table 3. Evaluation Procedure: Candidates were asked to submit predicted labels (’useful’ or ’not useful’) for every data point in the test set of 1000 comments. This was used by our script to generate the precision, recall, and the F1-Score (Macro) using the annotated (golden) labels. Features: Apart from evaluating the prediction metrics, we analysed the types of features the teams have used to devise the machine learning pipeline. The teams have performed routine pre-processing and have retained the significant words or letters only for both the code and comment pairs. Further, some of the teams have also used morphological features of a comment like a length, significant words ratio, parts of speech characteristics, or occurrence of words from an enumerated set as textual features. To correlate code and comment and detect redundancies, the teams mostly used grep-like string match to find similar words. Vector Space Representations: Code and comments belong to different semantic granularity which is unified by a vector space representation. The participants have used various pre-trained embeddings to generate vectors for the words like those based on one hot encoding, tf-idf based, word2vec or context aware like ELMo and BERT. Each of the employed embedding models are trained or finetuned using software development corpora. 5. Results and Analysis The F1-Scores have been analysed based on the machine learning models used, the features and pre-trained embeddings for projection to vector space. The dataset provided was balanced and had 4015 useful comments and 4033 not useful comments. The textual features used by the teams were mostly related to mining specific words and determining significant words. Similarly, almost all teams used string matching to locate overlapping words between code and comment. Significant differences were analysed in terms of the pre-trained embeddings used and the machine learning models which contributed to the improvement in the F1-Score. Machine Learning Architectures: The best F1 score was obtained using GPT architecture, although it is resource critical and can be afforded by software companies. The other machine learning models commonly used are recurrent neural networks, support vector machines, random forest and logistic regression with textual and correlation features. The F1-Score obtained using recurrent neural networks, support vector machines are comparable to the ones obtained from BERT. This is because the dataset is balanced and also due to the use of various textual features apart from numerical vectors (as features). Pre-Trained Embeddings: Both context-aware and context-independent pre-trained embed- dings trained from scratch or finetuned with software development concepts have been used. Results are better with the recently released codeBERT pre-trained embeddings [17] where natural language and programming language pairs are used from software projects of different domains to train using masked language modeling. Comparable results have been obtained by using CodeELMo [4] which trains ELMo from scratch using software development corpora from books, journals, and code repositories. 6. Conclusions The IRSE track in its first edition empirically investigates various approaches in a machine- learning framework for automated comment quality evaluation. The comments are evaluated based on whether they contain information that can aid in understanding the surrounding code. A total of 11 teams participated and submitted 34 experiments that used various types of machine learning models, embedding spaces, and features. The best F1-Score of 90.8 was reported by experiments conducted using GPT-2 architecture with textual and numerical features from CodeBERT vector space embeddings, to classify comments as ’useful’ and ’not useful’. References [1] L. Tan, D. Yuan, G. Krishna, Y. Zhou, icomment: Bugs or bad comments?, Association for Computing Machinery’s Special Interest Group on Operating Systems Review (SIGOPS), ACM, 2007, pp. 145–158. [2] I. K. Ratol, M. P. Robillard, Detecting fragile comments, International Conference on Automated Software Engineering (ASE), IEEE, 2017, pp. 112–122. [3] J. L. Freitas, D. da Cruz, P. R. Henriques, A comment analysis approach for program comprehension, Annual Software Engineering Workshop (SEW), IEEE, 2012, pp. 11–20. [4] S. Majumdar, A. Bansal, P. P. Das, P. D. Clough, K. Datta, S. K. Ghosh, Automated evaluation of comments to aid software maintenance, Journal of Software: Evolution and Process 34 (2022) e2463. [5] L. Pascarella, A. Bacchelli, Classifying code comments in java open-source software systems, International Conference on Mining Software Repositories (MSR), IEEE, 2017, pp. 227–237. [6] D. Haouari, H. Sahraoui, P. Langlais, How good is your comment? a study of comments in java programs, International Symposium on Empirical Software Engineering and Measurement (ESEM), IEEE, 2011, pp. 137–146. [7] D. Steidl, B. Hummel, E. Juergens, Quality analysis of source code comments, International Conference on Program Comprehension (ICPC), IEEE, 2013, pp. 83–92. [8] A. Bosu, M. Greiler, C. Bird, Characteristics of useful code reviews: An empirical study at microsoft, Working Conference on Mining Software Repositories, IEEE, 2015, pp. 146–156. [9] L. Tan, D. Yuan, Y. Zhou, Hotcomments: how to make program comments more useful?, in: Conference on Programming language design and implementation (SIGPLAN), ACM, 2007, pp. 20–27. [10] A. T. Ying, J. L. Wright, S. Abrams, Source code that talks: an exploration of eclipse task comments and their implication to repository mining, ACM SIGSOFT software engineering notes, ACM 30 (2005) 1–5. [11] M.-A. Storey, J. Ryall, R. I. Bull, D. Myers, J. Singer, Todo or to bug, International Conference on Software Engineering (ICSE) , IEEE, 2008, pp. 251–260. [12] M. M. Rahman, C. K. Roy, R. G. Kula, Predicting usefulness of code review comments using textual features and developer experience, International Conference on Mining Software Repositories (MSR), IEEE, 2017, pp. 215–226. [13] M. Liu, Y. Yang, X. Peng, C. Wang, C. Zhao, X. Wang, S. Xing, Learning based and context aware non-informative comment detection, International Conference on Software Maintenance and Evolution (ICSME), IEEE, 2020, pp. 866–867. [14] M. Alver, N. Batada, Jabref: Cross-platform citation and reference management software, Open Source, 2003. https://github.com/JabRef/jabref, Last Accessed: December 12, 2020. [15] J. Kotrlik, C. Higgins, Organizational research: Determining appropriate sample size in survey research, Information technology, learning, and performance journal 19 (2001) 43. [16] N. Gisev, et al., Interrater agreement and interrater reliability: key concepts, approaches, and applications, Research in Social and Administrative Pharmacy 9 (2013) 330–338. [17] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).