A New Dataset for Source Code Comment Coherence Anna Corazza Valerio Maggio DIETI, Fondazione Univ. di Napoli Federico II Bruno Kessler anna.corazza@unina.it vmaggio@fbk.eu Giuseppe Scanniello Dept. of Mathematics, Information Technology, and Economics Univ. della Basilicata giuseppe.scanniello@unibas.it Abstract source in Java. I risultati di questa valutazione sono stati raccolti in un English. Source code comments pro- dataset che abbiamo pubblicato sul web. vide useful insights on a codebase and Accenniamo anche al protocollo seguito on the intent behind design decisions and per la preparazione del dataset. goals. Often, the information provided in the comment of a method and in its corresponding implementation may be not 1 Introduction coherent with each other (i.e., the com- ment does not properly describe the im- Natural language is used in different ways in the plementation). Several could be the moti- development process of a software system and vations for this issue (e.g., comment and therefore techniques of natural language process- source code do not evolve coherently). ing (NLP) and information retrieval (IR) are more In this paper, we present the results of and more frequently integrated into software de- a manual assessment on the coherence velopment and maintenance tools. Many auto- between comments and implementations mated or semi-automated techniques have been of 3, 636 methods, gathered from 4 Java proposed to aid developers in the comprehen- open-source software. The results of this sion and in the evolution of existing systems, assessment has been collected in a dataset e.g., (Corazza et al., 2016; Scanniello et al., 2010). that we made publicly available on the Natural language information is provided in web. We also sketch here the protocol to source code comments and in the name of identi- create this dataset. fiers. In the former case, standard natural language is usually adopted, although quite technical. Com- Italiano. I commenti al codice sor- ments are written in English, even when develop- gente forniscono informazioni utili ers have different mother-tongues. On the other sull’implementazione del codice e sulle hand, identifiers are typically constructed by com- intenzioni relative alle decisioni e agli posing multiple terms and abbreviations. There- scopi del progetto. Spesso, le informazioni fore, more sophisticated techniques are necessary presenti nel commento di un metodo e to extract the lexical information contained in each nella sua implementazione possono non identifier (Corazza et al., 2012). essere coerenti (nel senso che il com- Most of these techniques assume that the same mento non dà una descrizione adeguata words are used whenever referring to a partic- dell’implementazione). Ci possono es- ular concept (Lawrie et al., 2010). In many sere diverse spiegazioni per questo (ad cases, this represents an oversimplification: meth- esempio, commenti e codice sorgente non ods are often modified without updating the cor- sono stati modificati in modo coerente). In responding comments (Salviulo and Scanniello, questo articolo, presentiamo i risultati di 2014). In these cases, comments might convey in- una valutazione manuale della coerenza formation unrelated or inconsistent with the cor- tra commenti e implementazione di 3, 636 responding implementation. Nevertheless, com- metodi, raccolti da 4 applicazioni open ments are extremely important because they are expected to convey the main intent behind design 2 Dataset Construction decisions, along with some implementation details To create our dataset, we adopted the perspective- (e.g., types of parameters and of returned values). based and the checklist-based review meth- Therefore, more sophisticated models are nec- ods (Wohlin et al., 2012). The perspective is the essary to determine if there is coherence between one of the Researcher aiming at assessing the co- the lexicon provided in comments and in its corre- herence between the lead comment of a method sponding source code. Hence, there exists coher- and its implementation. The process of creation is ence between a lead comment of a method and its based on the following elements: source code (also simply coherence, from here on) if that comment describes the intent of the method 1. Working Meetings. We used meetings to de- and its actual implementation. termine the goals of our research work and the process to create the dataset. In this work, we focus on the lead comment of methods. This kind of comments precedes the def- 2. Dataset Creation. We instantiated the de- inition of a given method and is supposed to pro- fined process to create our dataset. vide its documentation and details about the im- plementation. We discuss here a dataset we made 3. Outcomes. We gathered results during and publicly available on the web.1 It contains anno- after the creation of our dataset. tations about the coherence of 3, 636 methods col- 4. Publishing Results and Dataset. We lected from 4 implementations of 3 open source shared our experience with the community projects written in Java. The defined protocol used in (Corazza et al., 2015) and released the for its creation is also sketched, to give researchers dataset on the web. the opportunity to possibly extend it. Further de- tails on this protocol can be found in (Corazza et The construction of the dataset has been com- al., 2015). pleted in two main consecutive phases by using an ad-hoc web system implemented for the purpose: For the assessment of the quality of the annota- tion with special focus on computational linguis- • Verify coherence. Annotators verify by tics applications, a few indexes have been con- means of a checklist the coherence between sidered (Eugenio and Glass, 2004; Artstein and the lead comments of a set of methods and Poesio, 2008; Mathet et al., 2015), among which their corresponding implementation. the kappa index (Cohen, 1960) is the most widely adopted because of its favorable characteristics. • Resolve conflicts. The intervention of ex- The inter-annotator agreement has therefore been perts is required whenever the judgements of assessed by this parameter. the annotators differ. In our case, two of the authors, with a background in software engi- We expect that making freely available this neering, assumed the role of experts and ex- dataset could give impulse to the research for ap- amined the problematic cases. For each con- proaches to assess the coherence between the im- flicting method, the experts should reach an plementation of a method and its lead comment agreement about the coherence or the non- (simply coherence, from here on). In fact, al- coherence. Methods on which experts do not though no approach has been yet proposed in this get a consensus are automatically discarded. regard, they could be of great help for software maintenance and evolution activities. 3 Dataset Description The paper is structured as follows. In Section 2, Some descriptive statistics (e.g., number of classes we discuss the methodology used to create our and methods) of the software systems in our dataset. A description of the main characteristics dataset are shown in Table 1. of the dataset is given in Section 3, while in Sec- tion 4 the annotation is assessed. Some final con- • CoffeeMaker2 is a software to manage in- siderations conclude the paper. ventory and recipes and to purchase bever- ages. We chose this software because it has 2 agile.csc.ncsu.edu/SEMaterials/ 1 www2.unibas.it/gscanniello/coherence/ tutorials/coffee_maker/ Table 1: Descriptive statistics of the software sys- /** tems in the dataset: Nf stands for the number of * Sets the flag that determines whether or not * the tick labels are visible. files, Nc for the number of classes, Nm for the * Registered listeners are notified of a ∗ for the number of * general change to the axis. number of methods and Nm * methods with lead comments. * @param flag The flag to set. */ public void setTickLabelsVisible(boolean flag) { if (flag!=tickLabelsVisible) { System Version Nf Nc Nm Nm ∗ tickLabelsVisible = flag; notifyListeners(new AxisChangeEvent(this));} CoffeeMaker - 7 7 51 47 (92%) } JFreeChart 6.0 0.6.0 82 83 617 485 (79%) JFreeChart 7.1 0.7.1 124 127 807 624 (77%) JHotDraw 7.4.1 575 692 6414 2480 (39%) Figure 1: The lead comment and the implementa- tion of the method setTickLabelsVisible a simple and clear design (being it developed of JFreeChart 0.7.1 for educational purposes). // GEN-FIRST:event_save • JFreechart3 is a Java tool supporting the vi- // Code for dispatching events from components // to event handlers. sualization of data charts (e.g., scatter plots private void save(java.awt.event.ActionEvent evt) { try { and histograms). We included two versions String methodName = getParameter("datawrite"); of this software. As reported in Table 1, if (methodName.indexOf(’(’) > 0) { methodName = methodName.substring(0, both these versions contain almost the 80% methodName.indexOf(’(’) - 1); } JSObject win = JSObject.getWindow(this); of methods with lead comments. This sug- Object result = win.call(methodName, new Object[]{getData()}); gests an extensive use of comments, which is } catch (Throwable t) { the main reason why we decided to include TextFigure tf = new TextFigure("Fehler: " + t); AffineTransform tx = new AffineTransform(); this software in the dataset. tx.translate(10, 20); tf.transform(tx); getDrawing().add(tf); } • JHotDraw4 is a framework for technical and } structured graphics. Even if the source code of JHotDraw is scarcely commented (see Ta- Figure 2: Non-Coherent method in ble 1), it is well-known in the software main- JHotDraw 7.4.1 tenance community due to its good Object- Oriented design (Scanniello et al., 2010). would be separately evaluated by at least two an- In Figure 1, we report the imple- notators. This allowed us to have multiple judge- mentation and the lead comment of ments for each method in the dataset, and to cal- setTickLabelsVisible (extracted from culate the rate of agreement among annotators. JFreeChart ver. 0.7.1, included in our dataset). According to our definition of coherence, we 4 Annotation Assessment can assert that this method is coherent. On the other hand, the save method reported in Figure 2 The whole dataset creation process occurred from provides a very poor and inadequate descrip- January, 15th 2014 to June, 20th 2014, for a to- tion of the design intent of the method, thus tal of 800 man-hours. This gives an estimation of reflecting a lack of coherence with the underlying the effort required to conduct the study presented implementation. in this paper, and provides an indication to the re- Three annotators were involved in the dataset searcher interested in extending our dataset. creation process. Two of them hold a Bachelor The annotators provided indications on the degree in Computer Science, and have very sim- coherence of methods by assigning them one ilar technical backgrounds. On the other hand, out of three following possible values: Non- the third annotator can be considered more expe- Coherent, Don’t Know, and Coherent. rienced than the other two since he holds a Master In this scenario, we use the kappa index (Cohen, degree in Computer Science. We distributed the 1960) to obtain an assessment of the agreement effort among the annotators so that each software among annotators, thus estimating the reliability 3 www.jfree.org/jfreechart/ of their evaluations. In fact, if annotators agree 4 www.jhotdraw.org/ on a large number of methods, we can conclude that their annotations are reliable. The kappa index Table 2: Agreement Rate of Judges as computed is designed for categorical judgments and refers by the Cohen’s Kappa Index. the agreement rate calculation to the rate of chance agreement: System UK WK CoffeeMaker 0.913 0.913 po − pc JFreeChart 6.0 0.939 0.918 kappa = , (1) 1 − pc JFreeChart 7.1 0.983 0.977 JHotDraw 0.824 0.684 po is the observed probability of agreement, while pc is the chance probability of agreement. Both probabilities are estimated by the corresponding weighted according to the importance given to the frequencies. By a simple algebraic manipulation, corresponding disagreement cases. By contrast, Equation 1 can be written as: we refer to the original formulation of the kappa index as Unweighted Kappa (UK). We assign to qo the Don’t know response a weight that is half the kappa = 1 − , (2) qc weight assigned to the Not-Coherent (or Coher- ent) one. This is the same schema reported by where qo = 1 − po and qc = 1 − pc and corre- Cohen (Cohen, 1968). Weighted and Unweighted spond to the observed and the chance probabili- kappa indexes are reported in Table 2. ties of disagreement, respectively. Usually the in- The agreement between annotators is good on dex assumes values in ]0, 1]5 , as it can be expected the first three systems, and acceptable for JHot- that the observed disagreement is less likely than Draw. However, on this system the difference be- chance. A null value signals that observed dis- tween the values for UK and WK is large, thus agreement is exactly as likely as chance, while the providing a more accurate indication on the agree- kappa index assumes negative values in the un- ment of the evaluations on this software. wanted case where disagreement is more likely than chance. Perfect agreement corresponds to At the end of the first step of the dataset cre- k = 1. Values greater than 0.80 are usually ation process, the number of methods on which considered as a cue of good agreement. Values annotators did not agree was 302, corresponding in the interval [0.67, 0.80] are considered accept- to the 8.3% of the total number of methods from able (Cohen, 1960). all the systems in the dataset. Most of these meth- ods are those in JHotDraw, as suggested by the The classical formulation of the kappa index kappa index values (see Table 2). These methods considers a binary classification problem (e.g., were reviewed by two of the authors. An agree- Non-Coherent or Coherent). However in our case, ment was reached on all of these methods, which the neutral judgement (i.e., Don’t know) is also were then included in the dataset. The total num- allowed. Therefore, possible disagreements in- ber of methods in the dataset is reported in Table 3 clude the case where one of the two answers is (i.e., 2, 883). the neutral one. In this case, it is possible to dif- ferently weigh the possible disagreements among 5 Conclusions and future work annotators. In fact, disagreements due to the neu- tral answers are less serious than disagreements In this paper, we have presented the early steps where judgments are totally divergent (i.e., Co- of our research on the coherence between the lead herent and Non-Coherent, in our case). To this comment of methods and their implementations. end, Cohen (Cohen, 1968) presents a variant of the In particular, we have provided a description of kappa index, where in case of a disagreement, dif- the problem settings, along with the experimental ferent weights can be applied. In case the same protocol defined to create our dataset. We made it weight is assigned to all possible disagreement publicly available on the web. We also sketched combinations, the original (unweighted) formula- the results of quantitative analysis conducted on a tion is obtained. The formulation of the Weighted codebase of 3, 636 methods, gathered from 4 dif- Kappa (WK) is the one in the equation (2), but for ferent open-source systems written in Java. the computation of qo and qc the contributions are There could be many possible future directions 5 The notation means that 1 is included in the interval, for our research. For example, it would be inter- while 0 is not. esting to conduct an empirical study to investigate Table 3: Descriptive Statistics of the Dataset: C B. Fluri, M. Wursch, and H.C. Gall. 2007. Do code stands for Coherent, NC for Non-Coherent. and comments co-evolve? on the relation between source code and comment changes. In Proceedings Don’t Know of the Working Conference on Reverse Engineering, System C NC Total (Not included) pages 70–79. IEEE Computer Society. CoffeeMaker 27 20 47 0 JFreeChart 6.0 406 55 461 24 D Lawrie, D Binkley, and C Morrell. 2010. Normal- JFreeChart 7.1 520 68 588 36 izing Source Code Vocabulary. In Proceedings of JHotDraw 762 1025 1787 693 Working Conference on Reverse Engineering, pages Total 1715 1168 2883 3–12. IEEE Computer Society. Yann Mathet, Antoine Widlöcher, and Jean-Philippe the effect of maintenance operations on the coher- Métivier. 2015. The unified and holistic method gamma γ for inter-annotator agreement measure and ence for multiple versions of the same system. As alignment. Computational Linguistics, 41(3):437– a step forward in this future research direction, we 479, September. already included in the dataset two versions of the JFreeChart system. Our results and those by Fluri F. Salviulo and G. Scanniello. 2014. Dealing with identifiers and comments in source code com- et al. (Fluri et al., 2007) represent a viable start- prehension and maintenance: Results from an ing point. Finally, we would like to exploit the ethnographically-informed study with students and collected data as an evaluation set to assess the professionals. In Proceedings of International Con- performance of approaches able to discern method ference on Evaluation and Assessment in Software Engineering, pages 423–432. ACM Press. coherence. G. Scanniello, A. D’Amico, C. D’Amico, and T. D’Amico. 2010. Using the kleinberg algorithm References and vector space model for software system clus- tering. In Proceedings of International Conference Ron Artstein and Massimo Poesio. 2008. Inter-coder on Program Comprehension, pages 180–189. IEEE agreement for computational linguistics. Computa- Computer Society. tional Linguistics, 34(4):555–596, December. C. Wohlin, P. Runeson, M. Höst, M.C. Ohlsson, J. Cohen. 1960. A coefficient of agreement for nom- B. Regnell, and A. Wesslén. 2012. Experimenta- inal scales. Educational and Psychological Mea- tion in Software Engineering. Computer Science. surement, 20(1):37–46. Springer. J. Cohen. 1968. Weighted kappa: Nominal scale agreement provision for scaled disagreement or par- tial credit. Psychological Bulletin. A. Corazza, S. Di Martino, and V. Maggio. 2012. LIN- SEN: An efficient approach to split identifiers and expand abbreviations. In Proceedings of Interna- tional Conference on Software Maintenance, pages 233–242. IEEE Computer Society. A. Corazza, V. Maggio, and G. Scanniello. 2015. On the coherence between comments and implementa- tions in source code. In 41st Euromicro Confer- ence on Software Engineering and Advanced Appli- cations, EUROMICRO-SEAA 2015, Madeira, Portu- gal, August 26-28, 2015, pages 76–83. IEEE Com- puter Society. A. Corazza, S. Di Martino, V. Maggio, and G. Scan- niello. 2016. Weighing lexical information for software clustering in the context of architecture re- covery. Empirical Software Engineering, 21(1):72– 103. Barbara Di Eugenio and Michael Glass. 2004. The Kappa statistic: a second look. Computational Lin- guistics, 30(1).