A New Dataset for Source Code Comment Coherence
               Anna Corazza                               Valerio Maggio
                   DIETI,                                   Fondazione
         Univ. di Napoli Federico II                       Bruno Kessler
        anna.corazza@unina.it                           vmaggio@fbk.eu

                                 Giuseppe Scanniello
                           Dept. of Mathematics, Information
                             Technology, and Economics
                                 Univ. della Basilicata
                        giuseppe.scanniello@unibas.it

                Abstract                          source in Java. I risultati di questa
                                                  valutazione sono stati raccolti in un
English. Source code comments pro-                dataset che abbiamo pubblicato sul web.
vide useful insights on a codebase and            Accenniamo anche al protocollo seguito
on the intent behind design decisions and         per la preparazione del dataset.
goals. Often, the information provided
in the comment of a method and in its
corresponding implementation may be not       1   Introduction
coherent with each other (i.e., the com-
ment does not properly describe the im-       Natural language is used in different ways in the
plementation). Several could be the moti-     development process of a software system and
vations for this issue (e.g., comment and     therefore techniques of natural language process-
source code do not evolve coherently).        ing (NLP) and information retrieval (IR) are more
In this paper, we present the results of      and more frequently integrated into software de-
a manual assessment on the coherence          velopment and maintenance tools. Many auto-
between comments and implementations          mated or semi-automated techniques have been
of 3, 636 methods, gathered from 4 Java       proposed to aid developers in the comprehen-
open-source software. The results of this     sion and in the evolution of existing systems,
assessment has been collected in a dataset    e.g., (Corazza et al., 2016; Scanniello et al., 2010).
that we made publicly available on the           Natural language information is provided in
web. We also sketch here the protocol to      source code comments and in the name of identi-
create this dataset.                          fiers. In the former case, standard natural language
                                              is usually adopted, although quite technical. Com-
Italiano.    I commenti al codice sor-        ments are written in English, even when develop-
gente forniscono informazioni utili           ers have different mother-tongues. On the other
sull’implementazione del codice e sulle       hand, identifiers are typically constructed by com-
intenzioni relative alle decisioni e agli     posing multiple terms and abbreviations. There-
scopi del progetto. Spesso, le informazioni   fore, more sophisticated techniques are necessary
presenti nel commento di un metodo e          to extract the lexical information contained in each
nella sua implementazione possono non         identifier (Corazza et al., 2012).
essere coerenti (nel senso che il com-           Most of these techniques assume that the same
mento non dà una descrizione adeguata        words are used whenever referring to a partic-
dell’implementazione). Ci possono es-         ular concept (Lawrie et al., 2010). In many
sere diverse spiegazioni per questo (ad       cases, this represents an oversimplification: meth-
esempio, commenti e codice sorgente non       ods are often modified without updating the cor-
sono stati modificati in modo coerente). In   responding comments (Salviulo and Scanniello,
questo articolo, presentiamo i risultati di   2014). In these cases, comments might convey in-
una valutazione manuale della coerenza        formation unrelated or inconsistent with the cor-
tra commenti e implementazione di 3, 636      responding implementation. Nevertheless, com-
metodi, raccolti da 4 applicazioni open       ments are extremely important because they are
expected to convey the main intent behind design         2    Dataset Construction
decisions, along with some implementation details
                                                         To create our dataset, we adopted the perspective-
(e.g., types of parameters and of returned values).
                                                         based and the checklist-based review meth-
    Therefore, more sophisticated models are nec-        ods (Wohlin et al., 2012). The perspective is the
essary to determine if there is coherence between        one of the Researcher aiming at assessing the co-
the lexicon provided in comments and in its corre-       herence between the lead comment of a method
sponding source code. Hence, there exists coher-         and its implementation. The process of creation is
ence between a lead comment of a method and its          based on the following elements:
source code (also simply coherence, from here on)
if that comment describes the intent of the method           1. Working Meetings. We used meetings to de-
and its actual implementation.                                  termine the goals of our research work and
                                                                the process to create the dataset.
   In this work, we focus on the lead comment of
methods. This kind of comments precedes the def-             2. Dataset Creation. We instantiated the de-
inition of a given method and is supposed to pro-               fined process to create our dataset.
vide its documentation and details about the im-
plementation. We discuss here a dataset we made              3. Outcomes. We gathered results during and
publicly available on the web.1 It contains anno-               after the creation of our dataset.
tations about the coherence of 3, 636 methods col-           4. Publishing Results and Dataset.        We
lected from 4 implementations of 3 open source                  shared our experience with the community
projects written in Java. The defined protocol used             in (Corazza et al., 2015) and released the
for its creation is also sketched, to give researchers          dataset on the web.
the opportunity to possibly extend it. Further de-
tails on this protocol can be found in (Corazza et         The construction of the dataset has been com-
al., 2015).                                              pleted in two main consecutive phases by using an
                                                         ad-hoc web system implemented for the purpose:
   For the assessment of the quality of the annota-
tion with special focus on computational linguis-            • Verify coherence. Annotators verify by
tics applications, a few indexes have been con-                means of a checklist the coherence between
sidered (Eugenio and Glass, 2004; Artstein and                 the lead comments of a set of methods and
Poesio, 2008; Mathet et al., 2015), among which                their corresponding implementation.
the kappa index (Cohen, 1960) is the most widely
adopted because of its favorable characteristics.            • Resolve conflicts. The intervention of ex-
The inter-annotator agreement has therefore been               perts is required whenever the judgements of
assessed by this parameter.                                    the annotators differ. In our case, two of the
                                                               authors, with a background in software engi-
   We expect that making freely available this                 neering, assumed the role of experts and ex-
dataset could give impulse to the research for ap-             amined the problematic cases. For each con-
proaches to assess the coherence between the im-               flicting method, the experts should reach an
plementation of a method and its lead comment                  agreement about the coherence or the non-
(simply coherence, from here on). In fact, al-                 coherence. Methods on which experts do not
though no approach has been yet proposed in this               get a consensus are automatically discarded.
regard, they could be of great help for software
maintenance and evolution activities.                    3    Dataset Description
   The paper is structured as follows. In Section 2,     Some descriptive statistics (e.g., number of classes
we discuss the methodology used to create our            and methods) of the software systems in our
dataset. A description of the main characteristics       dataset are shown in Table 1.
of the dataset is given in Section 3, while in Sec-
tion 4 the annotation is assessed. Some final con-           • CoffeeMaker2 is a software to manage in-
siderations conclude the paper.                                ventory and recipes and to purchase bever-
                                                               ages. We chose this software because it has
                                                           2
                                                             agile.csc.ncsu.edu/SEMaterials/
   1
       www2.unibas.it/gscanniello/coherence/             tutorials/coffee_maker/
Table 1: Descriptive statistics of the software sys-        /**
tems in the dataset: Nf stands for the number of             * Sets the flag that determines whether or not
                                                             * the tick labels are visible.
files, Nc for the number of classes, Nm for the              * Registered listeners are notified of a
                              ∗ for the number of            * general change to the axis.
number of methods and Nm                                     *
methods with lead comments.                                  * @param flag The flag to set.
                                                             */
                                                            public void setTickLabelsVisible(boolean flag) {
                                                                if (flag!=tickLabelsVisible) {
 System           Version   Nf    Nc    Nm         Nm ∗             tickLabelsVisible = flag;
                                                                    notifyListeners(new AxisChangeEvent(this));}
 CoffeeMaker         -       7     7      51     47 (92%)   }
 JFreeChart 6.0    0.6.0    82    83     617   485 (79%)
 JFreeChart 7.1    0.7.1    124   127    807    624 (77%)
 JHotDraw          7.4.1    575   692   6414   2480 (39%)   Figure 1: The lead comment and the implementa-
                                                            tion of the method setTickLabelsVisible
        a simple and clear design (being it developed       of JFreeChart 0.7.1
        for educational purposes).
                                                            // GEN-FIRST:event_save
  • JFreechart3 is a Java tool supporting the vi-           // Code for dispatching events from components
                                                            // to event handlers.
    sualization of data charts (e.g., scatter plots         private void save(java.awt.event.ActionEvent evt) {
                                                            try {
    and histograms). We included two versions                 String methodName = getParameter("datawrite");
    of this software. As reported in Table 1,                 if (methodName.indexOf(’(’) > 0) {
                                                                    methodName = methodName.substring(0,
    both these versions contain almost the 80%                              methodName.indexOf(’(’) - 1); }
                                                              JSObject win = JSObject.getWindow(this);
    of methods with lead comments. This sug-                  Object result = win.call(methodName, new
                                                                       Object[]{getData()});
    gests an extensive use of comments, which is            } catch (Throwable t) {
    the main reason why we decided to include                 TextFigure tf = new TextFigure("Fehler: " + t);
                                                              AffineTransform tx = new AffineTransform();
    this software in the dataset.                             tx.translate(10, 20);
                                                              tf.transform(tx);
                                                              getDrawing().add(tf); }
  • JHotDraw4 is a framework for technical and              }
    structured graphics. Even if the source code
    of JHotDraw is scarcely commented (see Ta-
                                                            Figure 2: Non-Coherent method in
    ble 1), it is well-known in the software main-
                                                            JHotDraw 7.4.1
    tenance community due to its good Object-
    Oriented design (Scanniello et al., 2010).
                                                            would be separately evaluated by at least two an-
   In Figure 1, we report the imple-
                                                            notators. This allowed us to have multiple judge-
mentation and the lead comment of
                                                            ments for each method in the dataset, and to cal-
setTickLabelsVisible (extracted from
                                                            culate the rate of agreement among annotators.
JFreeChart ver. 0.7.1, included in our dataset).
According to our definition of coherence, we
                                                            4   Annotation Assessment
can assert that this method is coherent. On the
other hand, the save method reported in Figure 2            The whole dataset creation process occurred from
provides a very poor and inadequate descrip-                January, 15th 2014 to June, 20th 2014, for a to-
tion of the design intent of the method, thus               tal of 800 man-hours. This gives an estimation of
reflecting a lack of coherence with the underlying          the effort required to conduct the study presented
implementation.                                             in this paper, and provides an indication to the re-
   Three annotators were involved in the dataset            searcher interested in extending our dataset.
creation process. Two of them hold a Bachelor                  The annotators provided indications on the
degree in Computer Science, and have very sim-              coherence of methods by assigning them one
ilar technical backgrounds. On the other hand,              out of three following possible values: Non-
the third annotator can be considered more expe-            Coherent, Don’t Know, and Coherent.
rienced than the other two since he holds a Master             In this scenario, we use the kappa index (Cohen,
degree in Computer Science. We distributed the              1960) to obtain an assessment of the agreement
effort among the annotators so that each software           among annotators, thus estimating the reliability
   3
       www.jfree.org/jfreechart/                            of their evaluations. In fact, if annotators agree
   4
       www.jhotdraw.org/                                    on a large number of methods, we can conclude
that their annotations are reliable. The kappa index          Table 2: Agreement Rate of Judges as computed
is designed for categorical judgments and refers              by the Cohen’s Kappa Index.
the agreement rate calculation to the rate of chance
agreement:                                                                System            UK     WK
                                                                          CoffeeMaker      0.913   0.913
                              po − pc                                     JFreeChart 6.0   0.939   0.918
                  kappa =             ,                (1)
                              1 − pc                                      JFreeChart 7.1   0.983   0.977
                                                                          JHotDraw         0.824   0.684
po is the observed probability of agreement, while
pc is the chance probability of agreement. Both
probabilities are estimated by the corresponding              weighted according to the importance given to the
frequencies. By a simple algebraic manipulation,              corresponding disagreement cases. By contrast,
Equation 1 can be written as:                                 we refer to the original formulation of the kappa
                                                              index as Unweighted Kappa (UK). We assign to
                                   qo                         the Don’t know response a weight that is half the
                   kappa = 1 −        ,                (2)
                                   qc                         weight assigned to the Not-Coherent (or Coher-
                                                              ent) one. This is the same schema reported by
where qo = 1 − po and qc = 1 − pc and corre-
                                                              Cohen (Cohen, 1968). Weighted and Unweighted
spond to the observed and the chance probabili-
                                                              kappa indexes are reported in Table 2.
ties of disagreement, respectively. Usually the in-
                                                                 The agreement between annotators is good on
dex assumes values in ]0, 1]5 , as it can be expected
                                                              the first three systems, and acceptable for JHot-
that the observed disagreement is less likely than
                                                              Draw. However, on this system the difference be-
chance. A null value signals that observed dis-
                                                              tween the values for UK and WK is large, thus
agreement is exactly as likely as chance, while the
                                                              providing a more accurate indication on the agree-
kappa index assumes negative values in the un-
                                                              ment of the evaluations on this software.
wanted case where disagreement is more likely
than chance. Perfect agreement corresponds to                    At the end of the first step of the dataset cre-
k = 1. Values greater than 0.80 are usually                   ation process, the number of methods on which
considered as a cue of good agreement. Values                 annotators did not agree was 302, corresponding
in the interval [0.67, 0.80] are considered accept-           to the 8.3% of the total number of methods from
able (Cohen, 1960).                                           all the systems in the dataset. Most of these meth-
                                                              ods are those in JHotDraw, as suggested by the
   The classical formulation of the kappa index
                                                              kappa index values (see Table 2). These methods
considers a binary classification problem (e.g.,
                                                              were reviewed by two of the authors. An agree-
Non-Coherent or Coherent). However in our case,
                                                              ment was reached on all of these methods, which
the neutral judgement (i.e., Don’t know) is also
                                                              were then included in the dataset. The total num-
allowed. Therefore, possible disagreements in-
                                                              ber of methods in the dataset is reported in Table 3
clude the case where one of the two answers is
                                                              (i.e., 2, 883).
the neutral one. In this case, it is possible to dif-
ferently weigh the possible disagreements among
                                                              5   Conclusions and future work
annotators. In fact, disagreements due to the neu-
tral answers are less serious than disagreements              In this paper, we have presented the early steps
where judgments are totally divergent (i.e., Co-              of our research on the coherence between the lead
herent and Non-Coherent, in our case). To this                comment of methods and their implementations.
end, Cohen (Cohen, 1968) presents a variant of the            In particular, we have provided a description of
kappa index, where in case of a disagreement, dif-            the problem settings, along with the experimental
ferent weights can be applied. In case the same               protocol defined to create our dataset. We made it
weight is assigned to all possible disagreement               publicly available on the web. We also sketched
combinations, the original (unweighted) formula-              the results of quantitative analysis conducted on a
tion is obtained. The formulation of the Weighted             codebase of 3, 636 methods, gathered from 4 dif-
Kappa (WK) is the one in the equation (2), but for            ferent open-source systems written in Java.
the computation of qo and qc the contributions are               There could be many possible future directions
   5
     The notation means that 1 is included in the interval,   for our research. For example, it would be inter-
while 0 is not.                                               esting to conduct an empirical study to investigate
Table 3: Descriptive Statistics of the Dataset: C           B. Fluri, M. Wursch, and H.C. Gall. 2007. Do code
stands for Coherent, NC for Non-Coherent.                      and comments co-evolve? on the relation between
                                                               source code and comment changes. In Proceedings
                                            Don’t Know         of the Working Conference on Reverse Engineering,
    System            C     NC     Total
                                           (Not included)      pages 70–79. IEEE Computer Society.
    CoffeeMaker       27     20     47            0
    JFreeChart 6.0   406     55    461           24         D Lawrie, D Binkley, and C Morrell. 2010. Normal-
    JFreeChart 7.1   520     68    588           36           izing Source Code Vocabulary. In Proceedings of
    JHotDraw         762    1025   1787         693           Working Conference on Reverse Engineering, pages
    Total            1715   1168   2883                       3–12. IEEE Computer Society.
                                                            Yann Mathet, Antoine Widlöcher, and Jean-Philippe
the effect of maintenance operations on the coher-            Métivier. 2015. The unified and holistic method
                                                              gamma γ for inter-annotator agreement measure and
ence for multiple versions of the same system. As
                                                              alignment. Computational Linguistics, 41(3):437–
a step forward in this future research direction, we          479, September.
already included in the dataset two versions of the
JFreeChart system. Our results and those by Fluri           F. Salviulo and G. Scanniello.        2014.    Dealing
                                                               with identifiers and comments in source code com-
et al. (Fluri et al., 2007) represent a viable start-          prehension and maintenance: Results from an
ing point. Finally, we would like to exploit the               ethnographically-informed study with students and
collected data as an evaluation set to assess the              professionals. In Proceedings of International Con-
performance of approaches able to discern method               ference on Evaluation and Assessment in Software
                                                               Engineering, pages 423–432. ACM Press.
coherence.
                                                            G. Scanniello, A. D’Amico, C. D’Amico, and
                                                              T. D’Amico. 2010. Using the kleinberg algorithm
References                                                    and vector space model for software system clus-
                                                              tering. In Proceedings of International Conference
Ron Artstein and Massimo Poesio. 2008. Inter-coder            on Program Comprehension, pages 180–189. IEEE
  agreement for computational linguistics. Computa-           Computer Society.
  tional Linguistics, 34(4):555–596, December.
                                                            C. Wohlin, P. Runeson, M. Höst, M.C. Ohlsson,
J. Cohen. 1960. A coefficient of agreement for nom-           B. Regnell, and A. Wesslén. 2012. Experimenta-
   inal scales. Educational and Psychological Mea-            tion in Software Engineering. Computer Science.
   surement, 20(1):37–46.                                     Springer.

J. Cohen. 1968. Weighted kappa: Nominal scale
   agreement provision for scaled disagreement or par-
   tial credit. Psychological Bulletin.

A. Corazza, S. Di Martino, and V. Maggio. 2012. LIN-
   SEN: An efficient approach to split identifiers and
   expand abbreviations. In Proceedings of Interna-
   tional Conference on Software Maintenance, pages
   233–242. IEEE Computer Society.

A. Corazza, V. Maggio, and G. Scanniello. 2015. On
  the coherence between comments and implementa-
  tions in source code. In 41st Euromicro Confer-
  ence on Software Engineering and Advanced Appli-
  cations, EUROMICRO-SEAA 2015, Madeira, Portu-
  gal, August 26-28, 2015, pages 76–83. IEEE Com-
  puter Society.

A. Corazza, S. Di Martino, V. Maggio, and G. Scan-
  niello. 2016. Weighing lexical information for
  software clustering in the context of architecture re-
  covery. Empirical Software Engineering, 21(1):72–
  103.

Barbara Di Eugenio and Michael Glass. 2004. The
  Kappa statistic: a second look. Computational Lin-
  guistics, 30(1).