=Paper= {{Paper |id=Vol-2510/sattose2019_paper_14 |storemode=property |title=Establishing Benchmarks for Learning Program Representations |pdfUrl=https://ceur-ws.org/Vol-2510/sattose2019_paper_14.pdf |volume=Vol-2510 |authors=Anjan Karmakar |dblpUrl=https://dblp.org/rec/conf/sattose/Karmakar19 }} ==Establishing Benchmarks for Learning Program Representations== https://ceur-ws.org/Vol-2510/sattose2019_paper_14.pdf
                          Establishing Benchmarks
                    For Learning Program Representations

                                                   Anjan Karmakar
                                             Faculty of Computer Science
                                            Free University Bozen-Bolzano
                                                 akarmakar@unibz.it



                                                                      state of the art results for certain software engineering
                                                                      tasks.
                        Abstract                                         Like natural language, source code also is structured
                                                                      and repetitive [5], therefore representing instances of
    Recent advances in the field of machine learn-                    the input source code effectively, while leveraging their
    ing have shown great promise in solving vari-                     semantic and syntactic properties, could facilitate bet-
    ous software engineering tasks. However, un-                      ter learning of machine learning models. Studies such
    like machine learning techniques used in fields                   as [3] [1] used source code representations in the form
    such as NLP (Natural Language Processing)                         of ASTs, and graphs, to essentially capture the seman-
    where text-based tokens are used as model in-                     tic and syntactic properties of source code and then use
    puts, in software engineering (SE) structured                     them to train their model.
    representations of source code have proven to                        Although there are a number of ways to repre-
    be more effective for various SE tasks. De-                       sent source code, such as simple token-based represen-
    spite the findings, structured representations                    tations, ASTs (Abstract Syntax Trees), call graphs,
    of source code are still underused. In this pa-                   bytecode, we are interested in the more structured
    per, we propose to define a benchmark that                        representations of code. Furthermore, we hypothesize
    promotes the usage of structured representa-                      that the structured representation of source code as in-
    tions of source code as model inputs, via tasks                   puts to machine learning models would perform better
    that are explicitly defined towards that goal.                    for a variety of software engineering tasks, including
                                                                      tasks such as code completion and defect prediction.
1    Introduction                                                        Therefore, to evaluate our hypothesis, we aim to
                                                                      propose a benchmark made up of tasks designed in
With the advent of big code, applying machine learn-                  such a way that, to succeed, it is necessary to learn
ing techniques on large corpora of code have yielded                  more and more about the program structure. The
excellent results for a number of software engineer-                  tasks shall be of increasing difficulty (i.e., learning
ing tasks. Particularly, neural networks have been                    more and more of the structure is necessary to yield
very effective since they are able to learn the fea-                  desirable results). We expect that on the hardest ones,
tures from the input. However most of the tasks so                    current neural approaches will fare little better than
far use program structure in a shallow manner (name                   random chance. The tasks will be defined based on a
prediction from snippets, source code summarization,                  set of static analyses of the source code.
finding mappings between APIs, source code search,
etc). More recently a number of papers have utilized
the structured nature of source code and accomplished                 2   Background and Motivation
                                                                      Recently, there has been an increasing interest in ap-
Copyright c 2019 for this paper by its authors. Use permitted         plying machine learning techniques to solve SE (Soft-
under Creative Commons License Attribution 4.0 International
(CC BY 4.0).
                                                                      ware Engineering) tasks. However, most of the work
In: Anne Etien (eds.): Proceedings of the 12th Seminar on Ad-
                                                                      has directly tried to reuse natural language process-
vanced Techniques Tools for Software Evolution, Bolzano, Italy,       ing (NLP) methods for SE tasks, mainly by treating
July 8-10 2019, published at http://ceur-ws.org                       source code as a sequence of tokens - which ultimately




                                                                  1
fail to capitalize on the unique opportunities offered         sent code to capture the syntactic and semantic struc-
by code’s known structure and semantics [1].                   ture of code using graphs, and then use graph-based
   Even though there are many similarities between             deep learning methods to learn to reason over program
natural language and source code, one interesting as-          structures. The authors propose a new representa-
pect that differentiates source code from natural lan-         tion technique by encoding source code as graphs, in
guage is that source code is highly structured, which          which edges represent syntactic relationships as well
can be leveraged to obtain a greater understanding             as semantic relationships. They observe that exposing
of the context of the code. Owing to this structured           source code semantics explicitly as structured input
nature of code, program elements are usually depen-            to a machine learning model reduces the requirements
dent on each other, and therefore when building mod-           on the amounts of training data and model capacity -
els to predict program properties, simple sequence-            making way for solving tasks, such as variable naming
based models, which treat these program elements to            (VarNaming), and detecting instances misused vari-
be independent aspects, fail to make use of the inter-         ables (VarMisuse), and achieving state of the art re-
dependence on other code elements [7]. Also, simple            sults.
sequence-based models fail to determine which vari-               From the studies above, it is clear that structured
ables are in scope at any point in the program, which          representations of code fare much better than simple
can be resolved simply by embracing more structured            text-based token representations, for the tasks investi-
representations of code [6]. Some recent studies which         gated above. For our study, we are going to specifically
have utilized this structured representation of source         focus the defect prediction tasks and attempt to use
code have accomplished state of the art results on             structured representations of code as input to a learn-
many SE tasks.                                                 ing model to predict bugs. To evaluate the effective-
   For example, Alon et al. [3] introduce the use of           ness of utilizing the structured representations of code,
different path-based abstractions of the program’s ab-         the task of bug prediction is particularly potent since
stract syntax tree (AST) to represent source code. The         there is a wide range of available bug types - some are
goal is to extract a representation that captures rel-         easily detected while others require a thorough under-
evant and interesting properties from the program’s            standing of the syntactic formulation, semantics, and
AST, while keeping the representation open for gener-          structure of the code.
alization. One such way to produce such a represen-
tation of source code is to decompose the AST into             3   Benchmarking
paths between nodes that repeat across programs but
can also discriminate between different programs.              Our intention is to build a neural network model that
   The authors show that representing source code as           is able to highlight buggy code in a given code corpus.
AST paths can be useful in a diverse set of program-           In order to accomplish the said goal our methodology
ming tasks such as predicting variable names, predict-         would require an amalgamation of different techniques
ing method names, and predicting types of variables.           that have already been proposed in the literature for
Furthermore, they claim that the use of AST paths              diverse tasks. For the specific case of bug detection
can significantly improve the performance of the vari-         task, however, the techniques need to be applied to-
ous learning algorithms without modifying them, while          gether and evaluated in the right manner, since all our
achieving state of the art results.                            research questions are subject to experimental evalua-
   In yet another recent work by Zhang et al. [11],            tion, preferably against established benchmarks.
the authors address the challenge of source code repre-            An established benchmark, when embraced by a
sentation, to effectively capture syntactic and seman-         community, can have a strong positive effect on the
tic information, with an AST-based Neural Network              scientific maturity of a community [9]. Benchmarking
(ASTNN) to learn vector representations of source              can result in a more rigorous examination of research
code. The model decomposes large ASTs of code frag-            contributions and pave the way for the rapid develop-
ments into sequences of smaller statement trees, and           ment and evaluation of new methods and tools.
then obtains statement vectors by recursively encoding             While looking for established benchmarks in the
multi-way statement trees. Based on the sequence of            field, we have discovered datasets like the publicly
statement vectors, a bidirectional RNN model learns            available PROMISE dataset [8], which also include
the vector representations of code fragments by lever-         contributions from the NASA Metrics Data Program
aging the naturalness of statements, which is then eval-       [4], where a lot of defect prediction datasets are avail-
uated on two tasks, namely source code classification          able. However, the tests conducted on these datasets
and code clone detection, producing state-of-the-art           are mostly metric-based, essentially using static mea-
results.                                                       sures to guide software quality predictions, and are
   Allamanis et al. [1] propose a new method to repre-         essentially too ”coarse” meaning they often highlight




                                                           2
bugs on the file level based on the complexity mea-                3. “Hard” tasks: tasks for which non-local struc-
sures such as essential complexity, cyclomatic com-                   ture is necessary. E.g. a property of a method
plexity, design complexity and lines of Code. On the                  that needs information from its direct callers.
other hand, we would like to have tests which are more
”fine-grained”, in the sense that, they could not only             4. “Very hard” tasks: tasks for which non-local
identify suspicious or buggy files but also highlight the             and also indirect structure is necessary. E.g. you
exact bug locations in the lines of code.                             have to look into the callers of the callers.
    Thus, as a preliminary goal, we need to first deter-
mine some benchmarks that define tasks that would re-              5. “Inaccessible” tasks: tasks for which informa-
quire a thorough understanding of the programs struc-                 tion very distant from the source is necessary.
ture and semantics. We then need to evaluate the per-                 E.g., one statement in one method that is indi-
formance of our trained model on these tasks and com-                 rectly called by the method, but they are 5-10
pare it against the tasks comprising our benchmark.                   steps away.
    An additional desirable aspect of a benchmark is
to control the difficulty of the tasks. An example in                For example, the VarMisuse task defined by Alla-
NLP is the work of Weston et al. [10], which defines              manis et al. [1] serves as a good sample task for bug
20 NLP tasks in increasing levels of complexity. The              detection. The missing variables in the VarMisuse task
tasks are artificial, but establish minimum levels of             must be predicted properly else they risk causing sys-
complexity that a prediction model must fulfill. While            tem failure, and thereby, when applied to our case our
any model that can solve these tasks is not necessarily           model could highlight or detect whether a variable has
close to full reasoning, however, if a model fails on any         been misused and whether the module is buggy.
of these tasks then there are likely to fail in real-world           In essence, some variables are omitted from a given
tasks too.                                                        code snippet and fed as input to Allamanis et al’s
    In the case of bug prediction, we would like a similar        model. The learned model from Allamanis et al.
property, but with more realism. These could range                then accurately predicts the expected variables with
from basic tasks for which the information is directly            a high accuracy for all the variables but the last.
accessible without needing to understand the program              Such a task could be categorized as a Medium task,
structure - where sequential processing is still viable,          since to accurately predict the omitted variable the
to more advanced tasks where understanding of non-                model needs to take the code from the entire file into
local and indirect structure is necessary.                        consideration, rather than just the local method.
    To sum up, unlike existing benchmarks we need
tasks where the output is fine-grained, and where we                 To successfully tackle the task of bug detection,
can control the complexity of the tasks, so that we can           one needs to understand the role and function of the
provide incentives for models that leverage the pro-              program elements and understand how they relate to
gram structure. One candidate that fulfills these con-            each other. In the VarMisuse task discussed above,
ditions is to leverage static analysis tools. Static analy-       the model learns to predict the correct variable that
sis tools leverage program structure and find real bugs.          should be used at a given program location. Since the
The static analyses can point out precise program loca-           model produces near accurate results, any variable in
tions, and the analyses vary in complexity, from simple           the code that does not match the expected variable
and local pattern matching to full-program analyses.              predicted by the model could then be a point of
    Therefore, we could consider proceeding in defin-             inspection. This task could be complementary to our
ing certain tasks which would form the benchmark to               bug detection task. Given a certain code snippet, we
evaluate machine learning models using the structured             must therefore ascertain whether it contains certain
nature of code. The range of tasks based on difficulty            buggy fragments and then we can compare our results
could be categorized as:                                          with the predictions made by the model proposed by
                                                                  Allamanis et al. [1].
 1. “Easy” tasks: tasks for which the information
    is directly accessible without needing structure.
                                                                      These tasks from the established benchmark will
    E.g. a property of a method that can be deduced
                                                                  also help us compare our results against those of static
    by information in the method body.
                                                                  bug finders, which use approaches ranging from simple
 2. “Medium” tasks: tasks for which some degree                   code pattern-matching techniques to static analyses
    of structure is necessary. E.g. a property of a               that process semantic abstractions of code. From the
    method or variable that can be deduced, but you               list of tasks defined in our benchmarks, simple static
    need to take the entire file context into account,            bug finders would likely be able to detect bugs ”easy”
    not just the method body.                                     and ”medium” tasks but fare poorly for ”hard” tasks




                                                              3
where a greater understanding of the program struc-                However, to measure the effectiveness of the models
ture is required.                                               based on various representations of code, and to com-
   Making a direct comparison with static bug finders           pare against the results from static analysis, we need a
against our model could reveal the effectiveness and/or         set of tasks as a benchmark. Therefore, in our prelim-
weakness of our model. We could match the high-                 inary work we attempt to define a benchmark measur-
lighted faulty lines of code or defect locations from the       ing the accuracy and usefulness of a model based on
static bug finders and our model, and evaluate them             certain type of representation. The tasks comprising
for false positives and false negatives.                        the benchmark would essentially test the understand-
   There are a number of Static Bug Finders (SBF)               ing of the model, starting from simple tests for which
we could compare our results with to conclude on the            the information is directly accessible without needing
effectiveness of our model. For example, Hybrid tools           to understand the program structure, to increasingly
like FindBugs which incorporate both static data-flow           difficult tests where understanding of non-local and in-
analysis, and pattern matching, could be a good                 direct structure is necessary. Furthermore, this bench-
evaluation candidate. Also, tools like ESC-Java and             mark would also serve new models with novel repre-
CodeSonar could be considered, since their analyses             sentation techniques for their evaluation purposes - to
are reportedly more effective and they could possibly           compare against current state of the art.
highlight bugs in ”hard” tasks.                                    Once our benchmarks are established, as a part of
                                                                our future work, we could begin our bug detection ap-
   Advanced tools like Infer could even understand the          proach by first mining programs in our training set
program structure and fare better than traditional bug          as Abstract Syntax Trees (ASTs). Then selecting the
detection tools, and it will be interesting to compare          most effective paths in the ASTs, we could derive code
their results against the set of tasks from the estab-          embeddings from the program as an input for our
lished benchmarks, and whether they are able to de-             learning model. And finally, train the learning model
tect bugs in ”hard” and ”very hard” tasks.                      with pairs of buggy and correct code, based on the vec-
                                                                tors representations of the code fragments. When the
                                                                model is trained it should be able to highlight buggy
4   Discussion
                                                                fragments of code from the input - an unseen code
The eventual goal of our research is to enable learned          corpus. Another approach could be to mine the pro-
models to effectively detect bugs from a new corpus -           grams as ASTs and represent them as graphs which
which is not an easy task. Detecting code faults is a           could be used as a direct input to feed Graph Neu-
task that requires through understanding of the code            ral Networks. Establishing a solid benchmark for the
and reasoning about the program semantics. Even for             defect prediction task, would boost the development
an experienced programmer this is a challenging task.           and evaluation of new techniques and tools in defect
   To allow a learning model to grasp the correla-              prediction, and allow for the comparison of their ef-
tions between code fragments and understand how                 fectiveness. Our aim with this paper is to introduce
they work, we need a better way to map code frag-               the notion of benchmarking for learner models based
ments instead of simple sequential token-based map-             on structured representations of code, and propose a
ping. Even though code fragments have something in              set of tasks categories that would evaluate the usage
common with plain texts, they should not be simply              of program structure specifically for the task of defect
dealt with text-based or token-based methods due to             prediction.
their richer and more explicit structural information.
   Recent work [2] [1] provides strong evidence that            References
semantic and syntactic knowledge from source code
                                                                 [1] Miltiadis Allamanis, Marc Brockschmidt, and
as structured representations contributes a great deal
                                                                     Mahmoud Khademi. Learning to represent pro-
in modeling source code and can obtain better results
                                                                     grams with graphs. CoRR, abs/1711.00740, 2017.
than traditional sequential token-based methods. Be-
cause of the highly structured nature of code, treat-            [2] Uri Alon, Meital Zilberstein, Omer Levy, and
ing code snippets as structured representations for the              Eran Yahav. code2vec: Learning distributed
machine learning models, can capture critical seman-                 representations of code. CoRR, abs/1803.09473,
tic information that reflects common code patterns,                  2018.
and even significantly lessens the requirements on the
amount of training data compared to learning over se-            [3] Uri Alon, Meital Zilberstein, Omer Levy, and
quential token-based representations [2], and it is still            Eran Yahav. A general path-based representa-
general enough so that it can be applied to a diverse                tion for predicting program properties. CoRR,
range of tasks.                                                      abs/1803.09544, 2018.




                                                            4
 [4] Jackson W. Chapman SM., Callis P. Metrics data
     program. NASA IV and V Facility, 2004.
 [5] Abram Hindle, Earl T. Barr, Zhendong Su, Mark
     Gabel, and Premkumar Devanbu. On the natu-
     ralness of software. In Proceedings of the 34th In-
     ternational Conference on Software Engineering,
     ICSE ’12, pages 837–847, Piscataway, NJ, USA,
     2012. IEEE Press.
 [6] Chris J. Maddison and Daniel Tarlow. Structured
     generative models of natural source code. CoRR,
     abs/1401.0514, 2014.
 [7] Veselin Raychev, Martin Vechev, and Andreas
     Krause. Predicting program properties from ”big
     code”. In Proceedings of the 42Nd Annual ACM
     SIGPLAN-SIGACT Symposium on Principles of
     Programming Languages, POPL ’15, pages 111–
     124, New York, NY, USA, 2015. ACM.
 [8] J Sayyad Shirabad and Tim J Menzies.
     The promise repository of software engineering
     databases. School of Information Technology and
     Engineering, University of Ottawa, Canada, 24,
     2005.
 [9] S. E. Sim, S. Easterbrook, and R. C. Holt. Using
     benchmarking to advance research: a challenge to
     software engineering. In Proceedings of the 25th
     International Conference on Software Engineer-
     ing, ICSE ’03, pages 74–83, May 2003.
[10] Jason Weston, Antoine Bordes, Sumit Chopra,
     Alexander M Rush, Bart van Merriënboer, Ar-
     mand Joulin, and Tomas Mikolov. Towards ai-
     complete question answering: A set of prerequi-
     site toy tasks. arXiv preprint arXiv:1502.05698,
     2015.
[11] Zhang H Sun H Wang K Liu X Zhang J., Wang X.
     A novel neural source code representation based
     on abstract syntax tree. In Proceedings of the 41st
     International Conference on Software Engineer-
     ing, ICSE ’19, May 2019.




                                                           5