=Paper=
{{Paper
|id=Vol-2510/sattose2019_paper_14
|storemode=property
|title=Establishing Benchmarks for Learning Program Representations
|pdfUrl=https://ceur-ws.org/Vol-2510/sattose2019_paper_14.pdf
|volume=Vol-2510
|authors=Anjan Karmakar
|dblpUrl=https://dblp.org/rec/conf/sattose/Karmakar19
}}
==Establishing Benchmarks for Learning Program Representations==
Establishing Benchmarks For Learning Program Representations Anjan Karmakar Faculty of Computer Science Free University Bozen-Bolzano akarmakar@unibz.it state of the art results for certain software engineering tasks. Abstract Like natural language, source code also is structured and repetitive [5], therefore representing instances of Recent advances in the field of machine learn- the input source code effectively, while leveraging their ing have shown great promise in solving vari- semantic and syntactic properties, could facilitate bet- ous software engineering tasks. However, un- ter learning of machine learning models. Studies such like machine learning techniques used in fields as [3] [1] used source code representations in the form such as NLP (Natural Language Processing) of ASTs, and graphs, to essentially capture the seman- where text-based tokens are used as model in- tic and syntactic properties of source code and then use puts, in software engineering (SE) structured them to train their model. representations of source code have proven to Although there are a number of ways to repre- be more effective for various SE tasks. De- sent source code, such as simple token-based represen- spite the findings, structured representations tations, ASTs (Abstract Syntax Trees), call graphs, of source code are still underused. In this pa- bytecode, we are interested in the more structured per, we propose to define a benchmark that representations of code. Furthermore, we hypothesize promotes the usage of structured representa- that the structured representation of source code as in- tions of source code as model inputs, via tasks puts to machine learning models would perform better that are explicitly defined towards that goal. for a variety of software engineering tasks, including tasks such as code completion and defect prediction. 1 Introduction Therefore, to evaluate our hypothesis, we aim to propose a benchmark made up of tasks designed in With the advent of big code, applying machine learn- such a way that, to succeed, it is necessary to learn ing techniques on large corpora of code have yielded more and more about the program structure. The excellent results for a number of software engineer- tasks shall be of increasing difficulty (i.e., learning ing tasks. Particularly, neural networks have been more and more of the structure is necessary to yield very effective since they are able to learn the fea- desirable results). We expect that on the hardest ones, tures from the input. However most of the tasks so current neural approaches will fare little better than far use program structure in a shallow manner (name random chance. The tasks will be defined based on a prediction from snippets, source code summarization, set of static analyses of the source code. finding mappings between APIs, source code search, etc). More recently a number of papers have utilized the structured nature of source code and accomplished 2 Background and Motivation Recently, there has been an increasing interest in ap- Copyright c 2019 for this paper by its authors. Use permitted plying machine learning techniques to solve SE (Soft- under Creative Commons License Attribution 4.0 International (CC BY 4.0). ware Engineering) tasks. However, most of the work In: Anne Etien (eds.): Proceedings of the 12th Seminar on Ad- has directly tried to reuse natural language process- vanced Techniques Tools for Software Evolution, Bolzano, Italy, ing (NLP) methods for SE tasks, mainly by treating July 8-10 2019, published at http://ceur-ws.org source code as a sequence of tokens - which ultimately 1 fail to capitalize on the unique opportunities offered sent code to capture the syntactic and semantic struc- by code’s known structure and semantics [1]. ture of code using graphs, and then use graph-based Even though there are many similarities between deep learning methods to learn to reason over program natural language and source code, one interesting as- structures. The authors propose a new representa- pect that differentiates source code from natural lan- tion technique by encoding source code as graphs, in guage is that source code is highly structured, which which edges represent syntactic relationships as well can be leveraged to obtain a greater understanding as semantic relationships. They observe that exposing of the context of the code. Owing to this structured source code semantics explicitly as structured input nature of code, program elements are usually depen- to a machine learning model reduces the requirements dent on each other, and therefore when building mod- on the amounts of training data and model capacity - els to predict program properties, simple sequence- making way for solving tasks, such as variable naming based models, which treat these program elements to (VarNaming), and detecting instances misused vari- be independent aspects, fail to make use of the inter- ables (VarMisuse), and achieving state of the art re- dependence on other code elements [7]. Also, simple sults. sequence-based models fail to determine which vari- From the studies above, it is clear that structured ables are in scope at any point in the program, which representations of code fare much better than simple can be resolved simply by embracing more structured text-based token representations, for the tasks investi- representations of code [6]. Some recent studies which gated above. For our study, we are going to specifically have utilized this structured representation of source focus the defect prediction tasks and attempt to use code have accomplished state of the art results on structured representations of code as input to a learn- many SE tasks. ing model to predict bugs. To evaluate the effective- For example, Alon et al. [3] introduce the use of ness of utilizing the structured representations of code, different path-based abstractions of the program’s ab- the task of bug prediction is particularly potent since stract syntax tree (AST) to represent source code. The there is a wide range of available bug types - some are goal is to extract a representation that captures rel- easily detected while others require a thorough under- evant and interesting properties from the program’s standing of the syntactic formulation, semantics, and AST, while keeping the representation open for gener- structure of the code. alization. One such way to produce such a represen- tation of source code is to decompose the AST into 3 Benchmarking paths between nodes that repeat across programs but can also discriminate between different programs. Our intention is to build a neural network model that The authors show that representing source code as is able to highlight buggy code in a given code corpus. AST paths can be useful in a diverse set of program- In order to accomplish the said goal our methodology ming tasks such as predicting variable names, predict- would require an amalgamation of different techniques ing method names, and predicting types of variables. that have already been proposed in the literature for Furthermore, they claim that the use of AST paths diverse tasks. For the specific case of bug detection can significantly improve the performance of the vari- task, however, the techniques need to be applied to- ous learning algorithms without modifying them, while gether and evaluated in the right manner, since all our achieving state of the art results. research questions are subject to experimental evalua- In yet another recent work by Zhang et al. [11], tion, preferably against established benchmarks. the authors address the challenge of source code repre- An established benchmark, when embraced by a sentation, to effectively capture syntactic and seman- community, can have a strong positive effect on the tic information, with an AST-based Neural Network scientific maturity of a community [9]. Benchmarking (ASTNN) to learn vector representations of source can result in a more rigorous examination of research code. The model decomposes large ASTs of code frag- contributions and pave the way for the rapid develop- ments into sequences of smaller statement trees, and ment and evaluation of new methods and tools. then obtains statement vectors by recursively encoding While looking for established benchmarks in the multi-way statement trees. Based on the sequence of field, we have discovered datasets like the publicly statement vectors, a bidirectional RNN model learns available PROMISE dataset [8], which also include the vector representations of code fragments by lever- contributions from the NASA Metrics Data Program aging the naturalness of statements, which is then eval- [4], where a lot of defect prediction datasets are avail- uated on two tasks, namely source code classification able. However, the tests conducted on these datasets and code clone detection, producing state-of-the-art are mostly metric-based, essentially using static mea- results. sures to guide software quality predictions, and are Allamanis et al. [1] propose a new method to repre- essentially too ”coarse” meaning they often highlight 2 bugs on the file level based on the complexity mea- 3. “Hard” tasks: tasks for which non-local struc- sures such as essential complexity, cyclomatic com- ture is necessary. E.g. a property of a method plexity, design complexity and lines of Code. On the that needs information from its direct callers. other hand, we would like to have tests which are more ”fine-grained”, in the sense that, they could not only 4. “Very hard” tasks: tasks for which non-local identify suspicious or buggy files but also highlight the and also indirect structure is necessary. E.g. you exact bug locations in the lines of code. have to look into the callers of the callers. Thus, as a preliminary goal, we need to first deter- mine some benchmarks that define tasks that would re- 5. “Inaccessible” tasks: tasks for which informa- quire a thorough understanding of the programs struc- tion very distant from the source is necessary. ture and semantics. We then need to evaluate the per- E.g., one statement in one method that is indi- formance of our trained model on these tasks and com- rectly called by the method, but they are 5-10 pare it against the tasks comprising our benchmark. steps away. An additional desirable aspect of a benchmark is to control the difficulty of the tasks. An example in For example, the VarMisuse task defined by Alla- NLP is the work of Weston et al. [10], which defines manis et al. [1] serves as a good sample task for bug 20 NLP tasks in increasing levels of complexity. The detection. The missing variables in the VarMisuse task tasks are artificial, but establish minimum levels of must be predicted properly else they risk causing sys- complexity that a prediction model must fulfill. While tem failure, and thereby, when applied to our case our any model that can solve these tasks is not necessarily model could highlight or detect whether a variable has close to full reasoning, however, if a model fails on any been misused and whether the module is buggy. of these tasks then there are likely to fail in real-world In essence, some variables are omitted from a given tasks too. code snippet and fed as input to Allamanis et al’s In the case of bug prediction, we would like a similar model. The learned model from Allamanis et al. property, but with more realism. These could range then accurately predicts the expected variables with from basic tasks for which the information is directly a high accuracy for all the variables but the last. accessible without needing to understand the program Such a task could be categorized as a Medium task, structure - where sequential processing is still viable, since to accurately predict the omitted variable the to more advanced tasks where understanding of non- model needs to take the code from the entire file into local and indirect structure is necessary. consideration, rather than just the local method. To sum up, unlike existing benchmarks we need tasks where the output is fine-grained, and where we To successfully tackle the task of bug detection, can control the complexity of the tasks, so that we can one needs to understand the role and function of the provide incentives for models that leverage the pro- program elements and understand how they relate to gram structure. One candidate that fulfills these con- each other. In the VarMisuse task discussed above, ditions is to leverage static analysis tools. Static analy- the model learns to predict the correct variable that sis tools leverage program structure and find real bugs. should be used at a given program location. Since the The static analyses can point out precise program loca- model produces near accurate results, any variable in tions, and the analyses vary in complexity, from simple the code that does not match the expected variable and local pattern matching to full-program analyses. predicted by the model could then be a point of Therefore, we could consider proceeding in defin- inspection. This task could be complementary to our ing certain tasks which would form the benchmark to bug detection task. Given a certain code snippet, we evaluate machine learning models using the structured must therefore ascertain whether it contains certain nature of code. The range of tasks based on difficulty buggy fragments and then we can compare our results could be categorized as: with the predictions made by the model proposed by Allamanis et al. [1]. 1. “Easy” tasks: tasks for which the information is directly accessible without needing structure. These tasks from the established benchmark will E.g. a property of a method that can be deduced also help us compare our results against those of static by information in the method body. bug finders, which use approaches ranging from simple 2. “Medium” tasks: tasks for which some degree code pattern-matching techniques to static analyses of structure is necessary. E.g. a property of a that process semantic abstractions of code. From the method or variable that can be deduced, but you list of tasks defined in our benchmarks, simple static need to take the entire file context into account, bug finders would likely be able to detect bugs ”easy” not just the method body. and ”medium” tasks but fare poorly for ”hard” tasks 3 where a greater understanding of the program struc- However, to measure the effectiveness of the models ture is required. based on various representations of code, and to com- Making a direct comparison with static bug finders pare against the results from static analysis, we need a against our model could reveal the effectiveness and/or set of tasks as a benchmark. Therefore, in our prelim- weakness of our model. We could match the high- inary work we attempt to define a benchmark measur- lighted faulty lines of code or defect locations from the ing the accuracy and usefulness of a model based on static bug finders and our model, and evaluate them certain type of representation. The tasks comprising for false positives and false negatives. the benchmark would essentially test the understand- There are a number of Static Bug Finders (SBF) ing of the model, starting from simple tests for which we could compare our results with to conclude on the the information is directly accessible without needing effectiveness of our model. For example, Hybrid tools to understand the program structure, to increasingly like FindBugs which incorporate both static data-flow difficult tests where understanding of non-local and in- analysis, and pattern matching, could be a good direct structure is necessary. Furthermore, this bench- evaluation candidate. Also, tools like ESC-Java and mark would also serve new models with novel repre- CodeSonar could be considered, since their analyses sentation techniques for their evaluation purposes - to are reportedly more effective and they could possibly compare against current state of the art. highlight bugs in ”hard” tasks. Once our benchmarks are established, as a part of our future work, we could begin our bug detection ap- Advanced tools like Infer could even understand the proach by first mining programs in our training set program structure and fare better than traditional bug as Abstract Syntax Trees (ASTs). Then selecting the detection tools, and it will be interesting to compare most effective paths in the ASTs, we could derive code their results against the set of tasks from the estab- embeddings from the program as an input for our lished benchmarks, and whether they are able to de- learning model. And finally, train the learning model tect bugs in ”hard” and ”very hard” tasks. with pairs of buggy and correct code, based on the vec- tors representations of the code fragments. When the model is trained it should be able to highlight buggy 4 Discussion fragments of code from the input - an unseen code The eventual goal of our research is to enable learned corpus. Another approach could be to mine the pro- models to effectively detect bugs from a new corpus - grams as ASTs and represent them as graphs which which is not an easy task. Detecting code faults is a could be used as a direct input to feed Graph Neu- task that requires through understanding of the code ral Networks. Establishing a solid benchmark for the and reasoning about the program semantics. Even for defect prediction task, would boost the development an experienced programmer this is a challenging task. and evaluation of new techniques and tools in defect To allow a learning model to grasp the correla- prediction, and allow for the comparison of their ef- tions between code fragments and understand how fectiveness. Our aim with this paper is to introduce they work, we need a better way to map code frag- the notion of benchmarking for learner models based ments instead of simple sequential token-based map- on structured representations of code, and propose a ping. Even though code fragments have something in set of tasks categories that would evaluate the usage common with plain texts, they should not be simply of program structure specifically for the task of defect dealt with text-based or token-based methods due to prediction. their richer and more explicit structural information. Recent work [2] [1] provides strong evidence that References semantic and syntactic knowledge from source code [1] Miltiadis Allamanis, Marc Brockschmidt, and as structured representations contributes a great deal Mahmoud Khademi. Learning to represent pro- in modeling source code and can obtain better results grams with graphs. CoRR, abs/1711.00740, 2017. than traditional sequential token-based methods. Be- cause of the highly structured nature of code, treat- [2] Uri Alon, Meital Zilberstein, Omer Levy, and ing code snippets as structured representations for the Eran Yahav. code2vec: Learning distributed machine learning models, can capture critical seman- representations of code. CoRR, abs/1803.09473, tic information that reflects common code patterns, 2018. and even significantly lessens the requirements on the amount of training data compared to learning over se- [3] Uri Alon, Meital Zilberstein, Omer Levy, and quential token-based representations [2], and it is still Eran Yahav. A general path-based representa- general enough so that it can be applied to a diverse tion for predicting program properties. CoRR, range of tasks. abs/1803.09544, 2018. 4 [4] Jackson W. Chapman SM., Callis P. Metrics data program. NASA IV and V Facility, 2004. [5] Abram Hindle, Earl T. Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. On the natu- ralness of software. In Proceedings of the 34th In- ternational Conference on Software Engineering, ICSE ’12, pages 837–847, Piscataway, NJ, USA, 2012. IEEE Press. [6] Chris J. Maddison and Daniel Tarlow. Structured generative models of natural source code. CoRR, abs/1401.0514, 2014. [7] Veselin Raychev, Martin Vechev, and Andreas Krause. Predicting program properties from ”big code”. In Proceedings of the 42Nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’15, pages 111– 124, New York, NY, USA, 2015. ACM. [8] J Sayyad Shirabad and Tim J Menzies. The promise repository of software engineering databases. School of Information Technology and Engineering, University of Ottawa, Canada, 24, 2005. [9] S. E. Sim, S. Easterbrook, and R. C. Holt. Using benchmarking to advance research: a challenge to software engineering. In Proceedings of the 25th International Conference on Software Engineer- ing, ICSE ’03, pages 74–83, May 2003. [10] Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M Rush, Bart van Merriënboer, Ar- mand Joulin, and Tomas Mikolov. Towards ai- complete question answering: A set of prerequi- site toy tasks. arXiv preprint arXiv:1502.05698, 2015. [11] Zhang H Sun H Wang K Liu X Zhang J., Wang X. A novel neural source code representation based on abstract syntax tree. In Proceedings of the 41st International Conference on Software Engineer- ing, ICSE ’19, May 2019. 5