A Demonstration of CodeBreaker: A Machine
      Interpretable Knowledge Graph for Code

Ibrahim Abdelaziz1 , Kavitha Srinivas1 , Julian Dolby1 and James P. McCusker2
                 1
                 IBM Research, IBM T.J. Watson Research Center
      {ibrahim.abdelaziz1, kavitha.srinivas}@ibm.com, dolby@us.ibm.com
             2
               Rensselaer Polytechnic Institute (RPI), mccusj2rpi.edu


        Abstract. Knowledge graphs have been extremely useful in powering
        diverse applications like natural language understanding. CodeBreaker
        attempts to construct machine interpretable knowledge graphs about
        program code to similarly power diverse applications such as code search,
        code understanding, and code automation. We have built such a 1.98 bil-
        lion edges knowledge graph by a detailed analysis of function usage in 1.3
        million Python programs in GitHub, documentation about the functions
        in 2300+ modules, forum discussions with more than 47 million posts,
        class hierarchy information, etc. In this work, we will demonstrate one
        application of this knowledge graph, which is a code recommendation
        engine for programmers within an IDE. All user interactions within the
        application get translated into SPARQL queries, which have quite dif-
        ferent characteristics than queries against traditional knowledge graphs
        such as DBpedia or Wikidata. Aspects of code such as data flow are in-
        herently transitive, hence the SPARQL is complex and requires property
        paths. One of our goals is to provide these queries as a basis for graph
        querying benchmarks, while allowing users the ability to interact with a
        real application built on top of a large graph database.


1     Introduction
 Several knowledge graphs have been constructed in recent years such as DBpe-
dia [4], Wikidata [6] and Freebase [3]. These graphs now contain vast repositories
of knowledge about entities and concepts, and have been successfully used in a
number of different application areas such as natural language processing and
information retrieval [7]. With the unprecedented increase of published code
libraries in many domains and the growing number of open-source projects,
building a knowledge graph for code can be very useful in driving diverse appli-
cations around programming such as code search, code automation, refactoring,
bug detection, code optimization, etc.
    In CodeBreaker, we use a set of generic techniques to construct the first
large-scale knowledge graph for code. Specifically, we developed knowledge graph
with 1.98 billion edges from deep analysis of 1.3 million Python programs on
    Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0).
         I. Abdelaziz et al.

GitHub. This covered 278K functions, 257K classes and 5.8M methods from
2300+ Python modules, 47M posts from StackOverflow and StackExchange.
    One main challenge for building such a knowledge graph is about represen-
tation. We represent the code in a more global way than in prior work based
on program text or ASTs, i.e., in terms of data flow and control flow. This al-
lows us to analyze connections throughout programs. Specifically, we capture
the following: (a) which objects get passed as arguments to which methods, (b)
which objects get used to invoke methods (data flow) and (c) which methods
get called before which other ones (control flow). However, global connections
make scalable querying over large graphs harder; querying data- and control-
flow is inherently transitive. This can be achieved using SPARQL 1.1 [2], which
naturally supports graph patterns, filtering, transitive closure, and has scalable
implementations for large graphs.
    CodeBreaker is composed of 1.3 million individual graphs, one for each
program. Therefore, we use the Resource Description Framework [1] with its sup-
port for graphs as our storage representation. The graphs are however, connected
by the specific methods they call, qualified by the library name. In addition, be-
cause many important higher level semantic details about the code reside in
natural language for human consumption, we linked our knowledge graph to
natural language from usage documentation, and forums, using information re-
trieval techniques. For details on the construction of the knowledge graph, its
schema and how different code and documentation are linked, see [5].
    We demonstrate CodeBreaker3 using a recommendation engine we built
into an IDE. Our key focus here is on the sorts of storage and query support
needed for use cases of code knowledge graphs. The conference audience will
interact with CodeBreaker through a graphical interface, where they can find
1) the next most likely call given the current method call, 2) popular data
science pipelines used by others that are similar to their own, 3) relevant forum
discussions based on their own code. For each of these scenarios, we provide
SPARQL queries, which can provide the basis for graph querying benchmarks,
along with the knowledge graph.

2     Demonstration Overview
We provide a complete interface to show the potential of CodeBreaker in
coding assistance. In particular, we integrated CodeBreaker with Jupyter Lab
using its Language Server Protocol support, https://github.com/krassowski/
jupyterlab-lsp.

2.1    Next Coding Step
We show a developer being provided commonly used next steps, based on the
context of their current code. Context means data flow predecessors of the node
3
    A video of the demo is available at https://github.com/wala/graph4code/blob/
    master/docs/figures/demo_v2.mp4 while the underlying knowledge graph is avail-
    able at https://wala.github.io/graph4code/
 Demo of CodeBreaker: A Machine Interpretable Knowledge Graph for Code

of interest; we take a simple example of the single predecessor call that con-
structed the classifier. Figure 1 shows a real Kaggle notebook, where users can
select any expression in the code and get the most common next steps along
with their frequencies. In similar contexts, data scientists typically do one of
the following: 1) build a text report showing the main classification metrics
(frequency: 16), 2) report the confusion matrix which is an important step to
understand the classification errors (frequency: 10) and 3) save the prediction
array as text (frequency: 8). This can help users by alerting them to best prac-
tices from other data scientists. In this example, the suggested step of adding
code to compute a confusion matrix is actually useful. The existing Kaggle note-
book does not contain this call, but the call is very helpful to understand the
properties of a classifier. The exact SPARQL query to support this functional-
ity is available https://github.com/wala/graph4code/blob/master/usage_
queries/find_next_step.sparql.


                  Fig. 1. Finding most commonly used next step


2.2   Get Similar Flows From Other Programs

The second scenario helps understand data science pipelines similar to existing
ones, and use this to understand what types of models other data scientists use
given similar code context. As shown in Figure 2, to define the pipeline, the devel-
oper needs to choose two steps in the pipeline; such as from the point a dataset is
read (e.g. read csv) until a fit call is performed (e.g., model.fit). In the example
program, data flows from read csv to RandomForestClassifier.fit. The in-
terface allows users to query what other classifiers tend to be invoked on the same
data as RandomForestClassifier. On the right side of Figure 2, we can see that
in Kaggle notebooks, people tend to use RandomForestClassifier, along with
Gradient Boost Classifier and K neighbours Classifier to fit the same
data. The thickness of the arrows denote how frequently these classifiers have
been used together. This recommendation gives data scientists options of differ-
ent classification models to try. The SPARQL query for gathering the relevant
data science pipelines can be found in https://github.com/wala/graph4code/
blob/master/usage_queries/find_similar_flows.sparql. This query taxes
        I. Abdelaziz et al.

many aspects SPARQL, as we need multiple transitive property paths to capture
the fit calls, multiple regex filters, aggregation and a minus.


           …                     Select line, set as start of flow


                              Mark this line as end, find similar flows


                    Fig. 2. Finding similar data science pipelines


2.3   Get Relevant Forums Posts
A developer can find posts in StackOverflow and StackExchange based on code
written so far. From the code, CodeBreaker finds the forum posts with similar
flows. Similarities in flow are useful because it implies that the post is discussing
the same coding context. However, it is hard sometimes to detect code similarity
at a token level since the same object can be called differently. Since Code-
Breaker’s graph decomposes the Kaggle code into its semantics, we can take
each of the nodes in the dataflow for the Kaggle program and issue a SPARQL
query which then gathers up relevant StackOverflow posts.
    Figure 3 shows how one can show relevant forum posts for the path in the
code up to sklearn.svm.SVC.fit. The figure shows one of the posts. Note that
the code written in the Kaggle notebook has data flowing from a read csv
to a train test split to a SVC.fit. This exact flow exists in the retrieved
StackOverflow post which also a read csv to a train test split to a fit
on an SVC. Similarities in flow are useful because it implies that the post is
discussing the same coding context. It is hard to detect code similarity at a
token level since the SVC object is called model in the Kaggle notebook and
clf SVM radial basis in this forum post. Since CodeBreaker’s graph de-
composes the Kaggle code into its semantics, we can take each of the nodes in
the dataflow for the Kaggle program and issue a SPARQL query which then
gathers up relevant StackOverflow posts. The corresponding SPARQL query
can be found at https://github.com/wala/graph4code/blob/master/usage_
queries/find_stack_overflow_posts.sparql.


3     Conclusion
In this paper, we demonstrated the use of CodeBreaker, the first large-scale
knowledge graph for code, in code assistance within an IDE. In particular, we
show how one can get code suggestions, finding similar data science pipelines
 Demo of CodeBreaker: A Machine Interpretable Knowledge Graph for Code


                                                   Right click on line, select
                                                     show relevant posts


                         Fig. 3. Finding relevant forum posts

and context-based search in web forums. The knowledge graph is extensible and
we made it publicly available to the larger community for use.


References
1. Resource Description Framework (RDF). https://www.w3.org/TR/rdf-primer/
2. SPARQL 1.1. https://www.w3.org/TR/sparql11-query/
3. Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: a collabo-
   ratively created graph database for structuring human knowledge. In: In SIGMOD
   Conference. pp. 1247–1250 (2008)
4. Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N.,
   Hellmann, S., Morsey, M., van Kleef, P., Auer, S., Bizer, C.: DBpedia - a large-scale,
   multilingual knowledge base extracted from wikipedia. Semantic Web Journal 6(2),
   167–195 (2015)
5. Srinivas, K., Abdelaziz, I., Dolby, J., McCusker, J.P.: Graph4code: A machine in-
   terpretable knowledge graph for code. arXiv preprint arXiv:2002.09440 (2020)
6. Vrandečić, D., Krötzsch, M.: Wikidata: A free collaborative knowledgebase. Com-
   mun. ACM 57(10), 78–85 (Sep 2014). https://doi.org/10.1145/2629489
7. Wang, Q., Mao, Z., Wang, B., Guo, L.: Knowledge graph embedding: A survey of
   approaches and applications. IEEE Trans. Knowl. Data Eng. 29(12), 2724–2743
   (2017)