<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Language-Parametric Modular Framework for Mining Idiomatic Code Patterns</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Dario Di Nucci</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kim Mens</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hoang Son Pham</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Johan Fabry</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tim Molderez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Siegfried Nijssen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Coen De Roover</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vadim Zaytsev</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>In: Anne Etien (eds.): Proceedings of the 12th Seminar on Advanced Techniques Tools for Software Evolution</institution>
          ,
          <addr-line>Bolzano</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Raincode Labs</institution>
          ,
          <country country="BE">Belgium</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Universite catholique de Louvain</institution>
          ,
          <country country="BE">Belgium</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Vrije Universiteit Brussel</institution>
          ,
          <country country="BE">Belgium</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In an ongoing industry-university collaboration we are developing a language-parametric framework for mining code idioms in legacy systems. This modular framework has a pipeline architecture and a languageparametric meta representation of the artefacts used by each of its 5 components: source code importer, mining preprocessor, pattern miner, pattern matcher, and modernisation assistant. The pipeline enables reuse of its components across systems and languages, as well as for project partners to work on each of these components separately. An example is the exploration of novel pattern mining techniques independently of the languages on which they will be applied and the modernisation assistant in which they will be used. Our rst results on mining Java and COBOL code are promising, even though challenges still lie ahead to make the framework and its constituting components truly scalable, customisable, and language independent.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Legacy systems have been informally de ned as \large
software systems that we do not know how to cope
with but that are vital to our organisation" [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. To
keep their business value, legacy systems must evolve
over time by being replaced, redeveloped,
rearchitected, reengineered, reused, or by having their
software components and platforms migrated when
traditional maintenance practices can no longer achieve
the desired system properties [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Technology
consulting rms estimate that 180{200 billion lines of legacy
code are still in active use today [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Since the
potential bene ts for legacy system modernisation are well
recognised, these systems are being slowly replaced or
retired in favour of alternatives.
      </p>
      <p>This paper presents an initial framework that is
being developed by two universities and a legacy
modernisation company in the context of a code mining
project. The company has been active since 1998,
had a series of successful migration projects with a
streak of satis ed customers, and has already won
three migration-related technology excellence awards
from Microsoft. The project's objectives, elaborated
upon later in the paper, are to advance the state of
the art in legacy modernisation by applying a novel
merge of techniques from arti cial intelligence, pattern
mining, and program analysis.</p>
      <p>
        Software systems that are regarded as legacy by
their owners consist of more than just the old,
obsolete, and soon to be retired artefacts written in 1960s
languages like assembler [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and COBOL [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. As time
went by, the circle of legacy has started to include
systems that were built with 4th generation languages
(4GLs) of the 1980s [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], developed using model-driven
architecture (MDA) of the 1990s [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], or created using
domain-speci c languages (DSLs) of the 2000s [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. It
is thus crucial for legacy software modernisation
companies to be able to adapt to new languages and
previously unknown idioms.
      </p>
      <p>
        Conquering even one legacy ecosystem with all its
languages, dialects, con gurations and preprocessors,
is a substantial e ort for a company. It is beyond
trivial to reuse knowledge about prior successful
migration projects to cope with the next one, for each of
them is unique in some way. The patterns to solve the
Y2K problem [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] are drastically di erent from
patterns for database migration or turning procedural to
object-oriented code, and renovation patterns, working
e ectively in one 4GL, are often inapplicable to
another 4GL. In this context, any degree of automation
in the discovery and detection of coding idioms and
modernisation patterns and their corresponding code
transformation actions is worthwhile to pursue.
However, traditional software analysis and analytics tools
are usually geared towards detecting precise matches
for known patterns, such as a particular combination
of conditions and GO TO jumps that can be refactored
into a WHILE loop. What is really needed instead,
and what we are aiming to achieve, is the ability to
nd and act upon unknown patterns that are perhaps
only adhered to a limited extent.
      </p>
      <p>After having introduced the context of our work,
the rest of this paper is organised as follows: section 2
explains our objectives in su cient detail to
appreciate the rest; section 3 dives into prior related work
around code idioms |patterns that we are mining for;
section 4 visualises the pipeline of our framework
(Figure 1) and explains its components; section 5 reports
on preliminary results and concludes the paper.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Project Goals</title>
      <p>The goal of our work is to design and implement a
framework to explore novel pattern mining algorithms
for source code and to incorporate them in an
intelligent software modernisation assistant tool set.
Ideally, at the end of the project (end 2020), we should
have a tool set powerful enough to help legacy
software engineers analyse a previously unseen codebase
in some software language for previously unknown
patterns. With these tools, it should be possible to
analyse the available data (often just source code) quickly
and e ciently, recognise frequently occurring patterns,
confront domain experts with them and annotate them
with modernisation actions to produce a mature
modernisation solution within weeks, not decades.</p>
      <p>
        The framework being developed is
languageparametric thanks to a metamodel representation that
is able to support a variety of software languages. The
modernisation assistant will pro-actively recommend
source code modernisation actions [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] by comparing
the code being renovated with insights gained by
treating the source code and development history as data.
The assistant will continuously mine for previously
unknown patterns within the system's source code and
structure. Thus, the modernisation recommendations
made by the assistant can improve over time as it
renes or uncovers more previously unknown patterns.
      </p>
      <p>
        The three main goals of our framework are to:
1. Discover syntactic patterns to replace large,
repeated, error-prone programming idioms [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] by
more succinct macros or proven programming
language built-ins, with the purpose of improving
code reliability, understandability, and
maintainability.
2. Discover code deviating from expected
patterns which may be indicative of dissimilarities
and dormant errors.
3. Propose actions to improve respect of
idioms such as rewriting old-style FOR loops to
functional alternatives in Java 8+ or replacing ad
hoc string manipulations in older COBOL
versions with modern equivalents from the standard
library.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Idiomatic Code Patterns</title>
      <p>
        Coding conventions and idioms are syntactic patterns
in the source code. Conventions describe an overall
syntactic style that is meant to foster readability and
maintainability of source code [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Idioms are
fragments of code that recur frequently across di erent
projects, and play one semantic role [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. A piece of
code is often termed idiomatic if experienced
developers consider it to be written in an intuitive,
natural way. An idiom can be described in the form of a
code template, i.e., a snippet of code where parts can
be abstracted away with meta-variables. Examples of
scenarios that can be described with idioms include
iteration over a data structure, manipulating resources
(open, close, lock, etc.), handling errors, or executing
database transactions.
      </p>
      <p>
        IDEs often o er facilities to manually de ne idioms
and insert them whenever needed. However, these do
not help programmers if they are using a language or
library the IDE is not familiar with. To assist
programmers, Allamanis et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] describe an approach that
mines for code idioms in a corpus of idiomatic code.
These idioms are represented as a syntactic
probabilistic model that uses probabilities to measure the
quality of a proposed idiom. Similar approaches have been
used for measuring how natural/idiomatic code is, or
how it changes when bugs are xed [
        <xref ref-type="bibr" rid="ref14 ref15 ref16 ref17">14, 15, 16, 17</xref>
        ].
Based on such measures, these approaches have all
found that software is repetitive|in other words, that
idioms are often used.
      </p>
      <p>
        Allamanis et al. created the Naturalize tool [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ],
which learns the coding convention style of a program
and suggests changes to improve code consistency. It
uses statistical natural language processing to suggest
natural identi er names and formatting conventions.
      </p>
      <p>Mining
Preprocessor
Source Code</p>
      <p>Meta-Model</p>
      <p>Representations
Source Code</p>
      <p>Importer
5</p>
      <p>
        Modernization Assistant
A follow-up project [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] focused on suggesting
appropriate method and class names from their bodies by
using a neural network and an n-gram language model.
      </p>
      <p>As idioms and coding conventions directly relate to
a programming language's syntax, most existing work
in this area focuses on tools targeted at one speci c
language. Our work goes beyond this through the use
of metamodels to provide a language-parametric
representation for idioms and conventions. Our goal is to
demonstrate that patterns can be mined across
multiple languages with relatively small tooling e ort.</p>
      <p>
        Considering language-parametric or
languageindependent representations of source code, there have
been multiple e orts in this area. An arguably
wellknown example is MOOSE [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] and its FAMIX [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]
metamodel. Their focus is however di erent to our
work. Firstly, MOOSE was originally created for the
re-engineering of object-oriented systems, whereas we
do not have any restriction at all on the paradigm
of the programming language. Secondly, the FAMIX
metamodel allows for its instances (i.e. models of
programming languages) to abstract over certain parts of
the programs being modeled. Typically, such
models do not contain any information at a granularity
ner than method invocations. For pattern mining,
we however require the complete abstract syntax tree
of a program to be present.
      </p>
      <p>
        Alternatively, Rakic et.al. have worked on
languageindependent static code analysis [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], based on
concrete syntax trees that are enriched with universal
nodes: nodes that are considered to be semantically
equivalent in all programming languages. However the
presence of such nodes does not provide any additional
information that is relevant for our work. We search
for patterns in the source code, without regarding the
semantics of the nodes in a pattern tree. This is
because we do not have language-independent patterns
as a goal, instead our patterns are speci c to the
language being mined.
      </p>
      <p>Enhanced Meta-Model</p>
      <p>Representations
Code Idioms</p>
    </sec>
    <sec id="sec-4">
      <title>The Framework</title>
      <p>As depicted in Figure 1, our framework is structured
as a pipeline, comprising ve main components:
4.1</p>
      <sec id="sec-4-1">
        <title>Source Code Importer</title>
        <p>A rst challenge of the metamodel for our
modernisation assistant is to accommodate multiple (legacy or
other) programming languages. Indeed, it would not
be economical if a new version of the metamodel had to
be re-implemented for every language or even language
dialect it is applied to. To address this issue within our
framework, the metamodel de nes a language-agnostic
abstract syntax tree format (AST) for source code.</p>
        <p>The format is an XML form of the AST: each
AST node is an XML element that has as content the
child relationships of the node. Begin and end-tags
of AST nodes identify the type of AST node, e.g.,
&lt;ForStatement&gt; is a Java for statement node. The
relationships inside of a such an element are again
XML elements, with as tags the kind of
relationship, e.g., in a Java for statement node these would
be &lt;initializers&gt;, &lt;expression&gt;, &lt;updaters&gt; and
&lt;body&gt;. Each of these elements again contains a (list
of) AST nodes, in XML form.</p>
        <p>The purpose of the source code importers is thus to
transform programs in a given language to their
representation in this format. Fundamentally, the only
language-dependent part of the framework is this rst
step. Once an importer for a language has been
created, the remainder of the framework is used as-is.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Mining Preprocessor</title>
        <p>Before they are passed to the pattern miner, the ASTs
may be preprocessed in order to enhance the mining
process. Di erent preprocessing steps may be applied,
depending on what is being mined for. For example,
when considering naming conventions as part of the
mining, one preprocessor can split identi ers into a
subtree based on camelcase or based on underscores.
Another example would be mining at a granularity of
procedure-level entities and hence rst removing
elements at ner granularities like statements or
(modulelevel) variable declarations.
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Pattern Miner</title>
        <p>The pattern miner is responsible for extracting
idiomatic code patterns, taking the preprocessed ASTs
as input. A concrete example of an idiomatic pattern
we found in the project JHotDraw is given in Fig. 2.
In several instances, a method is de ned that
instantiates an AbstractUndoableEdit object with speci c
implementations for undo and redo functionality. Note
that the ellipses (...) in the pattern are wildcards that
can represent any amount of code, illustrating that the
miner is able to capture complex patterns that cannot
be found otherwise via e.g. clone detection tools.</p>
        <p>
          We are currently exploring the use of frequent graph
mining algorithms, though other mining algorithms
may be tried in the future. The most popular frequent
graph mining algorithms are developed for trees [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]
and undirected graphs [
          <xref ref-type="bibr" rid="ref24 ref25 ref26">24, 25, 26</xref>
          ], although standard
algorithms produce a (too) large amount of patterns
(as discussed in section 5). Thus, an important
component of our pattern miner is the de nition of the
heuristics and constraints used during the mining
process, so as to avoid discovering redundant or useless
patterns [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. In particular, our pattern mining
algorithm relies on two ideas:
1. maximal frequent subtree mining to ensure that
a condensed representation of only large patterns
is found
2. constraint-based data mining, in which additional
constraints are imposed on the patterns to be
found.
        </p>
        <p>The key bene t of constraint-based mining is that
it allows developers to specify easy to interpret
constraints on the patterns to include in the output of the
algorithm.</p>
        <p>We are currently exploring what heuristics work
best for di erent kinds of idioms, and how to represent
these heuristics in an idiom- and language-agnostic
way, so that they can easily be adapted when
looking for other kinds of idioms, or when mining other
languages.
4.4</p>
      </sec>
      <sec id="sec-4-4">
        <title>Pattern Matcher</title>
        <p>The pattern matcher is responsible for nding all AST
subtrees that match the patterns extracted by the
miner. While these ASTs are already known to the
pattern miner, we may want to apply postprocessing
protected void ...() {
...
final ArrayList&lt;Object&gt; restoreData =</p>
        <p>new ArrayList&lt;Object&gt;(...);
...</p>
        <p>UndoableEdit edit = new AbstractUndoableEdit() {
...
@Override
public String getPresentationName() { ... }
...
@Override
public void undo() {
super.undo();
Iterator&lt;Object&gt; iRestore =</p>
        <p>restoreData.iterator();
...
}
...
@Override
public void redo() {
super.redo();
...</p>
        <p>
          }
}
};
fireUndoableEditHappened(edit);
steps to the patterns that are found, e.g., to further
generalise them such that the patterns are more widely
applicable. The pattern matcher is then needed to
nd matches of these modi ed patterns. Another
application of the pattern matcher is that, when a
pattern was mined in one project, the pattern matcher
can now match this pattern against any other project.
The tool is designed to be language-parametric and is
based on code templates [
          <xref ref-type="bibr" rid="ref27 ref28">27, 28</xref>
          ]. A template is a
concrete snippet of source code, in which some parts can
be replaced by wildcards or metavariables. It is also
possible to attach so-called \directives" to parts of the
snippet, which can a ect the semantics of the pattern
to match in various ways.
4.5
        </p>
      </sec>
      <sec id="sec-4-5">
        <title>Modernisation Assistant</title>
        <p>The modernisation assistant provides a GUI that
allows a user to inspect all patterns uncovered by the
pattern miner, and their matches, both as text and as
graphs. The screenshots in Fig. 3 and Fig. 4
respectively show a match of the JHotdraw undo/redo
pattern in a speci c source le, and the graph
representation of this pattern. The engineer is presented a list of
patterns with their pattern size, support, con dence,
and type of root AST node. A speci c pattern can be
selected for inspection showing an overview of pattern
matches in the source code as well as concrete source
code snippets highlighted according to the structure
of the pattern. The graph representation of the
pattern essentially is an AST, where certain nodes are
annotated with the directives mentioned in Sec. 4.4.
For instance, in Fig. 4, a "match-set" directive is
attached to an AnonClassDecl, which indicates that this
AnonClassDecl node will match as long as its children
(two method declarations) can be found, even if the
actual matching node contains additional children or
they appear in a di erent order.</p>
        <p>The modular architecture of our framework is key to
achieve our research objectives. For example, given a
new programming language we mainly need to provide
a new Source Code Importer. However, we may also
de ne or con gure a Preprocessor speci c to the kind
of idioms we want to mine for in that language, and
that we need to adapt the heuristics and constraints
used by the Pattern Miner. But the general pipeline
and algorithms would remain the same. Similarly, if
we would like to explore alternative or more advanced
pattern mining algorithms, in a language-agnostic way,
this could be done mostly by replacing the Pattern
Miner.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Preliminary Results &amp; Challenges</title>
      <p>In this section, we report on the current state of the
implementation of our framework, some preliminary
results, as well as some of the challenges we have faced:</p>
      <p>
        We currently have importers for COBOL and Java. 1
The former is pragmatic custom code that is able to
process the entire NIST COBOL 85 compliance test
suite2 as well as the code for a variety of industrial
legacy systems. The latter uses the Eclipse Java
metamodel and is able to successfully produce ASTs for all
source code in QUAATLAS [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ]: a re ned subset of the
Qualitas Corpus [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ] of Java programs. The importers
also produce a description of the grammar of the
language that is used by the miner. Again, the Java
importer uses the Eclipse Java meta-model to produce
this grammar, whereas for the COBOL importer this
is custom code.
5.2
      </p>
      <sec id="sec-5-1">
        <title>Mining Preprocessor</title>
        <p>For the moment, we have only implemented a
preprocessing component that is able to split the identi ers
contained in a node into a subtree based on
camelcase or the dash/underscore convention. When using
that preprocessor, instead of considering identi ers as
similar only when they are equal, identi ers can be
matched at a ner-grained level based on the similar
keywords they contain.
5.3</p>
      </sec>
      <sec id="sec-5-2">
        <title>Pattern Miner</title>
        <p>
          Our pattern miner implements an extended and
adapted version of the FreqT [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ] frequent subtree
mining algorithm. Although we have found that pure
FreqT can indeed be used for mining idiomatic code
patterns, it does have some limitations such as being
highly time consuming and generating a large amount
of patterns as well as redundant patterns. To tackle
these problems, we have been exploring various
customizations of the FreqT algorithm. As a result, we
have managed to reduce the execution time of FreqT
signi cantly, and to limit the number of discovered
patterns. Although we have not completed a full
empirical study yet, many of the discovered patterns seem
to correspond to relevant code idioms.
        </p>
        <p>1We are currently working on an importer for C# as well.
2https://www.itl.nist.gov/div897/ctg/cobol_form.htm
To achieve these results, we had to use a variety
of heuristics and constraints. However, selecting the
appropriate constraints to apply is not a trivial task
since it seems to depend partly on the language and on
the kinds of patterns one wants to nd. Even though
those constraints can easily be con gured for other
languages and other kinds of patterns, it is less obvious
how to choose the appropriate constraints for legacy
languages that are less well-known, or when we do not
know upfront what kind of patterns we are looking for.
A particular challenge of our current research therefore
remains how to e ciently search for and evaluate
interesting and surprising patterns. As it is di cult, nor
is this the focus of our work, to measure how
exhaustive our approach is, we believe our framework's value
lies in uncovering any new interesting patterns that
would be di cult to nd otherwise. As such, aside
from measuring the miner's scalability towards larger
projects, our evaluation will mainly consider
qualitative aspects, e.g., how many patterns are genuinely
useful? ; do patterns tend to be project-speci c, or
general-purpose? ; can these patterns be classi ed in
a number of categories? ; given di erent con
gurations, what is the ratio of interesting/non-interesting
patterns?
5.4</p>
      </sec>
      <sec id="sec-5-3">
        <title>Pattern Matcher</title>
        <p>Currently, our pattern matcher is able to match precise
syntactic patterns. In the future, we plan to support
anomaly detection including the on-demand detection
of partial matches for a given mined pattern. To
facilitate inspection by a software engineer, the pattern
matching algorithm should also quantify its results by
indicating the extent to which a partial match
corresponds to a given pattern.
5.5</p>
      </sec>
      <sec id="sec-5-4">
        <title>Modernisation Assistant</title>
        <p>Based on the output of the miner and pattern matcher,
the modernisation assistant is able to visualise
patterns, matches and their corresponding source code.
Despite its seemingly summarising role, it was useful
from very early on in the project to explore mining
results and let human users interpret them. It has
consequently been a driving force in customising the
miner and matcher to provide results that are more
straightforwardly interpretable by a modernisation
engineer. For example, we found that since patterns are
subtrees with parts that are left unspeci ed, it is
important for highlighted source code to show which part
of the source code is speci ed by the pattern and which
part is not. Hence, the pattern matcher should include
this information in each match.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusions and Future Work</title>
      <p>In this paper we have outlined our
languageparametric modular framework for mining idiomatic
code patterns whose goal is to assist software
modernization engineers in their work of migrating legacy
systems. We reported some preliminary results, as well
as some challenges we faced.</p>
      <p>The most notable challenges lie in con guring and
selecting the appropriate heuristics and constraints
when mining to guide the algorithm towards the kinds
of patterns one wants to nd. This is particularly
relevant since the modernisation engineer will face
languages that are unknown to us and will not know
upfront what kind of patterns to look for. In light of
this, our focus is currently on establishing how to e
ciently search for and evaluate interesting and
surprising patterns. This would allow for easier
experimentation with heuristics and constraints.</p>
      <p>Obviously, more challenges still remain to make our
framework truly scalable and language independent,
but our promising rst results make us con dent that
our goals will be reached.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>The project is funded by the Belgian Innoviris TeamUp
project INTiMALS (2017-TEAM-UP-7).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K.</given-names>
            <surname>Bennett</surname>
          </string-name>
          , \
          <article-title>Legacy Systems: Coping with Success,"</article-title>
          <source>IEEE Software</source>
          , vol.
          <volume>12</volume>
          , no.
          <issue>1</issue>
          , pp.
          <volume>19</volume>
          {
          <issue>23</issue>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bisbal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lawless</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Grimson</surname>
          </string-name>
          , \
          <source>Legacy Information Systems: Issues and Directions," IEEE Software</source>
          , vol.
          <volume>16</volume>
          , no.
          <issue>5</issue>
          , pp.
          <volume>103</volume>
          {
          <issue>111</issue>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Khadka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. V.</given-names>
            <surname>Batlajery</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Saeidi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jansen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Hage</surname>
          </string-name>
          , \
          <source>How Do Professionals Perceive Legacy Systems and Software Modernization?" in ICSE'14. ACM</source>
          ,
          <year>2014</year>
          , pp.
          <volume>36</volume>
          {
          <fpage>47</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Veerman</surname>
          </string-name>
          , \
          <article-title>Revitalizing modi ability of legacy assets,"</article-title>
          <source>Journal of Software Maintenance and Evolution: Research and Practice</source>
          , vol.
          <volume>16</volume>
          , no.
          <issue>4-5</issue>
          , pp.
          <volume>219</volume>
          {
          <issue>254</issue>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>V.</given-names>
            <surname>Blagodarov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jaradin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Zaytsev</surname>
          </string-name>
          , \
          <article-title>Raincode Assembler Compiler,"</article-title>
          <source>in SLE'16</source>
          ,
          <year>2016</year>
          , pp.
          <volume>221</volume>
          {
          <fpage>225</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M. P. A.</given-names>
            <surname>Sellink</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. M.</given-names>
            <surname>Sneed</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Verhoef</surname>
          </string-name>
          , \
          <article-title>Restructuring of COBOL/CICS Legacy Systems," in CSMR'99</article-title>
          . IEEE,
          <year>1999</year>
          , pp.
          <volume>72</volume>
          {
          <fpage>82</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>V.</given-names>
            <surname>Zaytsev</surname>
          </string-name>
          , \
          <article-title>Open Challenges in Incremental Coverage of Legacy Software Languages,"</article-title>
          <source>in PX/17.2</source>
          ,
          <issue>2017</issue>
          , pp.
          <volume>1</volume>
          {
          <fpage>6</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Mellor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Scott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Uhl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weise</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Soley</surname>
          </string-name>
          , MDA Distilled:
          <article-title>Principles of Model-Driven Architecture</article-title>
          .
          <source>Addison-Wesley</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Volter</article-title>
          , S. Benz,
          <string-name>
            <given-names>C.</given-names>
            <surname>Dietrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Engelmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Helander</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. C. L.</given-names>
            <surname>Kats</surname>
          </string-name>
          , E. Visser, and G. Wachsmuth, DSL Engineering: Designing, Implementing and
          <string-name>
            <given-names>Using</given-names>
            <surname>Domain-Speci c Languages</surname>
          </string-name>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>C.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <article-title>The Year 2000 Software Problem: Quantifying the Costs and Assessing the Consequences</article-title>
          . ACM Press/Addison-Wesley,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Iosif-Lazar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Al-Sibahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Dimovski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Savolainen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sierszecki</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Wasowski</surname>
          </string-name>
          , \
          <article-title>Experiences from Designing and Validating a Software Modernization Transformation," in ASE'15</article-title>
          . IEEE,
          <year>2015</year>
          , pp.
          <volume>597</volume>
          {
          <fpage>607</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Allamanis</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Sutton</surname>
          </string-name>
          , \
          <article-title>Mining Idioms from Source Code,"</article-title>
          <source>in FSE'14. ACM</source>
          ,
          <year>2014</year>
          , pp.
          <volume>472</volume>
          {
          <fpage>483</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>B.</given-names>
            <surname>Goncharenko</surname>
          </string-name>
          and
          <string-name>
            <given-names>V.</given-names>
            <surname>Zaytsev</surname>
          </string-name>
          , \
          <article-title>Language Design and Implementation for the Domain of Coding Conventions,"</article-title>
          <source>in SLE'16</source>
          ,
          <year>2016</year>
          , pp.
          <volume>90</volume>
          {
          <fpage>104</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Hellendoorn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Godhane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Tu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bacchelli</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Devanbu</surname>
          </string-name>
          , \
          <article-title>On the Naturalness of Buggy Code," in ICSE'16</article-title>
          . IEEE,
          <year>2016</year>
          , pp.
          <volume>428</volume>
          {
          <fpage>439</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hindle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. T.</given-names>
            <surname>Barr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gabel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Devanbu</surname>
          </string-name>
          , \
          <article-title>On the Naturalness of Software," in ICSE'12</article-title>
          . IEEE,
          <year>2012</year>
          , pp.
          <volume>837</volume>
          {
          <fpage>847</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>B.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ponzanelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mocci</surname>
          </string-name>
          , G. Bavota, and
          <string-name>
            <given-names>M.</given-names>
            <surname>Lanza</surname>
          </string-name>
          , \
          <article-title>On the Uniqueness of Code Redundancies,"</article-title>
          <source>in ICPC'17</source>
          ,
          <year>2017</year>
          , pp.
          <volume>121</volume>
          {
          <fpage>131</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Campbell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hindle</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. N.</given-names>
            <surname>Amaral</surname>
          </string-name>
          , \
          <article-title>Syntax Errors Just aren't Natural: Improving Error Reporting with Language Models,"</article-title>
          <source>in MSR'14. ACM</source>
          ,
          <year>2014</year>
          , pp.
          <volume>252</volume>
          {
          <fpage>261</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Allamanis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. T.</given-names>
            <surname>Barr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bird</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Sutton</surname>
          </string-name>
          ,
          <article-title>\Learning Natural Coding Conventions,"</article-title>
          <source>in FSE'14. ACM</source>
          ,
          <year>2014</year>
          , pp.
          <volume>281</volume>
          {
          <fpage>293</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19] ||, \
          <article-title>Suggesting Accurate Method and Class Names,"</article-title>
          <source>in Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. ACM</source>
          ,
          <year>2015</year>
          , pp.
          <volume>38</volume>
          {
          <fpage>49</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ducasse</surname>
          </string-name>
          , T. G^rba,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kuhn</surname>
          </string-name>
          , and L. Renggli, \
          <article-title>Meta-environment and executable metalanguage using smalltalk: an experience report,"</article-title>
          <source>Software &amp; Systems Modeling</source>
          , vol.
          <volume>8</volume>
          , no.
          <issue>1</issue>
          , pp.
          <volume>5</volume>
          {
          <issue>19</issue>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>S.</given-names>
            <surname>Tichelaar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ducasse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Demeyer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>O.</given-names>
            <surname>Nierstrasz</surname>
          </string-name>
          , \
          <article-title>A meta-model for languageindependent refactoring,"</article-title>
          <source>in Proceedings International Symposium on Principles of Software Evolution. IEEE</source>
          ,
          <year>2000</year>
          , pp.
          <volume>154</volume>
          {
          <fpage>164</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>G.</given-names>
            <surname>Rakic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Budimac</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Savic</surname>
          </string-name>
          , \
          <article-title>Language independent framework for static code analysis," in Proceedings of the 6th Balkan Conference in Informatics, ser</article-title>
          .
          <source>BCI '13</source>
          . New York, NY, USA: ACM,
          <year>2013</year>
          , pp.
          <volume>236</volume>
          {
          <fpage>243</fpage>
          . [Online]. Available: http://doi.acm.
          <source>org/10</source>
          .1145/2490257.2490273
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. R.</given-names>
            <surname>Muntz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nijssen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. N.</given-names>
            <surname>Kok</surname>
          </string-name>
          , \
          <article-title>Frequent Subtree Mining|An Overview,"</article-title>
          <source>Fundamenta Informaticae</source>
          , vol.
          <volume>66</volume>
          , no.
          <issue>1-2</issue>
          , pp.
          <volume>161</volume>
          {
          <issue>198</issue>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kuramochi</surname>
          </string-name>
          and G. Karypis, \
          <article-title>Frequent Subgraph Discovery," in ICDM'01</article-title>
          . IEEE,
          <year>2001</year>
          , pp.
          <volume>313</volume>
          {
          <fpage>320</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>X.</given-names>
            <surname>Yan</surname>
          </string-name>
          and J. Han, \
          <article-title>gspan: Graph-based substructure pattern mining," in ICDM'02</article-title>
          . IEEE,
          <year>2002</year>
          , pp.
          <volume>721</volume>
          {
          <fpage>724</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>S.</given-names>
            <surname>Nijssen</surname>
          </string-name>
          and
          <string-name>
            <given-names>J. N.</given-names>
            <surname>Kok</surname>
          </string-name>
          , \
          <article-title>A Quickstart in Frequent Structure Mining Can Make a Di erence,"</article-title>
          <source>in KDDM'04. ACM</source>
          ,
          <year>2004</year>
          , pp.
          <volume>647</volume>
          {
          <fpage>652</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>C. De Roover</surname>
            and
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Inoue</surname>
          </string-name>
          , \
          <article-title>The ekeko/x Program Transformation Tool," in SCAM'14</article-title>
          . IEEE,
          <year>2014</year>
          , pp.
          <volume>53</volume>
          {
          <fpage>58</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>T.</given-names>
            <surname>Molderez</surname>
          </string-name>
          and
          <string-name>
            <surname>C. De Roover</surname>
          </string-name>
          , \Automated Generalization and
          <article-title>Re nement of Code Templates with ekeko/x,"</article-title>
          <source>in SANER'16</source>
          , vol.
          <volume>1</volume>
          . IEEE,
          <year>2016</year>
          , pp.
          <volume>669</volume>
          {
          <fpage>672</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>C. De Roover</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Lammel</surname>
          </string-name>
          , and E. Pek, \
          <article-title>Multidimensional Exploration of API Usage," in ICPC'13</article-title>
          . IEEE,
          <year>2013</year>
          , pp.
          <volume>152</volume>
          {
          <fpage>161</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>E.</given-names>
            <surname>Tempero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Anslow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dietrich</surname>
          </string-name>
          , T. Han,
          <string-name>
            <given-names>J</given-names>
            .
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lumpe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Melton</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Noble</surname>
          </string-name>
          , \
          <article-title>The Qualitas Corpus: A Curated Collection of Java Code for Empirical Studies," in APSEC'10</article-title>
          . IEEE,
          <year>2010</year>
          , pp.
          <volume>336</volume>
          {
          <fpage>345</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>T.</given-names>
            <surname>Asai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Abe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kawasoe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Arimura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sakamoto</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Arikawa</surname>
          </string-name>
          ,
          <source>E cient Substructure Discovery from Large Semi-structured Data</source>
          ,
          <year>2002</year>
          , pp.
          <volume>158</volume>
          {
          <fpage>174</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>