<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Solving the TTC 2014 Movie Database Case with GrGen.NET</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>What is GrGen.NET?</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <abstract>
        <p>GRGEN.NET (www.grgen.net) is a programming language and development tool for graph structured data with pattern matching and rewriting at its core, which furthermore supports imperative and objectoriented programming, and features some traits of database-like query-result processing. Founded on a rich data modeling language with multiple inheritance on node and edge types, it offers pattern-based rewrite rules of very high expressiveness, with subpatterns and nested alternative, iterated, negative, and independent patterns, as well as preset input parameters and output def variables yielded to. The rules are orchestrated with a concise control language, which offers constructs that are simplifying backtracking searches and state space enumerations. Development is supported by graphical and stepwise debugging, as well as search plan explanation and profiling instrumentation for graph search steps - the former helps in first getting the programs correct, the latter in getting them fast thereafter. The tool was built for performance and scalability: its model data structures are designed for fast processing of typed graphs at modest memory consumption, while its optimizing compiler adapts the search process to the characteristics of the host graph at hand and removes overhead where it is not needed. GRGEN.NET lifts the abstraction level of graph-representation based tasks to declarative and elegant pattern-based rules, that allow to program with structural recursion and structure directed transformation [3]. The mission of GRGEN.NET is to offer the highest combined speed of development and execution available for graph-based tasks.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Edgar Jakumeit
eja@ipd.info.uni-karlsruhe.de
The task of the Movie Database Case [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is to identify all couples of actors who performed together
in at least three movies. The challenge of the task is to do this fast on a large database. We employ
the general purpose graph rewrite system GRGEN.NET in developing and optimizing a solution.
Task 2: Finding Couples
}
The workhorse rule for finding all pairs of persons which played together in at least three movies is
findCouples.
rule findCouples
{
pn1 : Person ; pn2 : Person ;
independent {
pn1 -: personToMovie -&gt; m1 : Movie &lt;-: personToMovie - pn2 ;
pn1 -: personToMovie -&gt; m2 : Movie &lt;-: personToMovie - pn2 ;
pn1 -: personToMovie -&gt; m3 : Movie &lt;-: personToMovie - pn2 ;
exec ( addCommonMoviesAndComputeAverageRanking (c , pn1 , pn2 ));
      </p>
      <p>It specifies a pattern of two nodes pn1 and pn2 of type Person, and an independent pattern which
asks for 3 nodes m1, m2, m3 of type Movie, each being the target of an edge of type personToMovie
starting at pn1, and each being also the target of an edge starting at pn2. (Types are given after a colon,
the optional name of the entity may be given before the colon, for edges they are inscribed into the edge
notation --&gt;).</p>
      <p>An independent pattern means for one that its content only needs to be searched and is not available
for rewriting, and for the other that for each pn1 and pn2 in the graph, it is sufficient to find a single
instance of the independent, even if the rule is requested to be executed on all available matches –
without the independent we would get all the permutations of m1, m2, and m3 as different matches.</p>
      <p>The rewriting is specified as nested pattern in modify mode, which means that newly declared
entities will be created, and all entities from the matched pattern kept unless they are explicitly requested to
be deleted. Here we create a new node c of type Couple, link it with edges of the types p1 and p2 to the
nodes pn1 and pn2, and then execute the helper rule addCommonMoviesAndComputeAverageRanking
on c, pn1, and pn2. The helper rule is used to create the commonMovies edges to all the movies both
played in (you find it in Figure 3 in A.1).</p>
      <p>The auto keyword after the rule requests GRGEN.NET to generate a symmetry filter for matches
from automorphic patterns. The pattern is automorphic (isomorphic to itself) because it may be matched
with pn2 for pn1 and pn1 for pn2.</p>
      <p>To get all pairs of persons which played together in at least three movies we execute the rule with
allbracketing from a graph rewrite sequence in the GRSHELL script MovieDatabase.grs, filtering away
automorphic matches, with the syntax: exec [findCouples\auto].</p>
      <p>
        The language constructs are explained in more detail in the extensive GRGEN.NET user manual [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>Task 3: Computing Average Rankings</title>
      <p>The helper rule addCommonMoviesAndComputeAverageRanking that was already used for creating the
common movies edges is also used for computing the average rankings, you find it in Figure 3 in A.1.
The extension task 1 requires to compute the top couples according to the average rating of their common
movies, and according to the number of common movies. It is solved with the rule couplesWithRating
that you find in Figure 4 in A.1. We use the rule twice, ordering the matches differently. The sequence
[couplesWithRating\orderDescendingBy&lt;avgRating&gt;\keepFirst(15)] executed from the
GRSHELL asks for all matches of the rule couplesWithRating, then sorts them descendingly by the
avgRating pattern variable, then throws away all but the first top 15 matches. The other value of interest
is handled in exactly the same way.</p>
      <p>You can inspect the found results with the debugger of GRGEN.NET, see Figure 2. There, a
visualization of the match with most connections is displayed, in layout Circular; take a look at
MovieDatabaseLayout.grs to find out about the visualization configuration applied to reach this.
Notably we configured it so it shows only the matched entities plus the one-step connected elements
as context, as large graphs are costly (often too costly) to visualize.</p>
    </sec>
    <sec id="sec-3">
      <title>Extension Tasks 2, 3, 4</title>
      <p>3</p>
      <sec id="sec-3-1">
        <title>Getting it fast</title>
        <p>The other extension tasks asking to find cliques of actors are solved with manually coded rules for the
sizes 3, 4, and 5 as a straightforward generalization of the couples-based task; higher-order or
metaprogramming is not supported (and won’t be, you have to employ an external code generator if needed).
The clean pattern-based solution presented in the previous chapter unfortunately scales badly with
increasing model sizes. Utilizing the built-in search plan explanation and profiling instrumentation, we
were able to find out that the most expensive part is the check that there are at least 3 common movies
existing, for two persons that are already connected by a common movie. So we replaced it with an
imperatively formulated helper function, utilizing hash sets – they scale considerably better for
connectedness checking of a large number of adjacent(/incident) elements (O(n) instead of O(n n)).</p>
        <p>In addition to the built-in optimizations of independent inlining to ensure that pattern matching
follows connectedness, and search planning, adapting the pattern matching process to the characteristics of
the host graph at hand, did we apply some further optimizations of the code.</p>
        <p>We used imperatively formulated hash-set lookup instead of the nested-loops implementation of
the pattern matcher for the at-least-3-common-movies-check, to cope well with actors with high
numbers of movies.</p>
        <p>We even inspect the number of incident edges in this check (yielding an adaptive algorithm), so
the smaller hash set is built; in addition, we utilize an incident edges count index maintained by
the engine that gives us direct access to the number of incident edges without having to iterate and
count them.</p>
        <p>We applied some search space pruning by early filtering during the matching based on the unique
id order of the actors, instead of filtering of permuted matches of automorphic patterns afterwards;
and by checking for actors with fewer than 3 movies early on, to remove the ”long tail” (of actors
performing in few movies).</p>
        <p>We separated the computation of the average ranking from the addition of the common movies,
this allowed us a reformulation with an all-bracketed rule instead of an iterated pattern, saving us
the overhead of the graph parser employed in matching iterated, alternative, and subpatterns.
We parallelized the pattern matcher with a parallelize annotation. This does not minimize the
amount of work to be carried out for the specification as the other optimizations did, but maximizes
the amount of workers thrown on the work. A task with an expensive search like ours – as revealed
by profiling – benefits considerably from it.</p>
        <p>The final optimized rule is shown in Figure 5, and its helper functions in Figure 6 and Figure 7
(the optimization increased the LOC from about 30 for the original version of task 2 to about 90). The
optimization process is described in more detail in [4].
4</p>
      </sec>
      <sec id="sec-3-2">
        <title>Calling from API and Performance Results</title>
        <p>The task description asks for a standalone command line version for benchmarking. We supply a
C#Program that employs the GRGEN.NET API towards this end, which can be found in
MovieDatabaseBenchmarker.cs (everything else was coded entirely in the GRGEN-languages, and is based on actual
GRGEN.NET-features). It expects as first parameter the name of the rule to apply (findCouplesOpt,
findCliquesOf3Opt, findCliqueOf4Opt, findCliqueOf5Opt), and as second parameter either the
graph to import (e.g. imdb-0005000-50176.movies.xmi.grs), or the number of iterations to use in
generating the synthetic test graph. An additional sequence may be given in quotes, intended for emitting
the sorted results.</p>
        <p>The standalone version contains a further optimization that can be only applied on API level. After
importing the IMDB graphs, it reduces the named graphs to unnamed graphs, throwing away the name
information, which saves us a considerable amount of memory (and cache).</p>
        <p>On a Core i7 920 with 24GiB, running Windows with MS .NET, the largest synthetic benchmark
graph was processed in about 65sec for the couples, the 3 cliques in about 40sec, the 4 cliques in about
36sec, and the 5 cliques in about 20sec 1. The entire IMDB was processed on average in about 500sec
(searching for all couples performing in at least three movies, linking them to all of their common
movies). The cliques (which were outside of the main focus of our optimization work) show a
stunning computation time explosion, from 5 secs for cliques of 3, over something less than 2 minutes for
cliques of 4, to more than 2 hours for cliques of 5, already on the smallest IMDB graph.</p>
        <p>The official results (in seconds) measured on an Opteron 8387 with 32GiB, running LINUX with
mono are given in Table 1, for the time needed to synthesize the models and the time needed for
processing the couples on the synthesized models, and in Table 2, for the time needed for processing the couples
and 3-cliques on the IMDB models.
5</p>
      </sec>
      <sec id="sec-3-3">
        <title>Conclusion</title>
        <p>We first specified a clean and simple solution of the movie database task in the GRGEN languages, then
we optimized it for performance. The tool supported us in validating the solution with its graphical
debugger, and in optimizing it with its search plan explanation and profiling instrumentation for search
step counting.</p>
        <p>GRGEN.NET search planning can be compared to searching straw stars on a freshly harvested field,
by looking at the places where the ground is only slightly covered, only reaching into the haystacks when
they can’t be circumvented at all. A pattern matcher is generated based on the assumption that search
planning worked well in circumventing those haystacks. Here the task consists solely of diving within
a hay stack, a particularly large and interwoven one. So we had to supplement the declarative patterns
(implemented with loop-nesting behind the scenes, which is optimal for sparse graphs) by imperative
hash set based helper functions – at least this highlights that for about any task built on a graph-based
representation there are the language constructs available that are needed for solving it.</p>
        <p>Once again, GRGEN.NET is among the top performing tools.
1picking the fastest of 3 runs, with considerable variations in between
rule addCommonMoviesAndComputeAverageRanking (c: Couple , pn1 : Person , pn2 : Person )
{
iterated it {
pn1 -: personToMovie -&gt; m: Movie &lt;-: personToMovie - pn2 ;
modify {
c -: commonMovies -&gt; m;
eval { yield sum = sum + m. rating ; }
modify {
def var sum : double = 0.0;
eval { c. avgRating = sum / count ( it ); }
[4] Edgar Jakumeit (2014): Optimizing a Solution for the TTC 2014 Movie Database Case. Technical Report.</p>
        <p>Available at http://www.info.uni-karlsruhe.de/software/grgen/OptimizingMovie.pdf.
[5] Edgar Jakumeit (2014): SHARE demo related to the paper Solving the TTC 2014 Movie Database Case with
GrGen.NET. http://is.ieis.tue.nl/staff/pvgorp/share/?page=ConfigureNewSession&amp;vdi=
Ubuntu12LTS_TTC14_64bit_grgen_v2.vdi.</p>
      </sec>
      <sec id="sec-3-4">
        <title>A Appendix</title>
        <p>A.1</p>
        <sec id="sec-3-4-1">
          <title>Getting it right</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Task 2 and 3</title>
      <p>The helper rule addCommonMoviesAndComputeAverageRanking shown in Figure 3 eats all common
movies with an iterated pattern which is matched as often as possible; in the rewrite part it links
the Couple node to each such movie with a commonMovies edge. In the eval part used for attribute
evaluation, the avgRating is computed as the sum of the ratings of the movies munched, divided by the
count of the iterateds matched. The def var is used to define a variable whose content is computed
after matching from the matched entities, the yield is used to assign to such variables, from nested
patterns up to their containing patterns (normally variables are passed the other way round, from nesting
to nested patterns, following the flow of matching).
rule couplesWithRating
{
c: Couple ;
c -:p1 -&gt; pn1 : Person ;
c -:p2 -&gt; pn2 : Person ;
def var avgRating : double ;
def var numMovies : int ;
yield {
yield avgRating = c. avgRating ;
yield numMovies = outgoing (c, commonMovies ). size ();
modify {
emit (" avgRating : " + avgRating + " numMovies : " + numMovies</p>
      <p>+ " by " + pn1 . name + " and " + pn2 . name + "\n");
}
} \ orderDescendingBy &lt; avgRating &gt;, orderDescendingBy &lt; numMovies &gt;
after the (top-down) pattern matching completed. In the rewrite part specified in modify mode we just
emit the values of interest. Furthermore, we request GRGEN to generate sorting code for the def pattern
entities avgRating and numMovies with the declaration of the auto-generated matches accumulation
filters orderDescendingBy&lt;avgRating&gt; and orderDescendingBy&lt;numMovies&gt;.</p>
      <sec id="sec-4-1">
        <title>A.2 Getting it fast</title>
        <p>Figure 5 shows the rule findCouplesOpt we get after the final optimization step. Figure 6 shows the
helper function atLeastThreeCommonMovies we employ to find out if there are 3 common movies,
and Figure 7 shows the helper function getCommonMovies we employ to compute the common movies.
Note that the adjacentOutgoing function is built-in.</p>
      </sec>
      <sec id="sec-4-2">
        <title>A.3 Calling from API and Performance Results</title>
        <p>In Table 1 and Table 2 you find the official measurements.
rule findCouplesOpt [ parallelize =16]
{
pn1 : Person ; pn2 : Person ;
hom (pn1 , pn2 );
independent {
pn1 -p2m1 : personToMovie -&gt; m1: Movie &lt;-p2m2 : personToMovie - pn2 ;
hom (pn1 , pn2 ); hom (p2m1 , p2m2 );
if{ atLeastThreeCommonMovies (pn1 , pn2 ); }
}
if{ uniqueof ( pn1 ) &lt; uniqueof ( pn2 ); }
if{ countPersonToMovie [ pn1 ] &gt;= 3; }
if{ countPersonToMovie [ pn2 ] &gt;= 3; }
def ref common :set &lt;Node &gt;;
yield {</p>
        <p>yield common = getCommonMovies (pn1 , pn2 );
eval {
for ( movie : Node in common ) {</p>
        <p>add ( commonMovies , c, movie );
}
function atLeastThreeCommonMovies ( pn1 : Person , pn2 : Person ) : boolean
{
if( countPersonToMovie [ pn1 ] &lt;= countPersonToMovie [ pn2 ]) {
def var common : int = 0;
def ref movies :set &lt;Node &gt; = adjacentOutgoing (pn1 , personToMovie );
for ( movie : Node in adjacentOutgoing (pn2 , personToMovie ))
{
if( movie in movies ) {
common = common + 1;
if( common &gt;= 3) {</p>
        <p>return ( true );
}</p>
        <p>}
}
} else { /* pn1 and pn2 reversed */ }
return ( false );
function getCommonMovies ( pn1 : Person , pn2 : Person ) : set &lt; Node &gt;
{
def ref common :set &lt; Node &gt; = set &lt; Node &gt;{};
if ( countPersonToMovie [ pn1 ] &gt;= countPersonToMovie [ pn2 ]) {
def ref movies :set &lt; Node &gt; = adjacentOutgoing (pn2 , personToMovie );
for ( movie : Node in adjacentOutgoing (pn1 , personToMovie ))
{
if ( movie in movies ) {</p>
        <p>common . add ( movie );</p>
        <p>}
} else { /* pn1 and pn2 reversed */ }
return ( common );</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Jakob</given-names>
            <surname>Blomer</surname>
          </string-name>
          , Rubino Geiß &amp; Edgar
          <string-name>
            <surname>Jakumeit</surname>
          </string-name>
          (
          <year>2014</year>
          ):
          <article-title>The GrGen</article-title>
          .
          <article-title>NET User Manual</article-title>
          . http://www.grgen. net/GrGenNET-Manual.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Tassilo</given-names>
            <surname>Horn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Christian</given-names>
            <surname>Krause</surname>
          </string-name>
          &amp; Matthias
          <string-name>
            <surname>Tichy</surname>
          </string-name>
          (
          <year>2014</year>
          ):
          <article-title>The TTC 2014 Movie Database Case</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Edgar</given-names>
            <surname>Jakumeit</surname>
          </string-name>
          (
          <year>2011</year>
          )
          <article-title>: EBNF and SDT for GrGen</article-title>
          .
          <source>NET. Technical Report</source>
          . Available at http://www.info. uni-karlsruhe.de/software/grgen/EBNFandSDT.pdf.
          <source>Presented at AGTIVE</source>
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>