<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tassilo Horn</string-name>
          <email>horn@uni-koblenz.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christian Krause</string-name>
          <email>christian.krause01@sap.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matthias Tichy</string-name>
          <email>matthias.tichy@cse.gu.se</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Chalmers j University of Gothenburg</institution>
          ,
          <country country="SE">Sweden</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute for Software Technology, University Koblenz-Landau</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>SAP Innovation Center</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <abstract>
        <p>Social networks and other web 2.0 platforms use huge amounts of data to offer new services to customers. Often this data can be expressed as huge graphs and thus could be seen as a potential new application field for model transformations. However, this application area requires that model transformation tools scale to models with millions of objects. This transformation case targets this application area by using the IMDb movie database as a model. The transformation deals with identifying all actor couples which perform together in a set of at least three movies.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        The driving force behind social networks and other new web 2.0 offerings is often a huge amount of data
from which interesting information can be extracted. Consequently, concepts like MapReduce [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and
libraries like Hadoop [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and Giraph [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] have been developed to efficiently process this huge amount of
data. However, model transformation approaches have not adressed this field so far.
      </p>
      <p>Automotive software is an already well-established application field for model-driven software
engineering and its models also approach huge sizes. As a consequence from these two examples, model
transformation approaches must be scalable to models with million objects to be applicable for these
application areas.</p>
      <p>
        In the following, we present a case which uses the IMDb movie database [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] as a data source. The
IMDb movie database contains information about movies, actors performing in the movies, movie
ratings, etc. The main task is to develop a model transformation which identifies all couples of two actors
who perform in at least three common movies and calculate the average rating of those movies.1 This
core task is then generalized to cliques of n actors. Furthermore, some queries calculating top-15 lists of
couples and cliques are to be written. Evaluation criteria are correctness/completeness and performance.
      </p>
      <p>In the next section, we describe the case in more detail including the meta model as well as the
different core and extension tasks. After that, Section 3 presents the evaluation criteria for submitted
solutions to this case.
the models. In addition to the obvious classes, the metamodel contains a common superclass for actors
and actresses as well as classes for groups of actors which play in common movies. The class Group
contains the attribute avgRating which is intended to store the average rating of the common movies of
the group of actors. The metamodel distinguishes between groups of two persons (a Couple) or a Clique
of n persons to support the different tasks in this transformation case.</p>
      <p>
        The EMF model as well as a parser for the IMDb database files is available at [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The pre-parsed
IMDb models are available on request2. The transformation case will use synthetic data (see Task 1) as
well as real IMDb data for the evaluation of the correctness and performance of the submitted solutions.
2.1
      </p>
      <sec id="sec-1-1">
        <title>Task 1: Generating Test Data</title>
        <p>The goal of this task is to generate (synthetic) test data which will be used later to evaluate the
correctness and the performance of the solution of the main task. The transformation to be implemented
takes an empty model and a parameter N 0, and generates movies, actresses, actors, and references
between them. The number of objects to be generated is determined by the parameter N. Specifically, the
transformation is supposed to generate 5N actors, 5N actresses and 10N movies, totalling in 20N nodes.
Additionally, the references between persons and movies are generated in a specified way.</p>
        <p>
          The specific patterns to be generated are shown in the Henshin [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] rules in Figure 2. Each of the
two used rules generates five persons and five movies. The five persons in the rule createPositive play
together in three movies. In contrast, every possible pair of persons generated by the rule createNegative
plays together in at most two movies. The movie ratings and the name of the persons are derived from the
rule parameter n which takes the values from 0 to N 1. The entry point for the test data generator is the
iterated unit createExample. The unit has the integer-valued parameter N which determines the number
of loops to execute. Specifically, this unit executes the sequential unit createTest N times with parameter
values 0 : : : N 1 for n. The unit createTest invokes the rule createPositive and createNegative both
with n as parameter. Test data should be generated for N being 1000, 2000, 3000, 4000, 5000, 10000,
50000, 100000, and 200000.
        </p>
        <p>2Contact Matthias Tichy for getting access.
In this task, a transformation shall be implemented that takes a graph consisting of inter-connected
movies, actors and actresses as input, and creates additional nodes and links in this graph.
Specifically, the task is to find all pairs of persons (actors or actresses) which played together in at least three
movies. For every such pair, the transformation is supposed to create an object of type Couple referencing
both persons using the p1 and p2 references, and referencing all movies in which both persons played in
using the commonMovies reference.
2.3</p>
      </sec>
      <sec id="sec-1-2">
        <title>Task 3: Computing Average Rankings</title>
        <p>The input model of this task is the one generated in Task 2, i.e., a graph consisting of movie, actor, actress
and couple nodes. The goal of this task is to set the avgRating-attribute of all couple nodes to the average
(i.e. the arithmetic mean) of the ratings of all movies that both persons played in.
2.4</p>
      </sec>
      <sec id="sec-1-3">
        <title>Extension Task 1: Compute Top-15 Couples</title>
        <p>The goal of this task is to produce top-15 lists of the couples created by Task 2. For this purpose, two
model queries should be given.</p>
        <p>(a) Compute the top 15 couples according to the average rating of their common movies (requires</p>
        <p>Task 3 to be solved).</p>
        <p>(b) Compute the top 15 couples according to the number of common movies.</p>
        <p>Each of the couples in the top-15 lists should be printed with the names of the two persons, the
average rating (only if Task 3 has been solved), and the number of the couple’s common movies. We
don’t require printing the common movies’ titles because the couple with the most common movies in
the complete IMDb model has more than 400 of them. If two couples have the same value for the average
rating/number of common movies, their order should be determined in some stable manner.
2.5</p>
      </sec>
      <sec id="sec-1-4">
        <title>Extension Task 2: Finding Cliques</title>
        <p>This extension task is a generalization of Task 2. A clique is a group of at least n persons (with n 3)
acting together in at least 3 movies. So a couple is essentially a clique of size n = 2.</p>
        <p>The extension task is to find cliques of a given size n, and to create a Clique element for each of
them referencing the clique’s members using the persons reference and its common movies using the
commonMovies reference.</p>
        <p>The task will be evaluated for n 2 f3; 4; 5g, so it could be solved by writing three similar rules
manually. However, to achieve a full completeness score for this task, a solution should work for any n 3.
Therefore, a transformation could have n as a parameter, or there could be a higher-order transformation
that receives n and generates a rule creating cliques of exactly that size.
2.6</p>
      </sec>
      <sec id="sec-1-5">
        <title>Extension Task 3: Compute Average Rankings for Cliques</title>
        <p>Like it was done for couples in Task 3, the avgRating attribute of cliques should be set to the average
rating of all its common movies.
2.7</p>
      </sec>
      <sec id="sec-1-6">
        <title>Extension Task 4: Compute Top-15 Cliques</title>
        <p>This is a variant of Extension Task 1 for cliques instead of couples. Again, two queries should be given.
(a) Compute the top-15 cliques of a given size n according to the average rating of their common
movies (requires Extension Task 3 to be solved).
(b) Compute the top-15 cliques of a given size n according to the number of common movies.</p>
        <p>Again, every clique should be printed with the names of its members, the average rating, and the
number of common movies.
3</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Evaluation Criteria</title>
      <p>The evaluation of the submitted transformation will be done on synthetic data as well as real data from
the IMDb database. The IMDb database is regularly updated. In order to provide a common set of data,
we provide the models generated from the IMDb database in December 2013 to participants by request2.
3.1</p>
      <sec id="sec-2-1">
        <title>Correctness and Completeness</title>
        <p>All tasks and extension tasks are scored evenly with zero to three points. Zero means the task has not
been tackled at all, three points means the task has been completely solved and the implementation is
correct. The performance of a solution is not relevant, here. If a solution works correctly for the smaller
models but won’t terminate or run out of memory for the larger models, it may still achieve three points.</p>
        <p>The following list explains the evaluation criteria for the individual tasks.
Task 1 The test data generation will be evaluated for different values of N. The correct number of
elements, their relationships, and the correctly set attribute values will be assessed.</p>
        <p>Task 2 The correct number of couples will be evaluated for both the synthetic and the IMDb models.</p>
        <p>Furthermore, the correct setting of the p1, p2, and commonMovies references will be spot-checked.
Task 3 The correct average ranking of couples will be checked for both synthetic and IMDb models.
Ext. Task 1 The Top-15 lists of couples will be evaluated for both the synthetic and the IMDb models.
Ext. Task 2 The finding of cliques will be evaluated for the sizes n 2 f3; 4; 5g for the synthetic models
and for n = 3 for the IMDb models. The main criterion is that the value of n can be chosen freely,
i.e., the rule for a given n is not written manually but it is a parameter to the transformation or a
parameter to a higher-order transformation generating a transformation for that value.
Ext. Task 3 The correct average ranking of cliques will be checked for both synthetic and IMDb models.
Ext. Task 4 The Top-15 lists of cliques will be evaluated for both the synthetic (n 2 f3; 4; 5g) and the
IMDb models (n = 3). Like for Extension Task 1, the a stable sorting according to (a) average
rating, or (b) number of common movies is enough to achieve a full score.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Benchmarks</title>
        <p>
          The goal of this task is to generate a performance benchmark of your solution to Task 2 (finding couples).
This benchmark should be executed using two different sets of input data:
(a) synthetic test data generated using the transformation for Task 1,
(b) provided data from the IMDb movie database (available at [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]; parsable, e.g., using [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]).
For both cases, you should run the transformation for Task 2 and measure the time needed to complete the
transformation (without loading and saving the model). If you solved also the extension task 2 (finding
cliques), please also generate benchmarks for these cases using n 2 f3; 4; 5g for the synthetic test models
and n = 3 for the IMDb models.
        </p>
        <p>In order to evaluate the scalability of the solution and to compare it with other solutions, you should
use specific input models / model sizes. For the synthetic test data, please use the values for N stated in
Section 2.1. For the IMDb data version, please use the provided models which are available on request2.</p>
        <p>The benchmark results are scored by comparing the run-times with the other solutions. The fastest
solution gets 21 points (same amount as for correctness and completeness). The rest of the solutions get
proportional scores.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Apache</given-names>
            <surname>Software Foundation</surname>
          </string-name>
          . Apache Giraph. http://giraph.apache.org.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Apache</given-names>
            <surname>Software Foundation</surname>
          </string-name>
          . Apache Hadoop. http://hadoop.apache.org.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Arendt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Bierman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jurack</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Krause</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Taentzer</surname>
          </string-name>
          . Henshin:
          <article-title>Advanced concepts and tools for in-place EMF model transformations</article-title>
          .
          <source>In Proc. MoDELS</source>
          <year>2010</year>
          , LNCS
          <volume>6394</volume>
          , pages
          <fpage>121</fpage>
          -
          <lpage>135</lpage>
          . Springer,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Jeffrey</given-names>
            <surname>Dean</surname>
          </string-name>
          and
          <string-name>
            <given-names>Sanjay</given-names>
            <surname>Ghemawat</surname>
          </string-name>
          .
          <source>MapReduce: simplified data processing on large clusters. Commun. ACM</source>
          ,
          <volume>51</volume>
          (
          <issue>1</issue>
          ):
          <fpage>107</fpage>
          -
          <lpage>113</lpage>
          ,
          <year>January 2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Tassilo</given-names>
            <surname>Horn</surname>
          </string-name>
          . IMDB2EMF: https://github.com/tsdh/imdb2emf.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Internet</given-names>
            <surname>Movie</surname>
          </string-name>
          <article-title>Database (IMDB). Alternative interfaces</article-title>
          : http://www.imdb.com/interfaces.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Christian</given-names>
            <surname>Krause</surname>
          </string-name>
          , Matthias Tichy, and
          <string-name>
            <given-names>Holger</given-names>
            <surname>Giese</surname>
          </string-name>
          .
          <article-title>Implementing graph transformations in the bulk synchronous parallel model</article-title>
          .
          <source>In Proc. FASE 2014</source>
          . Springer,
          <year>2014</year>
          .
          <article-title>Accepted for publication</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>