=Paper= {{Paper |id=Vol-1305/paper10 |storemode=property |title=The TTC 2014 Movie Database Case: Rascal Solution |pdfUrl=https://ceur-ws.org/Vol-1305/paper10.pdf |volume=Vol-1305 |dblpUrl=https://dblp.org/rec/conf/staf/InostrozaS14a }} ==The TTC 2014 Movie Database Case: Rascal Solution== https://ceur-ws.org/Vol-1305/paper10.pdf
       The TTC 2014 Movie Database Case: Rascal Solution∗

                          Pablo Inostroza                               Tijs van der Storm

                Centrum Wiskunde & Informatica (CWI)            Centrum Wiskunde & Informatica (CWI)
                     Amsterdam, The Netherlands                      Amsterdam, The Netherlands
                            pvaldera@cwi.nl                                   storm@cwi.nl




       Rascal is a meta-programming language for processing source code in the broad sense (models, doc-
       uments, formats, languages, etc.). In this short note we discuss the implementation of the “TTC’14
       Movie Database Case” in Rascal. In particular we will highlight the challenges and benefits of using
       a functional programming language for transforming graph-based models.


1    Introduction
Rascal is a meta-programming language for source code analysis and transformation [2, 3]. Concretely,
it is targeted at analyzing and processing any kind of “source code in the broad sense”; this includes
importing, analyzing, transforming, visualizing and generating, models, data files, program code, docu-
mentation, etc.
     Rascal is a functional programming language in that all data is immutable (implemented using per-
sistent data structures), and functional programming concepts are used throughout: algebraic data types,
pattern matching, higher-order functions, comprehensions, etc.
     Specifically for the domain of source code manipulation, Rascal features powerful primitives for
parsing (context-free grammars), traversal (visit statement), relational analysis (transitive closure, image
etc.), and code generation (string templates). The standard library includes programming language gram-
mars (e.g., Java), IDE integration with Eclipse, numerous importers (e.g. XML, CSV, YAML, JSON etc.)
and a rich visualization framework.
     In the following sections we discuss the realization of the TTC’14 Movie Database case study [1] in
Rascal. We conclude the paper with some observations and concluding remarks. All code examples can
be found online at:
       https://github.com/cwi-swat/ttc2014-movie-database-case



2    Description of the solution
Representing IMDB in Rascal
As Rascal is a functional programming language, where all data is immutable, the IMDB models have
to be represented as trees instead of graphs. If there are cross references in the model, these have to be
represented using (symbolic or opaque) identifiers which can be used to look up referenced elements.
We use an algebraic data type to model IMDB models:


   ∗ This research was supported by the Netherlands Organisation for Scientific Research (NWO) Jacquard Grant “Next Gen-

eration Auditing: Data-Assurance as a service” (638.001.214).

                                                                                 c Inostroza and Van der Storm
Submitted to:
                                                                                 This work is licensed under the
TTC 2014
                                                                                 Creative Commons Attribution License.
2                                                                                                        TTC’14: Rascal


data IMDB      = imdb(map[Id, Movie] movies, map[Id, Person] persons,
                        set[Group] groups, rel[Id movie, Id person] stars);
data Movie     = movie(str title, real rating, int year);
data Person = actor(str name) | actress(str name);
data Group     = couple(real avgRating, Id p1, Id p2, set[Id] movies);

    An IMDB model is constructed using the imdb constructor. It contains the set of movies, persons,
groups and a relation stars encoding which movie stars which persons.        Both movies and persons are
identified using the opaque Id type. To model this identification, the movies and persons field of an IMDB
model are maps from such identifiers to the actual movie resp. person. Movies and persons are simple
values containing the various fields that pertain to them. The Group type captures couples as required in
Task 2. A couple references two persons and a set of movies using the opaque identifiers Id.

Task 1: Synthesizing Test Data
Synthesizing test data creates values of the type IMDB as declared in the previous section. The process
starts with an empty model (imdb((), (), {}, {})1 ), and then consecutively merges it with test models
for a value in the range 1, ..., n. Each test model in turn consists of merging the negative and positive test
model as discussed in the assignment. As an example, we list the creation of the positive test model:
IMDB createPositive(int i) = imdb(movies, people, {}, stars)
     when movies := ( j:        movie("m", toReal(j), 2013) | j <- [10*i..10*i+5] ),
            people := ( 10*i:        actor("a<10*i>"), 10*i+1:        actor("a<10*i+1>"),
                           10*i+2:    actor("a<10*i+2>"), 10*i+3:          actress("a<10*i+3>"),
                           10*i+4:    actress("a<10*i+4>") ),
            stars := {<10*i, 10*i>, <10*i, 10*i+1>, <10*i, 10*i+2>, <10*i, 10*i+3>,
                      <10*i+1, 10*i>, <10*i+1, 10*i+1>, <10*i+1, 10*i+2>, <10*i+1, 10*i+3>,
                      <10*i+2, 10*i+1>, <10*i+2, 10*i+2>, <10*i+2, 10*i+3>,
                      <10*i+3, 10*i+1>, <10*i+3, 10*i+2>, <10*i+3, 10*i+3>, <10*i+3, 10*i+4>,
                      <10*i+4, 10*i+1>, <10*i+4, 10*i+2>, <10*i+4, 10*i+3>, <10*i+4, 10*i+4>};

     The function uses map comprehensions to create the movies and people fields, and a binary relation
literal to create the stars relation. It then simply returns a value containing all those fields.

Task 2: Adding Couples
Task 2 consists of enriching IMDB models with couples: pairs of persons that performed in the same
movie, once or more often. This transformation is expressed by updating the couples field with the result
of the function makeCouples:
public map[int, set[int]] computeCostars(rel[Id movie, Id person] stars, int n){
     map[int star, set[int] movies] moviesPerStar = toMap(invert(stars));
     map[int movie, set[int] stars] personsPerMovie = toMap(stars);
     return toMap({ | <- stars, size(moviesPerStar[p])>=3, size(personsPerMovie[m])>=n});
}
set[Group] makeCouples(model:imdb(movies, persons, groups,stars)){
     map[int movie, set[int] stars]           costars = computeCostars(stars, n);

    1 The syntax () indicates an empty map, whereas (1:"a", 2:"b") represents a map with keys 1,2 and values "a","b".
Inostroza and Van der Storm                                                                                 3


     map[tuple[int star1, int star2] couple, set[int] movies] couples = ();
     for (int movie <- costars, int s1 <- costars[movie], int s2 <- costars[movie], s1 < s2) {
        couples[]?{} += {movie};     }
     return { couple(0.0, x, y, ms) |  <- couples, ms := couples[], size(ms) >=3 };
}

     The makeCouples function first converts the binary relation stars to a map (costars) from movie Id
to set of person Ids, filtering out some irrelevant elements by calling the computecoStars function. The
central for loop iterates over all movies and all combinations of two actors and adds the movie to a table
maintaining the set of movies for all couples (a map taking tuples of person Ids to sets of movie Ids). The
side condition s1 < s2 ensures we don’t visit duplicate or self combinations. The question mark notation
initializes a map entry with a default value, if the entry did not yet exist. In the final statement, a set of
Groups is returned containing all couples which performed in 3 or more movies.

Task 3: Computing Average Ratings for Couples
As can be seen in the previous section, the average rating field of couples is initialized to 0.0. In this
task we again transform an IMDB model, this time enriching each couple with its average rating of the
movies the couple co-starred in. The following function performs this transformation:
IMDB addGroupRatings(IMDB m) = m[groups=gs]
    when gs := { g[avgRating = mean([ m.movies[x].rating | x <- g.movies ])] | g <- m.groups };

    The groups field of the model m is updated with a new set of groups, as created in the when-clause
of the function. The new set of groups is created using a comprehension, updating the avgRating field of
each group. The average is computed based on the list of ratings obtained from the movies contained in
m that are referenced in the group g.



Extension Task 1: Top 15 Rankings
The object of the extension Task 1 is to compute top 15 rankings based on the average ratings or number
of movies a couple participated in. To represent rankings, we first introduce the following type alias:
alias Ranking = lrel[set[Person] persons, real avgRating, int numOfMovies];

    A ranking is an ordered relation (lrel) containing the co-stars, the average rating and the number of
movies of a couple. To generate rankings in this type, we create a generic function that takes the number
of entries (e.g., 15), an IMDB model, and a predicate function to determine ordering of groups. This last
argument allows to abstract over what the ranking is based (e.g., average rating or number of movies).
Ranking rank(int n, IMDB m, bool(Group, Group) gt) =
     take(n, [<{m.persons[x] | x <- getPersons(g)}, g.avgRating, size(g.movies)>
              | Group g <- sort(m.groups, gt)]);

    Again, this function employs a comprehension to create a ranking, iterating over the groups in
the IMDB model, sorted according to the predicate gt. For each person in the group (extracted using
getPersons), the actual person value is looked up in the model (m.person[x]). The take, size and sort
functions are in the standard library of Rascal.
    The actual top 15 rankings are then obtained as follows:
4                                                                                           TTC’14: Rascal


Ranking top15avgRating(IMDB m)       = rank(15, m, greaterThan(getRating));
Ranking top15commonMovies(IMDB m) = rank(15, m, greaterThan(getNumOfMovies));

    The last argument to rank is constructed using a higher-order function, greaterThan, which constructs
comparison functions on groups based on the argument getter function (i.e. getRating and getNumOfMovies).
So in the first case, rank is called with a comparison predicate based on average ratings of groups, whereas
in the second case, groups are ordered based on the number of shared movies in a group.

Extension Task 2: Generalizing groups to cliques
The extension task 2 consists of generalizing couples to arbitrarily sized cliques of people who co-starred
in the same set of movies. A couple is a special case where the clique size is 2. To represent arbitrary
cliques in the model, we modularly extended the Group data type (see Section 1.1) as follows:
data Group = clique(real avgRating, set[Id] persons, set[Id] movies);

    This declaration states that the clique constructor is now a valid group value, in addition to couple.
    Enriching an IMDB model with cliques follows the same pattern as enriching a model with couples.
In fact the following function follows almost exactly the same structure as the function makeCouples
described earlier:
set[Group] makeCliques(model:imdb(movies, persons, groups,stars), int n) {
     map[int movie, set[int] stars]      costars = computeCostars(stars, n);
     map[set[int] clique, set[int] movies] cliques = ();
    for (Id movie <- costars, set[Id] s <- combinations(costars[movie], n)) {
      cliques[s]?{} += {movie}; }
    return {clique(0.0, s, ms) | s <- cliques, ms := cliques[s], size(ms) >= 3 };
}

    Instead of iterating over pairs of actors explicitly, we now iterate over all combinations of size n using
the helper function combinations, which generates all combinations of size n taking elements from the set
costars[movie].



Extension Task 3 & 4 These are the same as Task 3 and Extension Task 1, respectively, but intended
for cliques instead of couples. It is not necessary to address these new cases in particular, because the
code is polymorphic over groups. Only in the case of Extension Task 4, the getPersons accessor has to
be extended to accommodate the new clique constructor defined in Extension Task 2:
set[Id] getPersons(clique(_, set[Id] ps, _)) = ps;




3    Observations and Concluding Remarks
Rascal can be seen as a model-transformation system, but it has to be acknowledged that its functional
nature poses certain challenges when compared to traditional model-transformation platforms:
    • Since Rascal is based on immutable data, models have to be represented as (containment) trees
      with explicit cross-references. As a result, all model elements need to have an identifier and some
      transformations have to explicitly look up model elements, given their identity.
Inostroza and Van der Storm                                                                                     5


   • For the same reason, model transformations feature non-destructive rewriting. That is, it is not
     possible to perform in-place updates. This has benefits for reasoning (locality), but might affect
     performance. In other words, in order to benefit from equational reasoning, there will be a com-
     promise in terms of performance. To improve this situation, active research is being conducted
     with the aim of optimizing the implementation of immutable data structures in Rascal [4].
    Looking back at the effort implementing the TTC’14 tasks we can observe that Rascal posed no
problems for solving the tasks. The solutions are small and declarative. The size of the implementation
is around 130 SLOC, including some helper functions, but excluding loading the model from XML
which is another 38 SLOC. Although Rascal allows side-effects in (local) variables, with the exception
of makeCouples and makeCliques none of the function use side-effects of any kind.
    Another observation is that the tasks mostly involved querying the models and aggregating new
results to enrich the model. In such cases, comprehensions are valuable features to create sets, maps, or
relations of model elements. Rascal’s built-in features for traversal and powerful pattern matching (e.g.,
deep pattern matching), were not (even) needed to perform most of the tasks in an adequate way.
    Related to the nature of the tasks, the fact that cross references had to be represented and managed
explicitly posed no problem. In all cases the top-level IMDB model was always available to perform the
necessary reference lookups in the movies and persons tables. It is, however, conceivable that in the case
of more complex transformations (in which the referential structure of a model needs to be changed),
more administration would be required in order to keep referential integrity intact.
    In terms of performance, and in absence of suitable benchmarks to compare, we can at this point
only report on the observed behavior of our solution. As an example, we observed that extracting cliques
of size 3 takes more than 1 hour on a 3.3Mb IMDB file2 . As it was shown in the code for computing
couples and cliques, we improved the performance by filtering out some model elements before the actual
couple/clique computation (e.g. removing actors who performed in less than 3 movies). By doing so we
could process the same file in 3 minutes.
    Finally, we would like to emphasize that Rascal’s module system proved its value. Some tasks could
be implemented as modular extensions of earlier tasks, combining extension of data types (Extension
Task 2) and extension of functions (Extension Tasks 3 and 4). In that way, it was possible to define
generic behavior for the Group ADT which initially considered just couples, but was later on modularly
extended to consider cliques too. Few additions to the original code were necessary, as the functionality
was defined in a generic way so that new variants could be naturally handled via polymorphism.


References
[1] Tassilo Horn, Christian Krause & Matthias Tichy (2014): The TTC 2014 Movie Database Case. In: 7th
    Transformation Tool Contest (TTC 2014), EPTCS.
[2] Paul Klint, Tijs van der Storm & Jurgen Vinju (2009): Rascal: A domain-specific language for source code
    analysis and manipulation. In: SCAM, pp. 168–177.
[3] Paul Klint, Tijs van der Storm & Jurgen Vinju (2011): EASY Meta-programming with Rascal. In: Generative
    and Transformational Techniques in Software Engineering III, Lecture Notes in Computer Science, Springer.
[4] Michael Steindorfer & Jurgen Vinju (2014): Code Specialization for Memory Efficient Hash Tries (Short
    Paper). In: Generative Programming and Component Engineering GPCE 2014, Proceedings.



  2 We ran our experiments on a laptop Apple MacBook Pro with Intel i7 CPU running at 2.9 GHz and 8GB memory.