Solving the TTC 2014 Movie Database Case with UML-RSDS

                                    K. Lano, S. Yassipour-Tehrani
                              Dept of Informatics, King’s College London

      This paper describes a solution to the Movie Database case using UML-RSDS. The solution specifi-
      cation is declarative and logically clear, whilst the implementation (in Java) is of practical efficiency.


1 Solution definition as a UML-RSDS specification
UML-RSDS [1] is a hybrid MT language which uses UML notations to specify transformations: source
and target metamodels of a transformation are defined as UML class diagrams, transformations are ex-
pressed as use cases, whose effect is specified by a sequence of postconditions written in OCL. This
provides an expressiveness similar to other hybrid languages such as GrGen or ETL. The UML-RSDS
tools automatically synthesise executable implementations of transformations from the UML specifica-
tions.
    For the case study specification, we define separate use cases for each task of the case study. Each
use case defines a sub-transformation of the problem.

Task 1: Create synthetic datasets We implement this task by a use case task1 which has parameter
n : Integer and a single postcondition
Integer.subrange(0,n-1)->forAll( x | Movie.createPositive(x) & Movie.createNegative(x) )
where createPositive is a static operation of Movie which creates the 5 movies, 3 actors and 2 actresses
of each positive case, and createNegative is a static operation of Movie which creates the 5 movies, 2
actors and 3 actresses of each negative case.
    createPositive is:
createPositive(n : Integer)
pre: n >= 0
post:
  Movie->exists( m1 | m1.rating = 10*n &
    Movie->exists( m2 | m2.rating = 10*n + 1 &
     Movie->exists( m3 | m3.rating = 10*n + 2 &
       Movie->exists( m4 | m4.rating = 10*n + 3 &
         Movie->exists( m5 | m5.rating = 10*n + 4 &
            Movie.createPositiveActors(n,m1,m2,m3,m4,m5) &
            Movie.createPositiveActresses(n,m1,m2,m3,m4,m5) ) ) ) ) )
where createPositiveActors creates the actors a, b and c and links them to the movies as required, and
likewise for createPositiveActresses. The definition of createNegative is similar.

Task 2: Find couples       We implement this task by a use case task2 which has a single postcondition:
  p : Person & q : p.movies.persons & p.name < q.name &
  comm = p.movies /\ q.movies & comm.size > 2    =>
      Couple->exists( c | p : c.p1 & q : c.p2 & c.commonMovies = comm )


To appear in EPTCS.
2                                                                 Movie Database Case with UML-RSDS


This constraint is implicitly ∀-quantified over persons p and q. It creates a couple c for each distinct pair
p and q of persons whose set of common movies comm has size at least 3. /\ denotes intersection, also
written as ∩. Only one couple is created for each pair because of the restriction that p1 always holds the
person with the lexicographically smallest name.
    The quantifier q : Person can be restricted to q : p.movies.persons because the conditions comm =
p.movies ∩ q.movies & comm.size > 2 imply that q ∈ p.movies.persons (a case of the Restricting Input
Ranges transformation design pattern [2]).
    The implementation is a linear iteration through Person and its execution time should therefore be of
order Person.size ∗ C where C is the maximum size of movies.persons. However, efficient computation
of set intersections is needed for situations where the sets of common movies become large.

Task 3: Calculate average scores for couples        This is implemented by a use case task3 with a single
postcondition operating on context Couple:
    avgRating = ( commonMovies->collect(rating)->sum() ) / commonMovies.size
This iterates over objects self of Couple, and sets the average rating of each couple equal to the average
of the rating of each of their common movies (if two or more movies have the same rating, these ratings
are all counted separately in the sum).

Extension task 1: List best 15 couples The set of existing couples can be sorted in different orders
using the sortedBy operator. For example:

      Couple→sortedBy(−avgRating)

is the sequence of couples in order of decreasing avgRating.
     However this would be very inefficient in this situation, where only the best 15 elements with respect
to a given measure are needed, out of possibly millions of elements.
     In UML-RSDS it is possible to extend the system library with new functions, which are provided
with an implementation by the developer. Here we need a version of sortedBy which takes a bound on
the number of elements to return: SortLib.sortByN(s, s→collect(e), n) returns the best n elements of s ac-
cording to e, sorted in ascending e-value order. Semantically it is the same as s→sortedBy(e).subrange(1, n).
     We define an external module SortLib with sortByN as a static operation, and provide (hand-written)
Java code for this operation, making use of the existing UML-RSDS merge sort algorithm. The use case
then has the postcondition:
bestcouples = SortLib.sortByN(Couple.allInstances,
                            Couple->collect(-avgRating), 15)   =>
                                   bestcouples->forAll( c | c->display() )
A toString() : String operation is added to Couple which returns a display string consisting of the average
score, number of movies and persons of each couple. This string is printed to the console by c→display().
    An example of the output is:
Couple avgRating 9992.5, 4 movies (a9993; a9994)
Couple avgRating 9992.0, 3 movies (a9990; a9992)
Couple avgRating 9992.0, 3 movies (a9990; a9993)
Couple avgRating 9992.0, 3 movies (a9990; a9994)
Couple avgRating 9992.0, 3 movies (a9991; a9992)
Couple avgRating 9992.0, 3 movies (a9991; a9993)
K. Lano, S. Yassipour-Tehrani                                                                           3


Couple avgRating 9992.0, 3 movies (a9991; a9994)
Couple avgRating 9992.0, 3 movies (a9992; a9993)
Couple avgRating 9992.0, 3 movies (a9992; a9994)
Couple avgRating 9991.5, 4 movies (a9990; a9991)
Couple avgRating 9982.5, 4 movies (a9983; a9984)
Couple avgRating 9982.0, 3 movies (a9980; a9982)
Couple avgRating 9982.0, 3 movies (a9980; a9983)
Couple avgRating 9982.0, 3 movies (a9980; a9984)
Couple avgRating 9982.0, 3 movies (a9981; a9982)
for the test case with N = 1000.
    Similarly, couples can be displayed in decreasing order of the number of common movies:
bestcouples2 = SortLib.sortByN(Couple.allInstances,
                              Couple->collect(-commonMovies.size), 15)   =>
                                      bestcouples2->forAll( c | c->display() )

Extension task 2: Generate cliques This use case assumes that task 2 has been completed. A use case
couples2cliques creates a 2-clique for each couple:
        Clique->exists( c |         c.persons = p1 \/ p2 & c.commonMovies = commonMovies )
This constraint has context Couple and is applied to each instance self of Couple.
   A use case nextcliques generates cliques of size n + 1 from those of size n:
persons@pre.size = n & p : commonMovies@pre.persons &
p.name > persons@pre.name->max() &
comm = p.movies /\ commonMovies@pre & comm.size > 2    =>
          Clique->exists( c | c.persons = cl.persons@pre->including( p ) &
                               c.commonMovies = comm )
This iterates over Clique@pre, so that only pre-existing cliques are considered as input to the rule, not
cliques generated by the rule. The nextcliques implementation is therefore a linear iteration over Clique,
rather than a fixed-point iteration.

Extension task 3: Calculate average score for cliques This task is implemented by a use case
exttask3 with a single postcondition operating on context Clique:
  avgRating = ( commonMovies->collect(rating)->sum() ) / commonMovies.size

Extension Task 4: List best 15 cliques As with extension task 1, this task can be achieved using a
specialised sorting operator that returns the best 15 cliques according to a valuation expression. Only
cliques of a given size n are of interest:
ncliques = Clique->select( persons.size = n ) &
bestcliques = SortLib.sortByN(ncliques, ncliques->collect(-avgRating), 15)   =>
                                           bestcliques->forAll( c | c->display() )
   Similarly for the display of cliques by the number of common movies:
ncliques2 = Clique->select( persons.size = n ) &
bestcliques2 = SortLib.sortByN(ncliques2, ncliques2->collect(-commonMovies.size), 15) =>
                             bestcliques2->forAll( c | c->display() )
4                                                                    Movie Database Case with UML-RSDS


2 Results
To run the use cases for couples from the command line, type
java Controller couples N
where N is the synthetic data set required (1000, 2000, etc).
   Table 1 shows the execution times of the tasks on SHARE for the synthesised data sets, using an
unoptimised Java 4 implementation (in which sets are represented as Vectors).

                                              N         task2
                                              1000      110ms
                                              2000      162ms
                                              3000      262ms
                                              5000      602ms
                                              10000     670ms

                         Table 1: Execution times for synthetic data sets (Java 4)

    Using the filter architectural pattern we could pre-filter the data to reduce input model size by remov-
ing all movies with fewer than 2 (fewer than M for M-cliques) cast members, and all people with fewer
than 3 movies [2]. This reduces the execution time for task2 and extension task 2.
    To run the use cases for cliques from the command line, type
java Controller cliques N
where N is the synthetic data set required (1000, 2000, etc).
   Table 2 shows the execution time for extension task 2 for clique sizes from 3 to 5.

                            N         exttask2 (3)    exttask2 (4)   exttask2 (5)
                            1000      115ms           114ms          75ms
                            2000      202ms           261ms          128ms
                            3000      271ms           274ms          192ms
                            5000      438ms           423ms          301ms
                            10000     824ms           996ms          606ms

              Table 2: Execution times for clique generation for synthetic data sets, Java 4

   The transformation has also been applied to the three IMDb models imdb-0005000-49930, imdb-
0010000-98168, imdb-0030000-207420. To apply the transformation to these, invoke it as:
java Controller mcouples in1.txt
and likewise for in2.txt, in3.txt. Table 3 shows the results.

                                        Data set     task2
                                        in1.txt      1864ms
                                        in2.txt      5816ms
                                        in3.txt      More than 120s

                          Table 3: Execution times for IMDb data sets (Java 4)

    To run the use cases for cliques for the IMDb files from the command line, type
K. Lano, S. Yassipour-Tehrani                                                                             5


java Controller mcliques in1.txt
This runs task2, couples2cliques, nextcliques (for parameter 2 to generate the cliques of size 3), extension
task 3 and extension task 4 (for cliques of size 3). Table 4 shows the results for clique generation.

                                           Model     exttask2 (3)
                                           in1.txt   6973ms
                                           in2.txt   11860ms

              Table 4: Execution times for 3-clique generation for IMDb data sets, Java 4

    The implemented transformation may be obtained at:
http://www.dcs.kcl.ac.uk/staff/kcl/movies.zip
It has also been uploaded to the umlrsds TTC14 workspace on SHARE, in the Public/rsync direc-
tory (remoteUbuntu12LTS TTC14 umlrsds new). The execution times in the SHARE environment are
slightly lower than those given above. A version using a pre-filter can also be executed, using FController
instead of Controller in the above commands. The filter can substantially reduce the size of the input
models by discarding people and movies which cannot contribute to the sets of couples or cliques. This
makes the computation of couples for the dataset in3.txt feasible (execution time 4296ms) although the
filter takes 45 seconds to execute. Similarly for clique calculation for in3.txt.


References
 [1] K. Lano, The UML-RSDS manual, www.dcs.kcl.ac.uk/staff/kcl/umlrsds.pdf, 2014.
 [2] K. Lano, S. Kolahdouz-Rahimi, Model transformation design patterns, IEEE Transactions in Software Engi-
     neering, vol. 40, 2014.
 [3] T. Horn, C. Krause, M. Tichy, The TTC 2014 Movie Database Case, TTC 2014.