<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Feature Engineering and Explainability with Vadalog: A Recommender Systems Application</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jack Clearman</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ruslan R. Fayzrakhmanov</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Georg Gottlob</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yavor Nenov</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stephane Reissfelder</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emanuel Sallinger</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Evgeny Sherkhonov</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Meltwater Group</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Oxford</institution>
        </aff>
      </contrib-group>
      <fpage>39</fpage>
      <lpage>43</lpage>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Vadalog [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is an extension of Datalog that features existential rules and a rich
set of functions, libraries, and methods for connecting to external data sources,
which make it a powerful tool for building advanced industrial AI applications
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Vadalog forms the core of an ongoing research collaboration between the
University of Oxford and the media intelligence company Meltwater, that aims
at a recommender system for the most relevant insights about companies from
outside data, including Meltwater's repository of millions of news articles. In
this application paper, we demonstrate various aspects of such a recommender
system in the movies domain, and show how Vadalog can be used for feature
engineering and the computation of explainable recommendations.
      </p>
      <p>Recommender Systems assist users in choosing the most relevant items they
may be interested in, thus reducing the experienced information load. The
typical methods used in recommender systems are based on the analysis of items
the user has already selected and are usually limited to \low-level" features, i.e.,
metadata associated with an item. However, such methods are not able to
provide suitable recommendations in the absence of discriminative low-level features
or the presence of non-trivial combinations of features which capture discrepancy
between liked and disliked items. In this paper, we approach this problem by
building a new set of high-level features that can capture domain knowledge and
non-trivial factors that in uence user's decision in choosing movies. Vadalog is
well suited for computing such high-level features, by having support for: (1)
aggregation, for computing features such as total revenue of movies, (2) graph
traversal for computing properties on the co-starring graph, (3) integration of
various data sources for uni ed access to multiple sources, such as IMDB and
RottenTomatoes, and (4) existential rules used in the computation of
recommendations for new users. Furthermore, declarativeness of Vadalog allows developing
high-level features rapidly (usually, a few hours per feature from conception to
deployment) and easily maintaining the resulting programs. Finally, we
demonstrate how to build an explainable ranking of recommendations, thus allowing
Vadalog not only to provide explanations at reasoning level, i.e., why a
particular high-level feature has a certain value, but also explanation at the machine
learning level, i.e., why there is a particular ranking of items.
featuursee(rU(sUesre,r)",AwardWinningCast", Movie, Score)
:hasAwardWinningActor(Movie, Person, Award),
awardScore(Award, AwardScore),</p>
      <p>Score = max(AwardScore).
hasAwcarredwW(iMnonviineg,AcPteorrs(oMnoIvDi,e,"CPaesrts"o)n,, Award)
:</p>
      <p>hasWonPrestigiousAward(Person, Award).
hasWonPrestigiousAward(Person, "Oscars")
:</p>
      <p>oscarsAward(Nomination, Person, Movie, Year).
hasWonPrestigiousAward(Person, "BAFTA")
:</p>
      <p>baftaAward(Nomination, Person, Movie, Year).
@input("oscarsAward").
@bind("oscarsAward", "postgres", "awards", "oscars").
@input("baftaAward").
@bind("baftaAward", "postgres", "awards", "bafta").
(a) Award Winning Cast
featuursee(rU(sUesre,r)",HighlyRatedDirector", Movie, Score)
:crew(Movie, Person, "Director"),
directorWithHighRating(Person, Score).
direcctroerwW(iMtohvHiieg,hRPaetrisnogn(,Pe"rDsiorne,ctAovrg"R)a,ting)
:imdbRating(Movie, Rating),
AvgRating = avg(Rating),</p>
      <p>AvgRating &gt; 8.5.
(b) Highly Rated Director
featucroeP(rUosdeurc,ed"(CUos-ePr,rodLuickteidoMno"v,ieM,ovMioev,ieS,coHroep)),:</p>
      <p>scoreTable(Hop, Score).
coProudsuecre(dU(sUesre)r,, Movie, Movie, 0)
:</p>
      <p>likedMovie(User, Movie).
coProcdouPcreodd(uUcseedr(,UsMeor,vieM,ovMioev,ieM2o,viNee1,wHoHpo)p):,produced(Producer, Movie1),
produced(Producer, Movie2),
Movie1 != Movie2,
Hop &lt; 4,</p>
      <p>NewHop = Hop + 1.</p>
      <p>(c) Co-Production</p>
      <p>Fig. 1: High-Level Features</p>
      <p>
        Due to space limitation we do not provide preliminaries for syntax and
semantics of Vadalog and refer the reader to [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] for details. Note that in the
below constructed programs negation and aggregate functions are restricted to
be strati ed.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2 High-Level Features</title>
      <p>We consider data closely resembling IMDB, the largest movie industry database.
The relation crew(Movie, Person, Role) represents crew members with their role
in the production, imdbRating(Movie, Rating) represents the movies' ratings,
produced(Producer, Movie) represents the production studio, and oscars(Nomination,
Person, Movie, Year) as well as bafta(Nomination, Person, Movie, Year) are external
data sources providing the Oscars resp. BAFTA award information. We assume
that the common attributes Person and Movie in the latter two relations have
been appropriately linked via a custom Vadalog program. Additionally we use
relations user(User), likedMovie(User, Movie), ratedMovie(User, Movie, Rating) and
friend(User1, User2) to provide users, their liked movies, the rating a user has
given to a movie and the pairs of friends.</p>
      <p>We next demonstrate how Vadalog can be used to build a number of
highlevel features in the movies domain. Here we show only those that are aimed
at demonstrating the (combinations of) three main use cases mentioned above:
integration of various data sources, aggregates, and graph traversal. All feature
values are stored in a predicate with the following signature: feature(User,
FeatureName, Movie, Score), where User is the user ID and Score is the value of the
feature FeatureName for each movie.</p>
      <p>Award Winning Cast. Often one of the factors in choosing a movie is whether
the movie features a star cast. A cast member is usually considered a \star" if
they have won a prestigious award such as an Oscars or BAFTA award. Such a
high-level feature is described in Figure 1a, where awardScore stores prede ned
scores for each type of award. The rst rule generates a score for a given user and
a movie if the movie has a cast member that has won an award. This is encoded
in the predicate hasAwardWinningActor de ned in the second rule. The need for
max is justi ed by the fact that the input movie can have multiple actors with
di erent awards, but we assign the best resulting score. The rules for
hasWonPrestigiousAward integrate two di erent data sources: the Oscars and BAFTA
datasets. In particular, the last four lines declare how external datasources (in
this case PostgreSQL tables) are bound to predicates.</p>
      <p>Highly Rated Director. Another possible factor for choosing a movie is
whether the movie's director has a good track record. This can be modelled by
choosing directors whose average movie ratings exceeds a certain threshold. This
is formalised in Figure 1b where the predicate directorWithHighRating stores all
directors with their high average ratings. This feature computation demonstrates
the need for aggregate queries, such as average rating. Other similar features can
be built using aggregates, e.g., the producer's total revenue of all movies they
produced and average sentiment about the movie in the social media.
Co-Production. This feature captures the following intuition: a producer
related to a movie liked by a user is likely to produce movies that the user will
like as well. Assume that we have a list of movies that the user has already
liked, which we refer to as a seed list. Based on this list we can build relative
features, i.e., their values are relative to the seed list. Our feature builds on the
co-production relation: two movies are in the relation if they were produced by
the same producer or a company. The co-production relation can then be
transitively closed and the feature value re ects how \far" a movie is from the seed list
in the resulting relation. This is formalised in Figure 1c, where the last two rules
demonstrate a (limited depth) recursive de nition of the predicate coProduced.
Similarly, other relative features such as Co-direction and Co-starring can be
computed by traversing the corresponding relations.</p>
      <p>Cold start. The relative features above assume that a given user has a set of
movies they liked. This information however is not available for new users who
have not liked any movies yet. This problem is known as \cold start" in
Recommender Systems. One way to overcome this is by creating \placeholder" movies
that have attributes such as Producer or Actor that are populated by most
popular producers and actors from movies liked by the user's friends. Vadalog is
particularly suitable to model such a situation as it supports existential
quanti ers. The rules are shown in Figure 2a, where the rst two compute the most
featured actors and the top rated directors from the movies that the user's friends
have liked. Then the next rule creates the placeholder movies for a user in case
they have not liked any movie. In particular, the variable Movie is existentially
quanti ed. The last rule de nes an extended relation likedMovieExt that stores a
placeholder movie for a user that has not liked any movie yet as well as known
liked movies for each user.
topOcfcruireinndg(CUassetrM,emUbseerrs1()U,ser, Person)
:likedMovie(User1, Movie),
crew(Movie, Person, "Cast"),
oOccccTuhrrreesnhcoel=d(Ucsoeurn,t(MTohrveiseh)o,ld),
Occurrence &gt; Threshold.</p>
      <p>trainterda(iUnsienrg,DaTtraa(iUnseerd), :F-eatures, Label),</p>
      <p>Trained = ml:train(User, Features, Label).
preditcrtaeidnReadn(kUisnegrS,coTrrea(iUnseedr),, Movie, RankingScore)
:movieFeatures(User, Movie, Features),</p>
      <p>RankingScore = ml:predict(User, Features).</p>
      <p>(b) Training and Predicting</p>
    </sec>
    <sec id="sec-3">
      <title>3 Explainable Ranking</title>
      <p>In this section, we show how Vadalog can be used to perform explainable ranking
of movies based on their precomputed features. For each user we perform the
following steps: training a machine learning ranking model; use the trained model
to rank all movies; and, nally, perform feature value analysis to compute a
tailored explanation for the ranking of each movie. We next provide more detail
for each of the these steps.</p>
      <p>Setting. Let the predicate movieFeatures(User, Movie, Features) contain all
feature values computed for each user and movie. Concretely, each movie has one
entry in movieFeatures, Features is the list of feature values for that movie.
Furthermore, assume that we have computed a relation trainingData(User, Features,
Label), which contains the training data for each user. The trainingData contains
two types of records. For each user and for each movie that they liked, the
relation associates the feature vector of the movie with label 1. Furthermore, for
each user and each movie from a prede ned set of sample movies, the relation
associates the feature vector of the movie with label 0. Thus, labels 1 and 0
represent positive and negative examples respectively.</p>
      <p>Training And Prediction. We use a machine learning regression model to
perform ranking of movies based on their features. The regression model takes
as input the feature values of a movie and produces a ranking score, which we
then use to rank all movies. The Vadalog system exposes several open-source
machine learning libraries, such as Weka4, that can seamlessly be used during
reasoning. We use a separate instance of the regression model for each user. To
train an ML model, we invoke a dedicated aggregate function ml:train, as shown
in the rst rule of Figure 2b. The rst argument of ml:train is the identi er of the
trained model (in our case the user), the second argument is the input vector,
4https://www.cs.waikato.ac.nz/ ml/weka/
and the third argument is the value to be learned. Next we can use the trained
model to compute a user-speci c ranking of all movies. To this end we use the
library function ml:predict, which takes the identi er of the trained model and
the input feature vector to produce the model prediction, as shown in the second
rule of Figure 2b.</p>
      <p>
        Ranking Explanation. We next show how one can use Vadalog to produce
explanations about the rank of each individual movie. We adapt an approach
described in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The goal is to identify for each movie the feature that has
the highest contribution to its position in the ranked list, and consequently
report the explanation of its computation to the user. To identify the most
prominent feature for the rank of a given movie, we rst identify a range of
interesting values (e.g. minimum, maximum, average, etc.) for each feature. For
each such value, we compute a modi ed ranking score for the given movie by
replacing its original value with the selected one. The feature that gives the
lowest modi ed ranking score is then used as an explanation for the movie's rank.
Assume we have precomputed with Vadalog the interesting feature values in
relation sampleValues(Feature, FeatureIndex, Value), where Feature is the feature
name, FeatureIndex is the index of the feature in the movie's feature vector, and
Value is a value to be used for computing the modi ed scores. Computing the
minimal modi ed scores for a movie is then performed using the trained model,
as shown in the rst rule in Figure 2c, which makes use of the collections function
col:setAt(Vector, Index, Value) which returns the result of replacing the value at
position Index in Vector with Value. Finally in the last two rules, we identify for
each movie the features yielding the lowest modi ed scores, and collect those in
a list using the list aggregate.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4 Conclusion</title>
      <p>In this application paper we reported on how a Recommender System can be
rapidly developed using Vadalog. As part of ongoing work, since the obtained
explainable recommender system is agnostic to the underlying machine
learning model, we intend to perform an evaluation of our approach using di erent
models. We believe our approach is useful in the scenarios when explanation of
recommendation is crucial. The transparency of such a system also enables users
to give feedback, incorporation of which in our model is future work.
Acknowledgements. This work is supported by the EPSRC programme grant
EP/M025268/1 VADA, the WWTF grant VRG18-013, and the EU Horizon 2020
grant 809965.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>L.</given-names>
            <surname>Bellomarini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. R.</given-names>
            <surname>Fayzrakhmanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Gottlob</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kravchenko</surname>
          </string-name>
          , E. Laurenza,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Nenov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Reissfelder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Sallinger</surname>
          </string-name>
          , E. Sherkhonov, and
          <string-name>
            <given-names>L.</given-names>
            <surname>Wu</surname>
          </string-name>
          .
          <article-title>Data science with Vadalog: Bridging machine learning and reasoning</article-title>
          .
          <source>In Proc. MEDI</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>L.</given-names>
            <surname>Bellomarini</surname>
          </string-name>
          , E. Sallinger, and
          <string-name>
            <surname>G. Gottlob.</surname>
          </string-name>
          <article-title>The Vadalog system: Datalog-based reasoning for knowledge graphs</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>11</volume>
          (
          <issue>9</issue>
          ):
          <volume>975</volume>
          {
          <fpage>987</fpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. M. ter
          <string-name>
            <surname>Hoeve</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Schuth</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Odijk</surname>
          </string-name>
          , and M. de Rijke.
          <article-title>Faithfully explaining rankings in a news recommender system</article-title>
          .
          <source>CoRR</source>
          , abs/
          <year>1805</year>
          .05447,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>