<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Transformation and aggregation preprocessing for top-k recommendation GAP rules induction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marta Vomlelova</string-name>
          <email>marta@ktiml.mff.cuni.cz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michal Kopecky</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Peter Vojtas</string-name>
          <email>vojtas@ksi.mff.cuni.cz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Mathematics and Physics Charles University in Prague Malostranske namesti 25</institution>
          ,
          <addr-line>Prague</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we describe the KTIML team approach to RuleML 2015 Rule-based Recommender Systems for the Web of Data Challenge Track. The task is to estimate the top 5 movies for each user separately in a semantically enriched MovieLens 1M dataset. We have three results. Best is a domain specific method like "recommend for all users the same set of movies from Spielberg". Our contributions are domain independent data mining methods tailored for top-k which combine second order logic data aggregations and transformations of metadata, especially 5003 open data attributes and general GAP rules mining methods.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Rule-based recommender systems can provide a desirable balance between the
quality of the recommendation and the understandability of the explanation for the
human user. This challenge targets a new type of recommender problems which uses
the Web of data to augment the feature set. The participating systems were requested
to find and recommend for each user a limited set of 5 items. The challenge uses a
semantically enriched version of the MovieLens 1M dataset. Recommender
performance is measured by F-measure at top-5 (F@5) and aggregate diversity (ILD).
Compactness of explanation is measured qualitatively where good explanations
involve small number of easy to understand rules.</p>
      <p>
        Our best result "recommend for all users the same set of movies from Spielberg" is
rather specific for this data. Our contribution is top-k focused data mining that
combines data aggregations and transformations of metadata, especially 5003 open data
attributes. Results are in form of simple SQL queries which can be transformed to
second order logic rules. We introduce 2GAP – a second order logic variant of GAP –
generalized annotated logic programming rules of [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        In the area of recommender systems “same-data-challenges” probably the most
famous is the Netflix challenge [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Netflix results were measured by improvement in
RMSE of ranking prediction, while we find F@5 as more challenging metrics. We
found relevant mining tricks in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and winners of 2014 ACM RecSys Challenge [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        Going to second order logic was motivated by [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Induction of many
valued rule systems was handled in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] (see [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]). Nevertheless in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] data sets for
induction were much smaller. We hope our method is transferable, nevertheless in
this paper we did not check it on other data.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Data Preparation, Challenge Understanding</title>
      <p>In this chapter we will describe our approach to processing input data.</p>
      <p>To be able to handle data and understand them we have chosen to store them and
analyze them in the relational database. The python script produced CSV files using
data joins with very large results – so a standard 32bit PC or even 64bit desktop PC
with usual amount of memory could not process it. Due to the size of the data,
practical calculations were done using SQL queries and adapted ML tools. We converted
results to simple SQL and constructed corresponding GAP rules.</p>
      <p>To be able to store data efficiently, we had to eliminate data redundancy again.
Therefore we preprocessed obtained CSV files and loaded data about movies, users
and known ratings in form of three database tables. Separately we stored the
incidence matrix of movies and 5003 DBPedia attributes.
2.1</p>
      <sec id="sec-2-1">
        <title>Task Analysis, Understanding and Steps of Solution.</title>
        <p>After a while spent with playing with the system and sending results in required
format (CSV, columns userId,movieId,score), we discovered that
the quality of solution does not depend on score.</p>
        <p>Further we observed that the recall Ru for user u is computed with respect to a
larger candidate set, nevertheless in average of size usually around 10. Hence recall and
F-measure are approximately a fixed multiple of precision and for maximizing F@5 it
suffices to maximize P@5. We did not try to improve ILD parameter by any activity.
We can just observe that results indicate quite high dissimilarity of our top-5 objects.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Natural Language Extraction of DBPedia Attributes</title>
        <p>First, really big obstacle, was the number of movie attributes, especially 5 003
DBPedia binary attributes. First we removed their common prefix1 and the rest was
processed by natural language extraction using database tools. First guess was to take
most frequent properties and check their influence of our prediction task. We saw that
these do not influence our rule mining achievements on the task significantly.</p>
        <p>Our second attempt was based on observation that certain types of properties like
Films_directed_by_..., Films_set_in_..., Films_shot_in_..., Films_about_... etc. form
1 “uri_relation_http://dbpedia.org/resource/Category:”
groups that can be seen as a predicates in form property=value. Nevertheless, this did
not bring us much further in our rule mining task.</p>
        <p>Third attempt brought some progress. We counted occurrences of individual values
of properties in all movies and compared it with number of occurrences in rated
movies. The bigger the ratio between movie count and rating count the more important
property-value pair. The result was submitted as user KSI (seeTable 2). For further
use, we aggregate all these properties to a single indicator GoodProperty=1 iff at least
one of properties is present.</p>
        <p>Our other approach was to create set of explanatory attributes of movies and set
them to 1, respectively 0 according to appearance of some word or phrase in the
movie flag description. So we created attribute Spielberg and set it to 1 for each movie
directed, produced or any other way connected to Steven Spielberg, similarly we
created attribute LA and assigned it to all movies located to LA etc. Then we tried to
compare movie ordering based on this attribute with the ordering based of ratings (the
ground truth). Attributes producing most corresponding orderings were combined to
query that ordered movies using expression:
100*Movies.Spielberg + 50*Movies.Original + Movies.BayesAVG
(1)</p>
        <p>Where BayesAvg=(3.5*50+AVG_Rating*MovieCNT)/(MovieCNT+50). Shown
weights are proportional to number of occurrences of given attribute in movies. On
the other hand the ordering is quite robust and is not overly dependent on specific
numbers as long as coefficients decrease from left to right. F@5 shown in Table 2 via
user SCS_CUNI is surprisingly good. The result can be interpreted that (in ideal case)
three quarters of users have at least one Spielberg movie in their top 5. Nevertheless
we do not consider this as our challenge contribution, because it is probably domain
dependent and rather a property of the respective dataset.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Modeling</title>
      <p>First, we learned different ML models to predict “interesting” movies. All models
gave similarly good results, we aimed for decision tree as our preferred expression
language because it can be more easily converted to rules.</p>
      <p>The goal was to recommend 5 movies, i.e. to select 5 best (unseen) predictions for
any user. It appeared that a tree with more than one hundreds leaves can be pruned to
five rules without significant decrease in the goal measure.</p>
      <p>Before sending a file for an evaluation, simple post processing was done to transfer
the prediction of (any) our model to 5 recommended movies.
3.1</p>
      <sec id="sec-3-1">
        <title>Model Building</title>
        <p>As training data, we used records in form (movieID, userID,Age, BayesAVG,
GoodProperty, Gender, GenreMatch, TAG), where the TAG=1 iff the movieID was
rated by the user, TAG=0 otherwise. We trained several machine learning models.
Since there was only 5% positive cases, the equal-cost recommendation was always 0.
If we measure the error as an average of the error on positive and negative cases, we
got errors 73% for Generalized Linear Models, 72% for decision trees and Naive
Bayes and 52% for Support Vector Machines. Since the difference between first three
models is small and we have strong preference for rule models we proceed further
with the Decision Tree model (DT).
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Pruned Model</title>
        <p>Our DT model predict preference on movies for users. Our goal was quite
different. For a given user predict the top 5 movies. This makes some attribute test
superficial (e.g. young users rated more movies, therefore the chance of TAG=1 was higher,
but it does not influence the ordering on movies for a given user; similarly, UserAVG
and so on). The only user-depended attribute left was the attribute GenreMatch, an
indicator whether the genre mostly selected by the user is the same as the genre of
currently considered movie.</p>
        <p>Furthermore, the ordering on most movies was not important since we were
interested only in top 5 out of 3.1 thousands. This makes most of the DT splits
unimportant. We selected only the 5 most important nodes out of 126 learned (those with
the highest percentage of positive cases and a reasonable support). We got F@5 is
0.04978 and precision 0.0714. We hope this dramatic pruning can be successful in
other tasks aiming for top-k predictions. These 5 rules are listed in Table 1, all rules
have additional condition GenreMatch=1.</p>
        <p>The overall ordering was given by the combination of the GenreMatch (a
necessary condition for high ranking), RulePreference as a rough ordering and movie
average ranking for ordering movies indifferent for rules. Expressed by a formula:
p.Prediction := p.GenreMatch*(p.BayesAVG+100*p.RulePreference)
(2)
for each user, we ordered all candidate movies according the Prediction and
selected the first 5 unseen movies.</p>
        <p>Query motivated by this approach and the first order one gave Precision: 0.135 (a
constant multiple of F@5) and is uploaded as Participant “KTIML", as a main result
of a methodology which can be used also in other similar tasks.</p>
        <p>Although method “Spielberg” achieved better results – this method cannot be used
in other application, simply because it is domain and dataset dependent and uses
specifically properties of movie data.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>2GAP – a Rule System Equivalent to Simple SQL Queries</title>
      <p>In this chapter we describe the translation of results to a rule system. We use
connection between simple SQL queries and logical rules. A query</p>
      <p>SELECT attribute1, …, attributen
FROM table1, …, tablem
WHERE conditions
(keeping in mind that conditions can contain bounds on variables and some
filtering conditions filter) is semantically equivalent to the logical rule
result(attribute1,…,attributen) table1,…,tablem, filter
and bounds on variables were applied.</p>
      <p>
        We use (many valued) GAP – Generalized annotated Programs rules of Kifer and
Subrahmanian [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. We interpret truth values as preference degrees. Semantics of these
rules is equivalent to database semantics.
      </p>
      <p>
        A 2GAP rule is a GAP rule in a language extended by atomic predicates
corresponding to tables resulting from database aggregations. These predicates are
originally defined by second order logic condition equivalent to database aggregation, we
assume we have facts in this new atomic predicate filled from database. This can be
considered as an alternative approach to that in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] (similarly as in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]).
      </p>
      <p>Assume we generated a table Ordered_Prediction using expression from equation
(1). Then SQL query</p>
      <p>SELECT UserID, MovieID, 5 FROM Ordered_Prediction WHERE OrdNr &lt;= 5;
corresponds to GAP rule
SCS_CUNI_Movie(u,m):100*x1+50*x2+ x3 
 SPIELBERG(m): x1 &amp; ORIGINAL(m): x2 &amp; BAYESAVG(m):x3
with top-5 semantics. Meaning of this rule is, that whenever (a two valued
conjunction)
SPIELBERG(m)≥x1 &amp; ORIGINAL(m) ≥x2 &amp; BAYESAVG(m) ≥x3 then
SCS_CUNI_Movie(u,m) ≥100*x1+50*x2+ x3.</p>
      <p>The weighted average movie ranking from Table 2user KSI can be depicted by
following GAP rule (after transformation of DBPedia attributes this is a first order
logic):</p>
      <p>KSI_Movie(u, m):(8000*x1+17000*x2+10000*x3+…)*x15  MAKEUP(m):x1 &amp; VISUAL(m):x2 &amp;
SMIX(m):x3 … NOTRATED(u,m):x15</p>
      <p>Recommended movie by data mining can be described (with necessary tuning of
weights) by following rule</p>
      <p>KTIML_Movie(u.m):(w1*x1+w2*x2+w3*x3)*x4  GENREMATCH(u,m):x1 &amp; BAYESAVG(m):x2 &amp;
RULEPREFERENCE(u,m):x3 &amp; NOTRATED(u,m):x4
Here predicates GENREMATCH, BayesAvg and RulePreference from
Table 1 are atomic 2GAP predicates representable by simple SQL query.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>We found dataset and task formulation very specific. Unfortunately we have better
results for domain specific methods like recommend to all users the same set of
movies from Spielberg. Challenge result combine data aggregations, transformations of
metadata (especially 5003 open data cloud attributes), selection of relevant attributes
and general data mining method. We believe that our method of dramatic pruning can
be re-used for other tasks aiming for top-k predictions. All recommendation
processing was done in a database engine resulting simple SQL queries were transformed
to second order logic GAP rules.</p>
      <p>Announcement. Supported by the Czech grants P46 and GACR-P103-15-19877S.
6</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>M</given-names>
            <surname>Kifer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>VS</given-names>
            <surname>Subrahmanian</surname>
          </string-name>
          .
          <article-title>Theory of generalized annotated logic programming and its applications</article-title>
          .
          <source>Journal of Logic Programming</source>
          <volume>12</volume>
          ,
          <issue>4</issue>
          (
          <year>1992</year>
          )
          <fpage>335</fpage>
          -
          <lpage>367</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kifer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Warren</surname>
          </string-name>
          .
          <article-title>HILOG: a foundation for higher-order logic programming</article-title>
          .
          <source>Journal of Logic Programming</source>
          <volume>15</volume>
          ,
          <issue>3</issue>
          (
          <year>1993</year>
          )
          <fpage>187</fpage>
          -
          <lpage>230</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>S.</given-names>
            <surname>Krajci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Lencses</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vojtas</surname>
          </string-name>
          .
          <article-title>A comparison of fuzzy and annotated logic programming</article-title>
          .
          <source>Fuzzy Sets and Systems</source>
          <volume>144</volume>
          ,
          <issue>1</issue>
          (
          <year>2004</year>
          )
          <fpage>173</fpage>
          -
          <lpage>192</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohapatra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Genesereth</surname>
          </string-name>
          ,
          <article-title>Aggregates in Datalog under set semantics</article-title>
          ,
          <source>Tech. Rep</source>
          . 2012
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Datomic</given-names>
            <surname>Queries</surname>
          </string-name>
          and Rules. http://www.datomic.com/
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Tomás</given-names>
            <surname>Horváth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Vojtas</surname>
          </string-name>
          .
          <article-title>Induction of Fuzzy and Annotated Logic Programs</article-title>
          .
          <source>In ILP 2006, LNCS 4455</source>
          ,
          <year>2007</year>
          , pp
          <fpage>260</fpage>
          -
          <lpage>274</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Xavier</given-names>
            <surname>Amatriain</surname>
          </string-name>
          ,
          <string-name>
            <surname>Josep M. Pujol</surname>
            ,
            <given-names>Nuria Oliver. I Like It... I</given-names>
          </string-name>
          <string-name>
            <surname>Like It</surname>
          </string-name>
          <article-title>Not: Evaluating User Ratings Noise in Recommender Systems</article-title>
          .
          <source>In UMAP '09, Springer</source>
          <year>2009</year>
          ,
          <volume>247</volume>
          -
          <fpage>258</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Netflix</given-names>
            <surname>Prize</surname>
          </string-name>
          , http://www.netflixprize.com/,
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Frederic</given-names>
            <surname>Guillou</surname>
          </string-name>
          , Romaric Gaudel, Jeremie Mary,
          <string-name>
            <given-names>Philippe</given-names>
            <surname>Preux</surname>
          </string-name>
          .
          <article-title>User Engagement as Evaluation: a Ranking or a Regression Problem? ACM</article-title>
          <year>2014</year>
          ,
          <volume>7</volume>
          -
          <fpage>12</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>