<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Two Recommendation Algorithms Based on Deformed Linear Combinations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alexander D'yakonov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>First Task “Cold Start”</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Moscow State University</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <fpage>21</fpage>
      <lpage>27</lpage>
      <abstract>
        <p>Data mining for recommender systems has gained a lot of interest in the recent years. ”ECML/PKDD Discovery Challenge 2011” was organized to improve current recommender system of the VideoLectures.Net website. Two main tasks of the challenge simulate new-user and new-item recommendation (cold-start mode) and clickstream based recommendation (normal mode). This paper provides detailed descriptions of two simple algorithms which were very successful in the both tasks. The main idea of the algorithms is construction of a linear combination equal to a vector of estimations of lectures popularity after viewing a certain lecture (or lectures). Each addend in the combination describes similarity of lectures using the part of the data. The algorithms are improved by transforming the combination to non-linear function. Lectures with the highest estimations of popularity are recommended to users.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Descriptions of the lectures from VideoLectures.net website are available. Every lecture
has the lecture id, the language of the lecture (“English”, “Slovene”, “French”, etc.),
3</p>
      <p>Algorithm for Solving the First Task
Let some information on a lecture can be written as the n-dimensional vector f =
(f1, . . . , fn). For example, if n is the number of the authors of all lectures than the
binary vector f describes the authors of concrete lecture: fi = 1 iff the i-th author is
the author of the lecture. It is similarly possible to describe the language of the lecture,
its categories, etc. Naturally, the vectors describing different types of information are of
different dimensionality. In each case it is possible to estimate similarity of the lectures.
For example, for the i-th lecture presented by the vector f (i) = (f1(i), . . . , fn(i)) and the
j-th lecture presented by the vector f (j) = (f1(j), . . . , fn(j)) their similarity is estimated
as changed cosine similarity [3]
hf (i), f (j)i =</p>
      <p>f1(i)f1(j) + · · · + fn(i)fn(j)
pf1(i)2 + · · · + fn(i)2 + εpf1(j)2 + · · · + fn(j)2 + ε
The change “+ε” is for preventing division by zero (for example, if the authorship of a
lecture is unknown). In the final version of the algorithm ε = 0.01.</p>
      <p>
        Idea of the algorithm is very simple: for the test lecture to calculate similarity to each
new lecture by summing (with some coefficients) values (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) for all presented “types of
information” (language, categories, authors, etc.). First, we will mark the main
modification of the algorithm which essentially improves performance. Together with similarity
to the lecture from the test set it is necessary to consider similarity to co-viewed older
lectures (similar from the point of view of users’ behavior).
      </p>
      <p>Let the set of older lectures be indexed by numbers from I, let f (i) be the vector of
the description of the i-th lecture, let m′ij be the estimation of similarity of the i-th and
the j-th lectures (from the point of view of users’ behavior, see below). Then let
f ′(i) = X
j∈I</p>
      <p>f (j)
′
mij pf1(j)2 + . . . + fn(j)2 + ε</p>
      <p>!
and similarity to the new t-th lecture is calculated by summing of hf ′(i), f (t)i for all types
of information. Let us describe how values m′ij are calculated. Let L be the number of
lectures, mij be the number of the users that viewed both the i-th and the j-th lectures,
i ∈ {1, 2, . . . , L}, j ∈ {1, 2, . . . , L}, i 6= j, and mii be the number of views of the i-th
lecture divided by 2 (such “strange” definition of the diagonal elements is a result of
optimization of algorithm performance). Then
m′ij =
mij
L
P mit
t=1</p>
      <p>.
fi = √whi i+ ε ,
The sense of this value is clear enough. If the numbers mii were equal to zero, i ∈
{1, 2, . . . , L}, than it would be an estimation of probability that user viewed the j-th
lecture under the condition that he viewed the i-th (the performance of the algorithm was
26.02%, see below). Nonzero diagonal elements are necessary to consider also similarity
to the i-th lecture, not only to co-viewed lectures (the performance was 29.06%, without
division by 2 the performance was 28.17%).</p>
      <p>Let us enumerate types of information which were used to calculate similarity. For
each type we will specify the vector</p>
      <p>γindex = (hf ′(i), f (j1)i, . . . , hf ′(i), f (jr)i) ,
where J = {j1, . . . , jr} is the set of new lecture indexes.</p>
      <p>1. Similarity of categories. Here f (j) is the characteristic vector of lecture
categories, i.e. a binary vector, in which the t-th coordinate is equal to 1 iff the j-th lecture
belongs to the t-th category. As a result we receive the vector γcat.</p>
      <p>2. Similarity of authors. Here f (j) is the characteristic vector of lecture authors,
i.e. a binary vector, in which the t-th coordinate is equal to 1 iff the t-th author is the
author of the j-th lecture. As a result we receive the vector γauth.</p>
      <p>3. Similarity of languages. Here f (j) is the characteristic vector of lecture
language, in which the first element corresponding to English is set to 1 (to make all lectures
similar on lectures in English, because lectures in English are popular among Internet
users). As a result we receive the vector γlang.</p>
      <p>
        4. Similarity of names. At first all words which are included into names and
descriptions of the lectures are parsed and reduced to word stems (we used Porter Stemmer
[4], [5]). Note, all special symbols (brackets, commas, signs of arithmetic operations, etc.)
were deleted, but stop words were reserved (it does not essentially influence performance
of the algorithm). The name of every lecture is described by the vector (h1, . . . , hW ), in
which hi is the number of words with the i-th word stem. Then we apply TF-IDF-like
weighting scheme [3]:
where wi is the total number of words with the i-th word stem in names and descriptions
of all lectures. Such vectors (f1, . . . , fW ) are used for calculation of the vector γdic. Note
that for calculation of similarity of names we use information on names and descriptions
(for weighting scheme). Standard TF-IDF proved to perform a bit worse (∼ 1–4%).
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
5. Similarity of names, descriptions, names and descriptions of events. Each
lecture has the name, the description (may be empty), the name of appropriate event
and the event description (if information on event is not present we consider that the
even name and the event description coincide with the lecture name and the lecture
description). All it is united in one text, that is described by the vector (h1, . . . , hW ),
and further operations are the same as in the previous item. As a result we receive the
vector γdic2.
      </p>
      <p>
        For task solving the algorithm construct the vector
γ = 0.19 · p0.6 · γcat + 5.6 · γauth + q4.5 · γlang + 5.8 · γdic + 3.1 · γdic2 .
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
Here the square root is elementwise operation. At first the linear combination of the
vectors γcat, γauth, γlang, γdic, γdic2 was used. The coefficients in the linear combination
were a result of optimization problem solving. But the usage of square roots gives small
improvement of performance (coefficients also tuned solving optimization problem). For
optimization problem the method of coordinate descent [6] was used. The algorithm
recommends the lectures with the highest values of elements in the vector γ = (γ1, . . . , γN ).
Such problem solving technology (selection of various types of information, construction
of an appropriate linear combination, its further tuning and “deformation”) is being
developed by the author and is named “LENKOR” (the full description of the technology
will be published in the near future).
      </p>
      <p>In the final submitted solution one more change was made: the vector γ = (γ1, . . . , γN )
was transformed to the vector
γ1 · 1 + δ
tmax − t1
tmax − tmin
, . . . , γN · 1 + δ
tmax − tN
tmax − tmin
where tj is the time (in days) when the j-th new lecture was published, tmin is the
minimum among all these times, tmax is the maximum. The transformation increased
performance approximately on 5%. The reason of the transformation is that it is
important how long is lecture available online (not only popularity of the lecture). In the
final version of the algorithm δ = 0.07, because this value maximizes performance of
the algorithm in uploads to the challenge website [1] (37.24% for δ = 0.09, 37.28% for
δ = 0.07, 36.24% for δ = 0.05).</p>
      <p>The described algorithm has won the first place among 62 participants with the result
of 35.857%. We do not describe the evaluation metric, the interested reader can find the
definition of the metric on the challenge website [1]. For local testing we used the same
metric. After data loading and processing (it takes 1 hour, but can be launched once for
recommender system construnction) the running time for the first task solving was 17.3
seconds on a computer HP p6050ru Intel Core 2 Quad CPU Q8200 2.33GHz, RAM 3Gb,
OS Windows Vista in MATLAB 7.10.0. 5704 recommendations (30 lectures in each) were
calculated. The dictionary consisted of 35664 word stems.
4</p>
    </sec>
    <sec id="sec-2">
      <title>Second Task “Pooled Sequences”</title>
      <p>In the second task the training set T consists of triples {a, b, c} of lecture numbers. For
every triple the number n({a, b, c}) of users who viewed all three lectures is known.
Besides, the pooled sequence [1] is specified. It is the ranked list of lectures which were
viewed by users after all lectures of the triple {a, b, c}. The numbers of views are also
known (the pooled sequence is ordered according to these values). Definition and
examples of pooled sequence construction can be found on the official site of the competition
[1]. We formalize this concept by means of the vector v({a, b, c}) ∈ ZZL, where L is the
number of lectures,</p>
      <p>v({a, b, c}) = (v1({a, b, c}), . . . , vL({a, b, c})) ,
vj ({a, b, c}) is the total number of views of the j-th lecture after triple {a, b, c} (informally
speaking, it is popularity of the j-th lecture after lectures from {a, b, c}). The test set
also consists of triples (the test set is not intersected with the training set). The task is
to define pooled sequences for triples from the test set, to be exact 10 first members of
the sequences (10 highest elements of each vector v({a, b, c}). So, these 10 lectures should
be recommended to user after viewing the three lectures.
5</p>
      <p>Algorithm for Solving the Second Task
At first, two normalizations of the vectors corresponding to triples from the training set
T are performed, the first is
v′({a, b, c}) =</p>
      <p>v1({a, b, c})
log(|{t˜ ∈ T | v1(t˜) &gt; 0}| + 2) · · ·</p>
      <p>
        vL({a, b, c})
· · · log(|{t˜ ∈ T | vL(t˜) &gt; 0}| + 2)
. (
        <xref ref-type="bibr" rid="ref5">5</xref>
        )
It is clear that |{t˜ ∈ T | vj (t˜) &gt; 0}| is the number of triples from the training set such
that their pooled sequences include the j-th lecture. The reason for performing such
normalization is that lectures included into many pooled sequences are generally less
relevant. The second normalization is
      </p>
      <p>
v′′({a, b, c}) = 


 v1′({a, b, c}) · log</p>
      <p>L ′
jP=1 vj ({a, b, c}) + 1</p>
      <p>!
p3 · n({a, b, c}) + ε</p>
      <p>· · ·
· · ·
vL′({a, b, c}) · log</p>
      <p>
        L ′
jP=1 vj ({a, b, c}) + 1
p3 · n({a, b, c}) + ε
! 

 , (
        <xref ref-type="bibr" rid="ref6">6</xref>
        )



ε = 0.01. It is difficult enough to describe sense of this normalization. It was a result
of exhaustive search of different variants and increased performance by 1–2%, that was
essential in competition.
      </p>
      <p>Let
s(d˜) =</p>
      <p>X
t˜∈T : d˜⊆t˜</p>
      <p>
        v′′(t˜)
and n(d˜) = |{t˜ ∈ T : d˜ ⊆ t˜}| be the number of addends in the sum, T be the training set.
For example, s({a, b}) is the sum of vectors v′′({a, b, d}) for all d such that {a, b, d} ∈ T .
Let operation ω deletes from a vector all zero elements except one and adds one zero if
there was not any zero element. For example, ω(
        <xref ref-type="bibr" rid="ref1 ref2">1, 0, 0, 2, 0</xref>
        ) = (
        <xref ref-type="bibr" rid="ref1 ref2">1, 0, 2</xref>
        ), ω(
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ) = (
        <xref ref-type="bibr" rid="ref1 ref2">1, 2, 0</xref>
        ).
Let
      </p>
      <p>uv 1 Xn
std(x1, . . . , xn) = tu n − 1 t=1
1 n</p>
      <p>X xi
xt − n i=1</p>
      <p>2
!
(standard deviation [7]).</p>
      <p>
        The algorithm is very simple: for the triple {a, b, c} from the test set if there are
at least two nonzeros among numbers n({a, b}), n({a, c}), n({b, c}) (it corresponds to
having enough information) than
(
        <xref ref-type="bibr" rid="ref7">7</xref>
        )
(
        <xref ref-type="bibr" rid="ref8">8</xref>
        )
(
        <xref ref-type="bibr" rid="ref9">9</xref>
        )
γ =
log(s({a, b}) + 0.02)
std(ω(s({a, b}))) + 0.5
log(s({b, c}) + 0.02)
std(ω(s({b, c}))) + 0.5
log(s({a, c}) + 0.02)
std(ω(s({a, c}))) + 0.5
+
+
.
      </p>
      <p>Otherwise we add to this sum addends
log(s({a}) + 0.02)
std(ω(s({a}))) + 0.5
+
log(s({b}) + 0.02)
std(ω(s({b}))) + 0.5
+
log(s({c}) + 0.02)
std(ω(s({c}))) + 0.5
.</p>
      <p>Here the log is taken elementwise. Elements of the received vector γ are treated as the
estimations of popularity of lectures from pooled sequence (the higher value, the more
popular). Lectures with the highest estimations are recommended.</p>
      <p>Let us try to explain the process of the development of the algorithm. It is very logical
to operate simply by a rule
γ = log(s({a, b})) + log(s({b, c})) + log(s({a, c})) =</p>
      <p>
        = log(s({a, b}) · s({b, c}) · s({a, c})),
where “·” is the elementwise multiplication of vectors. Indeed, if there is no information
on triple {a, b, c} we parse triples {a, b, d} for all d. Thus we sum vectors v({a, b, d}) and
receive a vector s({a, b}). It corresponds to union of multisets [8]. Similarly for triples
{a, c, d}, {b, c, d} we receive vectors s({a, c}) and s({b, c}). Now it is logical to intersect
the received multisets. Standard operation for intersection in the theory of multisets is
min (minimum). However in our experiment product proved to be better:
s({a, b}) · s({b, c}) · s({a, c}) ,
this operation is popular in the theory of fuzzy sets [9] for intersection (the performance
was 49%, and for min the performance was 47%). The expression
(s({a, b}) + ε) · (s({b, c}) + ε) · (s({a, c}) + ε)
is needed to prevent zeroing many elements of the vector and information losses (the
performance became 57%). Then experiments on normalizations of vectors and scaling
were made. As a result, division by std(ω(s({·, ·}))) + 0.5 increased performance
approximately by 1–3%. Adding of (
        <xref ref-type="bibr" rid="ref8">8</xref>
        ) does not influence performance, it was made “just in
case”.
      </p>
      <p>
        In this solution the ideas of “LENKOR” were also used: linear combination tuning
(for this reason the product (
        <xref ref-type="bibr" rid="ref9">9</xref>
        ) was written down as the sum of logs) and subsequent
“deformation” (we used data normalizations). Each addend in the linear combination is
a vector estimated popularity of lectures. The algorithm did not use detailed descriptions
of the lectures as in the first task. In our experiments the usage of descriptions did not
improve performance.
      </p>
      <p>The algorithm has won the first place among 22 participants with the result of
62.415%. The algorithm running time on computer HP p6050ru Intel Core 2 Quad CPU
Q8200 2.33GHz, RAM 3Gb, OS Windows Vista in MATLAB 7.10.0 for one lecture
recommendation is 0.0383 seconds, full task 2 solving (60274 recommendations) takes 38.33
minutes.</p>
      <p>
        The algorithms (for the first and the second tasks) can be efficiently parallelized.
Calculations (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )–(
        <xref ref-type="bibr" rid="ref9">9</xref>
        ) can be made on matrices to produce several recommendations at
once. For this reason we used MATLAB in our experiments: there are efficient matrix
calculations in this system.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>N.</given-names>
            <surname>Antulov-Fantulin</surname>
          </string-name>
          , M. Boˇsnjak, T. Sˇmuc, M. Jermol, M. Zˇnidarˇsiˇc, M. Grˇcar, P. Keˇse, N. Lavraˇc: ECML/PKDD 2011 -
          <article-title>Discovery challenge: “VideoLectures.Net Recommender System Challenge”</article-title>
          , http://tunedit.org/challenge/VLNetChallenge/
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>2. http://www.videolectures.net/</mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Christopher</surname>
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Manning</surname>
          </string-name>
          , Prabhakar Raghavan and Hinrich Schu¨tze: Introduction to Information Retrieval, Cambridge University Press (
          <year>2008</year>
          ). http://nlp.stanford.edu/IRbook/information-retrieval-book.html
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Porter</surname>
          </string-name>
          , Martin F.:
          <article-title>An algorithm for suffix stripping</article-title>
          .
          <source>Program</source>
          .
          <volume>14</volume>
          (
          <issue>3</issue>
          ),
          <fpage>130</fpage>
          -
          <lpage>137</lpage>
          (
          <year>1980</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>5. http://tartarus.org/∼martin/PorterStemmer/matlab.txt</mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>T.</given-names>
            <surname>Abatzoglou</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. O'</surname>
          </string-name>
          <article-title>Donnell: Minimization by coordinate descent</article-title>
          ,
          <source>Journal of Optimization Theory and Applications</source>
          .
          <volume>36</volume>
          (
          <issue>2</issue>
          ),
          <fpage>163</fpage>
          -
          <lpage>174</lpage>
          (
          <year>1982</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>7. http://en.wikipedia.org/wiki/Standard deviation</mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Wayne</surname>
            <given-names>D.</given-names>
          </string-name>
          <article-title>Blizard: Multiset theory</article-title>
          .
          <source>Notre Dame Journal of Formal Logic</source>
          .
          <volume>30</volume>
          ,
          <issue>1</issue>
          , Winter,
          <fpage>36</fpage>
          -
          <lpage>66</lpage>
          (
          <year>1989</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>9. http://en.wikipedia.org/wiki/Fuzzy mathematics</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>