<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Lightweight Approach to the Cold Start Problem in the Video Lecture Recommendation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Leo Iaquinta</string-name>
          <email>iaquinta@di.uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanni Semeraro</string-name>
          <email>semeraro@di.uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Bari “Aldo Moro”</institution>
          ,
          <addr-line>Bari</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <fpage>83</fpage>
      <lpage>93</lpage>
      <abstract>
        <p>In this paper we present our participation as SWAPTeam at the ECML/PKDD 2011 - Discovery challenge for the task on the cold start problem focused on making recommendations for new video lectures. The main idea is to use a content-based approach because it is less sensitive to the cold start problem that is commonly associated with pure collaborative filtering recommenders. The strategy for the integration by hybridization and the scalability performance affect the developed components.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>In this paper we present our participation as SWAPTeam1 at the ECML/PKDD 2011
- Discovery challenge for the task on the cold start problem focused on making
recommendations for new video lectures, based on historical data from the VideoLectures.Net
website.</p>
      <p>Recommender systems (RSs) usually suggest items of interest to users by the
exploitation of explicit and implicit feedbacks and preferences, usage patterns, and user or
item attributes. The past behavior is supposed to be useful to make reliable predictions,
thus past data is used in the training of RSs to achieve accurate prediction models. A
design challenge becomes from the dynamism of the real systems because new items and
new users are continuously added without a previous known behavior.</p>
      <p>Also VideoLectures.Net exploits a RS to guide users during the access to its large
multimedia repository of video lectures. Beside the editorial effort to select and classify
lectures, accompanying documents, information and links, the Discovery challenge is
organized in order to improve the website’s current RS, inter alia, to deal with the cold
start problem.</p>
      <p>The main idea underlying our participation is to use a content-based approach
because it is less sensitive to the cold start problem that is commonly associated with
pure collaborative filtering recommenders. The adopted solution exploits almost all the
provided data and the actual integration with VideoLectures.Net RS can be potentially
performed by a hybrid approach. Moreover, the scalability performance is considered as
a primary requirement and, thus, a lightweight solution is pursued.</p>
      <p>The rest of the paper is structured as follows: Section 2 recalls some common
knowledge about the cold start problem, Section 3 sketches some features of the dataset, Section
4 illustrates the proposed solution and Section 5 closes the paper with some conclusions
and future work.
1 http://www.di.uniba.it/~swap/index.php
2</p>
    </sec>
    <sec id="sec-2">
      <title>Cold Start Problem</title>
      <p>The cold start problem is commonly associated with pure collaborative filtering RSs.
Particularly, the item-based collaborative filtering techniques assume that items are
similar when they are similarly rated and therefore the recommendations concern items with
the highest correlations according the usage evidence. As drawback, new items cannot
be recommended during the cold start because they do not provide an adequate usage
evidence.</p>
      <p>
        The cold start problem concerns performance issues when new items (or new users)
should be handled by the system. The cold start can be considered as a sub problem
of the coverage one [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], indeed it measures the system coverage over a specific set of
items or users. Therefore, although the prediction accuracy of a RS, especially for a
collaborative filtering one, often grows with the amount of data, the coverage problem
of some algorithms appears with recommendations of high quality only for a portion of
the items even if the system has gathered a huge amount of data.
      </p>
      <p>
        Focusing on cold start for items, there are various heuristics to pick out the cold
items. For instance, cold items can be items with no ratings or usage evidence, or items
that exist in the systems for less than a certain amount of time (e.g., a day), or items
that have less than a predefined evidence amount (e.g., less than 10 ratings) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The
correct selection of cold items allows to process them in a different way.
      </p>
      <p>
        The prediction about cold items requires different approaches by comparing the
performance for the predictions about hot items. This may be desirable due to other
considerations such as novelty and serendipity. Thus evaluating the system accuracy on cold
items it may be wise to consider that there is a trade-off with the entire system accuracy
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Dataset</title>
      <p>The main entities of the dataset are the lectures. They are described by a set of
attributes and of relationships. The attributes are of various kind: for instance, type can
have one value in a predefined set (lecture, keynote, tutorial, invited talk and so on);
views attribute has a numeric value; rec date and pub date have a date value; name and
description are unstructured text, usually in the language of the lecture. The
relationships link the lectures with 519 context events, 8,092 authors, and 348 categories. Each
of these entities has its own attributes and relationships to describe taxonomies of events
and categories.</p>
      <p>Almost all this amount of data can be exploited to obtain features for a content-based
recommendation approach. The used features are briefly introduced in Section 4.2. The
lectures are divided into 6,983 for the training and 1,122 for the testing as cold items.</p>
      <p>In addition, the dataset contains records about pairs of lectures viewed together (not
necessarily consecutively) with at least two distinct cookie-identified browsers. This kind
of data has a collaborative flavour and it is actually the only information about the past
behavior. The user identification is missing, thus any user personalization is eliminated.
User queries and feedbacks are also missing.
4
4.1</p>
    </sec>
    <sec id="sec-4">
      <title>Proposed Approach</title>
      <sec id="sec-4-1">
        <title>Content-based Technique by Hybrid Approach</title>
        <p>
          To overcome the cold start problem of the collaborative approaches, a common solution
is to hybridize them with other techniques that do not suffer of the same problem [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. For
instance, a content-based approach can be used to bridge the gap from existing items to
new ones: item attributes are used to infer similarities among items.
        </p>
        <p>Content-based techniques also have a start-up problem because they must accumulate
enough usage evidence to build a reliable classifier, but in the task on the cold start
problem of the ECML/PKDD 2011 - Discovery challenge it is not an issue.</p>
        <p>Furthermore, relative to collaborative filtering, content-based techniques are limited
by the features that are explicitly associated with the items that they recommend. For
instance, a content-based movie recommendation is usually based on the movie metadata,
since the movie itself is opaque to the system. In the task on the cold start problem of the
ECML/PKDD 2011 - Discovery challenge, this general problem is solved by the editorial
effort of VideoLectures.Net to select and classify lectures. In addition, as sketched in
Section 3, almost all provided data can be exploited to obtain content-based features.</p>
        <p>
          The hybridization strategy can be flexible in order to apply different approaches
to specific classes of items (or users) and, therefore, switch to a specific technique for
the selected cold items. A switching approach [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] is a simple hybridization strategy to
implement different techniques with sensitivity on the item-level without any further cost
beside the cold item selection.
4.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Steps towards Solution</title>
        <p>The solution is obtained mainly by three steps: the data pre-processing, the model
learning, and the recommendation.</p>
        <p>Data pre-processing step starts with the loading of CSV files of the dataset by the
Super CSV library2 to obtain an in-memory object-oriented representation.
In addition, a set of Lucene3 indexes are created to store textual metadata (title,
description and slide title) in order to exploit the term frequency vectors to efficiently
compute document similarities. Since the metadata is inherently multi-lingual, a
single index is created for each language and textual metadata is added to the proper
index according to the detected language. The language detection is performed by
naive Bayesian filters that exploit language profiles learned from Wikipedia4. The
textual metadata is also preprocessed to remove stop words and to reduce inflected
words to their stem: these sub-steps are strongly language-dependent, thus specific
linguistic knowledges can improve the process effectiveness.</p>
        <p>The event names are filtered by regular expressions to introduce an event similarity
metric smarter than a simple string matching.</p>
        <sec id="sec-4-2-1">
          <title>2 http://supercsv.sourceforge.net/</title>
        </sec>
        <sec id="sec-4-2-2">
          <title>3 http://lucene.apache.org/</title>
        </sec>
        <sec id="sec-4-2-3">
          <title>4 http://code.google.com/p/language-detection/</title>
          <p>An in-memory complete representation of category taxonomy is also created to
compute the category similarity as graph-based minimum path between pairs of
categories.</p>
          <p>The main output of this step is a set of 20 numeric values describing the similarities
between lectures of each pair in the training set. Table 1 reports the used features: for
each pair of items, they involve the languages, the frequencies of languages (Fig. 5-b),
the descriptions, the recording and publication ages, the conferences, the authors and
their affiliations, and the categories.</p>
          <p>Model learning step uses Weka5 to build a prediction model for the frequency of a
pair of lectures. The available data and the lightweight goal determined the selection
of a linear model for the learning problem. Thus the model output is a weighted sum
of the attribute values that predicts the pair frequency. The learning process aims to
obtain a regression model for the weights from the output of the data pre-processing
step.</p>
          <p>This step is quite time-consuming and it requires a lot of memory, mainly under the
input constraints. Thus the output of the data pre-processing step can be controlled
on exploited features and selected items.</p>
          <p>Table 1 reports different models learned using all the available pairs: for each model,
the table reports the used features with their learned weights, the regression
metrics provided by Weka, and the metric values for the recommendation of cold items.
Model-1 uses all the available features; Model-2 leaves out the Lucene-based
similarity; Model-3 leaves out the features based on recording and publication ages; Model-4
leaves out the conferences; Model-4 leaves out the authors; Model-6 leaves out the
categories. Some weights are missing for the fitness of the learning method.
The learned weights of a model are stored in a configuration file, with the option to
add a boost factor for each weight to easily explore the feature influences beside the
learned model. Fig. 1 and Fig. 2 report the values of the evaluation metric (Mean
Average R-precision - MARp) for the recommendations using Model-1 when a boost
factor is changed. The boost factors can be modified also to implement a naive
feedback control on recommendations without performing a complete learning step.
Fig. 3 reports the evaluation metric values for the submitted solutions when the
boost factors for the learned weight in Model-1 are changed: the submitted solutions
always outperform the random baseline (MARp: 0.01949).</p>
          <p>Recommendation step uses the in-memory representation of the pre-processing step
and the learned weights to predict the frequency of an old item against each selected
cold item. The highest values are used to select the 30 cold items for the
recommendation.</p>
          <p>The in-memory representation and the lightweight prediction model allow to
formulate a new recommendation in a reasonably short time.</p>
          <p>The in-memory representation of the data pre-processing step is also used to create R6
scripts to visualize the information in the dataset for an informed selection of the
contentbases features. For instance, Fig. 4 shows how the views are temporally distributed</p>
        </sec>
        <sec id="sec-4-2-4">
          <title>5 http://www.cs.waikato.ac.nz/ml/weka/</title>
        </sec>
        <sec id="sec-4-2-5">
          <title>6 http://www.r-project.org/</title>
          <p>
            considering the recording and publishing ages: the behavior is quite dissimilar for the
two time scales, indeed, the oldest recorded lectures are seldom viewed as the cumulative
box-plot and density function (the rightmost subgraphs) highlight, conversely the oldest
published lectures have the highest density of views. Probably, the user interest for old
lectures is weak even if the VideoLectures.Net kindled a lot of attention during the first
months. In addition the views of lectures decrease when their recording and publishing
ages decrease. Thus recent lectures need some assistance. Fig. 4 supports the idea to
exploit age-based features in the model learning, although the temporal distribution of
views deserves further investigation for a selective use of pairs in the learning step. Fig.
5 shows how the views of each item are distributed considering its type: the rightmost
histogram shows the cumulative views for each type; the uppermost box-plot summarizes
the views for each items. Fig. 5 spots how the coldness and hotness are related to the
item type. Fig. 6 shows how types and languages are linked by training pairs: the circular
areas are proportional to the logarithm of cumulative frequencies for the pairs of lectures
viewed together. This kind of information is exploited by the “freqLang” feature.
With the growth of the dataset, many recommendation algorithms either slow down or
require additional resources such as computation power or memory. As RSs are designed
to help users to navigate in large collections of items, one of the goals of the designers of
such systems is to scale up to real datasets. As such, it is often the case that algorithms
trade other properties, such as accuracy or coverage, for providing rapid results for huge
datasets [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ]. The trade-off can be achieved by changing some parameters, such as the
complexity of the model, or the sample size. For real systems it is important to measure
the compromises that scalability dictates [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ].
          </p>
          <p>
            RSs are expected in many cases to provide recommendation on-line, thus it is also
important to measure how fast does the system provides recommendation [
            <xref ref-type="bibr" rid="ref3 ref5">3, 5</xref>
            ]. Common
measurement are the number of recommendations that the system can provide per second
(the throughput of the system) and the required time for making a recommendation (the
latency or response time).
          </p>
          <p>The developed Java components allow to complete the recommendation task for the
5,704 lectures in almost 85 seconds on a notebook with an Intel Core 2 at 2.0 GHz as CPU
and 2GB of RAM, i.e., each new recommendation about 30 cold items over the selected
1,122 ones is provided in almost 15 milliseconds. Reasonably, a production server allows
to reduce further the response time for new recommendations and a cache specifically
devised for the recommendations allows to increase the throughput.</p>
          <p>Moreover, the learning step performed by Weka is the most time-consuming one and
it requires a lot of memory. Although the step is designed to be performed off-line, the
time and space requirements can be reduced by exploiting few features or less previous
data.
5</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>We have described the steps to achieve the submitted solution that outperforms the
random baseline. The in vitro evaluation of a solution to the cold start problem is an
arduous task, since the common assumption about the reliability of past data to provide
predictions is weakened. For instance, Fig. 7 shows how many of the old items used in
the evaluation of submitted solutions have few associated cold items. The lack of such
links becomes from the real data and it warrants the need for some strategy to deal with
cold items. In additions, Fig. 7 shows that the average frequency of the considered pairs
of old and cold lectures increases when the users view an increasing number of cold items
for the same old item: the transition from cold to hot seems to be on the highest levels
used for the evaluation metric. The evaluation levels (5, 10, 15, 20, 25, 30) are shown in
Fig. 7 as grey vertical lines.</p>
      <p>
        The idea of integrating a content-based approach allows to provide also serendipitous
recommendations alongside classical ones [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Indeed the content-based item similarity
can be used to obtain a hybrid RS that exploits the “Anomalies and exceptions” approach
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] to spot potential serendipitous items as further trade-off with the entire system
accuracy.
      </p>
      <p>Finally, the scalability performance is considered as a primary requirement and a
lightweight solution is pursued. The preliminary performance for the notebook execution
is quite promising and some future directions for improving latency and throughput are
sketched. Also a feasible integration strategy is depicted.</p>
      <p>Acknowledgments. This research was partially funded by MIUR (Ministero dell’Universit`a
e della Ricerca) under the contract “Fondo per le Agevolazioni alla Ricerca”, DM19410
“Laboratorio di Bioinformatica per la Biodiversit`a Molecolare” (2007-2011).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Burke</surname>
          </string-name>
          , R.:
          <article-title>Hybrid recommender systems: Survey and experiments</article-title>
          .
          <source>User Modeling and UserAdapted Interaction</source>
          <volume>12</volume>
          ,
          <fpage>331</fpage>
          -
          <lpage>370</lpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Das</surname>
            ,
            <given-names>A.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Datar</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garg</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rajaram</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Google news personalization: scalable online collaborative filtering</article-title>
          .
          <source>In: Proc. of the 16th int. conf. on World Wide Web (WWW '07)</source>
          . pp.
          <fpage>271</fpage>
          -
          <lpage>280</lpage>
          . ACM (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Herlocker</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Konstan</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riedl</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>An empirical analysis of design choices in neighborhood-based collaborative filtering algorithms</article-title>
          .
          <source>Information Retrieval</source>
          <volume>5</volume>
          ,
          <fpage>287</fpage>
          -
          <lpage>310</lpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Iaquinta</surname>
          </string-name>
          , L.,
          <string-name>
            <surname>de Gemmis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lops</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Semeraro</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Filannino</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Molino</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Introducing serendipity in a content-based recommender system</article-title>
          . In: Xhafa,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Herrera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Abraham</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          , K¨oppen, M.,
          <string-name>
            <surname>B</surname>
          </string-name>
          ´enitez, J.M. (eds.)
          <source>Proc. of the 8th int. conf. on Hybrid Intelligent Systems (HIS-2008)</source>
          . pp.
          <fpage>168</fpage>
          -
          <lpage>173</lpage>
          . IEEE Computer Society (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Sarwar</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karypis</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Konstan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reidl</surname>
          </string-name>
          , J.:
          <article-title>Item-based collaborative filtering recommendation algorithms</article-title>
          .
          <source>In: Proc. of the 10th int. conf. on World Wide Web (WWW '01)</source>
          . pp.
          <fpage>285</fpage>
          -
          <lpage>295</lpage>
          . ACM (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Schein</surname>
            ,
            <given-names>A.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Popescul</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ungar</surname>
            ,
            <given-names>L.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pennock</surname>
            ,
            <given-names>D.M.</given-names>
          </string-name>
          :
          <article-title>Methods and metrics for cold-start recommendations</article-title>
          .
          <source>In: Proc. of the 25th ACM SIGIR conf. on Research and development in information retrieval (SIGIR '02)</source>
          . pp.
          <fpage>253</fpage>
          -
          <lpage>260</lpage>
          . ACM (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Shani</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gunawardana</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Evaluating recommendation systems</article-title>
          . In: Ricci,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Rokach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Shapira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Kantor</surname>
          </string-name>
          , P.B. (eds.)
          <source>Recommender Systems Handbook</source>
          , pp.
          <fpage>257</fpage>
          -
          <lpage>297</lpage>
          . Springer (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Toms</surname>
            ,
            <given-names>E.G.</given-names>
          </string-name>
          :
          <article-title>Serendipitous information retrieval</article-title>
          . In: DELOS Workshop: Information Seeking, Searching and Querying in Digital Libraries (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>