<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>TV-Show Retrieval and Classi cation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Cataldo Musto</string-name>
          <email>cataldomusto@di.uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fedelucio Narducci</string-name>
          <email>narducci@di.uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pasquale Lops</string-name>
          <email>lops@di.uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanni Semeraro</string-name>
          <email>semeraro@di.uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco de Gemmis</string-name>
          <email>degemmis@di.uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mauro Barbieri</string-name>
          <email>mauro.barbieri@philips.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jan Korst</string-name>
          <email>jan.korst@philips.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Verus Pronk</string-name>
          <email>verus.pronk@philips.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ramon Clout</string-name>
          <email>ramon.clout@philips.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Bari \A. Moro"</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Philips Research</institution>
          ,
          <addr-line>Eindhoven</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Recommender systems are popular tools to aid users in nding interesting and relevant TV shows and other digital video assets, based on implicitly de ned user preferences. In this context, a common assumption is that user preferences can be speci ed by program types (such as documentary, sports), and that an asset can be labeled by one or more program types, thus allowing an initial coarse preselection of potentially interesting assets. Furthermore each asset has a short textual description, which allows us to investigate whether it is possible to automatically label assets with program type labels. We compare the Vector Space Model (vsm) with more recent approaches to text classi cation, such as Logistic Regression (lr) and Random Indexing (ri) on a large collection of TV-show descriptions. The experimental results show that lr is the best approach, but ri outperforms vsm under particular conditions.</p>
      </abstract>
      <kwd-group>
        <kwd>Vector Space Model</kwd>
        <kwd>Random Indexing</kwd>
        <kwd>Logistic Regression</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Automatic TV recommendations have been explored extensively in the literature
where most papers assume that the set of items for recommendations is of
moderate size. Most approaches are not directly applicable to web video repositories
(such as YouTube) whose item sets are orders of magnitude larger. To provide
personalized recommendations for digital assets on the web and TV, a possible
approach is to match the assets' textual descriptions to personal preferences of
users. It is common practice to classify TV shows by labeling them with one or
more program type labels. It may also be assumed that user preferences can be
coarsely expressed in terms of program types [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In this paper, we assume that
each asset has a short textual description and we investigate (a) how well that
description can be automatically mapped to a program type and (b) which
machine learning algorithms are best suited for the above mentioned classi cation
task. To this end, we have extensively tested algorithms using a large collection
of TV-show descriptions which calls for the adoption of simple and scalable
retrieval models. A text classi cation algorithm based on the Vector Space Model
(vsm) might be a good solution, provided that e ective dimensionality
reduction techniques are integrated, such as Random Indexing (ri) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. As regards
classi cation algorithms, we opted for Logistic Regression (lr), since it is
generally considered as accurate as Support Vector Machines, with the advantage
of yielding a probability model [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        This research is carried out in the context of a joint project with aprico
Solutions3, a software company and part of Philips Electronics. aprico
Solutions develops video recommender and targeting technology, primarily for the
broadcast and internet industries. Further details are available in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>TV-show Classi cation and Retrieval</title>
      <p>The two problems we focus upon can be de ned as follows:
TV-show classi cation: given a program description s and a set P of program
types, choose a program type p 2 P that best matches the program description.
Each TV show has exactly one label assigned to it.</p>
      <p>TV-show retrieval: given a set S of TV-show descriptions and a program type
p 2 P , return a ranked list of k TV-show descriptions from S that best match
program type p.</p>
      <p>Three approaches for the TV-show classi cation and TV-show retrieval tasks
have been investigated. We compare vsm with lr and ri. For both tasks,
TVshow textual descriptions have been preprocessed for obtaining bag-of-words
representations (bow).
2.1</p>
      <sec id="sec-2-1">
        <title>TV-SHOW CLASSIFICATION</title>
        <p>Vector Space Model Given a set of documents (corpus), each document is
represented as a point in a n-dimensional vector space (n is the cardinality of the
vocabulary). Formally, each document is represented as a vector d = (w1; : : : ; wn)
where wi is the tfidf score of the feature i. A vector space representation of
each program type is obtained by summing the vectors of TV shows belonging
to that program type. Thus, given a TV show s to be classi ed, its program
type is given by the program type vector with the highest cosine similarity to s.
vsm has some important limitations: it is not incremental and it does not model
semantics.</p>
        <p>Random Indexing. ri is a scalable and incremental dimensionality reduction
technique. It belongs to the class of distributional models, which state that the
meaning of a word can be inferred by analyzing its use (distribution) within a
corpus of textual data. Random Indexing for TV-show classi cation follows the
same steps as for vsm: a prototype vector is built for each program type and
the cosine similarity between a TV-show and each program type is computed.
Unlike vsm, these steps are performed on the reduced vector space obtained as
output of the ri algorithm (500, 700 dimensions).
3 www.aprico.tv
Logistic Regression. lr is a supervised learning algorithm based on a
generalized linear model. In this work we exploited the implementation provided
in liblinear4. Given a TV show, we compute the probability of each program
type by exploiting the logistic functions learned for each class. The TV-show
program type is determined by the highest probability.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>TV-SHOW RETRIEVAL</title>
        <p>For the TV-show retrieval task, we exploited only lr and ri, since they achieved
the best performance for most classes in the classi cation task.</p>
        <p>Random Indexing. As in the classi cation task, the vector space is reduced
through the ri algorithm. Given a prototype vector built for each program type,
the cosine similarity with all TV shows is computed in order to get the list of
the best matching TV-show descriptions for a speci c program type.
Logistic Regression. The probability that a TV show belongs to a speci c
program type is computed for the retrieval task as well. In this task, given a
program type p, the TV shows are ranked based on their probability to belong
to p and are returned in a ranked list.
4 www.csie.ntu.edu.tw/~cjlin/liblinear/
has been carried out through a k-fold cross validation (k =10), on a dataset
composed of 133,579 TV shows broadcast from a set of 47 channels in the German
language. The textual descriptions are the input to the learning process and
are represented by bag of words. Stemming and stop-words elimination are
performed on the text. For the classi cation task we used the Accuracy as metric: it
is calculated as the ratio between the TV shows correctly classi ed and the total
number of TV shows classi ed. For the retrieval task we used the Precision@n%:
it is calculated as the ratio between the TV shows correctly classi ed and the
n% of the Test Set. vsm, lr, and ri (using di erent vector space dimensions)
have been compared.</p>
        <p>Classi cation task. Figure 1 reports accuracy values of vsm, lr and ri. The
con gurations that overcome the baseline (vsm) are in bold. For some classes the
dimensionality reduction technique deteriorated the performance of the
classier. However for most classes, ri outperformed vsm, even though the reduction
of the vector space dimension is considerable. Furthermore, the lr algorithm
obtained the best accuracy. The best improvement achieved compared to the
vsm model is almost 20%.</p>
        <p>Retrieval task. In general the di erent space dimensions for random indexing
do not a ect the retrieval accuracy of the retrieval model (see Figure 2). Also for
this task lr achieved better results compared to ri. The accuracy of the model
decreases when the size of the retrieved list increases. This was expected because
less relevant shows for each program type are in the tail of the list.
4</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Conclusions and Future Work</title>
      <p>The best performing approach for the classi cation task was lr. Despite the
fact that this approach already showed to be e ective in text classi cation in
the literature, results achieved in this speci c scenario were not obvious, since
TV shows have very short textual descriptions and only few training examples
were available for many classes. ri demonstrated a good performance in TV-show
classi cation for the classes with a small number of instances in the training set.
In the retrieval task lr outperforms the other approaches as well. In the future
we will work in a recommendation scenario in order to re-rank the retrieved list
of TV shows according to the user preferences.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>C.</given-names>
            <surname>Musto</surname>
          </string-name>
          and
          <string-name>
            <given-names>F.</given-names>
            <surname>Narducci</surname>
          </string-name>
          .
          <article-title>Tv-show retrieval and classi cation</article-title>
          .
          <source>Technical report, Philips Research, High Tech Campus, Eindhoven</source>
          , The Netherlands,
          <year>July 2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>V.</given-names>
            <surname>Pronk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Korst</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Barbieri</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Proidl</surname>
          </string-name>
          .
          <article-title>Personal television channels: simply zapping through your pvr content</article-title>
          .
          <source>In Proceedings of the 1st International Workshop on Recommendation-based Industrial Applications</source>
          ,
          <source>RecSys '09</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahlgren</surname>
          </string-name>
          .
          <article-title>An introduction to random indexing</article-title>
          .
          <source>In Methods and Applications of Semantic Indexing Workshop, TKE</source>
          <year>2005</year>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          and
          <string-name>
            <given-names>F. J.</given-names>
            <surname>Oles</surname>
          </string-name>
          .
          <article-title>Text categorization based on regularized linear classi cation methods</article-title>
          .
          <source>Information Retrieval</source>
          ,
          <volume>4</volume>
          :5{
          <fpage>31</fpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>