-

TV-Show Retrieval and Classi cation

Cataldo Musto

cataldomusto@di.uniba.it 0

Fedelucio Narducci

narducci@di.uniba.it 0

Pasquale Lops

lops@di.uniba.it 0

Giovanni Semeraro

semeraro@di.uniba.it 0

Marco de Gemmis

degemmis@di.uniba.it 0

Mauro Barbieri

mauro.barbieri@philips.com 1

Jan Korst

jan.korst@philips.com 1

Verus Pronk

verus.pronk@philips.com 1

Ramon Clout

ramon.clout@philips.com 1 0 Department of Computer Science, University of Bari \A. Moro" , Italy 1 Philips Research , Eindhoven , The Netherlands

Recommender systems are popular tools to aid users in nding interesting and relevant TV shows and other digital video assets, based on implicitly de ned user preferences. In this context, a common assumption is that user preferences can be speci ed by program types (such as documentary, sports), and that an asset can be labeled by one or more program types, thus allowing an initial coarse preselection of potentially interesting assets. Furthermore each asset has a short textual description, which allows us to investigate whether it is possible to automatically label assets with program type labels. We compare the Vector Space Model (vsm) with more recent approaches to text classi cation, such as Logistic Regression (lr) and Random Indexing (ri) on a large collection of TV-show descriptions. The experimental results show that lr is the best approach, but ri outperforms vsm under particular conditions.

Vector Space Model Random Indexing Logistic Regression

Automatic TV recommendations have been explored extensively in the literature where most papers assume that the set of items for recommendations is of moderate size. Most approaches are not directly applicable to web video repositories (such as YouTube) whose item sets are orders of magnitude larger. To provide personalized recommendations for digital assets on the web and TV, a possible approach is to match the assets' textual descriptions to personal preferences of users. It is common practice to classify TV shows by labeling them with one or more program type labels. It may also be assumed that user preferences can be coarsely expressed in terms of program types [ 2 ]. In this paper, we assume that each asset has a short textual description and we investigate (a) how well that description can be automatically mapped to a program type and (b) which machine learning algorithms are best suited for the above mentioned classi cation task. To this end, we have extensively tested algorithms using a large collection of TV-show descriptions which calls for the adoption of simple and scalable retrieval models. A text classi cation algorithm based on the Vector Space Model (vsm) might be a good solution, provided that e ective dimensionality reduction techniques are integrated, such as Random Indexing (ri) [ 3 ]. As regards classi cation algorithms, we opted for Logistic Regression (lr), since it is generally considered as accurate as Support Vector Machines, with the advantage of yielding a probability model [ 4 ].

This research is carried out in the context of a joint project with aprico Solutions3, a software company and part of Philips Electronics. aprico Solutions develops video recommender and targeting technology, primarily for the broadcast and internet industries. Further details are available in [ 1 ]. 2

TV-show Classi cation and Retrieval

The two problems we focus upon can be de ned as follows: TV-show classi cation: given a program description s and a set P of program types, choose a program type p 2 P that best matches the program description. Each TV show has exactly one label assigned to it.

TV-show retrieval: given a set S of TV-show descriptions and a program type p 2 P , return a ranked list of k TV-show descriptions from S that best match program type p.

Three approaches for the TV-show classi cation and TV-show retrieval tasks have been investigated. We compare vsm with lr and ri. For both tasks, TVshow textual descriptions have been preprocessed for obtaining bag-of-words representations (bow). 2.1

TV-SHOW CLASSIFICATION

Vector Space Model Given a set of documents (corpus), each document is represented as a point in a n-dimensional vector space (n is the cardinality of the vocabulary). Formally, each document is represented as a vector d = (w1; : : : ; wn) where wi is the tfidf score of the feature i. A vector space representation of each program type is obtained by summing the vectors of TV shows belonging to that program type. Thus, given a TV show s to be classi ed, its program type is given by the program type vector with the highest cosine similarity to s. vsm has some important limitations: it is not incremental and it does not model semantics.

Random Indexing. ri is a scalable and incremental dimensionality reduction technique. It belongs to the class of distributional models, which state that the meaning of a word can be inferred by analyzing its use (distribution) within a corpus of textual data. Random Indexing for TV-show classi cation follows the same steps as for vsm: a prototype vector is built for each program type and the cosine similarity between a TV-show and each program type is computed. Unlike vsm, these steps are performed on the reduced vector space obtained as output of the ri algorithm (500, 700 dimensions). 3 www.aprico.tv Logistic Regression. lr is a supervised learning algorithm based on a generalized linear model. In this work we exploited the implementation provided in liblinear4. Given a TV show, we compute the probability of each program type by exploiting the logistic functions learned for each class. The TV-show program type is determined by the highest probability. 2.2

TV-SHOW RETRIEVAL

For the TV-show retrieval task, we exploited only lr and ri, since they achieved the best performance for most classes in the classi cation task.

Random Indexing. As in the classi cation task, the vector space is reduced through the ri algorithm. Given a prototype vector built for each program type, the cosine similarity with all TV shows is computed in order to get the list of the best matching TV-show descriptions for a speci c program type. Logistic Regression. The probability that a TV show belongs to a speci c program type is computed for the retrieval task as well. In this task, given a program type p, the TV shows are ranked based on their probability to belong to p and are returned in a ranked list. 4 www.csie.ntu.edu.tw/~cjlin/liblinear/ has been carried out through a k-fold cross validation (k =10), on a dataset composed of 133,579 TV shows broadcast from a set of 47 channels in the German language. The textual descriptions are the input to the learning process and are represented by bag of words. Stemming and stop-words elimination are performed on the text. For the classi cation task we used the Accuracy as metric: it is calculated as the ratio between the TV shows correctly classi ed and the total number of TV shows classi ed. For the retrieval task we used the Precision@n%: it is calculated as the ratio between the TV shows correctly classi ed and the n% of the Test Set. vsm, lr, and ri (using di erent vector space dimensions) have been compared.

Classi cation task. Figure 1 reports accuracy values of vsm, lr and ri. The con gurations that overcome the baseline (vsm) are in bold. For some classes the dimensionality reduction technique deteriorated the performance of the classier. However for most classes, ri outperformed vsm, even though the reduction of the vector space dimension is considerable. Furthermore, the lr algorithm obtained the best accuracy. The best improvement achieved compared to the vsm model is almost 20%.

Retrieval task. In general the di erent space dimensions for random indexing do not a ect the retrieval accuracy of the retrieval model (see Figure 2). Also for this task lr achieved better results compared to ri. The accuracy of the model decreases when the size of the retrieved list increases. This was expected because less relevant shows for each program type are in the tail of the list. 4

Conclusions and Future Work

The best performing approach for the classi cation task was lr. Despite the fact that this approach already showed to be e ective in text classi cation in the literature, results achieved in this speci c scenario were not obvious, since TV shows have very short textual descriptions and only few training examples were available for many classes. ri demonstrated a good performance in TV-show classi cation for the classes with a small number of instances in the training set. In the retrieval task lr outperforms the other approaches as well. In the future we will work in a recommendation scenario in order to re-rank the retrieved list of TV shows according to the user preferences.

Musto and

Narducci . Tv-show retrieval and classi cation . Technical report, Philips Research, High Tech Campus, Eindhoven , The Netherlands, July 2011 .

Pronk ,

Korst ,

Barbieri , and

Proidl . Personal television channels: simply zapping through your pvr content . In Proceedings of the 1st International Workshop on Recommendation-based Industrial Applications , RecSys '09 , 2009 .

Sahlgren . An introduction to random indexing . In Methods and Applications of Semantic Indexing Workshop, TKE 2005 , 2005 .

Zhang and

F. J.

Oles . Text categorization based on regularized linear classi cation methods . Information Retrieval , 4 :5{ 31 , 2000 .