TV-Show Retrieval and Classification Cataldo Musto1 , Fedelucio Narducci1 , Pasquale Lops1 , Giovanni Semeraro1 , Marco de Gemmis1 , Mauro Barbieri2 , Jan Korst2 , Verus Pronk2 , and Ramon Clout2 1 Department of Computer Science, University of Bari “A. Moro”, Italy, {cataldomusto,narducci,lops,semeraro,degemmis}@di.uniba.it 2 Philips Research, Eindhoven, The Netherlands, {mauro.barbieri,jan.korst,verus.pronk,ramon.clout}@philips.com Abstract. Recommender systems are popular tools to aid users in find- ing interesting and relevant TV shows and other digital video assets, based on implicitly defined user preferences. In this context, a common assumption is that user preferences can be specified by program types (such as documentary, sports), and that an asset can be labeled by one or more program types, thus allowing an initial coarse preselection of potentially interesting assets. Furthermore each asset has a short tex- tual description, which allows us to investigate whether it is possible to automatically label assets with program type labels. We compare the Vector Space Model (vsm) with more recent approaches to text classifi- cation, such as Logistic Regression (lr) and Random Indexing (ri) on a large collection of TV-show descriptions. The experimental results show that lr is the best approach, but ri outperforms vsm under particular conditions. Keywords: Vector Space Model, Random Indexing, Logistic Regression 1 Introduction Automatic TV recommendations have been explored extensively in the literature where most papers assume that the set of items for recommendations is of mod- erate size. Most approaches are not directly applicable to web video repositories (such as YouTube) whose item sets are orders of magnitude larger. To provide personalized recommendations for digital assets on the web and TV, a possible approach is to match the assets’ textual descriptions to personal preferences of users. It is common practice to classify TV shows by labeling them with one or more program type labels. It may also be assumed that user preferences can be coarsely expressed in terms of program types [2]. In this paper, we assume that each asset has a short textual description and we investigate (a) how well that description can be automatically mapped to a program type and (b) which ma- chine learning algorithms are best suited for the above mentioned classification task. To this end, we have extensively tested algorithms using a large collection of TV-show descriptions which calls for the adoption of simple and scalable re- trieval models. A text classification algorithm based on the Vector Space Model 2 TV-Show Retrieval and Classification (vsm) might be a good solution, provided that effective dimensionality reduc- tion techniques are integrated, such as Random Indexing (ri) [3]. As regards classification algorithms, we opted for Logistic Regression (lr), since it is gen- erally considered as accurate as Support Vector Machines, with the advantage of yielding a probability model [4]. This research is carried out in the context of a joint project with aprico Solutions3 , a software company and part of Philips Electronics. aprico Solu- tions develops video recommender and targeting technology, primarily for the broadcast and internet industries. Further details are available in [1]. 2 TV-show Classification and Retrieval The two problems we focus upon can be defined as follows: TV-show classification: given a program description s and a set P of program types, choose a program type p ∈ P that best matches the program description. Each TV show has exactly one label assigned to it. TV-show retrieval: given a set S of TV-show descriptions and a program type p ∈ P , return a ranked list of k TV-show descriptions from S that best match program type p. Three approaches for the TV-show classification and TV-show retrieval tasks have been investigated. We compare vsm with lr and ri. For both tasks, TV- show textual descriptions have been preprocessed for obtaining bag-of-words representations (bow). 2.1 TV-SHOW CLASSIFICATION Vector Space Model Given a set of documents (corpus), each document is rep- resented as a point in a n-dimensional vector space (n is the cardinality of the vo- cabulary). Formally, each document is represented as a vector d = (w1 , . . . , wn ) where wi is the tfidf score of the feature i. A vector space representation of each program type is obtained by summing the vectors of TV shows belonging to that program type. Thus, given a TV show s to be classified, its program type is given by the program type vector with the highest cosine similarity to s. vsm has some important limitations: it is not incremental and it does not model semantics. Random Indexing. ri is a scalable and incremental dimensionality reduction technique. It belongs to the class of distributional models, which state that the meaning of a word can be inferred by analyzing its use (distribution) within a corpus of textual data. Random Indexing for TV-show classification follows the same steps as for vsm: a prototype vector is built for each program type and the cosine similarity between a TV-show and each program type is computed. Unlike vsm, these steps are performed on the reduced vector space obtained as output of the ri algorithm (500, 700 dimensions). 3 www.aprico.tv TV-Show Retrieval and Classification 3 Logistic Regression. lr is a supervised learning algorithm based on a gen- eralized linear model. In this work we exploited the implementation provided in liblinear4 . Given a TV show, we compute the probability of each program type by exploiting the logistic functions learned for each class. The TV-show program type is determined by the highest probability. 2.2 TV-SHOW RETRIEVAL For the TV-show retrieval task, we exploited only lr and ri, since they achieved the best performance for most classes in the classification task. Random Indexing. As in the classification task, the vector space is reduced through the ri algorithm. Given a prototype vector built for each program type, the cosine similarity with all TV shows is computed in order to get the list of the best matching TV-show descriptions for a specific program type. Logistic Regression. The probability that a TV show belongs to a specific program type is computed for the retrieval task as well. In this task, given a program type p, the TV shows are ranked based on their probability to belong to p and are returned in a ranked list. Fig. 2. P @n% of ri, and lr for the re- trieval task. Fig. 1. Accuracy of vsm, ri, and lr for the classification task. 3 Experimental Evaluation The goal of the experimental evaluation is to measure the effectiveness of the vsm, ri, and lr models in the retrieval and classification tasks. The experiment 4 www.csie.ntu.edu.tw/~cjlin/liblinear/ 4 TV-Show Retrieval and Classification has been carried out through a k-fold cross validation (k =10), on a dataset com- posed of 133,579 TV shows broadcast from a set of 47 channels in the German language. The textual descriptions are the input to the learning process and are represented by bag of words. Stemming and stop-words elimination are per- formed on the text. For the classification task we used the Accuracy as metric: it is calculated as the ratio between the TV shows correctly classified and the total number of TV shows classified. For the retrieval task we used the Precision@n%: it is calculated as the ratio between the TV shows correctly classified and the n% of the Test Set. vsm, lr, and ri (using different vector space dimensions) have been compared. Classification task. Figure 1 reports accuracy values of vsm, lr and ri. The configurations that overcome the baseline (vsm) are in bold. For some classes the dimensionality reduction technique deteriorated the performance of the classi- fier. However for most classes, ri outperformed vsm, even though the reduction of the vector space dimension is considerable. Furthermore, the lr algorithm obtained the best accuracy. The best improvement achieved compared to the vsm model is almost 20%. Retrieval task. In general the different space dimensions for random indexing do not affect the retrieval accuracy of the retrieval model (see Figure 2). Also for this task lr achieved better results compared to ri. The accuracy of the model decreases when the size of the retrieved list increases. This was expected because less relevant shows for each program type are in the tail of the list. 4 Conclusions and Future Work The best performing approach for the classification task was lr. Despite the fact that this approach already showed to be effective in text classification in the literature, results achieved in this specific scenario were not obvious, since TV shows have very short textual descriptions and only few training examples were available for many classes. ri demonstrated a good performance in TV-show classification for the classes with a small number of instances in the training set. In the retrieval task lr outperforms the other approaches as well. In the future we will work in a recommendation scenario in order to re-rank the retrieved list of TV shows according to the user preferences. References 1. C. Musto and F. Narducci. Tv-show retrieval and classification. Technical report, Philips Research, High Tech Campus, Eindhoven, The Netherlands, July 2011. 2. V. Pronk, J. Korst, M. Barbieri, and A. Proidl. Personal television channels: simply zapping through your pvr content. In Proceedings of the 1st International Workshop on Recommendation-based Industrial Applications, RecSys ’09, 2009. 3. M. Sahlgren. An introduction to random indexing. In Methods and Applications of Semantic Indexing Workshop, TKE 2005, 2005. 4. T. Zhang and F. J. Oles. Text categorization based on regularized linear classifica- tion methods. Information Retrieval, 4:5–31, 2000.