TUD-MMC at MediaEval 2016: Context of Experience task Bo Wang Cynthia C. S. Liem Delft University of Technology Delft University of Technology Delft, The Netherlands Delft, The Netherlands b.wang-6@student.tudelft.nl C.C.S.Liem@tudelft.nl ABSTRACT Features used Precision Recall F1 User rating 0.371 0.609 0.461 This paper provides a three-step framework to predict user assess- Visual 0.447 0.476 0.458 ment of the suitability of movies for an inflight viewing context. Metadata 0.524 0.516 0.519 For this, we employed classifier stacking strategies. First of all, Metadata + user rating 0.581 0.6 0.583 using the different modalities of training data, twenty-one classi- Metadata + visual 0.584 0.6 0.586 fiers were trained together with a feature selection algorithm. Final predictions were then obtained by applying three classifier stacking Table 1: Results obtained by applying a rule-based PART clas- strategies. Our results reveal that different stacking strategies lead sifier to the Right Inflight dataset. to different evaluation results. A considerable improvement can be found for the F1-score when using the label stacking strategy. Therefore, we were interested in taking a multimodal classifier 1. INTRODUCTION stacking approach to the given problem, and use a combination of A substantial amount of research has been conducted in recom- multiple weak learners to ‘boost’ them into a strong learner. mender systems that focus on user preference prediction. Here, tak- The process can be separated into three stages: classifier selec- ing contextual information into account can have significant posi- tion, feature selection and classifier stacking. tive impact on the performance of recommender systems [1]. 3.1 Classifier Selection The MediaEval Context of Experience task focuses on a specific type of context: the viewing context of the user. The challenge First of all , we want to select base classifiers that will be useful considers predicting the multimedia content that users find most candidates in a stacking approach. For this, we use the following fitting to watch in a specific viewing condition, more specifically, classifier selection procedure: while being on a plane. 1. Initialize a list of candidate classifiers. For each modality, we consider the following classifiers: k-nearest neighbor, 2. DATASET DESCRIPTION AND INITIAL nearest mean, decision tree, logistic regression, SVM, bag- ging, random forest, AdaBoost, gradient boosting, and naive EXPERIMENTS Bayes. We do not apply parameter tuning, but take the de- The dataset for the Context of Experience (CoE) task[5] contains fault parameter values as offered by scikit-learn1 . metadata and pre-extracted features for 318 movies [6]. Features are multimodal and include textual features, visual features and au- 2. Perform 10-fold cross-validation on the classifiers. As input dio features. The training set contains 95 labeled movies, which data, we use the training data set and its ground truth labels, are labeled as 0 (bad for airplane) or 1 (good for airplane). per single modality. For the audio MFCC features, we set A set of initial experiments has been conducted in order to eval- NaN values to 0, and calculate the average of each MFCC uate the usefulness of the various modalities in the CoE dataset [6]. coefficient over all frames. A rule-based PART classifier was employed to evaluate the feature 3. If Precision and Recall and F1-Score > 0.5, keep the candi- performance in terms of Precision, Recall and F1 Score, the result date classifier on the given modality as base classifier for our can be found in Table 1. stacking approach. 3. MULTIMODAL CLASSIFIER STACKING The selected base classifiers and their relevant modalities can be found in Table 2. It should be noted that the performance of Ensemble learning uses a combination of different classifiers, Bagging and Random forest is not stable. This is because usually getting a much better generalization ability. This particu- Bagging tries to use different subset of instances in each run and larly is the case for weak learners, which can be defined as learning RandomForest tries to use different subsets of instances and fea- algorithms that perform just slightly better than random guessing tures in each run. by themselves, but can be jointly grouped into an algorithm with arbitrarily high accuracy [2]. 3.2 Feature Selection For each classifier and corresponding modality, a better-performing subspace of features may optimize results further. Since we have Copyright is held by the author/owner(s). 1 MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Netherlands. http://scikit-learn.org/ Classifier Modality Precision Recall F1 4. RESULTS k-Nearest neighbor metadata 0.607 0.654 0.630 Nearest mean classifier metadata 0.603 0.579 0.591 We considered all prediction results by the 21 selected base clas- Decision tree metadata 0.538 0.591 0.563 sifiers, and then applied the three different classifier stacking strate- Logistic regression metadata 0.548 0.609 0.578 gies to the test data using 10-fold cross-validation. As results for SVM (Gaussian Kernel) metadata 0.501 0.672 0.574 label stacking vs. label attribute stacking were comparable on the Bagging metadata 0.604 0.662 0.631 training data, we only consider voting vs. label stacking on the test Random Forest metadata 0.559 0.593 0.576 AdaBoost metadata 0.511 0.563 0.536 data. Gradient Boosting Tree metadata 0.544 0.596 0.569 All obtained results, on the training (development) and test dataset, Naive Bayes textual 0.545 0.987 0.702 are given in Table 3. On the training data, we notice significant im- k-Nearest neighbor textual 0.549 0.844 0.666 provement can be found in terms of Precision, Recall as well as F1 SVM (Gaussian Kernel) textual 0.547 1.000 0.707 score in comparison to results obtained on individual modalities. k-Nearest neighbor visual 0.582 0.636 0.608 The voting strategy results in the best precision score, but has bad Decision tree visual 0.521 0.550 0.535 Logistic regression visual 0.616 0.600 0.608 performance in terms of recall. On the contrary, label stacking has SVM (Gaussian Kernel) visual 0.511 0.670 0.580 higher recall and the highest F1 score. Random Forest visual 0.614 0.664 0.638 Considering results obtained on the test dataset, we can conclude AdaBoost visual 0.601 0.717 0.654 that label stacking is more robust than the voting strategy. For vot- Gradient Boosting Tree visual 0.561 0.616 0.587 ing strategy, a significant decrease can be found in terms of preci- Logistic Regression audio 0.507 0.597 0.546 sion on test set. This is because majority vote (and Bayesian aver- Gradient Boosting Tree audio 0.560 0.617 0.587 aging) tendency to over-fit derives from the likelihood’s exponen- tial sensitivity to random fluctuations in the sample, and increases Table 2: Base classifier performance on multimodal dataset. with the number of models considered. Meanwhile, label stacking strategy performs reasonable well on test data. multiple learners, we employed the Las Vegas Wrapper(LVW) [3] Stacking Strategy Precision Recall F1 Voting (cv) 0.94 0.57 0.71 feature selection algorithm for a feature subset selection. For each Label Stacking (cv) 0.72 0.86 0.78 run, LVW generate a list of random features and evaluate the learner’s Label Attribute Stacking (cv) 0.71 0.79 0.75 error rate for n times, and select the best performing feature sub- Voting (test) 0.62 0.80 0.70 space as output. Label Stacking (test) 0.62 0.90 0.73 In our case, we slightly modified LVW to optimize F1 score, where the original las vegas wrapper was developed for optimize Table 3: Classifier Stacking results. accuracy. For each base classifier, with the exception of the random for- est classifier (as it already performs feature selection), we apply the LVW method, and achieve performance measures as listed in 5. CONCLUSIONS Table 2. In our entry for the MediaEval CoE task, we aimed to improve 3.3 Classifier Stacking classifier performance by a combination of classifier selection, fea- ture selection and classifier stacking. Results reveal that employing In previous research, classifier stacking (or metalearning) has a ensemble approach can considerably increase the classification been proved beneficial for predictive performance by combining performance, and is suitable for treating the multimodal Right In- different learning systems which each have different inductive bias flight dataset. (e.g. representation, search heuristics, search space) [4]. By com- The larger diversity of base classifiers is able to produce a more bining separately learned concepts, meta-learning is expected to robust ensemble classifier. On the other hand, a blending of mul- derive a higher-level learned model that more accurately can pre- tiple classifiers may also have some drawbacks, e.g computational dict than any of the individual learners. In our work, we consider costs, and difficulty in traceable interpretation. three types of stacking strategies: We expect better results for our method can still be obtained through parameter tuning, and by applying more robust classifier 1. Majority Voting: this is the simplest case, where we select stacking methods, such as feature weighted linear stacking [7]. classifiers and feature subspaces through the steps above, and assign final predicted labels through majority voting on the labels of the 21 classifiers. 6. REFERENCES [1] G. Adomavicius and A. Tuzhilin. Context-aware 2. Label Stacking: Assume we have n instances and T base recommender systems. In Recommender systems handbook, classifiers, then we can generate an n by T matrix consisting pages 217–253. Springer, 2011. of predictions (labels) given by each classifier. Label com- [2] Y. Freund and R. E. Schapire. A Decision-theoretic bining strategy tries to build a second-level classifier based Generalization of On-line Learning and an Application to on this label matrix, and return a final prediction result for Boosting. Journal of computer and system sciences, that. 55:119–139, 1997. 3. Label-Feature Stacking: Similar to label stacking, label-feature [3] H. Liu and R. Setiono. Feature selection and classification—a stacking strategy uses both base-classifier predictions and probabilistic wrapper approach. In Proceedings of the 9th features as training data to predict output. International Conference on Industrial and Engineering Applications of AI and ES, pages 419–424, 1997. [4] A. Prodromidis, P. Chan, and S. Stolfo. Meta-learning in distributed data mining systems: Issues and approaches. In Advances in distributed and parallel knowledge discovery, pages 81–114. MIT/AAAI Press, 2000. [5] M. Riegler, , C. Spampinato, M. Larson, P. Halvorsen, and C. Griwodz. The mediaeval 2016 context of experience task: Recommending videos suiting a watching situation. In Proceedings of the MediaEval 2016 Workshop, 2016. [6] M. Riegler, M. Larson, C. Spampinato, P. Halvorsen, M. Lux, J. Markussen, K. Pogorelov, C. Griwodz, and H. Stensland. Right inflight? A dataset for exploring the automatic prediction of movies suitable for a watching situation. In Proceedings of the 7th International Conference on Multimedia Systems, pages 45:1–45:6. ACM, 2016. [7] J. Sill, G. Takacs, L. Mackey, and D. Lin. Feature-weighted linear stacking. arXiv:0911.0460, 2009.