=Paper= {{Paper |id=Vol-1739/MediaEval_2016_paper_50 |storemode=property |title=TUD-MMC at MediaEval 2016: Context of Experience task |pdfUrl=https://ceur-ws.org/Vol-1739/MediaEval_2016_paper_50.pdf |volume=Vol-1739 |dblpUrl=https://dblp.org/rec/conf/mediaeval/WangL16 }} ==TUD-MMC at MediaEval 2016: Context of Experience task== https://ceur-ws.org/Vol-1739/MediaEval_2016_paper_50.pdf
  TUD-MMC at MediaEval 2016: Context of Experience task

                                 Bo Wang                                                   Cynthia C. S. Liem
                     Delft University of Technology                                   Delft University of Technology
                        Delft, The Netherlands                                           Delft, The Netherlands
                  b.wang-6@student.tudelft.nl                                           C.C.S.Liem@tudelft.nl



ABSTRACT                                                                                Features used         Precision   Recall    F1
                                                                                         User rating            0.371     0.609    0.461
This paper provides a three-step framework to predict user assess-                          Visual              0.447     0.476    0.458
ment of the suitability of movies for an inflight viewing context.                        Metadata              0.524     0.516    0.519
For this, we employed classifier stacking strategies. First of all,                  Metadata + user rating     0.581      0.6     0.583
using the different modalities of training data, twenty-one classi-                   Metadata + visual         0.584      0.6     0.586
fiers were trained together with a feature selection algorithm. Final
predictions were then obtained by applying three classifier stacking     Table 1: Results obtained by applying a rule-based PART clas-
strategies. Our results reveal that different stacking strategies lead   sifier to the Right Inflight dataset.
to different evaluation results. A considerable improvement can be
found for the F1-score when using the label stacking strategy.
                                                                            Therefore, we were interested in taking a multimodal classifier
1.    INTRODUCTION                                                       stacking approach to the given problem, and use a combination of
    A substantial amount of research has been conducted in recom-        multiple weak learners to ‘boost’ them into a strong learner.
mender systems that focus on user preference prediction. Here, tak-         The process can be separated into three stages: classifier selec-
ing contextual information into account can have significant posi-       tion, feature selection and classifier stacking.
tive impact on the performance of recommender systems [1].               3.1       Classifier Selection
    The MediaEval Context of Experience task focuses on a specific
type of context: the viewing context of the user. The challenge             First of all , we want to select base classifiers that will be useful
considers predicting the multimedia content that users find most         candidates in a stacking approach. For this, we use the following
fitting to watch in a specific viewing condition, more specifically,     classifier selection procedure:
while being on a plane.                                                        1. Initialize a list of candidate classifiers. For each modality,
                                                                                  we consider the following classifiers: k-nearest neighbor,
2.    DATASET DESCRIPTION AND INITIAL                                             nearest mean, decision tree, logistic regression, SVM, bag-
                                                                                  ging, random forest, AdaBoost, gradient boosting, and naive
      EXPERIMENTS                                                                 Bayes. We do not apply parameter tuning, but take the de-
   The dataset for the Context of Experience (CoE) task[5] contains               fault parameter values as offered by scikit-learn1 .
metadata and pre-extracted features for 318 movies [6]. Features
are multimodal and include textual features, visual features and au-           2. Perform 10-fold cross-validation on the classifiers. As input
dio features. The training set contains 95 labeled movies, which                  data, we use the training data set and its ground truth labels,
are labeled as 0 (bad for airplane) or 1 (good for airplane).                     per single modality. For the audio MFCC features, we set
   A set of initial experiments has been conducted in order to eval-              NaN values to 0, and calculate the average of each MFCC
uate the usefulness of the various modalities in the CoE dataset [6].             coefficient over all frames.
A rule-based PART classifier was employed to evaluate the feature
                                                                               3. If Precision and Recall and F1-Score > 0.5, keep the candi-
performance in terms of Precision, Recall and F1 Score, the result
                                                                                  date classifier on the given modality as base classifier for our
can be found in Table 1.
                                                                                  stacking approach.

3.    MULTIMODAL CLASSIFIER STACKING                                        The selected base classifiers and their relevant modalities can
                                                                         be found in Table 2. It should be noted that the performance of
   Ensemble learning uses a combination of different classifiers,
                                                                         Bagging and Random forest is not stable. This is because
usually getting a much better generalization ability. This particu-
                                                                         Bagging tries to use different subset of instances in each run and
larly is the case for weak learners, which can be defined as learning
                                                                         RandomForest tries to use different subsets of instances and fea-
algorithms that perform just slightly better than random guessing
                                                                         tures in each run.
by themselves, but can be jointly grouped into an algorithm with
arbitrarily high accuracy [2].                                           3.2       Feature Selection
                                                                           For each classifier and corresponding modality, a better-performing
                                                                         subspace of features may optimize results further. Since we have
Copyright is held by the author/owner(s).                                1
MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Netherlands.           http://scikit-learn.org/
         Classifier          Modality   Precision   Recall    F1         4.    RESULTS
     k-Nearest neighbor      metadata     0.607     0.654    0.630
   Nearest mean classifier   metadata     0.603     0.579    0.591          We considered all prediction results by the 21 selected base clas-
        Decision tree        metadata     0.538     0.591    0.563       sifiers, and then applied the three different classifier stacking strate-
     Logistic regression     metadata     0.548     0.609    0.578       gies to the test data using 10-fold cross-validation. As results for
   SVM (Gaussian Kernel)     metadata     0.501     0.672    0.574       label stacking vs. label attribute stacking were comparable on the
          Bagging            metadata     0.604     0.662    0.631       training data, we only consider voting vs. label stacking on the test
       Random Forest         metadata     0.559     0.593    0.576
         AdaBoost            metadata     0.511     0.563    0.536       data.
   Gradient Boosting Tree    metadata     0.544     0.596    0.569          All obtained results, on the training (development) and test dataset,
        Naive Bayes           textual     0.545     0.987    0.702       are given in Table 3. On the training data, we notice significant im-
     k-Nearest neighbor       textual     0.549     0.844    0.666       provement can be found in terms of Precision, Recall as well as F1
   SVM (Gaussian Kernel)      textual     0.547     1.000    0.707       score in comparison to results obtained on individual modalities.
     k-Nearest neighbor        visual     0.582     0.636    0.608       The voting strategy results in the best precision score, but has bad
        Decision tree          visual     0.521     0.550    0.535
     Logistic regression       visual     0.616     0.600    0.608       performance in terms of recall. On the contrary, label stacking has
   SVM (Gaussian Kernel)       visual     0.511     0.670    0.580       higher recall and the highest F1 score.
       Random Forest           visual     0.614     0.664    0.638          Considering results obtained on the test dataset, we can conclude
         AdaBoost              visual     0.601     0.717    0.654       that label stacking is more robust than the voting strategy. For vot-
   Gradient Boosting Tree      visual     0.561     0.616    0.587       ing strategy, a significant decrease can be found in terms of preci-
    Logistic Regression        audio      0.507     0.597    0.546       sion on test set. This is because majority vote (and Bayesian aver-
   Gradient Boosting Tree      audio      0.560     0.617    0.587
                                                                         aging) tendency to over-fit derives from the likelihood’s exponen-
                                                                         tial sensitivity to random fluctuations in the sample, and increases
 Table 2: Base classifier performance on multimodal dataset.
                                                                         with the number of models considered. Meanwhile, label stacking
                                                                         strategy performs reasonable well on test data.

multiple learners, we employed the Las Vegas Wrapper(LVW) [3]                        Stacking Strategy          Precision   Recall    F1
                                                                                        Voting (cv)               0.94       0.57    0.71
feature selection algorithm for a feature subset selection. For each
                                                                                    Label Stacking (cv)           0.72       0.86    0.78
run, LVW generate a list of random features and evaluate the learner’s          Label Attribute Stacking (cv)     0.71       0.79    0.75
error rate for n times, and select the best performing feature sub-                     Voting (test)             0.62       0.80    0.70
space as output.                                                                   Label Stacking (test)          0.62       0.90    0.73
   In our case, we slightly modified LVW to optimize F1 score,
where the original las vegas wrapper was developed for optimize                         Table 3: Classifier Stacking results.
accuracy.
   For each base classifier, with the exception of the random for-
est classifier (as it already performs feature selection), we apply
the LVW method, and achieve performance measures as listed in            5.    CONCLUSIONS
Table 2.                                                                    In our entry for the MediaEval CoE task, we aimed to improve
3.3    Classifier Stacking                                               classifier performance by a combination of classifier selection, fea-
                                                                         ture selection and classifier stacking. Results reveal that employing
   In previous research, classifier stacking (or metalearning) has       a ensemble approach can considerably increase the classification
been proved beneficial for predictive performance by combining           performance, and is suitable for treating the multimodal Right In-
different learning systems which each have different inductive bias      flight dataset.
(e.g. representation, search heuristics, search space) [4]. By com-         The larger diversity of base classifiers is able to produce a more
bining separately learned concepts, meta-learning is expected to         robust ensemble classifier. On the other hand, a blending of mul-
derive a higher-level learned model that more accurately can pre-        tiple classifiers may also have some drawbacks, e.g computational
dict than any of the individual learners. In our work, we consider       costs, and difficulty in traceable interpretation.
three types of stacking strategies:                                         We expect better results for our method can still be obtained
                                                                         through parameter tuning, and by applying more robust classifier
   1. Majority Voting: this is the simplest case, where we select
                                                                         stacking methods, such as feature weighted linear stacking [7].
      classifiers and feature subspaces through the steps above, and
      assign final predicted labels through majority voting on the
      labels of the 21 classifiers.                                      6.    REFERENCES
                                                                         [1] G. Adomavicius and A. Tuzhilin. Context-aware
   2. Label Stacking: Assume we have n instances and T base                  recommender systems. In Recommender systems handbook,
      classifiers, then we can generate an n by T matrix consisting          pages 217–253. Springer, 2011.
      of predictions (labels) given by each classifier. Label com-       [2] Y. Freund and R. E. Schapire. A Decision-theoretic
      bining strategy tries to build a second-level classifier based         Generalization of On-line Learning and an Application to
      on this label matrix, and return a final prediction result for         Boosting. Journal of computer and system sciences,
      that.                                                                  55:119–139, 1997.
   3. Label-Feature Stacking: Similar to label stacking, label-feature   [3] H. Liu and R. Setiono. Feature selection and classification—a
      stacking strategy uses both base-classifier predictions and            probabilistic wrapper approach. In Proceedings of the 9th
      features as training data to predict output.                           International Conference on Industrial and Engineering
                                                                             Applications of AI and ES, pages 419–424, 1997.
                                                                         [4] A. Prodromidis, P. Chan, and S. Stolfo. Meta-learning in
                                                                             distributed data mining systems: Issues and approaches. In
    Advances in distributed and parallel knowledge discovery,
    pages 81–114. MIT/AAAI Press, 2000.
[5] M. Riegler, , C. Spampinato, M. Larson, P. Halvorsen, and
    C. Griwodz. The mediaeval 2016 context of experience task:
    Recommending videos suiting a watching situation. In
    Proceedings of the MediaEval 2016 Workshop, 2016.
[6] M. Riegler, M. Larson, C. Spampinato, P. Halvorsen, M. Lux,
    J. Markussen, K. Pogorelov, C. Griwodz, and H. Stensland.
    Right inflight? A dataset for exploring the automatic
    prediction of movies suitable for a watching situation. In
    Proceedings of the 7th International Conference on
    Multimedia Systems, pages 45:1–45:6. ACM, 2016.
[7] J. Sill, G. Takacs, L. Mackey, and D. Lin. Feature-weighted
    linear stacking. arXiv:0911.0460, 2009.