A Regression Approach to Movie Rating Prediction using
                     Multimedia Content and Metadata
             Hossein A. Rahmani                               Yashar Deldjoo                               Markus Schedl
                University of Zanjan                          Politecnico di Bari                    Johannes Kepler University
                srahmani@znu.ac.ir                            deldjooy@acm.org                         markus.schedl@jku.at

ABSTRACT                                                                  views at regular time slots. Recently, Moghaddam et al. [8] address
This paper presents the submission of the team MASlab-ZNU to              the problem of movie popularity prediction using visual features of
the MMRecSys movie recommendation task, as part of MediaEval              movie trailers.
2019. The task involved predicting average movie ratings, standard        2   EXPERIMENTS
deviation of ratings, and the number of ratings by using audio and        This section presents the experiments carried out towards address-
visual features extracted from trailers and the associated metadata.      ing the prediction task. For performance evaluation, we randomly
In the proposed work, we model the rating prediction problem as a         split the original train set into 2 non-overlapping subsets, where
regression problem and employ different learning models for the           we consider 80% train data and 20% validation data. We refer to
prediction task, including ridge regression (RR), support vector re-      these datasets as trainSet and validSet throughout the paper. In a
gression (SVR), shallow neural network (SNN) and deep neural net-         prepossessing step, we normalize all features in the dataset, using
work (DNN). The results of fairly large amount of experiments on          min-max normalization. We perform a hyper parameter search and
various models and features indicate that combination of DNN+tag          report all the results under the best setting. We model the task in
features produce the best results for prediction of avgRating and         question as regression and use the following regression models:
StdRating while for numRating (popularity) it is the combination of
RR+tag that significantly outperforms the other competitors, with              • Linear model using Ridge Regression (RR): We use
a large margin.                                                                  linear regression to serve as a simple yet standard ap-
                                                                                 proach to model the relation between dependent (predic-
1    INTRODUCTION AND RELATED WORK                                               tion scores) and independent variables (features) in a lin-
The global revenue obtained from the media and entertainment                     ear fashion [7]. The model is given by y = θ T x where x
(M&E) market in 2017 was approximately 2 trillion US dollars. The                and y represent feature vectors and scores, respectively,
four main verticals of the industry in the US include: film (40%),               θ T contains the linear model coefficients estimated by RR
music (6%), book publishing (12%) and video games (8%) [1, 2].                   minimize 12 ∥y − θ T x ∥22 + λ∥θ ∥22 where ℓ2 regularization is
The movie industry not only has a large cultural and sociological                     θT
                                                                                 applied to avoid overfitting of the coefficients.
impact on people but also occupies an important part of the business
                                                                               • Support Vector Regression (SVR): SVR is the regression
market in the M&E industry. Producing a new movie means that the
                                                                                 type of Support Vector Machine (SVM). SVR makes use
company is betting on this movie’s success. Being able to predict
                                                                                 of a nonlinear transformation function to map the input
into the future whether a movie will be successful or not is therefore
                                                                                 features to a high-dimensional features space given by
crucial and requires machine learning techniques. The results of
                                                                                 y = ni=1 (ai − ai∗ ).K(x i , x) +b where K = exp(−||x − x || 2 )
                                                                                      Í                                                   ′
the predictions will be used by producers and investors to decide
                                                                                 is the Radial Basis Function (RBF) kernel, b is intercept,
whether or not to adopt the production of similar movies.
                                                                                 and a and a ∗ are Lagrange multipliers.
   The paper at hand describes the solution by the team MASlab-
                                                                               • Shallow Neural Network (SNN): To model the relation-
ZNU for the 2019 Multimedia for Recommender System Task:
                                                                                 ship between features in a non-linear way, we also apply a
MovieREC and NewsREEL at MediaEval [6], the movie recom-
                                                                                 neural network to predict the scores. Here, we consider a
mendation subtask. The task involved using of audio, visual fea-
                                                                                 simple (shallow) neural network model. SNN has a hidden
tures extracted from trailers of the corresponding movies coming
                                                                                 layer with 24 neurons and Rectified Linear Unit (ReLU )
from MMTF-14K dataset [4] and metadata to predict three scores:
                                                                                 as activation function. As for the output layer we use the
avgRating (representing user appreciation and dis-appreciation of
                                                                                 Siдmoid activation function.
the content), stdRating (characterizing agreement and disagree-
                                                                               • Deep Neural Network (DNN): This method is similar to
ment between user opinions/ratings), and numRating (reflecting
                                                                                 the SNN but we have more hidden layers to consider deeper
popularity of movies). The audio and visual features have been used
                                                                                 relations between features. In our deep model, we have 3
to solve various tasks in movie recommendation, see e.g., [3, 5].
                                                                                 hidden layers; the first layer has 128 neurons with ReLU as
   In the context of movie popularity prediction, Szabo et al. in [10]
                                                                                 activation function; the second layer has 64 neurons and
predict the future popularity of a video based on the number of
                                                                                 uses a Siдmoid activation function; the third one has 32
previous views on YouTube using predictive models based on linear
                                                                                 neurons and uses ReLU.
regression. Pinto et al. in [9] extend the work by Szabo et al. by
proposing a multivariate linear model and sampling the number of
                                                                          3   RESULTS AND ANALYSIS
Copyright 2019 for this paper by its authors. Use
permitted under Creative Commons License Attribution                      The regression results using the proposed approaches are presented
4.0 International (CC BY 4.0).                                            in Table 1. Regarding the comparison of learning models, we can see
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France                                                                     H. A. Rahmani et al.

Table 1: Results of regression in terms of RMSE, calculated on the validation set. The first and second best result for each score
are shown in bold and italic respectively.

                                                         Metadata              Audio             Visual
                             Score       Regressor                                                               Average
                                                      Genre    Tag         BLF     I-Vec      AVF    Deep
                                            RR        0.1401 0.1440       0.1495 0.1482      0.1499 0.1520        0.1472
                          avgRating        SVR        0.1466 0.1462       0.1518 0.1588      0.1523 0.1523        0.1513
                                           SNN        0.1396 0.1457       0.1492 0.1487      0.1496 0.1483        0.1468
                                           DNN        0.1409 0.1370       0.1491 0.1509      0.1501 0.1509        0.1464
                                            RR        0.1410 0.1445       0.1438 0.1421      0.1430 0.1472        0.1436
                           stdRating       SVR        0.1487 0.1396       0.1476 0.1464      0.1495 0.1495        0.1468
                                           SNN        0.1411 0.1433       0.1429 0.1420      0.1435 0.1423        0.1425
                                           DNN        0.1411 0.1356       0.1429 0.1429      0.1428 0.1429        0.1413
                                            RR        0.0515 0.0232       0.0518 0.0523      0.0524 0.0528        0.0473
                          numRating        SVR        0.1892 0.1142       0.1752 0.1656      0.1954 0.1952        0.1461
                                           SNN        0.0533 0.0272       0.0524 0.0526      0.0525 0.0523        0.0483
                                           DNN        0.0529 0.0529       0.0526 0.0527      0.0527 0.0528        0.0527


that DNN and RR are the best predictors, in most cases generating           REFERENCES
the best performance for each feature while SVR is the worst. The            [1] 2018.      2017 Top Markets Report Media and Entertainment.
final submitted runs are selected based on the ones performing the               https://www.trade.gov/topmarkets/pdf/Top%20Markets%20Media%
best on the validSet, which are highlighted in bold in Table 1.                  20and%20Entertinment%202017.pdf. (2018). Accessed: 2018-12-27.
Predicting average ratings: As can be seen in Table 1, the perfor-           [2] 2018. Media and Entertainment Industry Overview. https://
mance of all audio and visual features, are closely similar to each              investmentbank.com/media-and-entertainment-industry-overview/.
other. These results with a close margin look similar to the perfor-             (2018). Accessed: 2018-12-27.
                                                                             [3] Yashar Deldjoo, Mihai Gabriel Constantin, Hamid Eghbal-Zadeh, Bog-
mance of the genre descriptor, e.g. 0.1487 v.s. 0.1466. However, it can
                                                                                 dan Ionescu, Markus Schedl, and Paolo Cremonesi. 2018. Audio-visual
be noted that tag features is the best feature to predict the average
                                                                                 encoding of multimedia content for enhancing movie recommenda-
ratings. These results show that user-generated tags assigned to                 tions. In Proc. of the 12th ACM Conference on Recommender Systems,
movies contain semantic information that are well correlated with                RecSys 2018, Vancouver, BC, Canada, October 2-7, 2018. 455–459.
ratings given to movies by user. One can also observe that the DNN           [4] Yashar Deldjoo, Mihai Gabriel Constantin, Bogdan Ionescu, Markus
model is the best model to predict the average ratings, though it is             Schedl, and Paolo Cremonesi. 2018. MMTF-14K: a multifaceted movie
only marginally better than the SNN model.                                       trailer feature dataset for recommendation and retrieval. In Proceed-
Predicting standard deviation of ratings: The results of predict-                ings of the 9th ACM Multimedia Systems Conference, MMSys 2018,
ing standard deviations of ratings show the audio-visual features                Amsterdam, The Netherlands, June 12-15, 2018. 450–455.
and genre metadata have very similar results. Again we can see the           [5] Yashar Deldjoo, Maurizio Ferrari Dacrema, Mihai Gabriel Constantin,
                                                                                 Hamid Eghbal-zadeh, Stefano Cereda, Markus Schedl, Bogdan Ionescu,
best feature to predict the standard deviation of ratings is tag meta-
                                                                                 and Paolo Cremonesi. 2019. Movie genome: alleviating new item cold
data using the DNN learning model. Average results indicate that
                                                                                 start in movie recommendation. User Model. User-Adapt. Interact. 29,
the DNN model is the best learning model to predict the standard                 2 (2019), 291–343.
deviation of ratings.                                                        [6] Yashar Deldjoo, Benjamin Kille, Markus Schedl, Andreas Lommatzsch,
Predicting number of ratings: As for predicting number of rat-                   and Jialie Shen. 2019. The 2019 Multimedia for Recommender Sys-
ings, it can be seen that except for the tag feature with the best               tem Task: MovieREC and NewsREEL at MediaEval. In Working Notes
performance, the rest of audio-visual features and genre metadata                Proceedings of the MediaEval 2019 Workshop, Sophia Antipolis, France,
yield very similar results. For the tag feature, we observe a substan-           27-29 October 2019.
tial superiority in performance compared to all other features; it           [7] Jinna Lv, Wu Liu, Meng Zhang, He Gong, Bin Wu, and Huadong Ma.
reduces the RMSE by about 55% (0.0232 v.s. 0.0515 for tag v.s. genre),           2017. Multi-feature fusion for predicting social media popularity. In
                                                                                 Proceedings of the 25th ACM international conference on Multimedia.
compared to the second best feature (genre). More interestingly,
                                                                                 ACM, 1883–1888.
we can see for predicting number of ratings, the best model is the
                                                                             [8] Farshad B Moghaddam, Mehdi Elahi, Reza Hosseini, Christoph Trat-
simple RR learning model followed by SNN but not DNN.                            tner, and Marko Tkalcic. 2019. Predicting Movie Popularity and Rat-
                                                                                 ings with Visual Features. In 14th International Workshop On Semantic
4   CONCLUSION                                                                   And Social Media Adaptation And Personalization.
This paper reports the method used by team MASlab-ZNU for the                [9] Henrique Pinto, Jussara M Almeida, and Marcos A Gonçalves. 2013.
Multimedia for Recommender Systems task at MediaEval 2019 [6].                   Using early view patterns to predict the popularity of youtube videos.
                                                                                 In Proceedings of the sixth ACM international conference on Web search
Results of experiments using different regression approaches are
                                                                                 and data mining. ACM, 365–374.
promising and show the efficacy of audio and visual content in              [10] Gabor Szabo and Bernardo A Huberman. 2010. Predicting the Popu-
comparison with genre metadata but overall it is the tag feature                 larity of Online Content. Commun. ACM 53, 8 (2010), 80–88.
that provides the best prediction quality in all experimental cases.