<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multi-model Estimators and Ensemble-based Regressors for Predicting Video Memorability</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>R Gokul Prakash</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jayaraman Bhuvana</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eeswara Anvesh Chodisetty</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arjun Mukesh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>T T Mirnalinee</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Engineering, Sri Sivasubramaniya Nadar College of Engineering</institution>
          ,
          <addr-line>Chennai</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The need to organize prevalent multimedia in our day-to-day activities, presents us various aspects of video importance to be taken into consideration, like video aesthetics and interestingness. Memorability based organization of media is highly efective in cases where media is required to have a long positive retainability in one's memory, like in the field of advertising or educational content creation. Keeping our target group in mind, we have decided to pursue an ensemble-based approach for our model for the task of predicting media memorability in aide to the Benchmarking Initiative for Multimedia Evaluation's list of tasks for the MediaEval 2022 Workshop. Pre-defined ensemble models are versatile enough to incorporate other pre-defined regressors and ensemble models as base estimators and build on the knowledge accumulated by them. Video-level features like the C3D were chosen, and the regressors and ensemblers were trained on these features with parameter optimization.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        The task of assigning a score for video memorability can be formulated as a regression problem.
In recent years, there have been numerous advances in the study of regression and classification
on video in academic literature. The value of combining features from diferent modalities has
been thoroughly demonstrated in numerous earlier works [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ][
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The top-performing models
for the Predicting Media Memorability challenge’s 2019 and 2020 rounds used ensemble models.
      </p>
      <p>
        By building on top of the previous editions, we approached this task as an optimisation
challenge. Seminal papers on Ensemble learning [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ][
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] helped us adapt the method to the given
dataset without overfitting while producing good scores.
      </p>
      <p>The modalities and traits that are best predictive of memorability remain unknown. In light
of this, we decided to use Ensembler-based methods in our quest to achieve a breakthrough.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Approach</title>
      <p>
        In this proposed work to measure the video memorability, we opted to use 3D Convolution
(C3D) features [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], features that were extracted from the video entirely, rather than features
extracted from few particular frames in the video. The pre-extracted C3D features provided
by the organizers are fed to our models. C3D, was one of the first ways to learning generic
representation from videos. Its homogeneous architecture is built of tiny 3x3x3 convolution
kernels. It produces a 4096-dimension feature vector extracted from a video clip after being
trained on a generic action recognition dataset. Seven models that are chosen here are, Linear
Regression [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], Logistic regression, Decision Tree, Random Forest [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], Bayesian Regression [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ],
Support Vector Regression, K-Neighbours regression [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] were used. Six ensemble methods
employed are Gradient Boosting Regressor [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], AdaBoost Regressor, Voting Regressor [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ],
XGBoost Regressor [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and Stacking Regressor Ensemblers [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. The workflow of the proposed
video memorability is shown in figure 1.
      </p>
      <p>The models were trained using an 80-20 training-validation split based on Video ID. To test
the efectiveness of the various models, we ran each model individually on the dataset and
carefully handpicked those that gave the best metrics. The metrics were based on Mean Absolute
Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE) and Spearman rank
correlation coeficient. Ensemble learning tends to perform better when there is a diversity
among the models used and this was an additional contribution to our choice of models.</p>
      <p>
        Previous work[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] has shown that voting method and stacking method produce high
Spearman correlation coeficient for similar data, hence they were among our initial consideration.
Further down the task, we were able to test and implement four other models which also
performed on-par with the Stacking Ensembler and Voting Ensembler.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Implementation and Experiments</title>
      <p>RMSE and Spearman’s  were the deciding factors based on which we chose the methods after
rigorous experimentation. Internal tests were done on the 80/20 training/validation split.</p>
      <sec id="sec-4-1">
        <title>4.1. Choice of Models</title>
        <p>Models that are efective in predicting continuous output variables, memorability scores in this
case, were chosen.</p>
        <p>The models that achieved low RMSE scores and high Spearman’s rank correlation coeficient
( ) on the training set were ascertained and are added to a regressor candidate pool which was
initially kept empty. Ensemble models, such as the Gradient Boosting Regressor, AdaBoost
Regressor, and XGBoost Regressor, do not require additional base estimators and can be trained
independently. Therefore, they were also added to the regressor candidate pool. On the other
hand, Voting Regressor and Stacking Regressor permit the use of pre-trained base estimators.
The regressors in the candidate pool with the best performance, which can be observed from
Table 1, were chosen as the base estimators for Voting and Stacking Regressors. Since Voting
Regressor allows 3 base estimators, 4 best models based on performance from the pool were
chosen and 4 instances of Voting Regressor were made with diferent permutations of the 3 base
estimators. The best performance was observed in the model with AdaBoost, Gradient Boosting,
and Bayes Regression as best estimators. Similarly, 4 instances of Stacking Regressor models
were made, whose 3 base estimators were permuted with 4 of the best performing models in
the pool which now includes Voting Regressor. The best performance was observed in the
model with Voting Regressor, AdaBoost Regressor, and Gradient Boosting Regressor as base
estimators.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Choice of Methods</title>
        <p>
          Six ensemble methods were chosen initially of which two made the final cut. One of the base
estimators in the proposed architecture for the final Ensemblers (Voting and Stacking) is the
Gradient Boosting Regressor, which serves as an Ensembler itself. max_depth is a parameter
that was identified as an interesting value to tweak [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], that can improve the accuracy of
prediction. It is used to set the maximum depth of the individual regression estimators that
limits the number of nodes in the tree. The best value depends on the interaction of the input
variables. It was observed that a lower value of max_depth = 1 achieved a better performance
for the given training and validation set, rather than the default value of 3, which resulted in
our model overfitting the data. The other parameters were set to their default values. Both of
the final Ensemblers use AdaBoost Regressor as a base estimator, in which the n_estimators
parameter, used to set the maximum number of estimators at which boosting is terminated, was
set to 100 instead of the default value of 50 which was observed to be insuficient due to the
dataset being large. The other parameters were set to their default values.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Analysis</title>
      <p>The performance of the proposed architecture was evaluated using the metrics namely Pearson
correlation coeficient, Spearman’s rank correlation coeficient, and MSE.</p>
      <p>From the results observed on the training set reported in table 1, have given a Spearman
coeficient of around 0.52. This indicates a positive and above average correlation between the
two metrics. Out of all the regressors in the candidate pool, Voting Regressor and Stacking
Regressor performed the best with Spearman’s coeficient of 0.507 and 0.520 respectively. The
MSE values of both ensemblers were 0.008. The other regressors gave around 0.45 Spearman’s
coeficient but the best two were picked namely, Stacking Ensembler and Voting Ensembler,
and submitted for evaluation. The training set’s result of the final ensemblers are highlighted in
table 1.</p>
      <p>The submitted runs of the Stacking and Voting ensemblers were evaluated and the results are
tabulated in table 2. The results using the training data and testing data show a low variance in
the performance implying the model is not over-fitted on the training data. The Spearman’s
coeficient for the Stacking Ensembler and Voting Ensembler are 0.525 and 0.513 respectively
which means an above average correlation between our predicted scores and the ground truth
values. The Pearson coeficient also indicates the same. The most notable metric from submitted
runs is the MSE of 0.008 for both Ensemblers.</p>
      <p>In summary, the results indicate that Stacking regressor was able to perform better than
the others. Although it was found not to be very diferent from Voting Regressor, Stacking
Regressor achieved the best correlation and MSE scores.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion and Outlook</title>
      <p>In this competition, we’ve built a strong upper layer based on the foundation and precedents
established by previous work. Our key takeaway lies in aiding to establish ensemble learning
as a great way to approach the challenge of predicting media memorability.</p>
      <p>We emphasise that ensemble models from multiple methods produces the best results. Future
work would ideally experiment further with the parameters, diferent base estimators, and
diferent ensemble techniques other than the ones mentioned in our approach.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Sweeney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Constantin</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-H. Demarty</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Fosco</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>García Seco de Herrera</surname>
            , S. Halder,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Healy</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Ionescu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Matran-Fernandez</surname>
            ,
            <given-names>A. F.</given-names>
          </string-name>
          <string-name>
            <surname>Smeaton</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Sultana, Overview of the MediaEval 2022 predicting video memorability task</article-title>
          , in: MediaEval Multimedia Benchmark Workshop Working Notes,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Sweeney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Healy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Smeaton</surname>
          </string-name>
          ,
          <article-title>Predicting media memorability: comparing visual, textual and auditory features</article-title>
          ,
          <source>arXiv preprint arXiv:2112.07969</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          , G. Sun,
          <article-title>Media memorability prediction based on machine learning</article-title>
          .,
          <source>in: MediaEval</source>
          , volume
          <volume>2882</volume>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T. G.</given-names>
            <surname>Dietterich</surname>
          </string-name>
          , et al.,
          <article-title>Ensemble learning, The handbook of brain theory and neural networks 2 (</article-title>
          <year>2002</year>
          )
          <fpage>110</fpage>
          -
          <lpage>125</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>X.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <article-title>A survey on ensemble learning</article-title>
          ,
          <source>Frontiers of Computer Science</source>
          <volume>14</volume>
          (
          <year>2020</year>
          )
          <fpage>241</fpage>
          -
          <lpage>258</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bourdev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fergus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Torresani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Paluri</surname>
          </string-name>
          ,
          <article-title>Learning spatiotemporal features with 3d convolutional networks</article-title>
          ,
          <source>in: Proceedings of the IEEE international conference on computer vision</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>4489</fpage>
          -
          <lpage>4497</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Weisberg</surname>
          </string-name>
          , Applied linear regression, volume
          <volume>528</volume>
          , John Wiley &amp; Sons,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          , et al.,
          <article-title>Using" random forest" for classification and regression</article-title>
          .,
          <source>Chinese Journal of Applied Entomology</source>
          <volume>50</volume>
          (
          <year>2013</year>
          )
          <fpage>1190</fpage>
          -
          <lpage>1197</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Bishop</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Tipping</surname>
          </string-name>
          ,
          <article-title>Bayesian regression and classification</article-title>
          ,
          <source>Nato Science Series sub Series III Computer And Systems Sciences</source>
          <volume>190</volume>
          (
          <year>2003</year>
          )
          <fpage>267</fpage>
          -
          <lpage>288</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H.</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vovk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gammerman</surname>
          </string-name>
          ,
          <article-title>Regression conformal prediction with nearest neighbours</article-title>
          ,
          <source>Journal of Artificial Intelligence Research</source>
          <volume>40</volume>
          (
          <year>2011</year>
          )
          <fpage>815</fpage>
          -
          <lpage>840</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>P.</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          , G. Louppe,
          <article-title>Gradient boosted regression trees in scikit-learn</article-title>
          ,
          <source>in: PyData</source>
          <year>2014</year>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. M.</given-names>
            <surname>Luc</surname>
          </string-name>
          ,
          <article-title>Rrmse voting regressor: A weighting function based improvement to ensemble regression, 2022</article-title>
          . URL: https://arxiv.org/abs/2207.04837. doi:
          <volume>10</volume>
          .48550/ARXIV.2207.04837.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>T.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Benesty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Khotilovich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          , et al.,
          <article-title>Xgboost: extreme gradient boosting</article-title>
          ,
          <source>R package version 0.4-2 1</source>
          (
          <issue>2015</issue>
          )
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Acharya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Swaminathan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Das</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kansara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Francis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. R.</given-names>
            <surname>Aatre</surname>
          </string-name>
          ,
          <article-title>Non-invasive estimation of hemoglobin using a multi-model stacking regressor</article-title>
          ,
          <source>IEEE journal of biomedical and health informatics 24</source>
          (
          <year>2019</year>
          )
          <fpage>1717</fpage>
          -
          <lpage>1726</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>D.</given-names>
            <surname>Azcona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Moreu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. E.</given-names>
            <surname>Ward</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Smeaton</surname>
          </string-name>
          ,
          <article-title>Predicting media memorability using ensemble models</article-title>
          ,
          <source>CEUR Workshop Proceedings</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shami</surname>
          </string-name>
          ,
          <article-title>On hyperparameter optimization of machine learning algorithms: Theory and practice</article-title>
          ,
          <source>Neurocomputing</source>
          <volume>415</volume>
          (
          <year>2020</year>
          )
          <fpage>295</fpage>
          -
          <lpage>316</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>