<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Essex-NLIP at MediaEval Predicting Media Memorability 2020 Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Janadhip Jacutprakart</string-name>
          <email>j.jacutprakart@essex.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rukiye Savran Kiziltepe</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>John Q. Gan</string-name>
          <email>jqgan@essex.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giorgos Papanastasiou</string-name>
          <email>g.papanastasiou@essex.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alba G. Seco de Herrera</string-name>
          <email>alba.garcia@essex.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Essex</institution>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>In this paper, we present the methods of approach and the main results from the Essex NLIP Team's participation in the MediEval 2020 Predicting Media Memorability task. The task requires participants to build systems that can predict short-term and long-term memorability scores on real-world video samples provided. The focus of our approach is on the use of colour-based visual features as well as the use of the video annotation meta-data. In addition, hyper-parameter tuning was explored. Besides the simplicity of the methodology, our approach achieves competitive results. We investigated the use of diferent visual features. We assessed the performance of memorability scores through various regression models where Random Forest regression is our final model, to predict the memorability of videos.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        The number of published videos has been increasing, and the need
for improved content analysis has been of great interest for
different areas of research in media memorability. The MediaEval
Predicting Media Memorability task focuses on how video contents
are memorable to viewers. Participants of the task are expected to
develop systems that predict automatically short-term and
longterm memorability scores for given video samples [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The task was
introduced in 2018 with a soundless dataset including 10,000 short
videos [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In 2019 the task continued to explore the short-term and
long-term video memorability, during which people were watching
it for an extended period with no sound videos [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. However, this
year, García et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] have released a new dataset based on
realworld data with motion and audio information within videos. The
dataset, annotation collection procedure, pre-computed features,
and ground truth data are described in the task overview paper [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        This article describes the participation of the Essex-NLIP1
research group in the MediaEval Predicting Media Memorability 2020
task. The Essex-NLIP group participated in this task in 2019 for the
ifrst time [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. In 2020, the team focuses on the use of visual
features based on colour and based on the video annotation metadata.
Hyper-parameter tuning was also explored.
1Essex-NLIP is the Natural Language and Information Processing research group at
the University of Essex, UK (see https://essexnlip.uk/).
      </p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        In the 2019 MediaEval task, various types of regression models were
used to explore the use of image and video features for predicting
media memorability. Dos Santos and Almeida [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] used K-Nearest
Neighbour Regression (KNR) and Leyva et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] proposed the
method of Support Vector Regression (SVR), using a regularised
method and LassoLarsCV. SVR and Bayesian Ridge Regression using
an ensemble method to detect the capability of the predictions on
each model was employed by Azcona et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In contrast to the
above complex regressions, Wang et al. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] used Random Forest
regression and SVR. In this work, we use Random Forest after
exploring several regression models (see Section 3.3).
      </p>
      <p>
        In the last 20 years, many visual video descriptors have been
explored [
        <xref ref-type="bibr" rid="ref11 ref12 ref4">4, 11, 12</xref>
        ]. A set of pre-computed visual descriptors were
provided in this task for the videos in the collection (see Section 3.1).
      </p>
      <p>
        In order to combine features, both Rattani et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and Ross
and Govindarajan [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] used a simple concatenated method to fuse
diferent features.In this work, we also used the concatenate method
to fuse the descriptor we have on each feature together (see
Section 3.2)
      </p>
      <p>
        Bergstra and Bengio [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] implemented the hyper-parameter
optimisation method using both RandomizedsearchCV and GridsearchCV
similar to our approach in this work (see Section 3.3).
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>APPROACH</title>
      <p>This section describes the basic techniques that we used on the 5
runs, submitted for this work. The selected features in Section 3.2
were explored using Random Forest regression described in
Section 3.3 in order to obtain the memorability score. Figure 3 shows
an overview of the process.
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>Dataset</title>
      <p>
        In 2020, the collection is composed of 1500 short videos. A set of
pre-extracted features were also distributed, including seven visual
features. More details about the task and the collection can be found
at García et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
3.2
      </p>
    </sec>
    <sec id="sec-5">
      <title>Features</title>
      <p>All the pre-computed visual features provided by the task were
explored (AlexNetFC7, HOG, HSVHist, RGBHist, LBP, VGGFC7, C3D)
in combination with several regression models (see Section 3.3). For
this paper, RGBHist and HSVHist were selected based on the results
obtained on the experiments on the development set. Both RGBHist
and HSVHist are colour histogram-based which extract vectorised
pixels within a certain neighbourhood across each pixel, through
either RGB colour channels or HSV (known as Hue / Saturation /
Value colour). The descriptors were concatenated based on colour
channels in order to preserve the data format of the features. For
one of the experiments, we fused both, RGBHist and HSVHist, by
concatenating both descriptors.</p>
      <p>In addition, the metadata provided with the annotation was used
as a video descriptor, which we call it “Descriptive”. It contains
the average value of video position and the number of annotations
occurs per video.
3.3</p>
    </sec>
    <sec id="sec-6">
      <title>Regression Model</title>
      <p>
        After exploring several regression models (Random Forest [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ],
Decision Tree [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], Gradient Boosting [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], Extra Tree [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and
Sequential regression models [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]), in this work, we used Random
Forest based on the results we obtained from the development set.
      </p>
      <p>
        As for Random Forest regression, we explore performance based
on both the default parameters and the hyper-parameter tuning
method. RandomizedsearchCV is a method that chooses random
numbers of hyper-parametric pairs from a given domain.
GridsearchCV is a method that executed a complete search on the
predefined parameter values for an estimator and returns the best
result obtained from hyper-parametric combinations [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. We have
optimised the hyper-parameters for our GridsearchCV, based on
the results of RandomizedsearchCV we acquired.
4
      </p>
    </sec>
    <sec id="sec-7">
      <title>RESULTS AND ANALYSIS</title>
      <p>This year, Essex-NLIP submitted 5 runs for both short-term and
long-term, using the techniques described in Section 3:
• Run 1 - This run uses HSVHist descriptor and hyper-parameter
tuning.
• Run 2 - This run uses RGBHist descriptor and no
hyperparameter tuning.
• Run 3 - This run uses RGBHist descriptor and hyper-parameter
tuning.
• Run 4 - This run uses both RGBHist and HSVHist descriptors
and hyper-parameter tuning.
• Run 5 - This run uses the Descriptive descriptor and
hyperparameter tuning.</p>
      <p>Table 1 presents the results from the development and test sets
for both short-term and long-term memorability using Random
Forest regression and the following features: RGBHist, HSVHist
and Descriptive. Table 1 also indicates that the results achieved
in this year challenge were very low, considering the mean and
variance of participants’ results. Besides using a simple approach,
two of the submitted runs achieved competitive results over the
test set. For short-term memorability best result was achieved with
Run 1 when using HSVHist and hyper-parameter tuning. In the
case of long-term memorability, the best result was obtained on
Run 2 when using RGBHist without hyper-parameter tuning.</p>
      <p>The results obtained on the development set were considerably
higher compared to the ones achieved on the test set. The highest
Spearman’s correlation score (0.508) on the development set was
achieved with Run 5. The best result for long-term memorability
over the development set was obtained by Run 4 when using the
fusion features of RGBHist &amp; HSVHist with hyper-parameter tuning.
It obtained a 0.422 Spearman’s correlation score. Further
investigation is needed to increase the prediction performance on the test
set.</p>
      <p>Run 5 uses Descriptive feature presented in Section 3.2. Results
indicated that taking into account only the number of annotations
based on the video positions influences in how memorable a video
is, even without considering any further video descriptor.
5</p>
    </sec>
    <sec id="sec-8">
      <title>DISCUSSION AND OUTLOOK</title>
      <p>This article describes the methods and results of the Essex-NLIP
team for the MediaEval 2020 Predicting Media Memorability task.
Five runs were submitted for both short-term and long-term
memorability using Random Forest regression. After exploring all the
features provided by the task organisation, we worked on
colourbased features and metadata on the video position annotation which
achieved the highest score on our development set. he results on
the development set were higher compared to the test set, due to
diferences in the data set size (development set was larger). Besides
the simplicity of the proposed approach, it achieved competitive
results whilst explored how the video position and the number of
annotations can afect the memorability score.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>David</given-names>
            <surname>Azcona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Enric</given-names>
            <surname>Moreu</surname>
          </string-name>
          , Feiyan Hu,
          <string-name>
            <given-names>Tomás E.</given-names>
            <surname>Ward</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Alan F.</given-names>
            <surname>Smeaton</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Predicting Media Memorability Using Ensemble Models</article-title>
          .
          <source>In Working Notes Proceedings of the MediaEval 2019 Workshop (CEUR Workshop Proceedings)</source>
          , Vol.
          <volume>2670</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>James</given-names>
            <surname>Bergstra</surname>
          </string-name>
          and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Random Search for HyperParameter Optimization</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          <volume>13</volume>
          ,
          <issue>1</issue>
          (
          <year>2012</year>
          ),
          <fpage>281</fpage>
          -
          <lpage>305</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Romain</given-names>
            <surname>Cohendet</surname>
          </string-name>
          , Claire Hélène Demarty,
          <string-name>
            <surname>Ngoc Q.K. Duong</surname>
          </string-name>
          , Mats Sjöberg, Bogdan Ionescu, and Thanh Toan Do.
          <year>2018</year>
          .
          <article-title>MediaEval 2018: Predicting Media Memorability</article-title>
          .
          <source>In Working Notes Proceedings of the MediaEval 2018 Workshop (CEUR Workshop Proceedings)</source>
          , Vol.
          <volume>2283</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Miguel</surname>
            <given-names>T</given-names>
          </string-name>
          <string-name>
            <surname>Coimbra</surname>
            and
            <given-names>JP Silva</given-names>
          </string-name>
          <string-name>
            <surname>Cunha</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>MPEG-7 Visual Descriptors-Contributions for Automated Feature Extraction in Capsule Endoscopy</article-title>
          .
          <source>IEEE Transactions on Circuits and Systems for Video Technology</source>
          <volume>16</volume>
          ,
          <issue>5</issue>
          (
          <year>2006</year>
          ),
          <fpage>628</fpage>
          -
          <lpage>637</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Mihai</given-names>
            <surname>Gabriel</surname>
          </string-name>
          <string-name>
            <given-names>Constantin</given-names>
            , Bogdan Ionescu, Claire Hélène Demarty,
            <surname>Ngoc Q.K. Duong</surname>
          </string-name>
          , Xavier Alameda-Pineda, and
          <string-name>
            <given-names>Mats</given-names>
            <surname>Sjöberg</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>The Predicting Media Memorability Task at MediaEval 2019</article-title>
          .
          <source>In Working Notes Proceedings of the MediaEval 2019 Workshop (CEUR Workshop Proceedings)</source>
          , Vol.
          <volume>2670</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Xu</given-names>
            <surname>Dazhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Wu</given-names>
            <surname>Xiaoyu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Sun</given-names>
            <surname>Guoquan</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Image Memorability Prediction Based on Machine Learning</article-title>
          .
          <source>In 2020 IEEE 3rd International Conference on Computer and Communication Engineering Technology (CCET)</source>
          . IEEE,
          <fpage>91</fpage>
          -
          <lpage>94</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Samuel</given-names>
            <surname>Felipe Dos Santos</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jurandy</given-names>
            <surname>Almeida</surname>
          </string-name>
          .
          <year>2019</year>
          . GIBIS at MediaEval 2019:
          <article-title>Predicting Media Memorability Task</article-title>
          .
          <source>In Working Notes Proceedings of the MediaEval 2019 Workshop (CEUR Workshop Proceedings)</source>
          , Vol.
          <volume>2670</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Alba</given-names>
            <surname>García Seco de Herrera</surname>
          </string-name>
          , Rukiye Savran Kiziltepe, Jon Chamberlain, Mihai Gabriel Constantin,
          <string-name>
            <surname>Claire-Hélène</surname>
            <given-names>Demarty</given-names>
          </string-name>
          , Faiyaz Doctor, Bogdan Ionescu,
          <string-name>
            <given-names>and Alan F.</given-names>
            <surname>Smeaton</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Overview of MediaEval 2020 Predicting Media Memorability task: What Makes a Video Memorable?</article-title>
          .
          <source>In Working Notes Proceedings of the MediaEval 2020 Workshop (CEUR Workshop Proceedings).</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Nikhil</given-names>
            <surname>Ketkar</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Deep Learning with Python</article-title>
          . Springer.
          <fpage>97</fpage>
          -111 pages.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Roberto</surname>
            <given-names>Leyva</given-names>
          </string-name>
          , Faiyaz Doctor, Alba García Seco de Herrera, and
          <string-name>
            <given-names>Sohail</given-names>
            <surname>Sahab</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Multimodal Deep Features Fusion For Video Memorability Prediction</article-title>
          .
          <source>In Working Notes Proceedings of the MediaEval 2019 Workshop (CEUR Workshop Proceedings)</source>
          , Vol.
          <volume>2670</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Ionuţ</surname>
            <given-names>Mironică</given-names>
          </string-name>
          , Ionuţ Cosmin Duţă, Bogdan Ionescu, and
          <string-name>
            <given-names>Nicu</given-names>
            <surname>Sebe</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>A Modified Vector of Locally Aggregated Descriptors Approach for Fast Video Classification</article-title>
          .
          <source>Multimedia Tools and Applications</source>
          <volume>75</volume>
          ,
          <issue>15</issue>
          (
          <year>2016</year>
          ),
          <fpage>9045</fpage>
          -
          <lpage>9072</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Jens-Rainer Ohm</surname>
            ,
            <given-names>F</given-names>
          </string-name>
          <string-name>
            <surname>Bunjamin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Wolfram Liebsch</surname>
            , Bela Makai, Karsten Müller, Aljoscha Smolic, and
            <given-names>D</given-names>
          </string-name>
          <string-name>
            <surname>Zier</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>A Set of Visual Feature Descriptors and their Combination in a Low-Level Description Scheme</article-title>
          .
          <source>Signal Processing: Image Communication</source>
          <volume>16</volume>
          ,
          <fpage>1</fpage>
          -
          <lpage>2</lpage>
          (
          <year>2000</year>
          ),
          <fpage>157</fpage>
          -
          <lpage>179</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Fabian</surname>
            <given-names>Pedregosa</given-names>
          </string-name>
          , Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          , Ron Weiss, Vincent Dubourg, and others.
          <source>2011</source>
          .
          <article-title>Scikit-learn: Machine Learning in Python</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          (
          <year>2011</year>
          ),
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Ajita</surname>
            <given-names>Rattani</given-names>
          </string-name>
          , Dakshina Ranjan Kisku, Manuele Bicego, and
          <string-name>
            <given-names>Massimo</given-names>
            <surname>Tistarelli</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Feature Level Fusion of Face and Fingerprint Biometrics</article-title>
          .
          <source>In 2007 First IEEE International Conference on Biometrics: Theory, Applications, and Systems. IEEE</source>
          , 1-
          <fpage>6</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Arun</surname>
            <given-names>A</given-names>
          </string-name>
          <string-name>
            <surname>Ross and Rohin Govindarajan</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Feature Level Fusion Using Hand and Face Biometrics</article-title>
          .
          <source>In Biometric Technology for Human Identification II</source>
          , Vol.
          <volume>5779</volume>
          .
          <source>International Society for Optics and Photonics</source>
          ,
          <volume>196</volume>
          -
          <fpage>204</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Jaak</surname>
            <given-names>Simm</given-names>
          </string-name>
          , Ildefons Magrans De Abril, and
          <string-name>
            <given-names>Masashi</given-names>
            <surname>Sugiyama</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Tree-Based Ensemble Multi-Task Learning Method for Classification and Regression</article-title>
          .
          <source>IEICE Transactions on Information and Systems</source>
          <volume>97</volume>
          ,
          <issue>6</issue>
          (
          <year>2014</year>
          ),
          <fpage>1677</fpage>
          -
          <lpage>1681</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Hammad</given-names>
            <surname>Squalli-Houssaini</surname>
          </string-name>
          ,
          <source>Ngoc QK Duong</source>
          , Marquant Gwenaëlle, and
          <string-name>
            <surname>Claire-Hélène Demarty</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Deep Learning for Predicting Image Memorability</article-title>
          .
          <source>In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          . IEEE,
          <fpage>2371</fpage>
          -
          <lpage>2375</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Shuai</surname>
            <given-names>Wang</given-names>
          </string-name>
          , Linli Yao, Jieting Chen, and
          <string-name>
            <given-names>Qin</given-names>
            <surname>Jin</surname>
          </string-name>
          .
          <year>2019</year>
          . RUC at MediaEval 2019:
          <article-title>Video Memorability Prediction Based on Visual Textual and Concept Related Features</article-title>
          .
          <source>In Working Notes Proceedings of the MediaEval 2019 Workshop (CEUR Workshop Proceedings)</source>
          , Vol.
          <volume>2670</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>