<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of The MediaEval 2022 Predicting Video Memorability Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lorin Sweeney</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mihai Gabriel Constantin</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Claire-Hélène Demarty</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Camilo Fosco</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alba G. Seco de Herrera</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sebastian Halder</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Graham Healy</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bogdan Ionescu</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ana Matran-Fernandez</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alan F. Smeaton</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mushfika Sultana</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dublin City University</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>InterDigital</institution>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Massachusetts Institute of Technology Cambridge</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University Politehnica of Bucharest</institution>
          ,
          <country country="RO">Romania</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>University of Essex</institution>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>4489</fpage>
      <lpage>4497</lpage>
      <abstract>
        <p>This paper describes the 5th edition of the Predicting Video Memorability Task as part of MediaEval2022. This year we have reorganised and simplified the task in order to lubricate a greater depth of inquiry. Similar to last year, two datasets are provided in order to facilitate generalisation, however, this year we have replaced the TRECVid2019 Video-to-Text dataset with the VideoMem dataset in order to remedy underlying data quality issues, and to prioritise short-term memorability prediction by elevating the Memento10k dataset as the primary dataset. Additionally, a fully fledged electroencephalography (EEG)based prediction sub-task is introduced. In this paper, we outline the core facets of the task and its constituent sub-tasks; describing the datasets, evaluation metrics, and requirements for participant submissions.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>As the natural world unwinds in an endless cacophony of sensory threads, the human brain
selectively spins it into intelligible spools—filtering out information it deems unnecessary and
spinning the rest into an intelligible internal representation. The human brain is an equally
masterful weaver as it is spinster; weaving a colourful tapestry of meaning from its spools
of intelligible threads by deciding which threads should be stitched into the canvas of our
mind—what should be remembered and what should not.</p>
      <p>The question is, what criteria does it use to decide what should and should not be remembered?
Unfortunately, a satiating answer presently remains out of reach, leaving “what it deems to be
important” as our appetizer. Memorability—the likelihood that a given piece of content will
be recognised upon subsequent viewing—can accordingly be viewed as a proxy for human
importance, which is what ultimately motivates and brings meaning to its exploration. After
all, what could be more important than a measure of importance itself ?</p>
      <p>Memorability is accordingly the quintessential media metric by virtue of its proximal nature
to the bedrock of human experience. If a system can predict the memorability of incoming
information, it can evaluate its utility, then discard, filter, or augment the scantily useful, and
ultimately curate more meaningful media content.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        The subject of memorability has seen an influx in interest since the likelihood of images being
recognised upon subsequent viewing was found to be consistent across individuals [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Driven
primarily by the MediaEval Media Memorability tasks [
        <xref ref-type="bibr" rid="ref2 ref3 ref4 ref5">2, 3, 4, 5</xref>
        ], recent research has extended
beyond static images, pivoting to the more dynamic and multi-modal medium of video. In
2018, a video memorability annotation procedure was established, and the first ever large video
memorability dataset VideoMem [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]—10,000 short soundless videos with both long-term and
short-term memorability scores—was created. Additionally, the first ever analysis of human
consistency and video memorability was conducted. In 2019, the task ran for a second time
using the same dataset allowing participants to learn from the previous year’s task and carry
out comparative analysis of results from one year to the next. In 2020, a new smaller dataset was
introduced which included audio for the first time [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In 2021, that dataset was extended, with
a second large short-term dataset—Memento10k [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]—being released, short-term memorability
was sub-categorised into raw and normalised scores, an optional generalisation sub-task was
proposed, and a pilot EEG study [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] was conducted.
      </p>
      <p>
        Over the course of those four tasks, we have learned that short-term video memorability is
easier to predict than long-term memorability, simple image features, such as hue, saturation,
or spatial frequency, have repeatedly been found not to correlate with memorability, properties
such as aesthetics and interestingness likewise do not correlate with memorability, ensembles
that combine diferent modalities provide the best results, combining deep visual features in
conjunction with semantically rich features such as captions, emotions, or actions [
        <xref ref-type="bibr" rid="ref10 ref11 ref7 ref9">9, 7, 10, 11</xref>
        ]
is a highly efective approach, dimensionality reduction improves prediction results, and certain
semantic categories of objects or places are more memorable than others [
        <xref ref-type="bibr" rid="ref4 ref7">4, 7</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Description</title>
      <p>In this edition, the Predicting Video Memorability task challenges participants to develop systems
that automatically predict short-term memorability scores for short form videos. Participants
are provided with three datasets and ofered three sub-tasks in which to participate.</p>
      <sec id="sec-3-1">
        <title>3.1. Sub-task 1: How memorable is this video? - Video-based prediction</title>
        <p>
          Using the Memento10k [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] dataset, participants are required to generate automatic systems
that predict short-term memorability scores of new videos based on the given video dataset and
their memorability scores.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Sub-task 2: How memorable is this video? - Generalisation (optional)</title>
        <p>Sub-task 2 is a natural extension of sub-task 1, where participants can evaluate their systems
from sub-task 1 (trained on Memento10k) on the VideoMem dataset. Alternatively, participants
can also train a system on the VideoMem dataset and evaluate it on Memento10k.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Sub-task 3: Will this person remember this video? - EEG-based prediction (optional)</title>
        <p>Participants are required to generate automatic systems that predict whether or not a given
subject will recognise a given video upon subsequent viewing (N.B., this difers from
memorability as it is subject specific and a binary prediction, rather than subject agnostic and a floating
point prediction) based on the provided EEG data. Participants may choose to use the provided
EEG features in concert with sub-task 1’s visual features or in isolation. However, they must
use the EEG features in some capacity.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Dataset Details</title>
      <p>In the interest of clarity, standardisation, and the facilitation of more directed inquiry, we have
narrowed the scope of the tasks forgoing with raw and long-term memorability scores in favour
of normalised short-term scores. Additionally, in order to address systemic data quality issues
highlighted by a consistent disparity between participant systems trained on the TRECVid2019
dataset and the Memento10k dataset, we have opted to replace the TRECVid2019 dataset with
VideoMem, and to elevate Memento10k to primary dataset status. Additionally, a fledged EEG
dataset (EEGMem) is provided.</p>
      <p>
        The following set of pre-extracted features are provided along with the Memento10k and
VideoMem datasets:
• Image-level features: AlexNetFC7 [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], HOG [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], HSVHist, RGBHist, LBP [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], VGGFC7 [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ],
      </p>
      <p>
        DenseNet121 [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], ResNet50 [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], EficientNetB3 [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]
• Video-level features: C3D [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]
Three frames—the first, middle, and last—from each video were used to extract image-level
features.
      </p>
      <sec id="sec-4-1">
        <title>4.1. Memento10k</title>
        <p>
          Memorability scores were collected through Memento: The Video Memory Game, a
memorability experiment predicated the old-new recognition paradigm [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], where crowdworkers from
Amazon’s Mechanical Turk (AMT) watch a continuous stream of three-second video clips, and
are asked to press the space bar when they see a repeated video. To maximise the pace and
keep the experiment engaging, videos are shown as a continuous stream. When participants
press their spacebar, they receive either a red (incorrect) or green (correct) flash as feedback.
If a repeat is correctly identified, known as a “hit”, the stream skips ahead to the next video;
there is no feedback for missed repeats. Each level of the memory game contains on average
204 videos (with repeats) and lasts ∼ 9 minutes. The number of intervening videos between
the first and second occurrence of a repeated video is known as the “lag”. The game consists of
“vigilance” repeats that occur at short lags of 2-3 videos and are used to filter out inattentive
workers and “target” repeats at lags of 9-200 videos that provide memorability data.
        </p>
        <p>
          The Memento10k dataset [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] consists of 10,000 three-second videos depicting in-the-wild
scenes, each with associated short-term memorability scores, memorability decay values, action
labels, and five human generated captions. The memorability scores were computed with an
average of 90 annotations per video, and the videos were deafened before being shown to
participants. 7,000 videos are released as part of the training set, and 1,500 are provided for
validation. The remaining 1,500 videos are kept for the oficial test set.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. VideoMem</title>
        <p>
          The VideoMem dataset [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] consists of 10,000 soundless seven-second videos each with associated
short-term and long-term memorability scores, however, long-term scores are omitted from this
year’s task. Videos were extracted from cinematic raw stock footage and come with a caption.
7,000 videos are released as part of the training set, and 1,500 are provided for validation. The
remaining 1,500 videos are kept for the oficial test set.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. EEGMem</title>
        <p>
          The EEGMem dataset comprises pre-extracted features from EEG recordings for 12 subjects
captured while they watched a subset of the Memento10k [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] videos. Participants watched
the same videos again through a custom-built online portal between 24–72 hours after the
video-EEG recording session, where they were required to indicate for each video whether or
not they recognised it, providing binary ground truth annotations1.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Evaluation</title>
      <p>A total of five runs can be submitted by each participant for each sub-task. For sub-task 1
all information relating to the Memento10k dataset, i.e., ground-truth data, annotation data,
pre-extracted features, and features extracted from provided material, may be used to build
the system. For sub-task 2, in similar fashion to sub-task 1, all information relating to the
Memento10k and VideoMem datasets may be used to build the system, however, only one
dataset may be used per run, and must be evaluated on the other dataset to assess generalisability.
For sub-task 3 the only requirement is that EEG data must be, to some extent, included in the
system.</p>
      <p>Three standard metrics will be used to assess participant system performance for sub-tasks 1
and 2: Spearman’s rank correlation, Pearson correlation, and mean squared error. However,
similar to previous years, Spearman’s rank correlation will be adopted as the oficial metric as it
enables inter-method comparisons by taking into account monotonic relationships between
ground-truth data and system output. Submissions for sub-task 3 will be evaluated using the
Area Under the Receiver Operating Characteristic Curve.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions</title>
      <p>This paper presents an overview of the fith edition of the MediaEval Predicting Video
Memorability task. Similar to previous years, the task presents a framework to evaluate the prediction of
the memorability of short form videos. This year the task focuses on short-term memorability
and introduces a task based on EEG signals. Details regarding the participants’ approaches and
their results can be found in the proceedings of the 2022 MediaEval workshop2.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <p>Science Foundation Ireland under Grant Number SFI/12/RC/2289_P2, cofunded by the European
Regional Development Fund. Financial support also provided by the University of Essex Faculty
of Science and Health Research Innovation and Support Fund. Financial support also provided
under project AI4Media, a European Excellence Centre for Media, Society and Democracy,
H2020 ICT-48-2020, grant #951911.
1Further details on the EEGMem dataset and data collection protocol are available at: https://bit.ly/3BTstj7
2See CEUR Workshop Proceedings (CEUR-WS.org).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Isola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Torralba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Oliva</surname>
          </string-name>
          ,
          <article-title>What makes an image memorable</article-title>
          ,
          <source>in: Proceedings of the IEEE Conference on Computer Vision</source>
          and Pattern Recognition, IEEE,
          <year>2011</year>
          , pp.
          <fpage>145</fpage>
          -
          <lpage>152</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Kiziltepe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Constantin</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-H. Demarty</surname>
            , G. Healy,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Fosco</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>García Seco de Herrera</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Halder</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Ionescu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Matran-Fernandez</surname>
            ,
            <given-names>A. F.</given-names>
          </string-name>
          <string-name>
            <surname>Smeaton</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Sweeney</surname>
          </string-name>
          ,
          <article-title>Overview of the MediaEval 2021 predicting media memorability task</article-title>
          , in: MediaEval Multimedia Benchmark Workshop Working Notes,
          <year>2021</year>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3181</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>García Seco de Herrera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. Savran</given-names>
            <surname>Kiziltepe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chamberlain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Constantin</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-H. Demarty</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Doctor</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Ionescu</surname>
            ,
            <given-names>A. F.</given-names>
          </string-name>
          <string-name>
            <surname>Smeaton</surname>
          </string-name>
          ,
          <article-title>Overview of MediaEval 2020 predicting media memorability task: What makes a video memorable?</article-title>
          ,
          <source>in: Working Notes Proceedings of the MediaEval 2020 Workshop</source>
          ,
          <year>2020</year>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2882</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Cohendet</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-H. Demarty</surname>
            ,
            <given-names>N. Q.</given-names>
          </string-name>
          <string-name>
            <surname>Duong</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Engilberge</surname>
          </string-name>
          ,
          <article-title>Videomem: Constructing, analyzing, predicting short-term and long-term video memorability</article-title>
          ,
          <source>in: Proceedings of the IEEE International Conference on Computer Vision</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>2531</fpage>
          -
          <lpage>2540</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Cohendet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yadati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. Q.</given-names>
            <surname>Duong</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-H. Demarty</surname>
          </string-name>
          ,
          <article-title>Annotating, understanding, and predicting long-term video memorability</article-title>
          ,
          <source>in: Proceedings of the 2018 ACM International Conference on Multimedia Retrieval</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>178</fpage>
          -
          <lpage>186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Kiziltepe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sweeney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Constantin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Doctor</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-H. Demarty</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Healy</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Ionescu</surname>
            ,
            <given-names>A. F.</given-names>
          </string-name>
          <string-name>
            <surname>Smeaton</surname>
          </string-name>
          ,
          <article-title>An annotated video dataset for computing video memorability, Data in Brief 39 (</article-title>
          <year>2021</year>
          )
          <fpage>107671</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Newman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fosco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Casser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>McNamara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Oliva</surname>
          </string-name>
          , Multimodal memorability:
          <article-title>Modeling efects of semantics and decay on video memorability</article-title>
          , in: A.
          <string-name>
            <surname>Vedaldi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Bischof</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Brox</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-M. Frahm</surname>
          </string-name>
          (Eds.),
          <source>Computer Vision - ECCV 2020</source>
          , Springer International Publishing, Cham,
          <year>2020</year>
          , pp.
          <fpage>223</fpage>
          -
          <lpage>240</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L.</given-names>
            <surname>Sweeney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Matran-Fernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Halder</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>A. F.</given-names>
          </string-name>
          <string-name>
            <surname>Smeaton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <article-title>Healy, Overview of the EEG pilot subtask at MediaEval 2021: predicting media memorability</article-title>
          , in: MediaEval Multimedia Benchmark Workshop Working Notes,
          <year>2021</year>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3181</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Azcona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Moreu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ward</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Smeaton</surname>
          </string-name>
          ,
          <article-title>Predicting media memorability using ensemble models</article-title>
          ,
          <source>in: Proceedings of MediaEval</source>
          <year>2019</year>
          ,
          <string-name>
            <surname>Sophia</surname>
            <given-names>Antipolis</given-names>
          </string-name>
          , France,
          <source>CEUR Workshop Proceedings</source>
          ,
          <year>2019</year>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2670</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>L.</given-names>
            <surname>Sweeney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Healy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Smeaton</surname>
          </string-name>
          ,
          <article-title>The influence of audio on video memorability with an audio gestalt regulated video memorability system</article-title>
          , in: MediaEval Multimedia Benchmark Workshop Working Notes,
          <year>2021</year>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3181</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Friedland, Multi-modal ensemble models for predicting video memorability</article-title>
          ,
          <source>in: Proceedings of the MediaEval 2020 Workshop, CEUR Workshop Proceedings</source>
          ,
          <year>2020</year>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2882</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          , I. Sutskever,
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <article-title>Imagenet classification with deep convolutional neural networks</article-title>
          ,
          <source>Communications of the ACM</source>
          <volume>60</volume>
          (
          <year>2017</year>
          )
          <fpage>84</fpage>
          -
          <lpage>90</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>N.</given-names>
            <surname>Dalal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Triggs</surname>
          </string-name>
          ,
          <article-title>Histograms of oriented gradients for human detection</article-title>
          ,
          <source>in: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05)</source>
          , volume
          <volume>1</volume>
          , IEEE,
          <year>2005</year>
          , pp.
          <fpage>886</fpage>
          -
          <lpage>893</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>D.-C. He</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Texture unit, texture spectrum, and texture analysis</article-title>
          ,
          <source>IEEE Transactions on Geoscience and Remote Sensing</source>
          <volume>28</volume>
          (
          <year>1990</year>
          )
          <fpage>509</fpage>
          -
          <lpage>512</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          ,
          <source>arXiv preprint arXiv:1409.1556</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>G.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Van Der</given-names>
            <surname>Maaten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <article-title>Densely connected convolutional networks</article-title>
          ,
          <source>in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognitionn</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>4700</fpage>
          -
          <lpage>4708</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Deep residual learning for image recognition</article-title>
          ,
          <source>in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          , Eficientnet:
          <article-title>Rethinking model scaling for convolutional neural networks</article-title>
          ,
          <source>in: International Conference on Machine Learning, PMLR</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>6105</fpage>
          -
          <lpage>6114</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>D.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bourdev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fergus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Torresani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Paluri</surname>
          </string-name>
          ,
          <article-title>Learning spatiotemporal features with 3d convolutional networks</article-title>
          ,
          <source>in: Proceedings of the IEEE International Conference on Computer Vision</source>
          ,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>