<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>L. Sweeney); Graham.Healy@dcu.ie (G. Healy); alan.smeaton@dcu.ie
(A. F. Smeaton)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Difusing Surrogate Dreams of Video Scenes to Predict Video Memorability</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lorin Sweeney</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Graham Healy</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alan F. Smeaton</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Insight Centre for Data Analytics, Dublin City University</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>As part of the MediaEval 2022 Predicting Video Memorability task we explore the relationship between visual memorability, the visual representation that characterises it, and the underlying concept portrayed by that visual representation. We achieve state-of-the-art memorability prediction performance with a model trained and tested exclusively on surrogate dream images, elevating concepts to the status of a cornerstone memorability feature, and finding strong evidence to suggest that the intrinsic memorability of visual content can be distilled to its underlying concept or meaning irrespective of its specific visual representational.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction and Related Work</title>
      <p>The natural world is a tempest of sensory threads—from frenzied photons to odious odourants.
As we wade through this storm of complex multi-sensory data, our brain is court master and
king—tying threads into an intelligible internal representation, and exiling all that it deems
unnecessary. What should be remembered, and what should not? The answer is hidden in
the whims of the king. Memorability—the likelihood that a given piece of content will be
recognised upon subsequent viewing—can be viewed as the Rosetta Stone required to decipher
the remembering whims of the brain, which is what ultimately motivates and brings meaning
to its exploration. Additionally, its proximity to the essence of human experience, and “what the
brain deems to be important", casts it into the territory of proxy measure of human importance
and quintessential media metric.</p>
      <p>
        Although much progress has been made thinning the query-saturated haze that conceals the
landscape of answers mapped by the seminal question: “What makes an image memorable?”
[
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1, 2, 3, 4</xref>
        ], the summit remains out of sight, with 25% of the variance still remaining unaccounted
for [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The shortest path to understanding is through a hurricane of light. Given that we are
visually dominant creatures, with over half of the cortex involved in visual processing [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], we
naturally expect visual sensory data to exert the greatest influence on memorability. However,
it is important not to be lead awry by our brain’s appetite for visual sensory data, as semantic
meaning is known to play a critical role in visual memorability. Richer and more conceptually
distinctive events last longer in memory, and certain semantic categories are inherently more
memorable than others [
        <xref ref-type="bibr" rid="ref5 ref7">5, 7</xref>
        ]. Even though visual memories are stored with an exceptional
ifdelity of detail (i.e., configurations and contexts of viewed objects [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]), our performance is poor
when it comes to remembering random patterns unless they take on object-like qualities [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ],
suggesting that visual memory is not driven entirely by visual details. Further evidence suggests
that visual data is merely a means to conceptual understanding, which is in turn intimately
tied to memory, with conceptual distinctiveness supporting higher fidelity visual long-term
memory representations than perceptual distinctiveness, and influencing memory retention in
a manner that cannot be accounted for by perceptual distinctiveness alone [
        <xref ref-type="bibr" rid="ref10 ref7">10, 7</xref>
        ]. Perceptual
distinctiveness is typically measured within a given object category, and with reference to
variations in low dimensional, knowledge agnostic, perceptual features (i.e., colour, and shape).
Unfortunately, the line between perceptual and conceptual features begins to blur as we move
into higher dimensional features (e.g., length of torso relative to head size), which become more
category specific and likely to be acquired through visual experience [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], making it dificult to
probe the depth of connection between concept and memorability. However, with the recent
explosion in progress in the image synthesis field, and the release of open-source text-to-image
difusion model Stable Difusion [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], we find ourselves uniquely positioned to assess the impact
of conceptual features on video memorability independent of its perceptual features, with the
exceptional ability to preserve the depth and richness of information inherent to the visual
domain.
      </p>
      <p>We hypothesise that if visual data truly is merely a means to conceptual understanding, and
that it is the concept itself—which is conveyed/represented through the visual data—that holds
the content’s intrinsic memorability, then the inter-video relationship of memorability scores
predicted with ground-truth video frames should be observable in the memorability scores
predicted with synthetic images predicated on purely conceptual video data.</p>
      <p>This paper leverages state of the art image synthesis to facilitate the exploration of our
aforementioned hypothesis, which can be concisely captured as the following question: can the
intrinsic memorability of visual content be distilled to its underlying concept or meaning?</p>
    </sec>
    <sec id="sec-2">
      <title>2. Approach</title>
      <p>
        Our experiments were carried out within the purview of subtask 1 of the MediaEval Predicting
Video Memorability task [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], with the Memento10k dataset—comprised of 7,000 training
videos, 1,500 validation videos, and 1,500 withheld test videos—acting as our data landscape.
However, before we could set out on our quest for insight, we had to terraform the landscape
by synthesising images that reflect the conceptual essence of the original Memento10k videos. In
order to do so, we leveraged Stable Difusion , a latent text-to-image difusion model [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Stable
Difusion is pre-trained on the LAION-5B dataset [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], which consists of scraped non-curated
image-text-pairs from the internet, and is capable of generating high-resolution images from
text input.
      </p>
      <p>While the images synthesised using Stable Difusion are generally high quality in terms of
image resolution, if left unspecified in the input text prompt, the compositional construction of the
synthesised images is often quite unpredictable and hyper-stylised/unrealistic (i.e., cartoonish,
painted, rendered). With the aim of combatting this and guiding the style of synthesised images,
we created a style token (mem10kstyle) that could be appended to prompts by fine-tuning the
stable-difusion-v1-5 checkpoint on 20 real world photographs (depicted in Figure 1.) which
reflect the “in the wild" nature of Memento10k videos, and used 1,500 Memento10k video frames
as regularisation images, training for a total of 2,200 steps.</p>
      <p>Stable Difusion requires input prompts in order to generate images, so using each video’s
ifrst caption as a foundation, we build a textual prompt by pre-appending video action labels,
appending one of three custom prompt modifiers, and finishing with our mem10kstyle token.
Our custom prompt modifiers are tailored to the content depicted in the video to further guide
the image generation process. We then create a dataset we call “Memento10k Surrogate Dream” —
acknowledging that the synthesised images are in fact dream-like surrogates for the videos—by
passing each prompt to our fine-tuned Stable Difusion model (Figure 2.)</p>
      <p>We submitted 5 runs for evaluation in the Predicting Video Memorability task. Each run falls
into one of two categories: Genesis or Surrogate Dream.</p>
      <p>
        Genesis: Approaches trained on vanilla Memento10k data are considered to be Genesis,
and were trained on visual features extracted from unaltered Memento10k video frames. The
runs entitled Mem10k_DenseNet121 and Mem10k_DenseNet121_Dream are
ImageNetpretrained DenseNet121 models fine-tuned (for 50 epochs, with a maximum learning rate of
1e-3, and weight decay of 1e-2) on the middle frame of the Memento10k training videos. The
run Mem10k_CLIP_Ridge_Regression_Mem10k is a Bayesian Ridge Regressor (BRR) fit
with default sklearn [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] parameters on stacked CLIP visual embeddings (extracted from the
ifrst, middle, and last video frames) [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>Surrogate Dream: Approaches trained on images generated with our fine-tuned
Stable Difusion model are considered to be Surrogate Dream, and with the exception of
memorability scores, were trained exclusively on surrogate visual data. The runs entitled
Dream_DenseNet121_Mem10k and Dream_DenseNet121_Dream are ImageNet-pretrained
DenseNet121 models fine-tuned (for 50 epochs, with a maximum learning rate of 1e-3, and
weight decay of 1e-2) on our Memento10k Surrogate Dream dataset.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Discussion and Outlook</title>
      <p>
        Table 1 shows the Spearman scores for our runs from subtask 1, with Genesis/Surrogate
Dream indicating whether the approach was trained on ground-truth video frames, or
synthesized images respectively, and the final token Mem10k/Dream of each approach
indicating whether it was tested on ground-truth video frames, or synthesised images
respectively. In the broader context of memorability prediction, all of our runs sit firmly in
stateof-the-art territory, with two of our runs marginally outperforming the hitherto
state-ofthe-art memorability prediction model SemanticMemNet [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Although our run entitled
Mem10k_CLIP_Ridge_Regression_Mem10k achieved the highest Spearman score, the most
notable aspect of our results centres around our run entitled Dream_DenseNet121_Dream,
which was both exclusively trained and tested on surrogate dream images, and not only
outperforms our control run entitled Mem10k_Dense121_Mem10k, but achieves an impressive
better than state-of-the-art score of 0.664.
      </p>
      <p>The distributions of memorability score predictions for vanilla and surrogate dream
approaches are shown in Figure 3. When combined with the evaluation scores, this provides the
ifrst of its kind strong evidence that visual data is merely a means to conceptual understanding,
and that it is the concepts themselves—which are conveyed/represented through the visual
data—that hold the content’s intrinsic memorability.</p>
      <p>Graph B in Figure 3 tentatively suggests that surrogate dream images are more memorable
than ground-truth video frames by virtue of the left skew in predicted scores from our run
trained on Mem10k frames and tested on surrogate dream images. However, detailed exploration
and investigation into the nature and composition of images in our Memento10k Surrogate
Dream dataset is warranted and should be a focus of future research.</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgements</title>
      <p>Science Foundation Ireland under Grant Number SFI/12/RC/2289_P2, cofunded by the European
Regional Development Fund.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Isola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Torralba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Oliva</surname>
          </string-name>
          ,
          <article-title>What makes an image memorable</article-title>
          ,
          <source>in: Proceedings of the IEEE Conference on Computer Vision</source>
          and Pattern Recognition, IEEE,
          <year>2011</year>
          , pp.
          <fpage>145</fpage>
          -
          <lpage>152</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Newman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fosco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Casser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>McNamara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Oliva</surname>
          </string-name>
          , Multimodal memorability:
          <article-title>Modeling efects of semantics and decay on video memorability</article-title>
          , in: A.
          <string-name>
            <surname>Vedaldi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Bischof</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Brox</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-M. Frahm</surname>
          </string-name>
          (Eds.),
          <source>Computer Vision - ECCV 2020</source>
          , Springer International Publishing, Cham,
          <year>2020</year>
          , pp.
          <fpage>223</fpage>
          -
          <lpage>240</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Sweeney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Healy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Smeaton</surname>
          </string-name>
          ,
          <article-title>Leveraging audio gestalt to predict media memorability</article-title>
          , in: MediaEval Multimedia Benchmark Workshop Working Notes,
          <year>2020</year>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2882</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Sweeney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Healy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Smeaton</surname>
          </string-name>
          ,
          <article-title>The influence of audio on video memorability with an audio gestalt regulated video memorability system</article-title>
          , in: MediaEval Multimedia Benchmark Workshop Working Notes,
          <year>2021</year>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3181</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Isola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Parikh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Torralba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Oliva</surname>
          </string-name>
          ,
          <article-title>What makes a photograph memorable</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>36</volume>
          (
          <year>2013</year>
          )
          <fpage>1469</fpage>
          -
          <lpage>1482</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Snowden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Snowden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Thompson</surname>
          </string-name>
          , T. Troscianko,
          <article-title>Basic vision: an introduction to visual perception</article-title>
          , Oxford University Press,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Konkle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. F.</given-names>
            <surname>Brady</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. A.</given-names>
            <surname>Alvarez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Oliva</surname>
          </string-name>
          ,
          <article-title>Conceptual distinctiveness supports detailed visual long-term memory for real-world objects</article-title>
          .,
          <source>Journal of Experimental Psychology: General</source>
          <volume>139</volume>
          (
          <year>2010</year>
          )
          <fpage>558</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T. F.</given-names>
            <surname>Brady</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Konkle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. A.</given-names>
            <surname>Alvarez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Oliva</surname>
          </string-name>
          ,
          <article-title>Visual long-term memory has a massive storage capacity for object details</article-title>
          ,
          <source>Proceedings of the National Academy of Sciences</source>
          <volume>105</volume>
          (
          <year>2008</year>
          )
          <fpage>14325</fpage>
          -
          <lpage>14329</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Wiseman</surname>
          </string-name>
          , U. Neisser,
          <article-title>Perceptual organization as a determinant of visual recognition memory</article-title>
          ,
          <source>The American Journal of Psychology</source>
          (
          <year>1974</year>
          )
          <fpage>675</fpage>
          -
          <lpage>681</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Huebner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. R.</given-names>
            <surname>Gegenfurtner</surname>
          </string-name>
          ,
          <article-title>Conceptual and visual features contribute to visual memory for natural images</article-title>
          ,
          <source>PLoS One</source>
          <volume>7</volume>
          (
          <year>2012</year>
          )
          <article-title>e37575</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>P. G.</given-names>
            <surname>Schyns</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Goldstone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-P.</given-names>
            <surname>Thibaut</surname>
          </string-name>
          ,
          <article-title>The development of features in object concepts</article-title>
          ,
          <source>Behavioral and Brain Sciences</source>
          <volume>21</volume>
          (
          <year>1998</year>
          )
          <fpage>1</fpage>
          -
          <lpage>17</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rombach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Blattmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lorenz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Esser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ommer</surname>
          </string-name>
          ,
          <article-title>High-resolution image synthesis with latent difusion models</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>10684</fpage>
          -
          <lpage>10695</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>L.</given-names>
            <surname>Sweeney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Constantin</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-H. Demarty</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Fosco</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>García Seco de Herrera</surname>
            , S. Halder,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Healy</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Ionescu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Matran-Fernandez</surname>
            ,
            <given-names>A. F.</given-names>
          </string-name>
          <string-name>
            <surname>Smeaton</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Sultana, Overview of the MediaEval 2022 predicting video memorability task</article-title>
          , in: MediaEval Multimedia Benchmark Workshop Working Notes,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>C.</given-names>
            <surname>Schuhmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Beaumont</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Vencu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gordon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wightman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cherti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Coombes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Katta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Mullis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wortsman</surname>
          </string-name>
          , et al.,
          <article-title>Laion-5b: An open large-scale dataset for training next generation image-text models</article-title>
          ,
          <source>arXiv preprint arXiv:2210.08402</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Varoquaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gramfort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Thirion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Grisel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blondel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dubourg</surname>
          </string-name>
          , et al.,
          <article-title>Scikit-learn: Machine learning in Python</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          (
          <year>2011</year>
          )
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          , et al.,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          ,
          <source>in: International Conference on Machine Learning, PMLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>8748</fpage>
          -
          <lpage>8763</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>