<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Instagram Images and Videos Popularity Prediction: a Deep Learning-Based Approach</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Massimiliano Viola</string-name>
          <email>massimiliano.viola@studenti.unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Brunelli</string-name>
          <email>luca.brunelli@statwolf.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gian Antonio Susto</string-name>
          <email>gianantonio.susto@unipd.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Statwolf Data Science</institution>
          ,
          <addr-line>Padova, IT</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universita degli Studi di Padova</institution>
          ,
          <addr-line>Padova, IT</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>In recent years, social media platforms have seen tremendous growth in terms of the number of users, forms of interaction, and diversity of content. While these channels are purely a source of entertainment for many users, for others they represent the main source of revenue or advertising for their products and services. In order to capture users' attention, companies and professionals aim at achieving high popularity of their posts. In this work, we aspire to predict post popularity on the Instagram platform through Machine Learning approaches, with the goal of presenting a methodological tool that could provide useful information for post performance optimization. While previous contributions on the subject addressed the generic popularity of a post on the platform, we focus on the post popularity on a speci c pro le using only the visual content related to the post (image or video). We describe in detail the process and work ow to design a measure of popularity consistent even over the long time frame. Furthermore, we take advantage of stateof-the-art Convolutional Neural Networks and provide interpretability traits for their predictions, a quality that is nowadays highly welcomed in the industry. Lastly, we use a situation of scarce video data to experiment with ways of performing mixed training with both images and videos, providing problem-independent ideas and architectures that can potentially be applied to other video classi cation tasks.</p>
      </abstract>
      <kwd-group>
        <kwd>Computer Vision</kwd>
        <kwd>Convolutional Neural Networks</kwd>
        <kwd>Popularity Prediction</kwd>
        <kwd>Social Network</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Social media has become, in recent years, fundamental platforms for marketing
and advertising; post popularity is considered as a good proxy of marketing
strategy success in social media: for this reason, predicting post popularity is
not only of interest from an academic point of view, but also crucial from a
business and marketing perspective. With the possibility to reach thousands of
users with ease, consistently posting the right trending content can translate
into a signi cant increase in follower interaction and consequently sales for an
emerging or even established brand.</p>
      <p>In this work, we aim to predict Instagram post popularity via Machine
Learning (ML) approaches, with the goal of presenting a methodological tool
that could provide useful information for post performance optimization. In the
proposed approach we exploit state-of-the-art Convolutional Neural Networks
(CNNs) and provide interpretability methods for their predictions. Lastly, we
use a situation of scarce video data to experiment with ways of performing mixed
training with both images and videos, providing problem-independent ideas and
architectures that could potentially be applied to other video classi cation tasks.</p>
      <p>The rest of the paper is organized as follows: in Section 2 we provide
literature review on post popularity prediction and we propose a new metric, called
Popularity Rate; moreover, in Section 3 we formalize the ML task at hand. In
Section 4 the proposed approach is presented, while Section 5 is devoted to
detail the experimental part of this work: the real-world dataset employed, the
experimental settings and results. Finally, Section 6 reports the conclusions of
this work and discusses potential future research directions.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Proposed Popularity Metric</title>
      <p>On the Instagram platform, likes and comments are arguably the most quanti
able components of a post's success. Despite this, there is no universal measure
of popularity, and the choice of the metric used to describe it, starting with the
above ingredients, is itself an interesting subject for studies.</p>
      <p>
        In recent works on the topic, [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] considered dividing the sum of likes
and comments by the number of followers of the pro le and treated the problem
as a regression task, whereas [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] formalized it as a binary classi cation by taking
the best and worst 25% of each user's posts sorted by number of likes. Finally,
also [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] considered two popularity classes, but used a moving average window on
the likes trend (de ned as Likes Moving Average) to dynamically determine if a
user's post performed better or not than its latest K predecessors.
      </p>
      <p>All of the aforementioned works aimed to measure the popularity of a post
in absolute terms within the Instagram platform, using multi-user datasets often
in combination with contextual information. In our work, however, we have the
goal of interpreting the popularity within a speci c pro le, wanting to be as
accurate as possible in predicting how a well-de ned audience would respond
to a given post, allowing practitioners that decide to use our method to choose
accurately the content to publish.</p>
      <p>
        Focusing on the single Instagram pro le implies that we need to look over a
very long time frame in order to collect enough data, which intrinsically leads
to major obstacles that did not exist in the metrics used in previous works.
Indeed, in these settings absolute values are typically not meaningful and/or
reliable: the follower count used to normalize across di erent pro les is not a
reasonable quantity when aiming at modeling a single pro le behaviour, while
likes and comments grow by various orders of magnitude if the time interval is
not restricted, making the most recent posts always the most popular. In fact,
working over the long term, it certainly makes more sense to normalize for a
metric which depends on time when the post was published, as suggested by [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        From a business point of view, an important metric is certainly the
engagement, obtained by dividing the sum of likes and comments of a post by its total
views (also known as impressions ). This metric is crucial to keep track of
sponsored posts, which might occur frequently for a brand or business pro le, as well
as increasing the predictive potential by providing more and exact information.
The importance of this data can be observed for example in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], which uses the
impressions directly as a metric given appropriate auxiliary information. Unlike
other social media platforms, however, the number of impressions is not publicly
available to download via the Instagram API.
      </p>
      <p>Even with the privileges to access private metrics (as in the case of
impressions), it's not easy to retrieve data from past years. In many cases, it is necessary
to rely on third-party applications to do this, but generally, they can fetch the
various metrics up to 2 years in the past. Since not all entities have been
farsighted in this context, typically data of the impressions is known for a very
short period of time, thus, using impressions would force us often and willingly
to limit a lot the time interval in which to collect data from a speci c pro le.</p>
      <p>Returning therefore to the idea of discounting likes and comments not by the
number of impressions but rather by the number of followers, we propose the
following metric for popularity of a post p, referred to as Popularity Rate (PoR):
PoR(p) =
l(p) + c(p)
f (tp)
(1)
where l(p) and c(p) are respectively the likes and comments of a post p, while
f (tp) are the followers of the pro le at the time tp the post p was published.</p>
      <p>
        Since the number of followers at upload date is not an attribute of the post, it
does not come with the associated metadata, we recommend exploiting follower
trends provided by external services that track Instagram metrics3; we resorted
to interpolation of the known data points provided by the external service in
order to have an estimation of f (tp). It may happen that even in this way, the
number of followers in the past is not available until the desired date, but here,
unlike the impressions, we can reconstruct the missing data in a robust way as
explained in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        In Fig. 1.a, for the case study that we are going to describe in detail in Section
5, we compare the division of the sum of likes and comments by the number of
followers and by the total impressions in a time period in which both metrics are
available. Since we observe a good linear correlation (0.806) between them over
a two-year period, this shows the substitute metric is trustworthy: we tolerate
a small error in order to be able to extend the number of posts that we can
consider and exploit in our ML approach.
3 In this work, we have used the web service Not Just Analytics [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
At this point, we make the choice to tackle the problem as a classi cation task,
rather than a regression one, as it is typically done in the literature [
        <xref ref-type="bibr" rid="ref3 ref4">3,4</xref>
        ]. This
choice is motivated by two main factors: (i) typically for social content
managers/creators it is su cient to have classes of popularity associated with posts,
without high level of granularity; (ii) on a ML perspective, the classi cation
formalization makes the problem treatable, as precise regression models could be
di cult to be developed in this context.
      </p>
      <p>We exploit the PoR de ned in the previous Section to derive Popularity
Classes (PoCs), ie. non-overlapping classes that are de ned on the PoR metric
as a discretization of such continuous quantity. In this study, we will consider
both the classi cation problem with 2 PoCs and with 3, but the same procedure
can be applied with any problem cardinality.</p>
      <p>
        Initially, we de ned the PoCs by equally dividing all the posts: using the
median and the terciles of the PoR distribution as splitting points to delineate,
respectively, the 2 and 3 classes labels. After this operation, we realized that
a lot of the top-class posts were very old, whereas the latest ones were not as
popular; this was a direct consequence of the fact that the PoR is generally higher
when a pro le has fewer followers. This phenomenon is described in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] by
comparing the average PoR for di erent pro les: this downward trend is assumed
to be due both (i) to the fact that early followers are the most interested in the
content and (ii) to the growth of the platform itself which naturally exposes
users to more content and gradually reduces interest in a speci c pro le.
      </p>
      <p>
        For this reason, inspired by [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], we divide into the 2 or 3 PoCs using again
median and terciles, but this time evaluating these statistical indicators within
a rolling time window of several weeks. Such procedure is done under the
assumption that a local label assignment is better than a full-time horizon one,
since: (i) a post in a certain time frame is only compared with the nearest ones
under very similar environmental conditions; (ii) in this way, we are more robust
to errors and anomalies in calculating the follower estimate since the e ect of a
bad evaluation is only observed locally. The direct comparison of the two
alternatives, on the experiment of Section 5 and for the 3 classes problem, is shown
in Fig. 2.
      </p>
      <p>Everything explained so far would be done separately for images and videos:
we make this choice because, after evaluating several Instagram pro les, we
have seen that these two media have di erent behaviours in terms of PoR (and
consequently PoCs). This is a key di erence particularly evident in our case
study, when we notice posts containing only image content perform on average
45% better in PoR than posts with at least one video (see Fig. 1.b that refers
to the case study of Section 5).</p>
      <p>Since we use the sliding window independently for videos and images, the
result is a class-balanced dataset for both media type; this unfortunately has a
drawback, since a mixed post with both images and videos could end up with
di erent labels for di erent types of media.</p>
    </sec>
    <sec id="sec-3">
      <title>Proposed Approach</title>
      <sec id="sec-3-1">
        <title>Data Preprocessing</title>
        <p>Once collected all the media related to each post (image or video with relative
metadata) and created the PoCs, a cleaning operation of the dataset needs to
be performed. The major problem is the presence of duplicate or near duplicate
content, published at various times, that could lead to two di erent issues: (i)
free predictions in the validation sets; (ii) con icting labels in the training phase.
Removal of duplicates is performed as follows.</p>
        <p>
          For image content, we input pictures into a pre-trained CNN [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] and take
the activations of the last layer before the classi cation head. Nearest Neighbors
search is then performed in the embedding space of the CNN: every time two
images have an Euclidean distance less than a certain threshold, they are declared
duplicates and one of them is removed. A good initial guess for the threshold
can be easily identi ed by plotting the histogram of the distances between one
image and its closest neighbor, for each image, and picking the value that cuts
the left tail o . By visually inspecting the duplicate pairs, this value can then
be adjusted properly as needed (see Fig. 3.a as example for the case study of
Section 5). For the video content, the same can be done cheaply but e ectively
by looking at the video thumbnail and attributes such as length and frame rate.
        </p>
        <p>Whenever two duplicates are found and their labels are di erent, priority to
single posts over multiple ones (carousels) is given. In case both are of the same
type, the most recent one is always preferred.</p>
        <p>Carousels are not very common as a type of post (Fig. 3.b), but since they
include various media simultaneously they can be important to enrich the training
set, so we suggest inspecting the data and leave only the relevant content.
After performing all the preprocessing steps explained above, we obtain the nal
dataset, on which we can now train various classi cation models to predict the
PoCs of a post. In the following, we present a series of architectures that allow
us to handle images and videos with a separate or a mixed approach.</p>
        <p>We rst show a solution that uses only posts with images, being in general
the most frequently occurring type, as can be seen in Fig. 3.b. Although videos
occupy a secondary role on Instagram, recently more and more companies have
started to create content of this type. We therefore consider essential in this
work to take them into account, because: (i) on a single pro le, even over a very
long period of time, the images published may not be enough to allow optimal
training, so adding the (even few) videos allows us to increase the total size of
the dataset and may be useful in a modeling perspective; (ii) it's de nitely a
plus for companies to have a metric for evaluating video content as well.</p>
        <p>
          Thus, we then show a model based only on videos, to be used, given the
scarcity of data, as a benchmark, and nally we implement two di erent mixed
solutions with the objective of improving results on videos by taking advantage
of the image dataset which, as mentioned, generally has much larger size. The
way to do this in a situation of scarce training data for one of the di erent
source, however, is not a cutting-edge research eld. The only work we were able
to nd on the topic dates back to 2015 [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] and explores the idea to do Transfer
Learning from images to a video recognition task. We believe the reason for such
little interest in recent years is due to the fact that modern pre-trained CNNs
have become so robust that image classi cation or simple video recognition can
be performed and solved on a small scale, sometimes even looking at a single
frame of the sequence, without needing to combine the resources.
        </p>
        <p>
          Moreover, our situation is very peculiar also because the type of videos we
are dealing with are not smooth, meaning we nd rapid transitions of scenes,
light and dark e ects, and a single frame may or may not tell something about
the content of the video. Nevertheless, we propose 2 di erent approaches that
di er in the methodology used to adapt one type of data to the other.
Image Classi cation As said, working, even if on the images, on a single
pro le, does not allow us to have enough data to train a CNN from scratch, for
this reason we use a pre-trained model and then we perform a Transfer Learning
procedure from it. Regarding the choice of the pre-trained model, we opt for the
E cientNet [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] family for their performing speed and size. In particular, the
smallest model E cientNetB0 pre-trained on the ImageNet [
          <xref ref-type="bibr" rid="ref11 ref12">11,12</xref>
          ] dataset is
what we eventually use, since early trials showed that increasing the complexity
seems not to bring any signi cant bene ces in terms of accuracy. Dropout with
a 0.2 rate is applied in-between the pre-trained backbone of the E cientNetB0
and the classi cation layer with either one or three outputs, depending on the
number of classes we use. Medium data augmentation is performed to further
regularize: horizontal ips, random translations up to 10% in both height and
width direction, random 10% zoom, random brightness, contrast, and saturation
changes. Adam optimizer with 0.001 learning rate is used in combination with
standard cross-entropy loss, while the image size is set to 224x224. The models
are trained for 15 epochs keeping the base not trainable to avoid over tting.
Video Classi cation To classify the video content, we choose a popular hybrid
architecture [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], with a pre-trained CNN that extracts meaningful spatial
information from the video frames and a Recurrent Neural Network (RNN) that
models the temporal relationship between them. Known as CNN-RNN, this method
generally performs very well because it is simply based on the assumption that
a video is nothing but an ordered sequence of frames (images). To achieve this,
videos are preprocessed by taking one frame every second for the rst 15 seconds:
if the video is too short, the time interval between frames is reduced. Using an
E cientNetB0 to extract feature vectors from the resized 224x224 frames, the
input size has shape 15x1280. The architecture brie y described earlier consists
of a couple of Gated Recurrent Unit (GRU) [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] layers with 8 and 6 units
respectively, the rst one returning sequences, then a dense layer with 8 neurons
and the classi cation head. Regularization is applied via 0.2 dropout inside the
GRUs and before the dense layer of shape 8. Optimizer, loss function, number
of epochs are the same as before and we won't repeat this detail from this point
forward.
        </p>
        <p>Image to Video The rst mixed approach is based on the idea that we can
think of an image as the minimal representation of a static video with all equal
frames. For this reason, it makes sense to take the exact same CNN-RNN
architecture used with video content only and increase the number of training samples
as a regularization factor by using static frame sequences generated from images.
We are aware that this monotony is not well represented within real videos, but
this is done mostly to improve spatial rather than temporal information.
Video to Image The second approach is the opposite as before and it is a
videoto-image one, meaning we try to summarize the information within a video by
working around the time component and focusing majorly on the spatial one.
The idea we propose is about creating video embeddings that have the same
shape as those from one image, and then train a model using these features on
the combination of both. In order to do this, we rst load and preprocess the
video frames as we have done so far for CNN-RNN-based architectures, then we
reduce the temporal relation dimension by applying an aggregate function to
the time axis, leaving us with an embedding vector of shape 1280 for each video.
Regarding the last step, we nd taking the maximum to be the most e ective
among the standard aggregate functions, in the same way that a maximum
pooling is often preferred to an average one in CNNs. In terms of convolutions,
doing this operation means taking the maximum value of a certain pattern or
feature map in frames during the whole video, claiming extreme values are the
ones that give a reasonable representation. Of course, the more the video is static,
the more this feature vector resembles a single frame, while if the video involves
a lot of di erent scenes, this becomes more di cult to interpret. To classify, we
opted for simplicity to use a single dense layer with one or three outputs with
a 0.2 dropout, just as we did for images: our prediction is thus the activation
function of the weighted sum of these features.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experimental Results: a Real World Case Study</title>
      <p>We tested the proposed approach in a real-world scenario, with the contribution
of a leading trademark that works in the production of sports and leisure
equipment who gave us access to its Intsagram pro le, which has long been active
on the platform. Using the metadata collected from the pro le, and the follower
trend obtained from an external service, we were able to build a 5-year long
dataset that we used to calculate, using (1), the PoR, which was then divided
before into 2 and then 3 PoCs. After that, given the 1613 images and 575 videos
related to the various posts, we performed the preprocessing by: (i) loading and
resizing images and thumbnails to 224x224, extracting 1280-dimensional
feature vectors from an E cientNetB0 and nally opting for a threshold of 5 (see
Fig. 3.a); (ii) manually inspecting carousels to exclude logos or product
descriptions. Preprocessing eventually discarded 144 images and 19 videos, leaving us
with a pretty class-balanced dataset of 1469 pictures and 556 videos, for a total
of 1449 unique posts.
5.1</p>
      <sec id="sec-4-1">
        <title>Preliminary Analysis</title>
        <p>
          With comparable labels, a newsworthy analysis we could do was look for patterns
in the post history. We wondered if similar images had similar labels and if there
were types of images and products that people particularly liked or disliked. For
answering these questions, we took images and video thumbnails, extracted the
embeddings from an E cientNetB0 in the same way we explained in 4.1, and
then projected the high-dimensional vector of features into a 2d space using
tSNE [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. In Fig. 4 we present the results for three classes: impressive was the
ability of the pre-trained E cientNet to identify characteristic product shapes
and implicitly group them.
        </p>
        <p>Searching for parts of the plane where the mean of the labels of a signi cant
number of points was very high or low, we found some regions that con rmed
the presence of popular or unpopular patterns in the image content. At the same
time, the noise is very visible and the two extreme classes are often adjacent.
5.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Validation Scheme</title>
        <p>With such small image and video datasets like ours, validation became
challenging due to the non-negligible variance of the results even for a xed set of
parameters and seed. For this reason, a single Strati edKFold with 5 splits was
generally not stable, and so we chose to run each experiment three times as our
validation scheme. The metric we monitored in each run was the total accuracy
over the 5 splits and we averaged the three results as a nal measure of
performance. During training, for each fold and in each of the three runs, a callback
saved the weights of the model with the best validation accuracy. While this
setup still left room for uncertainty, we believe it reduced it enough to safely
compare the results across di erent experiments and model architectures.
5.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Image Classi cation</title>
        <p>
          Validating in the way described earlier, we achieved an average accuracy of 0.54
with three classes and of 0.72 for the binary classi cation task, with a top-2
accuracy for the multi-class model peaking at 85%. Considering that boundary
labels both in the binary and multi-class cases were comprehensibly often
confused by the nature of the task, the results seemed very satisfying. The confusion
matrices of the image classi cation models are reported in Fig. 5: the standard
deviations are computed between di erent experiments, not single folds.
Our case study greatly employed videos compared to many other pro les, so
much so that in a particular 1-year period of time the fraction of video content
reached a remarkable 43%. The total number of videos was 556, and in 30% of the
cases we had to take more than one frame every second due to the videos being
shorter than 15 seconds. We achieved an accuracy of 0.53 with the three classes
and 0.70 for the binary classi cation task. The confusion matrices, reported in
Fig. 6, make immediately clear that the multi-class model was quite poor and
su ered from a low recall of the intermediate class, predicting very often only
the extreme ones. Furthermore, this architecture had the substantial problem of
being over-parameterized: the rst GRU layer has 30960 parameters, which in
combination with the dataset being small and having no way of applying data
augmentation made training a trustable model really hard.
Validating on the same folds of the video baseline, we achieved an accuracy of
0.54 for the multi-class problem and 0.70 for the binary classi cation task. While
we did not observe the desired increase in accuracy, we can see from the confusion
matrices in Fig. 7 that the predictions changed signi cantly. In particular, it
seems like static videos generated from images made the normal video predictions
shift towards the lower classes, reducing certain types of errors but introducing
new ones. The takeaway of this experiment was that this kind of mixed training
had a clearly visible e ect on the predictions, and video classi cation could
potentially bene t even from static sequences of frames generated from images.
Validating once more on the same video folds as the CNN-RNN, we achieved an
accuracy of 0.54 for the multi-class problem and 0.71 for the binary classi cation
task, with the corresponding confusion matrices reported in Fig. 8. Again, we
did not see a signi cant improvement, but we were positively surprised by the
results in this setup, considering that we were drastically reducing the number
of parameters needed for video classi cation, as well as condensing temporal
information, while maintaining the same performances.
Given the strong applicative and business-oriented nature of our work and
research, we believed explaining predictions was as important as solid modeling.
In this section, we present some interpretability examples for image classi cation
models obtained with Grad-CAM [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], a technique for producing visual
explanations for decisions from CNNs. In a nutshell, the Grad-CAM algorithm creates a
class activation heatmap to superimpose on the original image, which represents
a coarse localization map highlighting the important regions for predicting the
speci c class. In Fig. 9, we can see it in action on three images [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]: these are
not sampled from our dataset for con dentiality reasons but are similar enough
to allow the comparison. The represented heatmaps refer to the top class when
using a multi-class classi cation model: as we can see, the focus is well located
on the meaningful components and areas.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and Future Works</title>
      <p>In this paper, we presented an approach to rst de ne a popularity metric for an
Instagram post using data always easily accessible and then to classify the
popularity using Deep Learning-based models. We then showed various solutions that
allowed us to also take into account the little but signi cant data derived from
posts containing videos. Always keeping in mind that popularity classes are very
noisy and thus not close to be perfectly separable, the results are very
promising, in particular we demonstrated that we were able to signi cantly isolate the
classes with low and high popularity.</p>
      <p>Some future research directions are foreseen: (i) include historical/contextual
information as a feature for various models: a particular image/video that was
successful in the past is not necessarily successful at the present time (and vice
versa); (ii) strengthen the training of video networks by applying data
augmentation: instead of extracting the frame features before training, they can be
processed in real time, augmented as if they were images; (iii) while the presented
approach was designed for Instagram, we think that both the proposed metric
and modeling pipeline can be easily extended to other social media platforms.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>K. R.</given-names>
            <surname>Purba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Asirvatham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. K.</given-names>
            <surname>Murugesan: Instagram Post Popularity Trend Analysis</surname>
          </string-name>
          and
          <article-title>Prediction using Hashtag, Image Assessment, and User History Features</article-title>
          .
          <source>International Arab Journal of Information Technology (IAJIT)</source>
          , vol.
          <volume>18</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>85</fpage>
          -
          <lpage>94</lpage>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Th</surname>
          </string-name>
          <article-title>ommes, Katja &amp; Hubner, Ronald: Why people press "like": A new measure for aesthetic appeal derived from Instagram data</article-title>
          .
          <source>Psychology of Aesthetics, Creativity, and the Arts</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Zhang, Zhongping et al:
          <article-title>How to Become Instagram Famous: Post Popularity Prediction with Dual-Attention</article-title>
          .
          <source>2018 IEEE International Conference on Big Data:</source>
          <fpage>2383</fpage>
          -
          <lpage>2392</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Carta</surname>
            ,
            <given-names>S.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Podda</surname>
            ,
            <given-names>A.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Recupero</surname>
            ,
            <given-names>D.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saia</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Usai</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>Popularity Prediction of Instagram Posts</article-title>
          . Inf.
          <volume>11</volume>
          (
          <year>2020</year>
          ):
          <fpage>453</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>A.</given-names>
            <surname>Zohourian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sajedi</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Yavary</surname>
          </string-name>
          <article-title>: Popularity prediction of images and videos on Instagram</article-title>
          .
          <source>4th International Conference on Web Research (ICWR)</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>111</fpage>
          -
          <lpage>117</lpage>
          , doi: 10.1109/ICWR.
          <year>2018</year>
          .
          <volume>8387246</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Khosla</surname>
          </string-name>
          , Aditya,
          <article-title>Atish Das Sarma, and Ra ay Hamid: What makes an image popular?</article-title>
          .
          <source>Proceedings of the 23rd international conference on World wide web</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>I.</given-names>
            <surname>Goodfellow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>Courville: Deep learning</article-title>
          . MIT press,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Joe</given-names>
            <surname>Yue-Hei Ng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hausknecht</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vijayanarasimhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Monga</surname>
          </string-name>
          and
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Toderici: Beyond short snippets: Deep networks for video classi cation</article-title>
          ,
          <source>2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <surname>Yu-Chuan</surname>
          </string-name>
          &amp; Hsuan, Chiu &amp; Yeh,
          <string-name>
            <surname>Chun-Yen</surname>
          </string-name>
          &amp;
          <article-title>Huang, Hsinfu &amp; Hsu, Winston: Transfer Learning for Video Recognition with Scarce Training Data</article-title>
          .
          <source>ArXiv abs/1409</source>
          .4127 (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Tan</surname>
            , Mingxing &amp; Le,
            <given-names>Quoc.</given-names>
          </string-name>
          (
          <year>2019</year>
          ): E cientNet:
          <article-title>Rethinking Model Scaling for Convolutional Neural Networks</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Russakovsky</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Su</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krause</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Satheesh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Ma,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            ,
            <surname>Karpathy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Khosla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Bernstein</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          , et al:
          <article-title>Imagenet large scale visual recognition challenge</article-title>
          .
          <source>International Journal of Computer Vision</source>
          ,
          <volume>115</volume>
          (
          <issue>3</issue>
          ):
          <volume>211</volume>
          {
          <fpage>252</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>J. Deng</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Dong</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>L.-J.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            and
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Fei-Fei</surname>
          </string-name>
          ,
          <article-title>ImageNet: A LargeScale Hierarchical Image Database</article-title>
          .
          <source>IEEE Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>van der Maaten</surname>
          </string-name>
          , Laurens &amp; Hinton,
          <source>Geo rey: Visualizing Data using t-SNE</source>
          .
          <year>2008</year>
          ,
          <source>Journal of Machine Learning Research</source>
          .
          <volume>9</volume>
          .
          <fpage>2579</fpage>
          -
          <lpage>2605</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>R. R.</given-names>
            <surname>Selvaraju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cogswell</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Vedantam</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Parikh</surname>
            and
            <given-names>D.</given-names>
          </string-name>
          <article-title>Batra: GradCAM: Visual Explanations from Deep Networks via Gradient-Based Localization</article-title>
          .
          <source>2017 IEEE International Conference on Computer Vision</source>
          (ICCV),
          <year>2017</year>
          , pp.
          <fpage>618</fpage>
          -
          <lpage>626</lpage>
          , doi: 10.1109/ICCV.
          <year>2017</year>
          .
          <volume>74</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15. Not Just Analytics, https://www.notjustanalytics.com/.
          <source>Accessed</source>
          <volume>20</volume>
          /09/
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16. Statista Research Department, https://www.statista.com/statistics/605107/ video-and
          <article-title>-image-brand-posts-on-</article-title>
          <string-name>
            <surname>instagram</surname>
            <given-names>/</given-names>
          </string-name>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <article-title>From left to right: "Ocean View GYM" by Prayitno; "Young Woman exercising with a t ball in modern gym" by shixart1985; "Sailors exercise in the gym aboard USS Dwight D. Eisenhower" by O cial U.S. Navy Imagery. Licensed under CC BY 2</article-title>
          .0, https://search.creativecommons.org/.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>