Less is MORE: a MultimOdal system for tag
                   REfinement

                      Lucia C. Passaro1 and Alessandro Lenci1

CoLing Lab, Dipartimento di Filologia, Letteratura e Linguistica (FiLeLi), Università
                  di Pisa, via Santa Maria 36, 56126 PISA, Italy
                       http://colinglab.humnet.unipi.it
         lucia.passaro@fileli.unipi.it, alessandro.lenci@unipi.it


        Abstract. With the proliferation of image-based social media, an ex-
        tremely large amount of multimodal data is being produced. Very often
        image contents are published together with a set of user defined meta-
        data such as tags and textual descriptions. Despite being very useful to
        enhance traditional image retrieval, user defined tags on social media
        have been proven to be noneffective to index images because they are
        influenced by personal experiences of the owners as well as their will of
        promoting the published contents. To be analyzed and indexed, multi-
        modal data require algorithms able to jointly deal with textual and visual
        data. This research presents a multimodal approach to the problem of tag
        refinement, which consists in separating the relevant descriptors (tags)
        of images from noisy ones. The proposed method exploits both Natu-
        ral Language Processing (NLP) and Computer Vision (CV) techniques
        based on deep learning to find a match between the textual information
        and visual content of social media posts. Textual semantic features are
        represented with (multilingual) word embeddings, while visual ones are
        obtained with image classification. The proposed system is evaluated on
        a manually annotated Italian dataset extracted from Instagram achieving
        68% of weighted F1-score.

        Keywords: Natural Language Processing        · Computer Vision · Multi-
        modal Semantics


1     Introduction

Human communication is intrinsically multimodal. Ideas can be better expressed
and understood by jointly using different modalities, as proven by most advertis-
ing campaigns and social media, in which the skillful combination of images and
language is able to amplify communicative intents and impact. With the ever-
growing expansion of Internet-based activities in the last 10 years, an extremely
large amount of multimodal data is being produced. Multimodal information
    Copyright © 2020 for this paper by its authors. Use permitted under Creative
    Commons License Attribution 4.0 International (CC BY 4.0).


                                           44
processing is therefore needed in order to make sense of such large quantities
of data, enabling the development of systems that jointly deal with textual and
visual data. Not only the automatic interpretation of texts can be improved by
exploiting additional non-verbal information such as visual contents, but the in-
terpretation of images can also be enriched by exploiting the meaning of their
surrounding text [4]. Examples of applications based on multimodal informa-
tion processing are image research from textual descriptions and, vice versa, the
generation of textual descriptors or captions from an image [16]. In this paper,
we focus on the refinement of user defined image annotations, namely the tags
provided with images when they are published on social media such as Insta-
gram. In this platform, owners share pictures annotated with a set of tags based
on personal experiences. Giannoulakis and Tsapatsoulis [14] demonstrated that
only 66% of human defined tags describe the visual content of the image. This
negatively affects the way we can access and use Instagram data.
    Social media tags are useful to enhance traditional image retrieval technol-
ogy [27,33], but they usually include a lot of noise. For instance, approximately
only 20% of the Instagram hashtag datasets are appropriate to be used as train-
ing examples (i.e., image - tag pairs) for image recognition machine learning
algorithms [14]. User defined tags suffer from ambiguity, carelessness and in-
completeness [22], and have been proven to be highly associated with trends
and events occurring in the real world, and biased toward personal perspectives
[15,30]. Moreover, very often people tag objects and scenes that are not present
in the visual content in order to favor image retrieval for the general audience
[22]. The present work aims at separating relevant from noisy tags in order to
guarantee a better indexing and retrieval of images. We approach the problem
by assigning a relevance value to tags. In order to do so, we employ a mixture of
Natural Language Processing (NLP) and Computer Vision (CV) techniques and
resources. This allows us to take into account both textual and visual features
of multimodal contents in a combined and synergic way.
    This paper is structured as follows: Section 2 shows a brief overview of exist-
ing works in this field. Section 3 describes the proposed approach to tag refine-
ment. Section 4 presents the system architecture including both the NLP and
CV modules. Sections 5 and 6 report on the collection of the dataset used for
the evaluation, namely a set of Italian manually annotated Instagram posts, and
on the evaluation itself. Section 7 discusses the performances achieved by the
current system implementation and Section 8 is left for conclusions and future
research.


2   Related work

As suggested by [22], existing research on image tags focuses on three main tasks,
namely tag assignment, tag refinement and tag retrieval. In tag assignment, given
an unlabeled image, the goal consists in assigning a (fixed) number of tags related
to the image content [26,34,35]. In tag refinement, given an image associated with
some initial tags, the objective is to separate irrelevant tags from relevant ones

                                        45
[24,37,23,38]. Finally, given a tag and a collection of images, tag retrieval focuses
on retrieving relevant images with respect to the tag of interest [10,13,36].
    In this work, we address the tag refinement task, and we aim at separating
relevant tags from noisy ones in user-tagged Instagram images. Previous studies
tackled tag refinement by considering several perspectives including the use of
linguistic information only, by measuring tag relevance in terms of weighted
semantic similarity between the target tag and the other ones [31] associated to
the image, or by integrating textual data with the visual content [21]. A different
approach exploited Principal Component Analysis [5] to build a model based on
the factorization of an image-tag matrix by a low-rank decomposition with error
sparsity [39]. A systematic evaluation and comparison of various tag refinement
models was carried out in [22]. In order to guarantee a reliable comparison, the
authors implemented several methods by exploiting the same models to process
the textual and the visual content. We suggest the interested reader to refer to
this survey for details on model implementation and results.
    This evaluation shows that the best performing model for tag refinement is
the one based on robust PCA [5] and a CNN [18] to process the image content,
achieving performances between 0.57 and 0.63 depending on the size of the
training set. However, the datasets employed in their evaluations present several
elements that do not comply with the purpose of the present work. First, both
NUS-WIDE [7] and MirFlickr [17] have been created several years ago, and since
then the way to publish multimodal contents has changed considerably. While in
Flickr-based datasets the content of tags is almost exclusively referential, in the
last few years, and with the advent of social media such as Instagram, tags are
not used only as referential descriptors, but also for several other purposes, such
as for instance expressing emotions [14]. Flickr users were mostly interested in
promoting their images specifically for their content. On the contrary, Instagram
users tend favour interactions with other users, and are therefore inclined to
produce tags that make their post accessible to the widest possible audience,
regardless of the content of the image. Moreover, such datasets contain limited
English dictionaries of tags that do not allow for a multilingual evaluation, which
is crucial in the present work. To the best of our knowledge, no previous work
addressed tag refinement by exploiting computer vision along with mulilingual
distributional semantics. For these reasons, we decided to create a new evaluation
set from public photos on Instagram.


3   Proposed approach

As anticipated, we approach the tag refinement task by integrating NLP and
CV techniques. Our system MORE (MultimOdal Tag Refinemen) is aimed at
separating relevant tags from noisy ones in user-tagged Instagram images. In
particular, it exploits several resources in order to compute the tag relevance.
We designed its architecture to address several issues.
   First, MORE is expected to work on Italian Instagram posts. On this social
network, users publish their photos accompanied by a list of tags (with a max-


                                         46
imum of 30 per photo). For Italian users, these tags are often both in Italian
and English, with the result of increasing the number of tags and adding noise
and redundancy. In Computational Linguistics, the last years have witnessed the
growing use of word embeddings, that are dense distributional vectors typically
built with neural language models to represent the meaning of words [20]. Usu-
ally, these embeddings are monolingual, but given the nature of Instagram and
the behaviour of its Italian users, we need to represent the meaning of the words
not only in Italian but also in a cross-lingual way. Multilingual embedding mod-
els have been proposed in recent literature. They are able to capture words and
their translations across languages in a joint embedding space [29]. This kind of
embeddings is very appealing when considering applications in social media like
Instagram since they allow to deal both with Italian and English tags.
    The second challenge we need to address is separating relevant from noisy
tags with respect to the image content as well. We define the relevance of a tag in
terms of the consistency between its denotational meaning and the visual content
of an image (e.g., objects and scenarios appearing in it). For example, the tag
boat is relevant for an image containing a seascape with a ship. To this purpose,
we use a framework to convert an image into a series of textual descriptors
for all the elements contained in it. This kind of descriptors are extracted with
one of the most popular Convolutional Neural Network architectures for image
labeling, namely the VGG-16 Neural Network. More specifically, we used the
pretrained VGGNet [32].
    The third challenge is related to the prospective application context of MORE.
Indeed, the system is expected to reduce the implicit noise of Instagram automat-
ically crawled datasets. This feature is very important in order to support other
industrial applications such as market surveys and sentiment or brand reputation
analysis. Despite being very popular for Business Intelligence, such applications
often suffer from the noise generated by user defined tags. In fact, these tags
are often carefully chosen to favor image retrieval and visibilty. Therefore they
require some form of preprocessing to guarantee the reliability of the aggregated
results. For example, the image in Figure 1 was published with several English
and Italian tags. Some describe the image content (e.g. fiori (‘flowers’), occhiali
(‘glasses’), orchidea (‘orchid’), flower ), while others are clearly used to promote
the image retrieval on the Instagram platform (e.g. moda (‘fashion’), buongiorno
(‘good morning’), Roma (‘Rome’), outfit).
    In order to separate relevant from noisy tags, MORE exploits three main
resources to compute tag relevance:

VGG-16: To establish the relevance of a tag for a given image, the pretrained
  VGG16 network [32] is used. In particular, we adopted the version available
  in Keras [6] and trained on ImageNet [9]. The ImageNet dataset [9] consists
  of 1.4 million images, each labelled with one of the 1, 000 different classes.
MUSE: The system compares English and Italian word vectors in a multilin-
  gual space. To build such space the Multilingual Unsupervised and Super-
  vised Embeddings (MUSE) framework [8,19] has been used to align in a
  single vector space the pretrained version of the Italian and English fastText

                                        47
Fig. 1. A photo posted on Instagram with its Italian and English relevant and noisy
tags. Original tags were: beauty, buongiorno, colors, fiori, fiorieocchiali, flower, flow-
ers, h52, instabeauty, instadaily, instafashion, moda, occhiali, occhialidasole, orchidea,
outfit, repost, roma, sunglasses.


  Wikipedia word embeddings [1]. In particular, the Italian space is used as
  source space and the English one is used as target.
OMW: Multilingual WordNet synsets were used to translate the ImageNet
  classes and to obtain their hypernyms. Specifically, the version of Open Mul-
  tilingual Wordnet (OMW) [3,2] available in NLTK [25] was used to make
  joint queries on the English [11] and the Italian [28] models.


4    The MORE Architecture

Given a set of Instagram posts consisting of an image and a list of tags, the
goal of the MORE system is to distinguish relevant from irrelevant tags. The
architecture of MORE is shown in Figure 2.
   Our dataset D is formally defined as D = {P1 , ..., Pk }. Each element Pi is
the pair P = (T, p), where T is a list {t1 , ..., tn } of tags including both relevant
and irrelevant ones, and p is the visual content. For each Pi ∈ D, MORE carries
out the following process:

Image classification: MORE classifies p with the VGG-16 model and returns a
   list of L = {l1 , ..., ln } English labels belonging to the ImageNet 1, 000 classes.
   A parameter specifies the probability threshold of the output classes (by
   default, the system returns all non-zero values). Therefore, this step provides
   a list of potential labels associated with the image and referring to its content.
   For example, given the photo in Figure 3, the output of the classification
   process is as follows: castle, monastery, palace, bell cote, church.


                                           48
    Fig. 2. The MORE architecture for refining the tags of an Instagram photo.


Labels translation and extension: The system translates into Italian the
   English labels in L by exploiting both WordNet senses (OMW) and multi-
   lingual embeddings (MUSE). For each label in L, the system retrieves from
   OMW all the Italian lemmas marked with the same synset of the English la-
   bel. Moreover, the system extracts also the hypernyms of each English label
   (e.g., cat and feline from Egyptian cat). The output of this step for each
   image is a list of Italian translated labels LT = {lt1 , ..., ltn }. For the exam-
   ple provided in Figure 3, the list of Italian translated labels (LT ) includes:
   convento (‘monastery’), monastero (‘monastery’), palazzo (‘palace’),
   chiesa (‘church’). The list of the extended labels EL is defined as L ∪
   LT (i.e., the English predicted labels and their Italian translation. The
   EL of the previous example contains: church, monastery, monastero
   (‘monastery’), convento (‘monastery’), palazzo (‘palace’), bell cote,
   chiesa (‘church’), castle, palace, as well as their hypernyms in Italian
   and English (e. g. abitazione (‘dwelling’), religious residence).
Multilingual neighborhood: For each tag t belonging to T , the system car-
   ries out a query on the MUSE multilingual embeddings models to collect the
   top x nearest neighbors of each element both in Italian and English. In par-
   ticular, we use the Italian space as source space and the English one as target


                                         49
Fig. 3. An example from the Instagram dataset. Original tags for this photo were:
brescia, cattedrale, landscapephotography, monument, fotografia, pics, world, prospet-
tiva, brixia, photo, art, city, foto, picsart, landscape, picoftheday, arte, lombardia, italia,
cultura, photography, italy, architecture.


    space, in order to populate the list of extended tags (ET ). By default, the
    parameter x is set to 5. For example, given the tag cattedrale (‘cathedral’),
    its Italian nearest neighbors are cattedrale (‘cathedral’), procattedrale (‘pro-
    cathedral’), concattedrale (‘co-cathedral’), cathedrale (‘cathedral’), basilica
    (‘basilica’) while its English ones are: cathedral, basilica, cathedra, cathedrals,
    church.
Filtering: The system filters the elements in T by considering ET and EL. In
    particular, for each of the tags t ∈ T , if t ∈ EL then it is added to the set
    R of relevant tags. Otherwise, t is added to R in three cases: (i) one of its
    nearest neighbors in ET belong to EL; (ii) the vector of t is similar to at
    least one of the vectors of EL; (iii) the vector of at least one of the neighbors
    of t in ET is similar to any of the vectors of EL. Similarity is computed with
    cosine. By default the threshold on cosine is set to 0.4. For example, the tag
    cattedrale (’cathedral’) in Figure 3, was found to be similar to most of the
    labels (e.g., chiesa (‘church’) with a cosine similarity of 0.74). Since this is
    an instance of case (ii), it is added to R. In fact, even if the tag itself was
    not similar to any of the labels, one of its neighbors in the English space is
    church. Since this word belongs to EL, the tag would be marked as relevant
    anyway (case (i)). Moreover, we can consider the case of the tag architecture.
    It has a cosine similarity below the threshold for all of the elements in EL.
    Nonetheless, at least one of its neighbors, (i.e., architectural ), has a cosine
    similarity above the specified threshold for at least one of the labels, in par-
    ticular with the label church with a similarity of 0.42. Therefore, according to
    the (iii) scenario, architecture was added to the relevant tags R. Conversely,


                                              50
     neither the tag brescia (‘Brescia’, Italian city) nor any of its neighbor are
     found to be similar to any of the labels (e.g., the cosine with monastery is
     0.27 in English and 0.13 in Italian). Thus, the tag is not considered relevant
     by the system, despite being in fact correct. The default MORE configura-
     tion marks as relevant only the following tags among the ones coming with
     the image (cf. Figure 3): architecture, art, arte (‘art’), cattedrale (‘cathe-
     dral’), cathedral, city, cultura (‘culture’), foto (‘photo’), fotografia (‘photog-
     raphy’),italia (Italy), italy, landscape, monument, photo, photography, pics,
     prospettiva (‘perspective’), world.


5     Dataset description and annotation

MORE has been evaluated on a portion of the dataset collected in the context
of the MUSE project, aimed at performing a multimodal analysis of both texts
and images in order to improve the quality of sentiment analysis and brand
reputation systems. The whole MUSE dataset has been collected between May
2018 and December 2018 by using the official Instagram API.1
    A total of 14 hashtags were used for data collection. Seven of these were
closely related to a customer company of the MUSE industrial partner while the
rest are very generic and not related to any particular topic. We don’t report
the list of the first group of hashtags. The list of the second group is as follows:
follow4follow, igers, followme, instago, italia, buongiorno, instaitalia.
    Overall, the dataset consists of more than 200k images and the effectiveness
of the system has been evaluated by the company internally. However, a portion
of randomly selected 50 images was manually annotated for tag relevance by 7
human annotators.
    The participants to a questionnaire rated 50 images with respect to several
human provided Instagram hashtags. Participants were asked to find the “rel-
evant” hashtags with respect to the image content. The English translation of
the instructions is reported in Figure 4. Each image was presented to annotators
together with a multi-selection button showing the list of the original hashtags
(see Figure 5).
    The annotation task was quite difficult, since the annotators had to select,
for each image, one, some or no tags with potential different degrees of relevance.
The number of tags for each image was variable and depended on the actual tags
obtained when the post was collected.
    Fleiss’ kappa [12] was used to compute the inter-annotator agreement on
the 1195 data points consisting of image-tag pairs. The overall agreement was
of 0.42, but an ablation experiment on the raters demonstrated that a global
agreement of 0.58 could be reached with 6 out of the 7 raters.
 1
    Given such agreement, we based the final decision about the relevance of
   After
each  tagDec. 12 2018
          on the        Instagram
                   majority        API changed
                             vote criterion      radically
                                            (4 votes),     to comply
                                                       as shown      with new
                                                                 in Figure 6. GDPR
    regulations https://www.instagram.com/developer/changelog/ and the collection
    of the dataset was stopped.


                                          51
This survey aims at identifying, given a set of images and tags (i.e. hashtag), the subset
of “relevant” tags given the image content.
A tag is relevant to an image if it refers to the entities (people, objects, places, etc.)
depicted in the image.

For example, for an image depicting a cathedral, tags such as “cathedral” and “church”
will be relevant, but tags such as “goaround”, “tourist” and “hello” will not.
Likewise, for an image depicting a person in front of a cathedral, tags such as “cathedral”,
“church” and “tourist” will be relevant, but tags such “goaround” and “hello” will be not.

Each image can contain tags both in Italian and in English. If you do not know the
meaning of the term used as a tag, use a dictionary to verify the relevance.

There may also be hashtags composed of the concatenation of several words. Please select
these tags where relevant.

              Fig. 4. Annotation Instructions (translated from Italian).


6   Evaluation

In the final evaluation dataset, the average number of tags associated with each
item is 23.9 (σ = 9.13). Human ratings reveal that on average 3.6 tags (σ = 2.9)
were actually relevant for a given image, while 20.3 (σ = 9.5) were not. The
distribution of relevant vs. noisy user-defined hashtags is in line with the findings
illustrated in [14], since approximately only 18% of tags are actually relevant.
    For the evaluation, each image-tag pair was considered as independent from
the others. In other words, each image is represented in the final dataset by a
number of data-points equal to the number of its original hashtags. Note that
the set of relevant tags is always a subset of the original tags.
    Since the annotators were asked to mark as correct the tags referring to the
objects visible in the images, we used the output of the image classification step
as a first baseline for the task. The results are reported in Table 1.


Class            P                 R                  F1                Support
False         0.85             1.00              0.92            1014
True          0.83             0.03              0.05            181
Macro avg     0.84             0.51              0.49            1195
Weighted avg 0.85              0.85              0.79            1195
      Table 1. Baseline based on the image classification (VGG-16) output.


    The baseline has a clear issue in terms of flexibility. The baseline classifier,
which is simply the VGG-16 classifier trained on ImageNet, may never predict
certain tags, as it is limited by the number of classes it was trained on, and may
choose to give more weight to certain aspects of the image. For example, if we
consider the image in Figure 1, one of the correct tags is flower. However, if we


                                           52
Fig. 5. An item provided during the an-           Fig. 6. Number of ratings for the hash-
notation process. Users were asked to se-         tags associated to the image in Figure 5.
lect none, one or more tags depending on          A tag is associated to a given image if at
their relevance with respect to the image.        least 4 annotators marked it as relevant.


look at the output of the classification step, the words pot and vase are included,
while flower itself is not predicted. This may be possibly due to the fact that
what the classifier is trained to see is actually a flower pot, but not just a flower.
Therefore in this case, flower is actually considered as a False Negative example.


7    Results and Discussion

The overall performances of the system were assessed by comparing the model
predictions against human rating. Table 2 shows the results of the evaluation in
terms of Precision, Recall, F1-score and Support.


Class            P                 R                     F1               Support
False            0.92            0.61             0.73              1014
True             0.24            0.71             0.36              181
Macro avg        0.58            0.66             0.55              1195
Weighted avg 0.82                0.62             0.68              1195
Table 2. Precision (P), Recall (R) and F1-Score (F1) and Support of MORE based
on a dataset of 50 Instagram images with the default configuration.

    True Positives (TP) are the image-tag pairs for which both humans and the
system associated the class True (relevant). At the same way, True Negatives
(TN) are data points for which both the system and humans associated the class
False (noisy). False Positives (FP) are predicted as relevant by the system, but
rated as noisy by humans. Finally, False Negatives (FN) are the examples rated
as relevant by humans but marked as noisy by the system.
    Given the distribution of relevant tags in our dataset (18% of the total),
most of the data-points belong to the False class, affecting the macro-averaged


                                             53
result. Therefore, we prefer to consider the weighted average as evaluation metric:
Precision, Recall and F1-score are computed for each class, and their average is
weighted by support (i.e., the number of true instances for each class).
    Overall, the system reaches a weighted average F1-score of 0.68, which is
distributed differently across the relevant and noisy hashtags class. Moreover, it
is important to stress that in the case of irrelevant tags (the False class) it is very
important to maximize the Precision in order to avoid noise. On the contrary,
for relevant tags (True class) we are particularly interested in maximizing the
Recall, in order to guarantee a satisfactory retrieval of relevant images.
    In order to pursue such goals, we decided to use, along with standard met-
rics, an additional one consisting of the average between the Precision of the
class False and the Recall of the True one. This metric, in fact, provides us
with a useful method to evaluate if the system is able to discard noisy tags and,
at the same time, to preserve the correct ones. If we look only at the Precision
and Recall of the individual classes, we would not be able to capture this infor-
mation. This metric, calculated with the MORE default parameters, is 0.815,
outperforming the baseline (for which it was 0.44) by a wide margin, despite its
weighted average F1-score was of 0.79. Even though we consider such results as
promising, we performed a manual evaluation of error types to identify the most
challenging cases for MORE.
    One of the problems we detected consists in the recall of multiword hash-
tags such as sprayart (‘spray art’), biancoenero (‘black and white’ concatenating
the words “bianco e nero”), fotodaltreno (‘photos from the train’, from “foto
dal treno”), fiorieocchiali (‘flowers and glasses’, from “fiori e occhiali”), creative-
makeup (‘creative makeup’).
    As for False Negative examples, 37 of the 53 total examples ( 70%) were
out-of-vocabulary (OOV) in both the source and the target word space. There
is surely wide room for improvement. For example, the MORE match function,
could be enriched with the ability of segmenting multiword hashtags in standard
lexical entries or by exploiting fastText features to extract the vectors of out-of-
vocabulary words also in a multilingual space.
    As for False Positive examples, we noticed that MORE is more prone to
predict as relevant tags high-frequency words such as photo, nice, look, selfie,
style, pizza. In addition, we noticed that MORE tends to mark abstract words
(in Italian and English) as relevant tags. Despite being an error, we can consider
it expected due to the way in which MORE has been constructed. In fact, very
often the vectors of abstract words are highly associated with referential ob-
jects depicted in the photos. For example, words like mood, enjoy, happy, verità
(‘truth’), bellezza (‘beauty’), parole (‘words’), have been considered False by an-
notators because, according to the provided instructions, they “do not refer to
the entities (people, objects, places, etc.) depicted in the image”. Nonetheless,
given the high association of such word vectors with referential objects, MORE
often tags them as relevant.
    We also performed further experiments to understand how the various pa-
rameters affect performances and can be exploited to reach different goals. It is

                                          54
clear that such parameters are very important to leverage the behavior of the
application. For example, a higher cosine similarity threshold eventually discards
many tags. This is useful if the final goal is to refine tags as accurately as pos-
sible. On the other hand, this setting will negatively affect the recall of relevant
hashtags. The same consideration goes for increasing the nearest neighbors of
tags and labels. We performed a parameter tuning experiment in which we eval-
uated MORE with a cosine similarity threshold of 0.3, 0.4 and 0.5. Similarly, we
assessed the system by changing the number of nearest neighbors. In this case,
the evaluation was performed with 3, 5 and 10 nearest neighbors. In order to
choose the best performing configuration, we selected the one maximizing the
average between the Precision of the class False (not relevant) and the Recall
of the True (relevant) one. For this metric, MORE performs at best 0.87. The
results of this model are reported in Table 3.
    By considering the standard metrics for this model, we notice that also the
Weighted AVG Precision improves after the parameter tuning. Weighted AVG
Recall (and thus F1-score), on the contrary, is negatively affected by the results
on False, which is the majority class.


Class            P                 R                     F1               Support
False            0.94              0.55              0.70              1014
True             0.24              0.80              0.37              181
Macro avg        0.59              0.68              0.53              1195
Weighted avg 0.83                  0.59              0.65              1195
Table 3. Precision (P), Recall (R) and F1-Score (F1) and Support of MORE based on
a dataset of 50 Instagram images after the parameter tuning (Evaluation performed
with cosine similarity threshold set ti 0.3 and nearest neighbors set to 10).


Fig. 7. MORE predictions against human             Fig. 8. MORE predictions against human
annotation with the default configuration          annotation after parameter tuning (co-
(cosine sim. threshold: 0.4; nearest neigh-        sine sim. threshold: 0.3; nearest neigh-
bors: 5).                                          bors: 10).


                                              55
    In order to further improve the overall performances, a viable option could
be to leverage the frequency and the popularity of the tags on Instagram. Such
tags, in fact, represent a lot of false positives (e.g. high frequency words), and
decreasing their number could increase also the Weighted AVG F1-score.
    Figures 7 and 8 show the predictions of both the default and the tuned config-
uration compared to human annotation. Green tags are true positive examples,
red tags are false positives (the system considered them as relevant while humans
did not), orange tags are the false negative. We can see that abstract words are
false positives in both the models. This means that the thresholds are not able to
properly mitigate such phenomenon because words like beach, travel and relax
are strongly associated one another (e.g. to remove the tags relax and travel the
system requires at least a threshold of cosine of 0.5, with the effect of removing
also true positive tags).


8    Conclusions
In this paper, we presented the MultimOdal Tag Refinement (MORE), a system
aimed at improving image descriptors by exploiting NLP and CV techniques.
The system starts from Instagram user defined image annotation, and merges
visual and textual information to find a match between the tags provided with
an image and its semantic visual content. Textual features have been extracted
from text based tags by exploiting the (multilingual) word embeddings. Visual
features have been gathered by exploiting image classification. The system has
been evaluated on an Italian manually annotated dataset achieving 68% of per-
formances in terms of weighted F1-score.
     The results of MORE are promising, but there are still wide margins of im-
provement for several key aspects of the system, including: (i) the construction of
multilingual embeddings; (ii) the management of multilingual hashtags; (iii) the
refinement and extension of the evaluation process; (iv) the distribution of the
manually annotated dataset. In the near future, our efforts will be focused to-
wards these directions. As for (i), we aim at training embeddings on a mixture
of social media and general purpose corpora. This combination is expected to
positively affect (ii), as it would enable the collection of reliable vectors for multi-
word hashtags, thus reducing the number of OOV words during filtering. As for
(iii), we plan to fine-tune all the modules including the neural network for image
classification, and to study the contribution of each module of the architecture
to the classification. Finally, for (iv), we plan to extend the manually annotated
dataset to improve the evaluation and to make it available for research purposes,
in accordance to the Instagram Privacy constraints.


Acknowledgments
This research has been supported by the Project MUltimodal Semantic Extrac-
tion (MUSE), in collaboration with Bnova s.r.l., funded by Regione Toscana with
the grant POR FSE 2014-2020 Asse A.

                                          56
References

 1. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with
    subword information. arXiv preprint arXiv:1607.04606 (2016)
 2. Bond, F., Foster, R.: Linking and extending an open multilingual wordnet. Sofia
    (2013)
 3. Bond, F., Paik, K.: A survey of wordnets and their licenses. In: Proceedings of the
    6th Global WordNet Conference (GWC 2012). Matsue (2012), 64–71
 4. Bruni, E., Tran, G.B., Baroni, M.: Distributional semantics from text and images.
    In: Proceedings of the GEMS 2011 workshop on geometrical models of natural
    language semantics. pp. 22–32 (2011)
 5. Candès, E.J., Li, X., Ma, Y., Wright, J.: Robust principal component analysis?
    Journal of the ACM (JACM) 58(3), 11 (2011)
 6. Chollet, F.: Deep learning with python (2017)
 7. Chua, T.S., Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y.: Nus-wide: a real-world
    web image database from national university of singapore. In: Proceedings of the
    ACM international conference on image and video retrieval. pp. 1–9 (2009)
 8. Conneau, A., Lample, G., Ranzato, M., Denoyer, L., Jégou, H.: Word translation
    without parallel data. arXiv preprint arXiv:1710.04087 (2017)
 9. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A Large-
    Scale Hierarchical Image Database. In: CVPR09 (2009)
10. Duan, L., Li, W., Tsang, I.W.H., Xu, D.: Improving web image search by bag-based
    reranking. IEEE Transactions on Image Processing 20(11), 3280–3290 (2011)
11. Fellbaum, C. (ed.): WordNet An Electronic Lexical Database. The MIT Press
    (1998)
12. Fleiss, J.: Measuring nominal scale agreement among many raters. Psychological
    bulletin 76(5), 378—382 (November 1971)
13. Gao, Y., Wang, M., Zha, Z.J., Shen, J., Li, X., Wu, X.: Visual-textual joint rel-
    evance learning for tag-based social image search. IEEE Transactions on Image
    Processing 22(1), 363–376 (2012)
14. Giannoulakis, S., Tsapatsoulis, N.: Evaluating the descriptive power of instagram
    hashtags. Journal of Innovation in Digital Ecosystems 3(2), 114–129 (2016)
15. Golder, S., Huberman, B.: Usage patterns of collaborative tagging systems. J.
    Information Science 32, 198–208 (04 2006)
16. Gong, Y., Ke, Q., Isard, M., Lazebnik, S.: A multi-view embedding space for mod-
    eling internet images, tags, and their semantics. International journal of computer
    vision 106(2), 210–233 (2014)
17. Huiskes, M.J., Thomee, B., Lew, M.S.: New trends and ideas in visual concept
    detection: the mir flickr retrieval evaluation initiative. In: Proceedings of the in-
    ternational conference on Multimedia information retrieval. pp. 527–536 (2010)
18. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-
    volutional neural networks. In: Advances in neural information processing systems.
    pp. 1097–1105 (2012)
19. Lample, G., Conneau, A., Ranzato, M., Denoyer, L., Jégou, H.: Word translation
    without parallel data. In: 6th International Conference on Learning Representa-
    tions, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference
    Track Proceedings (2018)
20. Lenci, A.: Distributional Models of Word Meaning. Annual review of Linguistics
    4, 151–171 (2018)


                                           57
21. Li, X., Snoek, C.G., Worring, M.: Learning social tag relevance by neighbor voting.
    IEEE Transactions on Multimedia 11(7), 1310–1322 (2009)
22. Li, X., Uricchio, T., Ballan, L., Bertini, M., Snoek, C.G.M., Bimbo, A.D.: Socializ-
    ing the semantic gap: A comparative survey on image tag assignment, refinement,
    and retrieval. ACM Comput. Surv. 49(1), 14:1–14:39 (Jun 2016)
23. Li, X., Shen, B., Liu, B.D., Zhang, Y.J.: A locality sensitive low-rank model for
    image tag completion. IEEE Transactions on Multimedia 18(3), 474–483 (2016)
24. Li, Z., Liu, J., Zhu, X., Liu, T., Lu, H.: Image annotation using multi-correlation
    probabilistic matrix factorization. In: Proceedings of the 18th ACM international
    conference on Multimedia. pp. 1187–1190. ACM (2010)
25. Loper, E., Bird, S.: Nltk: The natural language toolkit. In: Proceedings of the ACL-
    02 Workshop on Effective Tools and Methodologies for Teaching Natural Language
    Processing and Computational Linguistics - Volume 1. pp. 63–70. ETMTNLP ’02,
    Association for Computational Linguistics, Stroudsburg, PA, USA (2002)
26. Makadia, A., Pavlovic, V., Kumar, S.: Baselines for image annotation. International
    Journal of Computer Vision 90(1), 88–105 (2010)
27. Nov, O., Naaman, M., Ye, C.: What drives content tagging: the case of photos on
    flickr. In: Proceedings of the SIGCHI conference on Human factors in computing
    systems. pp. 1097–1100. ACM (2008)
28. Pianta, E., Bentivogli, L., Girardi, C.: Multiwordnet: developing an aligned multi-
    lingual database. In: Proceedings of the First International Conference on Global
    WordNet. Mysore (India) (2002)
29. Ruder, S., Vulić, I., Søgaard, A.: A survey of cross-lingual word embedding models.
    arXiv preprint arXiv:1706.04902 (2017)
30. Sen, S., Lam, S.K., Rashid, A.M., Cosley, D., Frankowski, D., Osterhouse, J.,
    Harper, F.M., Riedl, J.: Tagging, communities, vocabulary, evolution. In: Proceed-
    ings of the 2006 20th anniversary conference on Computer supported cooperative
    work. pp. 181–190. ACM (2006)
31. Sigurbjörnsson, B., Van Zwol, R.: Flickr tag recommendation based on collective
    knowledge. In: Proceedings of the 17th international conference on World Wide
    Web. pp. 327–336. ACM (2008)
32. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
    image recognition. arXiv preprint arXiv:1409.1556 (2014)
33. Sun, A., Bhowmick, S.S., Chong, J.A.: Social image tag recommendation by con-
    cept matching. In: Proceedings of the 19th ACM international conference on Mul-
    timedia. pp. 1181–1184. ACM (2011)
34. Tang, J., Hong, R., Yan, S., Chua, T.S., Qi, G.J., Jain, R.: Image annotation by
    k nn-sparse graph-based label propagation over noisily tagged web images. ACM
    Transactions on Intelligent Systems and Technology (TIST) 2(2), 14 (2011)
35. Tang, J., Shu, X., Li, Z., Jiang, Y.G., Tian, Q.: Social anchor-unit graph regularized
    tensor completion for large-scale image retagging. IEEE transactions on pattern
    analysis and machine intelligence (2019)
36. Wang, Y., Zhu, L., Qian, X., Han, J.: Joint hypergraph learning for tag-based
    image retrieval. IEEE Transactions on Image Processing 27(9), 4437–4451 (2018)
37. Wu, L., Jin, R., Jain, A.K.: Tag completion for image retrieval. IEEE Transactions
    on Pattern Analysis and Machine Intelligence 35(3), 716–727 (2012)
38. Xu, X., He, L., Lu, H., Shimada, A., Taniguchi, R.I.: Non-linear matrix completion
    for social image tagging. IEEE Access 5, 6688–6696 (2016)
39. Zhu, G., Yan, S., Ma, Y.: Image tag refinement towards low-rank, content-tag prior
    and error sparsity. In: Proceedings of the 18th ACM international conference on
    Multimedia. pp. 461–470. ACM (2010)


                                           58