<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Less is MORE: a MultimOdal system for tag RE nement</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lucia C. Passaro</string-name>
          <email>lucia.passaro@fileli.unipi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Lenci</string-name>
          <email>alessandro.lenci@unipi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CoLing Lab, Dipartimento di Filologia, Letteratura e Linguistica (FiLeLi), Universita di Pisa</institution>
          ,
          <addr-line>via Santa Maria 36, 56126 PISA</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <fpage>44</fpage>
      <lpage>58</lpage>
      <abstract>
        <p>With the proliferation of image-based social media, an extremely large amount of multimodal data is being produced. Very often image contents are published together with a set of user de ned metadata such as tags and textual descriptions. Despite being very useful to enhance traditional image retrieval, user de ned tags on social media have been proven to be none ective to index images because they are in uenced by personal experiences of the owners as well as their will of promoting the published contents. To be analyzed and indexed, multimodal data require algorithms able to jointly deal with textual and visual data. This research presents a multimodal approach to the problem of tag re nement, which consists in separating the relevant descriptors (tags) of images from noisy ones. The proposed method exploits both Natural Language Processing (NLP) and Computer Vision (CV) techniques based on deep learning to nd a match between the textual information and visual content of social media posts. Textual semantic features are represented with (multilingual) word embeddings, while visual ones are obtained with image classi cation. The proposed system is evaluated on a manually annotated Italian dataset extracted from Instagram achieving 68% of weighted F1-score.</p>
      </abstract>
      <kwd-group>
        <kwd>Natural Language Processing • Computer Vision • Multi- modal Semantics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Human communication is intrinsically multimodal. Ideas can be better expressed
and understood by jointly using di erent modalities, as proven by most
advertising campaigns and social media, in which the skillful combination of images and
language is able to amplify communicative intents and impact. With the
evergrowing expansion of Internet-based activities in the last 10 years, an extremely
large amount of multimodal data is being produced. Multimodal information
processing is therefore needed in order to make sense of such large quantities
of data, enabling the development of systems that jointly deal with textual and
visual data. Not only the automatic interpretation of texts can be improved by
exploiting additional non-verbal information such as visual contents, but the
interpretation of images can also be enriched by exploiting the meaning of their
surrounding text [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Examples of applications based on multimodal
information processing are image research from textual descriptions and, vice versa, the
generation of textual descriptors or captions from an image [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. In this paper,
we focus on the re nement of user de ned image annotations, namely the tags
provided with images when they are published on social media such as
Instagram. In this platform, owners share pictures annotated with a set of tags based
on personal experiences. Giannoulakis and Tsapatsoulis [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] demonstrated that
only 66% of human de ned tags describe the visual content of the image. This
negatively a ects the way we can access and use Instagram data.
      </p>
      <p>
        Social media tags are useful to enhance traditional image retrieval
technology [
        <xref ref-type="bibr" rid="ref27 ref33">27,33</xref>
        ], but they usually include a lot of noise. For instance, approximately
only 20% of the Instagram hashtag datasets are appropriate to be used as
training examples (i.e., image - tag pairs) for image recognition machine learning
algorithms [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. User de ned tags su er from ambiguity, carelessness and
incompleteness [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], and have been proven to be highly associated with trends
and events occurring in the real world, and biased toward personal perspectives
[
        <xref ref-type="bibr" rid="ref15 ref30">15,30</xref>
        ]. Moreover, very often people tag objects and scenes that are not present
in the visual content in order to favor image retrieval for the general audience
[
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. The present work aims at separating relevant from noisy tags in order to
guarantee a better indexing and retrieval of images. We approach the problem
by assigning a relevance value to tags. In order to do so, we employ a mixture of
Natural Language Processing (NLP) and Computer Vision (CV) techniques and
resources. This allows us to take into account both textual and visual features
of multimodal contents in a combined and synergic way.
      </p>
      <p>This paper is structured as follows: Section 2 shows a brief overview of
existing works in this eld. Section 3 describes the proposed approach to tag re
nement. Section 4 presents the system architecture including both the NLP and
CV modules. Sections 5 and 6 report on the collection of the dataset used for
the evaluation, namely a set of Italian manually annotated Instagram posts, and
on the evaluation itself. Section 7 discusses the performances achieved by the
current system implementation and Section 8 is left for conclusions and future
research.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>
        As suggested by [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], existing research on image tags focuses on three main tasks,
namely tag assignment, tag re nement and tag retrieval. In tag assignment, given
an unlabeled image, the goal consists in assigning a ( xed) number of tags related
to the image content [
        <xref ref-type="bibr" rid="ref26 ref34 ref35">26,34,35</xref>
        ]. In tag re nement, given an image associated with
some initial tags, the objective is to separate irrelevant tags from relevant ones
[
        <xref ref-type="bibr" rid="ref23 ref24 ref37 ref38">24,37,23,38</xref>
        ]. Finally, given a tag and a collection of images, tag retrieval focuses
on retrieving relevant images with respect to the tag of interest [
        <xref ref-type="bibr" rid="ref10 ref13 ref36">10,13,36</xref>
        ].
      </p>
      <p>
        In this work, we address the tag re nement task, and we aim at separating
relevant tags from noisy ones in user-tagged Instagram images. Previous studies
tackled tag re nement by considering several perspectives including the use of
linguistic information only, by measuring tag relevance in terms of weighted
semantic similarity between the target tag and the other ones [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ] associated to
the image, or by integrating textual data with the visual content [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. A di erent
approach exploited Principal Component Analysis [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to build a model based on
the factorization of an image-tag matrix by a low-rank decomposition with error
sparsity [
        <xref ref-type="bibr" rid="ref39">39</xref>
        ]. A systematic evaluation and comparison of various tag re nement
models was carried out in [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. In order to guarantee a reliable comparison, the
authors implemented several methods by exploiting the same models to process
the textual and the visual content. We suggest the interested reader to refer to
this survey for details on model implementation and results.
      </p>
      <p>
        This evaluation shows that the best performing model for tag re nement is
the one based on robust PCA [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and a CNN [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] to process the image content,
achieving performances between 0:57 and 0:63 depending on the size of the
training set. However, the datasets employed in their evaluations present several
elements that do not comply with the purpose of the present work. First, both
NUS-WIDE [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and MirFlickr [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] have been created several years ago, and since
then the way to publish multimodal contents has changed considerably. While in
Flickr-based datasets the content of tags is almost exclusively referential, in the
last few years, and with the advent of social media such as Instagram, tags are
not used only as referential descriptors, but also for several other purposes, such
as for instance expressing emotions [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Flickr users were mostly interested in
promoting their images speci cally for their content. On the contrary, Instagram
users tend favour interactions with other users, and are therefore inclined to
produce tags that make their post accessible to the widest possible audience,
regardless of the content of the image. Moreover, such datasets contain limited
English dictionaries of tags that do not allow for a multilingual evaluation, which
is crucial in the present work. To the best of our knowledge, no previous work
addressed tag re nement by exploiting computer vision along with mulilingual
distributional semantics. For these reasons, we decided to create a new evaluation
set from public photos on Instagram.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Proposed approach</title>
      <p>As anticipated, we approach the tag re nement task by integrating NLP and
CV techniques. Our system MORE (MultimOdal Tag Re nemen) is aimed at
separating relevant tags from noisy ones in user-tagged Instagram images. In
particular, it exploits several resources in order to compute the tag relevance.
We designed its architecture to address several issues.</p>
      <p>
        First, MORE is expected to work on Italian Instagram posts. On this social
network, users publish their photos accompanied by a list of tags (with a
maximum of 30 per photo). For Italian users, these tags are often both in Italian
and English, with the result of increasing the number of tags and adding noise
and redundancy. In Computational Linguistics, the last years have witnessed the
growing use of word embeddings, that are dense distributional vectors typically
built with neural language models to represent the meaning of words [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
Usually, these embeddings are monolingual, but given the nature of Instagram and
the behaviour of its Italian users, we need to represent the meaning of the words
not only in Italian but also in a cross-lingual way. Multilingual embedding
models have been proposed in recent literature. They are able to capture words and
their translations across languages in a joint embedding space [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ]. This kind of
embeddings is very appealing when considering applications in social media like
Instagram since they allow to deal both with Italian and English tags.
      </p>
      <p>
        The second challenge we need to address is separating relevant from noisy
tags with respect to the image content as well. We de ne the relevance of a tag in
terms of the consistency between its denotational meaning and the visual content
of an image (e.g., objects and scenarios appearing in it). For example, the tag
boat is relevant for an image containing a seascape with a ship. To this purpose,
we use a framework to convert an image into a series of textual descriptors
for all the elements contained in it. This kind of descriptors are extracted with
one of the most popular Convolutional Neural Network architectures for image
labeling, namely the VGG-16 Neural Network. More speci cally, we used the
pretrained VGGNet [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ].
      </p>
      <p>The third challenge is related to the prospective application context of MORE.
Indeed, the system is expected to reduce the implicit noise of Instagram
automatically crawled datasets. This feature is very important in order to support other
industrial applications such as market surveys and sentiment or brand reputation
analysis. Despite being very popular for Business Intelligence, such applications
often su er from the noise generated by user de ned tags. In fact, these tags
are often carefully chosen to favor image retrieval and visibilty. Therefore they
require some form of preprocessing to guarantee the reliability of the aggregated
results. For example, the image in Figure 1 was published with several English
and Italian tags. Some describe the image content (e.g. ori (` owers'), occhiali
(`glasses'), orchidea (`orchid'), ower ), while others are clearly used to promote
the image retrieval on the Instagram platform (e.g. moda (`fashion'), buongiorno
(`good morning'), Roma (`Rome'), out t ).</p>
      <p>
        In order to separate relevant from noisy tags, MORE exploits three main
resources to compute tag relevance:
VGG-16: To establish the relevance of a tag for a given image, the pretrained
VGG16 network [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ] is used. In particular, we adopted the version available
in Keras [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and trained on ImageNet [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The ImageNet dataset [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] consists
of 1:4 million images, each labelled with one of the 1; 000 di erent classes.
MUSE: The system compares English and Italian word vectors in a
multilingual space. To build such space the Multilingual Unsupervised and
Supervised Embeddings (MUSE) framework [
        <xref ref-type="bibr" rid="ref19 ref8">8,19</xref>
        ] has been used to align in a
single vector space the pretrained version of the Italian and English fastText
      </p>
      <p>
        Wikipedia word embeddings [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In particular, the Italian space is used as
source space and the English one is used as target.
      </p>
      <p>
        OMW: Multilingual WordNet synsets were used to translate the ImageNet
classes and to obtain their hypernyms. Speci cally, the version of Open
Multilingual Wordnet (OMW) [
        <xref ref-type="bibr" rid="ref2 ref3">3,2</xref>
        ] available in NLTK [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] was used to make
joint queries on the English [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and the Italian [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] models.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>The MORE Architecture</title>
      <p>Given a set of Instagram posts consisting of an image and a list of tags, the
goal of the MORE system is to distinguish relevant from irrelevant tags. The
architecture of MORE is shown in Figure 2.</p>
      <p>Our dataset D is formally de ned as D = fP1; :::; Pkg. Each element Pi is
the pair P = (T; p), where T is a list ft1; :::; tng of tags including both relevant
and irrelevant ones, and p is the visual content. For each Pi 2 D, MORE carries
out the following process:
Image classi cation: MORE classi es p with the VGG-16 model and returns a
list of L = fl1; :::; lng English labels belonging to the ImageNet 1; 000 classes.
A parameter speci es the probability threshold of the output classes (by
default, the system returns all non-zero values). Therefore, this step provides
a list of potential labels associated with the image and referring to its content.
For example, given the photo in Figure 3, the output of the classi cation
process is as follows: castle, monastery, palace, bell cote, church.
Labels translation and extension: The system translates into Italian the
English labels in L by exploiting both WordNet senses (OMW) and
multilingual embeddings (MUSE). For each label in L, the system retrieves from
OMW all the Italian lemmas marked with the same synset of the English
label. Moreover, the system extracts also the hypernyms of each English label
(e.g., cat and feline from Egyptian cat). The output of this step for each
image is a list of Italian translated labels LT = flt1; :::; ltng. For the
example provided in Figure 3, the list of Italian translated labels (LT ) includes:
convento (`monastery'), monastero (`monastery'), palazzo (`palace'),
chiesa (`church'). The list of the extended labels EL is de ned as L [
LT (i.e., the English predicted labels and their Italian translation. The
EL of the previous example contains: church, monastery, monastero
(`monastery'), convento (`monastery'), palazzo (`palace'), bell cote,
chiesa (`church'), castle, palace, as well as their hypernyms in Italian
and English (e. g. abitazione (`dwelling'), religious residence).
Multilingual neighborhood: For each tag t belonging to T , the system
carries out a query on the MUSE multilingual embeddings models to collect the
top x nearest neighbors of each element both in Italian and English. In
particular, we use the Italian space as source space and the English one as target</p>
      <p>space, in order to populate the list of extended tags (ET ). By default, the
parameter x is set to 5. For example, given the tag cattedrale (`cathedral'),
its Italian nearest neighbors are cattedrale (`cathedral'), procattedrale
(`procathedral'), concattedrale (`co-cathedral'), cathedrale (`cathedral'), basilica
(`basilica') while its English ones are: cathedral, basilica, cathedra, cathedrals,
church.</p>
      <p>Filtering: The system lters the elements in T by considering ET and EL. In
particular, for each of the tags t 2 T , if t 2 EL then it is added to the set
R of relevant tags. Otherwise, t is added to R in three cases: (i) one of its
nearest neighbors in ET belong to EL; (ii) the vector of t is similar to at
least one of the vectors of EL; (iii) the vector of at least one of the neighbors
of t in ET is similar to any of the vectors of EL. Similarity is computed with
cosine. By default the threshold on cosine is set to 0:4. For example, the tag
cattedrale ('cathedral') in Figure 3, was found to be similar to most of the
labels (e.g., chiesa (`church') with a cosine similarity of 0:74). Since this is
an instance of case (ii), it is added to R. In fact, even if the tag itself was
not similar to any of the labels, one of its neighbors in the English space is
church. Since this word belongs to EL, the tag would be marked as relevant
anyway (case (i)). Moreover, we can consider the case of the tag architecture.
It has a cosine similarity below the threshold for all of the elements in EL.
Nonetheless, at least one of its neighbors, (i.e., architectural ), has a cosine
similarity above the speci ed threshold for at least one of the labels, in
particular with the label church with a similarity of 0:42. Therefore, according to
the (iii) scenario, architecture was added to the relevant tags R. Conversely,
neither the tag brescia (`Brescia', Italian city) nor any of its neighbor are
found to be similar to any of the labels (e.g., the cosine with monastery is
0:27 in English and 0:13 in Italian). Thus, the tag is not considered relevant
by the system, despite being in fact correct. The default MORE con
guration marks as relevant only the following tags among the ones coming with
the image (cf. Figure 3): architecture, art, arte (`art'), cattedrale
(`cathedral'), cathedral, city, cultura (`culture'), foto (`photo'), fotogra a
(`photography'),italia (Italy), italy, landscape, monument, photo, photography, pics,
prospettiva (`perspective'), world.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Dataset description and annotation</title>
      <p>MORE has been evaluated on a portion of the dataset collected in the context
of the MUSE project, aimed at performing a multimodal analysis of both texts
and images in order to improve the quality of sentiment analysis and brand
reputation systems. The whole MUSE dataset has been collected between May
2018 and December 2018 by using the o cial Instagram API.1</p>
      <p>A total of 14 hashtags were used for data collection. Seven of these were
closely related to a customer company of the MUSE industrial partner while the
rest are very generic and not related to any particular topic. We don't report
the list of the rst group of hashtags. The list of the second group is as follows:
follow4follow, igers, followme, instago, italia, buongiorno, instaitalia.</p>
      <p>Overall, the dataset consists of more than 200k images and the e ectiveness
of the system has been evaluated by the company internally. However, a portion
of randomly selected 50 images was manually annotated for tag relevance by 7
human annotators.</p>
      <p>The participants to a questionnaire rated 50 images with respect to several
human provided Instagram hashtags. Participants were asked to nd the
\relevant" hashtags with respect to the image content. The English translation of
the instructions is reported in Figure 4. Each image was presented to annotators
together with a multi-selection button showing the list of the original hashtags
(see Figure 5).</p>
      <p>The annotation task was quite di cult, since the annotators had to select,
for each image, one, some or no tags with potential di erent degrees of relevance.
The number of tags for each image was variable and depended on the actual tags
obtained when the post was collected.</p>
      <p>
        Fleiss' kappa [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] was used to compute the inter-annotator agreement on
the 1195 data points consisting of image-tag pairs. The overall agreement was
of 0:42, but an ablation experiment on the raters demonstrated that a global
agreement of 0:58 could be reached with 6 out of the 7 raters.
      </p>
      <p>Given such agreement, we based the nal decision about the relevance of
e1acAhftteargDoenc.t1h2e 2m0a18joIrnitsytavgoratmecAriPteIricohnan(g4edvortaedsi)c,aallsy sthoowconmipnlyFwigiuthren6e.w GDPR
regulations https://www.instagram.com/developer/changelog/ and the collection
of the dataset was stopped.
This survey aims at identifying, given a set of images and tags (i.e. hashtag), the subset
of \relevant" tags given the image content.</p>
      <p>A tag is relevant to an image if it refers to the entities (people, objects, places, etc.)
depicted in the image.</p>
      <p>
        For example, for an image depicting a cathedral, tags such as \cathedral" and \church"
will be relevant, but tags such as \goaround", \tourist" and \hello" will not.
Likewise, for an image depicting a person in front of a cathedral, tags such as \cathedral",
\church" and \tourist" will be relevant, but tags such \goaround" and \hello" will be not.
Each image can contain tags both in Italian and in English. If you do not know the
meaning of the term used as a tag, use a dictionary to verify the relevance.
There may also be hashtags composed of the concatenation of several words. Please select
these tags where relevant.
In the nal evaluation dataset, the average number of tags associated with each
item is 23:9 ( = 9:13). Human ratings reveal that on average 3:6 tags ( = 2:9)
were actually relevant for a given image, while 20:3 ( = 9:5) were not. The
distribution of relevant vs. noisy user-de ned hashtags is in line with the ndings
illustrated in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], since approximately only 18% of tags are actually relevant.
      </p>
      <p>For the evaluation, each image-tag pair was considered as independent from
the others. In other words, each image is represented in the nal dataset by a
number of data-points equal to the number of its original hashtags. Note that
the set of relevant tags is always a subset of the original tags.</p>
      <p>Since the annotators were asked to mark as correct the tags referring to the
objects visible in the images, we used the output of the image classi cation step
as a rst baseline for the task. The results are reported in Table 1.</p>
      <p>The baseline has a clear issue in terms of exibility. The baseline classi er,
which is simply the VGG-16 classi er trained on ImageNet, may never predict
certain tags, as it is limited by the number of classes it was trained on, and may
choose to give more weight to certain aspects of the image. For example, if we
consider the image in Figure 1, one of the correct tags is ower. However, if we</p>
      <p>Fig. 6. Number of ratings for the
hashtags associated to the image in Figure 5.</p>
      <p>A tag is associated to a given image if at
least 4 annotators marked it as relevant.
look at the output of the classi cation step, the words pot and vase are included,
while ower itself is not predicted. This may be possibly due to the fact that
what the classi er is trained to see is actually a ower pot, but not just a ower.
Therefore in this case, ower is actually considered as a False Negative example.
7</p>
    </sec>
    <sec id="sec-6">
      <title>Results and Discussion</title>
      <p>The overall performances of the system were assessed by comparing the model
predictions against human rating. Table 2 shows the results of the evaluation in
terms of Precision, Recall, F1-score and Support.</p>
      <p>Class</p>
      <p>P</p>
      <p>R</p>
      <p>F1</p>
      <p>Support
False 0.92 0.61 0.73 1014
True 0.24 0.71 0.36 181
Macro avg 0.58 0.66 0.55 1195
Weighted avg 0.82 0.62 0.68 1195
Table 2. Precision (P), Recall (R) and F1-Score (F1) and Support of MORE based
on a dataset of 50 Instagram images with the default con guration.</p>
      <p>True Positives (TP) are the image-tag pairs for which both humans and the
system associated the class True (relevant). At the same way, True Negatives
(TN) are data points for which both the system and humans associated the class
False (noisy). False Positives (FP) are predicted as relevant by the system, but
rated as noisy by humans. Finally, False Negatives (FN) are the examples rated
as relevant by humans but marked as noisy by the system.</p>
      <p>Given the distribution of relevant tags in our dataset (18% of the total),
most of the data-points belong to the False class, a ecting the macro-averaged
result. Therefore, we prefer to consider the weighted average as evaluation metric:
Precision, Recall and F1-score are computed for each class, and their average is
weighted by support (i.e., the number of true instances for each class).</p>
      <p>Overall, the system reaches a weighted average F1-score of 0:68, which is
distributed di erently across the relevant and noisy hashtags class. Moreover, it
is important to stress that in the case of irrelevant tags (the False class) it is very
important to maximize the Precision in order to avoid noise. On the contrary,
for relevant tags (True class) we are particularly interested in maximizing the
Recall, in order to guarantee a satisfactory retrieval of relevant images.</p>
      <p>In order to pursue such goals, we decided to use, along with standard
metrics, an additional one consisting of the average between the Precision of the
class False and the Recall of the True one. This metric, in fact, provides us
with a useful method to evaluate if the system is able to discard noisy tags and,
at the same time, to preserve the correct ones. If we look only at the Precision
and Recall of the individual classes, we would not be able to capture this
information. This metric, calculated with the MORE default parameters, is 0:815,
outperforming the baseline (for which it was 0:44) by a wide margin, despite its
weighted average F1-score was of 0.79. Even though we consider such results as
promising, we performed a manual evaluation of error types to identify the most
challenging cases for MORE.</p>
      <p>One of the problems we detected consists in the recall of multiword
hashtags such as sprayart (`spray art'), biancoenero (`black and white' concatenating
the words \bianco e nero"), fotodaltreno (`photos from the train', from \foto
dal treno"), orieocchiali (` owers and glasses', from \ ori e occhiali"),
creativemakeup (`creative makeup').</p>
      <p>As for False Negative examples, 37 of the 53 total examples ( 70%) were
out-of-vocabulary (OOV) in both the source and the target word space. There
is surely wide room for improvement. For example, the MORE match function,
could be enriched with the ability of segmenting multiword hashtags in standard
lexical entries or by exploiting fastText features to extract the vectors of
out-ofvocabulary words also in a multilingual space.</p>
      <p>As for False Positive examples, we noticed that MORE is more prone to
predict as relevant tags high-frequency words such as photo, nice, look, sel e,
style, pizza. In addition, we noticed that MORE tends to mark abstract words
(in Italian and English) as relevant tags. Despite being an error, we can consider
it expected due to the way in which MORE has been constructed. In fact, very
often the vectors of abstract words are highly associated with referential
objects depicted in the photos. For example, words like mood, enjoy, happy, verita
(`truth'), bellezza (`beauty'), parole (`words'), have been considered False by
annotators because, according to the provided instructions, they \do not refer to
the entities (people, objects, places, etc.) depicted in the image". Nonetheless,
given the high association of such word vectors with referential objects, MORE
often tags them as relevant.</p>
      <p>We also performed further experiments to understand how the various
parameters a ect performances and can be exploited to reach di erent goals. It is
clear that such parameters are very important to leverage the behavior of the
application. For example, a higher cosine similarity threshold eventually discards
many tags. This is useful if the nal goal is to re ne tags as accurately as
possible. On the other hand, this setting will negatively a ect the recall of relevant
hashtags. The same consideration goes for increasing the nearest neighbors of
tags and labels. We performed a parameter tuning experiment in which we
evaluated MORE with a cosine similarity threshold of 0:3, 0:4 and 0:5. Similarly, we
assessed the system by changing the number of nearest neighbors. In this case,
the evaluation was performed with 3, 5 and 10 nearest neighbors. In order to
choose the best performing con guration, we selected the one maximizing the
average between the Precision of the class False (not relevant) and the Recall
of the True (relevant) one. For this metric, MORE performs at best 0:87. The
results of this model are reported in Table 3.</p>
      <p>By considering the standard metrics for this model, we notice that also the
Weighted AVG Precision improves after the parameter tuning. Weighted AVG
Recall (and thus F1-score), on the contrary, is negatively a ected by the results
on False, which is the majority class.</p>
      <p>Class</p>
      <p>P</p>
      <p>R</p>
      <p>F1</p>
      <p>Fig. 8. MORE predictions against human
annotation after parameter tuning
(cosine sim. threshold: 0.3; nearest
neighbors: 10).</p>
      <p>In order to further improve the overall performances, a viable option could
be to leverage the frequency and the popularity of the tags on Instagram. Such
tags, in fact, represent a lot of false positives (e.g. high frequency words), and
decreasing their number could increase also the Weighted AVG F1-score.</p>
      <p>Figures 7 and 8 show the predictions of both the default and the tuned con
guration compared to human annotation. Green tags are true positive examples,
red tags are false positives (the system considered them as relevant while humans
did not), orange tags are the false negative. We can see that abstract words are
false positives in both the models. This means that the thresholds are not able to
properly mitigate such phenomenon because words like beach, travel and relax
are strongly associated one another (e.g. to remove the tags relax and travel the
system requires at least a threshold of cosine of 0.5, with the e ect of removing
also true positive tags).
8</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusions</title>
      <p>In this paper, we presented the MultimOdal Tag Re nement (MORE), a system
aimed at improving image descriptors by exploiting NLP and CV techniques.
The system starts from Instagram user de ned image annotation, and merges
visual and textual information to nd a match between the tags provided with
an image and its semantic visual content. Textual features have been extracted
from text based tags by exploiting the (multilingual) word embeddings. Visual
features have been gathered by exploiting image classi cation. The system has
been evaluated on an Italian manually annotated dataset achieving 68% of
performances in terms of weighted F1-score.</p>
      <p>The results of MORE are promising, but there are still wide margins of
improvement for several key aspects of the system, including: (i) the construction of
multilingual embeddings; (ii) the management of multilingual hashtags; (iii) the
re nement and extension of the evaluation process; (iv) the distribution of the
manually annotated dataset. In the near future, our e orts will be focused
towards these directions. As for (i), we aim at training embeddings on a mixture
of social media and general purpose corpora. This combination is expected to
positively a ect (ii), as it would enable the collection of reliable vectors for
multiword hashtags, thus reducing the number of OOV words during ltering. As for
(iii), we plan to ne-tune all the modules including the neural network for image
classi cation, and to study the contribution of each module of the architecture
to the classi cation. Finally, for (iv), we plan to extend the manually annotated
dataset to improve the evaluation and to make it available for research purposes,
in accordance to the Instagram Privacy constraints.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This research has been supported by the Project MUltimodal Semantic
Extraction (MUSE), in collaboration with Bnova s.r.l., funded by Regione Toscana with
the grant POR FSE 2014-2020 Asse A.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bojanowski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grave</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joulin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Enriching word vectors with subword information</article-title>
          .
          <source>arXiv preprint arXiv:1607.04606</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bond</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Foster</surname>
          </string-name>
          , R.:
          <article-title>Linking and extending an open multilingual wordnet</article-title>
          .
          <source>So a</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bond</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paik</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>A survey of wordnets and their licenses</article-title>
          .
          <source>In: Proceedings of the 6th Global WordNet Conference (GWC</source>
          <year>2012</year>
          ).
          <source>Matsue</source>
          (
          <year>2012</year>
          ),
          <volume>64</volume>
          {
          <fpage>71</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Bruni</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tran</surname>
            ,
            <given-names>G.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baroni</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Distributional semantics from text and images</article-title>
          .
          <source>In: Proceedings of the GEMS 2011 workshop on geometrical models of natural language semantics</source>
          . pp.
          <volume>22</volume>
          {
          <issue>32</issue>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Candes</surname>
            ,
            <given-names>E.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ma</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wright</surname>
          </string-name>
          , J.:
          <article-title>Robust principal component analysis? Journal of the ACM (JACM) 58(3</article-title>
          ),
          <volume>11</volume>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Chollet</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Deep learning with python (</article-title>
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Chua</surname>
            ,
            <given-names>T.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hong</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luo</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zheng</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Nus-wide: a real-world web image database from national university of singapore</article-title>
          .
          <source>In: Proceedings of the ACM international conference on image and video retrieval</source>
          . pp.
          <volume>1</volume>
          {
          <issue>9</issue>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Conneau</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lample</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ranzato</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Denoyer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jegou</surname>
          </string-name>
          , H.:
          <article-title>Word translation without parallel data</article-title>
          .
          <source>arXiv preprint arXiv:1710.04087</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dong</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>L.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fei-Fei</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <string-name>
            <surname>ImageNet: A LargeScale Hierarchical Image</surname>
          </string-name>
          <article-title>Database</article-title>
          .
          <source>In: CVPR09</source>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Duan</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tsang</surname>
            ,
            <given-names>I.W.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Improving web image search by bag-based reranking</article-title>
          .
          <source>IEEE Transactions on Image Processing</source>
          <volume>20</volume>
          (
          <issue>11</issue>
          ),
          <volume>3280</volume>
          {
          <fpage>3290</fpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Fellbaum</surname>
          </string-name>
          , C. (ed.):
          <article-title>WordNet An Electronic Lexical Database</article-title>
          . The MIT Press (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Fleiss</surname>
          </string-name>
          , J.:
          <article-title>Measuring nominal scale agreement among many raters</article-title>
          .
          <source>Psychological bulletin 76(5)</source>
          ,
          <volume>378</volume>
          |382 (November
          <year>1971</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zha</surname>
            ,
            <given-names>Z.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>Visual-textual joint relevance learning for tag-based social image search</article-title>
          .
          <source>IEEE Transactions on Image Processing</source>
          <volume>22</volume>
          (
          <issue>1</issue>
          ),
          <volume>363</volume>
          {
          <fpage>376</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Giannoulakis</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tsapatsoulis</surname>
          </string-name>
          , N.:
          <article-title>Evaluating the descriptive power of instagram hashtags</article-title>
          .
          <source>Journal of Innovation in Digital Ecosystems</source>
          <volume>3</volume>
          (
          <issue>2</issue>
          ),
          <volume>114</volume>
          {
          <fpage>129</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Golder</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huberman</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Usage patterns of collaborative tagging systems</article-title>
          .
          <source>J. Information Science</source>
          <volume>32</volume>
          ,
          <issue>198</issue>
          {
          <volume>208</volume>
          (04
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Gong</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ke</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Isard</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lazebnik</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>A multi-view embedding space for modeling internet images, tags, and their semantics</article-title>
          .
          <source>International journal of computer vision 106(2)</source>
          ,
          <volume>210</volume>
          {
          <fpage>233</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Huiskes</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thomee</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lew</surname>
            ,
            <given-names>M.S.</given-names>
          </string-name>
          :
          <article-title>New trends and ideas in visual concept detection: the mir ickr retrieval evaluation initiative</article-title>
          .
          <source>In: Proceedings of the international conference on Multimedia information retrieval</source>
          . pp.
          <volume>527</volume>
          {
          <issue>536</issue>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Krizhevsky</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          , G.E.:
          <article-title>Imagenet classi cation with deep convolutional neural networks</article-title>
          .
          <source>In: Advances in neural information processing systems</source>
          . pp.
          <volume>1097</volume>
          {
          <issue>1105</issue>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Lample</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Conneau</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ranzato</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Denoyer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jegou</surname>
          </string-name>
          , H.:
          <article-title>Word translation without parallel data</article-title>
          .
          <source>In: 6th International Conference on Learning Representations, ICLR</source>
          <year>2018</year>
          , Vancouver, BC, Canada, April 30 - May 3,
          <year>2018</year>
          , Conference Track Proceedings (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Lenci</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Distributional Models of Word Meaning</article-title>
          .
          <source>Annual review of Linguistics</source>
          <volume>4</volume>
          ,
          <issue>151</issue>
          {
          <fpage>171</fpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Snoek</surname>
            ,
            <given-names>C.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Worring</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Learning social tag relevance by neighbor voting</article-title>
          .
          <source>IEEE Transactions on Multimedia</source>
          <volume>11</volume>
          (
          <issue>7</issue>
          ),
          <volume>1310</volume>
          {
          <fpage>1322</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Uricchio</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ballan</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bertini</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Snoek</surname>
            ,
            <given-names>C.G.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bimbo</surname>
            ,
            <given-names>A.D.</given-names>
          </string-name>
          :
          <article-title>Socializing the semantic gap: A comparative survey on image tag assignment, re nement, and retrieval</article-title>
          .
          <source>ACM Comput. Surv</source>
          .
          <volume>49</volume>
          (
          <issue>1</issue>
          ),
          <volume>14</volume>
          :1{
          <fpage>14</fpage>
          :39 (Jun
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>B.D.</given-names>
          </string-name>
          , Zhang,
          <string-name>
            <surname>Y.J.:</surname>
          </string-name>
          <article-title>A locality sensitive low-rank model for image tag completion</article-title>
          .
          <source>IEEE Transactions on Multimedia</source>
          <volume>18</volume>
          (
          <issue>3</issue>
          ),
          <volume>474</volume>
          {
          <fpage>483</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lu</surname>
          </string-name>
          , H.:
          <article-title>Image annotation using multi-correlation probabilistic matrix factorization</article-title>
          .
          <source>In: Proceedings of the 18th ACM international conference on Multimedia</source>
          . pp.
          <volume>1187</volume>
          {
          <fpage>1190</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Loper</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bird</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Nltk: The natural language toolkit</article-title>
          .
          <source>In: Proceedings of the ACL02 Workshop on E ective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics - Volume 1</source>
          . pp.
          <volume>63</volume>
          {
          <fpage>70</fpage>
          . ETMTNLP '
          <volume>02</volume>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computational Linguistics, Stroudsburg, PA, USA (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Makadia</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pavlovic</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Baselines for image annotation</article-title>
          .
          <source>International Journal of Computer Vision</source>
          <volume>90</volume>
          (
          <issue>1</issue>
          ),
          <volume>88</volume>
          {
          <fpage>105</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Nov</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Naaman</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ye</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>What drives content tagging: the case of photos on ickr</article-title>
          .
          <source>In: Proceedings of the SIGCHI conference on Human factors in computing systems</source>
          . pp.
          <volume>1097</volume>
          {
          <fpage>1100</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Pianta</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bentivogli</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Girardi</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Multiwordnet: developing an aligned multilingual database</article-title>
          .
          <source>In: Proceedings of the First International Conference on Global WordNet. Mysore (India)</source>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Ruder</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vulic</surname>
          </string-name>
          , I., S gaard, A.:
          <article-title>A survey of cross-lingual word embedding models</article-title>
          .
          <source>arXiv preprint arXiv:1706.04902</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Sen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lam</surname>
            ,
            <given-names>S.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rashid</surname>
            ,
            <given-names>A.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cosley</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frankowski</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Osterhouse</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harper</surname>
            ,
            <given-names>F.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riedl</surname>
          </string-name>
          , J.: Tagging, communities, vocabulary, evolution.
          <source>In: Proceedings of the 2006 20th anniversary conference on Computer supported cooperative work</source>
          . pp.
          <volume>181</volume>
          {
          <fpage>190</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31. Sigurbjornsson, B.,
          <string-name>
            <surname>Van</surname>
            <given-names>Zwol</given-names>
          </string-name>
          ,
          <string-name>
            <surname>R.</surname>
          </string-name>
          :
          <article-title>Flickr tag recommendation based on collective knowledge</article-title>
          .
          <source>In: Proceedings of the 17th international conference on World Wide Web</source>
          . pp.
          <volume>327</volume>
          {
          <fpage>336</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          32.
          <string-name>
            <surname>Simonyan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          .
          <source>arXiv preprint arXiv:1409.1556</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          33.
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bhowmick</surname>
            ,
            <given-names>S.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chong</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          :
          <article-title>Social image tag recommendation by concept matching</article-title>
          .
          <source>In: Proceedings of the 19th ACM international conference on Multimedia</source>
          . pp.
          <volume>1181</volume>
          {
          <fpage>1184</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          34.
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hong</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chua</surname>
            ,
            <given-names>T.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qi</surname>
            ,
            <given-names>G.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jain</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Image annotation by k nn-sparse graph-based label propagation over noisily tagged web images</article-title>
          .
          <source>ACM Transactions on Intelligent Systems and Technology (TIST) 2</source>
          (
          <issue>2</issue>
          ),
          <volume>14</volume>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          35.
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>Y.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tian</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          :
          <article-title>Social anchor-unit graph regularized tensor completion for large-scale image retagging</article-title>
          .
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          36.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qian</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          , Han,
          <string-name>
            <surname>J</surname>
          </string-name>
          .:
          <article-title>Joint hypergraph learning for tag-based image retrieval</article-title>
          .
          <source>IEEE Transactions on Image Processing</source>
          <volume>27</volume>
          (
          <issue>9</issue>
          ),
          <volume>4437</volume>
          {
          <fpage>4451</fpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          37.
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jin</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jain</surname>
            ,
            <given-names>A.K.</given-names>
          </string-name>
          :
          <article-title>Tag completion for image retrieval</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>35</volume>
          (
          <issue>3</issue>
          ),
          <volume>716</volume>
          {
          <fpage>727</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          38.
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shimada</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taniguchi</surname>
            ,
            <given-names>R.I.</given-names>
          </string-name>
          :
          <article-title>Non-linear matrix completion for social image tagging</article-title>
          .
          <source>IEEE Access 5</source>
          ,
          <issue>6688</issue>
          {
          <fpage>6696</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          39.
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Ma, Y.:
          <article-title>Image tag re nement towards low-rank, content-tag prior and error sparsity</article-title>
          .
          <source>In: Proceedings of the 18th ACM international conference on Multimedia</source>
          . pp.
          <volume>461</volume>
          {
          <fpage>470</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>