Linking Multimedia Items to Semantic Knowledge Base
                 with User-Generated Tags

                     Shuangyong Song, Chao Wang, Haiqing Chen

                      Alibaba Group, Beijing 100102, China.
     {shuangyong.ssy; chaowang.wc; haiqing.chenhq}@alibaba-inc.com


       Abstract. Multimedia items account for an important part of Linked Open Data
       (LOD), but currently most of the semantic relations between multimedia items
       and semantic knowledge base (KB) are based on manual semantic annotation.
       With the popularity of multimedia hosting websites, plentiful tagging infor-
       mation makes it possible to automatically generate semantic relations between
       multimedia items and KB. In this paper, we propose a mechanism for linking
       multimedia items to KB with user-generated tags, while taking into account top-
       ical semantic similarity between tags. Experimental results show the effect of
       our work on this task.

       Keywords. multimedia LOD; knowledge base; user-generated tags; topic model


1      Introduction
   Multimedia LOD has attracted considerable attention, but how to share and search
multimedia items on semantic web remains a significant yet challenging research
issue. Most of the semantic relations between multimedia items and semantic ontolo-
gies are based on manual semantic annotation, which is very time-consuming. Some
researches try to automatically tag multimedia items based on their web page text [1],
however, the complexity of web text generates much noise data and makes it difficult
to detect the exact text that are really related to a target multimedia item.
   Online multimedia hosting systems, such as Flickr, YouTube and Last.fm, have at-
tracted great attention in that they enable an effective way for users to organize, tag
and share multimedia items. Researches on understanding multimedia items based on
their tags have been attached importance [2]. In this paper, we aim to create links
between multimedia items and semantic KB by using their tagging information, in this
way to realize semantic multimedia retrieval and detection of multimedia relations.
   We firstly detect tags which can be unambiguously linked to ontologies in high-
quality public semantic KB and process some ambiguous tags with simple tags' Co-
occurrence relation, and then use the multimedia “item-tag” relation matrix to train
topic models, through which to calculate topical semantic relations between tags, and
then detect implicit semantic links between tags and semantic ontologies. Finally, we
link multimedia items to semantic KB through tag-based mediation. In the following
parts of this article, we will illustrate the details of the problem definition and pro-
posed model, and report the experimental results.
2      Problem Definition & Approach
2.1    Problem Definition
   Referring to [2], we give a definition of multimedia ontology as “ontology of a
multimedia item with a unique URI and available links to public recognized semantic
KB”. We use DBpedia as referred semantic KB, where identifiers of category-level
ontologies begin with “dbo”, such as “dbo:movie”, and identifiers of instance-level
ontologies begin with “dbr”, such as “dbr:creditcard”. We define the set of multime-
dia items as M = {m1, m2, …, mi, …, mI}, and for mi, a series of user-defined tags are
given as T(mi). Our goal is to create a set of ontologies O(mi) by linking mi to KB,
with considering both explicit and implicit semantic links between T(mi) and KB.

2.2    Approach
1) Linking Tags to Ontologies: We firstly link all unambiguous tags to KB, which
means that a tag has only one matched ontology definition in KB. For example, when
T(mi) contains a tag “credit_card”, we can detect an unambiguous matched ontology
as “dbr:creditcard”, then we add it to O(mi). However, when T(mi) contains a tag with
multiple matched ontologies in KB, such as the tag “Apple”, we design a simple tags'
Co-occurrence relation based method to determine linking an ambiguous tag to which
ontology or not linking it to any ontology. The formula of co-occurrence relation
𝑅(𝑡!! , 𝑡!! ) between an ambiguous tag 𝑡!! and an unambiguous tag 𝑡!! is given below:
                                      U                       U     C (tia , tku ) * C (tku , t uj )
                      R(tia , t uj ) = ∑ R(tia , tku , t uj ) = ∑                                      (1)
                                     k =1                    k =1              F (tku )
where 𝑅(𝑡!! , 𝑡!! , 𝑡!! ) means partial co-occurrence relation between 𝑡!! and 𝑡!! created
through 𝑡!! . U means number of all unambiguous tags. 𝐶(𝑡!! , 𝑡!! ) means co-occurrence
frequency of 𝑡!! and 𝑡!! , and 𝐹(𝑡!! ) means frequency of 𝑡!! . Finally, a 𝑡!! with the max-
imum 𝑅(𝑡!! , 𝑡!! ) of 𝑡!! will be detected as a vicarious tag of 𝑡!! . In particular, we also
detect some tag-combined ontologies for expanding the links' range. For example, if
an item has both “DigitalCamera” and “Conon” as its tags, we will check if there is a
semantic ontology “DigitalCamera_Conon” or “Conon_DigitalCamera” in KB.
2) Detecting Semantic Relations between Tags: Probabilistic topic models, such as
Latent Dirichlet Allocation model (LDA), have been proved to be powerful tools for
identifying latent topical information. We utilize JGibbLDA [4] to detect topical in-
formation from item-tag matrix. Since the item-tag matrix is sparse and difficult to be
well analyzed, topical information can help to discover implicit semantic relations
between tags, by calculating similarity between tags' topical vectors. Besides, this step
can be regarded as a dimension reduction of tags' vector space, which can greatly
reduce computational consumption of semantic similarity between tags. If we set the
number of topics as K, then each tag can be represented as a K-dimension vector.
3) Extending Ontology Linked Tags: The aim of this step is to expand the scope of
“ontology linked tags” by considering topical semantic similarities between tags, as
well as some lexical analysis. Lexical analyses include synonyms analysis, plurals
analysis and gerund analysis, while topical semantic similarity based method focuses
on topical vectors of tags. For checking the possibility of linking an unlinked tag t1 to
semantic KB, we calculate the topical semantic similarity between t1 and all KB linked
tags t* with cosine-similarity C(t1,t*) of their topical vectors, and choose those with
similarity greater than threshold σ to be synonymous tags, where σ = 1-10-d, and d is a
positive integer, while a larger d means a stricter threshold. Besides, for two tags t1
and t2, supplemented with Inclusion Relation and Levenshtein Distance between
them, we design some rules for judging if they are similar tags:
     a) If t1 and t2 have Inclusion Relation, such as “motor” and “motorcycle”, and C(t1,t2) is big-
     ger than σ, we judge them as similar tags;
     b) If Levenshtein Distance value between t1 and t2 is equal to or smaller than a threshold β,
     which is a positive integer, and C(t1,t2) is bigger than σ, we judge them as similar tags;
     c) If t1 and t2 don't have above relations, we judge them as dissimilar tags.
     d) If t1 and t2 are judged as similar tags, and just one of them has semantic link to KB, we
     link the other one to the same ontology with a probability C(t1,t2).
4) Linking Multimedia items to Ontologies: Based on mapping relations between
tags and KB, we link multimedia items to KB by taking tags as medium. For selecting
predicates in <subject, predicate, object> triples, we use 77 million triples collected
from a semantic database LOD4ALL 1 as criteria. We create links between multimedia
items and KB ontologies with unambiguous predicates. For example, we use predicate
“dbo:locationCountry” to create links from image items to object ontologies such as
“China” or “Denmark”. For objects with multiple predicates, links will not be created
while for other unknown objects, we tentatively use predicate as “rdfs:seeAlso” which
means “further information about the subject resource”.

3         Experiments
3.1       Dataset and Parameter Settings
   Three datasets of images, videos and music are respectively prepared: 1) mirFlickr
dataset [2] was utilized, which contains 0.7M images and 0.65M tags; 2) YouTube-
8M dataset [3] consists of about 8M videos and 4716 unique tags; 3) Last.fm dataset 2
consists of about 0.5M music tracks and 0.52M unique tags. We remove multimedia
items without any tag or just with “stopword” tags, and remaining datasets are used to
train domain-sensitive topic models respectively with an empirical K value of 200.
   For evaluating the effects of different β, we set β as positive integer between 1 and
6 and compare the performance with different β, since we empirically judge that two
tags with Levenshtein Distance more than 6 are with little probability to be similar
tags. We randomly choose 500 tag-couple with Levenshtein Distance equals to or
smaller than 6 as the dataset for evaluating different β, and we roughly set the recall
as 1.0 when β = 6 considering we are unable to collect all valid tag-couples. We first-
ly get the recall and precision results and further the F1-value to evaluate β. Table 1
shows the results. As shown, we get the best performance when β is 2. Therefore, we
set β = 2 in the following experiments.
1
    https://lod4all.net/zh/index.html
2
    https://labrosa.ee.columbia.edu/millionsong/lastfm
          Table 1. Results with different β.                  Table 2. Results with different d (when β = 2)

           β=1     β=2      β=3      β=4     β=5     β=6                d=1   d=2    d=3    d=4   d=5   d=6
Precision 0.901    0.892    0.692    0.439   0.295   0.248     Precision 0.322 0.545 0.921 0.877 1.000 1.000
Recall     0.555   0.601    0.688    0.755   0.835   1.000     Recall   1.000 0.887 0.728 0.411 0.169 0.055
F1-value 0.687     0.718    0.690    0.555   0.436   0.397     F1-value 0.487 0.675 0.813 0.560 0.289 0.104

   Totally 144,318 tags are unambiguously mapped to dbo or dbr, which only account
for 17.59% of all tags. To evaluate the effects of different d when β = 2, we manually
label 600 multimedia items. We roughly set the recall as 1.0 when d = 1 considering
we are unable to collect all possible related ontologies to an item. After getting both
precision and recall with different d, we also use F1-value as evaluation criterion to
evaluate d. Table 2 shows the results. As shown, the best performance shows when d
= 3, which indicates that when σ =1-10-3 we can get best discriminant performance.
Therefore, we set d = 3 in the following experiments.
3.2        Experimental Results
   With our model, 591,850 tags are finally mapped to KB, which account for 72.18%
of all tags, which makes tremendous growth compared to 17.59%. We compare our
model with two baselines on the tag similarity calculation subtask:
   ••Word2Vec based model (W2V): Word2vec is a recently popular method for get-
ting distributed representation of words, of which the output form is similar to LDA.
   ••Co-occurrence based model (Co-occur): Co-occur is another method for detect-
ing relationship between words, which doesn't consider latent semantic information.
                                    Table 3. Result comparison with F1-value.

Dataset                      Flickr                           YouTube                        Last.fm
Modes            W2V       Co-occur Our model W2V            Co-occur Our model     W2V     Co-occur Our model
F1-value         0.356      0.654     0.841   0.401           0.686     0.855       0.388    0.640     0.838

   We randomly choose 300 items from each dataset as test datasets. For each item,
we manually check the validity of every created item-ontology link, which can help
getting the precision easily for each model, and we take the union of all valid results
of three different models as basis for getting recall of each model. Then the F1-value
can be calculated and the results are shown in Table 3.

4          Future Works
   This paper is only a preliminary work. Named entity disambiguation and named
entity normalization will be considered in our next step work. Besides, predicate dis-
criminate should be performed by considering adjacent tags as context.

5          References
 1. Ding, G., Xu, N. Automatic semantic annotation of images based on Web data. In IAS'10.
 2. Huiskes, M. J., Lew, M. S. The MIR Flickr Retrieval Evaluation. In SIGMM'08, pp.39-43.
 3. Abu-El-Haija S, et al. YouTube-8M: A large-scale video classification benchmark. 2016.
 4. Phan, X-H., Nguyen, L-M., Horiguchi, S. Learning to classify short and sparse text & web
    with hidden topics from large-scale data collections. In WWW'08, pp.91-100.