=Paper=
{{Paper
|id=Vol-1917/paper32
|storemode=property
|title=Learning Word Embeddings from Tagging Data: A methodological comparison
|pdfUrl=https://ceur-ws.org/Vol-1917/paper32.pdf
|volume=Vol-1917
|authors=Thomas Niebler,Luzian Hahn,Andreas Hotho
|dblpUrl=https://dblp.org/rec/conf/lwa/NieblerHH17
}}
==Learning Word Embeddings from Tagging Data: A methodological comparison==
<pdf width="1500px">https://ceur-ws.org/Vol-1917/paper32.pdf</pdf>
<pre>
                 Learning Word Embeddings from Tagging Data: A
                           methodological comparison

                                  Thomas Niebler1 , Luzian Hahn1 , and Andreas Hotho1,2
                     1
                         Data Mining and Information Retrieval Group, University of Würzburg (Germany)
                                      {niebler, hotho}@informatik.uni-wuerzburg.de
                                          luzian.hahn@stud-mail.uni-wuerzburg.de
                                          2
                                            L3S Research Center Hanover (Germany)


                           Abstract The semantics hidden in natural language are an essential build-
                           ing block for a common language understanding needed in areas like NLP
                           or the Semantic Web. Such information is hidden for example in lightweight
                           knowledge representations such as tagging systems and folksonomies. While
                           extracting relatedness from tagging systems shows promising results, the
                           extracted information is often encoded in high dimensional vector represen-
                           tations, which makes relatedness learning or word sense discovery computa-
                           tionally infeasible. In the last few years, methods producing low-dimensional
                           vector representations, so-called word embeddings, have been shown to yield
                           extraordinary structural and semantic features and have been used in many
                           settings. Up to this point, there has been no in-depth exploration of the
                           applicability of word embedding algorithms on tagging data. In this work,
                           we explore different embedding algorithms with regard to their applicability
                           on tagging data and the semantic quality of the produced word embeddings.
                           For this, we use data from three different tagging systems and evaluate the
                           vector representations on several human intuition datasets. To the best of
                           our knowledge, we are the first to generate embeddings from tagging data.
                           Our results encourage the use of word embeddings based on tagging data, as
                           they capture semantic relations between tags better than high-dimensional
                           representations and make learning with tag representations feasible.


                 1       Introduction
                 Automatically assessing the degree of semantic relatedness between words, i.e., the
                 relatedness of their actual meanings, in such a way that it fits human intuition
                 is an important task with a variety of applications, such as ontology learning for
                 the Semantic Web [3], tag recommendation [15] or semantic search [13]. Semantic
                 relatedness information of words has been extracted from a variety of sources like
                 plain text [7], website navigation [21, 27] or social metadata [5, 8, 17]. Among
                 others, tagging data from social tagging systems like BibSonomy3 or Delicious4 are
                 useful to extract high-quality semantic relatedness information, e.g., for ontology
                 learning [3].
                     Traditionally, assessing the degree of semantic relatedness between tags utilizes
                 sparse, high-dimensional vector representations of those tags, which are constructed
                 3
                     https://www.bibsonomy.org
                 4
                     http://www.delicious.com


Copyright © 2017 by the paper’s authors. Copying permitted only for private and academic purposes.
In: M. Leyer (Ed.): Proceedings of the LWDA 2017 Workshops: KDML, FGWM, IR, and FGDB.
Rostock, Germany, 11.-13. September 2017, published at http://ceur-ws.org
from tag contexts based on posts in social tagging systems [8]. The semantic relat-
edness can then be estimated using the cosine measure of the corresponding tag
vectors [17]. Finally, evaluating the quality of the estimated scores is usually per-
formed by directly correlating them to human intuition [6, 11, 25]. In recent years,
many techniques have been proposed to represent words by dense, low-dimensional
vectors [20, 23, 28]. These so-called word embeddings have been shown to yield ex-
traordinary structural features [16, 19] and are applied in machine translation or text
classification. Furthermore, word embeddings often outperform high-dimensional
representations in tasks such as measuring semantic relatedness [1, 16].
Problem Setting. Traditionally, tags are represented by sparse, high-dimensional
vectors [8, 26]. However, although Cattuto et al. have shown that tagging data
contain meaningful semantics [8], the correlation of semantic relatedness scores from
those vectors with human intuition still leaves room for improvement. Furthermore,
the high dimensionality of those representations renders many algorithms using
them computationally expensive.5 Up to this point, there have been no extensive
attempts to generate word embeddings from social tagging data. All prior studies
rely on high dimensional tagging vectors or reduce the vector space arbitrarily by
cutting the dimensionality of the space by a fixed number, which in turn decreases
the fit of the resulting relatedness scores to human intuition.
Contribution. We contribute a thorough exploration of the applicability and op-
timization of three well-known embedding algorithms on tagging data. We first
analyze the parameters of each algorithm, before we optimize these settings to pro-
duce the best possible word embeddings from tagging data. Then, we compare the
embeddings of each algorithm with each other as well as with traditional sparse
representations by evaluating them on human intuition. We show that all produced
embeddings outperform high-dimensional vector representations. We discuss the re-
sults in the light of other semantic relatedness approaches and show that we reach
competitive results, on par with recent work on extracting semantic relatedness,
which opens another high quality source of information for semantic relatedness.
Structure of this work. We first cover the related work in Section 2 and any es-
sential theoretical background of this work in Section 3. Afterwards, we investigate
three well-known algorithms with respect to their applicability on tagging data
in Section 4. In Section 5, we describe the datasets we used in our experiments.
Section 6 outlines our experiments, where we compare all generated vector repre-
sentations with regard to their semantic content. Section 8 concludes this work.


2     Related Work
The related work to this paper can be roughly put in two groups: Word Embedding
algorithms and semantics of tagging data as well as their applications.
Word Embeddings. The concept of word embeddings, i.e., word representations
in low dimensional vector spaces dates back at least to 1990, when Deerwester pre-
sented LSA [9], which by factorizing a term-document matrix effectively produced
a dimension reduction of the term vector space. In 2003, Bengio et al. presented
their neural probabilistic language model [2]. The goal of this work was to overcome
5
    This is also sometimes referred to as the curse of dimensionality
the curse of dimensionality and learn distributed word representation in a low-
dimensional vector space. However, the wide-spread use of word embeddings only
really took off in 2013, when Mikolov et al. presented a similar, yet scalable and fast
approach to learn word embeddings [20]. Generally, such methods train a model to
predict a word from a given context [2, 20]. Other embedding methods focus on
factorizing a term-document matrix [9, 23]. In [1], Baroni et al. showed that all
those methods generally exhibit a notably higher correlation with human intuition
than the standard high-dimensional vector representations proposed in [26]. There
also exist several graph embedding algorithms. The LINE algorithm [28] attempts
to preserve the first- and second-order proximity of nodes in their corresponding
embedding relations. Perozzi et al. [24] proposed “DeepWalk”, an approach based
on random walks on graphs and the subsequent embedding using Word2Vec.
Social Tagging Systems. In [12], Golder and Huberman noted that with increas-
ing use, usage data from social tagging systems exhibited an emerging structure,
which was later called a folksonomy [29]. Mika noted that these emerging structures,
i.e., folksonomies, could even represent light-weight ontologies [18]. Using the folkso-
nomy structure, it was possible to extract information about semantic relatedness
between tags [8, 17]. The evaluation of this semantic relatedness information on
human intuition showed that tagging data contain a considerable amount of seman-
tic information, thus enabling further applications of tagging data. Applications
of these emerging structures can be found in tag recommendation [15], ontology
learning [3] and tag sense discovery algorithms [22].


3   Technical Background

In the following, we will describe the technical background for this paper. First,
we define the term folksonomy. Secondly, we introduce how to extract information
about semantic relatedness from folksonomies.
Folksonomies. Folksonomies are the data structures emerging from social tagging
systems. The term has been coined by Van der Wal in 2005 as a portmanteu of
“folks” and “taxonomy” [29]. In these systems, users collect resources and anno-
tate them with freely chosen keywords, so-called tags. Examples are BibSonomy,
Delicious, FlickR or last.fm. We follow the definition given by [14]:
    A folksonomy is a tuple (U, T, R, Y ) of sets U , T , R and a tripartite relation
Y ⊆ U × T × R. The sets U , T and R represent the sets of users, tags and resources,
respectively, while Y represents the set of tag assignments. A post is the collection
of tag assignments with the same user and same resource.
Extracting Semantic Relatedness from Folksonomies. After Golder and Hu-
berman argued that the emerging structure of folksonomies contains considerable
semantic information [12], Cattuto et al. proposed a way to extract this informa-
tion [8]. They used a context-co-occurrence based vector representation for the tags
and experimented with different context choices, such as tag-tag-context or tag-
resource-context, i.e., all assigned tags of a posted resource by either a specific user
or all users. Both of these context choices were shown to estimate human-perceived
semantic relatedness better than other context variants. In this work, we generally
use the tag-tag-context. The resulting vector representations follow the definition
given in [26] and are based on the co-occurrence counts of tags in their respective
contexts. More concretely, a vector representation vi of a tag ti ∈ V in a given
vocabulary is then a |V |-dimensional vector, where vij := #cooccpost (i, j). To fi-
nally receive a notion of the degree of semantic relatedness between two tags i and
j, one can compare the corresponding vectors vi and vj using the cosine measure
                     hvi ,vj i
cossim(vi , vj ) := kvi k·kv jk
                                [17].


4     Applicability of Embedding Algorithms on Tagging Data
This section describes the different embedding algorithms that we explored. For
each algorithm, we give a short summary, enumerate the parameters for each model
and shortly discuss how it can be applied to tagging data.

Word2Vec The most well-known embedding algorithm used in this work is the
Word2Vec algorithm [20], which is actually comprised of two algorithms, SkipGram
and CBOW (Cumulative Bag of Words).6 Word2Vec trains a shallow neural network
on word sequences to predict a word from its neigbors in a given context window.
Parameterization. Word2Vec takes two parameters. The first parameter is the
window size, which determines the amount of neighboring words in a sequence
considered as context from which a word will be predicted. The second parameter
is the number of negative samples per step. This is done to decrease complexity of
solving the proposed model by employing a noise contrastive estimation approach.
Applicability. The Word2Vec algorithm normally processes sequential data. How-
ever, the sequence of tags normally does not hold any meaning, so this could possibly
pose a problem if the window size is chosen too small. In order to be able to apply
Word2Vec on tagging data, we grouped the tag assignments into posts and fed the
random succession of tags as sentences into the algorithm.

GloVe GloVe is an unsupervised learning algorithm for obtaining vector represen-
tations for words [23]. Its main objective was to capture semantic relations such
as king − man + woman ≈ queen. Training is performed on aggregated global
word-word co-occurrence statistics from a corpus.
Parameterization. The main parameters of the GloVe algorithm are xmax and α.
xmax denotes an influence cutoff for frequent tags while α determines the importance
of infrequent tags. According to [23], GloVe worked best for xmax = 100 and α =
0.75. We will choose these as initial values in our experiments.
Applicability. Since GloVe depends on co-occurrence counts of words in a corpus,
it is very easy to apply on tagging data. For this, we construct the tag-tag-context
co-occurrence matrix and can then directly feed it into the algorithm.

LINE The goal of the LINE embedding algorithm is to create graph embeddings
where the first- and second-order proximity of nodes are preserved [28]. The first-
order proximity in a network is the local pairwise proximity of two nodes, i.e., the
6
    In the course of this work, every time we refer to Word2Vec, we talk about the CBOW
    algorithm, as is recommended by [20] for bigger datasets.
weight of an edge connecting these two nodes. The second-order proximity of two
nodes in a network is the similarity between their first-order neighborhoods.
Parameterization. LINE takes two different parameters: The amount of edge sam-
ples per step and the amount of negative samples per edge. To decrease complexity
of solving the proposed model in [28], the authors employed a noise contrastive
estimation approach as proposed by [20] using negative sampling. Furthermore, to
avoid high edge weights to outweigh lower weights by letting the gradient explode
or vanish, LINE employs a sampling process of edges and then ignores their weights
instead of actually using the edge weights in its objective function.
Applicability. Similar to GloVe, this algorithm processes a network with weighted
edges, such as a co-occurrence network. Thus, we only have to construct the co-
occurrence network from the tagging data and apply LINE on that network.


Common Parameters While each of the mentioned algorithms can be tuned
with a set of different parameters, they have some parameters in common. First,
the embedding dimension determines the size of the produced vectors. A higher
embedding dimension allows for more degrees of freedom in the expressiveness of
the vector, i.e., it can encode more information about word relations. Standard
ranges for embedding dimensions are between 25 and 300. Secondly, the initial
learning rate of an algorithm determines its convergence speed. Fine-tuning that
parameter is crucial to receive optimal results, because if chosen badly, the learning
process either converges very slowly or might be unable to converge at all.


5     Datasets

In this work we use two different kinds of datasets to evaluate embedding algorithms
on tagging data. That is, the actual tagging datasets which provide tagging meta-
data and human intuition datasets (HIDs) which we employ to evaluate semantic
relatedness. In the following we first describe three datasets containing tagging data
from which we later derive tag embeddings. Then we introduce all human intuition
datasets containing human-assigned scores of similarities to word pairs.


5.1     Tagging Datasets to Derive Word Embeddings

We study datasets of three public social tagging systems. In order to ensure a
minimum level of commonly accepted meaning of all tags, each dataset is restricted
to the top 10k tags. Additionally, we only considered tags from users who have
tagged at least 5 resources and resources which have been used at least 10 times.
We also removed all invalid tags, e.g., containing whitespaces or unreadable symbols.
BibSonomy. The social tagging system BibSonomy provides users with the possi-
bility to collect bookmarks (links to websites) or references to scientific publications
and annotate them with tags [4]. We use a freely available dump of BibSonomy, cov-
ering all tagging data from 2006 till the end of 2015.7 After filtering, it contains 9,302
distinct tags, assigned by 3,270 users to 49,654 resources in 630,962 assignments.
7
    http://www.kde.cs.uni-kassel.de/bibsonomy/dumps/
Delicious. Like BibSonomy, Delicious is a social tagging system, where users can
share their bookmarks and annotate them with tags. We use a freely available
dataset from 2011 [30].8 Delicious has been one of the biggest adopters of the tagging
paradigm and due to its audience, contains tags about design and technical topics.
After filtering, the Delicious dataset contains 10,000 tags, which were assigned by
1,685,506 users to 11,486,080 resources in 626,690,002 assignments.
CiteULike. We took a snapshot of the official CiteULike page from September
2016.9 Since CiteULike describes itself as a “free service for managing and discov-
ering scholarly references”, it contains tags mostly centered around research topics.
After filtering, the CiteULike dataset contains 10,000 tags, which were assigned by
141,395 users to 4,548,376 resources in 15,988,259 assignments.


5.2    Human Intuition Datasets (HIDs)

As a gold standard for semantic relatedness as it is perceived by humans, we use
several datasets with human-generated relatedness scores for word pairs. In the
following, we will describe all of the used datasets briefly.
WS-353. The WordSimilarity-35310 dataset consists of 353 pairs of English words
and names [10]. Each pair was assigned a relatedness value between 0.0 (no re-
lation) and 10.0 (identical meaning) by 16 raters, denoting the assumed common
sense semantic relatedness between two words. Finally, the total rating per pair
was calculated as the mean of each of the 16 users’ ratings. This way, WS-353 pro-
vides a valuable evaluation base for comparing our concept relatedness scores to an
established human generated and validated collection of word pairs.
MEN. The MEN Test Collection [6] contains 3,000 word pairs together with human-
assigned similarity judgments, obtained by crowdsourcing using Amazon Mechanical
Turk11 . Contrary to WS-353, the similarity judgments are relative rather than ab-
solute. Raters were given two pairs of words at a time and were asked to choose the
pair of words was more similar. The score of the chosen pair, i.e., the pair of words
that was more similar, was then increased by one. Each pair was rated 50 times,
which leads to a score between 0 and 50 for each pair.
Bib100. The Bib100 dataset has been created in order to provide a more fitting
vocabulary for the research and computer science oriented tagging data that we
investigate.12 It consists of 122 words in 100 pairs, which were judged 26 times each
for semantic relatedness using scores from 0 (no similarity) to 10 (full similarity).
MTurk. In [25], Radinsky et al. created an evaluation dataset specifically for news
texts.13 We use this dataset as a topically remote evaluation baseline in order to get
a notion how intrinsic semantic relations are captured by both the tagging data and
the generated embeddings. The dataset at hand consists of 287 word pairs and 499
words. 23 humans judged relatedness on a scale from 1 (unrelated) to 5 (related).
8
   http://www.zubiaga.org/datasets/socialbm0311/
9
   http://www.citeulike.org/faq/data.adp
10
   http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/wordsim353.html
11
   http://clic.cimec.unitn.it/~elia.bruni/MEN
12
   http://www.dmir.org/datasets/bib100
13
   http://www.kiraradinsky.com/Datasets.html
6      Experimental Setup and Results

In the following, we describe the conducted experiments and present the results for
each experiment. Due to space limitations, we only report results for MEN. 14


6.1     Preliminaries

Evaluating Word Vector Representations. Very often, the quality of semantic
relatedness encoded in word vectors is assessed by how well it fits human intuition.
Human intuition is collected in HIDs as introduced in Section 5. The most widely-
used method to evaluate semantic relatedness on such datasets is to compare human
scores of the semantic relatedness between two words with the cosine similarity
scores of the corresponding word vectors. The comparison is done by calculating
the Spearman rank correlation coefficient ρ, which compares two ranked lists of
word pairs induced by the human relatedness scores and the cosine scores [1, 23].
Baseline: Tag-Tag-Context Co-Occurrence Vectors. As a baseline, we pro-
duced high dimensional co-occurrence counting vectors from all three tagging da-
tasets. As described in Section 3, co-occurrence of tags was counted in a tag-tag-
context, i.e., the context of a tag was given as the other tags annotated to a given
resource by a certain user [8]. Since there is no option to vary the dimension of the
tag-tag-context co-occurrence vectors except truncating the vocabulary, we only re-
port the values for a truncated vocabulary of 10,000 tags in Table 1. Still, we give
all of the reported results as baselines in the subsequent figures.
Parameter Settings. For each of the following algorithms, we conducted the ex-
periments as follows: As initial parameter setting, we used the standard settings that
come with the implementation of each algorithm. The corresponding values are given
in Table 2. We then varied the initial learning rate for each algorithm in the range
of 0.01 to 0.1 in steps of 0.01. After that, we varied the embedding dimension on
the set of {10, 30, 50, 80, 100, 120, 150, 200}. For Word2Vec and LINE, we now var-
ied the number of negative samples on the set of {2, 5, 8, 12, 15, 20}. For GloVe, we
varied xmax ∈ {25, 50, . . . , 200} and α ∈ {0.5, 0.55, . . . , 1} simultaneously. Finally,
for Word2Vec, we varied the context window size between {1, 3, 5, 8, 10, 13, 16, 20},
while for LINE, we varied the number of samples per step on {1, 10, 100, 1000, 10000}·
106 . To rule out influence of a random embedding initialization, each experiment
was performed 10 times and the mean result was reported. After each experiment,
we chose the best performing parameter settings on the respective tagging datasets
across the four evaluation datasets and used them for all other experiments.


6.2     Embedding Evaluation Results

We will now present the evaluation results. For each algorithm, Table 3 gives the
parameter settings which produced the highest-scoring embeddings. In each figure,
we report both the evaluation results of the embeddings for a given parameter
as well as the corresponding baselines produced by the high-dimensional vector
representations given in Table 1.
14
     All result figures are publicly available at http://www.dmir.org/tagembeddings.
Table 1: Spearman correlation values for the co-occurrence baseline. For all evaluation
datasets, we gave the total number of pairs in the original dataset and the number of
matched pairs, i.e., where both words were present in the tagging corpus.
                            WS-353 (353) MEN (3000) MTurk (287) Bib100 (100)
                BibSonomy     0.433 (158)    0.415 (463) 0.604 (62)    0.623 (100)
                Delicious     0.486 (202)   0.577 (1376) 0.508 (103)    0.632 (94)
                CiteULike     0.186 (139)    0.423 (404) 0.469 (53)     0.270 (87)


                   Table 2: Initial parameter values for each algorithm.
Algorithm initial learning rate dimension samples per step negative samples (xmax , α) window size
LINE             0.025           100         100 · 106            5              -         -
GloVe             0.05           100             -                -         (100, 0.75)    -
Word2vec         0.025           100             -                5              -         5


Table 3: Best parameter values for each algorithm for the MEN dataset on Delicious data.
Algorithm initial learning rate dimension samples per step negative samples (xmax , α) window size
LINE              0.1            100         100 · 106           15              -         -
GloVe             0.1            120             -                -         (100, 0.75)    -
Word2vec          0.1            100             -               20              -         5


Word2Vec. Although Word2Vec is meant to be applied on sequential data, as op-
posed to the bag-of-words nature when assigning tags, the generated embeddings
yielded better correlation scores with human intuition than their high-dimensional
counterparts. However, we did not shuffle the tag sequence in posts, which is left
to future work. Figure 1a shows that fine-tuning the initial learning rate exhibits a
great effect on the quality of word embeddings from BibSonomy, with general peak
performance at α = 0.1, while Delicious data seem unaffected. Increasing the em-
bedding dimension only improves the embeddings’ semantic content up to a certain
point, which is mostly reached at around a very low number of dimensions between
30 and 50 (Figure 1b). Anything above does not notably increase performance of
the embeddings. The number of negative samples seems sufficiently high at 10 sam-
ples and even earlier for Delicious and CiteULike (Figure 1c). The context size had
negligible impact on the semantic content of the generated embeddings (Figure 1d).
GloVe. GloVe generates embeddings from co-occurrence data. As mentioned in Sec-
tion 4, GloVe is parameterized by the learning rate, the dimension of the generated
embeddings as well as by the weighting parameters xmax and α, which regulate
the importance of low-frequency co-occurrences in the training process. While the
learning rate does not show a great effect on embeddings generated from Delicious
data, fine-tuning influences the semantic content of CiteULike and BibSonomy em-
beddings notably (Figure 3a). Mostly, peak performance is reached at an embedding
dimension of 100 or even earlier, except for Delicious (Figure 3b) Furthermore, Bib-
Sonomy is quite sensitive to poor choices of xmax and α, i.e., if chosen too high,
performance suffers greatly (Figure 3c). Delicious and CiteULike seem unaffected
by those parameters, at least in our experimental ranges (Figures 3d and 3e).
LINE. LINE generates vertex embeddings from graph data, preserving the first-
and second-order proximity between vertices. Its parameters are the initial learning
rate, the embedding dimension, the number of negative samples per edge and the
number of samples per training step. While influence of the initial learning rate
is visible, it is not as great as with GloVe (cf. Figure 2a). Also, the embedding
dimension gives similar results above 50 and only lets performance suffer if chosen
too small (cf. Figure 2b). Interestingly enough, Figure 2c shows that the number
of negative samples seems to have almost no effect on the generated embeddings
across all tagging datasets. In contrast, choosing the number of samples per step
exerts great influence of the resulting embeddings, as can be seen in Figure 2d.


7                    Discussion

Across all algorithms, fine-tuning the initial learning rate greatly improves results
for embeddings based on BibSonomy, especially with GloVe. The effect of the em-
bedding dimension is much less pronounced across all three embedding algorithms.
Peak evaluation performance is often reached with an embedding dimension between
50 and 100 and stays quite stable with increasing dimension. Varying the number
of negative samples influences evaluation results of BibSonomy, but only at a very
high number of 20 negative samples. In contrast, Delicious and CiteULike only show
small performance changes already with 3 to 5 samples. Finally, GloVe’s weighting
factors xmax and α negatively influence results on BibSonomy, while barely affect-
ing evaluation performance on Delicious and CiteULike, due to BibSonomy being
our smallest tagging dataset with rarely any co-occurrences above a high xmax .
    Generally, all investigated embedding algorithms produce high-quality embed-
dings from tagging data. Although [8] found that tagging data contain high-quality
semantic information, the high-dimensional vector representation proposed there

                                               Bibsonomy                     Baseline Bibsonomy                                                            Bibsonomy                   Baseline Bibsonomy
                                                 Citeulike                     Baseline Citeulike                                                            Citeulike                   Baseline Citeulike
                                                 Delicious                     Baseline Delicious                                                            Delicious                   Baseline Delicious

                          0.8                                                                                                          0.8
    SpearmanCorrelation


                                                                                                                 SpearmanCorrelation


                          0.7                                                                                                          0.7
                          0.6                                                                                                          0.6
                          0.5                                                                                                          0.5
                          0.4                                                                                                          0.4
                          0.3                                                                                                          0.3
                          0.2                                                                                                          0.2
                             0.01   0.02   0.03     0.04         0.05        0.06    0.07    0.08   0.09   0.1                               0   20   40      60         80      100      120    140     160   180   200
                                                           initialLearningRate                                                                                                dimension


                                               Bibsonomy
                                                 Citeulike
                                                                 (a) Baseline Bibsonomy
                                                                       Baseline Citeulike
                                                                                                                                                           Bibsonomy
                                                                                                                                                             Citeulike
                                                                                                                                                                             (b) Baseline Bibsonomy
                                                                                                                                                                                   Baseline Citeulike
                                                 Delicious                     Baseline Delicious                                                            Delicious                  Baseline Delicious

                          0.8                                                                                                          0.8
    SpearmanCorrelation


                                                                                                                 SpearmanCorrelation


                          0.7                                                                                                          0.7
                          0.6                                                                                                          0.6
                          0.5                                                                                                          0.5
                          0.4                                                                                                          0.4
                          0.3                                                                                                          0.3
                          0.2                                                                                                          0.2
                                0   2      4       6         8          10      12      14     16    18    20                                0   2    4        6         8       10       12      14     16    18    20
                                                           negativeSamples                                                                                                     window


                                                                 (c)                                                                                                         (d)
Figure 1: Evaluation results for embeddings generated by Word2Vec on MEN. For CiteU-
Like and BibSonomy, tuning the learning rate notably improves results. The embedding
dimension seems to be the best for all tagging dataset at 100; before the embedding qual-
ity of BibSonomy decreases again. As reported by [20], roughly 10 negative samples fit for
small datasets such as BibSonomy, while even fewer samples fit for bigger datasets.
                                                Bibsonomy               Baseline Bibsonomy                                                             Bibsonomy                   Baseline Bibsonomy
                                                  Citeulike               Baseline Citeulike                                                             Citeulike                   Baseline Citeulike
                                                  Delicious               Baseline Delicious                                                             Delicious                   Baseline Delicious

                          0.8                                                                                                      0.8
    SpearmanCorrelation


                                                                                                             SpearmanCorrelation
                          0.7                                                                                                      0.7
                          0.6                                                                                                      0.6
                          0.5                                                                                                      0.5
                          0.4                                                                                                      0.4
                          0.3                                                                                                      0.3
                          0.2                                                                                                      0.2
                             0.01   0.02   0.03      0.04      0.05     0.06      0.07   0.08   0.09   0.1                               0   20   40        60       80      100      120    140     160   180    200
                                                            initialLearningRate                                                                                           dimension


                                                Bibsonomy
                                                  Citeulike
                                                               (a) Baseline Bibsonomy
                                                                     Baseline Citeulike
                                                                                                                                                       Bibsonomy
                                                                                                                                                         Citeulike
                                                                                                                                                                      (b) Baseline Bibsonomy
                                                                                                                                                                            Baseline Citeulike
                                                  Delicious               Baseline Delicious                                                             Delicious                  Baseline Delicious

                          0.8                                                                                                      0.8
    SpearmanCorrelation


                                                                                                             SpearmanCorrelation
                          0.7                                                                                                      0.7
                          0.6                                                                                                      0.6
                          0.5                                                                                                      0.5
                          0.4                                                                                                      0.4
                          0.3                                                                                                      0.3
                          0.2                                                                                                      0.2
                                2    4      6         8         10       12       14     16     18     20                                1             10                    100                1000             10000
                                                            negativeSamples                                                                                          samples [*10⁶]


                                                               (c)                                                                                                    (d)
Figure 2: Evaluation results for embeddings generated by LINE on the MEN dataset.
While the initial learning rate, the embedding dimension and the amount of samples per
step exert a notable influence on the evaluation result, increasing the number of negative
samples per edge only slightly improves results.

seems to not optimally capture this information when evaluated on human judg-
ment (see Table 1). In contrast, the generated embeddings seem better suited to
capture that information, as they mostly outperform the tag-tag-context based co-
occurrence count vectors (Section 8). Furthermore, the best result achievable on
WS-353 in this work is from Delicious data using the GloVe algorithm of around
0.7 (cf. Figure 4b), which is on par with with other well-known works, such as
ESA [11], which is based on Wikipedia text, achieving correlation around 0.748,
or the work done by Singer et al. on Wikipedia navigation [27] with the highest
correlation at 0.76, but generally achieving scores around 0.71.


8                         Conclusion

In this work, we explored embedding methods and their applicability on tagging
data. We conducted parameter studies for three well-known embedding algorithms
in order to achieve the best possible embeddings based on tagging data regarding
their fit to human intuition of semantic relatedness. Our results indicate that i)
tagging data provide a viable source to generate high-quality semantic embeddings,
even on par with current state-of-the-art methods and ii) that in order to achieve
competitive results, it is necessary to choose correct parameters for each algorithm
instead of the standard parameters. Overall we bridged the gap between the fact that
tagging data yield considerable semantic content and the current state-of-the-art
methods to produce high-quality and low-dimensional word embeddings. We expect
our results to be of special interest for folksonomy engineers and others working
with semantics of tagging data. Future work includes investigation of the influence
of different vector representations on tagging-based real-world applications, such as
                                                                                                         Bibsonomy                           Baseline Bibsonomy                                                                                                                                                    Bibsonomy                                                        Baseline Bibsonomy
                                                                                                           Citeulike                           Baseline Citeulike                                                                                                                                                    Citeulike                                                        Baseline Citeulike
                                                                                                           Delicious                           Baseline Delicious                                                                                                                                                    Delicious                                                        Baseline Delicious

                                                                        0.8                                                                                                                                                                                              0.8
                            SpearmanCorrelation


                                                                                                                                                                                                                                                 SpearmanCorrelation
                                                                        0.7                                                                                                                                                                                              0.7
                                                                        0.6                                                                                                                                                                                              0.6
                                                                        0.5                                                                                                                                                                                              0.5
                                                                        0.4                                                                                                                                                                                              0.4
                                                                        0.3                                                                                                                                                                                              0.3
                                                                        0.2                                                                                                                                                                                              0.2
                                                                           0.01        0.02         0.03          0.04         0.05         0.06                            0.07                         0.08          0.09          0.1                                           0            20        40                                  60                   80        100          120       140         160            180         200
                                                                                                                         initialLearningRate                                                                                                                                                                                                                            dimension


                                                                                                                               (a)                                                                                                                                                                                                                                    (b)

                                                                                                                          Bibsonomy                                                                                                                                                Citeulike                                                                                                                    Delicious

                                                                                                                                                                      0.8                                                                                                                                 0.8                                                                                                                              0.8
                                                  SpearmanCorrelation


                                                                                                                                                                                  SpearmanCorrelation


                                                                                                                                                                                                                                                                                                                                      SpearmanCorrelation
                                                                                                                                                                      0.7                                                                                                                                 0.7                                                                                                                              0.7
                                                                        0.8                                                                                                                             0.8                                                                                                                                                 0.8
                                                                        0.7                                                                                           0.6                               0.7                                                                                               0.6                                               0.7                                                                            0.6
                                                                        0.6                                                                                           0.5                               0.6                                                                                               0.5                                               0.6                                                                            0.5
                                                                        0.5                                                                                           0.4                               0.5                                                                                               0.4                                               0.5                                                                            0.4
                                                                        0.4                                                                                           0.3                               0.4                                                                                               0.3                                               0.4                                                                            0.3
                                                                        0.3                                                                                           0.2                               0.3                                                                                               0.2                                               0.3                                                                            0.2
                                                                        0.2                                                                                                                             0.2                                                                                                                                                 0.2
                                                                                                                                                200                                                                                                                                                 200                                                                                                                              200
                                                                                                                                             175                                                                                                                                                 175                                                                                                                              175
                                                                                                                                          150                                                                                                                                                 150                                                                                                                              150
                                                                        0.5                                                            125                                                              0.5                                                                                125                                                              0.5                                                             125
                                                                                 0.6                                                100     xmax                                                                 0.6                                                                    100     xmax                                                                0.6                                                  100     xmax
                                                                                        0.7                                    75                                                                                       0.7                                                        75                                                                                       0.7                                     75
                                                                                                 0.8                      50                                                                                                     0.8                                          50                                                                                                    0.8                       50
                                                                                        alpha             0.9                                                                                                           alpha              0.9                                                                                                                              alpha           0.9
                                                                                                                   1 25                                                                                                                                                1 25                                                                                                                            1 25


                                                                                                         (c)                                                                                                                               (d)                                                                                                                                              (e)
Figure 3: Evaluation results for embeddings generated by GloVe on the MEN dataset. The
initial learning rate only influences the smaller tagging datasets, while Delicious profits
most from increasing dimension. BibSonomy is influenced by a high cutoff xmax the most.


                                          GloVe                                        LINE            Word2Vec            Baseline                                                                     GloVe                 LINE          Word2Vec                                    Baseline                                                                  GloVe              LINE           Word2Vec                   Baseline

                      0.8                                                                                                                                                   0.8                                                                                                                                                               0.8

                      0.7                                                                                                                                                   0.7                                                                                                                                                               0.7
SpearmanCorrelation


                                                                                                                                                      SpearmanCorrelation


                                                                                                                                                                                                                                                                                                                SpearmanCorrelation


                      0.6                                                                                                                                                   0.6                                                                                                                                                               0.6

                      0.5                                                                                                                                                   0.5                                                                                                                                                               0.5

                      0.4                                                                                                                                                   0.4                                                                                                                                                               0.4

                      0.3                                                                                                                                                   0.3                                                                                                                                                               0.3

                      0.2                                                                                                                                                   0.2                                                                                                                                                               0.2
                                                                         ws353           men              mturk           bib100                                                                                ws353           men                      mturk                         bib100                                                                             ws353           men               mturk            bib100
                                                                                                Bibsonomy                                                                                                                              Delicious                                                                                                                                                Citeulike


                                                                        (a) BibSonomy                                                                                                                            (b) Delicious                                                                                                                                            (c) CiteULike
Figure 4: Evaluation results for embeddings generated by the best parameter settings across
the different tagging datasets. GloVe mostly produces the best embeddings and is the only
algorithm to always outperform the baseline.

tag recommendations in social tagging systems, tag sense discovery and ontology
learning algorithms. Furthermore, we want to try to improve the fit of tagging
embeddings to human intuition by applying metric learning approaches or alignment
approaches to external knowledge bases, e.g., WordNet or DBPedia.


References
                      [1]                                               Marco Baroni, Gerorgiana Dinu, and German Kruszewski. “Don’t count, predict! A
                                                                        systematic comparison of context-counting vs. context-predicting semantic vectors.”
                                                                        In: ACL (2014).
                      [2]                                               Yoshua Bengio et al. “A neural probabilistic language model.” In: JMLR (2003).
                      [3]                                               Dominik Benz et al. “Semantics made by you and me: Self-emerging ontologies can
                                                                        capture the diversity of shared knowledge.” In: WebSci. 2010.
                      [4]                                               Dominik Benz et al. “The Social Bookmark and Publication Management System
                                                                        BibSonomy.” In: VLDB (2010).
 [5]   Kalina Bontcheva and Dominic Rout. “Making sense of social media streams through
       semantics: a survey.” In: Semantic Web 5.5 (2014).
 [6]   Elia Bruni, Nam-Khanh Tran, and Marco Baroni. “Multimodal Distributional Se-
       mantics.” In: JAIR (2014).
 [7]   John A Bullinaria and Joseph P Levy. “Extracting semantic representations from
       word co-occurrence statistics: A computational study.” In: BRM 39.3 (2007).
 [8]   Ciro Cattuto et al. “Semantic Grounding of Tag Relatedness in Social Bookmarking
       Systems.” In: ISWC. 2008.
 [9]   Scott Deerwester et al. “Indexing by latent semantic analysis.” In: Journal of the
       American Society for Information Science 41.6 (1990).
[10]   Lev Finkelstein et al. “Placing Search in Context: the Concept Revisited.” In:
       WWW. 2001.
[11]   Evgeniy Gabrilovich and Shaul Markovitch. “Computing semantic relatedness using
       Wikipedia-based explicit semantic analysis.” In: IJCAI. 2007.
[12]   Scott Golder and Bernardo A. Huberman. “The Structure of Collaborative Tagging
       Systems.” In: (Aug. 2005). arXiv: cs.DL/0508082.
[13]   Mihajlo Grbovic et al. “Scalable Semantic Matching of Queries to Ads in Sponsored
       Search Advertising.” In: SIGIR. 2016.
[14]   Andreas Hotho et al. “Information Retrieval in Folksonomies: Search and Ranking.”
       In: ESWC. Ed. by York Sure and John Domingue. 2006.
[15]   Robert Jäschke et al. “Tag Recommendations in Folksonomies.” In: PKDD. Ed. by
       Joost N. Kok et al. 2007.
[16]   Omer Levy, Yoav Goldberg, and Israel Ramat-Gan. “Linguistic Regularities in
       Sparse and Explicit Word Representations.” In: CoNLL. 2014.
[17]   Benjamin Markines et al. “Evaluating Similarity Measures for Emergent Semantics
       of Social Tagging.” In: WWW. 2009.
[18]   Peter Mika. “Ontologies Are Us: A Unified Model of Social Networks and Seman-
       tics.” In: Web Semant. 5.1 (Mar. 2007).
[19]   Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. “Linguistic Regularities in Con-
       tinuous Space Word Representations.” In: HLT-NAACL. 2013.
[20]   Tomas Mikolov et al. “Distributed Representations of Words and Phrases and their
       Compositionality.” In: NIPS. 2013.
[21]   Thomas Niebler et al. “Extracting Semantics from Unconstrained Navigation on
       Wikipedia.” In: KI – Künstliche Intelligenz 30.2 (2016).
[22]   Thomas Niebler et al. “How Tagging Pragmatics Influence Tag Sense Discovery in
       Social Annotation Systems.” In: ECIR. 2013.
[23]   Jeffrey Pennington, Richard Socher, and Christopher D Manning. “Glove: Global
       Vectors for Word Representation.” In: EMNLP. Vol. 14. 2014.
[24]   Bryan Perozzi, Rami Al-Rfou’, and Steven Skiena. “DeepWalk: online learning of
       social representations.” In: KDD. Ed. by Sofus A. Macskassy et al. 2014.
[25]   Kira Radinsky et al. “A Word at a Time: Computing Word Relatedness Using
       Temporal Semantic Analysis.” In: WWW. 2011.
[26]   H. Schütze and J.O. Pedersen. “A cooccurrence-based thesaurus and two applica-
       tions to information retrieval.” In: IPM 33.3 (1997).
[27]   Philipp Singer et al. “Computing Semantic Relatedness from Human Navigational
       Paths: A Case Study on Wikipedia.” In: IJSWIS 9.4 (2013).
[28]   Jian Tang et al. “LINE: Large-scale Information Network Embedding.” In: WWW.
       2015.
[29]   Thomas Vander Wal. Folksonomy Definition and Wikipedia. Nov. 2005.
[30]   Arkaitz Zubiaga et al. “Harnessing Folksonomies to Produce a Social Classification
       of Resources.” In: IEEE Trans. on Knowl. and Data Eng. 25.8 (Aug. 2013).

</pre>