<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Learning Word Embeddings from Tagging Data: A methodological comparison</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Thomas Niebler</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luzian Hahn</string-name>
          <email>luzian.hahn@stud-mail.uni-wuerzburg.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andreas Hotho</string-name>
          <email>hothog@informatik.uni-wuerzburg.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Data Mining and Information Retrieval Group, University of Wurzburg</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>L3S Research Center Hanover</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <abstract>
        <p>The semantics hidden in natural language are an essential building block for a common language understanding needed in areas like NLP or the Semantic Web. Such information is hidden for example in lightweight knowledge representations such as tagging systems and folksonomies. While extracting relatedness from tagging systems shows promising results, the extracted information is often encoded in high dimensional vector representations, which makes relatedness learning or word sense discovery computationally infeasible. In the last few years, methods producing low-dimensional vector representations, so-called word embeddings, have been shown to yield extraordinary structural and semantic features and have been used in many settings. Up to this point, there has been no in-depth exploration of the applicability of word embedding algorithms on tagging data. In this work, we explore di erent embedding algorithms with regard to their applicability on tagging data and the semantic quality of the produced word embeddings. For this, we use data from three di erent tagging systems and evaluate the vector representations on several human intuition datasets. To the best of our knowledge, we are the rst to generate embeddings from tagging data. Our results encourage the use of word embeddings based on tagging data, as they capture semantic relations between tags better than high-dimensional representations and make learning with tag representations feasible.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Automatically assessing the degree of semantic relatedness between words, i.e., the
relatedness of their actual meanings, in such a way that it ts human intuition
is an important task with a variety of applications, such as ontology learning for
the Semantic Web [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], tag recommendation [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] or semantic search [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Semantic
relatedness information of words has been extracted from a variety of sources like
plain text [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], website navigation [
        <xref ref-type="bibr" rid="ref21 ref27">21, 27</xref>
        ] or social metadata [
        <xref ref-type="bibr" rid="ref17 ref5 ref8">5, 8, 17</xref>
        ]. Among
others, tagging data from social tagging systems like BibSonomy3 or Delicious4 are
useful to extract high-quality semantic relatedness information, e.g., for ontology
learning [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Traditionally, assessing the degree of semantic relatedness between tags utilizes
sparse, high-dimensional vector representations of those tags, which are constructed
from tag contexts based on posts in social tagging systems [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The semantic
relatedness can then be estimated using the cosine measure of the corresponding tag
vectors [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Finally, evaluating the quality of the estimated scores is usually
performed by directly correlating them to human intuition [
        <xref ref-type="bibr" rid="ref11 ref25 ref6">6, 11, 25</xref>
        ]. In recent years,
many techniques have been proposed to represent words by dense, low-dimensional
vectors [
        <xref ref-type="bibr" rid="ref20 ref23 ref28">20, 23, 28</xref>
        ]. These so-called word embeddings have been shown to yield
extraordinary structural features [
        <xref ref-type="bibr" rid="ref16 ref19">16, 19</xref>
        ] and are applied in machine translation or text
classi cation. Furthermore, word embeddings often outperform high-dimensional
representations in tasks such as measuring semantic relatedness [
        <xref ref-type="bibr" rid="ref1 ref16">1, 16</xref>
        ].
Problem Setting. Traditionally, tags are represented by sparse, high-dimensional
vectors [
        <xref ref-type="bibr" rid="ref26 ref8">8, 26</xref>
        ]. However, although Cattuto et al. have shown that tagging data
contain meaningful semantics [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], the correlation of semantic relatedness scores from
those vectors with human intuition still leaves room for improvement. Furthermore,
the high dimensionality of those representations renders many algorithms using
them computationally expensive.5 Up to this point, there have been no extensive
attempts to generate word embeddings from social tagging data. All prior studies
rely on high dimensional tagging vectors or reduce the vector space arbitrarily by
cutting the dimensionality of the space by a xed number, which in turn decreases
the t of the resulting relatedness scores to human intuition.
      </p>
      <p>Contribution. We contribute a thorough exploration of the applicability and
optimization of three well-known embedding algorithms on tagging data. We rst
analyze the parameters of each algorithm, before we optimize these settings to
produce the best possible word embeddings from tagging data. Then, we compare the
embeddings of each algorithm with each other as well as with traditional sparse
representations by evaluating them on human intuition. We show that all produced
embeddings outperform high-dimensional vector representations. We discuss the
results in the light of other semantic relatedness approaches and show that we reach
competitive results, on par with recent work on extracting semantic relatedness,
which opens another high quality source of information for semantic relatedness.
Structure of this work. We rst cover the related work in Section 2 and any
essential theoretical background of this work in Section 3. Afterwards, we investigate
three well-known algorithms with respect to their applicability on tagging data
in Section 4. In Section 5, we describe the datasets we used in our experiments.
Section 6 outlines our experiments, where we compare all generated vector
representations with regard to their semantic content. Section 8 concludes this work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        The related work to this paper can be roughly put in two groups: Word Embedding
algorithms and semantics of tagging data as well as their applications.
Word Embeddings. The concept of word embeddings, i.e., word representations
in low dimensional vector spaces dates back at least to 1990, when Deerwester
presented LSA [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], which by factorizing a term-document matrix e ectively produced
a dimension reduction of the term vector space. In 2003, Bengio et al. presented
their neural probabilistic language model [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The goal of this work was to overcome
5 This is also sometimes referred to as the curse of dimensionality
the curse of dimensionality and learn distributed word representation in a
lowdimensional vector space. However, the wide-spread use of word embeddings only
really took o in 2013, when Mikolov et al. presented a similar, yet scalable and fast
approach to learn word embeddings [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. Generally, such methods train a model to
predict a word from a given context [
        <xref ref-type="bibr" rid="ref2 ref20">2, 20</xref>
        ]. Other embedding methods focus on
factorizing a term-document matrix [
        <xref ref-type="bibr" rid="ref23 ref9">9, 23</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], Baroni et al. showed that all
those methods generally exhibit a notably higher correlation with human intuition
than the standard high-dimensional vector representations proposed in [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. There
also exist several graph embedding algorithms. The LINE algorithm [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] attempts
to preserve the rst- and second-order proximity of nodes in their corresponding
embedding relations. Perozzi et al. [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] proposed \DeepWalk", an approach based
on random walks on graphs and the subsequent embedding using Word2Vec.
Social Tagging Systems. In [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], Golder and Huberman noted that with
increasing use, usage data from social tagging systems exhibited an emerging structure,
which was later called a folksonomy [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ]. Mika noted that these emerging structures,
i.e., folksonomies, could even represent light-weight ontologies [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Using the
folksonomy structure, it was possible to extract information about semantic relatedness
between tags [
        <xref ref-type="bibr" rid="ref17 ref8">8, 17</xref>
        ]. The evaluation of this semantic relatedness information on
human intuition showed that tagging data contain a considerable amount of
semantic information, thus enabling further applications of tagging data. Applications
of these emerging structures can be found in tag recommendation [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], ontology
learning [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and tag sense discovery algorithms [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Technical Background</title>
      <p>In the following, we will describe the technical background for this paper. First,
we de ne the term folksonomy. Secondly, we introduce how to extract information
about semantic relatedness from folksonomies.</p>
      <p>
        Folksonomies. Folksonomies are the data structures emerging from social tagging
systems. The term has been coined by Van der Wal in 2005 as a portmanteu of
\folks" and \taxonomy" [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ]. In these systems, users collect resources and
annotate them with freely chosen keywords, so-called tags. Examples are BibSonomy,
Delicious, FlickR or last.fm. We follow the de nition given by [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]:
      </p>
      <p>A folksonomy is a tuple (U; T; R; Y ) of sets U , T , R and a tripartite relation
Y U T R. The sets U , T and R represent the sets of users, tags and resources,
respectively, while Y represents the set of tag assignments. A post is the collection
of tag assignments with the same user and same resource.</p>
      <p>
        Extracting Semantic Relatedness from Folksonomies. After Golder and
Huberman argued that the emerging structure of folksonomies contains considerable
semantic information [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], Cattuto et al. proposed a way to extract this
information [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. They used a context-co-occurrence based vector representation for the tags
and experimented with di erent context choices, such as tag-tag-context or
tagresource-context, i.e., all assigned tags of a posted resource by either a speci c user
or all users. Both of these context choices were shown to estimate human-perceived
semantic relatedness better than other context variants. In this work, we generally
use the tag-tag-context. The resulting vector representations follow the de nition
given in [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] and are based on the co-occurrence counts of tags in their respective
contexts. More concretely, a vector representation vi of a tag ti 2 V in a given
vocabulary is then a jV j-dimensional vector, where vij := #cooccpost(i; j). To
nally receive a notion of the degree of semantic relatedness between two tags i and
j, one can compare the corresponding vectors vi and vj using the cosine measure
cossim(vi; vj ) := kvhivki;vkjvijk [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Applicability of Embedding Algorithms on Tagging Data</title>
      <p>This section describes the di erent embedding algorithms that we explored. For
each algorithm, we give a short summary, enumerate the parameters for each model
and shortly discuss how it can be applied to tagging data.</p>
      <p>
        Word2Vec The most well-known embedding algorithm used in this work is the
Word2Vec algorithm [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], which is actually comprised of two algorithms, SkipGram
and CBOW (Cumulative Bag of Words).6 Word2Vec trains a shallow neural network
on word sequences to predict a word from its neigbors in a given context window.
Parameterization. Word2Vec takes two parameters. The rst parameter is the
window size, which determines the amount of neighboring words in a sequence
considered as context from which a word will be predicted. The second parameter
is the number of negative samples per step. This is done to decrease complexity of
solving the proposed model by employing a noise contrastive estimation approach.
Applicability. The Word2Vec algorithm normally processes sequential data.
However, the sequence of tags normally does not hold any meaning, so this could possibly
pose a problem if the window size is chosen too small. In order to be able to apply
Word2Vec on tagging data, we grouped the tag assignments into posts and fed the
random succession of tags as sentences into the algorithm.
      </p>
      <p>
        GloVe GloVe is an unsupervised learning algorithm for obtaining vector
representations for words [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. Its main objective was to capture semantic relations such
as king man + woman queen. Training is performed on aggregated global
word-word co-occurrence statistics from a corpus.
      </p>
      <p>
        Parameterization. The main parameters of the GloVe algorithm are xmax and .
xmax denotes an in uence cuto for frequent tags while determines the importance
of infrequent tags. According to [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], GloVe worked best for xmax = 100 and =
0:75. We will choose these as initial values in our experiments.
      </p>
      <p>
        Applicability. Since GloVe depends on co-occurrence counts of words in a corpus,
it is very easy to apply on tagging data. For this, we construct the tag-tag-context
co-occurrence matrix and can then directly feed it into the algorithm.
LINE The goal of the LINE embedding algorithm is to create graph embeddings
where the rst- and second-order proximity of nodes are preserved [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]. The
rstorder proximity in a network is the local pairwise proximity of two nodes, i.e., the
6 In the course of this work, every time we refer to Word2Vec, we talk about the CBOW
algorithm, as is recommended by [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] for bigger datasets.
weight of an edge connecting these two nodes. The second-order proximity of two
nodes in a network is the similarity between their rst-order neighborhoods.
Parameterization. LINE takes two di erent parameters: The amount of edge
samples per step and the amount of negative samples per edge. To decrease complexity
of solving the proposed model in [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ], the authors employed a noise contrastive
estimation approach as proposed by [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] using negative sampling. Furthermore, to
avoid high edge weights to outweigh lower weights by letting the gradient explode
or vanish, LINE employs a sampling process of edges and then ignores their weights
instead of actually using the edge weights in its objective function.
Applicability. Similar to GloVe, this algorithm processes a network with weighted
edges, such as a co-occurrence network. Thus, we only have to construct the
cooccurrence network from the tagging data and apply LINE on that network.
Common Parameters While each of the mentioned algorithms can be tuned
with a set of di erent parameters, they have some parameters in common. First,
the embedding dimension determines the size of the produced vectors. A higher
embedding dimension allows for more degrees of freedom in the expressiveness of
the vector, i.e., it can encode more information about word relations. Standard
ranges for embedding dimensions are between 25 and 300. Secondly, the initial
learning rate of an algorithm determines its convergence speed. Fine-tuning that
parameter is crucial to receive optimal results, because if chosen badly, the learning
process either converges very slowly or might be unable to converge at all.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Datasets</title>
      <p>In this work we use two di erent kinds of datasets to evaluate embedding algorithms
on tagging data. That is, the actual tagging datasets which provide tagging
metadata and human intuition datasets (HIDs) which we employ to evaluate semantic
relatedness. In the following we rst describe three datasets containing tagging data
from which we later derive tag embeddings. Then we introduce all human intuition
datasets containing human-assigned scores of similarities to word pairs.
5.1</p>
      <sec id="sec-5-1">
        <title>Tagging Datasets to Derive Word Embeddings</title>
        <p>
          We study datasets of three public social tagging systems. In order to ensure a
minimum level of commonly accepted meaning of all tags, each dataset is restricted
to the top 10k tags. Additionally, we only considered tags from users who have
tagged at least 5 resources and resources which have been used at least 10 times.
We also removed all invalid tags, e.g., containing whitespaces or unreadable symbols.
BibSonomy. The social tagging system BibSonomy provides users with the
possibility to collect bookmarks (links to websites) or references to scienti c publications
and annotate them with tags [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. We use a freely available dump of BibSonomy,
covering all tagging data from 2006 till the end of 2015.7 After ltering, it contains 9,302
distinct tags, assigned by 3,270 users to 49,654 resources in 630,962 assignments.
7 http://www.kde.cs.uni-kassel.de/bibsonomy/dumps/
Delicious. Like BibSonomy, Delicious is a social tagging system, where users can
share their bookmarks and annotate them with tags. We use a freely available
dataset from 2011 [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ].8 Delicious has been one of the biggest adopters of the tagging
paradigm and due to its audience, contains tags about design and technical topics.
After ltering, the Delicious dataset contains 10,000 tags, which were assigned by
1,685,506 users to 11,486,080 resources in 626,690,002 assignments.
CiteULike. We took a snapshot of the o cial CiteULike page from September
2016.9 Since CiteULike describes itself as a \free service for managing and
discovering scholarly references", it contains tags mostly centered around research topics.
After ltering, the CiteULike dataset contains 10,000 tags, which were assigned by
141,395 users to 4,548,376 resources in 15,988,259 assignments.
5.2
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>Human Intuition Datasets (HIDs)</title>
        <p>As a gold standard for semantic relatedness as it is perceived by humans, we use
several datasets with human-generated relatedness scores for word pairs. In the
following, we will describe all of the used datasets brie y.</p>
        <p>
          WS-353. The WordSimilarity-35310 dataset consists of 353 pairs of English words
and names [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. Each pair was assigned a relatedness value between 0.0 (no
relation) and 10.0 (identical meaning) by 16 raters, denoting the assumed common
sense semantic relatedness between two words. Finally, the total rating per pair
was calculated as the mean of each of the 16 users' ratings. This way, WS-353
provides a valuable evaluation base for comparing our concept relatedness scores to an
established human generated and validated collection of word pairs.
MEN. The MEN Test Collection [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] contains 3,000 word pairs together with
humanassigned similarity judgments, obtained by crowdsourcing using Amazon Mechanical
Turk11. Contrary to WS-353, the similarity judgments are relative rather than
absolute. Raters were given two pairs of words at a time and were asked to choose the
pair of words was more similar. The score of the chosen pair, i.e., the pair of words
that was more similar, was then increased by one. Each pair was rated 50 times,
which leads to a score between 0 and 50 for each pair.
        </p>
        <p>
          Bib100. The Bib100 dataset has been created in order to provide a more tting
vocabulary for the research and computer science oriented tagging data that we
investigate.12 It consists of 122 words in 100 pairs, which were judged 26 times each
for semantic relatedness using scores from 0 (no similarity) to 10 (full similarity).
MTurk. In [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ], Radinsky et al. created an evaluation dataset speci cally for news
texts.13 We use this dataset as a topically remote evaluation baseline in order to get
a notion how intrinsic semantic relations are captured by both the tagging data and
the generated embeddings. The dataset at hand consists of 287 word pairs and 499
words. 23 humans judged relatedness on a scale from 1 (unrelated) to 5 (related).
8 http://www.zubiaga.org/datasets/socialbm0311/
9 http://www.citeulike.org/faq/data.adp
10 http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/wordsim353.html
11 http://clic.cimec.unitn.it/~elia.bruni/MEN
12 http://www.dmir.org/datasets/bib100
13 http://www.kiraradinsky.com/Datasets.html
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Experimental Setup and Results</title>
      <p>In the following, we describe the conducted experiments and present the results for
each experiment. Due to space limitations, we only report results for MEN. 14
6.1</p>
      <sec id="sec-6-1">
        <title>Preliminaries</title>
      </sec>
      <sec id="sec-6-2">
        <title>Evaluating Word Vector Representations. Very often, the quality of semantic</title>
        <p>
          relatedness encoded in word vectors is assessed by how well it ts human intuition.
Human intuition is collected in HIDs as introduced in Section 5. The most
widelyused method to evaluate semantic relatedness on such datasets is to compare human
scores of the semantic relatedness between two words with the cosine similarity
scores of the corresponding word vectors. The comparison is done by calculating
the Spearman rank correlation coe cient , which compares two ranked lists of
word pairs induced by the human relatedness scores and the cosine scores [
          <xref ref-type="bibr" rid="ref1 ref23">1, 23</xref>
          ].
Baseline: Tag-Tag-Context Co-Occurrence Vectors. As a baseline, we
produced high dimensional co-occurrence counting vectors from all three tagging
datasets. As described in Section 3, co-occurrence of tags was counted in a
tag-tagcontext, i.e., the context of a tag was given as the other tags annotated to a given
resource by a certain user [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Since there is no option to vary the dimension of the
tag-tag-context co-occurrence vectors except truncating the vocabulary, we only
report the values for a truncated vocabulary of 10,000 tags in Table 1. Still, we give
all of the reported results as baselines in the subsequent gures.
        </p>
        <p>Parameter Settings. For each of the following algorithms, we conducted the
experiments as follows: As initial parameter setting, we used the standard settings that
come with the implementation of each algorithm. The corresponding values are given
in Table 2. We then varied the initial learning rate for each algorithm in the range
of 0.01 to 0.1 in steps of 0.01. After that, we varied the embedding dimension on
the set of f10; 30; 50; 80; 100; 120; 150; 200g. For Word2Vec and LINE, we now
varied the number of negative samples on the set of f2; 5; 8; 12; 15; 20g. For GloVe, we
varied xmax 2 f25; 50; : : : ; 200g and 2 f0:5; 0:55; : : : ; 1g simultaneously. Finally,
for Word2Vec, we varied the context window size between f1; 3; 5; 8; 10; 13; 16; 20g,
while for LINE, we varied the number of samples per step on f1; 10; 100; 1000; 10000g
106. To rule out in uence of a random embedding initialization, each experiment
was performed 10 times and the mean result was reported. After each experiment,
we chose the best performing parameter settings on the respective tagging datasets
across the four evaluation datasets and used them for all other experiments.
6.2</p>
      </sec>
      <sec id="sec-6-3">
        <title>Embedding Evaluation Results</title>
        <p>We will now present the evaluation results. For each algorithm, Table 3 gives the
parameter settings which produced the highest-scoring embeddings. In each gure,
we report both the evaluation results of the embeddings for a given parameter
as well as the corresponding baselines produced by the high-dimensional vector
representations given in Table 1.
14 All result gures are publicly available at http://www.dmir.org/tagembeddings.
Word2Vec. Although Word2Vec is meant to be applied on sequential data, as
opposed to the bag-of-words nature when assigning tags, the generated embeddings
yielded better correlation scores with human intuition than their high-dimensional
counterparts. However, we did not shu e the tag sequence in posts, which is left
to future work. Figure 1a shows that ne-tuning the initial learning rate exhibits a
great e ect on the quality of word embeddings from BibSonomy, with general peak
performance at = 0:1, while Delicious data seem una ected. Increasing the
embedding dimension only improves the embeddings' semantic content up to a certain
point, which is mostly reached at around a very low number of dimensions between
30 and 50 (Figure 1b). Anything above does not notably increase performance of
the embeddings. The number of negative samples seems su ciently high at 10
samples and even earlier for Delicious and CiteULike (Figure 1c). The context size had
negligible impact on the semantic content of the generated embeddings (Figure 1d).
GloVe. GloVe generates embeddings from co-occurrence data. As mentioned in
Section 4, GloVe is parameterized by the learning rate, the dimension of the generated
embeddings as well as by the weighting parameters xmax and , which regulate
the importance of low-frequency co-occurrences in the training process. While the
learning rate does not show a great e ect on embeddings generated from Delicious
data, ne-tuning in uences the semantic content of CiteULike and BibSonomy
embeddings notably (Figure 3a). Mostly, peak performance is reached at an embedding
dimension of 100 or even earlier, except for Delicious (Figure 3b) Furthermore,
BibSonomy is quite sensitive to poor choices of xmax and , i.e., if chosen too high,
performance su ers greatly (Figure 3c). Delicious and CiteULike seem una ected
by those parameters, at least in our experimental ranges (Figures 3d and 3e).
LINE. LINE generates vertex embeddings from graph data, preserving the
rstand second-order proximity between vertices. Its parameters are the initial learning
rate, the embedding dimension, the number of negative samples per edge and the
number of samples per training step. While in uence of the initial learning rate
is visible, it is not as great as with GloVe (cf. Figure 2a). Also, the embedding
dimension gives similar results above 50 and only lets performance su er if chosen
too small (cf. Figure 2b). Interestingly enough, Figure 2c shows that the number
of negative samples seems to have almost no e ect on the generated embeddings
across all tagging datasets. In contrast, choosing the number of samples per step
exerts great in uence of the resulting embeddings, as can be seen in Figure 2d.
7</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Discussion</title>
      <p>Across all algorithms, ne-tuning the initial learning rate greatly improves results
for embeddings based on BibSonomy, especially with GloVe. The e ect of the
embedding dimension is much less pronounced across all three embedding algorithms.
Peak evaluation performance is often reached with an embedding dimension between
50 and 100 and stays quite stable with increasing dimension. Varying the number
of negative samples in uences evaluation results of BibSonomy, but only at a very
high number of 20 negative samples. In contrast, Delicious and CiteULike only show
small performance changes already with 3 to 5 samples. Finally, GloVe's weighting
factors xmax and negatively in uence results on BibSonomy, while barely a
ecting evaluation performance on Delicious and CiteULike, due to BibSonomy being
our smallest tagging dataset with rarely any co-occurrences above a high xmax.</p>
      <p>
        Generally, all investigated embedding algorithms produce high-quality
embeddings from tagging data. Although [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] found that tagging data contain high-quality
semantic information, the high-dimensional vector representation proposed there
Bibsonomy
Citeulike
Delicious
      </p>
      <p>Baseline Bibsonomy
Baseline Citeulike
Baseline Delicious</p>
      <p>Bibsonomy
Citeulike
Delicious</p>
      <p>
        Baseline Bibsonomy
Baseline Citeulike
Baseline Delicious
seems to not optimally capture this information when evaluated on human
judgment (see Table 1). In contrast, the generated embeddings seem better suited to
capture that information, as they mostly outperform the tag-tag-context based
cooccurrence count vectors (Section 8). Furthermore, the best result achievable on
WS-353 in this work is from Delicious data using the GloVe algorithm of around
0.7 (cf. Figure 4b), which is on par with with other well-known works, such as
ESA [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], which is based on Wikipedia text, achieving correlation around 0.748,
or the work done by Singer et al. on Wikipedia navigation [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] with the highest
correlation at 0.76, but generally achieving scores around 0.71.
8
      </p>
    </sec>
    <sec id="sec-8">
      <title>Conclusion</title>
      <p>In this work, we explored embedding methods and their applicability on tagging
data. We conducted parameter studies for three well-known embedding algorithms
in order to achieve the best possible embeddings based on tagging data regarding
their t to human intuition of semantic relatedness. Our results indicate that i)
tagging data provide a viable source to generate high-quality semantic embeddings,
even on par with current state-of-the-art methods and ii) that in order to achieve
competitive results, it is necessary to choose correct parameters for each algorithm
instead of the standard parameters. Overall we bridged the gap between the fact that
tagging data yield considerable semantic content and the current state-of-the-art
methods to produce high-quality and low-dimensional word embeddings. We expect
our results to be of special interest for folksonomy engineers and others working
with semantics of tagging data. Future work includes investigation of the in uence
of di erent vector representations on tagging-based real-world applications, such as
0.8
iton 0.7
rae 0.6
lr
onC 0.5
ram 0.4
aep 0.3
S 0.2 0 20 40 60 80 100 120 140 160 180 200
dimension
(b)
Bibsonomy</p>
      <p>Citeulike</p>
      <p>Delicious
tag recommendations in social tagging systems, tag sense discovery and ontology
learning algorithms. Furthermore, we want to try to improve the t of tagging
embeddings to human intuition by applying metric learning approaches or alignment
approaches to external knowledge bases, e.g., WordNet or DBPedia.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Marco</given-names>
            <surname>Baroni</surname>
          </string-name>
          , Gerorgiana Dinu, and German Kruszewski. \
          <article-title>Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors." In: ACL (</article-title>
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          et al. \
          <article-title>A neural probabilistic language model." In: JMLR (</article-title>
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Dominik</given-names>
            <surname>Benz</surname>
          </string-name>
          et al. \
          <article-title>Semantics made by you and me: Self-emerging ontologies can capture the diversity of shared knowledge."</article-title>
          <source>In: WebSci</source>
          .
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Dominik</given-names>
            <surname>Benz</surname>
          </string-name>
          et al. \
          <article-title>The Social Bookmark and Publication Management System BibSonomy." In: VLDB (</article-title>
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Kalina</given-names>
            <surname>Bontcheva</surname>
          </string-name>
          and
          <string-name>
            <given-names>Dominic</given-names>
            <surname>Rout</surname>
          </string-name>
          . \
          <article-title>Making sense of social media streams through semantics: a survey."</article-title>
          <source>In: Semantic Web 5.5</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Elia</given-names>
            <surname>Bruni</surname>
          </string-name>
          ,
          <string-name>
            <surname>Nam-Khanh Tran</surname>
          </string-name>
          , and Marco Baroni. \Multimodal Distributional Semantics.
          <article-title>" In: JAIR (</article-title>
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>John</surname>
            <given-names>A Bullinaria</given-names>
          </string-name>
          and
          <string-name>
            <surname>Joseph P Levy.</surname>
          </string-name>
          \
          <article-title>Extracting semantic representations from word co-occurrence statistics: A computational study."</article-title>
          <source>In: BRM 39.3</source>
          (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Ciro</given-names>
            <surname>Cattuto</surname>
          </string-name>
          et al. \
          <source>Semantic Grounding of Tag Relatedness in Social Bookmarking Systems." In: ISWC</source>
          .
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Scott</given-names>
            <surname>Deerwester</surname>
          </string-name>
          et al. \
          <article-title>Indexing by latent semantic analysis."</article-title>
          <source>In: Journal of the American Society for Information Science 41.6</source>
          (
          <year>1990</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Lev</given-names>
            <surname>Finkelstein</surname>
          </string-name>
          et al. \
          <article-title>Placing Search in Context: the Concept Revisited."</article-title>
          <source>In: WWW</source>
          .
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Evgeniy</given-names>
            <surname>Gabrilovich</surname>
          </string-name>
          and
          <string-name>
            <given-names>Shaul</given-names>
            <surname>Markovitch</surname>
          </string-name>
          . \
          <article-title>Computing semantic relatedness using Wikipedia-based explicit semantic analysis."</article-title>
          <source>In: IJCAI</source>
          .
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Scott</given-names>
            <surname>Golder</surname>
          </string-name>
          and
          <string-name>
            <given-names>Bernardo A.</given-names>
            <surname>Huberman</surname>
          </string-name>
          . \
          <source>The Structure of Collaborative Tagging Systems." In: (Aug</source>
          .
          <year>2005</year>
          ).
          <source>arXiv: cs.DL/0508082.</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Mihajlo</given-names>
            <surname>Grbovic</surname>
          </string-name>
          et al. \
          <article-title>Scalable Semantic Matching of Queries to Ads in Sponsored Search Advertising."</article-title>
          <source>In: SIGIR</source>
          .
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Andreas</given-names>
            <surname>Hotho</surname>
          </string-name>
          et al. \
          <article-title>Information Retrieval in Folksonomies: Search and</article-title>
          <string-name>
            <surname>Ranking.</surname>
          </string-name>
          " In: ESWC. Ed. by York Sure and John Domingue.
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Robert</surname>
            <given-names>Ja</given-names>
          </string-name>
          schke et al. \
          <article-title>Tag Recommendations in Folksonomies." In: PKDD</article-title>
          . Ed. by
          <string-name>
            <surname>Joost N. Kok</surname>
          </string-name>
          et al.
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Omer</surname>
            <given-names>Levy</given-names>
          </string-name>
          , Yoav Goldberg, and
          <string-name>
            <surname>Israel</surname>
          </string-name>
          Ramat-Gan.
          <article-title>\Linguistic Regularities in Sparse and Explicit Word Representations."</article-title>
          <source>In: CoNLL</source>
          .
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Benjamin</given-names>
            <surname>Markines</surname>
          </string-name>
          et al. \
          <article-title>Evaluating Similarity Measures for Emergent Semantics of Social Tagging."</article-title>
          <source>In: WWW</source>
          .
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Peter</given-names>
            <surname>Mika</surname>
          </string-name>
          . \
          <article-title>Ontologies Are Us: A Uni ed Model of Social Networks and Semantics."</article-title>
          <source>In: Web Semant. 5</source>
          .1 (
          <issue>Mar</issue>
          .
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Tomas</surname>
            <given-names>Mikolov</given-names>
          </string-name>
          ,
          <article-title>Wen-tau Yih, and Geo rey Zweig. \Linguistic Regularities in Continuous Space Word Representations."</article-title>
          <source>In: HLT-NAACL</source>
          .
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          et al. \
          <article-title>Distributed Representations of Words and Phrases and their Compositionality."</article-title>
          <source>In: NIPS</source>
          .
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Niebler</surname>
          </string-name>
          et al. \
          <source>Extracting Semantics from Unconstrained Navigation on Wikipedia." In: KI { Kunstliche Intelligenz 30.2</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Niebler</surname>
          </string-name>
          et al. \
          <source>How Tagging Pragmatics In uence Tag Sense Discovery in Social Annotation Systems." In: ECIR</source>
          .
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Je</surname>
            rey Pennington, Richard Socher, and
            <given-names>Christopher D</given-names>
          </string-name>
          <string-name>
            <surname>Manning</surname>
          </string-name>
          . \
          <source>Glove: Global Vectors for Word Representation." In: EMNLP</source>
          . Vol.
          <volume>14</volume>
          .
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Bryan</surname>
            <given-names>Perozzi</given-names>
          </string-name>
          , Rami Al-Rfou', and
          <string-name>
            <given-names>Steven</given-names>
            <surname>Skiena</surname>
          </string-name>
          . \
          <article-title>DeepWalk: online learning of social representations." In: KDD</article-title>
          . Ed. by Sofus A.
          <string-name>
            <surname>Macskassy</surname>
          </string-name>
          et al.
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>Kira</given-names>
            <surname>Radinsky</surname>
          </string-name>
          et al. \
          <article-title>A Word at a Time: Computing Word Relatedness Using Temporal Semantic Analysis."</article-title>
          <source>In: WWW</source>
          .
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>H.</given-names>
            <surname>Schu</surname>
          </string-name>
          <article-title>tze and</article-title>
          <string-name>
            <given-names>J.O.</given-names>
            <surname>Pedersen</surname>
          </string-name>
          . \
          <article-title>A cooccurrence-based thesaurus and two applications to information retrieval."</article-title>
          <source>In: IPM 33.3</source>
          (
          <year>1997</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>Philipp</given-names>
            <surname>Singer</surname>
          </string-name>
          et al. \
          <article-title>Computing Semantic Relatedness from Human Navigational Paths: A Case Study on Wikipedia."</article-title>
          <source>In: IJSWIS 9.4</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Jian</given-names>
            <surname>Tang</surname>
          </string-name>
          et al. \
          <article-title>LINE: Large-scale Information Network Embedding."</article-title>
          <source>In: WWW</source>
          .
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Thomas</surname>
            <given-names>Vander</given-names>
          </string-name>
          <string-name>
            <surname>Wal. Folksonomy De</surname>
          </string-name>
          nition and Wikipedia. Nov.
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>Arkaitz</given-names>
            <surname>Zubiaga</surname>
          </string-name>
          et al. \
          <article-title>Harnessing Folksonomies to Produce a Social Classi cation of Resources."</article-title>
          <source>In: IEEE Trans. on Knowl. and Data Eng</source>
          .
          <volume>25</volume>
          .8 (
          <issue>Aug</issue>
          .
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>