<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Mining Semantic Maturity in Social Bookmarking Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Martin Atzmueller</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dominik Benz</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andreas Hotho</string-name>
          <email>hotho@informatik.uni-wuerzburg.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gerd Stumme</string-name>
          <email>stummeg@cs.uni-kassel.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Data Mining &amp; Information Retrieval Group, University of Wurzburg</institution>
          ,
          <addr-line>97074 Wurzburg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Knowledge &amp; Data Engineering Group, University of Kassel</institution>
          ,
          <addr-line>34121 Kassel</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The existence of emergent semantics within social metadata (such as tags in bookmarking systems) has been proven by a large number of successful approaches making the implicit semantic structures explicit. However, much less attention has been given to the factors which in uence the \maturing" process of these structures over time. A natural hypothesis is that tags become semantically more and more mature whenever many users use them in the same contexts. This would allow to describe a tag by a speci c and informative \semantic ngerprint" in the context of tagged resoures. However, the question of assessing the quality of such ngerprints has been seldomly addressed. In this paper, we provide a systematic approach of mining semantic maturity pro les within folksonomy-based tag properties. Our ultimate goal is to provide a characterization of \mature tags". Additionally, we consider semantic information about the tags as a gold-standard source for the characterization of the collected results. Our initial results suggest that a suitable composition of tag properties allows the identi cation of more mature tag subsets. The presented work has implications for a number of problems related to social tagging systems, including tag ranking, tag recommendation, and the capturing of light-weight ontologies from tagging data.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Social metadata, especially collaboratively created keywords or tags, form an
integral part of many social applications such as BibSonomy3, Delicious4, or
Flickr5. In such social systems, many studies of the development of the tagging
structure have shown the presence of emergent semantics (e.g., [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]) in the set
of human-annotated resources. That is, the semantics of tags develop gradually
depending on their usage.
3 http://www.bibsonomy.org
4 http://www.delicious.com
5 http://www.flickr.com
      </p>
      <p>Due to this important observation, one can regard this development as a
process of \semantic maturing". The basic idea is that knowledge about a set
of cooccurring tags is su cient for determining synonyms with a certain
reliability. The underlying assumption is that tags become \mature" after a certain
amount of usage. This maturity will then be re ected in a stable semantic pro le.
Thus, tags that have arrived at this stage can be regarded as high-quality tags,
concerning their encoded amount of emergent semantics.</p>
      <p>In this paper, we utilize folksonomy-based tag properties for mining pro les
indicating \matured tags", i.e., high-quality tags that can be considered to convey
more precise semantics according to their usage contexts. The proposed
properties consist of various structural properties of the tagging data. e.g., centrality, or
frequency properties. For a semantic grounding, we analyze the applied tagging
data with respect to tag-tag relations in Wordnet, for assessing the \true"
semantic quality. Our contribution is thus three-fold: We provide and discuss di erent
tag properties that are useful in determining semantic maturity pro les of tags.
These are all obtained considering the network structure of folksonomies.
Additionally, we obtain a detailed statistical characterization of semantic tag maturity
pro les in a folksonomy dataset. Finally, we provide a list of useful indicators for
identifying \mature tags" as well as synonyms in this context.</p>
      <p>
        Applications of the obtained knowledge concern the construction of
lightweight ontologies using tagging knowledge [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], tag recommendation [
        <xref ref-type="bibr" rid="ref14 ref19">14,19</xref>
        ], or
tag ranking [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. All of these utilize selection options and/or ranking information
about sets of tags, for initial setup and re nement. Tag ranking approaches, for
example, can bene t from a \maturity ranking" for ltering purposes.
      </p>
      <p>The rest of the paper is structured as follows: Section 2 discusses related work.
After that, Section 3 introduces basic notions of the presented approach, including
folksonomy-based tag properties, and the applied pattern mining method. Then,
we describe the mining methodology in detail, discuss our evaluation setting
and present the obtained results. Finally, Section 5 concludes the paper with a
summary and interesting directions for future research.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        While the phenomenon of collaborative tagging was discussed in its early stages
mainly in newsgroups or mailing lists (e.g. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]), a rst systematic analysis was
performed by [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. One core nding was that the openness and uncontrolledness
of these systems did not give rise to a \tag chaos", but led on the contrary to the
development of stable patterns in tag proportions assigned to a given resource. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
reported similar results and denoted the emerging patterns as \semantic
ngerprints" of resources. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] presented an approach to capture emergent semantics
from a folksonomy by deriving lightweight ontologies. In the sequel, several
methods of capturing emergent semantics in the form of (i) tag taxonomies [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], (ii)
measures of semantic tag relatedness [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], (iii) tag clusterings [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] and (iv)
mapping tags to concepts in existing ontologies [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] were proposed.
      </p>
      <p>
        Most of the above works provided evidence for the existence of emergent
tag semantics by making certain aspects of it explicit; however, the question
which factors in ucence its development were seldomly discussed. Despite that,
a common perception seemed to be that a certain amount of data is necessary
for getting a \signal". Golder and Hubermann [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] gave a rough estimate that
\after the rst 100 or so bookmarks", the proportions of tags assigned to a
resource tended to stabilize. This suggested the rule \the more data, the better
semantics". This assumption was partially con rmed by Korner et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], who
analyzed the amount of emergent semantics contained in di erent folksonomy
partitions. More data had a bene cial e ect, but the user composition within the
partitions turned out to be crucial as well: Sub-folksonomies induced by so-called
\describers", which exhibit a certain kind of tag usage pattern, proved to contain
semantic structures of higher quality. Halpin [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] showed that the tag distribution
at resources tends to stabilize quickly into a power-law, as a kind of \maturing"
of resources. In contrast, our work targets the maturing of tags themselves.
      </p>
      <p>Hovever, to the best of our knowledge none of the aforementioned works
has systematically addressed the question if there exists a connection between
structural properties of tags and the quality of semantics they encode (i.e. their
\semantic maturity"). In this work, we aim to ll this gap.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Preliminaries</title>
      <p>In the following sections, we rst brie y present a formal folksonomy model and
a folksonomy-based measure of tag relatedness. Then, we detail on the structural
and statistical tag properties serving as a basis for mining maturity pro les. After
that, we brie y summarize the basics of the applied pattern mining technique.
3.1</p>
      <sec id="sec-3-1">
        <title>Folksonomies and Semantic Tag Relatedness</title>
        <p>
          The underlying data structure of collaborative tagging systems is called
folksonomy ; according to [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], a folksonomy is a tuple F := (U; T; R; Y ) where U , T ,
and R are nite sets, whose elements are called users, tags and resources,
respectively. Y is a ternary relation between them, i.e. Y U T R. An element
y 2 Y is called a tag assignment or TAS. A post is a triple (u; Tur; r) with u 2 U ,
r 2 R, and a non-empty set Tur := ft 2 T j (u; t; r) 2 Y g.
        </p>
        <p>Folksonomies introduce various kinds of relations among their contained
lexical items. A typical example are cooccurrence networks, which constitute an
aggregation indicating which tags occur together. Given a folksonomy (U; T; R; Y ),
one can de ne the post-based tag-tag cooccurrence graph as Gcooc = (T; E; w) ;
whose set of vertices corresponds to the set T of tags. Two tags t1 and t2 are
connected by an edge, i there is at least one post (u; Tur; r) with t1; t2 2 Tur.
The weight of this edge is given by the number of posts that contain both t1 and
t2, i.e. w(t1; t2) := cardf(u; r) 2 U R j t1; t2 2 Turg</p>
        <p>
          For assessing the semantic relatedness between tags we apply the resource
context similarity (cf. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]) computed in the vector space RR. For a tag t, the
vector vt 2 RR counts how often the tag t is used for annotating a certain
resource r 2 R:
        </p>
        <p>vtr = cardfu 2 U j (u; t; r) 2 Y g :</p>
        <p>
          Based on this representation, we measure vector similarity by using the cosine
measure, as is customary in Information Retrieval: If two tags t1 and t2 are
represented by v1; v2 2 RX , their cosine similarity is de ned as: cossim(t1; t2) :=
cos ](v1; v2) = jjv1vjj12 jvjv22jj2 : In prior work, we showed that this measure comes
close to what humans perceive as semantically related [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Folksonomy-Based Tag Properties</title>
        <p>
          For folksonomy-based tag properties, we can utilize aggregated information such
as frequency, but also properties based on the network structure of the tag-tag
co-occurrence graph. The properties below are based on prior work in related
areas. They are abstract in that sense, that none of them considers the textual
content of a tag. Therefore, all properties are language independent since the only
operate on the folksonomy structure, on aggregated information, or on derived
networks. Below, we describe the di erent folksonomy-based properties, and also
discuss their intuitive role regarding the assessment of tag maturity.
Centrality Properties In network theory the centrality of a node v 2 V in
a network G is usually an indication of how important the vertex is [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ].
Because important nodes are usually well-connected within the network, one can
hypothesize that this connectedness corresponds to a well-established semantic
ngerprint. On the other hand, high centrality might correspond to a relatively
\broad" meaning { in the context of our study, we avoid the latter by restricting
ourselves to single-sense tags (see Section 4). Applied to our problem at hand, we
interpret centrality as a measure of maturity, following the intuition that more
mature terms are also more \important". We adopted three standard centralities
(degree, closeness, betweenness). All of them can be applied to a term graph G:
{ According to betweenness centrality a vertex has a high centrality if it can be
found on many shortest paths between other vertex pairs:
bet (v) =
        </p>
        <p>
          X
s6=v6=t2V
st(v)
st
Hereby, st denotes the number of shortest paths between s and t and st(v)
is the number of shortest paths between s and t passing through v. As its
computation is obviously very expensive, it is often approximated [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] by
calculating the shortest paths only between a fraction of points.
        </p>
        <p>It seems intuitive, that tags with a high betweenness centrality are closer
to important (semantic) hubs, and therefore more mature themselves. In
essence, higher values should indicate semantic maturity.
{ A vertex ranks higher according to closeness centrality the shorter its shortest
path length to all other reachable nodes is:
clos(v) = Pt2V nv dG(v; t)
1
(1)
(2)
deg (v) =</p>
        <p>d(v)
jV j
1
Comparred to the other metrics, degree centrality is a local measure since it
only takes into account the direct neighbourhood of a tag within the network.
According to the degree, a tag could be linked to both semantically mature
and non-mature tags. In this sense, it seems intuitive to assume that other
factors need to be taken into account; then an estimation of the e ect of the
degree centrality can be considered.</p>
        <p>Frequency Properties One rst idea about tag maturity considers the fact
that tags that are used more often can get more mature, since they can exhibit
a more speci c ngerprint. However, this does not guarantee maturity of tags.
Therefore, we consider the frequency of a tag as a candidate for the analysis.
{ We capture the resource frequency property rfreq which counts the number
of resources tagged by a given tag t according to</p>
        <p>rfreq (t) = cardfr : 9(u; t0; r) 2 Y; t = t0g
dG(v; t) denotes hereby the geodesic distance (shortest path) between the
vertices v and t. A tag with a high closeness value is therefore more close
to the core of the Folksonomy. Therefore, it seems intuitive to assume, that
more central tags according to this measure should have a higher probability
of being more mature.
{ The degree centrality simply counts the number of direct neighbors d(v) of a
vertex v in a graph G = (V; E):
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Pattern Mining using Subgroup Discovery</title>
        <p>
          Subgroup discovery [
          <xref ref-type="bibr" rid="ref2 ref21">21,2</xref>
          ] aims at identifying interesting patterns with respect to
a given target property of interest according to a speci c interesting measure. In
our context, the target property is given by a quality indicator for tags. The top
patterns are then ranked according to the given interesting measure. Subgroup
discovery is especially suited for identifying local patterns in the data, that is,
nuggets that hold for speci c subsets: It can uncover hidden relations captured
in small subgroups, for which variables are only signi cantly correlated in these
subgroups.
        </p>
        <p>For the semantic assessment of tag, an intuitive hypothesis could be that the
semantic pro le of a tag gets more concise when more and more resource are
tagged with it. However, this is not necessarily a criterion for mature tags
since the development of the semantic pro le could still be relatively fuzzy.
{ The user frequency property ufreq counts the number of users that applied
the tag t:</p>
        <p>ufreq (t) = cardfu : 9(u; t0; r) 2 Y; t = t0g
Similar to the resource frequency, more users should help to focus the
semantic pro le of a tag due to the re nement of its usage patterns.
(3)
(4)
(5)</p>
        <p>Formally, a database D = (I; A) is given by a set of individuals I (tags) and
a set of attributes A (i.e., tag properties). A selector or basic pattern sel a=aj is a
boolean function I ! f0; 1g that is true, i the value of attribute a is aj for this
individual. For a numeric attribute anum selectors sel a2[minj;maxj] can be de ned
analogously for each interval [minj ; maxj ] in the domain of anum. In this case,
the respective boolean function is set to true, i the value of attribute anum is in
the respective range.</p>
        <p>A subgroup description or (complex) pattern p = fsel 1; : : : ; seldg is then given
by a set of basic patterns, which is interpreted as a conjunction, i.e., p(I) = sel 1 ^
: : : ^ sel d. A subgroup (extension) sg p is now given by the set of individuals sgp =
fi 2 Ijp(i) = trueg := ext (p) which are covered by the subgroup description p.
A subgroup discovery task can now be speci ed by a 5-tuple (D; C; S; Q; k). The
target concept C : I ! &lt; speci es the property of interest. It is a function, that
maps each instance in the dataset to a target value c. It can be binary (e.g., the
quality of the tag is high or low), but can use arbitrary target values (e.g, the
continuous quality of a given tag according to a certain measure). The search
space 2S is de ned by set of basic patterns S. Given the dataset D and target
concept c, the quality function Q : 2S ! R maps every pattern in the search
space to a real number that re ects the interestingness of a pattern. Finally, the
integer k gives the number of returned patterns of this task. Thus, the result
of a subgroup discovery task is the set of k subgroup descriptions res1; : : : ; resk
with the highest interestingness according to the quality function. Each of these
descriptions could be reformulated as a rule resi ! c.</p>
        <p>
          While a huge amount of quality functions has been proposed in literature,
cf. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], many interesting measures trade-o the size jext(p)j of a subgroup and
the deviation c c0, where c is the average value of the target concept in the
subgroup and c0 the average value of the target concept in the general population.
        </p>
        <p>We consider the quality function lift, which measures just the increase of the
average value of c in the subgroup compared to the general population:
lift (p) =
c
c0
; if jext (p)j</p>
        <p>TSupp ; and 0 otherwise :
with an adequate minimal support threshold TSupp considering the size of the
subgroup. Usually, the analysis is performed using di erent minimal size
thresholds in an explorative way. It is easy to see, that both types of quality measures
are applicable for binary and continuous target concepts.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Mining Semantic Tag Maturity</title>
      <p>
        For a given Folksonomy and its tagging dataset, we apply the following steps:
Using the dataset, we construct the tag properties discussed in Section 3.2. As we
will see below, the \raw" properties do not correlate su ciently with semantic
maturity. Therefore, we consider the dataset at the level of high-quality subgroups
of semantically matured tags, and apply pattern mining using the lift quality
function for this task. As an evaluation, we apply a gold-standard measure of
semantic relatedness derived from WordNet [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
4.1
      </p>
      <sec id="sec-4-1">
        <title>Methodology</title>
        <p>
          For the purpose of assessing the degree of semantic maturity of a given tag, a
crucial question is how to measure this degree in a reliable and semantically
grounded manner. In prior work [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] we identi ed folksonomy-based measures of
semantic relatedness, which are among others able to detect potential synonym
tags for a given tag. The most precise measure we found was the resource context
relatedness, which is computed in the vector space RR. For a tag t, the vector
vt 2 RR is constructed by counting how often a tag t is used to annotate a certain
resource r 2 R: vtr := cardfu 2 U j (u; t; r) 2 Y g : This vector representation
can be interpreted as a "semantic ngerprint" of a given tag, based on its
distribution over all resources. Our intuition for capturing the degree of maturity is
based on the following argumentation chain:
1. The better the semantic ngerprint of a tag t re ects the meaning of t, the
higher is the probability that the resource context relatedness yields \true"
synomyms or semantically closely related tags tsim1; tsim2; ::: for t
2. If the most related potential synonym tag tsim1 is a \true" synonym of t (as
grounded against the WordNet synset hierarchy), then the semantic
ngerprint of t is regarded as semantically mature.
3. Otherwise, we consider the similarity in WordNet between t and tsim1 as an
indicator for the maturity of the tag.
        </p>
        <p>
          Please note, that we are using purely folksonomy-based measures (i.e.,
resource context relatedness) as a proxy for semantic similarity, because WordNet
is not available for all tags. Simply spoken, this approach regards a tag as
semantically mature if the information encoded in its resource context vector su ces
to identify other tags with the same meaning. Naturally, this requires the
existence of a su ciently similar tag, which cannot be guaranteed. Therefore, this is
not a su cient but a necessary criterion. However, we think that the approach
is justi ed, because the process of maturing is not restricted to isolated tags,
but takes place similar to a \co-evolution" among several tags belonging to a
certain domain of interest. As an example, if the topic of semantic web is very
popular, then a relatively broad vocabulary to describe this concept will emerge,
e.g. semantic web, semanticweb, semweb, sw, . . . . In such a case, the maturity
of a single tag would \correlate" with the existence of semantically similar tags
within the same domain of interest. In general, it is important to notice that
our methodology is also applicable to narrow folksonomies when replacing the
resource context relatedness with the tag context relatedness (see [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]).
4.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Semantic Considerations</title>
        <p>
          For assessing the semantic similarity between tags we apply WordNet [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], a
semantic lexicon of the English language. WordNet groups words into synsets, i.e.,
sets of synonyms that represent one concept. These synsets are nodes in a
network; links between these represent semantic relations. WordNet provides a
distinct network structure for each syntactic category (nouns, verbs, adjectives and
adverbs). For nouns and verbs, it is possible to restrict the links in the network to
(directed) is-a relationships only, therefore a subsumption hierarchy can be
dened. The is-a relation connects a hyponym (more speci c synset) to a hypernym
(more general synset). A synset can have multiple hypernyms, so that the graph
is not a tree, but a directed acyclic graph. Since the is-a WordNet network for
nouns and verbs consists of several disconnected hierarchies, it is useful to add a
fake top-level node subsuming all the roots of those hierarchies; the graph is then
fully connected so that several graph-based similarity metrics between pairs of
nouns and pairs of verbs can be de ned. In WordNet, we measure the semantic
similarity using the taxonomic shortest-path length dist ; the WordNet similarity
wns = 1 mdaixsdtist is then normalized using the maximum distance maxdist .
        </p>
        <p>In addition to the WordNet similarity, we consider two additional indicators:
{ The Maturity Indicator (mat) is a binary feature and measures if a tag has
reached a certain maturity according to the WordNet information, i.e., the
indicator is true, if we observe a WordNet similarity wns 0:5.
{ The Synonym-Indicator (syn) is a binary feature that speci es, if a tag-pair
is in a synonym relation, i.e., the WordNet similarity wns = 1.</p>
        <p>Since we consider the semantic ngerprint of tags using folksonomy
information, we restrict the analysis to WordNet terms with only one sense; otherwise
advanced word-sense disambiguation would be necessary in order to compare the
correct senses in the WordNet synsets.
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Dataset</title>
        <p>For our experiments we used data from the social bookmarking system del.icio.us,
collected in November 2006. In total, data from 667; 128 users of the del.icio.us
community were collected, comprising 2; 454; 546 tags, 18; 782; 132 resources, and
140; 333; 714 tag assignments. For the speci c purpose of our papers, some
preprocessing and ltering was necessary: For the purpose of \grounding" the true
semantic content of a tag t, we are applying vector-based measures to compute
similar tags tsim . Hence, we must assure that (i) the vector representation is
dense enough to yield meaningful similarity judgements and (ii) there exist
sufciently similar tags tsim . For these reasons, we rst restrict our dataset to the
10:000 most frequent tags of delicious (and to the resources/users that have been
associated with at least one of those tags). The restricted folksonomy consists
of jU j = 476; 378 users, jT j = 10; 000 tags, jRj = 12; 660; 470 resources, and
jY j = 101; 491; 722 tag assignments. In order to assure the existence of su cient
\similarity partners" for each tag, we lter all tags whose cosine similarity to
their most similar tag is lower than 0:05. As a last step, we only considered tags
with exactly a single sense in WordNet in order to eliminate the in uence of
ambiguity. After all ltering steps, we considered a total of 1944 tags. We are
aware that this is a strong limitation regardint the number of considered tags
{ however, because the problem at hand as well as our experimental
methodology is sensitive towards a number of factors (like ambiguity or folksonomy-based
similarity judgements), our focus is to start with a very \clean" subset. As a
followup, it would of course be interesting to include more tags given the results
on the clean subset are promising.</p>
        <p>WordNet Similarity (wns), Maturity Indicator (mat),
Synonym-Indicator (syn) and the di erent tag properties.</p>
        <p>bet
clos deg</p>
        <p>rfreq ufreq
wns 0.15 0.20 0.20 0.21
mat 0.09 0.14 0.12 0.15
syn</p>
        <p>We calculated all tag properties given the described co-ocurrence network,
and
discretized these using the standard</p>
        <p>
          MDL
method
of Fayyad &amp; Irani [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]
considering the
        </p>
        <p>WordNet similarity as a target class.</p>
      </sec>
      <sec id="sec-4-4">
        <title>Statistical Characterization</title>
        <p>applied data. Each circle in Figure 1 represents one of the 1944 tags. Concerning
the</p>
        <p>WordNet similarity (wns), we observe, that there is little correlation with the
tag properties; Furthermore, we observe even lower correlations considering the
two indicators mat and syn. Therefore, pattern
mining using subgroup discovery
is very suited for mining semantic tag pro les, since it also considers correlations
in rather small subgroups described by combinations of di erent in
uence factors.</p>
        <p>0
g 00
e 00
d 1
(a) wns vs. bet
0
0
0
0
5
1
0
0
0
0
0
1
0
0
0
0
5
0 ●
●
●
●
● ●
●
● ●
●
s
o
l
c
● ●
● ● ●
●
● ● ●
● ●</p>
        <p>●
●</p>
        <p>●
●● ●●● ●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●● ●●●●●●●● ●●●●● ●●●●●●● ●●●●● ●●●● ●●● ●●●●</p>
        <p>● ● ●
● ● ● ● ● ●● ●</p>
        <p>●
● ● ●
● ●●
● ● ● ●● ●●</p>
        <p>●
● ●●●●●● ●●●●●● ●●●●●● ●●●●●●●● ●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●● ●●●●●●●●● ●●●●●●●● ●●●●● ●●● ●●●●●●● ●●●●●●●●●● ●●●●●●● ●●●●●●●●●</p>
        <p>●
●
● ●</p>
        <p>●
●
● ●
●
●
●
●
● ●
● ●
●
●
●
● ●
●
wns
●
0.6
(b) wns vs. clos
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●</p>
        <p>WordNet Similarity (wns) vs. di erent tag properties.
4.4</p>
      </sec>
      <sec id="sec-4-5">
        <title>Results</title>
        <p>We applied pattern</p>
        <p>mining for the presented dataset using the tag properties as
attributes, and the target concepts (wns, mat, syn) discussed above. Concerning
the</p>
        <p>Wordnet Similarity (wns) and the lift quality function
with a
minimal
subgroup size n = 40, we obtained the top patterns shown in
1 1.52 0.91 44 ufreq &gt; 13:0% AND clos &gt; 64:7%
2 1.49 0.89 46 ufreq &gt; 13:0%
3 1.33 0.80 73 rfreq &gt; 3:0% AND clos &gt; 64:7%
4 1.31 0.78 77 rfreq &gt; 3:0%
5 1.25 0.75 231 deg &gt; 6:0% AND ufreq &gt; 1:0%
6 1.24 0.74 246 deg &gt; 6:0% AND rfreq &gt; 0:1%
7 1.21 0.72 115 bet 2 [0:03%; 1:0%] AND ufreq &gt; 1:0%
8 1.21 0.72 275 deg &gt; 6:0%
9 1.20 0.72 162 clos &gt; 64:7%
10 1.18 0.70 588 clos &gt; 47:0% AND rfreq &gt; 0:1%
11 1.36 0.81 74 clos 2 [53:0%; 64:7%] AND deg &gt; 6:0% AND ufreq &gt; 1:0%
12 1.33 0.80 86 clos 2 [53:0%; 64:7%] AND deg &gt; 6:0% AND rfreq &gt; 0:1%
13 1.30 0.77 105 bet 2 [0:03%; 1:0%] AND deg &gt; 6:0% AND ufreq &gt; 1:0%
14 1.26 0.75 108 clos 2 [53:0%; 64:7%] AND deg &gt; 6:0% AND ufreq &gt; 554
show only basic patterns (one selector), while the lines 11-15 indicate more
complex patterns. These results show that high betweenness and high closeness as
intuitively expected. The in uence of the degree centrality is not as prounounced
as the other centralities, while higher degree also improves semantic maturity.
Furthermore, a relatively high user frequency seems like the best indicator for
high quality tags. Additionally, relatively high resource frequency is also a top
indicator for semantic maturity.</p>
        <p>If we consider the \maturity indicator" as the binary target concept, we
obtain the patterns shown in Table 3. We observe similar in uential properties
as discussed above, however, the user and resource frequency combined with a
medium or high closeness show the best performances.</p>
        <p>Looking at the \synonym indicator" results shown in Table 4, we observe,
that the tag properties identi ed above have an even more pronounced in uence,
since the increase in the target concept (the lift) is between 2 and 3, indicating
an increase in the mean target share of the synonym indicator in the subgroups
by 100% to 200%. An example for a small subgroup containing only synonyms is
described by the pattern: bet 2 [1326142; 1:0%] AND ufreq &gt; 13:0% consists of
the tags \wallpaper", \templates" and \bookmarks".
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this paper, we have presented an approach for mining semantic maturity of
tags in social bookmarking systems. We applied pattern mining for identifying
subgroups of tags with mature semantic ngerprints according to di erent tag
properties. These were based on structural and statistical folksonomy properties
and computed using the tag co-occurrence information and tag/user frequency
information. We provided a detailed analysis of the di erent properties, and
presented a case study using data from del.icio.us. The results indicate the in
uence of several properties with interesting orders of magnitude for the del.icio.us
dataset. For example, the number of users plays a crucial role for the process of
semantic maturing; however, the addditional consideration of centrality properties
can help to identify subsets of tags with a higher degree of maturity.</p>
      <p>For future work, we plan to extend our proposed methodology to larger tag
sets, including less frequently used tags and especially the notion of semantic
\immaturity". Furthermore we plan to include further tag properties, also
including temporal aspects like the amount of time a tag is present in the system.
Additionally, we aim to evaluate the method on more datasets from diverse social
systems.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>This work has partially been supported by the VENUS research cluster at the
interdisciplinary Research Center for Information System Design (ITeG) at Kassel
University, and by the EU project EveryAware.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Angeletou</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Semantic Enrichment of Folksonomy Tagspaces</article-title>
          .
          <source>In: Int'l Semantic Web Conference. LNCS</source>
          , vol.
          <volume>5318</volume>
          , pp.
          <volume>889</volume>
          {
          <fpage>894</fpage>
          . Springer (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Atzmueller</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Puppe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Buscher</surname>
            ,
            <given-names>H.P.</given-names>
          </string-name>
          :
          <article-title>Exploiting Background Knowledge for Knowledge-Intensive Subgroup Discovery</article-title>
          .
          <source>In: Proc. 19th Intl. Joint Conference on Arti cial Intelligence (IJCAI-05)</source>
          . pp.
          <volume>647</volume>
          {
          <fpage>652</fpage>
          .
          <string-name>
            <surname>Edinburgh</surname>
          </string-name>
          ,
          <string-name>
            <surname>Scotland</surname>
          </string-name>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Benz</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hotho</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stumme</surname>
          </string-name>
          , G.:
          <article-title>Semantics Made by You and Me: Self-emerging Ontologies can Capture the Diversity of Shared Knowledge</article-title>
          .
          <source>In: Proceedings of the 2nd Web Science Conference (WebSci10)</source>
          . Raleigh,
          <string-name>
            <surname>NC</surname>
          </string-name>
          , USA (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Brandes</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pich</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          : Centrality Estimation in
          <source>Large Networks. I. J. Bifurcation and Chaos</source>
          <volume>17</volume>
          (
          <issue>7</issue>
          ),
          <volume>2303</volume>
          {
          <fpage>2318</fpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Cattuto</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Semiotic dynamics in online social communities</article-title>
          .
          <source>The European Physical Journal C - Particles and Fields</source>
          <volume>46</volume>
          ,
          <volume>33</volume>
          {
          <fpage>37</fpage>
          (
          <year>August 2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Cattuto</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Benz</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hotho</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stumme</surname>
          </string-name>
          , G.:
          <article-title>Semantic Grounding of Tag Relatedness in Social Bookmarking Systems</article-title>
          .
          <source>In: The Semantic Web, Proc.Intl. Semantic Web Conference</source>
          <year>2008</year>
          . vol.
          <volume>5318</volume>
          , pp.
          <volume>615</volume>
          {
          <fpage>631</fpage>
          . Springer, Heidelberg (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Fayyad</surname>
            ,
            <given-names>U.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Irani</surname>
          </string-name>
          , K.B.
          <article-title>: Multi-interval Discretization of continuousvalued Attributes for Classi cation Learning</article-title>
          .
          <source>In: Thirteenth International Joint Conference on Articial Intelligence</source>
          . vol.
          <volume>2</volume>
          , pp.
          <volume>1022</volume>
          {
          <fpage>1027</fpage>
          . Morgan Kaufmann Publishers (
          <year>1993</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Fellbaum</surname>
          </string-name>
          , C. (ed.):
          <article-title>WordNet: An Electronic Lexical Database</article-title>
          . MIT Press (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Geng</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hamilton</surname>
            ,
            <given-names>H.J.</given-names>
          </string-name>
          :
          <article-title>Interestingness Measures for Data Mining: A Survey</article-title>
          .
          <source>ACM Computing Surveys</source>
          <volume>38</volume>
          (
          <issue>3</issue>
          ) (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Golder</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huberman</surname>
            ,
            <given-names>B.A.</given-names>
          </string-name>
          :
          <article-title>The Structure of Collaborative Tagging Systems</article-title>
          .
          <source>Journal of Information Sciences</source>
          <volume>32</volume>
          (
          <issue>2</issue>
          ),
          <volume>198</volume>
          {208 (April
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Halpin</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Robu</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shepherd</surname>
          </string-name>
          , H.:
          <article-title>The Complex Dynamics of Collaborative Tagging</article-title>
          .
          <source>In: Proc. of WWW2007</source>
          . pp.
          <volume>211</volume>
          {
          <fpage>220</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Heymann</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garcia-Molina</surname>
          </string-name>
          , H.:
          <article-title>Collaborative Creation of Communal Hierarchical Taxonomies in Social Tagging Systems</article-title>
          .
          <source>Tech. rep., Computer</source>
          Science Department, Standford University (
          <year>April 2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Hotho</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Jaschke, R.,
          <string-name>
            <surname>Schmitz</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stumme</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>Information Retrieval in Folksonomies: Search and Ranking</article-title>
          .
          <source>In: The Semantic Web: Research and Applications. LNAI</source>
          , vol.
          <volume>4011</volume>
          , pp.
          <volume>411</volume>
          {
          <fpage>426</fpage>
          . Springer, Heidelberg (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14. Jaschke, R.,
          <string-name>
            <surname>Marinho</surname>
            ,
            <given-names>L.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hotho</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidt-Thieme</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stumme</surname>
          </string-name>
          , G.:
          <article-title>Tag Recommendations in Folksonomies</article-title>
          .
          <source>In: Proc. PKDD 2007. Lecture Notes in Computer Science</source>
          , vol.
          <volume>4702</volume>
          , pp.
          <volume>506</volume>
          {
          <fpage>514</fpage>
          . Berlin, Heidelberg (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15. Korner,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Benz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Strohmaier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Hotho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Stumme</surname>
          </string-name>
          , G.:
          <article-title>Stop Thinking, start Tagging - Tag Semantics emerge from Collaborative Verbosity</article-title>
          .
          <source>In: Proc. of WWW2010</source>
          . ACM, Raleigh,
          <string-name>
            <surname>NC</surname>
          </string-name>
          , USA (apr
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hua</surname>
            ,
            <given-names>X.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>H.J.: Tag</given-names>
          </string-name>
          <string-name>
            <surname>Ranking</surname>
          </string-name>
          .
          <source>In: Proc. of WWW2009</source>
          . pp.
          <volume>351</volume>
          {
          <fpage>360</fpage>
          . WWW '09,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Mathes</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Folksonomies - Cooperative Classi cation and Communication Through Shared Metadata</article-title>
          (
          <year>December 2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Mika</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Ontologies Are Us: A Uni ed Model of Social Networks and Semantics</article-title>
          .
          <source>In: Proc. Intl. Semantic Web Conf. LNCS</source>
          , vol.
          <volume>3729</volume>
          , pp.
          <volume>522</volume>
          {
          <fpage>536</fpage>
          . Springer (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19. Sigurbjornsson, B.,
          <string-name>
            <surname>van</surname>
            <given-names>Zwol</given-names>
          </string-name>
          ,
          <string-name>
            <surname>R.</surname>
          </string-name>
          :
          <source>Flickr Tag Recommendation Based on Collective Knowledge. In: Proc. of WWW2008. WWW '08</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Wasserman</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Faust</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Social Network Analysis: Methods and Applications</article-title>
          . Cambridge Univ Pr (
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Wrobel</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>An Algorithm for Multi-Relational Discovery of Subgroups</article-title>
          .
          <source>In: Proc. 1st European Symposium on Principles of Data Mining and Knowledge Discovery (PKDD-97)</source>
          . pp.
          <volume>78</volume>
          {
          <fpage>87</fpage>
          . Springer Verlag, Berlin (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bao</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>An Unsupervised Model for Exploring Hierarchical Semantics from Social Annotations</article-title>
          . pp.
          <volume>680</volume>
          {
          <issue>693</issue>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>