<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>UNED at RepLab 2012: Monitoring Task?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tamara Mart n</string-name>
          <email>tmartin@lsi.uned.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Damiano Spina</string-name>
          <email>damiano@lsi.uned.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Enrique Amigo</string-name>
          <email>enrique@lsi.uned.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Julio Gonzalo</string-name>
          <email>julio@lsi.uned.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>UNED NLP &amp; IR Group Juan del Rosal</institution>
          ,
          <addr-line>16 28040 Madrid</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes the UNED participation at RepLab 2012 Monitoring Task. Given an entity and a tweet stream containing the entity's name, the task consists on grouping the tweets in topics and then ranking the identi ed topics by priority. We tested three di erent systems to deal with the clustering problem: (i) an agglomerative clustering based on term co-occurrences, (ii) a clustering method that considers `wiki ed" tweets, where each tweet is represented with a set of Wikipedia entries that are semantically related to it and (iii) Twitter-LDA, a topic modeling approach that extends LDA considering some of the intrinsic properties of Twitter data. For the ranking problem, we rely on the insight that the priority of a topic depends on the sentiment expressed in the subjective tweets that refer to it. Although none of the proposed systems outperforms the o cial baseline in average, our systems obtain reasonable high precision results, (i.e. high Reliability scores). The average sentiment of a topic seems to be an useful indicator of priority, that merits further study. Finally, topics with high ratio of unrelated tweets are di cult to group correctly, suggesting a need of an explicit treatment of ambiguity.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The enormous popularity of Social Media in the Web, such as blogs, forums, or
real-time social networking's services o er a place for sharing information as it
happens and for connecting with others in real time, often spreading a wealth
of latest news about real-world events and topics dominating social discussions.
This phenomenon has generated the opportunity - and the necessity of managing
the online reputation of entities such as companies, brands and public gures.
Online Reputation Management consists of monitoring and handling the opinion
of Web users (also referred to as electronic word of mouth, eWOM) on people,
companies or products [7].
? This research was partially supported by the Spanish Ministry of Education (FPU
grant nr AP2009-0507), the Spanish Ministry of Science and Innovation (Holopedia</p>
    </sec>
    <sec id="sec-2">
      <title>Project, TIN2010-21128-C02), the Regional Government of Madrid and the ESF under MA2VICMR (S2009/TIC-1542) and the European Community's FP7 Programme under grant agreement nr 288024 (LiMoSINe).</title>
      <p>Online reputation managers spend remarkable e ort on continuously
monitoring social streams such as Twitter1 in order to early identifying the topics
that may alter (either negatively or positively) the reputation of an entity of
interest. The RepLab 2012 Monitoring Task [3] directly tackles this problem.
Systems receive a stream of tweets containing the name of an entity, and their
goal is to (i) cluster the most recent tweets in topics, and (ii) assign relative
priorities to the cluster.2</p>
      <p>In this paper, we present the results obtained from the systems proposed by
UNED at the participation to the RepLab 2012 Monitoring Task. We tested three
di erent approaches to deal with the clustering problem: (i) an agglomerative
clustering based on term co-occurrences, (ii) a clustering method that considers
`wiki ed" tweets, where each tweet is represented with a set of Wikipedia entries
that are semantically related to it and (iii) Twitter-LDA, a topic modeling
approach that extends LDA considering some of the intrinsic properties of Twitter
data. For the problem of the priority of a topic, we rely on the insight that the
priority of a topic depends on the sentiment expressed in the subjective tweets
that refer to it.</p>
      <p>This paper is organized as follows. Section 2 describes the proposed systems.
Section 3 gives details about the experiments and the obtained results. Finally,
conclusions are presented in section 4.
2</p>
      <sec id="sec-2-1">
        <title>Proposed Systems</title>
        <p>We tested three di erent approaches to tackle the clustering problem in the
monitoring task: (i) a two-step algorithm based on agglomerative clustering,
that rstly it groups terms considering pair of co-occurrent terms in the tweets
and then it assigns tweets to identi ed terms clusters, (ii) an agglomerative
clustering of \wiki ed" tweets, where each tweet is represented with a set of
Wikipedia entries that are semantically related to it and (iii) Twitter-LDA, a
topic modeling approach that extends LDA considering some of the intrinsic
properties of Twitter data. We also tested a method that relies to the polarity
of tweets to deal with the priority problem.
2.1</p>
        <sec id="sec-2-1-1">
          <title>Agglomerative Clustering Based on Term Co-occurrences</title>
          <p>Let us assume that each topic discussed about an entity can be represented with
a set of terms, that allow to the expert understand what the topic is about.
Considering this, we de ne a two-step algorithm that tries to (i) identify the
terminology of each topic, clustering the terms occurring in the input entity
stream of tweets and (ii) assigning tweets to the identi ed clusters.</p>
          <p>In the rst step, we use Hierarchical Agglomerative Clustering (HAC) to build
the clustering of terms. Obviously, not all the terms occurring in the tweets that
1 http://twitter.com
2 Please refer to the RepLab Monitoring Task overview's paper [3] for detailed
information about the task and the dataset.
we want to group belong to the terminology of the topics. For instance, stopwords
and common terms across the topics are not representative of none of them. Since
these terms are di cult to know a priori, we built a binary classi er that, given
a pair of co-occurrent terms, it guesses whether both terms belong to the same
cluster or not.</p>
          <p>We use di erent families of features to represent the co-occurrent pair. We
consider both the \labeled collection" and the \background collection" to
compute the features. Besides of the content of the tweets, we also use the meta-data
such as the date of creation and the author. Finally, we apply regular
expressions to extract named users (i.e. user), hashtags (e.g #apple) and URLs (e.g.
http://www.google.com). Short URLs have been translated to long URLs using
the conversion tables provided by the organizers. We de ne the following set of
features:
{ Term features: Features that describe each of the terms of the co-occurrence
pair. These are: term occurrence, normalized frequency, TF.IDF and
KLDivergence {considering term frequency as the frequency of the term in a
pseudo-document built from entity-speci c tweets like in [12]. These features
were computed in two ways: (i) considering only tweets on the labeled
corpus, and (ii) considering tweets in both the labeled and background corpus.
Features based on the tweets meta-data where each term occurs are:
Shannon's entropy of named users, URLs, hashtags and authors in the tweets
where the term occurs.
{ Content-based pair term features: Features that consider both terms of
the co-occurrence pair, such as Levenshtein's distance between terms,
normalized frequency of co-occurrences, Jaccard similarity between occurrences
of each of the terms.
{ Meta-data-based pair term features: Jaccard similarity and Shannon's
entropy of named users, URLs, hashtags and authors between tweets where
both terms co-occur.
{ Time-aware features: Features based on the date of the creation of the
tweets where the terms co-occurs. Features computed are median, minimum,
maximum, mean, standard deviation, Shannon's entropy and Jaccard
similarity. These features were computed considering four di erent time intervals:
milliseconds, minutes, hours and days.</p>
          <p>In our classi cation model, each instance corresponds to a pair of co-occurrent
terms ht; t0i in the entity stream of tweets. In order to learn the model, we
extract training instances from the trial dataset, considering the following labeling
function:
label (ht; t0i) =
clean if maxj Precision(Ct\t0 ; Lj ) &gt; 0:9
noisy in other case
where Ct\t0 is the set of tweets where terms t and t0 co-occurs and L is the set
of topics in the goldstandard, and</p>
          <p>Precision(Ci; Lj ) = jCi \ Lj j
jCij</p>
          <p>Then, term pairs that co-occur in tweets where 90% were annotated with
the same topic in the goldstandard, are considered clean pairs. If the precision
is below this threshold are then labeled as noisy pairs.</p>
          <p>After training a binary classi er, we use the con dence of belonging to the
\clean" to build a similarity matrix between terms. A Hierarchical
Agglomerative Clustering is then applied to cluster the terms, using the previously built
similarity matrix. After building the agglomerative clustering, a cut-o threshold
is used to return the nal term clustering solution.</p>
          <p>The second step of this algorithm consists on assigning tweets to the identi ed
term clusters. In our experiments, this is carried out following a straightforward
majority voting strategy. For each tweet, the nal assigned cluster is the one
that maximizes the number of terms in the tweet assigned to the it.
2.2</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Clustering Wiki ed Tweets</title>
          <p>The second systems we tested relies on the hypothesis that tweets sharing
concepts de ned in a knowledge base {such as Wikipedia{ are more likely to belong
to the same cluster than tweets with none or less concepts in common. In this
approach, each tweet is linked to a set of Wikipedia entries that semantically
represents the concepts that are related to it. We use the COMMONNESS
probability presented in [10] to identify the relevant concepts to a given tweet. It is
based in the intra-Wikipedia hyperlinks, and computes the probability of a
concept c been the target of a link with anchor text q in Wikipedia:
COMMONNESS(c; q) =</p>
          <p>P jLq;cj
Pc0 jLq;c0 j</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Tweet</title>
      <p>Les presento el nuevo producto de
la marca Apple...El iMeestabajando.
pic.twitter.com/JPdR5Oct
Apple ya ha comenzado con iOS 6 en el
iPad 3 !!!! http://goo.gl/fb/aY0cO #rumor
#ios6 #ipad #ios #ipad3g #apple
Server logs show Apple testing iPads
with iOS 6, possible Retina Displays
http://bit.ly/ysaFUA (via @appleinsider)</p>
    </sec>
    <sec id="sec-4">
      <title>Wiki ed representation</title>
    </sec>
    <sec id="sec-5">
      <title>Brand, Product (business), Apple Inc.</title>
    </sec>
    <sec id="sec-6">
      <title>IOS, Rumor, Apple Inc., IPad</title>
    </sec>
    <sec id="sec-7">
      <title>Software testing, Retina, IOS, Display device, Apple Inc., IPad Table 1. Examples of tweets represented with the Wikipedia entries identi ed using the COMMONNESS probability.</title>
      <sec id="sec-7-1">
        <title>2.3 Identifying Trivial Clusters</title>
        <p>Retweets and automatic generated tweets (by clicking \share" buttons in news or
blog posts, tweets generated by third services like Foursqurare, etc.) are frequent
in the trial data.</p>
        <p>Moreover, tweets sharing a high percentage of words are very likely to belong
to the same cluster. In both co-occurrence graph-based and commoness-based
systems, tweets with a term overlap higher than 70% are grouped a priori. These
tweets are then removed from the input, except of one representative tweet for
each of the trivial clusters. After running the system, we merge the output with
the a-priori trivial clustering. Finally, each cluster is joined with the cluster in
the system output that contains the representative tweet of the trivial cluster.
2.4</p>
      </sec>
      <sec id="sec-7-2">
        <title>Twitter-LDA Approach</title>
        <p>Twitter-LDA is a variant of LDA proposed by Zhao et al. [13] that is adapted to
the characteristics of Twitter: tweets are short (140-character limit) and a single
tweet tends to be about a single topic. Like Latent Dirichlet Allocation [5], it
is an unsupervised machine learning technique which discovers the latent topics
distributed across the documents of a given corpora.</p>
        <p>The model is based on the following assumptions. There is a set of topics
T in Twitter, each represented by a word distribution. Each user has her topic
interests modeled by a distribution over the topics. When a user wants to write
a tweet, rst chooses a topic based on his topic distribution. Then chooses a
bag of words one by one based on the chosen topic. However, not all words in a
tweet are closely related to the topic of that tweet; some are background words
commonly used in tweets on di erent topics. Therefore, for each word in a tweet,
the user rst decides whether it is a background word or a topic word and then
chooses the word from its respective word distribution.</p>
        <p>The generation process of tweets is described in Figure 1, where: t denote
the word distribution for topic t; tB the word distribution for background words;
u denote the topic distribution of user u and a Bernoulli distribution that
governs the choice between background words and topic words.</p>
        <p>Because each test case have few tweets to be annotated, we have to consider
two sets of background. The rst one is the background of the entity, consisting
of 5000 tweets that refer to the entity that will provide additional information
to cluster tweets that has the same topic. And the second set of 15000 tweets
of a di erent entity will allow the model to di erentiate between topics that do
not refer to the entity.
2.5</p>
      </sec>
      <sec id="sec-7-3">
        <title>Sentiment-based Priority Approach</title>
        <p>For the priority of each topic we use a tweet-level sentiment analysis classi er [1].
The main idea of this method is to extract the WordNet concepts in a sentence
that entail an emotional meaning, assign them an emotion within a set of
categories from an a ective lexicon, and use this information as the input to a
machine learning algorithm. The strengths of this approach, in contrast to other
more simple strategies, are: (1) use of WordNet and a word sense disambiguation
algorithm, which allows the system to work with concepts rather than terms, (2)
use of emotions instead of terms as classi cation attributes, and (3) processing
of negations and intensi ers to invert, increase or decrease the intensity of the
expressed emotions.</p>
        <p>Given the polarity of each tweet we estimate the priority of a topic in the
following manner:. Let Ti be a topic, NTi number of tweets in the topic Ti, we
de ne three function: P os(Ti), N eg(Ti), and N eu(T ) as the number of positive,
negative and neutrals tweets of that topic, respectively. The priority of a topic
can be de ned as:</p>
        <p>P riority(Ti) =
&gt;&gt;&gt;&gt;&gt;&gt;&gt;&gt;8223 iiifff NPoesg((TTNii))eg&gt;(TNPi)oe=sg((TTNii))Ti aanonrdd
&lt;</p>
        <p>2 if P os(Ti) + N eg(Ti)
&gt;&gt;&gt;1 if N eu(Ti) = NTi
&gt;&gt;&gt;1 if N eu(Ti) &gt; P os(Ti) + N eg(Ti)
&gt;:&gt;0 in other case</p>
        <p>P os(Ti) = NTi &gt;9
N eg(Ti) N eu(Ti)&gt;&gt;&gt;&gt;
P os(Ti) N eu(Ti)&gt;&gt;&gt;=</p>
        <p>N eu(Ti)
&gt;
&gt;
&gt;
&gt;
&gt;
&gt;
&gt;
&gt;
;
3</p>
        <sec id="sec-7-3-1">
          <title>Experiments and Results</title>
          <p>In this section we describe the parameters used on each of the submitted systems
and the obtained results. We report the scores obtained for the o cial metrics
used to evaluate the monitoring task: Reliability &amp; Sensitivity [4]3.</p>
          <p>We submitted three runs in total:
{ wiki ed tweets clustering: This run combines the wiki ed tweets
clustering approach described in section 2.2 with the trivial clustering
identication method described in section 2.3. This system corresponds to the
replab2012 monitoring UNED 1 run.
{ co-occurrence clustering: This run combines the agglomerative clustering
based on term co-occurrences described in section 2.1 with the trivial
clustering identi cation method described in section 2.3. This system corresponds
to the replab2012 monitoring UNED 2 run.
{ Twitter-LDA: This run uses Twitter-LDA, described in section 2.4 to
identify the clusters and uses the sentiment-based priority approach described in
section 2.5 to rank the clusters. This system corresponds to the
replab2012 monitoring UNED 3 run.</p>
          <p>In all runs, tweets were lowercased, tokenized using a Twitter tokenizer [6]
and punctuation was removed.</p>
          <p>
            The second run uses a Nave Bayes classi er to learn the clean/noisy
pairterms classi er. We have experimented with several machine learning
methods using Rapidminer[11]: Multilayer Perceptron with Backpropagation (Neural
Net), C4.5 and CART Decision Trees, Linear Support Vector Machines (SVM),
and Nave Bayes. We used a \leave-one-entity-out" strategy to evaluate the
performance of the models on the trial data. On each fold, all the pair terms related
to an entity are used as test data. All the term pairs related to the other entities
are used as training data. This process is repeated 6 times (as many as entities
in the trial corpus) and AUC is computed to evaluate the classi ers. Nave Bayes
signi cantly outperforms the other tested models, obtaining AUC values above
0.8 in all trial entities except of one, Alcatel-Lucent
            <xref ref-type="bibr" rid="ref3 ref4 ref9">(entity id RL2012E02)</xref>
            .
3 In the context of clustering tasks, Reliability &amp; Sensitivity are equivalent to BCubed
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Precision and BCubed Recall, respectively [2].</title>
      <p>The Hierarchical Agglomerative Clustering was performed using average
linkage (i.e. considering the mean similarity between elements of each cluster) and
using the S-Space package implementation [8]. The cut-o threshold of the HAC
was empirically established to 0.9999 after running some experiments over the
trial data.</p>
      <p>We ran Twitter-LDA with 500 iterations of Gibbs sampling. After trying a
few di erent numbers of topics, we empirically set the number of topics to 100.
We set to 50:0=jT j, to a smaller value of 0.01 and to 20 as [13] suggested.</p>
      <p>We also tried the standard LDA model (i.e. treating each tweet as a
single document) and found that the Twitter-LDA model was better. In addition,
Twitter-LDA is much more convenient in supporting the computation of
tweetlevel statistics (e.g. the number of co-occurrences of two words in a speci c topic)
than the standard LDA because Twitter-LDA assumes a single topic assignment
for an entire tweet.</p>
      <p>The o cial baseline consists on an agglomerative clustering, that use single
linkage over Jaccard word distances. Di erent stopping threshold were used. Here
we reports the results of considering 0%, 50% and 100% as stopping thresholds.
For priority relations, the baseline assigns all non-single clusters to the same
level, and single clusters are assigned to a secondary level.</p>
      <p>Table 2 shows the results of the baseline and the proposed systems when
considering clustering relationships.</p>
      <p>With regards to F-Measure, baseline 0% obtains the highest score. The
baseline with 0% stopping threshold assigns all the tweets to a single cluster,
corresponding to the so-called all-in-one system. This system reaches perfect recall,
and in precision is relatively high on entities with few topics on the set of tweets.
More precisely, this system achieve Reliability score above 0.95 in ve of the 24
test cases (slightly more than 20%).</p>
      <p>Although the clustering based on co-occurrences outperforms Twitter-LDA
in R and S, the latter obtains 0.01 higher F-1 score, suggesting that Twitter-LDA
is more R/S balanced across test cases than co-occurrence clustering.</p>
      <p>
        Remarkably, in some test cases where most of the tweets are not related
to the entity of interest such as Indra
        <xref ref-type="bibr" rid="ref3 ref4 ref9">(RL2012E12)</xref>
        , ING
        <xref ref-type="bibr" rid="ref3 ref4 ref9">(RL2012E15)</xref>
        or BP
        <xref ref-type="bibr" rid="ref3 ref4 ref9">(RL2012E27)</xref>
        , all of the proposed systems obtain F-1 scores below 0.25. This
suggests that an explicit treatment of ambiguity is needed, at least when the
entity's name may refers to multiple entities or concepts (e.g. acronyms).
      </p>
      <p>Table 3 shows the results obtained by the proposed systems considering only
priority relationships.</p>
      <p>Note that only the run that uses Twitter-LDA incorporates the
sentimentbased priority approach. The runs using Co-occurrence clustering and wiki ed
tweets clustering return all clusters with the same priority. These systems are
considered as non-informative by the used evaluation measures, obtaining the
minimum score in both R and S. As baseline 0% group all tweets in one
cluster, no single clusters are returned and then R and S is also 0. The baseline
using a stopping threshold of 50% obtains the highest scores for all the reported
metrics. However, the sentiment-based priority approach obtains competitive
results, suggesting that the overall sentiment of the topic is a helpful variable to
assign relative priorities.</p>
      <p>Finally, Table 3 shows the performance of the proposed systems in the
RepLab monitoring task, considering both clustering and priority relationships.</p>
      <p>No considering priority relationships signi cantly drops Sensitivity scores.
Note that in the case of baseline 0%, S decreases from 1 to 0:43. The
cooccurrence clustering and the wiki ed tweets clustering goes below 0:1. As
regards Twitter-LDA, Reliability and Sensitivity remains relatively close to the
scores achieved by the baselines.</p>
      <sec id="sec-8-1">
        <title>Discussion and conclusions</title>
        <p>In this paper we have decribed the systems used in the runs submitted by UNED
to the Monitoring Task of the RepLab 2012 evaluation campaign. Here, systems
receive a stream of tweets containing the name of an entity, and their goal is
to (i) cluster the most recent tweets in topics, and (ii) assign relative priorities
to the cluster. We tested di erent clustering approaches, and a sentiment-based
algorithm to predict the priority of the identi ed topics.</p>
        <p>Results show the high di culty of the monitoring task. In the case of the
clustering problem, simple models such as the agglomerative clsutering baseline
are di cult to be outperformed by more elaborated systems. However, our
proposed systems achieves reasonable high BCubed precision scores, suggesting that
more information is needed in the representation of the tweets in order to solve
joining gaps. With regarding of the priority problem, the sentiment expressed
in tweets of a same cluster seems to be an useful indicator of the topic priority.
However, there is still much room for improvement in this direction.</p>
        <p>As future work we intend to take advantage of Twitter metadata to add
new variables to the LDA model. As regards to co-occurence clustering, we
plan to include distributional semantics in the co-occurrence similarity features.
Finally, future work also includes incorporating a company name disambiguation
component to our systems.
10. Meij, E., Weerkamp, W., de Rijke, M.: Adding semantics to microblog posts. In:</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>Proceedings of the fth ACM international conference on Web search and data</title>
      <p>mining (2012)
11. Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., Euler, T.: Yale: Rapid
prototyping for complex data mining tasks. In: KDD '06: Proceedings of the 12th ACM</p>
    </sec>
    <sec id="sec-10">
      <title>SIGKDD international conference on Knowledge discovery and data mining. pp.</title>
      <p>935{940 (2006)
12. Spina, D., Meij, E., de Rijke, M., Oghina, A., Bui, M., Breuss, M.: Identifying entity
aspects in microblog posts. In: SIGIR '12: Proceedings of the 35th international</p>
    </sec>
    <sec id="sec-11">
      <title>ACM SIGIR conference on Research and development in information retrieval</title>
      <p>(2012)
13. Zhao, W.X., Jiang, J., Weng, J., He, J., Lim, E.P., Yan, H., Li, X.: Comparing
twitter and traditional media using topic models. In: Proceedings of the 33rd
European conference on Advances in information retrieval. pp. 338{349. ECIR'11,</p>
    </sec>
    <sec id="sec-12">
      <title>Springer-Verlag, Berlin, Heidelberg (2011)</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Jorge Carrillo de Albornoz</surname>
            ,
            <given-names>I.C.y.E.A.</given-names>
          </string-name>
          :
          <article-title>Using an emotion-based model and sentiment analysis techniques to classify polarity for reputation</article-title>
          .
          <source>In: Proceedings of the 3rd Conference and Labs of the Evaluation Forum</source>
          . (To appear) (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Amigo</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Artiles</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verdejo</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>A comparison of extrinsic clustering evaluation metrics based on formal constraints</article-title>
          .
          <source>Information Retrieval</source>
          <volume>12</volume>
          (
          <issue>4</issue>
          ),
          <volume>461</volume>
          {
          <fpage>486</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Amigo</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corujo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meij</surname>
          </string-name>
          , E., de Rijke, M.: Overview of RepLab 2012:
          <article-title>Evaluating Online Reputation Management Systems</article-title>
          .
          <source>In: CLEF 2012 Labs and Workshop Notebook</source>
          Papers (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Amigo</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verdejo</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Reliability and Sensitivity: Generic Evaluation Measures for Document Organization Tasks</article-title>
          .
          <source>Tech. rep.</source>
          ,
          <source>UNED</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>D.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jordan</surname>
            ,
            <given-names>M.I.</given-names>
          </string-name>
          :
          <article-title>Latent dirichlet allocation</article-title>
          .
          <source>J. Mach. Learn. Res</source>
          .
          <volume>3</volume>
          ,
          <issue>993</issue>
          {1022 (Mar
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Brendan</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krieger</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ahn</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , et al.:
          <article-title>Tweetmotif: Exploratory search and topic summarization for twitter (</article-title>
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Jansen</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sobel</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chowdury</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Twitter power: Tweets as electronic word of mouth</article-title>
          .
          <source>Journal of the American society for information science and technology 60(11)</source>
          ,
          <volume>2169</volume>
          {
          <fpage>2188</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Jurgens</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stevens</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>The s-space package: an open source package for word space models</article-title>
          .
          <source>In: Proceedings of the ACL 2010 System Demonstrations</source>
          . pp.
          <volume>30</volume>
          {
          <issue>35</issue>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Meij</surname>
          </string-name>
          , E.:
          <source>LiMoSINe Deliverable 4</source>
          .1:
          <string-name>
            <given-names>Initial</given-names>
            <surname>Semantic Mining Module</surname>
          </string-name>
          .
          <source>Tech. rep., Univerity of Amsterdam</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>