-

UNED at RepLab 2012: Monitoring Task?

Tamara Mart n

tmartin@lsi.uned.es 0

Damiano Spina

damiano@lsi.uned.es 0

Enrique Amigo

enrique@lsi.uned.es 0

Julio Gonzalo

julio@lsi.uned.es 0 0 UNED NLP & IR Group Juan del Rosal , 16 28040 Madrid , Spain

This paper describes the UNED participation at RepLab 2012 Monitoring Task. Given an entity and a tweet stream containing the entity's name, the task consists on grouping the tweets in topics and then ranking the identi ed topics by priority. We tested three di erent systems to deal with the clustering problem: (i) an agglomerative clustering based on term co-occurrences, (ii) a clustering method that considers `wiki ed" tweets, where each tweet is represented with a set of Wikipedia entries that are semantically related to it and (iii) Twitter-LDA, a topic modeling approach that extends LDA considering some of the intrinsic properties of Twitter data. For the ranking problem, we rely on the insight that the priority of a topic depends on the sentiment expressed in the subjective tweets that refer to it. Although none of the proposed systems outperforms the o cial baseline in average, our systems obtain reasonable high precision results, (i.e. high Reliability scores). The average sentiment of a topic seems to be an useful indicator of priority, that merits further study. Finally, topics with high ratio of unrelated tweets are di cult to group correctly, suggesting a need of an explicit treatment of ambiguity.

The enormous popularity of Social Media in the Web, such as blogs, forums, or real-time social networking's services o er a place for sharing information as it happens and for connecting with others in real time, often spreading a wealth of latest news about real-world events and topics dominating social discussions. This phenomenon has generated the opportunity - and the necessity of managing the online reputation of entities such as companies, brands and public gures. Online Reputation Management consists of monitoring and handling the opinion of Web users (also referred to as electronic word of mouth, eWOM) on people, companies or products [7]. ? This research was partially supported by the Spanish Ministry of Education (FPU grant nr AP2009-0507), the Spanish Ministry of Science and Innovation (Holopedia

Project, TIN2010-21128-C02), the Regional Government of Madrid and the ESF under MA2VICMR (S2009/TIC-1542) and the European Community's FP7 Programme under grant agreement nr 288024 (LiMoSINe).

Online reputation managers spend remarkable e ort on continuously monitoring social streams such as Twitter1 in order to early identifying the topics that may alter (either negatively or positively) the reputation of an entity of interest. The RepLab 2012 Monitoring Task [3] directly tackles this problem. Systems receive a stream of tweets containing the name of an entity, and their goal is to (i) cluster the most recent tweets in topics, and (ii) assign relative priorities to the cluster.2

In this paper, we present the results obtained from the systems proposed by UNED at the participation to the RepLab 2012 Monitoring Task. We tested three di erent approaches to deal with the clustering problem: (i) an agglomerative clustering based on term co-occurrences, (ii) a clustering method that considers `wiki ed" tweets, where each tweet is represented with a set of Wikipedia entries that are semantically related to it and (iii) Twitter-LDA, a topic modeling approach that extends LDA considering some of the intrinsic properties of Twitter data. For the problem of the priority of a topic, we rely on the insight that the priority of a topic depends on the sentiment expressed in the subjective tweets that refer to it.

This paper is organized as follows. Section 2 describes the proposed systems. Section 3 gives details about the experiments and the obtained results. Finally, conclusions are presented in section 4. 2

Proposed Systems

We tested three di erent approaches to tackle the clustering problem in the monitoring task: (i) a two-step algorithm based on agglomerative clustering, that rstly it groups terms considering pair of co-occurrent terms in the tweets and then it assigns tweets to identi ed terms clusters, (ii) an agglomerative clustering of \wiki ed" tweets, where each tweet is represented with a set of Wikipedia entries that are semantically related to it and (iii) Twitter-LDA, a topic modeling approach that extends LDA considering some of the intrinsic properties of Twitter data. We also tested a method that relies to the polarity of tweets to deal with the priority problem. 2.1

Agglomerative Clustering Based on Term Co-occurrences

Let us assume that each topic discussed about an entity can be represented with a set of terms, that allow to the expert understand what the topic is about. Considering this, we de ne a two-step algorithm that tries to (i) identify the terminology of each topic, clustering the terms occurring in the input entity stream of tweets and (ii) assigning tweets to the identi ed clusters.

In the rst step, we use Hierarchical Agglomerative Clustering (HAC) to build the clustering of terms. Obviously, not all the terms occurring in the tweets that 1 http://twitter.com 2 Please refer to the RepLab Monitoring Task overview's paper [3] for detailed information about the task and the dataset. we want to group belong to the terminology of the topics. For instance, stopwords and common terms across the topics are not representative of none of them. Since these terms are di cult to know a priori, we built a binary classi er that, given a pair of co-occurrent terms, it guesses whether both terms belong to the same cluster or not.

We use di erent families of features to represent the co-occurrent pair. We consider both the \labeled collection" and the \background collection" to compute the features. Besides of the content of the tweets, we also use the meta-data such as the date of creation and the author. Finally, we apply regular expressions to extract named users (i.e. user), hashtags (e.g #apple) and URLs (e.g. http://www.google.com). Short URLs have been translated to long URLs using the conversion tables provided by the organizers. We de ne the following set of features: { Term features: Features that describe each of the terms of the co-occurrence pair. These are: term occurrence, normalized frequency, TF.IDF and KLDivergence {considering term frequency as the frequency of the term in a pseudo-document built from entity-speci c tweets like in [12]. These features were computed in two ways: (i) considering only tweets on the labeled corpus, and (ii) considering tweets in both the labeled and background corpus. Features based on the tweets meta-data where each term occurs are: Shannon's entropy of named users, URLs, hashtags and authors in the tweets where the term occurs. { Content-based pair term features: Features that consider both terms of the co-occurrence pair, such as Levenshtein's distance between terms, normalized frequency of co-occurrences, Jaccard similarity between occurrences of each of the terms. { Meta-data-based pair term features: Jaccard similarity and Shannon's entropy of named users, URLs, hashtags and authors between tweets where both terms co-occur. { Time-aware features: Features based on the date of the creation of the tweets where the terms co-occurs. Features computed are median, minimum, maximum, mean, standard deviation, Shannon's entropy and Jaccard similarity. These features were computed considering four di erent time intervals: milliseconds, minutes, hours and days.

In our classi cation model, each instance corresponds to a pair of co-occurrent terms ht; t0i in the entity stream of tweets. In order to learn the model, we extract training instances from the trial dataset, considering the following labeling function: label (ht; t0i) = clean if maxj Precision(Ct\t0 ; Lj ) > 0:9 noisy in other case where Ct\t0 is the set of tweets where terms t and t0 co-occurs and L is the set of topics in the goldstandard, and

Precision(Ci; Lj ) = jCi \ Lj j jCij

Then, term pairs that co-occur in tweets where 90% were annotated with the same topic in the goldstandard, are considered clean pairs. If the precision is below this threshold are then labeled as noisy pairs.

After training a binary classi er, we use the con dence of belonging to the \clean" to build a similarity matrix between terms. A Hierarchical Agglomerative Clustering is then applied to cluster the terms, using the previously built similarity matrix. After building the agglomerative clustering, a cut-o threshold is used to return the nal term clustering solution.

The second step of this algorithm consists on assigning tweets to the identi ed term clusters. In our experiments, this is carried out following a straightforward majority voting strategy. For each tweet, the nal assigned cluster is the one that maximizes the number of terms in the tweet assigned to the it. 2.2

Clustering Wiki ed Tweets

The second systems we tested relies on the hypothesis that tweets sharing concepts de ned in a knowledge base {such as Wikipedia{ are more likely to belong to the same cluster than tweets with none or less concepts in common. In this approach, each tweet is linked to a set of Wikipedia entries that semantically represents the concepts that are related to it. We use the COMMONNESS probability presented in [10] to identify the relevant concepts to a given tweet. It is based in the intra-Wikipedia hyperlinks, and computes the probability of a concept c been the target of a link with anchor text q in Wikipedia: COMMONNESS(c; q) =

P jLq;cj Pc0 jLq;c0 j

Les presento el nuevo producto de la marca Apple...El iMeestabajando. pic.twitter.com/JPdR5Oct Apple ya ha comenzado con iOS 6 en el iPad 3 !!!! http://goo.gl/fb/aY0cO #rumor #ios6 #ipad #ios #ipad3g #apple Server logs show Apple testing iPads with iOS 6, possible Retina Displays http://bit.ly/ysaFUA (via @appleinsider)

Wiki ed representation Brand, Product (business), Apple Inc. IOS, Rumor, Apple Inc., IPad Software testing, Retina, IOS, Display device, Apple Inc., IPad Table 1. Examples of tweets represented with the Wikipedia entries identi ed using the COMMONNESS probability. 2.3 Identifying Trivial Clusters

Retweets and automatic generated tweets (by clicking \share" buttons in news or blog posts, tweets generated by third services like Foursqurare, etc.) are frequent in the trial data.

Moreover, tweets sharing a high percentage of words are very likely to belong to the same cluster. In both co-occurrence graph-based and commoness-based systems, tweets with a term overlap higher than 70% are grouped a priori. These tweets are then removed from the input, except of one representative tweet for each of the trivial clusters. After running the system, we merge the output with the a-priori trivial clustering. Finally, each cluster is joined with the cluster in the system output that contains the representative tweet of the trivial cluster. 2.4

Twitter-LDA Approach

Twitter-LDA is a variant of LDA proposed by Zhao et al. [13] that is adapted to the characteristics of Twitter: tweets are short (140-character limit) and a single tweet tends to be about a single topic. Like Latent Dirichlet Allocation [5], it is an unsupervised machine learning technique which discovers the latent topics distributed across the documents of a given corpora.

The model is based on the following assumptions. There is a set of topics T in Twitter, each represented by a word distribution. Each user has her topic interests modeled by a distribution over the topics. When a user wants to write a tweet, rst chooses a topic based on his topic distribution. Then chooses a bag of words one by one based on the chosen topic. However, not all words in a tweet are closely related to the topic of that tweet; some are background words commonly used in tweets on di erent topics. Therefore, for each word in a tweet, the user rst decides whether it is a background word or a topic word and then chooses the word from its respective word distribution.

The generation process of tweets is described in Figure 1, where: t denote the word distribution for topic t; tB the word distribution for background words; u denote the topic distribution of user u and a Bernoulli distribution that governs the choice between background words and topic words.

Because each test case have few tweets to be annotated, we have to consider two sets of background. The rst one is the background of the entity, consisting of 5000 tweets that refer to the entity that will provide additional information to cluster tweets that has the same topic. And the second set of 15000 tweets of a di erent entity will allow the model to di erentiate between topics that do not refer to the entity. 2.5

Sentiment-based Priority Approach

For the priority of each topic we use a tweet-level sentiment analysis classi er [1]. The main idea of this method is to extract the WordNet concepts in a sentence that entail an emotional meaning, assign them an emotion within a set of categories from an a ective lexicon, and use this information as the input to a machine learning algorithm. The strengths of this approach, in contrast to other more simple strategies, are: (1) use of WordNet and a word sense disambiguation algorithm, which allows the system to work with concepts rather than terms, (2) use of emotions instead of terms as classi cation attributes, and (3) processing of negations and intensi ers to invert, increase or decrease the intensity of the expressed emotions.

Given the polarity of each tweet we estimate the priority of a topic in the following manner:. Let Ti be a topic, NTi number of tweets in the topic Ti, we de ne three function: P os(Ti), N eg(Ti), and N eu(T ) as the number of positive, negative and neutrals tweets of that topic, respectively. The priority of a topic can be de ned as:

P riority(Ti) = >>>>>>>>8223 iiifff NPoesg((TTNii))eg>(TNPi)oe=sg((TTNii))Ti aanonrdd <

2 if P os(Ti) + N eg(Ti) >>>1 if N eu(Ti) = NTi >>>1 if N eu(Ti) > P os(Ti) + N eg(Ti) >:>0 in other case

P os(Ti) = NTi >9 N eg(Ti) N eu(Ti)>>>> P os(Ti) N eu(Ti)>>>=

N eu(Ti) > > > > > > > > ; 3

Experiments and Results

In this section we describe the parameters used on each of the submitted systems and the obtained results. We report the scores obtained for the o cial metrics used to evaluate the monitoring task: Reliability & Sensitivity [4]3.

We submitted three runs in total: { wiki ed tweets clustering: This run combines the wiki ed tweets clustering approach described in section 2.2 with the trivial clustering identication method described in section 2.3. This system corresponds to the replab2012 monitoring UNED 1 run. { co-occurrence clustering: This run combines the agglomerative clustering based on term co-occurrences described in section 2.1 with the trivial clustering identi cation method described in section 2.3. This system corresponds to the replab2012 monitoring UNED 2 run. { Twitter-LDA: This run uses Twitter-LDA, described in section 2.4 to identify the clusters and uses the sentiment-based priority approach described in section 2.5 to rank the clusters. This system corresponds to the replab2012 monitoring UNED 3 run.

In all runs, tweets were lowercased, tokenized using a Twitter tokenizer [6] and punctuation was removed.

The second run uses a Nave Bayes classi er to learn the clean/noisy pairterms classi er. We have experimented with several machine learning methods using Rapidminer[11]: Multilayer Perceptron with Backpropagation (Neural Net), C4.5 and CART Decision Trees, Linear Support Vector Machines (SVM), and Nave Bayes. We used a \leave-one-entity-out" strategy to evaluate the performance of the models on the trial data. On each fold, all the pair terms related to an entity are used as test data. All the term pairs related to the other entities are used as training data. This process is repeated 6 times (as many as entities in the trial corpus) and AUC is computed to evaluate the classi ers. Nave Bayes signi cantly outperforms the other tested models, obtaining AUC values above 0.8 in all trial entities except of one, Alcatel-Lucent (entity id RL2012E02) . 3 In the context of clustering tasks, Reliability & Sensitivity are equivalent to BCubed

Precision and BCubed Recall, respectively [2].

The Hierarchical Agglomerative Clustering was performed using average linkage (i.e. considering the mean similarity between elements of each cluster) and using the S-Space package implementation [8]. The cut-o threshold of the HAC was empirically established to 0.9999 after running some experiments over the trial data.

We ran Twitter-LDA with 500 iterations of Gibbs sampling. After trying a few di erent numbers of topics, we empirically set the number of topics to 100. We set to 50:0=jT j, to a smaller value of 0.01 and to 20 as [13] suggested.

We also tried the standard LDA model (i.e. treating each tweet as a single document) and found that the Twitter-LDA model was better. In addition, Twitter-LDA is much more convenient in supporting the computation of tweetlevel statistics (e.g. the number of co-occurrences of two words in a speci c topic) than the standard LDA because Twitter-LDA assumes a single topic assignment for an entire tweet.

The o cial baseline consists on an agglomerative clustering, that use single linkage over Jaccard word distances. Di erent stopping threshold were used. Here we reports the results of considering 0%, 50% and 100% as stopping thresholds. For priority relations, the baseline assigns all non-single clusters to the same level, and single clusters are assigned to a secondary level.

Table 2 shows the results of the baseline and the proposed systems when considering clustering relationships.

With regards to F-Measure, baseline 0% obtains the highest score. The baseline with 0% stopping threshold assigns all the tweets to a single cluster, corresponding to the so-called all-in-one system. This system reaches perfect recall, and in precision is relatively high on entities with few topics on the set of tweets. More precisely, this system achieve Reliability score above 0.95 in ve of the 24 test cases (slightly more than 20%).

Although the clustering based on co-occurrences outperforms Twitter-LDA in R and S, the latter obtains 0.01 higher F-1 score, suggesting that Twitter-LDA is more R/S balanced across test cases than co-occurrence clustering.

Remarkably, in some test cases where most of the tweets are not related to the entity of interest such as Indra (RL2012E12) , ING (RL2012E15) or BP (RL2012E27) , all of the proposed systems obtain F-1 scores below 0.25. This suggests that an explicit treatment of ambiguity is needed, at least when the entity's name may refers to multiple entities or concepts (e.g. acronyms).

Table 3 shows the results obtained by the proposed systems considering only priority relationships.

Note that only the run that uses Twitter-LDA incorporates the sentimentbased priority approach. The runs using Co-occurrence clustering and wiki ed tweets clustering return all clusters with the same priority. These systems are considered as non-informative by the used evaluation measures, obtaining the minimum score in both R and S. As baseline 0% group all tweets in one cluster, no single clusters are returned and then R and S is also 0. The baseline using a stopping threshold of 50% obtains the highest scores for all the reported metrics. However, the sentiment-based priority approach obtains competitive results, suggesting that the overall sentiment of the topic is a helpful variable to assign relative priorities.

Finally, Table 3 shows the performance of the proposed systems in the RepLab monitoring task, considering both clustering and priority relationships.

No considering priority relationships signi cantly drops Sensitivity scores. Note that in the case of baseline 0%, S decreases from 1 to 0:43. The cooccurrence clustering and the wiki ed tweets clustering goes below 0:1. As regards Twitter-LDA, Reliability and Sensitivity remains relatively close to the scores achieved by the baselines.

Discussion and conclusions

In this paper we have decribed the systems used in the runs submitted by UNED to the Monitoring Task of the RepLab 2012 evaluation campaign. Here, systems receive a stream of tweets containing the name of an entity, and their goal is to (i) cluster the most recent tweets in topics, and (ii) assign relative priorities to the cluster. We tested di erent clustering approaches, and a sentiment-based algorithm to predict the priority of the identi ed topics.

Results show the high di culty of the monitoring task. In the case of the clustering problem, simple models such as the agglomerative clsutering baseline are di cult to be outperformed by more elaborated systems. However, our proposed systems achieves reasonable high BCubed precision scores, suggesting that more information is needed in the representation of the tweets in order to solve joining gaps. With regarding of the priority problem, the sentiment expressed in tweets of a same cluster seems to be an useful indicator of the topic priority. However, there is still much room for improvement in this direction.

As future work we intend to take advantage of Twitter metadata to add new variables to the LDA model. As regards to co-occurence clustering, we plan to include distributional semantics in the co-occurrence similarity features. Finally, future work also includes incorporating a company name disambiguation component to our systems. 10. Meij, E., Weerkamp, W., de Rijke, M.: Adding semantics to microblog posts. In:

Proceedings of the fth ACM international conference on Web search and data

mining (2012) 11. Mierswa, I., Wurst, M., Klinkenberg, R., Scholz, M., Euler, T.: Yale: Rapid prototyping for complex data mining tasks. In: KDD '06: Proceedings of the 12th ACM

SIGKDD international conference on Knowledge discovery and data mining. pp.

935{940 (2006) 12. Spina, D., Meij, E., de Rijke, M., Oghina, A., Bui, M., Breuss, M.: Identifying entity aspects in microblog posts. In: SIGIR '12: Proceedings of the 35th international

ACM SIGIR conference on Research and development in information retrieval

(2012) 13. Zhao, W.X., Jiang, J., Weng, J., He, J., Lim, E.P., Yan, H., Li, X.: Comparing twitter and traditional media using topic models. In: Proceedings of the 33rd European conference on Advances in information retrieval. pp. 338{349. ECIR'11,

Springer-Verlag, Berlin, Heidelberg (2011)

1. Jorge Carrillo de Albornoz , I.C.y.E.A. : Using an emotion-based model and sentiment analysis techniques to classify polarity for reputation . In: Proceedings of the 3rd Conference and Labs of the Evaluation Forum . (To appear) ( 2012 )

2. Amigo , E. , Gonzalo , J. , Artiles , J. , Verdejo , F. : A comparison of extrinsic clustering evaluation metrics based on formal constraints . Information Retrieval 12 ( 4 ), 461 { 486 ( 2009 )

3. Amigo , E. , Corujo , A. , Gonzalo , J. , Meij , E., de Rijke, M.: Overview of RepLab 2012: Evaluating Online Reputation Management Systems . In: CLEF 2012 Labs and Workshop Notebook Papers ( 2012 )

4. Amigo , E. , Gonzalo , J. , Verdejo , F. : Reliability and Sensitivity: Generic Evaluation Measures for Document Organization Tasks . Tech. rep. , UNED ( 2012 )

5. Blei , D.M. , Ng , A.Y. , Jordan , M.I. : Latent dirichlet allocation . J. Mach. Learn. Res . 3 , 993 {1022 (Mar 2003 )

6. Brendan , O. , Krieger , M. , Ahn , D. , et al.: Tweetmotif: Exploratory search and topic summarization for twitter ( 2010 )

7. Jansen , B. , Zhang , M. , Sobel , K. , Chowdury , A. : Twitter power: Tweets as electronic word of mouth . Journal of the American society for information science and technology 60(11) , 2169 { 2188 ( 2009 )

8. Jurgens , D. , Stevens , K. : The s-space package: an open source package for word space models . In: Proceedings of the ACL 2010 System Demonstrations . pp. 30 { 35 ( 2010 )

9. Meij , E.: LiMoSINe Deliverable 4 .1:

Initial

Semantic Mining Module . Tech. rep., Univerity of Amsterdam ( 2012 )