Towards an Evaluation Framework for Topic Extraction Systems for Online Reputation Management? Enrique Amigó1 , Damiano Spina1 , Bernardino Beotas2 , and Julio Gonzalo1 1 Departamento de Lenguajes y Sistemas Informáticos Universidad Nacional de Educación a Distancia C/Juan de Rosal, 16 28020 Madrid, España {enrique,damiano,julio}@lsi.uned.es 2 Grupo ALMA C/Valentı́n Beato, 23 28037 Madrid, España b.beotas@almatech.es Abstract. This work present a novel evaluation framework for topic ex- traction over user generated contents. The motivation of this work is the development of systems that monitor the evolution of opinionated top- ics around a certain entity (a person, company or product) in the Web. Currently, due to the effort that would be required to develop a gold standard, topic extraction systems are evaluated qualitatively over cases of study or by means of intrinsic evaluation metrics that can not be ap- plied across heterogeneous systems. We propose evaluation metrics based on available document metadata (link structure and time stamps) which do not require manual annotation of the test corpus. Our preliminary experiments show that these metrics are sensitive to the number of iter- ations in LDA-based topic extraction algorithms, which is an indication of the consistency of the metrics. ? This work has been partially supported by Alma Technologies and the Spanish Gov- ernment (projects Webopinion and Text-Mess/Ines) 1 Introduction The growing interest on monitoring opinions in the Web 2.0 is well known. On- line Reputation Management consists of monitoring the opinion of Web users on people, companies or products, and it is already a fundamental tool in corpo- rate communication. A particularly relevant problem is to detect new topics or opinion trends which deserve the attention of communication experts, such as a burst of tweets or blog entries about a controversial issue about a company, or a defect of a product. A system that assists a communication expert should be able to detect (particularly new) topics, tag them in an interpretable way, cluster documents related to each topic and analyze the evolution of topics over time. What makes this a distinctive problem is the fact that documents are naturally multi-topic: relevance of a document for a topic may be even sub-sentential. This problem is sometimes referred to as Temporal Text Mining [1,2]. Models and systems to solve these tasks are recently starting to appear in scholar publications. But a major bottleneck so far is the absence of a benchmark- ing test suite to evaluate and compare systems. Creating such a gold standard is, in fact, a complex task: defining the set of topics in a document stream is a subtle task, because topics tend to co-occur in documents and the appropriate level of granularity in topic and sub-topic distinctions is something fuzzy to fix. For similar reasons it is also difficult, once the set of topics is established, to decide which documents talk about each of the topics and how central is each document to each of the topics that the document discusses. In the absence of a gold standard, extrinsic precision-recall based metrics can not be applied. For this reason, current systems are evaluated informally via use cases, or otherwise using intrinsic evaluation measures which are specific to the model being tested. There are, however, basic restrictions on how a good system should behave. For instance, documents which share outlinks to the same web pages should tend to be more related than documents which do not share outlinks. This type of information has not been yet used by current topic detection systems, because they relate together only a small subset of the documents. This information, however, might be used as a (limited) evaluation or validation mechanism to optimize system parameters. In this paper we address the task of defining an evaluation methodology based on this idea, and check its suitability on an LDA- based approach to topic detection over time. 2 State of the art We will start with an overview of models to solve the task, and then we will summarize the evaluation methodologies used so far and discuss their limitations. 2.1 Topic detection models The most basic approaches for topic monitoring focus on word frequency. The as- sumption is that frequent words indicate, in general, salient topics in a document collection. Some available web services are Blogpulse Trends3 , Mood Views4 and Blogscope5 . Brooks and Montanez showed that frequent words (extracted according to tf.idf), produce tags that generate document clusters with more cohesion than user tags in blogs [3]. Gruhl included the temporal dimension in his model by extracting topic terms with frequency peaks over time [4]. Chi considered also the distribution of terms across blogs [5]. He assumed that topics gain prominence in blog subsets. His model consists of computing the singular values of the time-blog frequency matrix . Mei et al. combine topological information with the temporal dimension [6]. Their model employs the EM algorithm to identify the topic distributions along time and location that maximize the likelihood of word occurrences. An interesting feature of this model is that it assumes that several topics can appear in the same document. Many novel proposals are currently based on the LDA (Latent Dirichlet Al- location) model [7]. As well as Mei’s approach, LDA is a probabilistic model that estimates the distribution θd of topics for each document d and the distribution of words for each topic. The particularity of LDA is that the distribution pa- rameters are generated by a Dirichlet distribution with certain hyperparameters that are stated a priori. One example of these models is TOT (Topic Over Time) [8]. The most char- acteristic aspect of this work is that the temporal variable is added to the LDA model, assuming that topics follow a Beta distribution along time. One drawback in this work is that all the document collection must be processed for inferring temporal distribution when new documents appear in the input stream. The model Dynamic Topic Model [9] tries to solve it by estimating topic distribu- tions for each time slot independently. After this, the model employs temporal series techniques in order to analyze topic evolution. Another model that tackle this issue is On-line LDA [10]. This model states that the knowledge produced over a time slot represents the a priori knowledge for the next time slot. This idea allows to process new documents without reprocessing the whole collection. However, an addition mechanism to detect new topics along the time becomes necessary. Another interesting model based on LDA is denominated Multiscale Topic Tomography[11]. In this approach the topic distribution includes different granularity levels. 2.2 Evaluation approaches The main bottleneck in this area of research is the absence of a common evalu- ation methodology to compare approaches. Let us summarize the approaches to evaluation in the research described above. In terms of its efficiency and suitability to assist experts in the online repu- tation management task, some approaches are best suited than others. For in- stance, the models ON-line LDA and Dynamic Topic Model are able to process 3 www.blogpulse.com/trends 4 ilps.science.uva.nl/MoodViews 5 www.blogscope.net new documents without re-processing the collection. Multiscale Topic Tomog- raphy, on the other hand, allows a topic visualization at different granularity levels. However, it is still necessary to define an evaluation framework to com- pare approaches in terms of accuracy. Some approaches are simply evaluated over case studies. This is the case of Mei’s approach [6] and the Dynamic topic model [9]. Ghuhl’s model [4], on the other hand, is evaluated against human annotated topic terms; an evaluation method than can not, for instance, be applied to LDA-based models. The model Topic over Time [8] is evaluated with intrinsic clustering metrics according to the KL-divergence between topics; but this methodology is only appropriate to compare similar systems. For instance, in their evaluation the authors obtained evidence about the advantages of including the temporal vari- able in the model. It is not possible, however, to evaluate heterogeneous systems with intrinsic clustering metrics. For instance, systems based on KL-divergence would be rewarded by this evaluation method. Something similar happens with the evaluation of the Multiscale Topic Tomography[11], where the perplexity of the model is compared against the perplexity obtained with other models. In this case, LDA-based models could not be compared with models based on traditional clustering algorithms. Other proposed evaluation metrics focus on extrinsic tasks using topic descriptors, such as multi-document summarization [2]. With these limitations in mind, our goal is to define and apply an auto- matic evaluation framework enabling comparison between heterogeneous, arbi- trary systems, and which is not dependent on cost-intensive manual annotation of data. 3 Evaluation methodology 3.1 System prerequisites We start from a few prerequisites for topic detection systems in opinion mining: – Aggregation: The system must detect a finite number of topics. Documents will be associated to zero, one or more topics in a discrete or continuous way. The key point is that related documents should share at least one topic. – Temporality In order to analyze the evolution of the reputation of a given entity, the system must reflect differences in topic distribution across time. This implies to show the intensity of topics across time slots. – Interpretability Identified topics should be tagged in a way that is inter- pretable for the user. – Accessibility For each topic, the corresponding documents must be ranked according to its relevance in the context of the topic. In this work, we focus on the two first functionalities: “aggregation” and “Temporality”. The interesting aspect of these two features is that it is possible to generate automatically a benchmark for evaluation purposes. 3.2 System Output variables The aggregation functionality requires to infer to what extent each document is related to each topic. This output can be formalized as P (θ|d), which represents the distribution θ of topics in each document d. For instance, a traditional dis- crete clustering algorithm would return P (θi |d) = 1 if the document d belongs to the cluster associated with the topic θi . Analogously, the temporality function requires an output variable P (θ|t) rep- resenting the distribution of topics in each time slot t. From the perspective of evaluation, a key aspect is that all functionalities must be mutually consistent. In particular, the intensity of topics (temporality) has to correspond with the number of associated documents in the time slot (Aggregation). Therefore, tem- porality can be inferred from the output P (θ|d). Assuming that the intensity of topics is proportional to the number of related topic in the time slot, we can state that: X P (θ|t) = P (θ|d) d∈t 3.3 Evaluation measures Our evaluation methodology is based on two assumptions on the desired behavior of systems: – Documents with outlinks that point to the same page and documents pro- duced by the same author will tend to be more topically-related than the average. – It is easier to find highly related documents in the same time slot (say, blog posts in the same week), than separated by long time periods (such as several months). As most current systems do not rely on this kind of information, it is possible to use it at least for parameter optimization cycles. Some of the systems do employ temporal information, and therefore the second restriction is not totally system-independent. In such cases, however, the improvement obtained by the use of temporal information can still be measured in terms of the first restriction. The first step to evaluate systems according to these two assumptions consists of defining when the system considers that two documents are related (as for their topics). This is not straightforward, given that systems generate a distribution of weighted topics for each document. We will assume that one topic is enough to consider that two documents are related, but only if both documents focus on this topic. According to this, we define the Connectedness of a document pair as: Connectedness(d1, d2) = Maxi (Min(P (θi , d1), P (θi , d2))) Our evaluation metrics will compare the connectedness of document pairs in two sets according to the assumptions introduced above. We will call these sets RDP (Related Document Pairs) and NRDP (Non-Related Document Pairs). RDP consists of document pairs with, for instance, one or more common out- links, while NRDP consists of documents, conversely, without common outlinks. According to our assumptions, document pairs in RDP should have a higher connectedness, in average, than document pairs in NRDP. In order to avoid dependencies on scale properties of the distribution P (θ|d) associated to each system, we will formulate evaluation metrics in a non-parametric way, estimating, for each system s: metric value(s) = P (Connectedness(dr , dr 0 ) > Connectedness(dn , dn 0 )) where < dr , dr >∈ RDP, < dn , dn 0 >∈ N RDP . In other words, the quality of the system is measured as the probability that two documents from the RDP set have a higher topic overlap (according to the system) than two documents from NRDP. Different criteria to form RDP and NRDP lead to different evaluation metrics; we now discuss some examples. Outlink Aggregation In order to obtain the set of related document pairs (RDP), we assume that two documents are more likely to be related if they share an outlink to the same web page, if this outlink does not appear in other documents (this restriction eliminates frequent outlinks which are not related to the document content, such as links to Facebook). Author Aggregation As for documents related by a common author, we will simply consider pairs of documents with the same author for RDP and pairs from different authors for NRDP. Temporality As for temporality, we will assume that is easier to find documents sharing a topic when both documents belong to the same time slot. In particular, we will build RDP with the 100 most related document pairs (according to the system output) which are created in the same week. NRDP is formed by the 100 most related document pairs which are created with a difference of at least three months. 4 Test Case: Iterations in an LDA-based system To test our evaluation methodology, we have implemented the LDA approach, starting with the algorithm described in [8] and eliminating the temporal variable component – which will be tested in future work –. LDA is a generative process where each document d is associated with a multinomial distribution of topics, and uses Dirichlet distributions as hyperparameters. The model assumes that each document token is associated to a single topic, and therefore the topic distribution in a document would be given by the individual token assignments. The article by Wang and McCallum describes the approach in detail as well as the derivation of the Gibbs sampling. The algorithm implemented consists of the following steps: 1. Random initialization of each token to some of the k topics. 2. For each token in document d, the topic is updated drawing on the proba- bility P (z) for each topic z. The probabilities are: nz,w˙ + β P (z) = (md,z + α) P V v (nz,w + β) where md,z represents the number of tokens in the document d associated to topic z; nz,w represents the number of occurrences of the word w from the corresponding token in topic z, and V is the vocabulary. α and β are two hyperparameters that reflect, respectively, the topic dispersion per word and per document. 3. md,z and nz,w are updated and then we go back to step 2, for as many iterations as desired. Implementations known to us use fixed hyperparameters for any word in the vocabulary and for every document. In future work, however, and counting with an automatic evaluation mechanism, we could test whether α should have some relation with the document length. 5 Hypothesis validation In order to validate our assumptions, we have performed a small experiment which involves manual validation of document pairs. From our testbed, we have generated 64 random tuples, each consisting of two document pairs: in one pair, both documents share at least one outlink that does not appear in any other document (see Section 3.3); in the other pair, they do not share any outlink. According to our hypothesis, document pairs which share outlinks should be more topically related in average than pairs that do not share outlinks. For each tuple, we have manually annotated which is the most topically related document pair (sometimes this is not obvious and the tuple is then annotated as undecidable). For 50 tuples (78%), the document pair that share the outlink was more topically related than the other. In 12 cases (19%) it was undecidable, and only in 2 cases (3%) the linked document pair was less topically related than the other. An analogous process was conducted for the co-authorship criterion, com- paring tweet pairs written by the same author with pairs written by different authors. The results over 97 tuples are similar to the previous ones: co-authored tweets are more related in 80% of the cases, while non co-authored tweets are more related only in one occasion (1%). These results suggest that our assumptions are reasonable for our testbed. This is, of course, just a preliminary result that must be validated with larger manual annotations over different testbeds. Note also that the experimental procedure must be refined, because “undecidable” cases (which are 20% of the assessed samples) might become decidable with a more precise, testbed-specific definition of relatedness. 6 Experiment and Evaluation Results The goal of our experiment is to test the behavior of our evaluation metrics. As a dataset we use 5,000 tweets and 500 posts from blogs in Spanish contain- ing the term BBVA (an Spanish bank operating in several countries). We have generated a vocabulary excluding stop words. In general, for all the approaches compared, topics detected by the LDA system consist of (i) information about ”Liga BBVA”, the Spanish Premier Football League, which is sponsored by the bank; (ii) economic information on the bank; (iii) information in languages other than Spanish (such as Catalan); and (iv) topics with unfrequent terms. In general, the granularity of the topics is relatively low. Note that, unlike other related experiments, we are focusing on a single entity, while other approaches cover several totally independent topics. Table 1 displays the results of Aggregation (by outlinks) and temporality for LDA over 500 blog posts, fixing certain values of the hyperparameters and for different number of iterations. Note that aggregation goes from 0.41 up to 0.9, reaching a ceiling after 100 iterations. This number might be different when more documents are processed, or for different values of the hyperparameters. In any case, the results show a strong correlation between the number of iterations of the algorithm and the evaluation results; assuming that a higher number of iterations leads to better LDA results, our metric behaves consistently. Table 1 also shows the Temporality obtained for different number of itera- tions. This metric also increases with the number of iterations, although this behavior is not so stable here. A possible reason is that the assumption that documents which are closer in time should be more related on average is not so valid as for the case of aggregation. Another possible reason is that this measure is estimated on the 100 most related document pairs only, while aggregation is computed on a larger set of samples. Iterations Agregation (by outlinks) Temporality 1 0.41 0.42 5 0.60 0.35 10 0.76 0.6 20 0.86 0.57 50 0.89 0.60 100 0.90 0.61 200 0.90 0.61 500 0.90 0.65 1000 0.91 0.71 2000 0.90 0.72 Table 1. Evaluation results for 500 blog posts, 10 topics, α = 1, β = 0.1 and different number of iterations Table 2 shows the results on 5,000 tweets, this time measuring aggregation by author. Again, the metric values seem to stabilize around 100 iterations, and they show a clear correlation with the number of iterations. Iterations Agregation (by author) 1 0.47 5 0.55 10 0.61 20 0.72 50 0.76 100 0.78 200 0.79 500 0.8 Table 2. Evaluation results for 5000 tweets, 10 topics, α = 1 and β = 0.1 Another variable that can be analyzed with our evaluation methodology is the effect of different values for the hyperparameter α. Table 3 shows that α does not have a strong effect on the results. In fact, the maximum seems to be around α = 1. This implies that, in general, documents tend to be centered around one single topic. This is perhaps due to the low granularity of the topics produced in our experiment. alpha value Agregation (by outlinks) temporality 0.1 0.89 0.64 1 0.9 0.66 5 0.9 0.67 10 0.89 0.66 20 0.88 0.74 50 0.84 0.62 Table 3. Evaluation results for 500 blog posts, 2000 iterations, 10 topics, β = 0.1 and different α values Finally, we have studied the effect of the number of topics on the results of the evaluation. Is it possible that LDA, in this context, reaches a more adequate topic granularity by increasing their number? Table 4 shows the results obtained for 500 blog entries, 2000 iterations and α = 1. Note that, although there is some positive effect when increasing the number of topics, it is not as clear as in previous experiments. Number of topics Agregation (by outlinks) temporality 5 0.85 0.67 10 0.9 0.73 15 0.92 0.74 20 0.92 0.68 40 0.93 0.71 50 0.93 0.7 Table 4. Evaluation results for 500 blog posts, 2000 iterations, α = 1, β = 0.1 and a variable number of topics 7 Conclusions In this work we have proposed an early version of an automatic evaluation methodology which permits the optimization of topic extraction models for on- line reputation management using external information not employed by the system. In a preliminar experiment using blog entries and tweets for a bank, we have been able to observe quantitative effects such as a little influence of the α hyperparameter on the final results, the number of iterations which lead to stable results for LDA, or the effect produced by the number of topics. Our evaluation methodology has still unresolved issues: we do not know yet to which extent the selection of the RDP and NRDP sets bias the results (they are, after all, just a small sample of the full test set, with very precise characteristics). We also need to revise the ”temporality” measure to obtain more stable results in our experimental framework. In any case, the methodology provides a way of testing hypothesis not yet evaluated quantitatively in other studies, such as the effect of including a tempo- ral variable in the model, the possibility of processing time slots independently, the effects of structuring topics hierarchically, etc. References 1. Hurst, M.: Temporal text mining. In: Proceedings of the AAAI Spring Symposia on Computational Approaches to Analyzing Weblogs. (2006) 2. Subasic, I., Bettina, B.: From bursty patterns to bursty facts: The effectiveness of temporal text mining for news 3. Brooks, C.H., Montanez, N.: Improved annotation of the blogosphere via autotag- ging and hierarchical clustering. In: WWW ’06: Proceedings of the 15th interna- tional conference on World Wide Web, New York, NY, USA, ACM Press (2006) 625–632 4. Gruhl, D., Guha, R., Liben-Nowell, D., Tomkins, A.: Information diffusion through blogspace. In: WWW ’04: Proceedings of the 13th international conference on World Wide Web, New York, NY, USA, ACM (2004) 491–501 5. Chi, Y., Tseng, B.L., Tatemura, J.: Eigen-trend: trend analysis in the blogosphere based on singular value decompositions. In: CIKM ’06: Proceedings of the 15th ACM international conference on Information and knowledge management, New York, NY, USA, ACM Press (2006) 68–77 6. Mei, Q., Liu, C., Su, H., Zhai, C.: A probabilistic approach to spatiotemporal theme pattern mining on weblogs. In: WWW ’06: Proceedings of the 15th international conference on World Wide Web, New York, NY, USA, ACM Press (2006) 533–542 7. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of Machine Learning Research 3 (2002) 2003 8. Wang, X., McCallum, A.: Topics over time: a non-markov continuous-time model of topical trends. In: KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, New York, NY, USA, ACM (2006) 424–433 9. Blei, D., Lafferty, J.: Dynamic topic models. In: Proceedings of the 23rd interna- tional conference on Machine learning, ACM New York, NY, USA (2006) 113–120 10. AlSumait, L., Barbara, D., Domeniconi, C.: On-line LDA: adaptive topic models for mining text streams with applications to topic detection and tracking. In: Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, IEEE Computer Society Washington, DC, USA (2008) 3–12 11. Nallapati, R.M., Ditmore, S., Lafferty, J.D., Ung, K.: Multiscale topic tomography. In: KDD ’07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, New York, NY, USA, ACM (2007) 520–529