CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets Arjumand Younus1,2 , Colm O’Riordan1 , and Gabriella Pasi2 1 Computational Intelligence Research Group, National University of Ireland Galway, Ireland 2 Information Retrieval Lab,Informatics, Systems and Communication, University of Milan Bicocca, Milan, Italy arjumand.younus@nuigalway.ie, colm.oriordan@nuigalway.ie, pasi@disco.unimib.it Abstract. Using Twitter as an effective marketing tool has become a gold mine for companies interested in their online reputation. A quite sig- nificant research challenge related to the above issue is to disambiguate tweets with respect to company names. In fact, finding if a particu- lar tweet is relevant or irrelevant to a company is an important task not satisfactorily solved yet; to address this issue in this paper we pro- pose a Wikipedia-based two-pass algorithm. The experimental evalua- tions demonstrate the effectiveness of the proposed approach. 1 Introduction Twitter1 is a microblogging site which currently2 ranks 8th world wide in total traffic according to Alexa3 . This huge popularity has turned Twitter into an effective marketing platform with almost all the major companies maintaining Twitter accounts. Moreover, Twitter users often express their opinions about companies via 140-character long Twitter messages called tweets [8] [10]. Com- panies are highly interested in monitoring their online reputation; this, however involves the significant challenge of disambiguating company names in text. The task becomes even more challenging in tweets due to huge noise, short length and lack of context for company name disambiguation [6] [7]. This paper de- scribes our experience in dealing with some of these challenges in the context of RepLab2012 filtering task where we are given a set of companies and for each company a set of tweets, which contain some tweets relevant to the company and some irrelevant ones. Our approach consists in a two-pass filtering algorithm that makes use of Wikipedia as an external knowledge resource in the first pass, and a concept 1 http://twitter.com 2 Data as of July 31, 2012 3 http://www.alexa.com/siteinfo/twitter.com II term score propagation mechanism in the second pass. The first step is precision- oriented, where the aim is to keep the noisy tweets to a minimum. The second step enhances the recall via a score propagation technique, and by making use of other sources of evidence. Our technique shows high accuracy over the Re- pLab2012 dataset. The rest of the paper is organized as follows: section 2 presents a more de- tailed description of the considered problem. Section 3 describes the proposed methodology in detail. Section 4 gives presents the experimental evaluation of our method, and Section 5 summarizes the related work. Finally section 6 con- cludes the paper. 2 Problem Statement In this section we give a brief overview of the RepLab2012 filtering task. We were given company names and a set of tweets obtained via issuing a query (i.e. the name of the company) to Twitter Search4 . Due to the issue of noise in Twitter, the tweets obtained via the query may or may not be relevant to the company. Our task addresses the problem of distinguishing the tweets relevant from those non-relevant to a company. 3 Methodology This section describes the proposed filtering method in detail. We first explain how we use Wikipedia as an external knowledge resource: we use only portions of a company’s Wikipedia page, as only some of the portions are meaningful for filtering purposes. Next, we explain the two steps of our algorithm. 3.1 Wikipedia as External Knowledge Resource As Meij et al. point out in [7], a simple matching between tweets and Wikipedia texts would produce a significant amount of irrelevant and noisy concept terms. The authors further mention that such noise can be eliminated on either the Wikipedia or the textual side. On the basis of the intuition that the Wikipedia page of a company contains significant information about the company in certain portions of the Wikipedia article (i.e. concept terms exist in some portions of text), we perform this cleaning on the Wikipedia side as follows: – we use the information within the Wikipedia infoboxes. – we use the information within the Wikipedia category box. – we parse the paragraphs that in the Wikipedia text are followed by appli- cation of POS tagging to these paragraphs. After the application of POS tagging, we extract unigrams, bigrams and trigrams for the significant POS tags [1]. 4 http://twitter.com/search III Finally, the concept terms extracted from the various portions of Wikipedia are split into single terms. These are then used for matching against the terms in the tweets for the task of filtering. 3.2 First Pass After collecting all the extracted concept terms from Wikipedia we order them by their specificity in the Wikipedia article thereby giving a score to each term. We then check for the occurrence of these concept terms in the tweets, and the number of occurrences per term is multiplied by the score of that particular concept term to constitute a score for the tweet. Tweets that have a score above a certain threshold are considered to be relevant. The intuition behind the use of Wikipedia concept terms for the first phase is to keep the precision as high as possible, and to get relevant tweets only. 3.3 Second Pass The second pass makes use of the idea of concept term score propagation in order to discover more tweets relevant to a particular company i.e. to increase the recall. The score propagation technique is based on the intuition that terms co- located with significant concept terms may have some relevance to that concept. The scores for concept terms in a relevant tweet obtained from the first pass are redistributed among co-located terms. This in turn gives some score to the non- concept terms, and by using the scores of these non-concept terms we perform a second computation to obtain the scores of tweets. Moreover, this phase uses more sources of evidence, these are: – POS tag of the company name occurring within the tweets – URL occurring within the tweets – Twitter username occurring within the tweets – Hashtag occurring within the tweets The score propagation technique as well as the extra sources of evidence mentioned above enable us to extract more tweets relevant to the company, thus increasing the recall. We show the results of the evaluation metrics in the next section. 4 Experimental Evaluations As mentioned in section 2 the task comprises binary classification and hence we report the effectiveness of our algorithm through the standard evaluation metrics of precision and recall. We were given a very small trial dataset (six companies) and a considerably larger test dataset (31 companies). We report our results for the test dataset. Table 1 shows the precision and recall figures after the application of the first pass, and after the application of the second pass of our algorithm. IV First Pass Second Pass Precision 0.84827 0.81129 Recall 0.16307 0.76229 Table 1. Precision and Recall Scores for Two Passes of the Algorithm As Table 1 shows the first pass yields a high precision but an extremely low recall. The application of the second pass increases the recall by a large degree, while not overly reducing the precision. The significantly large increase in recall proves the effectiveness of the score-propagation technique combined with the use of multiple sources of evidence. The RepLab2012 filtering task used the measures of Reliability and Sensitiv- ity for evaluation purposes, these are described in detail in [3]. Table 2 presents a snapshot of the official results for the Filtering task of RepLab 2012, where CIRGDISCO is the name of our team. Table 2. Results of Monitoring task of RepLab 2012 Team Accuracy Filtering R Filtering S Filtering F(R,S) Filtering Daedalus 2 0.72 0.24 0.43 0.26 Daedalus 3 0.70 0.24 0.42 0.25 Daedalus 1 0.72 0.22 0.40 0.25 CIRGDISCO 0.70 0.22 0.34 0.23 Table 2 shows that our algorithm performed competitively. It is the second best among the submitted systems and fourth best among the submitted runs. We only managed to submit a single run due to the shortage of time. 5 Related Work There has been an increasing interest in research on applying natural language processing techniques to tweets over the past few years. We provide an overview of some of these works in this section. Named entity recognition in tweets has been an area of active research over the past few years with a focus on semi-supervised learning techniques [5] [9]. Given the lack of context in tweets for the task of named entity recognition, Liu et al. [5] use a large training set and apply a KNN classifier in combination with CRF based labeler whereas Ritter et al. [9] employ Labeled LDA over a FreeBase corpus. Understanding and analysing the content of tweets is another significant re- search direction where the goal is to extract keyphrases from the content of tweets [4] [11]. Hong and Davidson [4] proposed several schemes to train standard LDA, and the Author-Topic LDA models for topic discovery over Twitter data. Zhao V et al. [11] extracted and ranked topical key phrases on Twitter through the use of topic models [12] followed by topical PageRank. Semantic enrichment of microblog posts has also been studied with the aim of determining what a tweet is about [2] [7]. Abel et al. make use of news articles to contextualize tweets [2] while Meij et al. [7] provide a fine semantic enrichment of tweets through matching with Wikipedia. 6 Conclusion We proposed a two-pass algorithm for company name disambiguation in tweets. Our algorithm makes use of Wikipedia as a primary knowledge resource in the first pass of the algorithm, and the tweets are matched across Wikipedia terms. The matched terms are then used for score propagation in the second pass of the algorithm that also makes use of multiple sources of evidence. Our algorithm showed competitive performance and demonstrates the effectiveness of the tech- niques employed in the two passes. As a future work, we aim to refine the score propagation technique of the second pass by taking into account better features for effective scores computation. References 1. http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html. 2. F. Abel, Q. Gao, G.-J. Houben, and K. Tao. Semantic enrichment of twitter posts for user profile construction on the social web. In Proceedings of the 8th extended semantic web conference on The semanic web: research and applications - Volume Part II, ESWC’11, pages 375–389, Berlin, Heidelberg, 2011. Springer-Verlag. 3. E. Amigo, J. Gonzalo, and F. Verdejo. Reliability and Sensitivity: Generic Eval- uation Measures for Document Organization Tasks. UNED, Madrid, Spain, 2012. Technical Report. 4. L. Hong and B. D. Davison. Empirical study of topic modeling in twitter. In Proceedings of the First Workshop on Social Media Analytics, SOMA ’10, pages 80–88, New York, NY, USA, 2010. ACM. 5. X. Liu, S. Zhang, F. Wei, and M. Zhou. Recognizing named entities in tweets. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, HLT ’11, pages 359–367, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics. 6. K. Massoudi, M. Tsagkias, M. de Rijke, and W. Weerkamp. Incorporating query expansion and quality indicators in searching microblog posts. In Proceedings of the 33rd European conference on Advances in information retrieval, ECIR’11, pages 362–367, Berlin, Heidelberg, 2011. Springer-Verlag. 7. E. Meij, W. Weerkamp, and M. de Rijke. Adding semantics to microblog posts. In Proceedings of the fifth ACM international conference on Web search and data mining, WSDM ’12, pages 563–572, New York, NY, USA, 2012. ACM. 8. A. Pak and P. Paroubek. Twitter as a corpus for sentiment analysis and opin- ion mining. In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC’10), Valletta, Malta, May 2010. European Lan- guage Resources Association (ELRA). VI 9. A. Ritter, S. Clark, Mausam, and O. Etzioni. Named entity recognition in tweets: an experimental study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’11, pages 1524–1534, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics. 10. S. R. Yerva, Z. Miklós, and K. Aberer. What have fruits to do with technology?: the case of orange, blackberry and apple. In Proceedings of the International Con- ference on Web Intelligence, Mining and Semantics, WIMS ’11, pages 48:1–48:10, New York, NY, USA, 2011. ACM. 11. W. X. Zhao, J. Jiang, J. He, Y. Song, P. Achananuparp, E.-P. Lim, and X. Li. Top- ical keyphrase extraction from twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, HLT ’11, pages 379–388, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics. 12. W. X. Zhao, J. Jiang, J. Weng, J. He, E.-P. Lim, H. Yan, and X. Li. Comparing twitter and traditional media using topic models. In Proceedings of the 33rd Eu- ropean conference on Advances in information retrieval, ECIR’11, pages 338–349, Berlin, Heidelberg, 2011. Springer-Verlag.