First steps beyond the bag-of-words representation of short texts Paolo Ferragina and Ugo Scaiella Dipartimento di Informatica University of Pisa, Italy {ferragina,scaiella}@di.unipi.it Abstract. We address the problem of enhancing the classical bag-of- words representation of texts by designing and engineering Tagme, the first system that performs an accurate and on-the-fly semantic annota- tion of short texts via Wikipedia as knowledge base. Several experiments show that Tagme outperforms state-of-the-art algorithms when they are adapted to work on short texts and it results fast and competitive on long ones. This leads us to argue favorably about Tagme’s application to clustering, classification and retrieval systems on challenging scenarios like web-snippets, tweets, news, ads, etc.. 1 Motivation and background The typical IR-approach to indexing, clustering, classification and retrieval, just to name a few, is that based on the bag-of-words paradigm. In recent years a good deal of work attempted to go beyond this paradigm with the goal of im- proving the search experience on (unstructured) textual data. In our work we are concerned with the task of adding structure to unstructured data, consisting of the identification of sequences of terms (aka spots) in the input text and their annotation with a Wikipedia page. This annotation process provides a stunning contextualization of the input text so that each subsequent IR-task could be improved by leveraging the huge semantic network provided by Wikipedia. Re- cently several works (see e.g. [4, 6] and refs therein) addressed the problem of annotating texts with hyper-links to Wikipedia pages. We add to this flow of work the specialty that the input texts to be annotated are short, namely, they are composed of few tens of terms. The context of use we have in mind is the annotation of either the snippets of search-engine results, or the tweets of a Twitter channel, or the items of a news feed, or the posts of a blog, or the advertisement messages, etc.. It is easy to argue that these poorly composed texts pose new challenges in terms of efficiency and effectiveness of the annotation process, which (1) should be very fast, because in those contexts data may be retrieved at query time and thus cannot be pre-processed, and (2) should be designed properly, because the input texts are so short that it is difficult to mine significant statistics that are rather available when texts are long. To address these issues, we have designed and implemented Tagme the first software system that, on-the-fly and with high precision/recall, annotates short texts with pertinent hyper-links to Wikipedia pages. As an example, let us consider the following news: “Diego Maradona won against Mexico”. Our goal is to detect “Diego Maradona” and “Mexico” as spots, and then hyper-link them with the Wikipedia pages which deal with the ex Argentina’s coach and the football team of Mexico. Tagme uses as spots (to be annotated) the sequences of terms composing the anchor texts which occur in the Wikipedia pages, and it uses as possible senses for each spot the (possibly many) pages pointed in Wikipedia by that spot/anchor. Tagme selects among the potentially many available mappings (spot-to-page)1 the most pertinent ones by finding a collective agreement among them via new scoring functions which are fast to be computed and accurate in the finally produced annotation. A preliminary description of Tagme has been published as poster in Procs ACM CIKM 2010. Tagme is available for test at http://tagme.di.unipi.it. What follows is a sketch of the main ideas and experimental results concerning with Tagme, the interested reader is invited to read [3] for details. 2 The anatomy of TAGME The annotation process of Tagme is composed by two main phases: Anchor disambiguation and Anchor pruning. Anchor disambiguation. This is the task that judiciously cross-references each anchor a ∈ AT found in the input text T with one pertinent page pa of Wikipedia. Tagme selects the best association a 7→ pa by computing a score for each possible page pa linked to a in Wikipedia (we call Pg(a) this set) that is based on a new notion of “collective agreement” between the page pa and the pages that can be associated to all other anchors detected in T , i.e. the anchors in AT . This agreement is evaluated by means of a voting scheme that computes for each other anchor b ∈ AT \ {a} its vote to the annotation a 7→ pa . Given that b may be linked to many pages in Wikipedia (i.e. |Pg(b)| > 1) we compute this vote as the average relatedness between each page pb , potentially linked to b, and the sense pa we wish to associate to a. However we argue that not all possible pages of b have the same (statistical) significance, so we weight each relatedness with the commonness of the page pb with respect to b (denoted as Pr(pb |b) and computed as the prior probability that b points to pb over all links of b in Wikipedia). P Hence the voting given by anchor b to the annotation a 7→ pa is 1 vb (pa ) = |Pg(b)| · pb ∈Pg(b) rel(pb , pa ) · Pr(pb |b) where the relatedness rel(pb , pa ) between the two Wikipedia pages pa and pb is computed as suggested in [5] by exploiting the intersection over the incoming links to pa and pb . The overall score for the annotation a 7→ pa is computed as the sum of the votes given by all anchors b in T . This score is not enough to obtain an accurate disambiguation, so we first filter out candidate pages in Pg(a), via a properly set threshold, and then select the best page by deploying the commonness scores. Anchor pruning. This step detects and possibly prunes some of the candidate annotations produced by the Disambiguation phase, if they are considered to be not meaningful. These “bad annotations” are detected via a simple, yet ef- fective, scoring function that takes into account only two features: (1) the link 1 “Diego Maradona” is the name of two persons and “Mexico” points to 154 different pages in Wikipedia. probability lp(a) of the anchor a, computed as the number of times that a is used as anchor divided by the occurrences of a in Wikipedia (as anchor or not) and (2) the coherence between its candidate annotation a 7→ pa (assigned by the Disambiguation Phase) and the candidate annotations of the other anchors in T , computed with the relatedness function presented above. This pruning score, say ρ(a 7→ p), is then compared against a properly set threshold ρna , so that if ρ(a 7→ p) < ρna then that annotation for a is discarded. The parameter ρna allows to balance recall vs precision of the annotation process and we deeply investigated its impact in our experiments. 3 Experimental evaluation We evaluated Tagme over a set of short texts randomly drawn from Wikipedia, composed by about 20 terms (like web-snippets), and containing an average of about 10 spots. We compared the annotation produced by Tagme against the links attached by Wikipedia editors. We also evaluated Tagme on the dataset proposed in [4] that is composed by manually annotated long texts drawn from web. In annotating long texts containing more than 10 spots, Tagme processes the long input text by shifting a window of about 10 spots over it, and applying our algorithms on each window in an incremental way, so that we didn’t change Tagme’s architecture and Tagme is able to scale linearly with the number of anchors in the input text. For lack of space we cannot report detailed figures about the real performance of Tagme, as well as we cannot detail the com- parisons against Milne&Witten’s and Chakrabarti’s systems2 , however here we briefly state that: – on short texts, our annotator outperforms M&W’s one by yielding an overall F-measure of about 78% with an absolute improvement of more than 8%; – on long texts, Tagme resulted competitive (if not superior) with respect to M&W’s and Chakrabarti’s systems even if our “shift-based” approach clearly gives advantages to our competitors that deploy the full input text; – a deep evaluation on the scalability and time efficiency of Tagme showed that it is able to annotate a short text of about 10 spots in 18ms, while M&W’s and Chakrabarti’s systems take about 95 ms and > 2 sec, respec- tively, on comparable hardware. 4 Applications and future works We believe that Tagme has implications which go far beyond the enrichment of a text with explanatory links. The most interesting benefit of this annotation process is the structured knowledge attached to textual fragments that let us to leverage the semantic network provided by Wikipedia to improve IR-tasks which nowadays are mainly addressed with the bag-of-words paradigm, with all its well-known limitations. We are currently investigating the impact of Tagme’s annotation onto several application domains: 2 A deeper evaluation is presented in our technical report [3]. – The on-the-fly labeled clustering of search-engine results (see also Clusty.com, Carrot2 and the survey [2]). Most of those softwares are based on syntac- tic and statistical features, so they could benefit from Tagme’s annotation to improve the effectiveness of the labeling and the clustering phases. For this application we are considering the deployment of spectral clustering techniques [7]. – IR systems for vertical domains. Since in this context other structured in- formation (coming from other sources like databases, ontologies, etc.) could be available, we are investigating how to deploy such information in the Tagme-annotation process in order to further improve its performance. As a case study, we are considering the application of Tagme to the annotation of Italian medical prescriptions. – Web Advertising. The explanatory links and the structured knowledge at- tached to plain-texts by Tagme could allow the efficient and effective reso- lution of ambiguity and polysemy issues which often occur when advertiser’s keywords are matched against the content of Web pages offering display-ads. Finally, we are studying the impact of other tools and information sources in Tagme to better relate and/or assign pages/senses to text anchors (e.g. Natural Language Processing tools, other by-products of Wikipedia— such as DBpedia.org or YAGO— or the huge amount of structured informations provided by the W3C SWEO Linking Open Data project [1]). As well as, we are currently setting up a much larger user-study over Mechanical Turk with the twofold goal of creating a manually-annotated dataset which is much larger than the one offered by [4], and of supporting the applicability of Tagme on short web-texts. References 1. C. Bizer. The Emerging Web of Linked Data. IEEE Int. Sys., 24(5):87–92, 2009. 2. C. Carpineto, S. Osiński, G. Romano, and D. Weiss. A survey of web clustering engines. ACM Comput. Surv., 41(3):1–38, 2009. 3. P. Ferragina and U. Scaiella. TAGME: On-the-fly annotation of short text frag- ments (by Wikipedia entities). In Proc. ACM CIKM, 1625–1628, 2010. A detailed technical report is available at http://arxiv.org/abs/1006.3498. 4. S. Kulkarni, A. Singh, G. Ramakrishnan, and S. Chakrabarti. Collective annotation of Wikipedia entities in web text. In Proc. ACM KDD, 457–466, 2009. 5. O. Medelyan, D. Milne, C. Legg, and I. H. Witten. Mining meaning from Wikipe- dia. Int. J. Hum.-Comput. Stud., 67(9):716–754, 2009. 6. D. Milne and I. H. Witten. Learning to link with Wikipedia. In Proc. ACM CIKM, 509–518, 2008. 7. U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395–416, 2007.