WikifyMe: Creating Testbed for Wikifiers c Sergey Bartunov Alexander Boldakov Denis Turdakov ISP RAS sbartunov@gmail.com, boldakov@gmail.com, turdakov@ispras.ru Abstract be associated with Wikipedia articles and some of them should be marked as key terms. This is required for sepa- Finding relationships between words in text and rated testing of WSD and key term extraction algorithms. articles from Wikipedia is an extremely popular This paper introduces WikifyMe1 , a Web-based sys- task known as wikification. However there is tem that aims at creating large wikified corpora with the still no gold standard corpus for wikifiers com- aid of Web users. This system has a user-friendly in- parison. We present WikifyMe, the online tool terface that makes manual wikification much easier. We for collaborative work on universal test collec- expect that this system will yield good corpora for com- tion which allows users to easily prepare tests paring different wikifiers at a relatively lower cost. for two most difficult problems in wikification: The rest of the paper is organized as follows. Re- word-sense disambiguation and keyphrase ex- lated work is described in the next section. Sect. 3 gives traction. overview of the WikifyMe and provides intuition for de- cisions we made during development of the system. In 1 Introduction Sect. 4, a description of a current dataset is presented. Conclusion and future work are discussed in Sect. 5. Enrichment of text documents with links to Wikipedia’s pages has became an extremely popular task. This task is called wikification. Wikification is necessary for 2 Related Work intelligent systems that use knowledge extracted from Wikipedia is an evident corpus for wikifiers evaluation. Wikipedia for different purposes [5, 8]. Showing wiki- Each regular Wikipedia’s page describes one unambigu- fied documents to reader of blogs or news feed is com- ous concept and has links to other pages of Wikipedia. mon as well [10, 4]. In general case, each link consists of two parts: desti- Enrichment text with links to Wikipedia usually con- nation page and caption shown to readers. Therefore, sists of two steps: extraction of key terms from a docu- the link could be interpreted as the annotation of the ment and associating these terms with Wikipedia pages. text in caption with meaning described by destination Lexical ambiguity of language presents a main diffi- page. Another assumption concerning internal links is culty for automatic wikification. Therefore, word sense that users of Wikipedia make links only for key terms. disambiguation (WSD) is a necessary step for the auto- Based on these ideas, researches extract random samples matic wikifiers. of Wikipedia’s regular pages and use them as testing cor- Another challenge for the automatic wikification is pora. choosing terms that should be associated with Wikipedia Main drawback of this approach is a bias of test- articles. Marking every term described in Wikipedia with ing results for algorithms that use Wikipedia’s links for links makes the document hard to read. Therefore, only training. In addition, behaviour of key terms extractors most relevant terms should be presented as links for a trained with the aid of Wikipedia’s internal links on real particular document. Such terms are usually called key data is not well studied. Therefore, researchers make terms. their own corpora based on different data sources. There are many approaches to automatic wikification. Mihalcea [9] manually mapped some Wikipedia terms Most successful wikifiers use supervised learning algo- to WordNet terms in order to carry out experiments on rithms for word sense disambiguation and key terms ex- commonly accepted standard tests of the SenseEval cor- traction. For such algorithms, Wikipedia serves as a pus. However, there is no one-for-one mapping between training corpus. However, the lack of testing corpora Wikipedia and Wordnet, therefore this approach is not based on real data makes it extremely hard to compare commonly used. differrent wikifiers and choose the best one. Cucerzan created his own corpus for evaluation of the In order to estimate the quality of automatic wikifier system described in the paper [3]. A set of 100 news on real data, part of this data should be wikified manually stories on a diverse range of topics was marked with by human expert. Difficulty of manual wikification de- named entities, which were also associated with articles pends on the number of key terms that should be linked of Wikipedia. This corpus is publicly available, but an- to Wikipedia. In general case, all terms in text should notations in there are sparse and limited to a few entity types. Proceedings of the Spring Researcher’s Colloquium on Database and Information Systems, Moscow, Russia, 2011 1 http://wikifyme.ispras.ru Milne and Witten [10] used Mechanical Turk [1] a great p[rogramming langu]age” would be expanded service to annotate subset of 50 documents from the to “Scala is a great [programming language]”. AQUAINT text corpus: a collection of newswire stories The second technique allows to remove unnecessary from the Xinhua News Service, the New York Times, spaces from the selection. “Evaluation of [delimited and the Associated Press. However they only ask to an- continuations ]is supported” becomes “Evaluation of notate key terms. Therefore their corpus cannot be used [delimited continuations] is supported”. Both tech- for WSD evaluation with high recall. niques can be enabled or disabled at any moment. Kulkarni et. al. [7] developed browser based annota- After the term has been created the user is offered to tion tool for creating test corpus. They collected about select a meaning for the term (see Figure 1). The mean- 19,000 annotations by six volunteers. Documents for ing can be represented by any article in Wikipedia, hov- manual annotation were collected from the links within ewer for each term we provide a list of recommended homepages of popular sites belonging to a handful do- concepts. These concepts were obtained from wiki-links mains including sports, entertainment, science, technol- appeared in Wikipedia articles that contain the term text. ogy, and health. The number of distinct Wikipedia en- The concepts are ranked according to how often links to tities that were linked to was about 3,800. About 40% them anchored the term text. If certain concept was used of the spots was labeled n/a, highlighting the importance once as a meaning for the term in the document, then the of backoffs. This corpus is good for testing WSD al- system put it in the top of list. gorithms, but it doesn’t contain any information about keywords. Similar corpus was created for evaluation of the algo- rithms described in paper [11]. Like previous one, this corpus has tags for all possible segments, even though there is no correct mark for them (these segments are marked as n/a). This corpus didn’t provide any infor- mation about keywords as well. We added this corpus to our system, then revised marks and included information about keywords. The idea of involving Web users into creation of train- ing and testing corpora was described and implemented in OMWE project [2]. The aim of this project was cre- ation of a large corpus for WSD task with the aid of Web users. Result of this project was a corpus for WSD tracks on the Senseval 3 conference. However, this corpus is Figure 1: List of recommended meanings for the based on WordNet senses. Therefore, it could not be di- “AT&T” term rectly used for wikifiers evaluation. 3 Description of the System List of document concepts are shown on the right panel (fig. 2). User may click on any concept and mark it 3.1 Terminology a key concept. This will mark all term representation of the concept as keyphrases. To create a new test, the user have to upload and mark We have restricted the term markup by only one term up a text file (we call such file “a document”). Docu- on a single part of text. That means no two different ment consists of plain text and metadata that represents terms could be intersected by each other. We have found terms, concepts and keyphrases. Term models a conti- such restriction is a reasonable simplification, which nous part of text which have significant semantic value lighten the user interface and facilitate user’s interaction and thus some meaning. Meanings are represented by with the system. Also, our experience in the creation of concepts, that is, articles in Wikipedia. We defined the WSD tests shows that single user has no need in making special “not-in-wikipedia” concept for cases when the one piece of text a part of several terms and this limi- term have valuable sense, but there is no right concept tation is very common. However, if several users select to reflect the sense. overlapping parts of text as a terms in their versions of The union of all term meanings forms the set of doc- the same document, then this will be represented in re- ument’s concepts. Some concepts may be thought as key sulting test as we describe in 3.5. concepts, which reflect main topic(s) of the document. So we think of keyphrases as the terms (that is pieces of 3.3 Preprocessing text) whose meanings are key concepts. To make the test creating process more easy we pro- 3.2 Process of the Wikification vide automatic preprocessing feature which uses wikifier described in [6] to automatically detect terms in docu- User selects by mouse some part of the text to mark up ments, assign them right meanings and select key con- a term there. It’s very important to accurately select the cepts. Meanings assigned in such way are marked as term boundaries, so we had implemented several tech- non-reviewed. This feature significantly improves the niques that help users to do that. speed and usability of test creation process because users The first feature is selection expansion to the bound- should just review these meanings as well as “key” status aries of selected words. For example, selecton “Scala is of document concepts. Figure 2: Preparing the test. Green terms are reviewed, red ones are unreviewed. Bold concepts on the left side are marked as key concepts. 3.4 Documents and Folders Documents in WikifyMe are organized in folders. Each We treat two terms the same if their boundaries are folder has a name and optionally a description. Users matching exactly. So the confidence of two terms which are able to create new folders, so the user who creates a meanings just overlap does not decrease, but the confi- new folder is treated as this folder owner. Each folder is dence of term selection does. accessible to all users. However only folder owner can delete it or upload new documents into it. To allow other 3.6 Output format users upload new documents to the folder, it has to be marked as ”public” by it’s owner. The XML as a widespread format for annotated text files Whenever user opens a document uploaded by an- has been chosen for the output format of merged docu- other user, the new version of the document is being cre- ments. The example of the document is shown in Figure ated. This version doesn’t contain any information from 3. the original document except the plain text, so users have The concept tag define the concept in the docu- to work on the same documents independently. This is ment with name and id attributes that refer to Wikipedia good because each user is not affected by possible mis- article’s name and ID obtained from Wikipedia dump. takes of others. Users can delete their versions of doc- concept tag also contain the representation tags, uments, but original documents can be deleted only by each of them define the term associated with contain- owners of containing folders. ing concept as their meaning. span attribute have a “start..end” and indicate the position of term in the text. 3.5 Getting the Tests term tag also defines a term and completely du- plicates an information from certain combination of Everyone can get the whole test collection by click on concept and representation. This redundancy is due the “Merge and download” button. WikifyMe will merge to different data structures are more suitable for differ- all versions of all files and provide the results in a singe ent tasks. Thus, usage of term tags is convenient for archive. word-sense disambiguation while concept tags are suit- The process of merging is quite simple: to merge able for semantic analysis of the document. a set of documents WikifyMe builds a resulting docu- Sense of confidence and keyphraseness attributes ment which consists of terms, meanings and key con- have been described above. cepts from all these documents. Then the system counts an agree level (we call it a confidence) for each term, meaning and key concept (a keyphraseness) selection. 4 Data The meaning confidence for each term is counted by formula: Currently, WikifyMe contains 8 folders with 132 docu- ments from very different sources - from scientific pa- |this meaning selections| pers and blog posts to summaries from Google News. confidence = (1) Such variety is quite helpful for testing on different kind |this term selections| of texts and we except the document collection to be The keyphraseness of key concepts is counted as: broaden by users. Greg-January-2008, Monah-DBMS2-May-2008, |versions where the concept is key| keyphraseness = (2) radar oreilly jan 2007 refer to blog posts collection |versions where the concept appear| from Greg Linden, DBMS2 and Tim O’Reilly blogs respectively. news google com 26 may 2008 folder WikifyMe also count the confidence of term selection: contains news articles by 26th May of 2008 from Google News, UPI Entertainment 17 22 may 2008 |this term selections| and UPI Health 01 06 june 2008 - from Health and confidence = (3) Entertainment sections of “United Press International”. |other terms overlapped by this term| 5 Conclusion Despite WikifyMe is a ready-to-work system already there are still lot of possibilites to make it better and at first we plan to add the existing test corpora such as Kulkarni et. al. [7] and Milne et. al. [10] used in their researches. As a key of the whole project success is the active con- tribution of users we will add several features to the web tool to stimulate the user activity. For example, public statistics for amount of work made by each user (maybe included in the archive with tests). We believe that it will make a sense because it’s important for a user to feel that he or she is a part of the project and the value of self contribution made is visible to everyone. We hope that WikifyMe will gather the active user community and help to create a large and high-quality Figure 3: Example of downloaded XML test file. test collection useful for researchers in wikification. References Table 1: Statistics for base corpora [1] Jeff Barr and Luis Felipe Cabrera. Ai gets a brain. avg. Queue, 4:24–29, May 2006. # of Folder doc [2] Timothy Chklovski and Rada Mihalcea. Building terms length a sense tagged corpus with open mind word expert. Greg-January-2008 661 336.7 In Proceedings of the ACL-02 workshop on Word Monah-DBMS2-May-2008 686 242.7 sense disambiguation: recent successes and future news google com 26 may 2008 844 386.3 directions - Volume 8, WSD ’02, pages 116–122, radar oreilly jan 2007 482 803.6 Stroudsburg, PA, USA, 2002. Association for Com- scientific papers 858 1761 putational Linguistics. sqlsummit-June2008 419 89.5 UPI Entertainment 17 22 may 2008 1898 162.6 [3] Silviu Cucerzan. Large-scale named entity disam- UPI Health 01 06 june 2008 1297 201.2 biguation based on wikipedia data. In Proceedings of EMNLP-CoNLL 2007, page 708716, 2007. scientific papers as the name suggests consists of sci- [4] Paolo Ferragina and Ugo Scaiella. Tagme: on-the- entfic papers directly converted from PDF to plain text fly annotation of short text fragments (by wikipedia and sqlsummit-June2008 contains short news summaries entities). In Proceedings of the 19th ACM inter- from “SQL Summit” blog. Summary for the corpora is national conference on Information and knowledge presented in the Table 1. management, CIKM ’10, pages 1625–1628, New York, NY, USA, 2010. ACM. Initially the base corpora has been marked up by one person in average, thus the confedence and keyphrase- [5] M. Grineva, D. Lizorkin, M. Grinev, A. Boldakov, ness metrics are about 1.0 and are not representative at D. Turdakov, A. Sysoev, and A. Kiyko. Blognoon: the current stage. Exploring a topic in the blogosphere. In Proceed- ings of the 18th international conference on World wide web, 2011. Table 2: Comparison of corpora [6] Maria Grineva, Maxim Grinev, and Dmitry Li- Corpus Number of terms zorkin. Extracting key terms from noisy and mul- WikifyMe 7145 titheme documents. In Proceedings of the 18th in- Milne et. al. tests 314 ternational conference on World wide web, WWW Kulkarni et. al. (IITB) 17200 ’09, pages 661–670, New York, NY, USA, 2009. ACM. [7] Sayali Kulkarni, Amit Singh, Ganesh Ramakrish- Table 2 constains the comparison by number of terms nan, and Soumen Chakrabarti. Collective annota- between Kulkarni et. al. [7] manually collected ”ground tion of Wikipedia entities in web text. In Proceed- truth” corpus named IITB, Millne et. al. [10] test corpus, ings of the 15th ACM SIGKDD international con- which was automatically wikified by their tool and man- ference on Knowledge discovery and data mining, ually verified then, and WikifyMe manually collected KDD ’09, pages 457–466, New York, NY, USA, corpus. As we can see, at the moment WikifyMe’s cor- 2009. ACM. pus is comparable to IITB and outperforms Millne et. al. corpus by number of tagged terms, so it’s suitable enough [8] Olena Medelyan, Ian H. Witten, and David Milne. for WSD benchmarking tasks. Topic indexing with wikipedia, 2008. [9] Rada Mihalcea. Using wikipedia for automatic word sense disambiguation. In North American Chapter of the Association for Computational Lin- guistics (NAACL 2007), 2007. [10] David Milne and Ian H. Witten. Learning to link with wikipedia. In Proceeding of the 17th ACM conference on Information and knowledge manage- ment, CIKM ’08, pages 509–518, New York, NY, USA, 2008. ACM. [11] Denis Turdakov and Pavel Velikhov. Semantic re- latedness metric for wikipedia concepts based on link analysis and its application to word sense dis- ambiguation. In Proceedings of the SYRCODIS 2008 Colloquium on Databases and Information Systems, 2008.