WikifyMe: Creating Testbed for Wikifiers

                  c Sergey Bartunov                 Alexander Boldakov                      Denis Turdakov
                                                    ISP RAS
                           sbartunov@gmail.com, boldakov@gmail.com, turdakov@ispras.ru


                        Abstract                                be associated with Wikipedia articles and some of them
                                                                should be marked as key terms. This is required for sepa-
    Finding relationships between words in text and             rated testing of WSD and key term extraction algorithms.
    articles from Wikipedia is an extremely popular                This paper introduces WikifyMe1 , a Web-based sys-
    task known as wikification. However there is                tem that aims at creating large wikified corpora with the
    still no gold standard corpus for wikifiers com-            aid of Web users. This system has a user-friendly in-
    parison. We present WikifyMe, the online tool               terface that makes manual wikification much easier. We
    for collaborative work on universal test collec-            expect that this system will yield good corpora for com-
    tion which allows users to easily prepare tests             paring different wikifiers at a relatively lower cost.
    for two most difficult problems in wikification:               The rest of the paper is organized as follows. Re-
    word-sense disambiguation and keyphrase ex-                 lated work is described in the next section. Sect. 3 gives
    traction.                                                   overview of the WikifyMe and provides intuition for de-
                                                                cisions we made during development of the system. In
1   Introduction                                                Sect. 4, a description of a current dataset is presented.
                                                                Conclusion and future work are discussed in Sect. 5.
Enrichment of text documents with links to Wikipedia’s
pages has became an extremely popular task. This
task is called wikification. Wikification is necessary for
                                                                2     Related Work
intelligent systems that use knowledge extracted from           Wikipedia is an evident corpus for wikifiers evaluation.
Wikipedia for different purposes [5, 8]. Showing wiki-          Each regular Wikipedia’s page describes one unambigu-
fied documents to reader of blogs or news feed is com-          ous concept and has links to other pages of Wikipedia.
mon as well [10, 4].                                            In general case, each link consists of two parts: desti-
    Enrichment text with links to Wikipedia usually con-        nation page and caption shown to readers. Therefore,
sists of two steps: extraction of key terms from a docu-        the link could be interpreted as the annotation of the
ment and associating these terms with Wikipedia pages.          text in caption with meaning described by destination
    Lexical ambiguity of language presents a main diffi-        page. Another assumption concerning internal links is
culty for automatic wikification. Therefore, word sense         that users of Wikipedia make links only for key terms.
disambiguation (WSD) is a necessary step for the auto-          Based on these ideas, researches extract random samples
matic wikifiers.                                                of Wikipedia’s regular pages and use them as testing cor-
    Another challenge for the automatic wikification is         pora.
choosing terms that should be associated with Wikipedia             Main drawback of this approach is a bias of test-
articles. Marking every term described in Wikipedia with        ing results for algorithms that use Wikipedia’s links for
links makes the document hard to read. Therefore, only          training. In addition, behaviour of key terms extractors
most relevant terms should be presented as links for a          trained with the aid of Wikipedia’s internal links on real
particular document. Such terms are usually called key          data is not well studied. Therefore, researchers make
terms.                                                          their own corpora based on different data sources.
    There are many approaches to automatic wikification.            Mihalcea [9] manually mapped some Wikipedia terms
Most successful wikifiers use supervised learning algo-         to WordNet terms in order to carry out experiments on
rithms for word sense disambiguation and key terms ex-          commonly accepted standard tests of the SenseEval cor-
traction. For such algorithms, Wikipedia serves as a            pus. However, there is no one-for-one mapping between
training corpus. However, the lack of testing corpora           Wikipedia and Wordnet, therefore this approach is not
based on real data makes it extremely hard to compare           commonly used.
differrent wikifiers and choose the best one.                       Cucerzan created his own corpus for evaluation of the
    In order to estimate the quality of automatic wikifier      system described in the paper [3]. A set of 100 news
on real data, part of this data should be wikified manually     stories on a diverse range of topics was marked with
by human expert. Difficulty of manual wikification de-          named entities, which were also associated with articles
pends on the number of key terms that should be linked          of Wikipedia. This corpus is publicly available, but an-
to Wikipedia. In general case, all terms in text should         notations in there are sparse and limited to a few entity
                                                                types.
Proceedings of the Spring Researcher’s Colloquium on Database
and Information Systems, Moscow, Russia, 2011                       1 http://wikifyme.ispras.ru
    Milne and Witten [10] used Mechanical Turk [1]            a great p[rogramming langu]age” would be expanded
service to annotate subset of 50 documents from the           to “Scala is a great [programming language]”.
AQUAINT text corpus: a collection of newswire stories            The second technique allows to remove unnecessary
from the Xinhua News Service, the New York Times,             spaces from the selection. “Evaluation of [delimited
and the Associated Press. However they only ask to an-        continuations ]is supported” becomes “Evaluation of
notate key terms. Therefore their corpus cannot be used       [delimited continuations] is supported”. Both tech-
for WSD evaluation with high recall.                          niques can be enabled or disabled at any moment.
    Kulkarni et. al. [7] developed browser based annota-         After the term has been created the user is offered to
tion tool for creating test corpus. They collected about      select a meaning for the term (see Figure 1). The mean-
19,000 annotations by six volunteers. Documents for           ing can be represented by any article in Wikipedia, hov-
manual annotation were collected from the links within        ewer for each term we provide a list of recommended
homepages of popular sites belonging to a handful do-         concepts. These concepts were obtained from wiki-links
mains including sports, entertainment, science, technol-      appeared in Wikipedia articles that contain the term text.
ogy, and health. The number of distinct Wikipedia en-         The concepts are ranked according to how often links to
tities that were linked to was about 3,800. About 40%         them anchored the term text. If certain concept was used
of the spots was labeled n/a, highlighting the importance     once as a meaning for the term in the document, then the
of backoffs. This corpus is good for testing WSD al-          system put it in the top of list.
gorithms, but it doesn’t contain any information about
keywords.
    Similar corpus was created for evaluation of the algo-
rithms described in paper [11]. Like previous one, this
corpus has tags for all possible segments, even though
there is no correct mark for them (these segments are
marked as n/a). This corpus didn’t provide any infor-
mation about keywords as well. We added this corpus to
our system, then revised marks and included information
about keywords.
    The idea of involving Web users into creation of train-
ing and testing corpora was described and implemented
in OMWE project [2]. The aim of this project was cre-
ation of a large corpus for WSD task with the aid of Web
users. Result of this project was a corpus for WSD tracks
on the Senseval 3 conference. However, this corpus is
                                                              Figure 1: List of recommended meanings for the
based on WordNet senses. Therefore, it could not be di-
                                                              “AT&T” term
rectly used for wikifiers evaluation.

3     Description of the System                                   List of document concepts are shown on the right
                                                              panel (fig. 2). User may click on any concept and mark it
3.1   Terminology                                             a key concept. This will mark all term representation of
                                                              the concept as keyphrases.
To create a new test, the user have to upload and mark            We have restricted the term markup by only one term
up a text file (we call such file “a document”). Docu-        on a single part of text. That means no two different
ment consists of plain text and metadata that represents      terms could be intersected by each other. We have found
terms, concepts and keyphrases. Term models a conti-          such restriction is a reasonable simplification, which
nous part of text which have significant semantic value       lighten the user interface and facilitate user’s interaction
and thus some meaning. Meanings are represented by            with the system. Also, our experience in the creation of
concepts, that is, articles in Wikipedia. We defined the      WSD tests shows that single user has no need in making
special “not-in-wikipedia” concept for cases when the         one piece of text a part of several terms and this limi-
term have valuable sense, but there is no right concept       tation is very common. However, if several users select
to reflect the sense.                                         overlapping parts of text as a terms in their versions of
    The union of all term meanings forms the set of doc-      the same document, then this will be represented in re-
ument’s concepts. Some concepts may be thought as key         sulting test as we describe in 3.5.
concepts, which reflect main topic(s) of the document.
So we think of keyphrases as the terms (that is pieces of     3.3   Preprocessing
text) whose meanings are key concepts.
                                                              To make the test creating process more easy we pro-
3.2   Process of the Wikification                             vide automatic preprocessing feature which uses wikifier
                                                              described in [6] to automatically detect terms in docu-
User selects by mouse some part of the text to mark up        ments, assign them right meanings and select key con-
a term there. It’s very important to accurately select the    cepts. Meanings assigned in such way are marked as
term boundaries, so we had implemented several tech-          non-reviewed. This feature significantly improves the
niques that help users to do that.                            speed and usability of test creation process because users
   The first feature is selection expansion to the bound-     should just review these meanings as well as “key” status
aries of selected words. For example, selecton “Scala is      of document concepts.
Figure 2: Preparing the test. Green terms are reviewed, red ones are unreviewed. Bold concepts on the left side are
marked as key concepts.

3.4    Documents and Folders
Documents in WikifyMe are organized in folders. Each                   We treat two terms the same if their boundaries are
folder has a name and optionally a description. Users               matching exactly. So the confidence of two terms which
are able to create new folders, so the user who creates a           meanings just overlap does not decrease, but the confi-
new folder is treated as this folder owner. Each folder is          dence of term selection does.
accessible to all users. However only folder owner can
delete it or upload new documents into it. To allow other           3.6   Output format
users upload new documents to the folder, it has to be
marked as ”public” by it’s owner.                                   The XML as a widespread format for annotated text files
   Whenever user opens a document uploaded by an-                   has been chosen for the output format of merged docu-
other user, the new version of the document is being cre-           ments. The example of the document is shown in Figure
ated. This version doesn’t contain any information from             3.
the original document except the plain text, so users have              The concept tag define the concept in the docu-
to work on the same documents independently. This is                ment with name and id attributes that refer to Wikipedia
good because each user is not affected by possible mis-             article’s name and ID obtained from Wikipedia dump.
takes of others. Users can delete their versions of doc-            concept tag also contain the representation tags,
uments, but original documents can be deleted only by               each of them define the term associated with contain-
owners of containing folders.                                       ing concept as their meaning. span attribute have a
                                                                    “start..end” and indicate the position of term in the text.
3.5    Getting the Tests                                                term tag also defines a term and completely du-
                                                                    plicates an information from certain combination of
Everyone can get the whole test collection by click on              concept and representation. This redundancy is due
the “Merge and download” button. WikifyMe will merge                to different data structures are more suitable for differ-
all versions of all files and provide the results in a singe        ent tasks. Thus, usage of term tags is convenient for
archive.                                                            word-sense disambiguation while concept tags are suit-
    The process of merging is quite simple: to merge                able for semantic analysis of the document.
a set of documents WikifyMe builds a resulting docu-                    Sense of confidence and keyphraseness attributes
ment which consists of terms, meanings and key con-                 have been described above.
cepts from all these documents. Then the system counts
an agree level (we call it a confidence) for each term,
meaning and key concept (a keyphraseness) selection.                4     Data
    The meaning confidence for each term is counted by
formula:                                                            Currently, WikifyMe contains 8 folders with 132 docu-
                                                                    ments from very different sources - from scientific pa-
                           |this meaning selections|                pers and blog posts to summaries from Google News.
          confidence =                                        (1)   Such variety is quite helpful for testing on different kind
                             |this term selections|
                                                                    of texts and we except the document collection to be
   The keyphraseness of key concepts is counted as:                 broaden by users.
                                                                       Greg-January-2008,         Monah-DBMS2-May-2008,
                        |versions where the concept is key|
  keyphraseness =                                             (2)   radar oreilly jan 2007 refer to blog posts collection
                       |versions where the concept appear|
                                                                    from Greg Linden, DBMS2 and Tim O’Reilly blogs
                                                                    respectively.    news google com 26 may 2008 folder
   WikifyMe also count the confidence of term selection:            contains news articles by 26th May of 2008 from
                                                                    Google News, UPI Entertainment 17 22 may 2008
                              |this term selections|                and UPI Health 01 06 june 2008 - from Health and
      confidence =                                            (3)   Entertainment sections of “United Press International”.
                     |other terms overlapped by this term|
                                                             5   Conclusion
                                                             Despite WikifyMe is a ready-to-work system already
                                                             there are still lot of possibilites to make it better and
                                                             at first we plan to add the existing test corpora such as
                                                             Kulkarni et. al. [7] and Milne et. al. [10] used in their
                                                             researches.
                                                                 As a key of the whole project success is the active con-
                                                             tribution of users we will add several features to the web
                                                             tool to stimulate the user activity. For example, public
                                                             statistics for amount of work made by each user (maybe
                                                             included in the archive with tests). We believe that it will
                                                             make a sense because it’s important for a user to feel that
                                                             he or she is a part of the project and the value of self
                                                             contribution made is visible to everyone.
                                                                 We hope that WikifyMe will gather the active user
                                                             community and help to create a large and high-quality
   Figure 3: Example of downloaded XML test file.            test collection useful for researchers in wikification.


                                                             References
           Table 1: Statistics for base corpora
                                                              [1] Jeff Barr and Luis Felipe Cabrera. Ai gets a brain.
                                                  avg.            Queue, 4:24–29, May 2006.
                                      #    of
  Folder                                          doc         [2] Timothy Chklovski and Rada Mihalcea. Building
                                      terms
                                                  length
                                                                  a sense tagged corpus with open mind word expert.
  Greg-January-2008                   661         336.7           In Proceedings of the ACL-02 workshop on Word
  Monah-DBMS2-May-2008                686         242.7           sense disambiguation: recent successes and future
  news google com 26 may 2008         844         386.3           directions - Volume 8, WSD ’02, pages 116–122,
  radar oreilly jan 2007              482         803.6
                                                                  Stroudsburg, PA, USA, 2002. Association for Com-
  scientific papers                   858         1761
                                                                  putational Linguistics.
  sqlsummit-June2008                  419         89.5
  UPI Entertainment 17 22 may 2008    1898        162.6       [3] Silviu Cucerzan. Large-scale named entity disam-
  UPI Health 01 06 june 2008          1297        201.2           biguation based on wikipedia data. In Proceedings
                                                                  of EMNLP-CoNLL 2007, page 708716, 2007.

scientific papers as the name suggests consists of sci-       [4] Paolo Ferragina and Ugo Scaiella. Tagme: on-the-
entfic papers directly converted from PDF to plain text           fly annotation of short text fragments (by wikipedia
and sqlsummit-June2008 contains short news summaries              entities). In Proceedings of the 19th ACM inter-
from “SQL Summit” blog. Summary for the corpora is                national conference on Information and knowledge
presented in the Table 1.                                         management, CIKM ’10, pages 1625–1628, New
                                                                  York, NY, USA, 2010. ACM.
   Initially the base corpora has been marked up by one
person in average, thus the confedence and keyphrase-         [5] M. Grineva, D. Lizorkin, M. Grinev, A. Boldakov,
ness metrics are about 1.0 and are not representative at          D. Turdakov, A. Sysoev, and A. Kiyko. Blognoon:
the current stage.                                                Exploring a topic in the blogosphere. In Proceed-
                                                                  ings of the 18th international conference on World
                                                                  wide web, 2011.
            Table 2: Comparison of corpora
                                                              [6] Maria Grineva, Maxim Grinev, and Dmitry Li-
   Corpus                           Number of terms               zorkin. Extracting key terms from noisy and mul-
   WikifyMe                         7145                          titheme documents. In Proceedings of the 18th in-
   Milne et. al. tests              314                           ternational conference on World wide web, WWW
   Kulkarni et. al. (IITB)          17200                         ’09, pages 661–670, New York, NY, USA, 2009.
                                                                  ACM.
                                                              [7] Sayali Kulkarni, Amit Singh, Ganesh Ramakrish-
   Table 2 constains the comparison by number of terms            nan, and Soumen Chakrabarti. Collective annota-
between Kulkarni et. al. [7] manually collected ”ground           tion of Wikipedia entities in web text. In Proceed-
truth” corpus named IITB, Millne et. al. [10] test corpus,        ings of the 15th ACM SIGKDD international con-
which was automatically wikified by their tool and man-           ference on Knowledge discovery and data mining,
ually verified then, and WikifyMe manually collected              KDD ’09, pages 457–466, New York, NY, USA,
corpus. As we can see, at the moment WikifyMe’s cor-              2009. ACM.
pus is comparable to IITB and outperforms Millne et. al.
corpus by number of tagged terms, so it’s suitable enough     [8] Olena Medelyan, Ian H. Witten, and David Milne.
for WSD benchmarking tasks.                                       Topic indexing with wikipedia, 2008.
 [9] Rada Mihalcea. Using wikipedia for automatic
     word sense disambiguation. In North American
     Chapter of the Association for Computational Lin-
     guistics (NAACL 2007), 2007.
[10] David Milne and Ian H. Witten. Learning to link
     with wikipedia. In Proceeding of the 17th ACM
     conference on Information and knowledge manage-
     ment, CIKM ’08, pages 509–518, New York, NY,
     USA, 2008. ACM.
[11] Denis Turdakov and Pavel Velikhov. Semantic re-
     latedness metric for wikipedia concepts based on
     link analysis and its application to word sense dis-
     ambiguation. In Proceedings of the SYRCODIS
     2008 Colloquium on Databases and Information
     Systems, 2008.