<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>WikifyMe: Creating Testbed for Wikifiers</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alexander Boldakov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Denis Turdakov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>ISP RAS sbartunov@gmail.com</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>boldakov@gmail.com</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>turdakov@ispras.ru</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Proceedings of the Spring Researcher's Colloquium on Database and Information Systems</institution>
          ,
          <addr-line>Moscow, Russia, 2011</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Finding relationships between words in text and articles from Wikipedia is an extremely popular task known as wikification. However there is still no gold standard corpus for wikifiers comparison. We present WikifyMe, the online tool for collaborative work on universal test collection which allows users to easily prepare tests for two most difficult problems in wikification: word-sense disambiguation and keyphrase extraction.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Enrichment of text documents with links to Wikipedia’s
pages has became an extremely popular task. This
task is called wikification. Wikification is necessary for
intelligent systems that use knowledge extracted from
Wikipedia for different purposes [
        <xref ref-type="bibr" rid="ref5 ref8">5, 8</xref>
        ]. Showing
wikified documents to reader of blogs or news feed is
common as well [
        <xref ref-type="bibr" rid="ref10 ref4">10, 4</xref>
        ].
      </p>
      <p>Enrichment text with links to Wikipedia usually
consists of two steps: extraction of key terms from a
document and associating these terms with Wikipedia pages.</p>
      <p>Lexical ambiguity of language presents a main
difficulty for automatic wikification. Therefore, word sense
disambiguation (WSD) is a necessary step for the
automatic wikifiers.</p>
      <p>Another challenge for the automatic wikification is
choosing terms that should be associated with Wikipedia
articles. Marking every term described in Wikipedia with
links makes the document hard to read. Therefore, only
most relevant terms should be presented as links for a
particular document. Such terms are usually called key
terms.</p>
      <p>There are many approaches to automatic wikification.
Most successful wikifiers use supervised learning
algorithms for word sense disambiguation and key terms
extraction. For such algorithms, Wikipedia serves as a
training corpus. However, the lack of testing corpora
based on real data makes it extremely hard to compare
differrent wikifiers and choose the best one.</p>
      <p>In order to estimate the quality of automatic wikifier
on real data, part of this data should be wikified manually
by human expert. Difficulty of manual wikification
depends on the number of key terms that should be linked
to Wikipedia. In general case, all terms in text should
be associated with Wikipedia articles and some of them
should be marked as key terms. This is required for
separated testing of WSD and key term extraction algorithms.</p>
      <p>This paper introduces WikifyMe1, a Web-based
system that aims at creating large wikified corpora with the
aid of Web users. This system has a user-friendly
interface that makes manual wikification much easier. We
expect that this system will yield good corpora for
comparing different wikifiers at a relatively lower cost.</p>
      <p>The rest of the paper is organized as follows.
Related work is described in the next section. Sect. 3 gives
overview of the WikifyMe and provides intuition for
decisions we made during development of the system. In
Sect. 4, a description of a current dataset is presented.
Conclusion and future work are discussed in Sect. 5.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>Wikipedia is an evident corpus for wikifiers evaluation.
Each regular Wikipedia’s page describes one
unambiguous concept and has links to other pages of Wikipedia.
In general case, each link consists of two parts:
destination page and caption shown to readers. Therefore,
the link could be interpreted as the annotation of the
text in caption with meaning described by destination
page. Another assumption concerning internal links is
that users of Wikipedia make links only for key terms.
Based on these ideas, researches extract random samples
of Wikipedia’s regular pages and use them as testing
corpora.</p>
      <p>Main drawback of this approach is a bias of
testing results for algorithms that use Wikipedia’s links for
training. In addition, behaviour of key terms extractors
trained with the aid of Wikipedia’s internal links on real
data is not well studied. Therefore, researchers make
their own corpora based on different data sources.</p>
      <p>
        Mihalcea [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] manually mapped some Wikipedia terms
to WordNet terms in order to carry out experiments on
commonly accepted standard tests of the SenseEval
corpus. However, there is no one-for-one mapping between
Wikipedia and Wordnet, therefore this approach is not
commonly used.
      </p>
      <p>
        Cucerzan created his own corpus for evaluation of the
system described in the paper [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. A set of 100 news
stories on a diverse range of topics was marked with
named entities, which were also associated with articles
of Wikipedia. This corpus is publicly available, but
annotations in there are sparse and limited to a few entity
types.
      </p>
      <p>
        Milne and Witten [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] used Mechanical Turk [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
service to annotate subset of 50 documents from the
AQUAINT text corpus: a collection of newswire stories
from the Xinhua News Service, the New York Times,
and the Associated Press. However they only ask to
annotate key terms. Therefore their corpus cannot be used
for WSD evaluation with high recall.
      </p>
      <p>
        Kulkarni et. al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] developed browser based
annotation tool for creating test corpus. They collected about
19,000 annotations by six volunteers. Documents for
manual annotation were collected from the links within
homepages of popular sites belonging to a handful
domains including sports, entertainment, science,
technology, and health. The number of distinct Wikipedia
entities that were linked to was about 3,800. About 40%
of the spots was labeled n/a, highlighting the importance
of backoffs. This corpus is good for testing WSD
algorithms, but it doesn’t contain any information about
keywords.
      </p>
      <p>
        Similar corpus was created for evaluation of the
algorithms described in paper [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Like previous one, this
corpus has tags for all possible segments, even though
there is no correct mark for them (these segments are
marked as n/a). This corpus didn’t provide any
information about keywords as well. We added this corpus to
our system, then revised marks and included information
about keywords.
      </p>
      <p>
        The idea of involving Web users into creation of
training and testing corpora was described and implemented
in OMWE project [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The aim of this project was
creation of a large corpus for WSD task with the aid of Web
users. Result of this project was a corpus for WSD tracks
on the Senseval 3 conference. However, this corpus is
based on WordNet senses. Therefore, it could not be
directly used for wikifiers evaluation.
3
3.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>Description of the System</title>
      <sec id="sec-3-1">
        <title>Terminology</title>
        <p>To create a new test, the user have to upload and mark
up a text file (we call such file “a document”).
Document consists of plain text and metadata that represents
terms, concepts and keyphrases. Term models a
continous part of text which have significant semantic value
and thus some meaning. Meanings are represented by
concepts, that is, articles in Wikipedia. We defined the
special “not-in-wikipedia” concept for cases when the
term have valuable sense, but there is no right concept
to reflect the sense.</p>
        <p>The union of all term meanings forms the set of
document’s concepts. Some concepts may be thought as key
concepts, which reflect main topic(s) of the document.
So we think of keyphrases as the terms (that is pieces of
text) whose meanings are key concepts.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Process of the Wikification</title>
        <p>User selects by mouse some part of the text to mark up
a term there. It’s very important to accurately select the
term boundaries, so we had implemented several
techniques that help users to do that.</p>
        <p>The first feature is selection expansion to the
boundaries of selected words. For example, selecton “Scala is
a great p[rogramming langu]age” would be expanded
to “Scala is a great [programming language]”.</p>
        <p>The second technique allows to remove unnecessary
spaces from the selection. “Evaluation of [delimited
continuations ]is supported” becomes “Evaluation of
[delimited continuations] is supported”. Both
techniques can be enabled or disabled at any moment.</p>
        <p>After the term has been created the user is offered to
select a meaning for the term (see Figure 1). The
meaning can be represented by any article in Wikipedia,
hovewer for each term we provide a list of recommended
concepts. These concepts were obtained from wiki-links
appeared in Wikipedia articles that contain the term text.
The concepts are ranked according to how often links to
them anchored the term text. If certain concept was used
once as a meaning for the term in the document, then the
system put it in the top of list.</p>
        <p>List of recommended meanings for the</p>
        <p>List of document concepts are shown on the right
panel (fig. 2). User may click on any concept and mark it
a key concept. This will mark all term representation of
the concept as keyphrases.</p>
        <p>We have restricted the term markup by only one term
on a single part of text. That means no two different
terms could be intersected by each other. We have found
such restriction is a reasonable simplification, which
lighten the user interface and facilitate user’s interaction
with the system. Also, our experience in the creation of
WSD tests shows that single user has no need in making
one piece of text a part of several terms and this
limitation is very common. However, if several users select
overlapping parts of text as a terms in their versions of
the same document, then this will be represented in
resulting test as we describe in 3.5.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Preprocessing</title>
        <p>
          To make the test creating process more easy we
provide automatic preprocessing feature which uses wikifier
described in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] to automatically detect terms in
documents, assign them right meanings and select key
concepts. Meanings assigned in such way are marked as
non-reviewed. This feature significantly improves the
speed and usability of test creation process because users
should just review these meanings as well as “key” status
of document concepts.
Documents in WikifyMe are organized in folders. Each
folder has a name and optionally a description. Users
are able to create new folders, so the user who creates a
new folder is treated as this folder owner. Each folder is
accessible to all users. However only folder owner can
delete it or upload new documents into it. To allow other
users upload new documents to the folder, it has to be
marked as ”public” by it’s owner.
        </p>
        <p>Whenever user opens a document uploaded by
another user, the new version of the document is being
created. This version doesn’t contain any information from
the original document except the plain text, so users have
to work on the same documents independently. This is
good because each user is not affected by possible
mistakes of others. Users can delete their versions of
documents, but original documents can be deleted only by
owners of containing folders.
3.5</p>
      </sec>
      <sec id="sec-3-4">
        <title>Getting the Tests</title>
        <p>Everyone can get the whole test collection by click on
the “Merge and download” button. WikifyMe will merge
all versions of all files and provide the results in a singe
archive.</p>
        <p>The process of merging is quite simple: to merge
a set of documents WikifyMe builds a resulting
document which consists of terms, meanings and key
concepts from all these documents. Then the system counts
an agree level (we call it a confidence) for each term,
meaning and key concept (a keyphraseness) selection.</p>
        <p>The meaning confidence for each term is counted by
formula:
confidence = jthis meaning selectionsj</p>
        <p>jthis term selectionsj
The keyphraseness of key concepts is counted as:
keyphraseness = jversions where the concept is keyj
jversions where the concept appearj
(1)
(2)
WikifyMe also count the confidence of term selection:
confidence =</p>
        <p>jthis term selectionsj
jother terms overlapped by this termj</p>
        <p>We treat two terms the same if their boundaries are
matching exactly. So the confidence of two terms which
meanings just overlap does not decrease, but the
confidence of term selection does.</p>
        <p>The concept tag define the concept in the
document with name and id attributes that refer to Wikipedia
article’s name and ID obtained from Wikipedia dump.
concept tag also contain the representation tags,
each of them define the term associated with
containing concept as their meaning. span attribute have a
“start..end” and indicate the position of term in the text.</p>
        <p>term tag also defines a term and completely
duplicates an information from certain combination of
concept and representation. This redundancy is due
to different data structures are more suitable for
different tasks. Thus, usage of term tags is convenient for
word-sense disambiguation while concept tags are
suitable for semantic analysis of the document.</p>
        <p>Sense of confidence and keyphraseness attributes
have been described above.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Data</title>
      <p>Currently, WikifyMe contains 8 folders with 132
documents from very different sources - from scientific
papers and blog posts to summaries from Google News.
Such variety is quite helpful for testing on different kind
of texts and we except the document collection to be
broaden by users.</p>
      <p>Greg-January-2008, Monah-DBMS2-May-2008,
radar oreilly jan 2007 refer to blog posts collection
from Greg Linden, DBMS2 and Tim O’Reilly blogs
respectively. news google com 26 may 2008 folder
contains news articles by 26th May of 2008 from
Google News, UPI Entertainment 17 22 may 2008
and UPI Health 01 06 june 2008 - from Health and
Entertainment sections of “United Press International”.
scientific papers as the name suggests consists of
scientfic papers directly converted from PDF to plain text
and sqlsummit-June2008 contains short news summaries
from “SQL Summit” blog. Summary for the corpora is
presented in the Table 1.</p>
      <p>Initially the base corpora has been marked up by one
person in average, thus the confedence and
keyphraseness metrics are about 1.0 and are not representative at
the current stage.</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>
        Despite WikifyMe is a ready-to-work system already
there are still lot of possibilites to make it better and
at first we plan to add the existing test corpora such as
Kulkarni et. al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and Milne et. al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] used in their
researches.
      </p>
      <p>As a key of the whole project success is the active
contribution of users we will add several features to the web
tool to stimulate the user activity. For example, public
statistics for amount of work made by each user (maybe
included in the archive with tests). We believe that it will
make a sense because it’s important for a user to feel that
he or she is a part of the project and the value of self
contribution made is visible to everyone.</p>
      <p>We hope that WikifyMe will gather the active user
community and help to create a large and high-quality
test collection useful for researchers in wikification.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Jeff</given-names>
            <surname>Barr</surname>
          </string-name>
          and Luis Felipe Cabrera.
          <article-title>Ai gets a brain</article-title>
          .
          <source>Queue</source>
          ,
          <volume>4</volume>
          :
          <fpage>24</fpage>
          -
          <lpage>29</lpage>
          , May
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Timothy</given-names>
            <surname>Chklovski</surname>
          </string-name>
          and
          <string-name>
            <given-names>Rada</given-names>
            <surname>Mihalcea</surname>
          </string-name>
          .
          <article-title>Building a sense tagged corpus with open mind word expert</article-title>
          .
          <source>In Proceedings of the ACL-02 workshop on Word sense disambiguation: recent successes and future directions - Volume 8, WSD '02</source>
          , pages
          <fpage>116</fpage>
          -
          <lpage>122</lpage>
          , Stroudsburg, PA, USA,
          <year>2002</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Silviu</given-names>
            <surname>Cucerzan</surname>
          </string-name>
          .
          <article-title>Large-scale named entity disambiguation based on wikipedia data</article-title>
          .
          <source>In Proceedings of EMNLP-CoNLL</source>
          <year>2007</year>
          , page
          <volume>708716</volume>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Ferragina</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ugo</given-names>
            <surname>Scaiella</surname>
          </string-name>
          . Tagme:
          <article-title>on-thefly annotation of short text fragments (by wikipedia entities)</article-title>
          .
          <source>In Proceedings of the 19th ACM international conference on Information and knowledge management</source>
          ,
          <source>CIKM '10</source>
          , pages
          <fpage>1625</fpage>
          -
          <lpage>1628</lpage>
          , New York, NY, USA,
          <year>2010</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Grineva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lizorkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Grinev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Boldakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Turdakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sysoev</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Kiyko</surname>
          </string-name>
          . Blognoon:
          <article-title>Exploring a topic in the blogosphere</article-title>
          .
          <source>In Proceedings of the 18th international conference on World wide web</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Maria</given-names>
            <surname>Grineva</surname>
          </string-name>
          , Maxim Grinev, and
          <string-name>
            <given-names>Dmitry</given-names>
            <surname>Lizorkin</surname>
          </string-name>
          .
          <article-title>Extracting key terms from noisy and multitheme documents</article-title>
          .
          <source>In Proceedings of the 18th international conference on World wide web, WWW '09</source>
          , pages
          <fpage>661</fpage>
          -
          <lpage>670</lpage>
          , New York, NY, USA,
          <year>2009</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Sayali</given-names>
            <surname>Kulkarni</surname>
          </string-name>
          , Amit Singh,
          <string-name>
            <given-names>Ganesh</given-names>
            <surname>Ramakrishnan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Soumen</given-names>
            <surname>Chakrabarti</surname>
          </string-name>
          .
          <article-title>Collective annotation of Wikipedia entities in web text</article-title>
          .
          <source>In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          ,
          <source>KDD '09</source>
          , pages
          <fpage>457</fpage>
          -
          <lpage>466</lpage>
          , New York, NY, USA,
          <year>2009</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Olena</given-names>
            <surname>Medelyan</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ian H. Witten</surname>
          </string-name>
          , and David Milne.
          <source>Topic indexing with wikipedia</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Rada</given-names>
            <surname>Mihalcea</surname>
          </string-name>
          .
          <article-title>Using wikipedia for automatic word sense disambiguation. In North American Chapter of the Association for Computational Linguistics (NAACL</article-title>
          <year>2007</year>
          ),
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>David</given-names>
            <surname>Milne</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ian H.</given-names>
            <surname>Witten</surname>
          </string-name>
          .
          <article-title>Learning to link with wikipedia</article-title>
          .
          <source>In Proceeding of the 17th ACM conference on Information and knowledge management</source>
          ,
          <source>CIKM '08</source>
          , pages
          <fpage>509</fpage>
          -
          <lpage>518</lpage>
          , New York, NY, USA,
          <year>2008</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Denis</given-names>
            <surname>Turdakov</surname>
          </string-name>
          and
          <string-name>
            <given-names>Pavel</given-names>
            <surname>Velikhov</surname>
          </string-name>
          .
          <article-title>Semantic relatedness metric for wikipedia concepts based on link analysis and its application to word sense disambiguation</article-title>
          .
          <source>In Proceedings of the SYRCODIS 2008 Colloquium on Databases and Information Systems</source>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>