<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards an Integrated Corpus for the Evaluation of Named Entity Recognition and Object Consolidation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Knud Mo¨ller</string-name>
          <email>knud.moeller@deri.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexander Schutz</string-name>
          <email>schutz@coli.uni-sb.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefan Decker</string-name>
          <email>stefan.decker@deri.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Digital Enterprise Research Institute, National University of Ireland</institution>
          ,
          <addr-line>Galway</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institut uf ̈r Allgemeine Linguistik, Universiatt ̈ des Saarlandes</institution>
          ,
          <addr-line>Saarbur ̈cken</addr-line>
        </aff>
      </contrib-group>
      <fpage>131</fpage>
      <lpage>133</lpage>
      <abstract>
        <p>When faced with the task of incorporating legacy web data from existing HTML pages into the Semantic Web (SW), a widespread approach is to use Information Extraction (IE) and Named Entity Recognition (NER) techniques. Natural language texts are annotated automatically or semi-automatically, and thus formal data is extracted from the texts. While this allows to add new sets of data to the SW, the process cannot stop there. It is necessary to integrate the newly created formal data with existing formal data, i.e. to identify entities which are identical in both sets. To summarize, two main problems have to be tackled to allow the integration of information from unstructured data into the SW: 1. Find the set of entities ED in a document (NER), and probably detect coreference chains within the document. 2. Find matches between the elements of ED and entities in a pre-existing knowledge base. In order to evaluate any system trying to tackle both of these problems (e.g. KIM [1] or Semtag and Seeker [2]), conventional corpora are not suitable, since these are mostly tailord towards IE and NER only. These corpora can be used to evaluate a system's performance on an inner-document basis, i.e. how well it can detect entities in a document and probably chains of co-reference between them. However, what is needed is a means of evaluating a system with respect to how well it is able to match between the entities in a document and corresponding entities in a database. This problem falls into the area of Object Consolidation. We therefore propose a novel kind of corpus, which we will call an Integrated Corpus for Named Entity Recognition and Object Consolidation. The first incentive for proposing such a corpus came when we were looking for a way to evaluate the Geco project [3].</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Our proposed integrated corpus consists of two interrelated parts:
– A knowledge base (KB) C2 containing objects corresponding to the entities
mentioned in C1.</p>
      <p>These two parts are integrated by linking the annotated entities in C1 to the
corresponding objects in C2, as Figure 2 illustrates.</p>
      <p>C1</p>
      <p>C2
#alex
#matt
#jane
#berit</p>
      <p>For the rfist version of our corpus, we defined a set of 40 documents with
approximately 100 words. These documents were excerpts from Wikipedia5 biographies of
various politicians, actors, scientists, bands, cfitional and non-cfitional characteres,
etc. We compiled the corpus with the aim of including challenging problems for both
the NER and the object consolidation task, such as different forms of the same name
(e.g. “Bill Clinton”, “Clinton”, “Billy”), potentially ambiguous tokens (e.g. “Hope”:
location/verb) and pseudonyms (e.g. “Ringo Starr”, “Richard Starkey”). The
corpus was then annotated by one human annotator, currently only with respect to
three different annotation types: person, location and jobtitle.</p>
      <p>In order to allow the integration of the textual corpus and the KB, the latter
would have to contain the same entities as mentioned in the text. Of the 205 Person
annotations in the textual corpus, 95 referred to individual entities. For each of these
entities, we included a corresponding entity in the KB. Within the Geco project,
we were working with FOAF6 representations of people. For this reason, we chose
to build a KB of foaf:Person instances.</p>
      <p>Having completed both parts of the corpus, they had to be tied together. This
was achieved by referencing the Person instances in the knowledge base from the
annotations in the textual corpus. In FOAF, the assumption is made that each
person can be uniquely identified by her email address. We therefore used email
addresses (both real and made-up) as the referencing scheme. Once both parts of
the corpus had been related in that way, the Integrated Corpus was complete.
3</p>
    </sec>
    <sec id="sec-2">
      <title>Future Work</title>
      <p>In this paper, we proposed a novel kind of evaluation corpus, which we called an
Integrated Corpus for the Evaluation of Named Entity Recognition and
Object Consolidation. It can be used for both the evaluation of NER systems
and systems trying to solve object consolidation problems. We are aware of the fact
that future versions of the textual part of our corpus will have to be extended in
both size and depth. We will have to extend the size of the corpus, its scope and
the number of annotation types. Another important task for a future version of our
corpus is the development of a suitable kind of evaluation metrics. The conventional
recall, precision and F-measure metrics could be applied individually to the textual
part of the corpus and the linking between the annotations and the instances in the
knowledge base. However, it would be desirable to provide a combined measure in
order to rate the overall performance of a system with respect to our corpus.
5 see http://en.wikipedia.org
6 see http://xmlns.org/foaf/0.1</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Popov</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kiryakov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manov</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ognyanoff</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goranov</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          : Kim - semantic
          <source>annotation platform. Lecture Notes in Computer Science</source>
          <volume>124</volume>
          (
          <year>2003</year>
          )
          <fpage>834</fpage>
          -
          <lpage>849</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Dill</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eiron</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gibson</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gruhl</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guha</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jhingran</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kanungo</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rajagopalan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tomkins</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tomlin</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zien</surname>
          </string-name>
          , J.Y.:
          <article-title>Semtag and seeker: bootstrapping the semantic web via automated semantic annotation</article-title>
          .
          <source>In: Proceedings of the twelfth international conference on World Wide Web</source>
          , ACM Press (
          <year>2003</year>
          )
          <fpage>178</fpage>
          -
          <lpage>186</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3] Mo¨ller, K.:
          <article-title>Geco - using human language technology to enhance semantic web browsing</article-title>
          .
          <source>In: Proceedings of the Faculty of Engineering Research Day</source>
          <year>2004</year>
          , National University of Ireland, Galway. (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>