<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Statistical Analyses of Named Entity Disambiguation Benchmarks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nadine Steinmetz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Magnus Knuth</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Harald Sack</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Hasso Plattner Institute for Software Systems Engineering</institution>
          ,
          <addr-line>Potsdam</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In the last years, various tools for automatic semantic annotation of textual information have emerged. The main challenge of all approaches is to solve ambiguity of natural language and assign unique semantic entities according to the present context. To compare the di erent approaches a ground truth namely an annotated benchmark is essential. But, besides the actual disambiguation approach the achieved evaluation results are also dependent on the characteristics of the benchmark dataset and the expressiveness of the dictionary applied to determine entity candidates. This paper presents statistical analyses and mapping experiments on di erent benchmarks and dictionaries to identify characteristics and structure of the respective datasets.</p>
      </abstract>
      <kwd-group>
        <kwd>named entity disambiguation</kwd>
        <kwd>benchmark evaluation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>One essential step in understanding textual information is the identi cation of semantic
concepts within natural language texts. Therefore multiple Named Entity Recognition
systems have been developed and become integrated in content management and
information retrieval systems to handle the ood of information.</p>
      <p>We have to distinguish between Named Entity Recognition (NER) systems that refer
to nding meaningful entities within a given natural language text that are of a speci c
predetermined type (as e. g., persons, locations, or organizations) and Named Entity
Disambiguation (NED) systems (sometimes also referred to as Named Entity Mapping
or Named Entity Linking ) that take the NER process one step further by interpreting
named entities to assign a unique meaning (entity ) to a sequence of terms. In order
to achieve this, rst all potential entity candidates for a phrase have to be determined
with the help of a dictionary. The number of potential entity candidates corresponds to
the level of ambiguity of the underlying text phrase. Taking into account the context
of the phrase, as e. g. the sentence where the phrase occurs, a unique entity is selected
according to the meaning of the phrase in a subsequent disambiguation step.</p>
      <p>Multiple e orts compete in this discipline. But, the comparison of di erent NED
systems is di cult, especially if they don't use a common dictionary for entity candidate
determination. Therefore, it is highly desirable to provide common benchmarks for
evaluation. On the other hand, benchmarks are applied to tune a NED system for its
intended purpose and/or a speci c domain, i. e. context and pragmatics of the NED
system are xed to a speci c task. To achieve this multiple benchmark datasets have
been created to evaluate such systems. To evaluate a NED system and to compare its
performance against already existing solutions the system's developer should be aware
of the characteristics of the available benchmarks.</p>
      <p>In this paper, prominent datasets { dictionary datasets as well as benchmark
datasets { are analyzed to gain better insights about both their characteristics and on
their capabilities while considering also potential drawbacks. The datasets are
statistically analyzed for mapping coverage, level of ambiguity, maximum achievable recall, as
well as di culty. All benchmarks and evaluation results are available online to achieve
more target-oriented evaluations of NER and NED systems.</p>
      <p>The paper is organized as follows: Section 2 gives an overview on NED tools and
comparison approaches and introduces the benchmarks and dictionaries utilized in this
paper. Statistical information about the benchmarks are presented in Section 3.
Experiments using four di erent dictionaries on three di erent benchmarks are described
and discussed in Section 4. Section 5 concludes the paper and summarizes the scienti c
contribution.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Semantic annotation of textual information in web documents has become a key
technology for data mining and information retrieval and a key itself towards the Semantic
Web. Several tools for automatic semantic annotation have emerged for this task and
created a strong demand for evaluation benchmarks to enable comparison. Therefore,
a number of benchmarks containing natural language texts annotated with
semantic entities have been created. Cornolti et al. present a benchmarking framework for
entity-annotation tools and also compare the performances of various systems [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This
evaluation indicates a di erence between several applied datasets, but does not analyze
their causes in further detail. Gangemi describes an approach of comparing di erent
annotation tools without the application of a benchmark [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The baseline for the
evaluation is de ned by the maximum agreement of all evaluated automatic semantic
annotation tools. Unfortunately, such a baseline does not take into account di erent
semantic annotation levels in terms of the special purposes the evaluated tools have
been developed for.
      </p>
      <p>
        DBpedia Spotlight is an established NED application that applies an analytical
approach for the disambiguation process. Every entity candidate of a surface form
found in the text is represented by a vector composed of all terms that co-occurred
within the same paragraphs of the Wikipedia articles where this entity is linked [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The
approach has been evaluated on a benchmark containing ten semantically annotated
New York Times articles. This benchmark is described in Section 3.1 and part of the
presented experiments. DBpedia Spotlight applies a Wikipedia based dictionary { a
Lexicalization dataset { to determine potential entity candidates. This dataset is also
part of the presented experiments and described in the next section.
      </p>
      <p>
        AIDA is an online tool for disambiguation of named entities in natural language text
and tables [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. It utilizes relationships between named entities for the disambiguation.
AIDA applies a dictionary called AIDA Means to determine potential entity candidates.
This dictionary is further described in the next section and also under observation
for the presented experiments described in Section 4. AIDA has been evaluated on a
benchmark created from the CoNLL 2003 dataset1. Since this dataset is not available
1 http://www.cnts.ua.ac.be/conll2003/ner/
for free, KORE 50 { a subset of the AIDA benchmark dataset { has been used for the
experiments in this paper which is described in Section 3.1.
3
3.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>Benchmark Dataset Evaluation</title>
      <sec id="sec-3-1">
        <title>Benchmark Datasets</title>
        <p>The benchmark datasets under consideration contain annotated texts linking enclosed
lexemes to entities. Based on these benchmarks the performance of NED systems can
be evaluated. Within this work, we restrict our selection of benchmark datasets to
those containing (a) english language texts (b) originating from authentic documents
(e. g. newswire), (c) containing annotations to DBpedia entities or Wikipedia articles,
and (d) involving context at least on sentence level.</p>
        <p>
          The DBpedia Spotlight dataset [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] has been created for the eponymous NED tool.
It contains 60 natural language sentences from ten di erent New York Times articles
with overall 249 annotated DBpedia entities, i. e. the entities are not explicitely bound
to mentions within the texts, which causes a certain lack of clarity. Therefore, we (in all
conscience) retroactively have allocated the entities to their positions within the texts.
The entities dbp:Markup_Language and dbp:PBC_CSKA_Moscow could not be linked in
the texts, since there was also a more speci c entity enlisted occupying their solely
possible location, e. g. hypertext markup language has been annotated with dbp:HTML
rather than dbp:Markup_language.
        </p>
        <p>
          KORE 50 (AIDA) [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] is a subset of the larger AIDA corpus [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], which is based
on the dataset of the CoNLL 2003 NER task. The dataset aims to capture hard to
disambiguate mentions of entities and it contains a large number of rst names referring
to persons, whose identity needs to be deduced from the given context. It comprises
50 sentences from di erent domains, such as music, celebrities, and business and is
provided in a clear TSV format.
        </p>
        <p>
          The Wikilinks Corpus [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] has been introduced recently by Google. The corpus
collects hyperlinks to Wikipedia gathered from over 3 million web sites. It has been
transformed to RDF using the NLP Interchange Format (NIF) by Hellmann et al. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
The corpus is divided in 68 RDF dump les, from which the rst one2 has been used
for Lexicalization Statistics (cf. Section 4). The intention behind links to Wikipedia
articles needs to be considered in a di erent way compared to the intention of the other
two datasets, since links have been created rather for informational reasons. For each
annotation the original website is named, which allows to recover the full document
contexts for the annotations, though they are not contained in the NIF resource so
far. This benchmark cannot be considered as a gold standard. In some cases mentions
are linked to broken URLs, redirects or semantically wrong entities. This issue is also
discussed in Section 4.
        </p>
        <p>
          For further processing NIF representations of KORE 50 and DBpedia Spotlight have
been created, which are accessible at our website3. Further datasets not considered
in this paper are e. g. the complete AIDA/CoNLL corpus [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], the WePS (Web people
search) evaluation dataset [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], the cross-document Italian people coreference (CRIPCO)
corpus [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], and the corpus for cross-document coreference by Day et al. [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
2 It can be assumed that the slices are homogeneously mixed.
3 http://www.yovisto.com/labs/ner-benchmarks/
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Benchmark Statistics</title>
        <p>The three benchmark datasets under consideration cover di erent domains, e. g. though
all datasets originate from authentic corporas varying portions have been selected and
di erent types of entities have been annotated. Table 1 shows the distribution of
DBpedia types within the benchmark dataset.</p>
        <p>About 10% of the annotated entities in the DBpedia Spotlight dataset are locations
and majority of about 80% of the annotated entities are not associated with any type
information in DBpedia. Since the DBpedia Spotlight dataset originates from New
York Times articles, the annotations are embedded in document contexts.</p>
        <p>The KORE 50 dataset contains 144 annotations which mostly refer to agents (74
times dbo:Person and 28 times dbo:Organisation). Only a relatively small amount
(18.5%) of annotated entities does not provide any type information in DBpedia. The
context for the annotated entities in the KORE 50 dataset is limited to (relatively
short) sentences.</p>
        <p>The by far largest dataset is Wikilinks. Its sheer size allows to extract sub-benchmarks
for speci c designated domains, e. g. there are about 281,000 mentions of 8,594 di erent
diseases. However, a large amount (66%) of the annotated entities does not provide any
type information in DBpedia and the largest amount of the typed entities refer to an
agent (18.9%).
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Lexicalization Statistics and Discussion</title>
      <p>The benchmarks described in Section 3.1 are constructed to evaluate NED algorithms.
The evaluation results of a NED method are not only dependent on the actual algorithm
used to disambiguate ambiguous mentions but also on the structure of the benchmark
and the underlying dictionary utilized to determine entity candidates for a mention.
A mention mapping or mapped mention refers to a mention of a benchmark that is
assigned to one or more entity candidates of the used dictionary. The following section
introduces several dictionaries.
4.1</p>
      <sec id="sec-4-1">
        <title>Dictionary Datasets</title>
        <p>Dictionaries contain associations that map strings (surface forms) to entities
represented by Wikipedia articles or DBpedia concepts. Typically, dictionaries are applied
by NED systems in an early step to nd candidates for lexemes in natural language
texts. In a further (disambiguation) step the actual correct entity has to be selected
from all these candidates.</p>
        <p>
          The DBpedia Lexicalizations dataset [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] has been extracted from Wikipedia
interwiki links. It contains anchor texts, the so called surface form, with their
respective destination article. Overall, there are 2 million entries in the DBpedia
Lexicalizations dataset. For each combination the conditional probabilities P (uri jsurfaceform)4,
P (surfaceformjuri ), and the pointwise mutual information value (PMI) are given.
Subsequently, this dictionary is referred to as DBL (DB pedia Lexicalizations).
        </p>
        <p>
          Google has released a similar, but far larger dataset: Crosswiki [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. The Crosswiki
dictionary has been build at webscale and includes 378 million entries. This
dictionary is subsequently referred to as GCW. Similar to the DBL dataset the probability
P (uri jsurfaceform) has been calculated and is available in the dictionary. This
probability is used for the experiments described in Section 4.2.
        </p>
        <p>
          The AIDA Means dictionary is an extended version of the YAGO25 means
relation. The YAGO means relation is harvested from disambiguations pages, redirects,
and links in Wikipedia [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. Unfortunately, there is no information given what the
extension includes exactly. The AIDA Means dictionary contains 18 million entries.
Subsequently, this dictionary is referred to as AIDA.
4 The measure is used later on for the experiments as Anchor-Link-Probability (cf. Section 4)
5 http://www.yago-knowledge.org/
        </p>
        <p>In addition to the three already existing dictionaries described above, we have
constructed an own dictionary. Similar to the YAGO means relation this dictionary
has been constructed by solving disambiguation pages and redirects and using these
alternative labels additionally to the original labels of the DBpedia entities. Except
the elimination of bracket terms (e. g. the label Berlin (2009 lm) is converted to
Berlin by removing the brackets and the term within them) no further preprocessing
has been performed on this dictionary. Thus, all labels are presented in original case
sensitivity. Further evaluation on this issue is described in Section 4.3. This dictionary
is subsequently referred to as RDM (Redirect D isambiguation M apping).
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Experiments</title>
        <p>To identify several characteristics of the introduced dictionaries as well as consolidate
assumptions about the structure of the benchmarks the experiments described in the
following sections have been conducted. For performance issues only a subset of the
Wikilinks benchmark has been used for the following experiments. For the subset the
rst dump le containing 494,512 annotations and 192,008 distinct mentions and
assigned entities has been used.</p>
        <p>Mapping Coverage First, the coverage of mention mappings is calculated. All annotated
entity mentions from the benchmarks are looked up in the four di erent dictionaries.
If at least one entity candidate for the mention is found in the dictionary a counter
is increased. This measure is an indicator for the expressiveness and versatility of the
dictionary.</p>
        <p>Entity Candidate Count For all mapped mentions the number of entity candidates
found in the respective dictionary is added up. The number of entity candidates
corresponds to the level of ambiguity of the mention and can be considered as an indicator
for the level of di culty of the subsequent disambiguation process.</p>
        <p>Maximum Recall The list of entity candidates for all mapped mentions is looked up
whether the annotated entity (from the benchmark) is included. Only if it is contained
in the list, a correct disambiguation is achievable at all. Thus, this measure predicts
the maximum achievable recall using the respective dictionary on the benchmark.
Recall and Precision achieved by Popularity For Word Sense Disambiguation (WSD)
after determining entity candidates for the mentions a subsequent disambiguation
process tries to detect the most relevant entity of all candidates according to the given
context. For this experiment the disambiguation process is simpli ed: the most
popular entity among the available candidates is chosen as correct disambiguation. To
determine the popularity of the entity candidates three di erent measures are applied:
{ Incoming Page Links of entity candidates
{ Anchor-Link-Probability within web document corpus
{ Anchor-Link-Probability within Wikipedia corpus
The rst measure is a simple entity-based popularity measure. The popularity is de ned
according to the number of incoming Wikipedia page links. The more links point to an
entity the more popular the entity is considered. The Anchor-Link-Probability de nes
the probability of a linked entity for a given anchor text. Thus, the more often a
mention is used to link to the same entity the higher is the Anchor-Link-Probability.
This probability has been calculated on two di erent corpora. For the DBL dictionary
this probability has been calculated based on the Wikipedia article corpus and for
GCW dataset it has been calculated based on all web documents (cf. Section 4.1).
The results of this experiment can be considered as an indicator for the degree of
di culty of the applied benchmark in terms of WSD. A high recall and precision by
simply using a popularity measure indicates a less di cult benchmark dataset. If a
benchmark contains less popular entities the disambiguation process can be considered
more di cult.
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Results &amp; Discussion</title>
        <p>The experiments described above are discussed in the following paragraphs. For every
experiment a table with the achieved results is given. The tables show the results for
the four di erent dictionaries { represented by the columns { on the three di erent
benchmarks { represented by the rows. For comparison issues, for all dictionaries the
number of entries and for all benchmarks the number of distinct mentions and their
annotated entities is given. For all results the total numbers as well as proportional
respectively an averaged value is given. This facilitates the comparison of benchmarks
and dictionaries that are signi cantly di ering in number of annotations and size.</p>
        <p>The experiments mapping coverage, entity candidate count, maximum recall, and
recall and precision based on page link popularity have been also performed using
caseinsensitive mentions and labels in the four di erent dictionaries. For comparison, these
results are presented in the same tables of the respective experiments as the results of
the case-sensitive experiments. Recall and precision based on Anchor-Link-Probability
have not been calculated as the probabilities for case-insensitive anchors are not
available for the DBL and GCW datasets.</p>
        <p>Mapping Coverage
{ GCW achieves highest coverage (between 94.67% and 100%) due to largest
dictionary containing 378 m. entries and its construction method: anchor texts and
linked Wikipedia articles in web documents.
{ RDM performs worst with only 25.19% on the Spotlight benchmark due to the lack
of preprocessing { all labels are given with capital rst letters which is not common
in English language except for persons, places, organizations.
{ Coverage for RDM increased by 69% (to 94%) when mentions in Spotlight
benchmark are looked up in dictionary case-insensitive. Also, for the Wikilinks
benchmark the coverage using the RDM dictionary is increased by 16% to 76%. The
RDM dictionary consists of mainly case-sensitive labels (as no pre-processing has
been performed). Persons, organizations, and places are written with a rst capital
letter in English language texts. Mentions of entities of those types are found in a
case-sensitive dictionary, such as RDM. In contrast, mentions of entities that are
not of type person, organization or place, as e. g. internet are not found in the
dictionary. If a benchmark contains mainly mentions of entities of type person,
organization, or place the RDM dictionary achieves a high mapping coverage { as
for the KORE 50 benchmark. Case-insensitive selection must increase the coverage,
especially if the benchmark contains entity mentions that are not of type person,
organization or place. This assumption is consolidated by the increased mapping
coverage for the Spotlight and Wikilinks benchmark and the type information of
the mentioned entities in the benchmarks presented in Table 1.
{ Overall, the dictionaries perform very well or even best on the benchmarks that
have been constructed for the evaluation of their respective applications: DBL {
Spotlight, AIDA { KORE 50, and GCW { Wikilinks.</p>
        <sec id="sec-4-3-1">
          <title>The overall results are depicted in Table 2.</title>
          <p>{ DBL and RDM do not contain all rst names of persons as needed for benchmark</p>
          <p>KORE 50. Thus, the maximum recall decreases compared to mapping coverage.
{ AIDA performs poorly on Spotlight benchmark due to the structure of dictionary.</p>
          <p>The dictionary contains a large number of persons' rst names. Apparently, the
dictionary does not re ect labels for entities in manually annotated texts.
Recall and Precision achieved by Popularity { Incoming Wikipedia Page Links of Entity
Candidates
{ Notably GCW performs poorly on all benchmarks compared to maximum
achievable recall due to a high entity candidate count. Apparently entity candidate lists
often contain more popular but incorrect entities.
{ In the KORE 50 benchmark, due to many annotated rst names, entity candidate
lists contain many prospective entities and apparently the correct candidate is often
not the most popular one compared to the other candidates. This explains the poor
performance of all dictionaries on the KORE 50 using page link popularity.
{ Compared to the maximum achievable recall (of all dictionaries) on the KORE 50
the achieved recall is very low using a popularity measure as simpli ed
disambiguation process. This con rms the intention of the benchmark to contain mentions that
are hard to disambiguate.</p>
        </sec>
        <sec id="sec-4-3-2">
          <title>Overall results are shown in Table 5.</title>
        </sec>
        <sec id="sec-4-3-3">
          <title>Overall results are shown in Table 7.</title>
          <p>Mention</p>
          <p>Count
265
130
Evaluation results of NED approaches are dependent on the structure of the used
benchmark dataset as well as on the dictionary used for entity candidate determination. The
objective of this paper is to point out the di erences of several benchmarks and
dictionaries for NED. For this purpose three di erent benchmarks have been analyzed.
Two of them rst have been converted into NIF representations and made available
online. The analyses included simple statistical information as well as type information of
contained entities about the benchmarks. Additionally, four di erent dictionaries have
been applied to determine entity candidates in the benchmarks. Based on our
evaluation, important assumptions about the benchmarks have been consolidated and new
insights into the characteristics of evaluated benchmarks as well as on the
expressiveness of the dictionaries have been delivered. By making all benchmarks and evaluation
results available online, evaluation of new NER or NED tools can be achieved more
target-oriented with more meaningful results.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>J.</given-names>
            <surname>Artiles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Borthwick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sekine</surname>
          </string-name>
          , and
          <string-name>
            <surname>E. Amigo.</surname>
          </string-name>
          <article-title>WePS-3 evaluation campaign: Overview of the web people search clustering and attribute extraction tasks</article-title>
          . In CLEF (Notebook Papers/LABs/Workshops),
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>L.</given-names>
            <surname>Bentivogli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Girardi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Pianta</surname>
          </string-name>
          .
          <article-title>Creating a gold standard for person crossdocument coreference resolution in italian news</article-title>
          .
          <source>In Proc. of the LREC 2008 Workshop on Resources and Evaluation for Identity Matching, Entity Resolution and Entity Management</source>
          , page
          <volume>19</volume>
          ,
          <string-name>
            <surname>Marrakech</surname>
          </string-name>
          , Morocco, May
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>M.</given-names>
            <surname>Cornolti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ferragina</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Ciaramita</surname>
          </string-name>
          .
          <article-title>A framework for benchmarking entityannotation systems</article-title>
          .
          <source>In Proceedings of the 22nd international conference on World Wide Web, WWW '13</source>
          , pages
          <fpage>249</fpage>
          {
          <fpage>260</fpage>
          ,
          <string-name>
            <surname>Geneva</surname>
          </string-name>
          , Switzerland,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>D.</given-names>
            <surname>Day</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hitzeman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Wick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Crouch</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Poesio</surname>
          </string-name>
          .
          <article-title>A corpus for cross-document co-reference</article-title>
          .
          <source>In Proc. of the LREC 2008 Workshop on Resources and Evaluation for Identity Matching, Entity Resolution and Entity Management</source>
          , May
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>A.</given-names>
            <surname>Gangemi</surname>
          </string-name>
          .
          <article-title>A comparison of knowledge extraction tools for the semantic web</article-title>
          .
          <source>In The Semantic Web: Semantics and Big Data</source>
          , volume
          <volume>7882</volume>
          of Lecture Notes in Computer Science, pages
          <volume>351</volume>
          {
          <fpage>366</fpage>
          . Springer Berlin Heidelberg,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>S.</given-names>
            <surname>Hellmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          , and
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Brummer. Integrating NLP using linked data</article-title>
          .
          <source>In Proc. of 12th Int. Semantic Web Conf., Sydney</source>
          , Australia,
          <year>October 2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. J. Ho art, S. Seufert,
          <string-name>
            <given-names>D. B.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Theobald</surname>
          </string-name>
          , and
          <string-name>
            <surname>G. Weikum. KORE</surname>
          </string-name>
          :
          <article-title>Keyphrase overlap relatedness for entity disambiguation</article-title>
          .
          <source>In Proc. of the 21st ACM international conference on Information and knowledge management</source>
          , pages
          <volume>545</volume>
          {
          <fpage>554</fpage>
          . ACM,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. J. Ho art, M. A.
          <string-name>
            <surname>Yosef</surname>
            , I. Bordino, H. Furstenau,
            <given-names>M.</given-names>
            Pinkal, M.
          </string-name>
          <string-name>
            <surname>Spaniol</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Taneva</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Thater</surname>
            , and
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Weikum</surname>
          </string-name>
          .
          <article-title>Robust disambiguation of named entities in text</article-title>
          .
          <source>In Proc. of the Conf. on Empirical Methods in Natural Language Processing, EMNLP '11</source>
          , pages
          <fpage>782</fpage>
          {
          <fpage>792</fpage>
          ,
          <string-name>
            <surname>Stroudsburg</surname>
          </string-name>
          , PA, USA,
          <year>2011</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>P. N.</given-names>
            <surname>Mendes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jakob</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Garc</surname>
          </string-name>
          a-Silva, and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer. DBpedia Spotlight</surname>
          </string-name>
          <article-title>: shedding light on the web of documents</article-title>
          .
          <source>In Proc. of the 7th Int. Conf. on Semantic Systems (I-Semantics)</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Subramanya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Pereira</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>McCallum. Wikilinks</surname>
          </string-name>
          :
          <article-title>A large-scale crossdocument coreference corpus labeled via links to Wikipedia</article-title>
          .
          <source>Technical Report UM-CS2012-015</source>
          , University of Massachusetts Amherst,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>V. I.</given-names>
            <surname>Spitkovsky</surname>
          </string-name>
          and
          <string-name>
            <given-names>A. X.</given-names>
            <surname>Chang</surname>
          </string-name>
          .
          <article-title>A cross-lingual dictionary for english Wikipedia concepts</article-title>
          .
          <source>In Proc. of the Eight Int. Conf. on Language Resources and Evaluation (LREC'12)</source>
          , Istanbul, Turkey, May
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>M. A. Yosef</surname>
            , J. Ho art, I. Bordino,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Spaniol</surname>
            , and
            <given-names>G. Weikum.</given-names>
          </string-name>
          <article-title>AIDA: an online tool for accurate disambiguation of named entities in text and tables</article-title>
          .
          <source>PVLDB</source>
          ,
          <volume>4</volume>
          (
          <issue>12</issue>
          ):
          <volume>1450</volume>
          {
          <fpage>1453</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>