<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>April</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>NERD meets NIF: Lifting NLP Extraction Results to the Linked Data Cloud</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Raphaël Troncy EURECOM</string-name>
          <email>giuseppe.rizzo@eurecom.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>France raphael.troncy@eurecom.fr</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Giuseppe Rizzo EURECOM, France Politecnico di Torino</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Measurement</institution>
          ,
          <addr-line>Performance, Experimentation, Web</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Sebastian Hellmann, Martin Bruemmer Universität Leipzig</institution>
          ,
          <addr-line>Germany leipzig.de</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2012</year>
      </pub-date>
      <volume>16</volume>
      <issue>2012</issue>
      <abstract>
        <p>We have often heard that data is the new oil. In particular, extracting information from semi-structured textual documents on the Web is key to realize the Linked Data vision. Several attempts have been proposed to extract knowledge from textual documents, extracting named entities, classifying them according to pre-de ned taxonomies and disambiguating them through URIs identifying real world entities. As a step towards interconnecting the Web of documents via those entities, di erent extractors have been proposed. Although they share the same main purpose (extracting named entity), they di er from numerous aspects such as their underlying dictionary or ability to disambiguate entities. We have developed NERD, an API and a front-end user interface powered by an ontology to unify various named entity extractors. The uni ed result output is serialized in RDF according to the NIF speci cation and published back on the Linked Data cloud. We evaluated NERD with a dataset composed of ve TED talk transcripts, a dataset composed of 1000 New York Times articles and a dataset composed of the 217 abstracts of the papers published at WWW 2011.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Categories and Subject Descriptors</title>
      <p>I.2.7 [Arti cial Intelligence]: [Natural Language
Processing - Language parsing and understanding]</p>
    </sec>
    <sec id="sec-2">
      <title>General Terms</title>
    </sec>
    <sec id="sec-3">
      <title>1. INTRODUCTION</title>
      <p>
        The Web of Data is often illustrated as a fast growing
cloud of interconnected dataset representing information about
barely everything [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The Web hosts also millions of
semistructured texts such as scienti c papers, news articles as
well as forum and archived mailing list threads or
(micro)blog posts. This information has usually a rich semantic
structure which is clear for the author but that remains
mostly hidden to computing machinery. Named entity and
information extractors aim to bring such a structure from
those free texts. They provide algorithms for extracting
semantic units identifying the name of people, organizations,
locations, time references, quantities, etc. and classifying
them according to prede ned schema, increasing
discoverability (e.g. through faceted search), reusability and the
utility of information.
      </p>
      <p>
        Since the 90's, an increasing emphasis has been given to
the evaluation of NLP techniques. Hence, the Named
Entity Recognition (NER) task has been developed as an
essential component of the Information Extraction eld.
Initially, these techniques focused on identifying atomic
information unit in a text, named entities, later on classi ed into
prede ned categories (also called context types) by
classication techniques, and linked to real world objects using
web identi ers. Such a task is called Named Entity
Disambiguation. Knowledge bases a ect the disambiguation task
in several ways, because they provide the nal
disambiguation point where the information is linked. Recent methods
leverage knowledge bases such as DBpedia [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], Freebase1 or
YAGO [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] since they contain many entries corresponding
to real world entities and classi ed according to exhaustive
classi cation schemes. A certain number of tools have been
developed to extract structured information from text
resources classifying them according to pre-de ned taxonomies
and disambiguating them using URIs. In this work, we aim
to evaluate tools which provide such an online computation:
AlchemyAPI2, DBpedia Spotlight3, Evri4, Extractiv5,
Lupedia6, OpenCalais7, Saplo8, Wikimeta9, Yahoo! Content
Analysis (YCA)10 and Zemanta11. They represent a clear
1http://www.freebase.com/
2http://www.alchemyapi.com
3http://dbpedia.org/spotlight
4http://www.evri.com/developer/index.html
5http://extractiv.com
6http://lupedia.ontotext.com
7http://www.opencalais.com
8http://saplo.com
9http://www.wikimeta.com
10http://developer.yahoo.com/search/content/V2/contentAnalysis.
html
11http://www.zemanta.com
opportunity for the Linked Data community to increase the
volume of interconnected data. Although these tools share
the same purpose { extracting semantic units from text {
they make use of di erent algorithms and training data.
They generally provide a potential similar output composed
of a set of extracted named entities, their type and
potentially a URI disambiguating each named entity. The output
vary in terms of data model used by the extractors. Hence,
we propose the Named Entity Recognition and
Disambiguation (NERD) framework which uni es the output results of
these extractors, lifting them to the Linked Data Cloud
using the new NIF speci cation.
      </p>
      <p>These services have their own strengths and
shortcomings but, to the best of our knowledge, few scienti c
evaluations have been conducted to understand the conditions
under which a tool is the most appropriate one. This paper
attempts to ll this gap. We have performed quantitative
evaluations conducted on three di erent datasets covering
di erent type of textual documents: a dataset composed of
transcripts of ve TED12 talks, a dataset composed of 1000
news articles from The New York Times13 and a dataset
composed of the 217 abstracts of the papers published at
WWW 2011 14. We present statistics to underline the
behavior of such extractors in di erent scenarios and group
them according to the NERD ontology. We have developed
the NERD framework, available at http://nerd.eurecom.fr to
perform systematic evaluation of NE extractors.</p>
      <p>The remainder of this paper is organized as follows. In
section 2, we introduce a factual comparison of the named
entity extractors investigated in this work. We describe the
NERD framework in section 3 and we highlight the
importance to have an output compliant with the Linked Data
principles in section 4. Then, we describe the experimental
results we obtained in section 5 and in section 6, we propose
an overview on Named Entity recognition and
disambiguation techniques. Finally, we give our conclusions and outline
future work in section 7.</p>
    </sec>
    <sec id="sec-4">
      <title>FACTUAL COMPARISON OF</title>
    </sec>
    <sec id="sec-5">
      <title>NAMED ENTITY EXTRACTORS</title>
      <p>The NE recognition and disambiguation tools vary in terms
of response granularity and technology used. As
granularity, we de ne the way how the extraction algorithm works:
One Entity per Name (OEN) where the algorithm tokenizes
the document in a list of exclusive sentences, recognizing the
dot as a terminator character, and for each sentence, detects
named entities; and One Entity per Document (OED) where
the algorithm considers the bag of words from the entire
document and then detects named entities, removing duplicates
for the same output record (NE, type, URI). Therefore, the
result set di ers from the two approaches.</p>
      <p>Table 1 provides an extensive comparison that take into
account the technology used: algorithms used to extract
NE, supported languages, ontology used to classify the NE,
dataset for looking up the real world entities and all the
technical issues related to the online computation such as
the maximum content request size and the response format.
We also report whether a tool provides the position where
an NE is found in the text or not. We distinguish four cases:
12http://www.ted.com
13http://www.nytimes.com
14http://www.www2011india.com
char o set considering the text as a sequence of characters,
it reports the char index where the NE starts and the length
(number of chars) of the NE; range of chars considering the
text as a sequence of characters, it reports the start index
and the end index where the NE appears; word o set the
text is tokenized considering any punctuation, it reports the
word number after the NE is located (this counting does not
take into account the punctuation); POS o set the text is
tokenized considering any punctuation, it reports the
number of part-of-a-speech after the NE is located.</p>
      <p>We performed an experimental evaluation to estimate the
max content chunk supported by each API, creating a
simple application that is able to send to each extractor a text
of 1KB initially. In case that the answer was correct (HTTP
status 20x), we performed one more test increasing of 1 KB
the content chunk. We iterated this operation until we
received the answer \text too long". Table 1 summarizes the
factual comparison of the services involved in this study.
The * means the value has been estimated experimentally
(as the content chunk), + means a list of other sources,
generally identi able as any source available within the Web,
nally N/A means not available.
3.</p>
    </sec>
    <sec id="sec-6">
      <title>THE NERD FRAMEWORK</title>
      <p>
        NERD is a web framework plugged on top of various NER
extractors. Its architecture follows the REST principles [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
and includes an HTML front-end for humans and an API
for computers to exchange content in JSON (another
serialization of NERD output will be detailed in the section 4).
Both interfaces are powered by the NERD REST engine.
3.1
      </p>
    </sec>
    <sec id="sec-7">
      <title>The NERD Data Model</title>
      <p>We propose the following data model that encapsulates
the common properties for representing NERD extraction
results. It is composed of a list of entities for which a label,
a type and a URI is provided, together with the mapped type
in the NERD taxonomy, the position of the named entity,
the con dence and relevance scores as they are provided by
the NER tools. The example below shows this data model
(for the sake of brevity, we use the JSON syntax):
" entities ": [{
" entity ":" Tim Berners - Lee ",
" type ":" Person ",
" uri ":" http :// dbpedia . org / resource /</p>
      <p>Tim_berners_lee ",
" nerdType ":" http :// nerd . eurecom . fr / ontology #</p>
      <p>Person ",
" startChar ":30 ,
" endChar ":45 ,
" confidence ":1 ,
" relevance ":0.5
}]
3.2</p>
    </sec>
    <sec id="sec-8">
      <title>The NERD REST API</title>
      <p>The REST engine runs on Jersey15 and Grizzly16
technologies. Their extensible frameworks enable to develop
several components and NERD is composed of 7 modules
namely authentication, scraping, extraction, ontology
mapping, store, statistics and web. The authentication takes as
input a FOAF pro le of a user and links the evaluations with
the user who performs them (we are freezing an OpenID
implementation and it will replace soon the simple
authentica15http://jersey.java.net
16http://grizzly.java.net
tion system working right now). The scraping module takes
as input the URI of an article and extracts all its raw text.
Extraction is the module designed to invoke the external
service APIs and collect the results. Each service provides
its own taxonomy of named entity types it can recognize.
We therefore designed the NERD ontology which provides
a set of mappings between these various classi cations. The
ontology mapping is the module in charge to map the
classi cation type retrieved to our ontology. The store module
saves all evaluations according to the schema model we
dened in the NERD database. The statistic module enables
to extract data patterns form the user interactions stored in
the database and to compute statistical scores such as the
Fleiss Kappa score and the precision measure. Finally, the
web module manages the client requests, the web cache and
generates HTML pages.</p>
      <p>Plugged on the top of this engine, there is an API
interface17. It is developed following the REST principles and it
has been implemented to enable programmatic access to the
NERD framework. It follows the following URI scheme (the
base URI is http://nerd.eurecom.fr/api):
/document : GET, POST, PUT methods enable to fetch,
submit or modify a document parsed by the NERD
framework;
/user : GET, POST methods enable to insert a new user to
the NERD framework and to fetch account details;
/annotation/{extractor} : POST method drives the
annotation of a document. The parametric URI allows to
pilot the extractors supported by NERD;
/extraction : GET method allows to fetch the output
described in section 3.1;
/evaluation : GET method allows to retrieve a statistic
interpretation of the extractor behaviors.
3.3</p>
    </sec>
    <sec id="sec-9">
      <title>The NERD Ontology</title>
      <p>
        Although these tools share the same goal, they use di
erent algorithms and di erent dictionaries which makes hard
their comparison. We have developed the NERD ontology,
a set of mappings established manually between the
taxonomies of NE types. Concepts included in the NERD
ontology are collected from di erent schema types: ontology (for
DBpedia Spotlight, Lupedia, and Zemanta), lightweight
taxonomy (for AlchemyAPI, Evri, and Yahoo!) or simple at
type lists (for Extractiv, OpenCalais, Saplo, and Wikimeta).
The NERD ontology tries to merge the linguistic
community needs and the logician community ones: we developed
a core set of axioms based on the Quaero schema [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and we
mapped similar concepts described in the other scheme. The
selection of these concepts has been done considering the
greatest common denominator among them. The concepts
that do not appear in the NERD namespace are sub-classes
of parents that end-up in the NERD ontology. This ontology
is available at http://nerd.eurecom.fr/ontology. To summarize,
a concept is included in the NERD ontology as soon as there
are at least two extractors that use it. The NERD ontology
becomes a reference ontology for comparing the classi
cation task of NE extractors. We show an example mapping
among those extractors below: the City type is considered
17http://nerd.eurecom.fr/api/application.wadl
as being equivalent to alchemy:City, dbpedia-owl:City,
extractiv:CITY, opencalais:City, evri:City while being
more speci c than wikimeta:LOC and zemanta:location.
nerd : City a rdfs : Class ;
rdfs : subClassOf wikimeta : LOC ;
rdfs : subClassOf zemanta : location ;
owl : equivalentClass alchemy : City ;
owl : equivalentClass dbpedia - owl : City ;
owl : equivalentClass evri : City ;
owl : equivalentClass extractiv : CITY ;
owl : equivalentClass opencalais : City .
3.4
      </p>
    </sec>
    <sec id="sec-10">
      <title>The NERD UI</title>
      <p>
        The user interface18 is developed in HTML/Javascript. Its
goal is to provide a portal where researchers can nd
information about the NERD project, the NERD ontology, and
common statistics of the supported extractors. Moreover,
it provides a personalized space where a user can create a
developer or a simple user account. For the former account
type, a developer can navigate through a dashboard, see his
pro le details, browse some personal usage statistics and get
a programmatic access to the NERD API via a NERD key.
The simple user account enables to annotate any web
documents via its URI. The raw text is rst extracted from the
web source and a user can select a particular extractor.
After the extraction step, the user can judge the correctness
of each eld of the tuple (NE, type, URI, relevant). This
is an important process which gives to NERD human
feedbacks with the main purpose of evaluating the quality of the
extraction results collected by those tools [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. At the end
of the evaluation, the user sends the results, through
asynchronous calls, to the REST API engine in order to store
them. This set of evaluations is further used to compute
statistics about precision measures for each tool, with the
goal to highlight strengths and weaknesses and to compare
them [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. The comparison aggregates all the evaluations
performed and, nally, the user is free to select one or more
evaluations to see the metrics that are computed for each
service in real time.
4.
      </p>
    </sec>
    <sec id="sec-11">
      <title>NIF: AN NLP INTERCHANGE FORMAT</title>
      <p>The NLP Interchange Format (NIF) is an
RDF/OWLbased format that aims to achieve interoperability between
Natural Language Processing (NLP) tools, language resources
and annotations. The NIF speci cation has been released in
an initial version 1.0 in November 2011 and describes how
interoperability between NLP tools, which are exposed as NIF
web services can be achieved. Extensive feedback was given
on several mailing lists and a community of interest19 was
created to improve the speci cation. Implementations for 8
di erent NLP tools (e.g. UIMA, Gate ANNIE and DBpedia
Spotlight) exist and a public web demo20 is available.</p>
      <p>In the following, we will rst introduce the core concepts
of NIF, which are de ned in a String Ontology21 (STR). We
will then explain how NIF is used in NERD. The resulting
properties and axioms are included into a Structured
Sentence Ontology22 (SSO). While the String Ontology is used
18http://nerd.eurecom.fr
19http://nlp2rdf.org/get-involved
20http://nlp2rdf.lod2.eu/demo.php
21http://nlp2rdf.lod2.eu/schema/string
22http://nlp2rdf.lod2.eu/schema/sso
to describe the relations between strings (i.e. Unicode
characters), the SSO collects properties and classes to connect
strings to NLP annotations and NER entities as produced
by NERD.
4.1</p>
    </sec>
    <sec id="sec-12">
      <title>Core Concepts of NIF</title>
      <p>The motivation behind NIF is to allow NLP tools to
exchange annotations about documents in RDF. Hence, the
main prerequisite is that parts of the documents (i.e. strings)
are referenceable by URIs, so that they can be used as
subjects in RDF statements. We call an algorithm to create
such identi ers URI Scheme: For a given text t (a sequence
of characters) of length jtj (number of characters), we are
looking for a URI Scheme to create a URI, that can serve as
a unique identi er for a substring s of t (i.e. jsj jtj). Such
a substring can (1) consist of adjacent characters only and
it is therefore a unique character sequence within the text,
if we account for parameters such as context and position or
(2) derived by a function which points to several substrings
as de ned in (1).</p>
      <p>NIF provides two URI schemes, which can be used to
represent strings as RDF resources. We focus here on the rst
scheme using o sets. In the top part of Figure 1, two triples
are given that use the following URI as subject:
http://www.w3.org/DesignIssues/LinkedData.html#
offset_717_729
According to the above de nition, the URI points to a
substring of a given text t, which starts at character index 717
until the index 729 (counting all characters). NIF currently
mandates that the whole string of the document has to be
included in the RDF output as an rdf:Literal to serve as
the reference point, which we will call inside context
formalized using an OWL class called str:Context. The term
document would be inappropriate to capture the real
intention of this concept as we would like to refer to an arbitrary
grouping of characters forming a unit, which could also be
applied to a paragraph or a section and is highly dependent
upon the wider context in which the string is actually used
such as a Web document reachable via HTTP.</p>
      <p>To appropriately capture the intention of such a class,
we will distinguish between the notion of outside and inside
context of a piece of text. The inside context is easy to
explain and formalise, as it is the text itself and therefore it
provides a reference context for each substring contained in
the text (i.e. the characters before or after the substring).
The outside context is more vague and is given by an outside
observer, who might arbitrarily interpret the text as a \book
chapter" or a \book section".</p>
      <p>The class str:Context now provides a clear reference point
for all other relative URIs used in this context and blocks
the addition of information from a larger (outside) context
by de nition. For example, str:Context is disjoint with
foaf:Document since labeling a context resource as a
document is an information which is not contained within the
context (i.e. the text) itself. It is legal, however, to say
that the string of the context occurs in (str:occursIn) a
foaf:Document. Additionally, str:Context is a subclass
of str:String and therefore its instances denote Unicode
text as well. The main bene t to limit the context is that
an OWL reasoner can now infer that two contexts are the
same, if they consist of the same string, because an
inversefunctional data type property (str:isString) is used to
attach the actual text to the context resource.</p>
      <p>: offset_0_26546 a str : Context ;
# the exact retrieval method is left underspecified
str : occursIn &lt;http :// www . w3 . org / DesignIssues /</p>
      <p>LinkedData . html &gt; ;
# [...] are all 26547 characters as rdf : Literal
str : isString " [...] " .
: offset_717_729 a str : String ;</p>
      <p>str : referenceContext : offset_0_26546 .</p>
      <p>A complete formalisation is still work in progress, but the
idea is explained here. The NIF URIs will be grounded
on Unicode Characters (especially Unicode Normalization
Form C23. For all resources of type str:String, the universe
of discourse will then be the words over the alphabet of
Unicode characters sometimes called . Perspectively, we
hope that this will allow for an unambiguous interpretation
of NIF by machines.</p>
      <p>Within the framework of RDF and the current usage of
NIF for the interchange of output between NLP tools, the
de nition of the semantics is su cient to produce a working
system. However, problems arise if additional
interoperability with Linked Data or fragment identi ers and ad-hoc
retrieval of content from the Web is demanded. The actual
retrieval method (such as content negotiation) to retrieve and
validate the content for #offset_717_729_Semantic%20Web or
its reference context is left underspeci ed as is the relation
of NIF URIs to fragment identi ers for MIME types such as
text/plain (see RFC 514724). As long as such issues remain
open, the complete text has to be included as RDF Literal.
4.2</p>
    </sec>
    <sec id="sec-13">
      <title>Connecting String to Entities</title>
      <p>For NERD, three relevant concepts have to be expressed
in RDF and were included into the Structured Sentence
Ontology (SSO): OEN, OED and NERD ontology types.</p>
      <p>One Entity per Name (OEN) can be modeled in a
straightforward way, by introducing a property sso:oen, which
connects the string with an arbitrary entity.</p>
      <p>: offset_717_729 sso : oen dbpedia : Semantic_Web .</p>
      <p>One Entity per Document (OED). As document is an
outside interpretation of a string, the notion of context in NIF
23http://www.unicode.org/reports/tr15/#Norm_Forms counted in
Code Units http://unicode.org/faq/char_combmark.html#7
24http://tools.ietf.org/html/rfc5147
has to be used. The property sso:oec is used to attach
entities to a given context. We furthermore add the following
DL-Axiom:
sso:oec
str:ref erenceContext 1
sso:oen
As the property oen contains more speci c information, oec
can be inferred by the above role chain inclusion. In case the
context is enlarged, any materialized information attached
via the oec property needs to be migrated to the larger
context resource.</p>
      <p>The connection between NERD types and strings is done
via a linked data URI, which disambiguates the entity.
Overall three cases can be distinguished: In case, the NER
extractor has provided a linked data URI to disambiguate the
entity, we simply re-use it as in the following example:
# this URI points to the string " W3C "
: offset_23107_23110
rdf : type str : String ;
str : referenceContext : offset_0_26546 ;
sso : oen dbpedia : W3C ;
str : beginIndex " 23107 " ;
str : endIndex " 23110 " .</p>
      <p>dbpedia : W3C rdf : type nerd : Organization .</p>
      <p>If, however, the NER extractor provides no disambiguation
link at all or just a non-linked data URI for the entity
(typically, the foaf:homepage of an organization such as
http://www.w3.org/ ), we plan to mint a new linked data
URI for the respective entity that could then be further
sameAs with other identi ers in a data reconciliation
process.</p>
    </sec>
    <sec id="sec-14">
      <title>EVALUATIONS</title>
      <p>We performed a quantitative experiment using three
different datasets: a dataset composed of transcripts of ve
TED talks (di erent category of talks), a dataset composed
of 1000 news articles from The New York Times (collected
from 09/10/2011 to 12/10/2011), and a dataset composed
of the 217 abstracts of the papers published at WWW 2011
conference. The aim of these evaluations is to assess how
these extractors perform in di erent scenarios, such as news
articles, user generated content and scienti c papers. The
total number of document is 1222, with an average word
number per document equal to 549. Each document was
evaluated using 6 extractors supported by the NERD
framework25. The nal number of entities detected is 177; 823 and
the average of unique entity number per document is 20:03.
Table 2 shows statistics about grouped view according to
the source documents.</p>
      <p>We de ne the following variables: the number nd of
evaluated documents, the number nw of words, the total
number ne of entities, the total number nc of categories and nu
URIs. Moreover, we compute the following measures: word
detection rate r(w; d), i.e. the number of words per
document, entity detection rate r(e; d), i.e. the number of
entities per document, the number of entities per word r(e; w),
the number of categories per entity r(c; e) (this measure has
been computed removing not relevant labels such as \null"
or \LINKED OTHER") and the number of URIs per entity
r(u; e).
25At the time this evaluation has been conducted Lupedia,
Saplo, Wikimeta and YCA were not part of the NERD
framework.
nd
nw
rw
ne
re</p>
      <p>In this experiment, we focus on the extractions performed
by all tools for 5 TED talk transcripts. The goal is to nd
out NE extraction ratio for user generated content, such
as speech transcripts of videos. First, we propose general
statistics about the extraction task and then, we focus on
the classi cation, showing statistics grouped according to
the NERD ontology. DBpedia Spotlight classi es each
resource according three di erent schema (see Table 1). For
this experiment, we consider only the results which belong
to the DBpedia ontology. The total number of documents
is 5, with an overall number of total words equal to 13; 381.
The word detection rate per document r(w; d) is equal to
2; 676:2 with an overall number of entities equal to 1; 441,
and the r(e; d) is 288:2. Table 3 shows the statistics about
the computation results for all extractors. DBpedia
Spotlight is the extractor which provides the highest number of
NE and disambiguated URIs. These values show the ability
from this extractor to locate NE and to exploit the large
cloud of LOD resources. In parallel, it is crucial noting that
it is not able to classify these resources, although it uses a
deep classi cation schema. All the extractors show high
ability for the classi cation task, except Zemanta as shown by
the r(c; e). Contrarily, Zemanta shows strong ability to
disambiguate NE via URI identi cation, as shown by r(u; e).
It is worth noting OpenCalais and Evri have almoust the
same performances of Zemanta. The last part of this
experiment consists in aligning all the classi cation types provided
by these extractors, while performing the analysis of TED
talk transcripts, using the NERD ontology. For the sake of
brevity, we report all the grouping results according to 6
main concepts: Person, Organization, Country, City, Time
and Number. Table 4 shows the comparison results.
AlchemyAPI classi es a higher number of Person, Country and
City than all the others. In addition, OpenCalais obtains
good performances to classify all the concepts except Time
and Number. It is worth noting that Extractiv is the only
extractor able to locate and classify Number and Time. In
this grouped view, we consider all the results classi ed with
the 6 main classes and we do not take into account all
potentially inferred relationships. This is why the Evri results
contrast with what is showed in the Table 3. Indeed, Evri
provides a precise classi cation about Person such as
Journalist, Physicist, Technologist but it does not describe the
same resource as a sub-classes of the Person axiom.
5.2</p>
    </sec>
    <sec id="sec-15">
      <title>Scientific Documents</title>
      <p>In this experiment, we focus on the extraction performed
by all tools for the 217 abstract papers published at the
WWW 2011 conference, with the aim to seek NE
extraction patterns for scienti c contributions. The total number
r(e; d) r(e; w) r(c; e) r(u; e)
28.2 0.01 1 0.504
162 0.06 0 0.77
24 0.009 1 0.942
12 0.004 0.883 0.367
32.6 0.012 0.834 0.963
10 0.003 0.34 0.98
of words is 13; 381, while the word detection rate per
document r(w; d) is equal to 175:40 and the total number of
recognized entities is 12; 266 with the r(e; d) equal to 56:53.
Table 5 shows the statistics of the computation results for
all extractors. DBpedia Spotlight keeps a high rate of NEs
extracted but shows some weaknesses to disambiguate NEs
with LOD resources. r(u; e) is equal to 0:2871, lesser than
is performance in the previous experiment (see section 5.1).
OpenCalais, instead, has the best r(u; e) and it has a
considerable ability to classify NEs. Evri performed in a similar
way as shown by the r(c; e). The last part of this
experiment consists in aligning all the classi cation types retrieved
by these extractors using the NERD ontology, aligning 6
main concepts: Person, Organization, Country, City, Time
and Number. Table 6 shows the comparison results.
AlchemyAPI still preserves the best result to classify named
entities as Person. Instead, di erently to what happened in the
previous experiment, Evri outperforms AlchemyAPI while
classifying named entities as Organization. It is important
to note that Evri shows an high number of NEs classi ed
using the class Person in this scenario, but does not explore
deeply the Person inference (as shown in the user generated
content experiment). OpenCalais has the best performance
to classify NEs according to the City class, while
Extractiv shows reliability to recognize Country and, especially, to
classify Number.
5.3</p>
    </sec>
    <sec id="sec-16">
      <title>News Articles</title>
      <p>For this experiment, we collected 1000 news articles of
The New York Times from 09/10/2011 to 12/10/2011 and
we performed the extraction for the tools involved in this
comparison. The goal is to explore the NE extraction ratio
with this dataset and to assess commonalities and di erences
with the previous experiments. The total number of words is
620; 567, while the word detection rate per document r(w; d)
is equal to 620:57 and the total number of recognized entities
is 164; 12 with the r(e; d) equal to 164:17. Table 7 shows the
statistics of the computation results for all extractors.</p>
      <p>Extractiv is the tool which provides the highest number of
NEs. This score is considerably greater than what does the
same extractor in the other test scenarios (see section 5.1
and section 5.2), and it does not depend from the number
of words per document, as reported by r(e; w). In contrast,
DBpedia Spotlight shows a r(e; w) which is strongly a ected
by the number of words: indeed, the r(e; w) is 0:048 lower
than the same score in the previous experiment. Although
the highest number of URIs detailed is provided by
OpenCalais, the URI detection rate per entity is greater for
Zemanta, with a score equal to 0:577. Alchemy, Evri, and
OpenCalais con rm their reliability to classify NEs and its
detection score value r(c; e) is sensibly greater than all the
others. Finally, we propose the alignment of the 6 main
types recognized by all extractors using the NERD
ontology. Table 8 shows the comparison results. Di erently to
what has been detailed previously, DBpedia Spotlight
recognizes few classes, although this number is not
comparable with what performed by the other extractors. Zemanta
and DBpedia Spotlight increase classi cation performances
with respect to the experiments detailed in the two previous
test cases, obtaining a number of recognized Person which
is lower than one magnitude order. AlchemyAPI preserves
strong ability to recognize Person, but still shows great
performance to recognize City and signi cant scores for
Organization and Country. OpenCalais shows meaningful results
to recognize the class Person and especially a strong
ability to classify NEs with the label Organization. Extractiv
holds the best score for classifying Country and it is the only
extractor able to seek the classes Time and Number.
6.</p>
    </sec>
    <sec id="sec-17">
      <title>RELATED WORK</title>
      <p>
        The Named Entity (NE) recognition and disambiguation
problem has been addressed in di erent research elds such
as NLP, Web mining and Semantic Web communities. All of
them agree on the de nition of a Named Entity, which was
coined by Grishman et al. as an information unit described
by the name of a person or an organization, a location,
a brand, a product, a numeric expression including time,
date, money and percent found in a sentence [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. One of the
rst research papers in the NLP eld, aiming at
automatically identifying named entities in texts, was proposed by
Rau [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. This work relies on heuristics and de nition of
patterns to recognize company names in texts. The training set
is de ned by the set of heuristics chosen. This work evolved
and was improved later on by Sekine et al. [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. A di erent
approach was introduced when Supervised Learning (SL)
techniques were used. The big disruptive change was the
use of a large dataset manually labeled. In the SL eld,
a human being usually trains positive and negative
examples so that the algorithm computes classi cation patterns.
SL techniques exploit Hidden Markov Models (HMM) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
Decision Trees [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], Maximum Entropy Models [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], Support
Vector Machines (SVM) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and Conditional Random Fields
(CRF) [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. The common goal of these approaches is to
recognize relevant key-phrases and to classify them in a xed
taxonomy. The challenges with SL approaches is the
unavailability of such labeled resources and the prohibitive cost
of creating examples. Semi-Supervised Learning (SSL) and
Unsupervised Learning (UL) approaches attempt to solve
this problem by either providing a small initial set of labeled
data to train and seed the system [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], or by resolving the
extraction problem as a clustering one. For instance, a user
can try to gather named entities from clustered groups based
on the similarity of context. Other unsupervised methods
may rely on lexical resources (e.g. WordNet), lexical
patterns and statistics computed on large annotated corpus [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        The NER task is strongly dependent on the knowledge
base used to train the NE extraction algorithm. Leveraging
on the use of DBpedia, Freebase and YAGO, recent
methods, coming from Semantic Web community, have been
introduced to map entities to relational facts exploiting these
ne-grained ontologies. In addition to detect a NE and its
type, e orts have been spent to develop methods for
disambiguating information unit with a URI. Disambiguation is
one of the key challenges in this scenario and its foundation
stands on the fact that terms taken in isolation are naturally
ambiguous. Hence, a text containing the term London may
refer to the city London in UK or to the city London in
Minnesota, USA, depending on the surrounding context.
Similarly, people, organizations and companies can have multiple
names and nicknames. These methods generally try to nd
in the surrounding text some clues for contextualizing the
ambiguous term and re ne its intended meaning. Therefore,
a NE extraction work ow consists in analyzing some input
content for detecting named entities, assigning them a type
weighted by a con dence score and by providing a list of
URIs for disambiguation. Initially, the Web mining
comAlchemyAPI
DBpedia Spotlight
Evri
Extractiv
OpenCalais
Zemanta
munity has harnessed Wikipedia as the linking hub where
entities were mapped [
        <xref ref-type="bibr" rid="ref10 ref12">12, 10</xref>
        ]. A natural evolution of this
approach, mainly driven by the Semantic Web community,
consists in disambiguating named entities with data from
the LOD cloud. In [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], the authors proposed an approach
to avoid named entity ambiguity using the DBpedia dataset.
      </p>
      <p>
        Interlinking text resources with the Linked Open Data
cloud becomes an important research question and it has
been addressed by numerous services which have opened
their knowledge to online computation. Although these
services expose a comparable output, they have their own strengths
and weaknesses but, to the best of our knowledge, few
research comparisons have been spent to evaluate them. The
creators of the DBpedia Spotlight service have compared
their service with a number of other NER extractors
(OpenCalais, Zemanta, Ontos Semantic API26, The Wiki
Machine27, AlchemyAPI and M&amp;W's wiki er [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]) according
to an annotation task scenario. The experiment consisted
in evaluating 35 paragraphs from 10 news articles in 8
categories selected from the The New York Times and has been
performed by 4 human raters. The nal goal was to
create wiki links and to provide a disambiguation benchmark
(partially, re-used in this work). The experiment showed
how DBpedia Spotlight overcomes the performance of other
services under evaluation, but its performances are strongly
a ected by the con guration parameters. Authors
underlined the importance to perform several set-up experiments
and to gure out the best con guration set for the speci c
disambiguation task. Moreover, they did not take into
account the precision of the NE and type.
      </p>
      <p>
        We have ourselves proposed a rst qualitative
comparison attempt, highlighting the precision score for each
extracted eld from 10 news articles coming from 2 di erent
sources, The New York Times and BBC 28 and 5 di erent
categories: business, health, science, sport, world [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Due
to the news articles length, we face a very low Fleiss's kappa
26http://www.ontos.com
27http://thewikimachine.fbk.eu/
28http://www.bbc.com
agreement score: many output records to evaluate a ected
the human rater ability to select the correct answer. In
this paper, we advance these initial experiments by
providing a full generic framework powered by an ontology and
we present a large scale quantitative experiment focusing
on the extraction performances with di erent type of text:
user-generated content, scienti c text, and news articles.
7.
      </p>
    </sec>
    <sec id="sec-18">
      <title>CONCLUSION AND FUTURE WORK</title>
      <p>In this paper, we presented NERD a web framework which
uni es 10 named entity extractors and lift the output result
to the Linked Data Cloud following the NIF speci cation.
To motivate NERD, we presented a quantitative
comparison of 6 extractors in particular task, scenario and settings.
Our goal was to assess the performance variations
according to di erent kind of texts (news articles, scienti c papers,
user generated content) and di erent text length. Results
showed that some extractors are a ected by the word
cardinality and the type of text, especially for scienti c papers.
DBpedia Spotlight and OpenCalais are not a ected by the
word cardinality and Extractiv is the best solution to
classify NEs according to \scienti c" concepts such as Time and
Number.</p>
      <p>This work has evidenced the need to follow up with such
systematic comparisons between NE extractor tools,
especially using a large golden dataset. We believe that the
NERD framework we have proposed is a suitable tool to
perform such evaluations. In this work, the human
evaluation has been conducted asking all participants to rate
the output results of these extractors. An important step
forward would be to investigate about the creation of an
already labeled dataset of triples (NE, type, URI) and then
assessing how these extractors adhere to this dataset. Future
work will include a thorough comparison with the ESTER2
and CONLL-2003 datasets (datasets well-known in the NLP
community) studying how it may t the need of comparing
those extractor tools and more importantly, how to combine
them. In terms of manual evaluation, Boolean decision is not
enough for judging all tools. For example, a named entity
type might not be wrong, but not precise enough (Obama is
not only a person, he is also known as the American
President). Another improvement of the system is to allow the
input of additional items or correct miss-understanding or
ambiguous items. Finally, we plan to implement a \smart"
extractor service, which takes into account extraction
evaluations coming from all raters to assess new evaluation tasks.
The idea is to study the role of the relevance eld in order to
create a set of not-discovered NE from one tool, but which
may be nd out by other tools.</p>
    </sec>
    <sec id="sec-19">
      <title>Acknowledgments</title>
      <p>This work was supported by the French National Agency
under contracts 09.2.93.0966, \Collaborative Annotation for
Video Accessibility" (ACAV), ANR.11.EITS.006.01, \Open
Innovation Platform for Semantic Media" (OpenSEM) and
the European Union's 7th Framework Programme via the
projects LOD2 (GA 257943) and LinkedTV (GA 287911).
The authors would like to thank Pablo Mendes for his
fruitful support and suggestions and Ruben Verborgh for the
NERD OpenID implementation.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Alfonseca</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Manandhar</surname>
          </string-name>
          .
          <article-title>An Unsupervised Method for General Named Entity Recognition And Automated Concept Discovery</article-title>
          .
          <source>In 1st International Conference on General WordNet</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Asahara</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Matsumoto. Japanese Named</surname>
          </string-name>
          <article-title>Entity extraction with redundant morphological analysis</article-title>
          .
          <source>In International Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NAACL'03)</source>
          , pages
          <fpage>8</fpage>
          {
          <fpage>15</fpage>
          , Edmonton, Canada,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          , G. Kobilarov,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cyganiak</surname>
          </string-name>
          , and
          <string-name>
            <surname>Z. Ives.</surname>
          </string-name>
          <article-title>DBpedia: A Nucleus for a Web of Open Data</article-title>
          .
          <source>In 6th International Semantic Web Conference (ISWC'07)</source>
          , pages
          <fpage>722</fpage>
          {
          <fpage>735</fpage>
          ,
          <string-name>
            <surname>Busan</surname>
          </string-name>
          , South Korea,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Bikel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Schwartz</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Weischedel</surname>
          </string-name>
          .
          <article-title>Nymble: a high-performance learning name- nder</article-title>
          .
          <source>In 5th International Conference on Applied Natural Language Processing</source>
          , pages
          <volume>194</volume>
          {
          <fpage>201</fpage>
          , Washington, USA,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Borthwick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sterling</surname>
          </string-name>
          , E. Agichtein, and
          <string-name>
            <given-names>R.</given-names>
            <surname>Grishman</surname>
          </string-name>
          . NYU:
          <article-title>Description of the MENE Named Entity System as Used in MUC-7</article-title>
          .
          <source>In 7th Message Understanding Conference (MUC-7)</source>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Cyganiak</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Jentzsch</surname>
          </string-name>
          .
          <article-title>Linking Open Data cloud diagram</article-title>
          .
          <source>LOD Community</source>
          (http://lod-cloud.
          <source>net/)</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R. T.</given-names>
            <surname>Fielding</surname>
          </string-name>
          and
          <string-name>
            <given-names>R. N.</given-names>
            <surname>Taylor</surname>
          </string-name>
          .
          <article-title>Principled design of the modern web architecture</article-title>
          .
          <source>ACM Transaction Interneternet Technology</source>
          ,
          <volume>2</volume>
          :
          <fpage>115</fpage>
          {
          <fpage>150</fpage>
          , May
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>O.</given-names>
            <surname>Galibert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rosset</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Grouin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zweigenbaum</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Quintard</surname>
          </string-name>
          .
          <article-title>Structured and extended named entity evaluation in automatic speech transcriptions</article-title>
          .
          <source>In Proceedings of 5th International Joint Conference on Natural Language Processing</source>
          , pages
          <volume>518</volume>
          {
          <fpage>526</fpage>
          ,
          <string-name>
            <surname>Chiang</surname>
            <given-names>Mai</given-names>
          </string-name>
          , Thailand,
          <year>November 2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Grishman</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Sundheim</surname>
          </string-name>
          . Message Understanding Conference-
          <volume>6</volume>
          :
          <article-title>a brief history</article-title>
          .
          <source>In 16th International Conference on Computational linguistics (COLING'96)</source>
          , pages
          <fpage>466</fpage>
          {
          <fpage>471</fpage>
          , Copenhagen, Denmark,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J. Ho art</given-names>
            , M. A.
            <surname>Yosef</surname>
          </string-name>
          , I. Bordino, H. Furstenau, M. Pinkal,
          <string-name>
            <given-names>M.</given-names>
            <surname>Spaniol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Taneva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Thater</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Weikum</surname>
          </string-name>
          .
          <article-title>Robust Disambiguation of Named Entities in Text</article-title>
          .
          <source>In Conference on Empirical Methods in Natural Language Processing</source>
          , pages
          <volume>782</volume>
          {
          <fpage>792</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>H.</given-names>
            <surname>Ji</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Grishman</surname>
          </string-name>
          .
          <article-title>Data selection in semi-supervised learning for name tagging</article-title>
          .
          <source>In Workshop on Information Extraction Beyond The Document</source>
          , pages
          <volume>48</volume>
          {
          <fpage>55</fpage>
          ,
          <string-name>
            <surname>Sydney</surname>
          </string-name>
          , Australia,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kulkarni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Ramakrishnan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Chakrabarti</surname>
          </string-name>
          .
          <article-title>Collective annotation of Wikipedia entities in Web text</article-title>
          .
          <source>In 15th ACM International Conference on Knowledge Discovery and Data Mining (KDD'09)</source>
          , pages
          <fpage>457</fpage>
          {
          <fpage>466</fpage>
          , Paris, France,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A. M. W.</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>Early results for named entity recognition with conditional random elds, feature induction and web-enhanced lexicons</article-title>
          .
          <source>In 7th International Conference on Natural Language Learning at HLT-NAACL (CONLL'03)</source>
          , pages
          <fpage>188</fpage>
          {
          <fpage>191</fpage>
          , Edmonton, Canada,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>P. N.</given-names>
            <surname>Mendes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jakob</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Garc</surname>
          </string-name>
          a-Silva, and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          . DBpedia Spotlight:
          <article-title>Shedding Light on the Web of Documents</article-title>
          .
          <source>In 7th International Conference on Semantic Systems (I-Semantics)</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>D.</given-names>
            <surname>Milne</surname>
          </string-name>
          and
          <string-name>
            <given-names>I. H.</given-names>
            <surname>Witten</surname>
          </string-name>
          .
          <article-title>Learning to link with Wikipedia</article-title>
          .
          <source>In 17th ACM International Conference on Information and Knowledge Management (CIKM'08)</source>
          , pages
          <fpage>509</fpage>
          {
          <fpage>518</fpage>
          ,
          <string-name>
            <surname>Napa</surname>
            <given-names>Valley</given-names>
          </string-name>
          , California, USA,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>L.</given-names>
            <surname>Rau</surname>
          </string-name>
          .
          <article-title>Extracting company names from text</article-title>
          .
          <source>In 7th IEEE Conference on Arti cial Intelligence Applications</source>
          , volume i, pages
          <volume>29</volume>
          {
          <fpage>32</fpage>
          ,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>G.</given-names>
            <surname>Rizzo</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Troncy</surname>
          </string-name>
          .
          <article-title>NERD: A Framework for Evaluating Named Entity Recognition Tools in the Web of Data</article-title>
          .
          <source>In 10th International Semantic Web Conference (ISWC'11)</source>
          , Demo Session, pages
          <fpage>1</fpage>
          <lpage>{</lpage>
          4, Bonn, Germany,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>G.</given-names>
            <surname>Rizzo</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Troncy</surname>
          </string-name>
          . NERD:
          <article-title>Evaluating Named Entity Recognition Tools in the Web of Data</article-title>
          .
          <source>In Workshop on Web Scale Knowledge Extraction (WEKEX'11)</source>
          , pages
          <fpage>1</fpage>
          {
          <fpage>16</fpage>
          ,
          <string-name>
            <surname>Bonn</surname>
          </string-name>
          , Germany,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sekine</surname>
          </string-name>
          . NYU:
          <article-title>Description of the Japanese NE system used for MET-2</article-title>
          .
          <source>In 7th Message Understanding Conference (MUC-7)</source>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sekine</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Nobata</surname>
          </string-name>
          . De nition,
          <article-title>Dictionaries and Tagger for Extended Named Entity Hierarchy</article-title>
          .
          <source>In 4th International Conference on Language Resources and Evaluation (LREC'04)</source>
          , Lisbon, Portugal,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>F.</given-names>
            <surname>Suchanek</surname>
          </string-name>
          , G. Kasneci, and
          <string-name>
            <given-names>G.</given-names>
            <surname>Weikum.</surname>
          </string-name>
          <article-title>Yago: a Core of Semantic Knowledge</article-title>
          .
          <source>In 16th International Conference on World Wide Web (WWW'07)</source>
          , pages
          <fpage>697</fpage>
          {
          <fpage>706</fpage>
          ,
          <string-name>
            <surname>Ban</surname>
          </string-name>
          , Alberta, Canada,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>