=Paper= {{Paper |id=None |storemode=property |title=NERD meets NIF: Lifting NLP Extraction Results to the Linked Data Cloud |pdfUrl=https://ceur-ws.org/Vol-937/ldow2012-paper-02.pdf |volume=Vol-937 |dblpUrl=https://dblp.org/rec/conf/www/RizzoTHB12 }} ==NERD meets NIF: Lifting NLP Extraction Results to the Linked Data Cloud== https://ceur-ws.org/Vol-937/ldow2012-paper-02.pdf
            NERD meets NIF: Lifting NLP Extraction Results
                     to the Linked Data Cloud

                 Giuseppe Rizzo                     Raphaël Troncy                     Sebastian Hellmann,
               EURECOM, France                      EURECOM, France                     Martin Bruemmer
             Politecnico di Torino, Italy       raphael.troncy@eurecom.fr            Universität Leipzig, Germany
            giuseppe.rizzo@eurecom.fr                                                  hellmann@informatik.uni-
                                                                                              leipzig.de


ABSTRACT                                                         barely everything [6]. The Web hosts also millions of semi-
We have often heard that data is the new oil. In particular,     structured texts such as scientific papers, news articles as
extracting information from semi-structured textual docu-        well as forum and archived mailing list threads or (micro-
ments on the Web is key to realize the Linked Data vision.       )blog posts. This information has usually a rich semantic
Several attempts have been proposed to extract knowledge         structure which is clear for the author but that remains
from textual documents, extracting named entities, classi-       mostly hidden to computing machinery. Named entity and
fying them according to pre-defined taxonomies and disam-        information extractors aim to bring such a structure from
biguating them through URIs identifying real world entities.     those free texts. They provide algorithms for extracting se-
As a step towards interconnecting the Web of documents via       mantic units identifying the name of people, organizations,
those entities, different extractors have been proposed. Al-     locations, time references, quantities, etc. and classifying
though they share the same main purpose (extracting named        them according to predefined schema, increasing discover-
entity), they differ from numerous aspects such as their un-     ability (e.g. through faceted search), reusability and the
derlying dictionary or ability to disambiguate entities. We      utility of information.
have developed NERD, an API and a front-end user inter-             Since the 90’s, an increasing emphasis has been given to
face powered by an ontology to unify various named entity        the evaluation of NLP techniques. Hence, the Named En-
extractors. The unified result output is serialized in RDF       tity Recognition (NER) task has been developed as an es-
according to the NIF specification and published back on         sential component of the Information Extraction field. Ini-
the Linked Data cloud. We evaluated NERD with a dataset          tially, these techniques focused on identifying atomic infor-
composed of five TED talk transcripts, a dataset composed        mation unit in a text, named entities, later on classified into
of 1000 New York Times articles and a dataset composed of        predefined categories (also called context types) by classi-
the 217 abstracts of the papers published at WWW 2011.           fication techniques, and linked to real world objects using
                                                                 web identifiers. Such a task is called Named Entity Disam-
                                                                 biguation. Knowledge bases affect the disambiguation task
Categories and Subject Descriptors                               in several ways, because they provide the final disambigua-
I.2.7 [Artificial Intelligence]: [Natural Language Process-      tion point where the information is linked. Recent methods
ing - Language parsing and understanding]                        leverage knowledge bases such as DBpedia [3], Freebase1 or
                                                                 YAGO [21] since they contain many entries corresponding
                                                                 to real world entities and classified according to exhaustive
General Terms                                                    classification schemes. A certain number of tools have been
Measurement, Performance, Experimentation, Web                   developed to extract structured information from text re-
                                                                 sources classifying them according to pre-defined taxonomies
Keywords                                                         and disambiguating them using URIs. In this work, we aim
                                                                 to evaluate tools which provide such an online computation:
Named Entity extractors, Information extraction, Linked          AlchemyAPI2 , DBpedia Spotlight3 , Evri4 , Extractiv5 , Lu-
Data, Evaluation                                                 pedia6 , OpenCalais7 , Saplo8 , Wikimeta9 , Yahoo! Content
                                                                 Analysis (YCA)10 and Zemanta11 . They represent a clear
1.   INTRODUCTION
                                                                  1
   The Web of Data is often illustrated as a fast growing           http://www.freebase.com/
                                                                  2
cloud of interconnected dataset representing information about      http://www.alchemyapi.com
                                                                  3
                                                                    http://dbpedia.org/spotlight
                                                                  4
                                                                    http://www.evri.com/developer/index.html
                                                                  5
                                                                    http://extractiv.com
                                                                  6
                                                                    http://lupedia.ontotext.com
                                                                  7
                                                                    http://www.opencalais.com
                                                                  8
                                                                    http://saplo.com
                                                                  9
                                                                    http://www.wikimeta.com
                                                                 10
                                                                    http://developer.yahoo.com/search/content/V2/contentAnalysis.
                                                                  html
Copyright is held by the author/owner(s).                        11
LDOW2012, April 16, 2012, Lyon, France.                             http://www.zemanta.com
opportunity for the Linked Data community to increase the        char offset considering the text as a sequence of characters,
volume of interconnected data. Although these tools share        it reports the char index where the NE starts and the length
the same purpose – extracting semantic units from text –         (number of chars) of the NE; range of chars considering the
they make use of different algorithms and training data.         text as a sequence of characters, it reports the start index
They generally provide a potential similar output composed       and the end index where the NE appears; word offset the
of a set of extracted named entities, their type and poten-      text is tokenized considering any punctuation, it reports the
tially a URI disambiguating each named entity. The output        word number after the NE is located (this counting does not
vary in terms of data model used by the extractors. Hence,       take into account the punctuation); POS offset the text is
we propose the Named Entity Recognition and Disambigua-          tokenized considering any punctuation, it reports the num-
tion (NERD) framework which unifies the output results of        ber of part-of-a-speech after the NE is located.
these extractors, lifting them to the Linked Data Cloud us-         We performed an experimental evaluation to estimate the
ing the new NIF specification.                                   max content chunk supported by each API, creating a sim-
   These services have their own strengths and shortcom-         ple application that is able to send to each extractor a text
ings but, to the best of our knowledge, few scientific eval-     of 1KB initially. In case that the answer was correct (HTTP
uations have been conducted to understand the conditions         status 20x), we performed one more test increasing of 1 KB
under which a tool is the most appropriate one. This paper       the content chunk. We iterated this operation until we re-
attempts to fill this gap. We have performed quantitative        ceived the answer “text too long”. Table 1 summarizes the
evaluations conducted on three different datasets covering       factual comparison of the services involved in this study.
different type of textual documents: a dataset composed of       The * means the value has been estimated experimentally
transcripts of five TED 12 talks, a dataset composed of 1000     (as the content chunk), + means a list of other sources, gen-
news articles from The New York Times 13 and a dataset           erally identifiable as any source available within the Web,
composed of the 217 abstracts of the papers published at         finally N/A means not available.
WWW 2011 14 . We present statistics to underline the be-
havior of such extractors in different scenarios and group       3.       THE NERD FRAMEWORK
them according to the NERD ontology. We have developed
                                                                    NERD is a web framework plugged on top of various NER
the NERD framework, available at http://nerd.eurecom.fr to
                                                                 extractors. Its architecture follows the REST principles [7]
perform systematic evaluation of NE extractors.
                                                                 and includes an HTML front-end for humans and an API
   The remainder of this paper is organized as follows. In
                                                                 for computers to exchange content in JSON (another serial-
section 2, we introduce a factual comparison of the named
                                                                 ization of NERD output will be detailed in the section 4).
entity extractors investigated in this work. We describe the
                                                                 Both interfaces are powered by the NERD REST engine.
NERD framework in section 3 and we highlight the impor-
tance to have an output compliant with the Linked Data           3.1        The NERD Data Model
principles in section 4. Then, we describe the experimental
                                                                    We propose the following data model that encapsulates
results we obtained in section 5 and in section 6, we propose
                                                                 the common properties for representing NERD extraction
an overview on Named Entity recognition and disambigua-
                                                                 results. It is composed of a list of entities for which a label,
tion techniques. Finally, we give our conclusions and outline
                                                                 a type and a URI is provided, together with the mapped type
future work in section 7.
                                                                 in the NERD taxonomy, the position of the named entity,
                                                                 the confidence and relevance scores as they are provided by
2.    FACTUAL COMPARISON OF                                      the NER tools. The example below shows this data model
      NAMED ENTITY EXTRACTORS                                    (for the sake of brevity, we use the JSON syntax):
   The NE recognition and disambiguation tools vary in terms          " entities ": [{
of response granularity and technology used. As granular-                   " entity ":" Tim Berners - Lee " ,
                                                                            " type ":" Person " ,
ity, we define the way how the extraction algorithm works:                  " uri ":" http :// dbpedia . org / resource /
One Entity per Name (OEN) where the algorithm tokenizes                            Tim _be rner s_le e " ,
the document in a list of exclusive sentences, recognizing the              " nerdType ":" http :// nerd . eurecom . fr / ontology #
                                                                                   Person " ,
dot as a terminator character, and for each sentence, detects               " startChar ":30 ,
named entities; and One Entity per Document (OED) where                     " endChar ":45 ,
the algorithm considers the bag of words from the entire doc-               " confidence ":1 ,
                                                                            " relevance ":0.5
ument and then detects named entities, removing duplicates                 }]
for the same output record (NE, type, URI). Therefore, the
result set differs from the two approaches.
   Table 1 provides an extensive comparison that take into       3.2        The NERD REST API
account the technology used: algorithms used to extract            The REST engine runs on Jersey15 and Grizzly16 tech-
NE, supported languages, ontology used to classify the NE,       nologies. Their extensible frameworks enable to develop
dataset for looking up the real world entities and all the       several components and NERD is composed of 7 modules
technical issues related to the online computation such as       namely authentication, scraping, extraction, ontology map-
the maximum content request size and the response format.        ping, store, statistics and web. The authentication takes as
We also report whether a tool provides the position where        input a FOAF profile of a user and links the evaluations with
an NE is found in the text or not. We distinguish four cases:    the user who performs them (we are freezing an OpenID im-
12                                                               plementation and it will replace soon the simple authentica-
   http://www.ted.com
13                                                               15
   http://www.nytimes.com                                             http://jersey.java.net
14                                                               16
   http://www.www2011india.com                                        http://grizzly.java.net
                  AlchemyAPI      DBpedia Spotlight        Evri         Extractiv       Lupedia         OpenCalais       Saplo        Wikimeta          YCA           Zemanta
 Granularity          OED               OEN                OED            OEN             OEN             OED            OED            OEN             OEN            OED
  Language          English            English            English        English        English          English        English        English         English        English
   support           French       German (partial)        Italian                       French           French         Swedish        French
                    German       Portuguese (partial)                                    Italian         Spanish                      Spanish
                     Italian       Spanish (partial)
                  Portuguese
                    Russian
                    Spanish
                    Swedish
Restriction on        30000           unlimited             3000           1000         unlimited         50000          1333         unlimited         5000            10000
academic use
 (calls/day)
   Sample           C/C++                Java           Action Script      Java           N/A              Java           Java          Java         Javascript          C#
   clients            C#              Javascript            Java                                                       Javascript       Perl           PHP               Java
                      Java              PHP                 PHP                                                          PHP                                          Javascript
                      Perl                                                                                              Python                                           Perl
                     PHP-5                                                                                                                                              PHP
                    Python                                                                                                                                             Python
                     Ruby                                                                                                                                               Ruby
     API              CLI              AJAX                AJAX           AJAX            CLI             AJAX          AJAX            CLI           JAX-RS            AJAX
   interface        JAX-RS              CLI               JAX-RS           CLI          JAX-RS             CLI           CLI          JAX-RS            CLI              CLI
                     SOAP             JAX-RS                             JAX-RS                          JAX-RS        JAX-RS                                          JAX-RS
                                       SOAP                                                               SOAP
  Content           150KB*            452KB*               8KB*           32KB*          20KB*            8KB*          26KB*          80KB*          7769KB*          970KB*
    chunk
  Response           JSON        HTML+uF(rel-tag)           GPB           HTML           HTML              JSON          JSON          JSON             JSON           JSON
   format         Microformats       JSON                  HTML           JSON           JSON           Microformats                   XML              XML           WNJSON
                     XML             RDF                   JSON            RDF           RDFa                N3                                                        RDF
                      RDF         XHTML+RDFa                RDF           XML            XML           Simple Format                                                   XML
                                     XML
  Entity type         324             320                   300*            34             319              95             5              7              13              81
    number
Entity position       N/A            char offset            N/A         word offset   range of chars    char offset      N/A         POS offset     range of chars      N/A
 Classification     Alchemy           DBpedia               Evri         DBpedia        DBpedia         OpenCalais       Saplo        ESTER            Yahoo          FreeBase
   ontologies                         FreeBase                                                                                        (partial)
                                     Schema.org
 Deferencable       DBpedia           DBpedia               Evri         DBpedia       DBpedia          OpenCalais       N/A          DBpedia        Wikipedia        Wikipedia
 vocabularies       Freebase                                                          LinkedMDB                                      Geonames                           IMDB
                   US Census                                                                                                        CIAFactbook                      MusicBrainz
                   GeoNames                                                                                                         Wikicompanies                     Amazon
                    UMBEL                                                                                                              others+                        YouTube
                    OpenCyc                                                                                                                                          TechCrunch
                     YAGO                                                                                                                                            MusicBrainz
                   MusicBrainz                                                                                                                                         Twitter
                  CIA Factbook                                                                                                                                       MyBlogLog
                   CrunchBase                                                                                                                                         Facebook
                                                                                                                                                                       others+


                                          Table 1: Factual information about 10 extractors under investigation
tion system working right now). The scraping module takes        as being equivalent to alchemy:City, dbpedia-owl:City,
as input the URI of an article and extracts all its raw text.    extractiv:CITY, opencalais:City, evri:City while being
Extraction is the module designed to invoke the external         more specific than wikimeta:LOC and zemanta:location.
service APIs and collect the results. Each service provides
                                                                      nerd : City a rdfs : Class ;
its own taxonomy of named entity types it can recognize.                rdfs : subClassOf wikimeta : LOC ;
We therefore designed the NERD ontology which provides                  rdfs : subClassOf zemanta : location ;
a set of mappings between these various classifications. The            owl : equi vale ntC lass alchemy : City ;
                                                                        owl : equi vale ntC lass dbpedia - owl : City ;
ontology mapping is the module in charge to map the clas-               owl : equi vale ntC lass evri : City ;
sification type retrieved to our ontology. The store module             owl : equi vale ntC lass extractiv : CITY ;
saves all evaluations according to the schema model we de-              owl : equi vale ntC lass opencalais : City .
fined in the NERD database. The statistic module enables
to extract data patterns form the user interactions stored in
the database and to compute statistical scores such as the       3.4       The NERD UI
Fleiss Kappa score and the precision measure. Finally, the          The user interface18 is developed in HTML/Javascript. Its
web module manages the client requests, the web cache and        goal is to provide a portal where researchers can find infor-
generates HTML pages.                                            mation about the NERD project, the NERD ontology, and
   Plugged on the top of this engine, there is an API inter-     common statistics of the supported extractors. Moreover,
face17 . It is developed following the REST principles and it    it provides a personalized space where a user can create a
has been implemented to enable programmatic access to the        developer or a simple user account. For the former account
NERD framework. It follows the following URI scheme (the         type, a developer can navigate through a dashboard, see his
base URI is http://nerd.eurecom.fr/api):                         profile details, browse some personal usage statistics and get
                                                                 a programmatic access to the NERD API via a NERD key.
/document : GET, POST, PUT methods enable to fetch, sub-         The simple user account enables to annotate any web docu-
         mit or modify a document parsed by the NERD frame-      ments via its URI. The raw text is first extracted from the
         work;                                                   web source and a user can select a particular extractor. Af-
                                                                 ter the extraction step, the user can judge the correctness
/user : GET, POST methods enable to insert a new user to
                                                                 of each field of the tuple (NE, type, URI, relevant). This
         the NERD framework and to fetch account details;        is an important process which gives to NERD human feed-
                                                                 backs with the main purpose of evaluating the quality of the
/annotation/{extractor} :    POST method drives the annota-
                                                                 extraction results collected by those tools [17]. At the end
         tion of a document. The parametric URI allows to
                                                                 of the evaluation, the user sends the results, through asyn-
         pilot the extractors supported by NERD;
                                                                 chronous calls, to the REST API engine in order to store
/extraction :    GET method allows to fetch the output de-       them. This set of evaluations is further used to compute
         scribed in section 3.1;                                 statistics about precision measures for each tool, with the
                                                                 goal to highlight strengths and weaknesses and to compare
/evaluation :    GET method allows to retrieve a statistic in-   them [18]. The comparison aggregates all the evaluations
         terpretation of the extractor behaviors.                performed and, finally, the user is free to select one or more
                                                                 evaluations to see the metrics that are computed for each
3.3        The NERD Ontology                                     service in real time.
   Although these tools share the same goal, they use differ-
ent algorithms and different dictionaries which makes hard
their comparison. We have developed the NERD ontology,           4.       NIF: AN NLP INTERCHANGE FORMAT
a set of mappings established manually between the tax-             The NLP Interchange Format (NIF) is an RDF/OWL-
onomies of NE types. Concepts included in the NERD ontol-        based format that aims to achieve interoperability between
ogy are collected from different schema types: ontology (for     Natural Language Processing (NLP) tools, language resources
DBpedia Spotlight, Lupedia, and Zemanta), lightweight tax-       and annotations. The NIF specification has been released in
onomy (for AlchemyAPI, Evri, and Yahoo!) or simple flat          an initial version 1.0 in November 2011 and describes how in-
type lists (for Extractiv, OpenCalais, Saplo, and Wikimeta).     teroperability between NLP tools, which are exposed as NIF
The NERD ontology tries to merge the linguistic commu-           web services can be achieved. Extensive feedback was given
nity needs and the logician community ones: we developed         on several mailing lists and a community of interest19 was
a core set of axioms based on the Quaero schema [8] and we       created to improve the specification. Implementations for 8
mapped similar concepts described in the other scheme. The       different NLP tools (e.g. UIMA, Gate ANNIE and DBpedia
selection of these concepts has been done considering the        Spotlight) exist and a public web demo20 is available.
greatest common denominator among them. The concepts                In the following, we will first introduce the core concepts
that do not appear in the NERD namespace are sub-classes         of NIF, which are defined in a String Ontology21 (STR). We
of parents that end-up in the NERD ontology. This ontology       will then explain how NIF is used in NERD. The resulting
is available at http://nerd.eurecom.fr/ontology. To summarize,   properties and axioms are included into a Structured Sen-
a concept is included in the NERD ontology as soon as there      tence Ontology22 (SSO). While the String Ontology is used
are at least two extractors that use it. The NERD ontology       18
                                                                    http://nerd.eurecom.fr
becomes a reference ontology for comparing the classifica-       19
                                                                    http://nlp2rdf.org/get-involved
tion task of NE extractors. We show an example mapping           20
                                                                    http://nlp2rdf.lod2.eu/demo.php
among those extractors below: the City type is considered        21
                                                                    http://nlp2rdf.lod2.eu/schema/string
17                                                               22
     http://nerd.eurecom.fr/api/application.wadl                    http://nlp2rdf.lod2.eu/schema/sso
                                                                    provides a reference context for each substring contained in
                                                                    the text (i.e. the characters before or after the substring).
                                                                    The outside context is more vague and is given by an outside
                                                                    observer, who might arbitrarily interpret the text as a “book
                                                                    chapter” or a “book section”.
                                                                      The class str:Context now provides a clear reference point
                                                                    for all other relative URIs used in this context and blocks
                                                                    the addition of information from a larger (outside) context
                                                                    by definition. For example, str:Context is disjoint with
                                                                    foaf:Document since labeling a context resource as a doc-
                                                                    ument is an information which is not contained within the
                                                                    context (i.e. the text) itself. It is legal, however, to say
                                                                    that the string of the context occurs in (str:occursIn) a
                                                                    foaf:Document. Additionally, str:Context is a subclass
                                                                    of str:String and therefore its instances denote Unicode
Figure 1: NIF URI schemes: Offset (top) and                         text as well. The main benefit to limit the context is that
context-hashes (bottom) are used to create identi-                  an OWL reasoner can now infer that two contexts are the
fiers for strings                                                   same, if they consist of the same string, because an inverse-
                                                                    functional data type property (str:isString) is used to at-
                                                                    tach the actual text to the context resource.
to describe the relations between strings (i.e. Unicode char-
acters), the SSO collects properties and classes to connect              : offset_0_26546 a str : Context ;
strings to NLP annotations and NER entities as produced                  # the exact retrieval method is left underspecified
                                                                            str : occursIn < http :// www . w3 . org / DesignIssues /
by NERD.                                                                           LinkedData . html > ;
                                                                         # [...] are all 26547 characters as rdf : Literal
4.1    Core Concepts of NIF                                                   str : isString            " [...] " .
                                                                         : offset_717_729 a str : String ;
   The motivation behind NIF is to allow NLP tools to ex-                   str : r ef e re n ce C on t ex t : offset_0_26546 .
change annotations about documents in RDF. Hence, the
main prerequisite is that parts of the documents (i.e. strings)        A complete formalisation is still work in progress, but the
are referenceable by URIs, so that they can be used as sub-         idea is explained here. The NIF URIs will be grounded
jects in RDF statements. We call an algorithm to create             on Unicode Characters (especially Unicode Normalization
such identifiers URI Scheme: For a given text t (a sequence         Form C23 . For all resources of type str:String, the universe
of characters) of length |t| (number of characters), we are         of discourse will then be the words over the alphabet of
looking for a URI Scheme to create a URI, that can serve as         Unicode characters sometimes called σ ∗. Perspectively, we
a unique identifier for a substring s of t (i.e. |s| ≤ |t|). Such   hope that this will allow for an unambiguous interpretation
a substring can (1) consist of adjacent characters only and         of NIF by machines.
it is therefore a unique character sequence within the text,           Within the framework of RDF and the current usage of
if we account for parameters such as context and position or        NIF for the interchange of output between NLP tools, the
(2) derived by a function which points to several substrings        definition of the semantics is sufficient to produce a working
as defined in (1).                                                  system. However, problems arise if additional interoperabil-
   NIF provides two URI schemes, which can be used to rep-          ity with Linked Data or fragment identifiers and ad-hoc re-
resent strings as RDF resources. We focus here on the first         trieval of content from the Web is demanded. The actual re-
scheme using offsets. In the top part of Figure 1, two triples      trieval method (such as content negotiation) to retrieve and
are given that use the following URI as subject:                    validate the content for #offset_717_729_Semantic%20Web or
  http://www.w3.org/DesignIssues/LinkedData.html#                   its reference context is left underspecified as is the relation
                  offset_717_729                                    of NIF URIs to fragment identifiers for MIME types such as
                                                                    text/plain (see RFC 514724 ). As long as such issues remain
According to the above definition, the URI points to a sub-         open, the complete text has to be included as RDF Literal.
string of a given text t, which starts at character index 717
until the index 729 (counting all characters). NIF currently        4.2        Connecting String to Entities
mandates that the whole string of the document has to be               For NERD, three relevant concepts have to be expressed
included in the RDF output as an rdf:Literal to serve as            in RDF and were included into the Structured Sentence On-
the reference point, which we will call inside context for-         tology (SSO): OEN, OED and NERD ontology types.
malized using an OWL class called str:Context. The term                One Entity per Name (OEN) can be modeled in a straight-
document would be inappropriate to capture the real inten-          forward way, by introducing a property sso:oen, which con-
tion of this concept as we would like to refer to an arbitrary      nects the string with an arbitrary entity.
grouping of characters forming a unit, which could also be
                                                                         : offset_717_729    sso : oen dbpedia : Semantic_Web .
applied to a paragraph or a section and is highly dependent
upon the wider context in which the string is actually used            One Entity per Document (OED). As document is an out-
such as a Web document reachable via HTTP.                          side interpretation of a string, the notion of context in NIF
   To appropriately capture the intention of such a class,
we will distinguish between the notion of outside and inside        23
                                                                     http://www.unicode.org/reports/tr15/#Norm_Forms counted            in
context of a piece of text. The inside context is easy to           Code Units http://unicode.org/faq/char_combmark.html#7
                                                                    24
explain and formalise, as it is the text itself and therefore it         http://tools.ietf.org/html/rfc5147
has to be used. The property sso:oec is used to attach en-                       WWW2011          TED      NYTimes
tities to a given context. We furthermore add the following                nd       217             5        1,000
DL-Axiom:                                                                  nw     38,062         13,381     62,0567
                                                                           rw      175.4         2,676.2    620.567
           sso:oec ⊇ str:ref erenceContext−1 ◦ sso:oen                     ne     12,266          1,441     164,116
                                                                           re      56.53          288.2      164.1
As the property oen contains more specific information, oec
can be inferred by the above role chain inclusion. In case the
                                                                 Table 2: Statistics about the three dataset used in
context is enlarged, any materialized information attached
                                                                 the quantitative experiment, grouped according to
via the oec property needs to be migrated to the larger con-
                                                                 the source where documents were collected.
text resource.
   The connection between NERD types and strings is done
via a linked data URI, which disambiguates the entity. Over-     5.1    User Generated Content
all three cases can be distinguished: In case, the NER ex-
tractor has provided a linked data URI to disambiguate the          In this experiment, we focus on the extractions performed
entity, we simply re-use it as in the following example:         by all tools for 5 TED talk transcripts. The goal is to find
                                                                 out NE extraction ratio for user generated content, such
     # this URI points to the string " W3C "                     as speech transcripts of videos. First, we propose general
     : o ffs et_ 231 07_ 23 1 1 0
         rdf : type                 str : String ;
                                                                 statistics about the extraction task and then, we focus on
         str : referenceCo n te x t : offset_0_26546 ;           the classification, showing statistics grouped according to
         sso : oen                  dbpedia : W3C ;              the NERD ontology. DBpedia Spotlight classifies each re-
         str : beginIndex           " 23107 " ;
         str : endIndex             " 23110 " .
                                                                 source according three different schema (see Table 1). For
     dbpedia : W3C rdf : type       nerd : Organization .        this experiment, we consider only the results which belong
                                                                 to the DBpedia ontology. The total number of documents
If, however, the NER extractor provides no disambiguation        is 5, with an overall number of total words equal to 13, 381.
link at all or just a non-linked data URI for the entity         The word detection rate per document r(w, d) is equal to
(typically, the foaf:homepage of an organization such as         2, 676.2 with an overall number of entities equal to 1, 441,
http://www.w3.org/ ), we plan to mint a new linked data          and the r(e, d) is 288.2. Table 3 shows the statistics about
URI for the respective entity that could then be further         the computation results for all extractors. DBpedia Spot-
sameAs with other identifiers in a data reconciliation pro-      light is the extractor which provides the highest number of
cess.                                                            NE and disambiguated URIs. These values show the ability
                                                                 from this extractor to locate NE and to exploit the large
                                                                 cloud of LOD resources. In parallel, it is crucial noting that
5.       EVALUATIONS                                             it is not able to classify these resources, although it uses a
   We performed a quantitative experiment using three dif-       deep classification schema. All the extractors show high abil-
ferent datasets: a dataset composed of transcripts of five       ity for the classification task, except Zemanta as shown by
TED talks (different category of talks), a dataset composed      the r(c, e). Contrarily, Zemanta shows strong ability to dis-
of 1000 news articles from The New York Times (collected         ambiguate NE via URI identification, as shown by r(u, e).
from 09/10/2011 to 12/10/2011), and a dataset composed           It is worth noting OpenCalais and Evri have almoust the
of the 217 abstracts of the papers published at WWW 2011         same performances of Zemanta. The last part of this exper-
conference. The aim of these evaluations is to assess how        iment consists in aligning all the classification types provided
these extractors perform in different scenarios, such as news    by these extractors, while performing the analysis of TED
articles, user generated content and scientific papers. The      talk transcripts, using the NERD ontology. For the sake of
total number of document is 1222, with an average word           brevity, we report all the grouping results according to 6
number per document equal to 549. Each document was              main concepts: Person, Organization, Country, City, Time
evaluated using 6 extractors supported by the NERD frame-        and Number. Table 4 shows the comparison results. Alche-
work25 . The final number of entities detected is 177, 823 and   myAPI classifies a higher number of Person, Country and
the average of unique entity number per document is 20.03.       City than all the others. In addition, OpenCalais obtains
Table 2 shows statistics about grouped view according to         good performances to classify all the concepts except Time
the source documents.                                            and Number. It is worth noting that Extractiv is the only
   We define the following variables: the number nd of eval-     extractor able to locate and classify Number and Time. In
uated documents, the number nw of words, the total num-          this grouped view, we consider all the results classified with
ber ne of entities, the total number nc of categories and nu     the 6 main classes and we do not take into account all po-
URIs. Moreover, we compute the following measures: word          tentially inferred relationships. This is why the Evri results
detection rate r(w, d), i.e. the number of words per docu-       contrast with what is showed in the Table 3. Indeed, Evri
ment, entity detection rate r(e, d), i.e. the number of enti-    provides a precise classification about Person such as Jour-
ties per document, the number of entities per word r(e, w),      nalist, Physicist, Technologist but it does not describe the
the number of categories per entity r(c, e) (this measure has    same resource as a sub-classes of the Person axiom.
been computed removing not relevant labels such as “null”
or “LINKED OTHER”) and the number of URIs per entity             5.2    Scientific Documents
r(u, e).
                                                                    In this experiment, we focus on the extraction performed
25
 At the time this evaluation has been conducted Lupedia,         by all tools for the 217 abstract papers published at the
Saplo, Wikimeta and YCA were not part of the NERD                WWW 2011 conference, with the aim to seek NE extrac-
framework.                                                       tion patterns for scientific contributions. The total number
                                          ne    nc      nu   r(e, d)   r(e, w)     r(c, e)   r(u, e)
                     AlchemyAPI          141   141      71    28.2       0.01        1       0.504
                     DBpedia Spotlight   810     0     624    162       0.06         0        0.77
                     Evri                120   120     113     24       0.009        1       0.942
                     Extractiv            60    53      22     12       0.004      0.883     0.367
                     OpenCalais          163   136     157    32.6      0.012      0.834     0.963
                     Zemanta              50    17      49     10       0.003       0.34      0.98

Table 3: Statistics about computation results for the sources coming from TED talks of all extractors used
in the comparison.




                           AlchemyAPI    DBpedia Spotlight     Evri    Extractiv     OpenCalais        Zemanta
            Person              42              -               10         6            27                4
            Organization        15              -                -         -            20                1
            Country             16              -               11         1            16                3
            City                14              -                3         3             7                -
            Time                 -              -                -         1             -                -
            Number               -              -                -         5             -                -

Table 4: Number of axioms aligned for all the extractors involved in the comparison according to the NERD
ontology for the sources coming from TED talks.




                                        ne       nc      nu     r(e, d)   r(e, w)     r(c, e)   r(u, e)
                  AlchemyAPI           323      171      39     1.488      0.008      0.529     0.121
                  DBpedia Spotlight   3,699       0    1,062    17.04     0.097         0       0.287
                  Evri                 282      282     167     1.299      0.007        1       0.592
                  Extractiv           1,345     725     415     6.198      0.035      0.539     0.309
                  OpenCalais          1,158    1,158    713     5.337       0.03        1       0.616
                  Zemanta             1,748      97     757     8.055      0.046      0.055     0.433

Table 5: Statistics about extraction results for the 217 abstract papers published at the WWW 2011 confer-
ence of all extractors used in the comparison.




                           AlchemyAPI    DBpedia Spotlight     Evri    Extractiv     OpenCalais        Zemanta
            Person              17              -               12         6             6                1
            Organization        20              -               24         -             5                -
            Country              9              -                8        14             7                6
            City                 4              -                3         8             9                -
            Time                 -              -                -         -             -                -
            Number               -              -                -       184             -                -

Table 6: Number of axioms aligned for all the extractors involved in the comparison according to the NERD
ontology for the sources coming from the WWW 2011 conference.
of words is 13, 381, while the word detection rate per doc-        nization and Country. OpenCalais shows meaningful results
ument r(w, d) is equal to 175.40 and the total number of           to recognize the class Person and especially a strong abil-
recognized entities is 12, 266 with the r(e, d) equal to 56.53.    ity to classify NEs with the label Organization. Extractiv
Table 5 shows the statistics of the computation results for        holds the best score for classifying Country and it is the only
all extractors. DBpedia Spotlight keeps a high rate of NEs         extractor able to seek the classes Time and Number.
extracted but shows some weaknesses to disambiguate NEs
with LOD resources. r(u, e) is equal to 0.2871, lesser than        6.   RELATED WORK
is performance in the previous experiment (see section 5.1).
                                                                      The Named Entity (NE) recognition and disambiguation
OpenCalais, instead, has the best r(u, e) and it has a con-
                                                                   problem has been addressed in different research fields such
siderable ability to classify NEs. Evri performed in a similar
                                                                   as NLP, Web mining and Semantic Web communities. All of
way as shown by the r(c, e). The last part of this experi-
                                                                   them agree on the definition of a Named Entity, which was
ment consists in aligning all the classification types retrieved
                                                                   coined by Grishman et al. as an information unit described
by these extractors using the NERD ontology, aligning 6
                                                                   by the name of a person or an organization, a location,
main concepts: Person, Organization, Country, City, Time
                                                                   a brand, a product, a numeric expression including time,
and Number. Table 6 shows the comparison results. Alche-
                                                                   date, money and percent found in a sentence [9]. One of the
myAPI still preserves the best result to classify named enti-
                                                                   first research papers in the NLP field, aiming at automat-
ties as Person. Instead, differently to what happened in the
                                                                   ically identifying named entities in texts, was proposed by
previous experiment, Evri outperforms AlchemyAPI while
                                                                   Rau [16]. This work relies on heuristics and definition of pat-
classifying named entities as Organization. It is important
                                                                   terns to recognize company names in texts. The training set
to note that Evri shows an high number of NEs classified
                                                                   is defined by the set of heuristics chosen. This work evolved
using the class Person in this scenario, but does not explore
                                                                   and was improved later on by Sekine et al. [20]. A different
deeply the Person inference (as shown in the user generated
                                                                   approach was introduced when Supervised Learning (SL)
content experiment). OpenCalais has the best performance
                                                                   techniques were used. The big disruptive change was the
to classify NEs according to the City class, while Extrac-
                                                                   use of a large dataset manually labeled. In the SL field,
tiv shows reliability to recognize Country and, especially, to
                                                                   a human being usually trains positive and negative exam-
classify Number.
                                                                   ples so that the algorithm computes classification patterns.
                                                                   SL techniques exploit Hidden Markov Models (HMM) [4],
5.3    News Articles                                               Decision Trees [19], Maximum Entropy Models [5], Support
   For this experiment, we collected 1000 news articles of         Vector Machines (SVM) [2] and Conditional Random Fields
The New York Times from 09/10/2011 to 12/10/2011 and               (CRF) [13]. The common goal of these approaches is to rec-
we performed the extraction for the tools involved in this         ognize relevant key-phrases and to classify them in a fixed
comparison. The goal is to explore the NE extraction ratio         taxonomy. The challenges with SL approaches is the un-
with this dataset and to assess commonalities and differences      availability of such labeled resources and the prohibitive cost
with the previous experiments. The total number of words is        of creating examples. Semi-Supervised Learning (SSL) and
620, 567, while the word detection rate per document r(w, d)       Unsupervised Learning (UL) approaches attempt to solve
is equal to 620.57 and the total number of recognized entities     this problem by either providing a small initial set of labeled
is 164, 12 with the r(e, d) equal to 164.17. Table 7 shows the     data to train and seed the system [11], or by resolving the
statistics of the computation results for all extractors.          extraction problem as a clustering one. For instance, a user
   Extractiv is the tool which provides the highest number of      can try to gather named entities from clustered groups based
NEs. This score is considerably greater than what does the         on the similarity of context. Other unsupervised methods
same extractor in the other test scenarios (see section 5.1        may rely on lexical resources (e.g. WordNet), lexical pat-
and section 5.2), and it does not depend from the number           terns and statistics computed on large annotated corpus [1].
of words per document, as reported by r(e, w). In contrast,           The NER task is strongly dependent on the knowledge
DBpedia Spotlight shows a r(e, w) which is strongly affected       base used to train the NE extraction algorithm. Leveraging
by the number of words: indeed, the r(e, w) is 0.048 lower         on the use of DBpedia, Freebase and YAGO, recent meth-
than the same score in the previous experiment. Although           ods, coming from Semantic Web community, have been in-
the highest number of URIs detailed is provided by Open-           troduced to map entities to relational facts exploiting these
Calais, the URI detection rate per entity is greater for Ze-       fine-grained ontologies. In addition to detect a NE and its
manta, with a score equal to 0.577. Alchemy, Evri, and             type, efforts have been spent to develop methods for disam-
OpenCalais confirm their reliability to classify NEs and its       biguating information unit with a URI. Disambiguation is
detection score value r(c, e) is sensibly greater than all the     one of the key challenges in this scenario and its foundation
others. Finally, we propose the alignment of the 6 main            stands on the fact that terms taken in isolation are naturally
types recognized by all extractors using the NERD ontol-           ambiguous. Hence, a text containing the term London may
ogy. Table 8 shows the comparison results. Differently to          refer to the city London in UK or to the city London in Min-
what has been detailed previously, DBpedia Spotlight rec-          nesota, USA, depending on the surrounding context. Simi-
ognizes few classes, although this number is not compara-          larly, people, organizations and companies can have multiple
ble with what performed by the other extractors. Zemanta           names and nicknames. These methods generally try to find
and DBpedia Spotlight increase classification performances         in the surrounding text some clues for contextualizing the
with respect to the experiments detailed in the two previous       ambiguous term and refine its intended meaning. Therefore,
test cases, obtaining a number of recognized Person which          a NE extraction workflow consists in analyzing some input
is lower than one magnitude order. AlchemyAPI preserves            content for detecting named entities, assigning them a type
strong ability to recognize Person, but still shows great per-     weighted by a confidence score and by providing a list of
formance to recognize City and significant scores for Orga-        URIs for disambiguation. Initially, the Web mining com-
                                             ne        nc       nu        r(e, d)   r(e, w)     r(c, e)   r(u, e)
                       AlchemyAPI          17,433    17,443    3,833      17.44      0.028        1        0.22
                       DBpedia Spotlight   30,333      20      8,702      30.33      0.048      0.001     0.287
                       Evri                16,944    16,944    8,594      16.94      0.027        1       0.507
                       Extractiv           47,455    41,393    8,177      47.45     0.076       0.872     0.172
                       OpenCalais          23,625    23,625   12,525      23.62      0.038        1        0.53
                       Zemanta              9,474     4621     5,467      9.474      0.015      0.488     0.577

Table 7: Statistics about extraction results for the 1000 news articles published by The New York Times from
09/10/2011 to 12/10/2011 of all extractors used in the comparison.

                                   AlchemyAPI   DBpedia Spotlight    Evri     Extractiv       OpenCalais     Zemanta
                 Person               6,246           14             2,698     5,648            5,615         1,069
                 Organization         2,479            -              900        81             2,538          180
                 Country              1,727            2             1,382     2,676            1,707          720
                 City                 2,133            -              845      2,046            1,863           -
                 Time                   -              -               -        123               1             -
                 Number                 -              -               -       3,940              -             -

Table 8: Number of axioms aligned for all the extractors involved in the comparison according to the NERD
ontology for the sources collected from the The New York Times from 09/10/2011 to 12/10/2011.


munity has harnessed Wikipedia as the linking hub where             agreement score: many output records to evaluate affected
entities were mapped [12, 10]. A natural evolution of this          the human rater ability to select the correct answer. In
approach, mainly driven by the Semantic Web community,              this paper, we advance these initial experiments by provid-
consists in disambiguating named entities with data from            ing a full generic framework powered by an ontology and
the LOD cloud. In [14], the authors proposed an approach            we present a large scale quantitative experiment focusing
to avoid named entity ambiguity using the DBpedia dataset.          on the extraction performances with different type of text:
   Interlinking text resources with the Linked Open Data            user-generated content, scientific text, and news articles.
cloud becomes an important research question and it has
been addressed by numerous services which have opened
their knowledge to online computation. Although these ser-          7.   CONCLUSION AND FUTURE WORK
vices expose a comparable output, they have their own strengths        In this paper, we presented NERD a web framework which
and weaknesses but, to the best of our knowledge, few re-           unifies 10 named entity extractors and lift the output result
search comparisons have been spent to evaluate them. The            to the Linked Data Cloud following the NIF specification.
creators of the DBpedia Spotlight service have compared             To motivate NERD, we presented a quantitative compari-
their service with a number of other NER extractors (Open-          son of 6 extractors in particular task, scenario and settings.
Calais, Zemanta, Ontos Semantic API26 , The Wiki Ma-                Our goal was to assess the performance variations accord-
chine27 , AlchemyAPI and M&W’s wikifier [15]) according             ing to different kind of texts (news articles, scientific papers,
to an annotation task scenario. The experiment consisted            user generated content) and different text length. Results
in evaluating 35 paragraphs from 10 news articles in 8 cate-        showed that some extractors are affected by the word cardi-
gories selected from the The New York Times and has been            nality and the type of text, especially for scientific papers.
performed by 4 human raters. The final goal was to cre-             DBpedia Spotlight and OpenCalais are not affected by the
ate wiki links and to provide a disambiguation benchmark            word cardinality and Extractiv is the best solution to clas-
(partially, re-used in this work). The experiment showed            sify NEs according to “scientific” concepts such as Time and
how DBpedia Spotlight overcomes the performance of other            Number.
services under evaluation, but its performances are strongly           This work has evidenced the need to follow up with such
affected by the configuration parameters. Authors under-            systematic comparisons between NE extractor tools, espe-
lined the importance to perform several set-up experiments          cially using a large golden dataset. We believe that the
and to figure out the best configuration set for the specific       NERD framework we have proposed is a suitable tool to
disambiguation task. Moreover, they did not take into ac-           perform such evaluations. In this work, the human eval-
count the precision of the NE and type.                             uation has been conducted asking all participants to rate
   We have ourselves proposed a first qualitative compari-          the output results of these extractors. An important step
son attempt, highlighting the precision score for each ex-          forward would be to investigate about the creation of an al-
tracted field from 10 news articles coming from 2 different         ready labeled dataset of triples (NE, type, URI) and then
sources, The New York Times and BBC 28 and 5 different              assessing how these extractors adhere to this dataset. Future
categories: business, health, science, sport, world [18]. Due       work will include a thorough comparison with the ESTER2
to the news articles length, we face a very low Fleiss’s kappa      and CONLL-2003 datasets (datasets well-known in the NLP
                                                                    community) studying how it may fit the need of comparing
26
   http://www.ontos.com                                             those extractor tools and more importantly, how to combine
27
   http://thewikimachine.fbk.eu/                                    them. In terms of manual evaluation, Boolean decision is not
28
   http://www.bbc.com                                               enough for judging all tools. For example, a named entity
type might not be wrong, but not precise enough (Obama is               International Conference on Computational linguistics
not only a person, he is also known as the American Pres-               (COLING’96), pages 466–471, Copenhagen, Denmark,
ident). Another improvement of the system is to allow the               1996.
input of additional items or correct miss-understanding or         [10] J. Hoffart, M. A. Yosef, I. Bordino, H. Fürstenau,
ambiguous items. Finally, we plan to implement a “smart”                M. Pinkal, M. Spaniol, B. Taneva, S. Thater, and
extractor service, which takes into account extraction evalu-           G. Weikum. Robust Disambiguation of Named
ations coming from all raters to assess new evaluation tasks.           Entities in Text. In Conference on Empirical Methods
The idea is to study the role of the relevance field in order to        in Natural Language Processing, pages 782–792, 2011.
create a set of not-discovered NE from one tool, but which         [11] H. Ji and R. Grishman. Data selection in
may be find out by other tools.                                         semi-supervised learning for name tagging. In
                                                                        Workshop on Information Extraction Beyond The
Acknowledgments                                                         Document, pages 48–55, Sydney, Australia, 2006.
This work was supported by the French National Agency              [12] S. Kulkarni, A. Singh, G. Ramakrishnan, and
under contracts 09.2.93.0966, “Collaborative Annotation for             S. Chakrabarti. Collective annotation of Wikipedia
Video Accessibility” (ACAV), ANR.11.EITS.006.01, “Open                  entities in Web text. In 15th ACM International
Innovation Platform for Semantic Media” (OpenSEM) and                   Conference on Knowledge Discovery and Data Mining
the European Union’s 7th Framework Programme via the                    (KDD’09), pages 457–466, Paris, France, 2009.
projects LOD2 (GA 257943) and LinkedTV (GA 287911).                [13] A. M. W. Li. Early results for named entity
The authors would like to thank Pablo Mendes for his fruit-             recognition with conditional random fields, feature
ful support and suggestions and Ruben Verborgh for the                  induction and web-enhanced lexicons. In 7th
NERD OpenID implementation.                                             International Conference on Natural Language
                                                                        Learning at HLT-NAACL (CONLL’03), pages
8.   REFERENCES                                                         188–191, Edmonton, Canada, 2003.
 [1] E. Alfonseca and S. Manandhar. An Unsupervised                [14] P. N. Mendes, M. Jakob, A. Garcı́a-Silva, and
     Method for General Named Entity Recognition And                    C. Bizer. DBpedia Spotlight: Shedding Light on the
     Automated Concept Discovery. In 1st International                  Web of Documents. In 7th International Conference
     Conference on General WordNet, 2002.                               on Semantic Systems (I-Semantics), 2011.
 [2] M. Asahara and Y. Matsumoto. Japanese Named                   [15] D. Milne and I. H. Witten. Learning to link with
     Entity extraction with redundant morphological                     Wikipedia. In 17th ACM International Conference on
     analysis. In International Conference of the North                 Information and Knowledge Management (CIKM’08),
     American Chapter of the Association for                            pages 509–518, Napa Valley, California, USA, 2008.
     Computational Linguistics on Human Language                   [16] L. Rau. Extracting company names from text. In 7th
     Technology (NAACL’03), pages 8–15, Edmonton,                       IEEE Conference on Artificial Intelligence
     Canada, 2003.                                                      Applications, volume i, pages 29–32, 1991.
 [3] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann,                  [17] G. Rizzo and R. Troncy. NERD: A Framework for
     R. Cyganiak, and Z. Ives. DBpedia: A Nucleus for a                 Evaluating Named Entity Recognition Tools in the
     Web of Open Data. In 6th International Semantic                    Web of Data. In 10th International Semantic Web
     Web Conference (ISWC’07), pages 722–735, Busan,                    Conference (ISWC’11), Demo Session, pages 1–4,
     South Korea, 2007.                                                 Bonn, Germany, 2011.
 [4] D. Bikel, S. Miller, R. Schwartz, and R. Weischedel.          [18] G. Rizzo and R. Troncy. NERD: Evaluating Named
     Nymble: a high-performance learning name-finder. In                Entity Recognition Tools in the Web of Data. In
     5th International Conference on Applied Natural                    Workshop on Web Scale Knowledge Extraction
     Language Processing, pages 194–201, Washington,                    (WEKEX’11), pages 1–16, Bonn, Germany, 2011.
     USA, 1997.                                                    [19] S. Sekine. NYU: Description of the Japanese NE
 [5] A. Borthwick, J. Sterling, E. Agichtein, and                       system used for MET-2. In 7th Message
     R. Grishman. NYU: Description of the MENE Named                    Understanding Conference (MUC-7), 1998.
     Entity System as Used in MUC-7. In 7th Message                [20] S. Sekine and C. Nobata. Definition, Dictionaries and
     Understanding Conference (MUC-7), 1998.                            Tagger for Extended Named Entity Hierarchy. In 4th
 [6] R. Cyganiak and A. Jentzsch. Linking Open Data                     International Conference on Language Resources and
     cloud diagram. LOD Community                                       Evaluation (LREC’04), Lisbon, Portugal, 2004.
     (http://lod-cloud.net/), 2011.                                [21] F. Suchanek, G. Kasneci, and G. Weikum. Yago: a
 [7] R. T. Fielding and R. N. Taylor. Principled design of              Core of Semantic Knowledge. In 16th International
     the modern web architecture. ACM Transaction                       Conference on World Wide Web (WWW’07), pages
     Interneternet Technology, 2:115–150, May 2002.                     697–706, Banff, Alberta, Canada, 2007.
 [8] O. Galibert, S. Rosset, C. Grouin, P. Zweigenbaum,
     and L. Quintard. Structured and extended named
     entity evaluation in automatic speech transcriptions.
     In Proceedings of 5th International Joint Conference
     on Natural Language Processing, pages 518–526,
     Chiang Mai, Thailand, November 2011.
 [9] R. Grishman and B. Sundheim. Message
     Understanding Conference-6: a brief history. In 16th