=Paper=
{{Paper
|id=None
|storemode=property
|title=NERD meets NIF: Lifting NLP Extraction Results to the Linked Data Cloud
|pdfUrl=https://ceur-ws.org/Vol-937/ldow2012-paper-02.pdf
|volume=Vol-937
|dblpUrl=https://dblp.org/rec/conf/www/RizzoTHB12
}}
==NERD meets NIF: Lifting NLP Extraction Results to the Linked Data Cloud==
NERD meets NIF: Lifting NLP Extraction Results to the Linked Data Cloud Giuseppe Rizzo Raphaël Troncy Sebastian Hellmann, EURECOM, France EURECOM, France Martin Bruemmer Politecnico di Torino, Italy raphael.troncy@eurecom.fr Universität Leipzig, Germany giuseppe.rizzo@eurecom.fr hellmann@informatik.uni- leipzig.de ABSTRACT barely everything [6]. The Web hosts also millions of semi- We have often heard that data is the new oil. In particular, structured texts such as scientific papers, news articles as extracting information from semi-structured textual docu- well as forum and archived mailing list threads or (micro- ments on the Web is key to realize the Linked Data vision. )blog posts. This information has usually a rich semantic Several attempts have been proposed to extract knowledge structure which is clear for the author but that remains from textual documents, extracting named entities, classi- mostly hidden to computing machinery. Named entity and fying them according to pre-defined taxonomies and disam- information extractors aim to bring such a structure from biguating them through URIs identifying real world entities. those free texts. They provide algorithms for extracting se- As a step towards interconnecting the Web of documents via mantic units identifying the name of people, organizations, those entities, different extractors have been proposed. Al- locations, time references, quantities, etc. and classifying though they share the same main purpose (extracting named them according to predefined schema, increasing discover- entity), they differ from numerous aspects such as their un- ability (e.g. through faceted search), reusability and the derlying dictionary or ability to disambiguate entities. We utility of information. have developed NERD, an API and a front-end user inter- Since the 90’s, an increasing emphasis has been given to face powered by an ontology to unify various named entity the evaluation of NLP techniques. Hence, the Named En- extractors. The unified result output is serialized in RDF tity Recognition (NER) task has been developed as an es- according to the NIF specification and published back on sential component of the Information Extraction field. Ini- the Linked Data cloud. We evaluated NERD with a dataset tially, these techniques focused on identifying atomic infor- composed of five TED talk transcripts, a dataset composed mation unit in a text, named entities, later on classified into of 1000 New York Times articles and a dataset composed of predefined categories (also called context types) by classi- the 217 abstracts of the papers published at WWW 2011. fication techniques, and linked to real world objects using web identifiers. Such a task is called Named Entity Disam- biguation. Knowledge bases affect the disambiguation task Categories and Subject Descriptors in several ways, because they provide the final disambigua- I.2.7 [Artificial Intelligence]: [Natural Language Process- tion point where the information is linked. Recent methods ing - Language parsing and understanding] leverage knowledge bases such as DBpedia [3], Freebase1 or YAGO [21] since they contain many entries corresponding to real world entities and classified according to exhaustive General Terms classification schemes. A certain number of tools have been Measurement, Performance, Experimentation, Web developed to extract structured information from text re- sources classifying them according to pre-defined taxonomies Keywords and disambiguating them using URIs. In this work, we aim to evaluate tools which provide such an online computation: Named Entity extractors, Information extraction, Linked AlchemyAPI2 , DBpedia Spotlight3 , Evri4 , Extractiv5 , Lu- Data, Evaluation pedia6 , OpenCalais7 , Saplo8 , Wikimeta9 , Yahoo! Content Analysis (YCA)10 and Zemanta11 . They represent a clear 1. INTRODUCTION 1 The Web of Data is often illustrated as a fast growing http://www.freebase.com/ 2 cloud of interconnected dataset representing information about http://www.alchemyapi.com 3 http://dbpedia.org/spotlight 4 http://www.evri.com/developer/index.html 5 http://extractiv.com 6 http://lupedia.ontotext.com 7 http://www.opencalais.com 8 http://saplo.com 9 http://www.wikimeta.com 10 http://developer.yahoo.com/search/content/V2/contentAnalysis. html Copyright is held by the author/owner(s). 11 LDOW2012, April 16, 2012, Lyon, France. http://www.zemanta.com opportunity for the Linked Data community to increase the char offset considering the text as a sequence of characters, volume of interconnected data. Although these tools share it reports the char index where the NE starts and the length the same purpose – extracting semantic units from text – (number of chars) of the NE; range of chars considering the they make use of different algorithms and training data. text as a sequence of characters, it reports the start index They generally provide a potential similar output composed and the end index where the NE appears; word offset the of a set of extracted named entities, their type and poten- text is tokenized considering any punctuation, it reports the tially a URI disambiguating each named entity. The output word number after the NE is located (this counting does not vary in terms of data model used by the extractors. Hence, take into account the punctuation); POS offset the text is we propose the Named Entity Recognition and Disambigua- tokenized considering any punctuation, it reports the num- tion (NERD) framework which unifies the output results of ber of part-of-a-speech after the NE is located. these extractors, lifting them to the Linked Data Cloud us- We performed an experimental evaluation to estimate the ing the new NIF specification. max content chunk supported by each API, creating a sim- These services have their own strengths and shortcom- ple application that is able to send to each extractor a text ings but, to the best of our knowledge, few scientific eval- of 1KB initially. In case that the answer was correct (HTTP uations have been conducted to understand the conditions status 20x), we performed one more test increasing of 1 KB under which a tool is the most appropriate one. This paper the content chunk. We iterated this operation until we re- attempts to fill this gap. We have performed quantitative ceived the answer “text too long”. Table 1 summarizes the evaluations conducted on three different datasets covering factual comparison of the services involved in this study. different type of textual documents: a dataset composed of The * means the value has been estimated experimentally transcripts of five TED 12 talks, a dataset composed of 1000 (as the content chunk), + means a list of other sources, gen- news articles from The New York Times 13 and a dataset erally identifiable as any source available within the Web, composed of the 217 abstracts of the papers published at finally N/A means not available. WWW 2011 14 . We present statistics to underline the be- havior of such extractors in different scenarios and group 3. THE NERD FRAMEWORK them according to the NERD ontology. We have developed NERD is a web framework plugged on top of various NER the NERD framework, available at http://nerd.eurecom.fr to extractors. Its architecture follows the REST principles [7] perform systematic evaluation of NE extractors. and includes an HTML front-end for humans and an API The remainder of this paper is organized as follows. In for computers to exchange content in JSON (another serial- section 2, we introduce a factual comparison of the named ization of NERD output will be detailed in the section 4). entity extractors investigated in this work. We describe the Both interfaces are powered by the NERD REST engine. NERD framework in section 3 and we highlight the impor- tance to have an output compliant with the Linked Data 3.1 The NERD Data Model principles in section 4. Then, we describe the experimental We propose the following data model that encapsulates results we obtained in section 5 and in section 6, we propose the common properties for representing NERD extraction an overview on Named Entity recognition and disambigua- results. It is composed of a list of entities for which a label, tion techniques. Finally, we give our conclusions and outline a type and a URI is provided, together with the mapped type future work in section 7. in the NERD taxonomy, the position of the named entity, the confidence and relevance scores as they are provided by 2. FACTUAL COMPARISON OF the NER tools. The example below shows this data model NAMED ENTITY EXTRACTORS (for the sake of brevity, we use the JSON syntax): The NE recognition and disambiguation tools vary in terms " entities ": [{ of response granularity and technology used. As granular- " entity ":" Tim Berners - Lee " , " type ":" Person " , ity, we define the way how the extraction algorithm works: " uri ":" http :// dbpedia . org / resource / One Entity per Name (OEN) where the algorithm tokenizes Tim _be rner s_le e " , the document in a list of exclusive sentences, recognizing the " nerdType ":" http :// nerd . eurecom . fr / ontology # Person " , dot as a terminator character, and for each sentence, detects " startChar ":30 , named entities; and One Entity per Document (OED) where " endChar ":45 , the algorithm considers the bag of words from the entire doc- " confidence ":1 , " relevance ":0.5 ument and then detects named entities, removing duplicates }] for the same output record (NE, type, URI). Therefore, the result set differs from the two approaches. Table 1 provides an extensive comparison that take into 3.2 The NERD REST API account the technology used: algorithms used to extract The REST engine runs on Jersey15 and Grizzly16 tech- NE, supported languages, ontology used to classify the NE, nologies. Their extensible frameworks enable to develop dataset for looking up the real world entities and all the several components and NERD is composed of 7 modules technical issues related to the online computation such as namely authentication, scraping, extraction, ontology map- the maximum content request size and the response format. ping, store, statistics and web. The authentication takes as We also report whether a tool provides the position where input a FOAF profile of a user and links the evaluations with an NE is found in the text or not. We distinguish four cases: the user who performs them (we are freezing an OpenID im- 12 plementation and it will replace soon the simple authentica- http://www.ted.com 13 15 http://www.nytimes.com http://jersey.java.net 14 16 http://www.www2011india.com http://grizzly.java.net AlchemyAPI DBpedia Spotlight Evri Extractiv Lupedia OpenCalais Saplo Wikimeta YCA Zemanta Granularity OED OEN OED OEN OEN OED OED OEN OEN OED Language English English English English English English English English English English support French German (partial) Italian French French Swedish French German Portuguese (partial) Italian Spanish Spanish Italian Spanish (partial) Portuguese Russian Spanish Swedish Restriction on 30000 unlimited 3000 1000 unlimited 50000 1333 unlimited 5000 10000 academic use (calls/day) Sample C/C++ Java Action Script Java N/A Java Java Java Javascript C# clients C# Javascript Java Javascript Perl PHP Java Java PHP PHP PHP Javascript Perl Python Perl PHP-5 PHP Python Python Ruby Ruby API CLI AJAX AJAX AJAX CLI AJAX AJAX CLI JAX-RS AJAX interface JAX-RS CLI JAX-RS CLI JAX-RS CLI CLI JAX-RS CLI CLI SOAP JAX-RS JAX-RS JAX-RS JAX-RS JAX-RS SOAP SOAP Content 150KB* 452KB* 8KB* 32KB* 20KB* 8KB* 26KB* 80KB* 7769KB* 970KB* chunk Response JSON HTML+uF(rel-tag) GPB HTML HTML JSON JSON JSON JSON JSON format Microformats JSON HTML JSON JSON Microformats XML XML WNJSON XML RDF JSON RDF RDFa N3 RDF RDF XHTML+RDFa RDF XML XML Simple Format XML XML Entity type 324 320 300* 34 319 95 5 7 13 81 number Entity position N/A char offset N/A word offset range of chars char offset N/A POS offset range of chars N/A Classification Alchemy DBpedia Evri DBpedia DBpedia OpenCalais Saplo ESTER Yahoo FreeBase ontologies FreeBase (partial) Schema.org Deferencable DBpedia DBpedia Evri DBpedia DBpedia OpenCalais N/A DBpedia Wikipedia Wikipedia vocabularies Freebase LinkedMDB Geonames IMDB US Census CIAFactbook MusicBrainz GeoNames Wikicompanies Amazon UMBEL others+ YouTube OpenCyc TechCrunch YAGO MusicBrainz MusicBrainz Twitter CIA Factbook MyBlogLog CrunchBase Facebook others+ Table 1: Factual information about 10 extractors under investigation tion system working right now). The scraping module takes as being equivalent to alchemy:City, dbpedia-owl:City, as input the URI of an article and extracts all its raw text. extractiv:CITY, opencalais:City, evri:City while being Extraction is the module designed to invoke the external more specific than wikimeta:LOC and zemanta:location. service APIs and collect the results. Each service provides nerd : City a rdfs : Class ; its own taxonomy of named entity types it can recognize. rdfs : subClassOf wikimeta : LOC ; We therefore designed the NERD ontology which provides rdfs : subClassOf zemanta : location ; a set of mappings between these various classifications. The owl : equi vale ntC lass alchemy : City ; owl : equi vale ntC lass dbpedia - owl : City ; ontology mapping is the module in charge to map the clas- owl : equi vale ntC lass evri : City ; sification type retrieved to our ontology. The store module owl : equi vale ntC lass extractiv : CITY ; saves all evaluations according to the schema model we de- owl : equi vale ntC lass opencalais : City . fined in the NERD database. The statistic module enables to extract data patterns form the user interactions stored in the database and to compute statistical scores such as the 3.4 The NERD UI Fleiss Kappa score and the precision measure. Finally, the The user interface18 is developed in HTML/Javascript. Its web module manages the client requests, the web cache and goal is to provide a portal where researchers can find infor- generates HTML pages. mation about the NERD project, the NERD ontology, and Plugged on the top of this engine, there is an API inter- common statistics of the supported extractors. Moreover, face17 . It is developed following the REST principles and it it provides a personalized space where a user can create a has been implemented to enable programmatic access to the developer or a simple user account. For the former account NERD framework. It follows the following URI scheme (the type, a developer can navigate through a dashboard, see his base URI is http://nerd.eurecom.fr/api): profile details, browse some personal usage statistics and get a programmatic access to the NERD API via a NERD key. /document : GET, POST, PUT methods enable to fetch, sub- The simple user account enables to annotate any web docu- mit or modify a document parsed by the NERD frame- ments via its URI. The raw text is first extracted from the work; web source and a user can select a particular extractor. Af- ter the extraction step, the user can judge the correctness /user : GET, POST methods enable to insert a new user to of each field of the tuple (NE, type, URI, relevant). This the NERD framework and to fetch account details; is an important process which gives to NERD human feed- backs with the main purpose of evaluating the quality of the /annotation/{extractor} : POST method drives the annota- extraction results collected by those tools [17]. At the end tion of a document. The parametric URI allows to of the evaluation, the user sends the results, through asyn- pilot the extractors supported by NERD; chronous calls, to the REST API engine in order to store /extraction : GET method allows to fetch the output de- them. This set of evaluations is further used to compute scribed in section 3.1; statistics about precision measures for each tool, with the goal to highlight strengths and weaknesses and to compare /evaluation : GET method allows to retrieve a statistic in- them [18]. The comparison aggregates all the evaluations terpretation of the extractor behaviors. performed and, finally, the user is free to select one or more evaluations to see the metrics that are computed for each 3.3 The NERD Ontology service in real time. Although these tools share the same goal, they use differ- ent algorithms and different dictionaries which makes hard their comparison. We have developed the NERD ontology, 4. NIF: AN NLP INTERCHANGE FORMAT a set of mappings established manually between the tax- The NLP Interchange Format (NIF) is an RDF/OWL- onomies of NE types. Concepts included in the NERD ontol- based format that aims to achieve interoperability between ogy are collected from different schema types: ontology (for Natural Language Processing (NLP) tools, language resources DBpedia Spotlight, Lupedia, and Zemanta), lightweight tax- and annotations. The NIF specification has been released in onomy (for AlchemyAPI, Evri, and Yahoo!) or simple flat an initial version 1.0 in November 2011 and describes how in- type lists (for Extractiv, OpenCalais, Saplo, and Wikimeta). teroperability between NLP tools, which are exposed as NIF The NERD ontology tries to merge the linguistic commu- web services can be achieved. Extensive feedback was given nity needs and the logician community ones: we developed on several mailing lists and a community of interest19 was a core set of axioms based on the Quaero schema [8] and we created to improve the specification. Implementations for 8 mapped similar concepts described in the other scheme. The different NLP tools (e.g. UIMA, Gate ANNIE and DBpedia selection of these concepts has been done considering the Spotlight) exist and a public web demo20 is available. greatest common denominator among them. The concepts In the following, we will first introduce the core concepts that do not appear in the NERD namespace are sub-classes of NIF, which are defined in a String Ontology21 (STR). We of parents that end-up in the NERD ontology. This ontology will then explain how NIF is used in NERD. The resulting is available at http://nerd.eurecom.fr/ontology. To summarize, properties and axioms are included into a Structured Sen- a concept is included in the NERD ontology as soon as there tence Ontology22 (SSO). While the String Ontology is used are at least two extractors that use it. The NERD ontology 18 http://nerd.eurecom.fr becomes a reference ontology for comparing the classifica- 19 http://nlp2rdf.org/get-involved tion task of NE extractors. We show an example mapping 20 http://nlp2rdf.lod2.eu/demo.php among those extractors below: the City type is considered 21 http://nlp2rdf.lod2.eu/schema/string 17 22 http://nerd.eurecom.fr/api/application.wadl http://nlp2rdf.lod2.eu/schema/sso provides a reference context for each substring contained in the text (i.e. the characters before or after the substring). The outside context is more vague and is given by an outside observer, who might arbitrarily interpret the text as a “book chapter” or a “book section”. The class str:Context now provides a clear reference point for all other relative URIs used in this context and blocks the addition of information from a larger (outside) context by definition. For example, str:Context is disjoint with foaf:Document since labeling a context resource as a doc- ument is an information which is not contained within the context (i.e. the text) itself. It is legal, however, to say that the string of the context occurs in (str:occursIn) a foaf:Document. Additionally, str:Context is a subclass of str:String and therefore its instances denote Unicode Figure 1: NIF URI schemes: Offset (top) and text as well. The main benefit to limit the context is that context-hashes (bottom) are used to create identi- an OWL reasoner can now infer that two contexts are the fiers for strings same, if they consist of the same string, because an inverse- functional data type property (str:isString) is used to at- tach the actual text to the context resource. to describe the relations between strings (i.e. Unicode char- acters), the SSO collects properties and classes to connect : offset_0_26546 a str : Context ; strings to NLP annotations and NER entities as produced # the exact retrieval method is left underspecified str : occursIn < http :// www . w3 . org / DesignIssues / by NERD. LinkedData . html > ; # [...] are all 26547 characters as rdf : Literal 4.1 Core Concepts of NIF str : isString " [...] " . : offset_717_729 a str : String ; The motivation behind NIF is to allow NLP tools to ex- str : r ef e re n ce C on t ex t : offset_0_26546 . change annotations about documents in RDF. Hence, the main prerequisite is that parts of the documents (i.e. strings) A complete formalisation is still work in progress, but the are referenceable by URIs, so that they can be used as sub- idea is explained here. The NIF URIs will be grounded jects in RDF statements. We call an algorithm to create on Unicode Characters (especially Unicode Normalization such identifiers URI Scheme: For a given text t (a sequence Form C23 . For all resources of type str:String, the universe of characters) of length |t| (number of characters), we are of discourse will then be the words over the alphabet of looking for a URI Scheme to create a URI, that can serve as Unicode characters sometimes called σ ∗. Perspectively, we a unique identifier for a substring s of t (i.e. |s| ≤ |t|). Such hope that this will allow for an unambiguous interpretation a substring can (1) consist of adjacent characters only and of NIF by machines. it is therefore a unique character sequence within the text, Within the framework of RDF and the current usage of if we account for parameters such as context and position or NIF for the interchange of output between NLP tools, the (2) derived by a function which points to several substrings definition of the semantics is sufficient to produce a working as defined in (1). system. However, problems arise if additional interoperabil- NIF provides two URI schemes, which can be used to rep- ity with Linked Data or fragment identifiers and ad-hoc re- resent strings as RDF resources. We focus here on the first trieval of content from the Web is demanded. The actual re- scheme using offsets. In the top part of Figure 1, two triples trieval method (such as content negotiation) to retrieve and are given that use the following URI as subject: validate the content for #offset_717_729_Semantic%20Web or http://www.w3.org/DesignIssues/LinkedData.html# its reference context is left underspecified as is the relation offset_717_729 of NIF URIs to fragment identifiers for MIME types such as text/plain (see RFC 514724 ). As long as such issues remain According to the above definition, the URI points to a sub- open, the complete text has to be included as RDF Literal. string of a given text t, which starts at character index 717 until the index 729 (counting all characters). NIF currently 4.2 Connecting String to Entities mandates that the whole string of the document has to be For NERD, three relevant concepts have to be expressed included in the RDF output as an rdf:Literal to serve as in RDF and were included into the Structured Sentence On- the reference point, which we will call inside context for- tology (SSO): OEN, OED and NERD ontology types. malized using an OWL class called str:Context. The term One Entity per Name (OEN) can be modeled in a straight- document would be inappropriate to capture the real inten- forward way, by introducing a property sso:oen, which con- tion of this concept as we would like to refer to an arbitrary nects the string with an arbitrary entity. grouping of characters forming a unit, which could also be : offset_717_729 sso : oen dbpedia : Semantic_Web . applied to a paragraph or a section and is highly dependent upon the wider context in which the string is actually used One Entity per Document (OED). As document is an out- such as a Web document reachable via HTTP. side interpretation of a string, the notion of context in NIF To appropriately capture the intention of such a class, we will distinguish between the notion of outside and inside 23 http://www.unicode.org/reports/tr15/#Norm_Forms counted in context of a piece of text. The inside context is easy to Code Units http://unicode.org/faq/char_combmark.html#7 24 explain and formalise, as it is the text itself and therefore it http://tools.ietf.org/html/rfc5147 has to be used. The property sso:oec is used to attach en- WWW2011 TED NYTimes tities to a given context. We furthermore add the following nd 217 5 1,000 DL-Axiom: nw 38,062 13,381 62,0567 rw 175.4 2,676.2 620.567 sso:oec ⊇ str:ref erenceContext−1 ◦ sso:oen ne 12,266 1,441 164,116 re 56.53 288.2 164.1 As the property oen contains more specific information, oec can be inferred by the above role chain inclusion. In case the Table 2: Statistics about the three dataset used in context is enlarged, any materialized information attached the quantitative experiment, grouped according to via the oec property needs to be migrated to the larger con- the source where documents were collected. text resource. The connection between NERD types and strings is done via a linked data URI, which disambiguates the entity. Over- 5.1 User Generated Content all three cases can be distinguished: In case, the NER ex- tractor has provided a linked data URI to disambiguate the In this experiment, we focus on the extractions performed entity, we simply re-use it as in the following example: by all tools for 5 TED talk transcripts. The goal is to find out NE extraction ratio for user generated content, such # this URI points to the string " W3C " as speech transcripts of videos. First, we propose general : o ffs et_ 231 07_ 23 1 1 0 rdf : type str : String ; statistics about the extraction task and then, we focus on str : referenceCo n te x t : offset_0_26546 ; the classification, showing statistics grouped according to sso : oen dbpedia : W3C ; the NERD ontology. DBpedia Spotlight classifies each re- str : beginIndex " 23107 " ; str : endIndex " 23110 " . source according three different schema (see Table 1). For dbpedia : W3C rdf : type nerd : Organization . this experiment, we consider only the results which belong to the DBpedia ontology. The total number of documents If, however, the NER extractor provides no disambiguation is 5, with an overall number of total words equal to 13, 381. link at all or just a non-linked data URI for the entity The word detection rate per document r(w, d) is equal to (typically, the foaf:homepage of an organization such as 2, 676.2 with an overall number of entities equal to 1, 441, http://www.w3.org/ ), we plan to mint a new linked data and the r(e, d) is 288.2. Table 3 shows the statistics about URI for the respective entity that could then be further the computation results for all extractors. DBpedia Spot- sameAs with other identifiers in a data reconciliation pro- light is the extractor which provides the highest number of cess. NE and disambiguated URIs. These values show the ability from this extractor to locate NE and to exploit the large cloud of LOD resources. In parallel, it is crucial noting that 5. EVALUATIONS it is not able to classify these resources, although it uses a We performed a quantitative experiment using three dif- deep classification schema. All the extractors show high abil- ferent datasets: a dataset composed of transcripts of five ity for the classification task, except Zemanta as shown by TED talks (different category of talks), a dataset composed the r(c, e). Contrarily, Zemanta shows strong ability to dis- of 1000 news articles from The New York Times (collected ambiguate NE via URI identification, as shown by r(u, e). from 09/10/2011 to 12/10/2011), and a dataset composed It is worth noting OpenCalais and Evri have almoust the of the 217 abstracts of the papers published at WWW 2011 same performances of Zemanta. The last part of this exper- conference. The aim of these evaluations is to assess how iment consists in aligning all the classification types provided these extractors perform in different scenarios, such as news by these extractors, while performing the analysis of TED articles, user generated content and scientific papers. The talk transcripts, using the NERD ontology. For the sake of total number of document is 1222, with an average word brevity, we report all the grouping results according to 6 number per document equal to 549. Each document was main concepts: Person, Organization, Country, City, Time evaluated using 6 extractors supported by the NERD frame- and Number. Table 4 shows the comparison results. Alche- work25 . The final number of entities detected is 177, 823 and myAPI classifies a higher number of Person, Country and the average of unique entity number per document is 20.03. City than all the others. In addition, OpenCalais obtains Table 2 shows statistics about grouped view according to good performances to classify all the concepts except Time the source documents. and Number. It is worth noting that Extractiv is the only We define the following variables: the number nd of eval- extractor able to locate and classify Number and Time. In uated documents, the number nw of words, the total num- this grouped view, we consider all the results classified with ber ne of entities, the total number nc of categories and nu the 6 main classes and we do not take into account all po- URIs. Moreover, we compute the following measures: word tentially inferred relationships. This is why the Evri results detection rate r(w, d), i.e. the number of words per docu- contrast with what is showed in the Table 3. Indeed, Evri ment, entity detection rate r(e, d), i.e. the number of enti- provides a precise classification about Person such as Jour- ties per document, the number of entities per word r(e, w), nalist, Physicist, Technologist but it does not describe the the number of categories per entity r(c, e) (this measure has same resource as a sub-classes of the Person axiom. been computed removing not relevant labels such as “null” or “LINKED OTHER”) and the number of URIs per entity 5.2 Scientific Documents r(u, e). In this experiment, we focus on the extraction performed 25 At the time this evaluation has been conducted Lupedia, by all tools for the 217 abstract papers published at the Saplo, Wikimeta and YCA were not part of the NERD WWW 2011 conference, with the aim to seek NE extrac- framework. tion patterns for scientific contributions. The total number ne nc nu r(e, d) r(e, w) r(c, e) r(u, e) AlchemyAPI 141 141 71 28.2 0.01 1 0.504 DBpedia Spotlight 810 0 624 162 0.06 0 0.77 Evri 120 120 113 24 0.009 1 0.942 Extractiv 60 53 22 12 0.004 0.883 0.367 OpenCalais 163 136 157 32.6 0.012 0.834 0.963 Zemanta 50 17 49 10 0.003 0.34 0.98 Table 3: Statistics about computation results for the sources coming from TED talks of all extractors used in the comparison. AlchemyAPI DBpedia Spotlight Evri Extractiv OpenCalais Zemanta Person 42 - 10 6 27 4 Organization 15 - - - 20 1 Country 16 - 11 1 16 3 City 14 - 3 3 7 - Time - - - 1 - - Number - - - 5 - - Table 4: Number of axioms aligned for all the extractors involved in the comparison according to the NERD ontology for the sources coming from TED talks. ne nc nu r(e, d) r(e, w) r(c, e) r(u, e) AlchemyAPI 323 171 39 1.488 0.008 0.529 0.121 DBpedia Spotlight 3,699 0 1,062 17.04 0.097 0 0.287 Evri 282 282 167 1.299 0.007 1 0.592 Extractiv 1,345 725 415 6.198 0.035 0.539 0.309 OpenCalais 1,158 1,158 713 5.337 0.03 1 0.616 Zemanta 1,748 97 757 8.055 0.046 0.055 0.433 Table 5: Statistics about extraction results for the 217 abstract papers published at the WWW 2011 confer- ence of all extractors used in the comparison. AlchemyAPI DBpedia Spotlight Evri Extractiv OpenCalais Zemanta Person 17 - 12 6 6 1 Organization 20 - 24 - 5 - Country 9 - 8 14 7 6 City 4 - 3 8 9 - Time - - - - - - Number - - - 184 - - Table 6: Number of axioms aligned for all the extractors involved in the comparison according to the NERD ontology for the sources coming from the WWW 2011 conference. of words is 13, 381, while the word detection rate per doc- nization and Country. OpenCalais shows meaningful results ument r(w, d) is equal to 175.40 and the total number of to recognize the class Person and especially a strong abil- recognized entities is 12, 266 with the r(e, d) equal to 56.53. ity to classify NEs with the label Organization. Extractiv Table 5 shows the statistics of the computation results for holds the best score for classifying Country and it is the only all extractors. DBpedia Spotlight keeps a high rate of NEs extractor able to seek the classes Time and Number. extracted but shows some weaknesses to disambiguate NEs with LOD resources. r(u, e) is equal to 0.2871, lesser than 6. RELATED WORK is performance in the previous experiment (see section 5.1). The Named Entity (NE) recognition and disambiguation OpenCalais, instead, has the best r(u, e) and it has a con- problem has been addressed in different research fields such siderable ability to classify NEs. Evri performed in a similar as NLP, Web mining and Semantic Web communities. All of way as shown by the r(c, e). The last part of this experi- them agree on the definition of a Named Entity, which was ment consists in aligning all the classification types retrieved coined by Grishman et al. as an information unit described by these extractors using the NERD ontology, aligning 6 by the name of a person or an organization, a location, main concepts: Person, Organization, Country, City, Time a brand, a product, a numeric expression including time, and Number. Table 6 shows the comparison results. Alche- date, money and percent found in a sentence [9]. One of the myAPI still preserves the best result to classify named enti- first research papers in the NLP field, aiming at automat- ties as Person. Instead, differently to what happened in the ically identifying named entities in texts, was proposed by previous experiment, Evri outperforms AlchemyAPI while Rau [16]. This work relies on heuristics and definition of pat- classifying named entities as Organization. It is important terns to recognize company names in texts. The training set to note that Evri shows an high number of NEs classified is defined by the set of heuristics chosen. This work evolved using the class Person in this scenario, but does not explore and was improved later on by Sekine et al. [20]. A different deeply the Person inference (as shown in the user generated approach was introduced when Supervised Learning (SL) content experiment). OpenCalais has the best performance techniques were used. The big disruptive change was the to classify NEs according to the City class, while Extrac- use of a large dataset manually labeled. In the SL field, tiv shows reliability to recognize Country and, especially, to a human being usually trains positive and negative exam- classify Number. ples so that the algorithm computes classification patterns. SL techniques exploit Hidden Markov Models (HMM) [4], 5.3 News Articles Decision Trees [19], Maximum Entropy Models [5], Support For this experiment, we collected 1000 news articles of Vector Machines (SVM) [2] and Conditional Random Fields The New York Times from 09/10/2011 to 12/10/2011 and (CRF) [13]. The common goal of these approaches is to rec- we performed the extraction for the tools involved in this ognize relevant key-phrases and to classify them in a fixed comparison. The goal is to explore the NE extraction ratio taxonomy. The challenges with SL approaches is the un- with this dataset and to assess commonalities and differences availability of such labeled resources and the prohibitive cost with the previous experiments. The total number of words is of creating examples. Semi-Supervised Learning (SSL) and 620, 567, while the word detection rate per document r(w, d) Unsupervised Learning (UL) approaches attempt to solve is equal to 620.57 and the total number of recognized entities this problem by either providing a small initial set of labeled is 164, 12 with the r(e, d) equal to 164.17. Table 7 shows the data to train and seed the system [11], or by resolving the statistics of the computation results for all extractors. extraction problem as a clustering one. For instance, a user Extractiv is the tool which provides the highest number of can try to gather named entities from clustered groups based NEs. This score is considerably greater than what does the on the similarity of context. Other unsupervised methods same extractor in the other test scenarios (see section 5.1 may rely on lexical resources (e.g. WordNet), lexical pat- and section 5.2), and it does not depend from the number terns and statistics computed on large annotated corpus [1]. of words per document, as reported by r(e, w). In contrast, The NER task is strongly dependent on the knowledge DBpedia Spotlight shows a r(e, w) which is strongly affected base used to train the NE extraction algorithm. Leveraging by the number of words: indeed, the r(e, w) is 0.048 lower on the use of DBpedia, Freebase and YAGO, recent meth- than the same score in the previous experiment. Although ods, coming from Semantic Web community, have been in- the highest number of URIs detailed is provided by Open- troduced to map entities to relational facts exploiting these Calais, the URI detection rate per entity is greater for Ze- fine-grained ontologies. In addition to detect a NE and its manta, with a score equal to 0.577. Alchemy, Evri, and type, efforts have been spent to develop methods for disam- OpenCalais confirm their reliability to classify NEs and its biguating information unit with a URI. Disambiguation is detection score value r(c, e) is sensibly greater than all the one of the key challenges in this scenario and its foundation others. Finally, we propose the alignment of the 6 main stands on the fact that terms taken in isolation are naturally types recognized by all extractors using the NERD ontol- ambiguous. Hence, a text containing the term London may ogy. Table 8 shows the comparison results. Differently to refer to the city London in UK or to the city London in Min- what has been detailed previously, DBpedia Spotlight rec- nesota, USA, depending on the surrounding context. Simi- ognizes few classes, although this number is not compara- larly, people, organizations and companies can have multiple ble with what performed by the other extractors. Zemanta names and nicknames. These methods generally try to find and DBpedia Spotlight increase classification performances in the surrounding text some clues for contextualizing the with respect to the experiments detailed in the two previous ambiguous term and refine its intended meaning. Therefore, test cases, obtaining a number of recognized Person which a NE extraction workflow consists in analyzing some input is lower than one magnitude order. AlchemyAPI preserves content for detecting named entities, assigning them a type strong ability to recognize Person, but still shows great per- weighted by a confidence score and by providing a list of formance to recognize City and significant scores for Orga- URIs for disambiguation. Initially, the Web mining com- ne nc nu r(e, d) r(e, w) r(c, e) r(u, e) AlchemyAPI 17,433 17,443 3,833 17.44 0.028 1 0.22 DBpedia Spotlight 30,333 20 8,702 30.33 0.048 0.001 0.287 Evri 16,944 16,944 8,594 16.94 0.027 1 0.507 Extractiv 47,455 41,393 8,177 47.45 0.076 0.872 0.172 OpenCalais 23,625 23,625 12,525 23.62 0.038 1 0.53 Zemanta 9,474 4621 5,467 9.474 0.015 0.488 0.577 Table 7: Statistics about extraction results for the 1000 news articles published by The New York Times from 09/10/2011 to 12/10/2011 of all extractors used in the comparison. AlchemyAPI DBpedia Spotlight Evri Extractiv OpenCalais Zemanta Person 6,246 14 2,698 5,648 5,615 1,069 Organization 2,479 - 900 81 2,538 180 Country 1,727 2 1,382 2,676 1,707 720 City 2,133 - 845 2,046 1,863 - Time - - - 123 1 - Number - - - 3,940 - - Table 8: Number of axioms aligned for all the extractors involved in the comparison according to the NERD ontology for the sources collected from the The New York Times from 09/10/2011 to 12/10/2011. munity has harnessed Wikipedia as the linking hub where agreement score: many output records to evaluate affected entities were mapped [12, 10]. A natural evolution of this the human rater ability to select the correct answer. In approach, mainly driven by the Semantic Web community, this paper, we advance these initial experiments by provid- consists in disambiguating named entities with data from ing a full generic framework powered by an ontology and the LOD cloud. In [14], the authors proposed an approach we present a large scale quantitative experiment focusing to avoid named entity ambiguity using the DBpedia dataset. on the extraction performances with different type of text: Interlinking text resources with the Linked Open Data user-generated content, scientific text, and news articles. cloud becomes an important research question and it has been addressed by numerous services which have opened their knowledge to online computation. Although these ser- 7. CONCLUSION AND FUTURE WORK vices expose a comparable output, they have their own strengths In this paper, we presented NERD a web framework which and weaknesses but, to the best of our knowledge, few re- unifies 10 named entity extractors and lift the output result search comparisons have been spent to evaluate them. The to the Linked Data Cloud following the NIF specification. creators of the DBpedia Spotlight service have compared To motivate NERD, we presented a quantitative compari- their service with a number of other NER extractors (Open- son of 6 extractors in particular task, scenario and settings. Calais, Zemanta, Ontos Semantic API26 , The Wiki Ma- Our goal was to assess the performance variations accord- chine27 , AlchemyAPI and M&W’s wikifier [15]) according ing to different kind of texts (news articles, scientific papers, to an annotation task scenario. The experiment consisted user generated content) and different text length. Results in evaluating 35 paragraphs from 10 news articles in 8 cate- showed that some extractors are affected by the word cardi- gories selected from the The New York Times and has been nality and the type of text, especially for scientific papers. performed by 4 human raters. The final goal was to cre- DBpedia Spotlight and OpenCalais are not affected by the ate wiki links and to provide a disambiguation benchmark word cardinality and Extractiv is the best solution to clas- (partially, re-used in this work). The experiment showed sify NEs according to “scientific” concepts such as Time and how DBpedia Spotlight overcomes the performance of other Number. services under evaluation, but its performances are strongly This work has evidenced the need to follow up with such affected by the configuration parameters. Authors under- systematic comparisons between NE extractor tools, espe- lined the importance to perform several set-up experiments cially using a large golden dataset. We believe that the and to figure out the best configuration set for the specific NERD framework we have proposed is a suitable tool to disambiguation task. Moreover, they did not take into ac- perform such evaluations. In this work, the human eval- count the precision of the NE and type. uation has been conducted asking all participants to rate We have ourselves proposed a first qualitative compari- the output results of these extractors. An important step son attempt, highlighting the precision score for each ex- forward would be to investigate about the creation of an al- tracted field from 10 news articles coming from 2 different ready labeled dataset of triples (NE, type, URI) and then sources, The New York Times and BBC 28 and 5 different assessing how these extractors adhere to this dataset. Future categories: business, health, science, sport, world [18]. Due work will include a thorough comparison with the ESTER2 to the news articles length, we face a very low Fleiss’s kappa and CONLL-2003 datasets (datasets well-known in the NLP community) studying how it may fit the need of comparing 26 http://www.ontos.com those extractor tools and more importantly, how to combine 27 http://thewikimachine.fbk.eu/ them. In terms of manual evaluation, Boolean decision is not 28 http://www.bbc.com enough for judging all tools. For example, a named entity type might not be wrong, but not precise enough (Obama is International Conference on Computational linguistics not only a person, he is also known as the American Pres- (COLING’96), pages 466–471, Copenhagen, Denmark, ident). Another improvement of the system is to allow the 1996. input of additional items or correct miss-understanding or [10] J. Hoffart, M. A. Yosef, I. Bordino, H. Fürstenau, ambiguous items. Finally, we plan to implement a “smart” M. Pinkal, M. Spaniol, B. Taneva, S. Thater, and extractor service, which takes into account extraction evalu- G. Weikum. Robust Disambiguation of Named ations coming from all raters to assess new evaluation tasks. Entities in Text. In Conference on Empirical Methods The idea is to study the role of the relevance field in order to in Natural Language Processing, pages 782–792, 2011. create a set of not-discovered NE from one tool, but which [11] H. Ji and R. Grishman. Data selection in may be find out by other tools. semi-supervised learning for name tagging. In Workshop on Information Extraction Beyond The Acknowledgments Document, pages 48–55, Sydney, Australia, 2006. This work was supported by the French National Agency [12] S. Kulkarni, A. Singh, G. Ramakrishnan, and under contracts 09.2.93.0966, “Collaborative Annotation for S. Chakrabarti. Collective annotation of Wikipedia Video Accessibility” (ACAV), ANR.11.EITS.006.01, “Open entities in Web text. In 15th ACM International Innovation Platform for Semantic Media” (OpenSEM) and Conference on Knowledge Discovery and Data Mining the European Union’s 7th Framework Programme via the (KDD’09), pages 457–466, Paris, France, 2009. projects LOD2 (GA 257943) and LinkedTV (GA 287911). [13] A. M. W. Li. Early results for named entity The authors would like to thank Pablo Mendes for his fruit- recognition with conditional random fields, feature ful support and suggestions and Ruben Verborgh for the induction and web-enhanced lexicons. In 7th NERD OpenID implementation. International Conference on Natural Language Learning at HLT-NAACL (CONLL’03), pages 8. REFERENCES 188–191, Edmonton, Canada, 2003. [1] E. Alfonseca and S. Manandhar. An Unsupervised [14] P. N. Mendes, M. Jakob, A. Garcı́a-Silva, and Method for General Named Entity Recognition And C. Bizer. DBpedia Spotlight: Shedding Light on the Automated Concept Discovery. In 1st International Web of Documents. In 7th International Conference Conference on General WordNet, 2002. on Semantic Systems (I-Semantics), 2011. [2] M. Asahara and Y. Matsumoto. Japanese Named [15] D. Milne and I. H. Witten. Learning to link with Entity extraction with redundant morphological Wikipedia. In 17th ACM International Conference on analysis. In International Conference of the North Information and Knowledge Management (CIKM’08), American Chapter of the Association for pages 509–518, Napa Valley, California, USA, 2008. Computational Linguistics on Human Language [16] L. Rau. Extracting company names from text. In 7th Technology (NAACL’03), pages 8–15, Edmonton, IEEE Conference on Artificial Intelligence Canada, 2003. Applications, volume i, pages 29–32, 1991. [3] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, [17] G. Rizzo and R. Troncy. NERD: A Framework for R. Cyganiak, and Z. Ives. DBpedia: A Nucleus for a Evaluating Named Entity Recognition Tools in the Web of Open Data. In 6th International Semantic Web of Data. In 10th International Semantic Web Web Conference (ISWC’07), pages 722–735, Busan, Conference (ISWC’11), Demo Session, pages 1–4, South Korea, 2007. Bonn, Germany, 2011. [4] D. Bikel, S. Miller, R. Schwartz, and R. Weischedel. [18] G. Rizzo and R. Troncy. NERD: Evaluating Named Nymble: a high-performance learning name-finder. In Entity Recognition Tools in the Web of Data. In 5th International Conference on Applied Natural Workshop on Web Scale Knowledge Extraction Language Processing, pages 194–201, Washington, (WEKEX’11), pages 1–16, Bonn, Germany, 2011. USA, 1997. [19] S. Sekine. NYU: Description of the Japanese NE [5] A. Borthwick, J. Sterling, E. Agichtein, and system used for MET-2. In 7th Message R. Grishman. NYU: Description of the MENE Named Understanding Conference (MUC-7), 1998. Entity System as Used in MUC-7. In 7th Message [20] S. Sekine and C. Nobata. Definition, Dictionaries and Understanding Conference (MUC-7), 1998. Tagger for Extended Named Entity Hierarchy. In 4th [6] R. Cyganiak and A. Jentzsch. Linking Open Data International Conference on Language Resources and cloud diagram. LOD Community Evaluation (LREC’04), Lisbon, Portugal, 2004. (http://lod-cloud.net/), 2011. [21] F. Suchanek, G. Kasneci, and G. Weikum. Yago: a [7] R. T. Fielding and R. N. Taylor. Principled design of Core of Semantic Knowledge. In 16th International the modern web architecture. ACM Transaction Conference on World Wide Web (WWW’07), pages Interneternet Technology, 2:115–150, May 2002. 697–706, Banff, Alberta, Canada, 2007. [8] O. Galibert, S. Rosset, C. Grouin, P. Zweigenbaum, and L. Quintard. Structured and extended named entity evaluation in automatic speech transcriptions. In Proceedings of 5th International Joint Conference on Natural Language Processing, pages 518–526, Chiang Mai, Thailand, November 2011. [9] R. Grishman and B. Sundheim. Message Understanding Conference-6: a brief history. In 16th