Introduction

Named Entity Recognition using FOX

Rene´ Speck

Axel-Cyrille Ngonga Ngomo

ngongag@informatik.uni-leipzig.de 0 0 AKSW, Department of Computer Science, University of Leipzig , Germany

Unstructured data still makes up an important portion of the Web. One key task towards transforming this unstructured data into structured data is named entity recognition. We demo FOX, the Federated knOwledge eXtraction framework, a highly accurate open-source framework that implements RESTful web services for named entity recognition. Our framework achieves a higher Fmeasure than state-of-the-art named entity recognition frameworks by combining the results of several approaches through ensemble learning. Moreover, it disambiguates and links named entities against DBpedia by relying on the AGDISTIS framework. As a result, FOX provides users with accurately disambiguated and linked named entities in several RDF serialization formats. We demonstrate the different interfaces implemented by FOX within use cases pertaining to extracting entities from news texts.

Introduction

The goal of the demonstration will be to show the whole of the FOX workflow from the gathering and preprocessing of input data to the generation of RDF data. In addition, we will show how to configure and train FOX after it has been enhanced with a

1 FOX online demo:http://fox-demo.aksw.org

FOX project page:http://fox.aksw.org.

Source code, evaluation data and evaluation results:http://github.com/AKSW/FOX. novel NER tool or EL algorithm. Further, we will present FOX’s feedback RESTful service to improve the training and test datasets. In the demonstration, we also go over the Python2 and Java bindings3 for an easy use of FOX’s RESTful service within an application. At the end we will explain how to use the FOX Java interfaces to integrate future algorithms. 2.1

Workflow The workflow underlying FOX consists of four main steps: (1) preprocessing of the unstructured input data, (2) recognizing the Named Entities (NE), (3) linking the NE to resources using AGDISTIS [ 9 ] and (4) converting the results to an RDF serialization format.

Preprocessing FOX allows users to use a URL, text with HTML tags or plain text as input data (see the top left part of Figure 1). The input can be carried out in a form (see the center of Figure 1) or via FOX’s web service. In case of a URL, FOX sends a request to the given URL to receive the input data. Then, for all input formats, FOX removes HTML tags and detects sentences and tokens.

We will use text examples, URLs and text with HTML tags to show how FOX gathers or cleans them for the sake of entity recognition.

2 https://pypi.python.org/pypi/foxpy 3 https://github.com/renespeck/fox-java

Entity Recognition Our approach relies on four state-of-the-art NER tools so far: (1) the Stanford Named Entity Recognizer (Stanford) [ 2 ], (2) the Illinois Named Entity Tagger (Illinois) [ 6 ], (3) the Ottawa Baseline Information Extraction (Balie) [ 4 ] and (4) the Apache OpenNLP Name Finder (OpenNLP) [ 1 ]. FOX allows using a particular NER approach which is integrated in it (see bottom right of Figure 1). To this end, FOX light has to be set to the absolute path to the class of the tool to use. If FOX light is off, then FOX utilizes these four NER tools in parallel and stores the received NEs for further processing. It maps the entity types of each of the NER tools to the classes Location, Organization and Person. Finally, the results of all tools are merged by using FOX’s EL layer as discussed in [ 7 ]. We will show the named entities recognized by FOX and contrast these with those recognized by the other tools. Moreover, we will show the runtime log that FOX generates to point to FOX’s scalability. Entity Linking FOX makes use of AGDISTIS [ 9 ], an open-source named entity disambiguation framework able to link entities against every linked data knowledge base, to disambiguate entities and to link them against DBpedia. In contrast to lookup-based approaches, our framework can also detect resources that are not in DBpedia. In this case, these are assigned their own URIs. Moreover, FOX provides a Java interface and a configuration file for easy integration of other entity linking tools. We will show the messages that FOX generates and sends to AGDISTIS as well as the answers it receives and serializes.

Serialization Formats FOX is designed to support a large number of use cases. To this end, our framework can serialize its results into the following formats: JSON-LD4, N-Triples5, RDF/JSON6, RDF/XML7, Turtle8, TriG9, N-Quads10. FOX allows the user to choose between these formats (see bottom left part of Figure 1). We will show how the out of FOX looks like in the different formats and point to how they can be parsed. 3

Evaluation and Results

We performed a thorough evaluation of FOX by using five different datasets and comparing it with state-of-the-art NER frameworks (see Table 1). Our evaluation shows that FOX clearly outperforms the state of the art. The details of the complete evaluation are presented in [ 7 ]. The evaluation code and datasets are also available at FOX’s Github page, i.e., http://github.com/AKSW/FOX. 4 http://www.w3.org/TR/json-ld 5 http://www.w3.org/TR/n-triples/ 6 http://www.w3.org/TR/rdf-json 7 http://www.w3.org/TR/REC-rdf-syntax 8 http://www.w3.org/TR/turtle 9 http://www.w3.org/TR/trig 10 http://www.w3.org/TR/n-quads 4

Conclusion

We will present FOX, a NER framework which relies on EL and demonstrate how it can be used. In future work, we will extend the number of tools integrated in FOX. Moreover, we will extend the tasks supported by the framework. In particular, we aim to integrate tagging, keyword extraction as well as relation extraction in the near future.

Baldridge . The opennlp project , 2005 .

Jenny

Rose Finkel , Trond Grenager, and

Christopher

Manning . Incorporating non-local information into information extraction systems by gibbs sampling . In ACL , pages 363 - 370 , 2005 .

Ali

Khalili , So¨ ren Auer, and Axel-Cyrille Ngonga Ngomo. context - lightweight text analytics using linked data . In 11th Extended Semantic Web Conference (ESWC2014) , 2014 .

David

Nadeau . Balie-baseline information extraction: Multilingual information extraction from text with machine learning and natural language techniques . Technical report, Technical report , University of Ottawa, 2005 .

5. Axel-Cyrille Ngonga

Ngomo

, Norman Heino, Klaus Lyko, Rene´ Speck, and Martin Kaltenbo¨ ck. SCMS - Semantifying Content Management Systems . In Proceedings of the International Semantic Web Conference , 2011 .

Lev

Ratinov and

Dan

Roth . Design challenges and misconceptions in named entity recognition . In Proceedings of the Thirteenth Conference on Computational Natural Language Learning , CoNLL '09 , pages 147 - 155 , Stroudsburg, PA, USA, 2009 . Association for Computational Linguistics .

7. Rene´ Speck and Axel-Cyrille Ngonga Ngomo . Ensemble learning for named entity recognition . In In Proceedings of the International Semantic Web Conference, Lecture Notes in Computer Science , 2014 .

Ricardo

Usbeck . Combining linked data and statistical information retrieval . In 11th Extended Semantic Web Conference , PhD Symposium . Springer, 2014 .

Ricardo

Usbeck , Axel-Cyrille Ngonga

Ngomo

, So¨ ren Auer, Daniel Gerber, and

Andreas

Both . Agdistis - agnostic disambiguation of named entities using linked open data . In Submitted to 12th International Semantic Web Conference , 21 -25 October 2013 , Sydney, Australia, 2013 .