Introduction

Introducing Distiller: a Lightweight Framework for Knowledge Extraction and Filtering

Dario De Nart

Dante Degl'Innocenti

dante.deglinnocenti@spes.uniud.it 0

Carlo Tasso

carlo.tassog@uniud.it 0 0 Articial Intelligence Lab Department of Mathematics and Computer Science University of Udine , Italy

Semantic content analysis is an activity that can greatly support a broad range of user modelling applications. Several automatic tools are available, however such systems usually provide little tuning possibilities and do not support integration with dierent systems. Personalization applications, on the other hand, are becoming increasingly multi-lingual and cross-domain. In this paper we present a novel framework for Knowledge Extraction, whose main goal is to support the development of new strategies and technologies and to ease the integration of the existing ones.

Introduction

{ Knowledge Overload : long texts, such as scientic papers, may include a lot of named entities, but not all are equally relevant inside the text. State-ofthe-art KE systems currently provide Named Entity Recognition, but do not lter relevant entities nor include relevance measures.

In this paper we introduce Distiller, a KE framework whose aim is to overcome these limitations and allow integration of heterogeneous KE technologies. 2

Related Work

Named entity recognition and automatic semantic data generation from natural language text has already been investigated and several knowledge extraction systems already exist [ 4 ], such as OpenCalais 2, Apache Stanbol3, and TagMe [ 3 ]. In [ 8 ] an ensemble learning strategy to raise the accuracy of the named entity identication process is presented. Several authors in the literature have addressed the problem of ltering document information by identifying keyphrases (herein KPs) and a wide range of approaches have been proposed. The authors of [ 10 ] identify four types of KP extraction strategies: { Simple Statistical Approaches : mostly unsupervised techniques, considering word frequency, TF-IDF or word co-occurency [ 7 ]. { Linguistic Approaches : techniques relying on linguistic knowledge to identify KPs. Proposed methods include lexical analysis, syntactic analysis, and discourse analysis [ 5 ]. { Machine Learning Approaches : techniques based on machine learning algorithms such as Naive Bayes classiers and SVM. Systems such as KEA [ 9 ] belong to this category. { Other Approaches : other strategies exist which do not t into one of the above categories, mostly hybrid approaches combining two or more of the above techniques [ 1 ]. Among others, heuristic approaches based on knowledgebased criteria [ 6 ] have been proposed. 3

System Overview

In order to overcome the shortcomings of state-of-the-art KE systems we extended the approach presented in [ 2 ] and formalized it as a framework named Distiller whose main aim is to support research and prototyping activities by providing an environment for building testbed systems and integrating existing systems. The guiding principle of the framework design is that several dierent types of knowledge are involved in the process of KE and should be clearly separated to design systems able to cope with multilinguality and multi-domain issues. We consider four main types of knowledge: Statistical, Linguistic, External (i.e. coming from outside the text, like the one extracted from ontologies), 2 http://www.opencalais.com/ 3 https://stanbol.apache.org/ and Heuristic knowledge. Linguistic knowledge is language dependant, Heuristic knowledge is domain dependent, and External knowledge is both domain and language dependant. At a more practical level, this principle implies that dierent types of knowledge must reside in distinct modules, for instance, statistical and linguistic analysis must be handled by dierent modules.

Distiller is organized in a series of single-knowledge oriented modules and its workow is organized in four phases: Concept Unit Splitting, Annotation, Candidate Generation, and Filtering, as shown in Figure 1. In the rst phase the text is split into Concept Units, i.e. logical blocks such as chapters, paragraphs or sentences. The framework allows the co-existence of concept units of dierent languages inside a document. The Annotation phase consists in enriching the text with information such as POS tagging, stems, lemmas, or links to entities from external knowledge sources (such as DBpedia). This phase introduces new knowledge in the text, and several dierent annotators can contribute, enriching the text with dierent kinds of knowledge, but mostly with External knowledge that may come from heterogeneous sources. Existing KE tools, such as TagMe, can be integrated in the framework as annotators. The Candidate Generation phase identies in the text all the candidate entities and/or concepts of interests exploiting the annotations provided in the previous step and internally represents them as KPs with an attached set of annotations. Finally, the Filtering phase evaluates a relevance score for each candidate concept depending on which it is returned as output or hidden. The Filtering phase, like the Candidate Generation one, may exploit dierent types of knowledge embedded in annotations, and combine them according to the needs of the applications that will eventually use the extracted knowledge.

Distiller is implemented in Java using the Dependency Injection pattern, that allows users to easily switch between dierent modules and congurations. Default implementations for all the above described modules are provided with the framework4. 4 A sample application built with the default modules is showcased at ailab.uniud.it:8080/distiller

Conclusions

With respect to the three issues of KE presented in Section 1, Distiller allows the development of applications able to overcome such shortcomings. The issue of multilinguality is eased by the possibility of specifying a wide array of annotators and to dynamically link them at runtime depending on the text language. The issue of Knowledge Source Completeness is eased by the possibility of integrating heterogeneous knowledge sources as dierent annotators and implementing annotators who generate URIs on the y. The issue of Knowledge Overload, nally, is eased by the presence of a ltering phase in which entities are evaluated with respect to their relevance in the text.

1. De Nart , D. , Tasso , C. : A domain independent double layered approach to keyphrase generation . In: WEBIST 2014 - Proceedings of the 10th International Conference on Web Information Systems and Technologies . pp. 305 { 312 .

SCITEPRESS

Science and Technology Publications ( 2014 )

2. Degl'Innocenti , D. , De Nart , D. , Tasso , C. : A new multi-lingual knowledge-base approach to keyphrase extraction for the italian language . In: Proceedings of the 6th International Conference on Knowledge Discovery and Information Retrieval . pp. 78 { 85 . SciTePress ( 2014 )

3. Ferragina , P. , Scaiella , U. : Tagme: On-the-y annotation of short text fragments (by wikipedia entities) . In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management . pp. 1625 { 1628 . CIKM '10, ACM , New York, NY, USA ( 2010 )

4. Gangemi , A. : A comparison of knowledge extraction tools for the semantic web . In: The Semantic Web: Semantics and Big Data , pp. 351 { 366 . Springer ( 2013 )

5. Krapivin , M. , Marchese , M. , Yadrantsau , A. , Liang , Y. : Unsupervised key-phrases extraction from scientic papers using domain and linguistic knowledge . In: Digital Information Management , 2008 . ICDIM 2008 . Third International Conference on. pp. 105 { 112 (Nov 2008 )

6. Liu , Z. , Li , P. , Zheng , Y. , Sun , M. : Clustering to nd exemplar terms for keyphrase extraction . In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 . pp. 257 { 266 . EMNLP ' 09 ( 2009 )

7. Matsuo , Y. , Ishizuka , M. : Keyword extraction from a single document using word co-occurrence statistical information . International Journal on Articial Intelligence Tools 13 ( 01 ), 157 { 169 ( 2004 )

8. Speck , R. ,

Ngonga

Ngomo , A.C. : Ensemble learning for named entity recognition . In: The Semantic Web ISWC 2014, Lecture Notes in Computer Science , vol. 8796 , pp. 519 { 534 . Springer International Publishing ( 2014 )

9. Witten , I.H. , Paynter , G.W. , Frank , E. , Gutwin , C. , Nevill-Manning , C.G. : Kea: Practical automatic keyphrase extraction . In: Proceedings of the fourth ACM conference on Digital libraries . pp. 254 { 255 . ACM ( 1999 )

10. Zhang , C. : Automatic keyword extraction from documents using conditional random elds . Journal of Computational Information Systems 4 ( 3 ), 1169 { 1180 ( 2008 ), http://eprints.rclis.org/handle/10760/12305