Introducing Distiller: a Lightweight Framework for Knowledge Extraction and Filtering Dario De Nart, Dante Degl'Innocenti, Carlo Tasso Arti cial Intelligence Lab Department of Mathematics and Computer Science University of Udine, Italy fdario.denart,carlo.tassog@uniud.it, dante.deglinnocenti@spes.uniud.it Abstract. Semantic content analysis is an activity that can greatly sup- port a broad range of user modelling applications. Several automatic tools are available, however such systems usually provide little tuning possibilities and do not support integration with di erent systems. Per- sonalization applications, on the other hand, are becoming increasingly multi-lingual and cross-domain. In this paper we present a novel frame- work for Knowledge Extraction, whose main goal is to support the de- velopment of new strategies and technologies and to ease the integration of the existing ones. 1 Introduction Adaptive Personalization systems can greatly bene t from automatic Knowledge Extraction (herein KE) from natural language text: for instance machine read- able data about the textual content of visited Web pages or of user-generated content can enhance user pro ling activities. Due to the current size of the Web, however, one cannot expect human experts to annotate such data manually. Sev- eral tools have been developed over the past years to address this issue. However we can identify three critical issues in state-of-the-art Knowledge Extraction systems: { Multilinguality : roughly half of the available Web pages include non-English text 1 . The large majority of Web users are non-English native speakers, and multilingual personalization is a hot research topic, however KE tools show a general lack of multilingual non-English support. { Knowledge Source Completeness : KE systems mostly rely on a speci c knowl- edge source (such as DBpedia) acting in a closed-world fashion and assuming that such knowledge source is complete. This assumption is in contrast with the open-world assumption of semantic Web technologies and shows o its limitations when applied to texts such as scienti c papers, where new con- cepts are often introduced. Therefore a more exible approach open to more than one external knowledge source seems more appropriate. 1 http://w3techs.com/technologies/overview/content language/all { Knowledge Overload : long texts, such as scienti c papers, may include a lot of named entities, but not all are equally relevant inside the text. State-of- the-art KE systems currently provide Named Entity Recognition, but do not lter relevant entities nor include relevance measures. In this paper we introduce Distiller, a KE framework whose aim is to overcome these limitations and allow integration of heterogeneous KE technologies. 2 Related Work Named entity recognition and automatic semantic data generation from natu- ral language text has already been investigated and several knowledge extraction systems already exist [4], such as OpenCalais 2 , Apache Stanbol3 , and TagMe [3]. In [8] an ensemble learning strategy to raise the accuracy of the named entity identi cation process is presented. Several authors in the literature have ad- dressed the problem of ltering document information by identifying keyphrases (herein KPs) and a wide range of approaches have been proposed. The authors of [10] identify four types of KP extraction strategies: { Simple Statistical Approaches : mostly unsupervised techniques, considering word frequency, TF-IDF or word co-occurency [7]. { Linguistic Approaches : techniques relying on linguistic knowledge to iden- tify KPs. Proposed methods include lexical analysis, syntactic analysis, and discourse analysis [5]. { Machine Learning Approaches : techniques based on machine learning algo- rithms such as Naive Bayes classi ers and SVM. Systems such as KEA [9] belong to this category. { Other Approaches : other strategies exist which do not t into one of the above categories, mostly hybrid approaches combining two or more of the above techniques [1]. Among others, heuristic approaches based on knowledge- based criteria [6] have been proposed. 3 System Overview In order to overcome the shortcomings of state-of-the-art KE systems we ex- tended the approach presented in [2] and formalized it as a framework named Distiller whose main aim is to support research and prototyping activities by providing an environment for building testbed systems and integrating existing systems. The guiding principle of the framework design is that several di er- ent types of knowledge are involved in the process of KE and should be clearly separated to design systems able to cope with multilinguality and multi-domain issues. We consider four main types of knowledge: Statistical, Linguistic, Exter- nal (i.e. coming from outside the text, like the one extracted from ontologies), 2 http://www.opencalais.com/ 3 https://stanbol.apache.org/ and Heuristic knowledge. Linguistic knowledge is language dependant, Heuristic knowledge is domain dependent, and External knowledge is both domain and language dependant. At a more practical level, this principle implies that di er- ent types of knowledge must reside in distinct modules, for instance, statistical and linguistic analysis must be handled by di erent modules. Distiller is organized in a series of single-knowledge oriented modules and its work ow is organized in four phases: Concept Unit Splitting, Annotation, Candidate Generation, and Filtering, as shown in Figure 1. In the rst phase the text is split into Concept Units, i.e. logical blocks such as chapters, paragraphs or sentences. The framework allows the co-existence of concept units of di erent languages inside a document. The Annotation phase consists in enriching the text with information such as POS tagging, stems, lemmas, or links to entities from external knowledge sources (such as DBpedia). This phase introduces new knowledge in the text, and several di erent annotators can contribute, enriching the text with di erent kinds of knowledge, but mostly with External knowledge that may come from heterogeneous sources. Existing KE tools, such as TagMe, can be integrated in the framework as annotators. The Candidate Generation phase identi es in the text all the candidate entities and/or concepts of interests exploiting the annotations provided in the previous step and internally represents them as KPs with an attached set of annotations. Finally, the Filtering phase evaluates a relevance score for each candidate concept depending on which it is returned as output or hidden. The Filtering phase, like the Candidate Generation one, may exploit di erent types of knowledge embedded in annotations, and combine them according to the needs of the applications that will eventually use the extracted knowledge. Fig. 1: Framework work ow. Distiller is implemented in Java using the Dependency Injection pattern, that allows users to easily switch between di erent modules and con gurations. Default implementations for all the above described modules are provided with the framework4 . 4 A sample application built with the default modules is showcased at ailab.uniud.it:8080/distiller 4 Conclusions With respect to the three issues of KE presented in Section 1, Distiller allows the development of applications able to overcome such shortcomings. The issue of multilinguality is eased by the possibility of specifying a wide array of annotators and to dynamically link them at runtime depending on the text language. The issue of Knowledge Source Completeness is eased by the possibility of integrat- ing heterogeneous knowledge sources as di erent annotators and implementing annotators who generate URIs on the y. The issue of Knowledge Overload, - nally, is eased by the presence of a ltering phase in which entities are evaluated with respect to their relevance in the text. References 1. De Nart, D., Tasso, C.: A domain independent double layered approach to keyphrase generation. In: WEBIST 2014 - Proceedings of the 10th Interna- tional Conference on Web Information Systems and Technologies. pp. 305{312. SCITEPRESS Science and Technology Publications (2014) 2. Degl'Innocenti, D., De Nart, D., Tasso, C.: A new multi-lingual knowledge-base approach to keyphrase extraction for the italian language. In: Proceedings of the 6th International Conference on Knowledge Discovery and Information Retrieval. pp. 78{85. SciTePress (2014) 3. Ferragina, P., Scaiella, U.: Tagme: On-the- y annotation of short text fragments (by wikipedia entities). In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management. pp. 1625{1628. CIKM '10, ACM, New York, NY, USA (2010) 4. Gangemi, A.: A comparison of knowledge extraction tools for the semantic web. In: The Semantic Web: Semantics and Big Data, pp. 351{366. Springer (2013) 5. Krapivin, M., Marchese, M., Yadrantsau, A., Liang, Y.: Unsupervised key-phrases extraction from scienti c papers using domain and linguistic knowledge. In: Digital Information Management, 2008. ICDIM 2008. Third International Conference on. pp. 105{112 (Nov 2008) 6. Liu, Z., Li, P., Zheng, Y., Sun, M.: Clustering to nd exemplar terms for keyphrase extraction. In: Proceedings of the 2009 Conference on Empirical Methods in Nat- ural Language Processing: Volume 1. pp. 257{266. EMNLP '09 (2009) 7. Matsuo, Y., Ishizuka, M.: Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Arti cial Intelli- gence Tools 13(01), 157{169 (2004) 8. Speck, R., Ngonga Ngomo, A.C.: Ensemble learning for named entity recognition. In: The Semantic Web ISWC 2014, Lecture Notes in Computer Science, vol. 8796, pp. 519{534. Springer International Publishing (2014) 9. Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C., Nevill-Manning, C.G.: Kea: Practical automatic keyphrase extraction. In: Proceedings of the fourth ACM con- ference on Digital libraries. pp. 254{255. ACM (1999) 10. Zhang, C.: Automatic keyword extraction from documents using conditional ran- dom elds. Journal of Computational Information Systems 4(3), 1169{1180 (2008), http://eprints.rclis.org/handle/10760/12305