<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Introducing Distiller: a unifying framework for Knowledge Extraction</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Marco</forename><surname>Basaldella</surname></persName>
							<email>basaldella.marco.1@spes.uniud.it</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Mathematics and Computer Science</orgName>
								<orgName type="laboratory">Artificial Intelligence Lab</orgName>
								<orgName type="institution">University of Udine</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Dario</forename><surname>De Nart</surname></persName>
							<email>dario.denart@uniud.it</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Mathematics and Computer Science</orgName>
								<orgName type="laboratory">Artificial Intelligence Lab</orgName>
								<orgName type="institution">University of Udine</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Carlo</forename><surname>Tasso</surname></persName>
							<email>carlo.tasso@uniud.it</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Mathematics and Computer Science</orgName>
								<orgName type="laboratory">Artificial Intelligence Lab</orgName>
								<orgName type="institution">University of Udine</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Introducing Distiller: a unifying framework for Knowledge Extraction</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">F00CFC392431FF74323CAF7547827AB7</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T08:39+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The Digital Libraries community has shown over the last years a growing interest in Semantic Search technologies. Content analysis and annotation is a vital task, but for large corpora it's not feasible to do it manually. Several automatic tools are available, but such tools usually provide little tuning possibilities and do not support integration with different systems. Search and adaptation technologies, on the other hand, are becoming increasingly multi-lingual and cross-domain to tackle the continuous growth of the available information. So, we claim that to tackle such criticalities a more systematic and flexible approach, such as the use of a framework, is needed. In this paper we present a novel framework for Knowledge Extraction, whose main goal is to support the development of new applications and to ease the integration of the existing ones.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Automatic Knowledge Extraction (herein KE) from natural language documents is a critical step to provide better access and classification of documents by means of semantic technologies. However, due to the current size of digital archives, one cannot expect human experts to annotate such data manually. Several tools have been developed over the past years to address this issue. However four critical issues in state-of-the-art Knowledge Extraction systems can be identified:</p><p>-Multilinguality: roughly half of the available Web pages include non-English text <ref type="foot" target="#foot_0">1</ref> . The large majority of Web users are non-English native speakers, and tasks like multilingual search, adaptation, and personalization are likely to be key features of future information access. Unfortunately KE tools show, with a few notable exceptions, a general lack of multilingual support. -Knowledge Source Completeness: KE systems mostly rely on a specific knowledge source (such as DBpedia or Freebase) acting in a closed-world fashion and assuming that such knowledge source is complete. This assumption is in contrast with the open-world assumption of semantic Web technologies and shows off its limitations when applied to texts where new concepts are often introduced, such as scientific papers. Therefore a more flexible approach open to more than one external knowledge source and compliant to the open-world assumption seems more appropriate. -Knowledge Overload : long texts such as scientific papers, may include a lot of named entities, but not all are equally relevant inside the text. State-ofthe-art KE systems currently provide Named Entity Recognition but do not filter relevant entities nor include relevance measures. On the other hand Keyword and Keyphrase extraction systems usually do filter entities but do not disambiguate nor link them to DBpedia or other authoritative ontologies. -Flexibility: state-of-the-art systems tend to provide a "one-size-fits-all" solution that is generally a domain independent application and, to the best of our knowledge, none of them can be easily tailored by non-KE-experts to fit specific domain requirements, assumptions, or constraints of each digital library.</p><p>To overcome this issues in this paper we introduce Distiller, a KE framework whose aim is to overcome these limitations by providing a complete, yet easily understandable, KE pipeline, allowing quick development of custom applications and integration of heterogeneous KE technologies.</p><p>The rest of the paper is organized as follows: in Section 2 we present some related work, in Section 3 we introduce the key concepts of the Distiller framework as well as the built-in modules, and in Section 4 we explain how to obtain and use the Distiller. Finally, Section 5 and 6 conclude and present the future extensions of our work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>Named entity recognition and automatic semantic data generation from natural language text has already been investigated and several knowledge extraction systems already exist <ref type="bibr" target="#b2">[3]</ref> such as OpenCalais<ref type="foot" target="#foot_1">2</ref> , Apache Stanbol<ref type="foot" target="#foot_2">3</ref> , TagMe <ref type="bibr" target="#b1">[2]</ref>, BabelNet, Babelfy <ref type="bibr" target="#b11">[12]</ref>, and so on. All these systems are tailored to a specific domain and work well in that specific domain. On the other hand several authors in the literature have addressed the problem of filtering document information by identifying keyphrases (herein KPs) and a wide range of approaches have been proposed. Different techniques of KP extraction have been identified in literature <ref type="bibr" target="#b4">[5]</ref>. These techniques can be divided mainly in supervised techniques based on statistical, structural, and syntactic features of a document, and unsupervised techniques, which employ graph clustering and ranking techniques to find the most relevant KPs in a document.</p><p>As far as we know by now these efforts, even when they brought a significant step forward in the KP research field, rarely brought to a systematic and replicable development approach to the KP problems. At the time we write, we are not aware of the existence of an 'out of the box' solution able to offer a developer, or even a less technical-minded researcher, a solution which is both easy to use and easy to configure for the KP extraction problem. Moreover, while there is a wide body of state of the art algorithms, just few of them are freely available to the research community. So in this section we focus only on the KP extraction software that is available for download on the Internet.</p><p>An example of an available solution is RAKE <ref type="bibr" target="#b13">[14]</ref>. While there is an open source implementation of the algorithm<ref type="foot" target="#foot_3">4</ref> , it's a single purpose application with little or no configuration. There is also an open source implementation of the KEA algorithm <ref type="bibr" target="#b15">[16]</ref> available online<ref type="foot" target="#foot_4">5</ref> , but it seems that the project has not been updated since 2007. As for RAKE, this software is a single-purpose solution with very little customization options. The KEA algorithm is the basis for the MAUI software <ref type="foot" target="#foot_5">6</ref> , which offers an open source implementation of an improved version of the KEA algorithm plus other tools for other common KE tasks such as Entity Recognition or Automatic Tagging <ref type="bibr" target="#b9">[10]</ref>. Unfortunately the bulk of such useful features is part of a closed-source commercial product. Moreover, such software is not meant to be a framework, therefore extension with new modules and integration with existent systems are hard to develop. Finally, JATE<ref type="foot" target="#foot_6">7</ref> is a library that offers a set of KP extraction algorithms. Unfortunately, this library is not developed as a framework, but just as a collection of algorithms.</p><p>It is also important to stress that the KE domain itself lacks in standardization. Evaluation of KP extraction systems is difficult, since in the community there is little agreement on which metrics should be used: some scholars use Information Retrieval metrics <ref type="bibr" target="#b6">[7]</ref>, while others introduce new domain specific metrics like in <ref type="bibr" target="#b14">[15]</ref>. Moreover, as we discuss in Section 3.4, there is still no shared terminology in the community.</p><p>Our work aims to be a step towards a wider, unifying direction: we want to provide to the KE and KP communities an open-source, simple, and flexible framework-based solution, which can be used for fast development and evaluation of KE and KP extraction techniques.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Framework Design</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">General Design</head><p>In order to overcome the shortcomings of state-of-the-art KE systems we extended the approach presented in <ref type="bibr" target="#b12">[13]</ref> and <ref type="bibr" target="#b0">[1]</ref> and formalized it in a framework named Distiller, whose main aim is to support research and prototyping activities by providing an environment for building testbed systems and integrating existing KE systems.</p><p>Distiller is implemented in Java, since such language is widespread among the research community and offers reasonable performance and multiplatform support. Moreover, since it runs on the JVM, Distiller can be used with other popular languages such as Groovy, Scala, and Ruby 8 . Distiller relies on the Spring framework to handle dependency injection allowing easy Web deployment on Servlet containers such as Apache Tomcat.</p><p>The design of Distiller is guided by the key principle that several different types of knowledge are involved in the process of KE and should be clearly separated in order to design systems able to cope with multilinguality and multidomain issues. For example, by now we consider four types of knowledge:</p><p>-Statistical : word distribution in the document and/or in a corpus of documents; -Linguistic: Lexical and morphological knowledge; -Social-Semantic: Knowledge derived from external sources such as Wikipedia, or more specific domain ontologies, possibly cooperatively developed; -Meta-Structural : heuristics based on prior knowledge on text structure (e.g.:</p><p>knowing that scientific papers have an abstract).</p><p>Linguistic knowledge is language dependant Meta-Structural knowledge is domain dependent, and Social-Semantic knowledge is both domain and language dependant. At a more practical level this principle implies that different types of knowledge must reside in distinct modules, for instance, statistical and linguistic analysis must be handled by different modules.</p><p>Distiller is organized in a series of single-knowledge oriented modules, where any module is designed to perform a single task efficiently, e.g. POS tagging, statistical analysis, knowledge inference, and so on. This allows a highly modular design with the possibility of implementing different pipelines (i.e. sequences of modules) for different tasks. All these modules are required to insert the knowledge they extract on a shared blackboard so that a module can use the knowledge produced by another module. For example an n-gram generator module can generate n-grams according to the POS tags produced by a POS tagger module. Since these modules work by annotating the text on the blackboard with new information, we call them Annotators in our framework.</p><p>Implementing Knowledge Extraction tasks with Distiller ultimately is reduced to specifying a pipeline including the right annotators. Consider for instance the task of KP Extraction introduced in Section 2. Usually such task is divided in the following steps: text pre-processing, candidate KP generation, and candidate KP selection and/or ranking. Distiller allows a quick deployment of such an application with the following annotators: a Sentence Splitter and a word Tokenizer to handle the pre-processing phase, a Stemmer, a POS Tagger and an optional Entity Linker to annotate the text, an N-Gram Generator to generate candidates, and Scoring a Filtering modules to filter the most relevant candidates according to the annotations produced in the previous steps. The resulting pipeline is shown in Figure <ref type="figure" target="#fig_0">1</ref>. Since each Annotator provides only a specific kind of knowledge, tailoring the pipeline to specific needs requires little effort. For instance, switching to another language requires to replace only the language dependant annotators, namely the POS Tagger, the Stemmer, and the Word Tokenizer. Other pipelines can be specified to implement different Knowledge Extraction and text mining tasks such as Sentiment Analysis, Summarization, or Authorship Identification. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Examples of Annotators</head><p>The framework provides out of the box a small set of annotators that allow to build a simple pipeline for the tasks of KP Extraction and Concept Inference. The pipeline we designed follows the feature-based approach which is widespread in the keyphrase extraction literature <ref type="bibr" target="#b4">[5]</ref>. In this section, to showcase the capabilities of the framework, we present a set of annotations that the Distiller is already able to produce.</p><p>There are lots of features that can be found in literature that we have not implemented in the Distiller yet. This is not due to the fact that we don't consider them worthy or interesting enough, but, since the framework architecture offers the capability to quickly implement an Annotator that calculates a desired feature, our purpose is to provide a solid and reliable framework design rather than a simple collection of algorithms. We plan, to extend this feature set in the future, extending it also to other domains different from Knowledge Extraction such as, for example, Sentiment Analysis.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.1">Linguistic Annotators</head><p>We developed wrappers for two of the most popular natural language processing toolkits available in the Java language, namely the Stanford CoreNLP library <ref type="bibr" target="#b8">[9]</ref> and the Apache OpenNLP library<ref type="foot" target="#foot_7">9</ref> .</p><p>We use these tools to split, tokenize, and POS tag documents. These modules are usually the annotators at the beginning of the pipeline.</p><p>Moreover, we provide a simple n-gram generator used to generate candidate keyphrases. This module selects from the input documents the n-grams whose POS tag sequence corresponds to a typical keyphrase POS tag sequence; for example NN NN is a valid POS tag sequence for this module. These sequences are stored in a simple database in the shape of a JSON file. The developer can then give to the n-gram generator one database file per language, and the module is able to select the appropriate one at run time. Default pos-pattern databases that we obtained by running a POS tagger on a corpus of manually defined keyphrases, using the same approach as <ref type="bibr" target="#b5">[6]</ref>, are already available in the framework.</p><p>This n-gram generator is also used to compute what we call the Noun Value of a candidate keyphrase, i.e., given a n-gram g of length n, noun value(g) = (number of nouns in g)/(n)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.2">Statistical Annotators</head><p>We include in the Distiller a statistical annotator that provides statistical information about n-grams generated by the n-gram generator mentioned above.</p><p>In order to illustrate how the statistical processing is performed, we introduce some definitions. Given D a document and g a gram, we denote with |D| the number of sentences of the document, and pos(D, g) as a function that, given a gram, returns a list of positions of the gram in the document. For example, suppose we have pos(D, g) = {1, 3, 3, 5}: this means that g appears in the first, third, and fifth sentence of the document, appearing two times in the third sentence. This module annotates n-grams with four features: These annotations provide us positional knowledge about the n-grams, helping us to discriminate potential keyphrases. This kind of knowledge is widely used in the keyphrase extraction field <ref type="bibr" target="#b4">[5]</ref>, albeit with different names or slightly different definitions. For example, what we call height is called distance in the KEA system <ref type="bibr" target="#b15">[16]</ref>, and it's computed on the basis of the number of words instead of sentences.</p><p>The HUMB system [8] calls KEA's 'distance' simply first position. More recently, <ref type="bibr" target="#b3">[4]</ref> also calls KEA's distance first position, and moreover it defines first sentence as we define height in this work.</p><p>We recognize that the difference in terminology may cause confusion to a reader coping with all these definitions but, since there's no standard terminology in the KP community itself, it's hard to come up with unambiguous definitions. This remarks may be indeed useful for the KP community in order to define a common corpus of definitions, eliminating the need for re-definition.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.3">Knowledge-Based Annotators</head><p>We built an annotator that relies on TagME <ref type="bibr" target="#b1">[2]</ref> aimed at marking an n-gram with a boolean value if it appears on Wikipedia. We called this boolean value Wikiflag.</p><p>Using the information provided by this annotator, we're able to identify a set of relevant entities that appear in the document and a set of suggested entities that are related to the ones that appear in the document. This way, we provide a quick way for the reader to gather the relevant information of a document without the need of reading the whole document.</p><p>We thoroughly describe the process of filtering and suggesting entities in <ref type="bibr" target="#b10">[11]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Multilinguality</head><p>It is simple to adapt the design of a pipeline to languages different from English. Since we use components that are quite standard in the NLP community one can use the resources that are already available online to port a pipeline from a language to another. Let's take again our Keyphrase Extraction pipeline as an example. The pipeline is already designed to support English and Italian but it's possible to support an arbitrary number of languages. In fact, the only annotators that are language-dependent are the linguistic annotators (POS tagger, splitter, and so on), the n-gram generator and the external knowledge annotators. We already mentioned that splitting, tokenization, and POS tagging are performed by external libraries such as Apache OpenNLP. To perform these tasks in languages different from English, we already offer the user a simple configuration parameter that allows him to use one of the many language models that are already available<ref type="foot" target="#foot_8">10</ref> . Listing 1.2 is an example of multi language support for the Apache OpenNLP wrapper in the Distiller. These models can be used to build the POS patterns for the n-gram generator, whose multilanguage capabilities we have already mentioned in Section 3.2.1</p><p>Regarding the external knowledge annotators, while TagMe is available only in Italian and English, it is possible to use one of the many similar online services to perform the same task such, for example, Babelfy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Evaluation</head><p>An important step of every scientific process is the evaluation of the results. For this reason the Distiller design allows to easily build an evaluation stage for every kind of pipeline that it can support.</p><p>As we already mentioned, the focus of the Distiller by now is on Knowledge Extraction and, more specifically, on KP Extraction, so we designed a simple evaluator process for this task. We have built an evaluation system for scientific articles based on the SEMEVAL 2010 dataset. In the near future we plan to integrate evaluation on the Inspec dataset to evaluate the pipeline on abstracts, and DUC-2001 dataset to evaluate news articles.</p><p>For the Keyphrase Extraction task, evaluation is performed by calculating the usual metrics of precision, recall and f-measure. Moreover, <ref type="bibr" target="#b6">[7]</ref> recently introduced two metrics derived from the Information Retrieval community, namely the binary preference measure and the mean reciprocal rank, which are used to take the ranking of the extracted keyphrases into account. For the same reason, recently <ref type="bibr" target="#b14">[15]</ref> proposed a new metric called average standard normalised cumulative gain which claims to offer a even better evaluation technique for keyphrase extraction. We use these three innovative metrics along with the usual precision, recall and f-measure in the Distiller. This way, we hope to provide a fast, accurate, and comprehensive evaluation of the KE task in our framework.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Using the Distiller</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Distribution and Licensing</head><p>All the code of the Distiller is available online under the Apache 2 License. The full source code can be found on GitHub <ref type="foot" target="#foot_9">11</ref> . Due to license constraints we can't include GPL licensed code in our framework. For this reason we will not include the Stanford CoreNLP wrapper in the default release but we will release in the future a set of GPL2 licensed annotators to overcome this limit.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">A practical example</head><p>Being a Spring application, Distiller can be configured with a XML configuration file. Each module can be specified and configured in such file and the system configuration can be changed with no need to recompile the code. It's also possible to configure the Distiller using Java code, but since the result is the same as the XML configuration, we cover only the latter in this paper. Listing 1.1 shows a sample configuration snippet where the KE pipeline is defined. This pipeline is injected into the Distiller using the facilities that the Spring framework provides. &lt;bean i d=" d e f a u l t P i p e l i n e " c l a s s=" i t . uniud . a i l a b . d c o r e . a n n o t a t i o n . P i p e l i n e "&gt; &lt;p r o p e r t y name=" a n n o t a t o r s "&gt; &lt; l i s t&gt; &lt; !−− s p l i t t h e document −−&gt; &lt; r e f bean="openNLP" /&gt; &lt; !−− a n n o t a t e t h e t o k e n s −−&gt; &lt; r e f bean=" tagme " /&gt; &lt; !−− g e n e r a t e t h e n−grams −−&gt; &lt; r e f bean=" nGramGenerator " /&gt; &lt; !−− a n n o t a t e t h e n−grams −−&gt; &lt; r e f bean=" s t a t i s t i c a l " /&gt; &lt; r e f bean=" tagmegram " /&gt; &lt; !−− e v a l u a t e t h e k e y p h r a s e n e s s −−&gt; &lt; r e f bean=" l i n e a r E v a l u a t o r " /&gt; &lt; !−− i n f e r c o n c e p t s −−&gt; &lt; r e f bean=" w i k i p e d i a I n f e r e n c e " /&gt; &lt; !−− f i l t e r t h e non−i n t e r e s t i n g o u t p u t −−&gt; &lt; r e f bean=" s k y l i n e G r a m F i l t e r " /&gt; &lt; r e f bean=" h y p e r n y m F i l t e r " /&gt; &lt; r e f bean=" r e l a t e d F i l t e r " /&gt; &lt;/ l i s t&gt; &lt;/ p r o p e r t y&gt; &lt;/ bean&gt; Listing 1.1: A configuration snippet Each module of the pipeline must implement the Annotator interface. An example of Annotator is the OpenNLPBootstrapper, a module that uses the Apache OpenNLP library 12 to split, tokenize, and POS tag the document. This annotator is defined as a bean, as in Listing 1.2, in the XML file and then passed to the pipeline as in Listing 1.1 above. &lt;bean i d="openNLP" c l a s s=" i t . uniud . a i l a b . d c o r e . wrappers . e x t e r n a l .</p><p>OpenNlpBootstrapperAnnotator "&gt; &lt;p r o p e r t y name=" modelPaths "&gt; &lt;map key−type=" j a v a . l a n g . S t r i n g " va lue −type=" j a v a . l a n g . S t r i n g "&gt; &lt;e n t r y key=" en−s e n t " v a l u e=" / opt / d i s t i l l e r / models / en−s e n t . b i n " /&gt; &lt;e n t r y key=" en−token " v a l u e=" / opt / d i s t i l l e r / models / en−token . b i n " /&gt; &lt;e n t r y key=" en−pos−maxent " v a l u e=" / opt / d i s t i l l e r / models / en−pos−maxent . b i n " /&gt; &lt;e n t r y key=" i t −s e n t " v a l u e=" / opt / d i s t i l l e r / models / i t / i t −s e n t . b i n " /&gt; &lt;e n t r y key=" i t −token " v a l u e=" / opt / d i s t i l l e r / models / i t / i t −token . b i n " /&gt; &lt;e n t r y key=" i t −pos−maxent " v a l u e=" / opt / d i s t i l l e r / models / i t / i t −pos−maxent . b i n " /&gt; &lt;/map&gt; &lt;/ p r o p e r t y&gt; &lt;/ bean&gt; Listing 1.2: A configuration snippet Listing 1.2 is also useful to show how a single module can be configured. Here again we use the facilities provided by the Spring framework to set the model file paths that the OpenNLP framework is going to use in this configuration to split, tokenize and POS tag text.</p><p>Once configured, Distiller offers a simple and minimal interface to allow programmers to instantiate and run the application. Listing 1.3 shows how to build a Distiller application according to the configuration file and to launch extraction from a text. It is also possible to use the Spring framework (or the wrappers for the framework provided in the DistillerFactory class) to load and use any custom pipeline for the distiller. D i s t i l l e r d = D i s t i l l e r F a c t o r y . g e t D e f a u l t ( ) ; D i s t i l l e d O u t p u t output = d . d i s t i l l ( ' ' Text t o d i s t i l l ' ' ) ; Listing 1.3: Running Distiller with the default configuration</p><p>The output format is an object containing ranked concepts, links to external knowledge sources (if any) and other annotations generated along the KE pipeline.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusions</head><p>With respect to the four issues of KE presented in Section 1, Distiller allows the development of applications able to overcome such shortcomings. The issue of multilinguality is eased by the possibility of specifying a wide array of annotators and to dynamically link them at runtime on the basis of the considered language. The issue of Knowledge Source Completeness is eased by the possibility of integrating heterogeneous knowledge sources as different annotators, such as TagME or Babelfy. The issue of Knowledge Overload, finally, is eased by the presence of a filtering phase in which entities are evaluated with respect to their relevance in the text. Currently we are releasing the Distiller framework as an open source project and providing, by request, a RESTful API to access a sample application with multilingual support. Finally, we believe that the Distiller is flexible enough to tackle complex and diverse tasks, provided that the right annotators for these tasks are available. If an annotator for a specific problem does not exists, however, it is possible to implement it and easily plug it in a custom KE pipeline.</p><p>Since the keyphrase ranking phase is based on heuristically calculated weights for the features we discussed in this paper, we plan to build a keyphrase ranking module with the possibility to use different machine learning techniques for this task. This work is out of the scope of this paper and will be discussed in a future work.</p><p>We also plan to include support for other languages in the Keyphrase Extraction task. We're currently working on Portuguese, Arabic, and Romanian.</p><p>Other future work will include the development of a different kind of pipelines in the Distiller, such as a Sentiment Analysis oriented pipeline. In order to demonstrate this possibility we already built a simple module, which is a Java port of the Syuzhet R package <ref type="foot" target="#foot_10">13</ref> , which is used to detect the emotional intensity of a text.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 :</head><label>1</label><figDesc>Fig. 1: Knowledge extraction pipeline. The downwards arrow indicates an annotator that writes on the blackboard, the upwards arrow indicates an annotator that reads from the blackboard.</figDesc><graphic coords="5,134.77,185.90,345.83,135.34" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>-</head><label></label><figDesc>depth: the (relative) position of the last occurrence of the n-gram, i.e. depth = max(pos(D, g)) |D| -height: the (relative) position of the first occurrence of the gram, i.e. height = min(pos(D, g)) |D| -lifespan: the part of the text in which the gram appears, i.e. lifespan = max(pos(D, g)) − min(pos(D, g)) |D| or equivalently lifespan = depth − height -frequency: the relative frequency of the gram in the text, i.e. frequency = |pos(D, g)| |D|</figDesc></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">http://w3techs.com/technologies/overview/content_language/all</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">http://www.opencalais.com/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">https://stanbol.apache.org/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">https://github.com/aneesha/RAKE</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4">http://www.nzdl.org/Kea/download.html</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_5">https://github.com/zelandiya/maui</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_6">https://github.com/ziqizhang/jate</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_7">https://opennlp.apache.org/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="10" xml:id="foot_8">http://opennlp.sourceforge.net/models-1.5/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="11" xml:id="foot_9">https://github.com/ailab-uniud/distiller-CORE</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="13" xml:id="foot_10">https://github.com/mjockers/syuzhet</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">A New Multilingual Knowledge-base Approach to Keyphrase Extraction for the Italian Language</title>
		<author>
			<persName><forename type="first">Dante</forename><surname>Degl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">'</forename><surname>Innocenti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dario</forename><forename type="middle">De</forename><surname>Nart</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Carlo</forename><surname>Tasso</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 6th International Conference on Knowledge Discovery and Information Retrieval</title>
				<meeting>the 6th International Conference on Knowledge Discovery and Information Retrieval</meeting>
		<imprint>
			<publisher>SciTePress</publisher>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="78" to="85" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">TAGME: On-the-fly Annotation of Short Text Fragments (by Wikipedia Entities)</title>
		<author>
			<persName><forename type="first">Paolo</forename><surname>Ferragina</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ugo</forename><surname>Scaiella</surname></persName>
		</author>
		<idno type="DOI">10.1145/1871437.1871689</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 19th ACM International Conference on Information and Knowledge Management. CIKM &apos;10</title>
				<meeting>the 19th ACM International Conference on Information and Knowledge Management. CIKM &apos;10<address><addrLine>Toronto, ON, Canada</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="1625" to="1628" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">A comparison of knowledge extraction tools for the semantic web</title>
		<author>
			<persName><forename type="first">Aldo</forename><surname>Gangemi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">The Semantic Web: Semantics and Big Data</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2013">2013</date>
			<biblScope unit="page" from="351" to="366" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Accurate Keyphrase Extraction from Scientific Papers by Mining Linguistic Information</title>
		<author>
			<persName><forename type="first">Mounia</forename><surname>Haddoud</surname></persName>
		</author>
		<ptr target="http://ceur-ws.org.2015" />
	</analytic>
	<monogr>
		<title level="m">Proc. of the Workshop Mining Scientific Papers: Computational Linguistics and Bibliometrics, 15th International Society of Scientometrics and Informetrics Conference (ISSI)</title>
				<meeting>of the Workshop Mining Scientific Papers: Computational Linguistics and Bibliometrics, 15th International Society of Scientometrics and Informetrics Conference (ISSI)<address><addrLine>Istanbul, Turkey</addrLine></address></meeting>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Automatic keyphrase extraction: A survey of the state of the art</title>
		<author>
			<persName><forename type="first">Saidul</forename><surname>Kazi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Vincent</forename><surname>Hasan</surname></persName>
		</author>
		<author>
			<persName><surname>Ng</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Association for Computational Linguistics (ACL)</title>
				<meeting>the Association for Computational Linguistics (ACL)<address><addrLine>Baltimore, Maryland</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Improved Automatic Keyword Extraction Given More Linguistic Knowledge</title>
		<author>
			<persName><forename type="first">Anette</forename><surname>Hulth</surname></persName>
		</author>
		<idno type="DOI">10.3115/1119355.1119383</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing. EMNLP &apos;03</title>
				<meeting>the 2003 Conference on Empirical Methods in Natural Language Processing. EMNLP &apos;03<address><addrLine>Stroudsburg, PA, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2003">2003</date>
			<biblScope unit="page" from="216" to="223" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Automatic keyphrase extraction via topic decomposition</title>
		<author>
			<persName><forename type="first">Zhiyuan</forename><surname>Liu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing</title>
				<meeting>the 2010 Conference on Empirical Methods in Natural Language Processing</meeting>
		<imprint>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="366" to="376" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">HUMB: Automatic key term extraction from scientific articles in GROBID</title>
		<author>
			<persName><forename type="first">Patrice</forename><surname>Lopez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Laurent</forename><surname>Romary</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 5th international workshop on semantic evaluation. Association for Computational Linguistics</title>
				<meeting>the 5th international workshop on semantic evaluation. Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="248" to="251" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">The Stanford CoreNLP Natural Language Processing Toolkit</title>
		<author>
			<persName><forename type="first">D</forename><surname>Christopher</surname></persName>
		</author>
		<author>
			<persName><surname>Manning</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations</title>
				<meeting>52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations</meeting>
		<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="55" to="60" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Human-competitive Tagging Using Automatic Keyphrase Extraction</title>
		<author>
			<persName><forename type="first">Eibe</forename><surname>Olena Medelyan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ian</forename><forename type="middle">H</forename><surname>Frank</surname></persName>
		</author>
		<author>
			<persName><surname>Witten</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing</title>
				<meeting>the 2009 Conference on Empirical Methods in Natural Language Processing<address><addrLine>Singapore</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computational Linguistics</publisher>
			<date type="published" when="2009">2009</date>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page" from="1318" to="1327" />
		</imprint>
	</monogr>
	<note>EMNLP &apos;09</note>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">A Keyphrase Generation Technique Based upon Keyphrase Extraction and Reasoning on Loosely Structured Ontologies</title>
		<author>
			<persName><forename type="first">Dario</forename><surname>De</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Nart</forename></persName>
		</author>
		<author>
			<persName><forename type="first">Carlo</forename><surname>Tasso</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 7th International Workshop on Information Filtering and Retrieval co-located with the 13th Conference of the Italian Association for Artificial Intelligence (AI*IA 2013)</title>
				<meeting>the 7th International Workshop on Information Filtering and Retrieval co-located with the 13th Conference of the Italian Association for Artificial Intelligence (AI*IA 2013)<address><addrLine>Turin, Italy</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2013-12-06">December 6, 2013. 2013</date>
			<biblScope unit="page" from="49" to="60" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">BabelNet: The Automatic Construction, Evaluation and Application of a Wide-Coverage Multilingual Semantic Network</title>
		<author>
			<persName><forename type="first">Roberto</forename><surname>Navigli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Simone</forename><surname>Paolo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ponzetto</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Artificial Intelligence</title>
		<imprint>
			<biblScope unit="volume">193</biblScope>
			<biblScope unit="page" from="217" to="250" />
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Automatic keyphrase extraction and ontology mining for content-based tag recommendation</title>
		<author>
			<persName><forename type="first">Nirmala</forename><surname>Pudota</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Intelligent Systems</title>
		<imprint>
			<biblScope unit="volume">25</biblScope>
			<biblScope unit="issue">12</biblScope>
			<biblScope unit="page" from="1158" to="1186" />
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Automatic keyword extraction from individual documents</title>
		<author>
			<persName><forename type="first">Stuart</forename><surname>Rose</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Text Mining</title>
				<imprint>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="1" to="20" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">A critical survey on measuring success in rank-based keyword assignment to documents</title>
		<author>
			<persName><forename type="first">Natalie</forename><surname>Schluter</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="s">22eme Traitement Automatique des Langues Naturelles</title>
		<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">KEA: Practical automatic keyphrase extraction</title>
		<author>
			<persName><surname>Ian H Witten</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the fourth ACM conference on Digital libraries</title>
				<meeting>the fourth ACM conference on Digital libraries</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="1999">1999</date>
			<biblScope unit="page" from="254" to="255" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
