<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Introducing Distiller: a unifying framework for Knowledge Extraction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marco Basaldella</string-name>
          <email>basaldella.marco.1@spes.uniud.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dario De Nart</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carlo Tasso</string-name>
          <email>carlo.tassog@uniud.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Arti cial Intelligence Lab Department of Mathematics and Computer Science University of Udine</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The Digital Libraries community has shown over the last years a growing interest in Semantic Search technologies. Content analysis and annotation is a vital task, but for large corpora it's not feasible to do it manually. Several automatic tools are available, but such tools usually provide little tuning possibilities and do not support integration with di erent systems. Search and adaptation technologies, on the other hand, are becoming increasingly multi-lingual and cross-domain to tackle the continuous growth of the available information. So, we claim that to tackle such criticalities a more systematic and exible approach, such as the use of a framework, is needed. In this paper we present a novel framework for Knowledge Extraction, whose main goal is to support the development of new applications and to ease the integration of the existing ones.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Automatic Knowledge Extraction (herein KE) from natural language documents
is a critical step to provide better access and classi cation of documents by means
of semantic technologies. However, due to the current size of digital archives, one
cannot expect human experts to annotate such data manually. Several tools have
been developed over the past years to address this issue. However four critical
issues in state-of-the-art Knowledge Extraction systems can be identi ed:
often introduced, such as scienti c papers. Therefore a more exible
approach open to more than one external knowledge source and compliant to
the open-world assumption seems more appropriate.
{ Knowledge Overload : long texts such as scienti c papers, may include a lot
of named entities, but not all are equally relevant inside the text.
State-ofthe-art KE systems currently provide Named Entity Recognition but do not
lter relevant entities nor include relevance measures. On the other hand
Keyword and Keyphrase extraction systems usually do lter entities but do
not disambiguate nor link them to DBpedia or other authoritative ontologies.
{ Flexibility : state-of-the-art systems tend to provide a \one-size- ts-all"
solution that is generally a domain independent application and, to the best
of our knowledge, none of them can be easily tailored by non-KE-experts to
t speci c domain requirements, assumptions, or constraints of each digital
library.</p>
      <p>To overcome this issues in this paper we introduce Distiller, a KE framework
whose aim is to overcome these limitations by providing a complete, yet easily
understandable, KE pipeline, allowing quick development of custom applications
and integration of heterogeneous KE technologies.</p>
      <p>The rest of the paper is organized as follows: in Section 2 we present some
related work, in Section 3 we introduce the key concepts of the Distiller
framework as well as the built-in modules, and in Section 4 we explain how to obtain
and use the Distiller. Finally, Section 5 and 6 conclude and present the future
extensions of our work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Named entity recognition and automatic semantic data generation from natural
language text has already been investigated and several knowledge extraction
systems already exist [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] such as OpenCalais2, Apache Stanbol3, TagMe [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
BabelNet, Babelfy [12], and so on. All these systems are tailored to a speci c
domain and work well in that speci c domain. On the other hand several authors
in the literature have addressed the problem of ltering document information by
identifying keyphrases (herein KPs) and a wide range of approaches have been
proposed. Di erent techniques of KP extraction have been identi ed in literature
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. These techniques can be divided mainly in supervised techniques based on
statistical, structural, and syntactic features of a document, and unsupervised
techniques, which employ graph clustering and ranking techniques to nd the
most relevant KPs in a document.
      </p>
      <p>As far as we know by now these e orts, even when they brought a signi cant
step forward in the KP research eld, rarely brought to a systematic and
replicable development approach to the KP problems. At the time we write, we are not
aware of the existence of an `out of the box' solution able to o er a developer,
2 http://www.opencalais.com/
3 https://stanbol.apache.org/
or even a less technical-minded researcher, a solution which is both easy to use
and easy to con gure for the KP extraction problem. Moreover, while there is a
wide body of state of the art algorithms, just few of them are freely available to
the research community. So in this section we focus only on the KP extraction
software that is available for download on the Internet.</p>
      <p>An example of an available solution is RAKE [14]. While there is an open
source implementation of the algorithm4, it's a single purpose application with
little or no con guration. There is also an open source implementation of the
KEA algorithm [16] available online5, but it seems that the project has not been
updated since 2007. As for RAKE, this software is a single-purpose solution with
very little customization options. The KEA algorithm is the basis for the MAUI
software6, which o ers an open source implementation of an improved version
of the KEA algorithm plus other tools for other common KE tasks such as
Entity Recognition or Automatic Tagging [10]. Unfortunately the bulk of such
useful features is part of a closed-source commercial product. Moreover, such
software is not meant to be a framework, therefore extension with new modules
and integration with existent systems are hard to develop. Finally, JATE7 is a
library that o ers a set of KP extraction algorithms. Unfortunately, this library
is not developed as a framework, but just as a collection of algorithms.</p>
      <p>It is also important to stress that the KE domain itself lacks in
standardization. Evaluation of KP extraction systems is di cult, since in the community
there is little agreement on which metrics should be used: some scholars use
Information Retrieval metrics [7], while others introduce new domain speci c
metrics like in [15]. Moreover, as we discuss in Section 3.4, there is still no
shared terminology in the community.</p>
      <p>Our work aims to be a step towards a wider, unifying direction: we want
to provide to the KE and KP communities an open-source, simple, and exible
framework-based solution, which can be used for fast development and evaluation
of KE and KP extraction techniques.
3
3.1</p>
    </sec>
    <sec id="sec-3">
      <title>Framework Design</title>
      <sec id="sec-3-1">
        <title>General Design</title>
        <p>
          In order to overcome the shortcomings of state-of-the-art KE systems we
extended the approach presented in [13] and [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] and formalized it in a framework
named Distiller, whose main aim is to support research and prototyping
activities by providing an environment for building testbed systems and integrating
existing KE systems.
        </p>
        <p>Distiller is implemented in Java, since such language is widespread among
the research community and o ers reasonable performance and multiplatform</p>
        <sec id="sec-3-1-1">
          <title>4 https://github.com/aneesha/RAKE 5 http://www.nzdl.org/Kea/download.html 6 https://github.com/zelandiya/maui 7 https://github.com/ziqizhang/jate</title>
          <p>support. Moreover, since it runs on the JVM, Distiller can be used with other
popular languages such as Groovy, Scala, and Ruby8. Distiller relies on the
Spring framework to handle dependency injection allowing easy Web deployment
on Servlet containers such as Apache Tomcat.</p>
          <p>The design of Distiller is guided by the key principle that several di erent
types of knowledge are involved in the process of KE and should be clearly
separated in order to design systems able to cope with multilinguality and
multidomain issues. For example, by now we consider four types of knowledge:
{ Statistical : word distribution in the document and/or in a corpus of
documents;
{ Linguistic: Lexical and morphological knowledge;
{ Social-Semantic: Knowledge derived from external sources such as Wikipedia,
or more speci c domain ontologies, possibly cooperatively developed;
{ Meta-Structural : heuristics based on prior knowledge on text structure (e.g.:
knowing that scienti c papers have an abstract).</p>
          <p>Linguistic knowledge is language dependant Meta-Structural knowledge is
domain dependent, and Social-Semantic knowledge is both domain and language
dependant. At a more practical level this principle implies that di erent types of
knowledge must reside in distinct modules, for instance, statistical and linguistic
analysis must be handled by di erent modules.</p>
          <p>Distiller is organized in a series of single-knowledge oriented modules, where
any module is designed to perform a single task e ciently, e.g. POS tagging,
statistical analysis, knowledge inference, and so on. This allows a highly modular
design with the possibility of implementing di erent pipelines (i.e. sequences of
modules) for di erent tasks. All these modules are required to insert the
knowledge they extract on a shared blackboard so that a module can use the knowledge
produced by another module. For example an n-gram generator module can
generate n-grams according to the POS tags produced by a POS tagger module.
Since these modules work by annotating the text on the blackboard with new
information, we call them Annotators in our framework.</p>
          <p>
            Implementing Knowledge Extraction tasks with Distiller ultimately is
reduced to specifying a pipeline including the right annotators. Consider for
instance the task of KP Extraction introduced in Section 2. Usually such task is
divided in the following steps: text pre-processing, candidate KP generation, and
candidate KP selection and/or ranking. Distiller allows a quick deployment of
such an application with the following annotators: a Sentence Splitter and a word
Tokenizer to handle the pre-processing phase, a Stemmer, a POS Tagger and an
optional Entity Linker to annotate the text, an N-Gram Generator to generate
candidates, and Scoring a Filtering modules to lter the most relevant
candidates according to the annotations produced in the previous steps. The resulting
pipeline is shown in Figure 1. Since each Annotator provides only a speci c kind
of knowledge, tailoring the pipeline to speci c needs requires little e ort. For
instance, switching to another language requires to replace only the language
8 via the JRuby implementation.
dependant annotators, namely the POS Tagger, the Stemmer, and the Word
Tokenizer. Other pipelines can be speci ed to implement di erent Knowledge
Extraction and text mining tasks such as Sentiment Analysis, Summarization,
or Authorship Identi cation.
The framework provides out of the box a small set of annotators that allow to
build a simple pipeline for the tasks of KP Extraction and Concept Inference.
The pipeline we designed follows the feature-based approach which is widespread
in the keyphrase extraction literature [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ]. In this section, to showcase the
capabilities of the framework, we present a set of annotations that the Distiller is
already able to produce.
          </p>
          <p>There are lots of features that can be found in literature that we have not
implemented in the Distiller yet. This is not due to the fact that we don't
consider them worthy or interesting enough, but, since the framework architecture
o ers the capability to quickly implement an Annotator that calculates a desired
feature, our purpose is to provide a solid and reliable framework design rather
than a simple collection of algorithms. We plan, to extend this feature set in the
future, extending it also to other domains di erent from Knowledge Extraction
such as, for example, Sentiment Analysis.
3.2.1 Linguistic Annotators We developed wrappers for two of the most
popular natural language processing toolkits available in the Java language,
namely the Stanford CoreNLP library [9] and the Apache OpenNLP library9.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>9 https://opennlp.apache.org/</title>
          <p>We use these tools to split, tokenize, and POS tag documents. These modules
are usually the annotators at the beginning of the pipeline.</p>
          <p>
            Moreover, we provide a simple n-gram generator used to generate candidate
keyphrases. This module selects from the input documents the n-grams whose
POS tag sequence corresponds to a typical keyphrase POS tag sequence; for
example NN NN is a valid POS tag sequence for this module. These sequences
are stored in a simple database in the shape of a JSON le. The developer
can then give to the n-gram generator one database le per language, and the
module is able to select the appropriate one at run time. Default pos-pattern
databases that we obtained by running a POS tagger on a corpus of manually
de ned keyphrases, using the same approach as [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ], are already available in the
framework.
          </p>
          <p>This n-gram generator is also used to compute what we call the Noun Value
of a candidate keyphrase, i.e., given a n-gram g of length n,</p>
          <p>noun value(g) = (number of nouns in g)=(n)
3.2.2 Statistical Annotators We include in the Distiller a statistical
annotator that provides statistical information about n-grams generated by the
n-gram generator mentioned above.</p>
          <p>In order to illustrate how the statistical processing is performed, we introduce
some de nitions. Given D a document and g a gram, we denote with jDj the
number of sentences of the document, and pos(D; g) as a function that, given
a gram, returns a list of positions of the gram in the document. For example,
suppose we have pos(D; g) = f1; 3; 3; 5g: this means that g appears in the rst,
third, and fth sentence of the document, appearing two times in the third
sentence. This module annotates n-grams with four features:
{ depth: the (relative) position of the last occurrence of the n-gram, i.e.
{ height: the (relative) position of the rst occurrence of the gram, i.e.
depth =
max(pos(D; g))</p>
          <p>jDj
height =
min(pos(D; g))</p>
          <p>jDj
lifespan = depth</p>
          <p>height
frequency = jpos(D; g)j
jDj
{ frequency: the relative frequency of the gram in the text, i.e.
{ lifespan: the part of the text in which the gram appears, i.e.
lifespan =
max(pos(D; g))</p>
          <p>
            min(pos(D; g))
jDj
or equivalently
These annotations provide us positional knowledge about the n-grams, helping
us to discriminate potential keyphrases. This kind of knowledge is widely used in
the keyphrase extraction eld [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ], albeit with di erent names or slightly di erent
de nitions. For example, what we call height is called distance in the KEA system
[16], and it's computed on the basis of the number of words instead of sentences.
The HUMB system [8] calls KEA's `distance' simply rst position. More recently,
[
            <xref ref-type="bibr" rid="ref4">4</xref>
            ] also calls KEA's distance rst position, and moreover it de nes rst sentence
as we de ne height in this work.
          </p>
          <p>
            We recognize that the di erence in terminology may cause confusion to a
reader coping with all these de nitions but, since there's no standard terminology
in the KP community itself, it's hard to come up with unambiguous de nitions.
This remarks may be indeed useful for the KP community in order to de ne a
common corpus of de nitions, eliminating the need for re-de nition.
3.2.3 Knowledge-Based Annotators We built an annotator that relies on
TagME [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ] aimed at marking an n-gram with a boolean value if it appears on
Wikipedia. We called this boolean value Wiki ag.
          </p>
          <p>Using the information provided by this annotator, we're able to identify a set
of relevant entities that appear in the document and a set of suggested entities
that are related to the ones that appear in the document. This way, we provide
a quick way for the reader to gather the relevant information of a document
without the need of reading the whole document.</p>
          <p>We thoroughly describe the process of ltering and suggesting entities in [11].
3.3</p>
          <p>Multilinguality
It is simple to adapt the design of a pipeline to languages di erent from
English. Since we use components that are quite standard in the NLP community
one can use the resources that are already available online to port a pipeline
from a language to another. Let's take again our Keyphrase Extraction pipeline
as an example. The pipeline is already designed to support English and
Italian but it's possible to support an arbitrary number of languages. In fact, the
only annotators that are language-dependent are the linguistic annotators (POS
tagger, splitter, and so on), the n-gram generator and the external knowledge
annotators. We already mentioned that splitting, tokenization, and POS tagging
are performed by external libraries such as Apache OpenNLP. To perform these
tasks in languages di erent from English, we already o er the user a simple
conguration parameter that allows him to use one of the many language models
that are already available10. Listing 1.2 is an example of multi language
support for the Apache OpenNLP wrapper in the Distiller. These models can be
used to build the POS patterns for the n-gram generator, whose multilanguage
capabilities we have already mentioned in Section 3.2.1
10 http://opennlp.sourceforge.net/models-1.5/</p>
          <p>Regarding the external knowledge annotators, while TagMe is available only
in Italian and English, it is possible to use one of the many similar online services
to perform the same task such, for example, Babelfy.
3.4</p>
          <p>Evaluation
An important step of every scienti c process is the evaluation of the results.
For this reason the Distiller design allows to easily build an evaluation stage for
every kind of pipeline that it can support.</p>
          <p>As we already mentioned, the focus of the Distiller by now is on Knowledge
Extraction and, more speci cally, on KP Extraction, so we designed a simple
evaluator process for this task. We have built an evaluation system for scienti c
articles based on the SEMEVAL 2010 dataset. In the near future we plan to
integrate evaluation on the Inspec dataset to evaluate the pipeline on abstracts,
and DUC-2001 dataset to evaluate news articles.</p>
          <p>For the Keyphrase Extraction task, evaluation is performed by calculating
the usual metrics of precision, recall and f-measure. Moreover, [7] recently
introduced two metrics derived from the Information Retrieval community, namely
the binary preference measure and the mean reciprocal rank, which are used to
take the ranking of the extracted keyphrases into account. For the same reason,
recently [15] proposed a new metric called average standard normalised
cumulative gain which claims to o er a even better evaluation technique for keyphrase
extraction. We use these three innovative metrics along with the usual
precision, recall and f-measure in the Distiller. This way, we hope to provide a fast,
accurate, and comprehensive evaluation of the KE task in our framework.
4
4.1</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Using the Distiller</title>
      <sec id="sec-4-1">
        <title>Distribution and Licensing</title>
        <p>All the code of the Distiller is available online under the Apache 2 License. The
full source code can be found on GitHub11. Due to license constraints we can't
include GPL licensed code in our framework. For this reason we will not include
the Stanford CoreNLP wrapper in the default release but we will release in the
future a set of GPL2 licensed annotators to overcome this limit.
4.2</p>
        <p>A practical example
Being a Spring application, Distiller can be con gured with a XML con guration
le. Each module can be speci ed and con gured in such le and the system
conguration can be changed with no need to recompile the code. It's also possible
to con gure the Distiller using Java code, but since the result is the same as the
XML con guration, we cover only the latter in this paper. Listing 1.1 shows a
sample con guration snippet where the KE pipeline is de ned. This pipeline is
injected into the Distiller using the facilities that the Spring framework provides.
11 https://github.com/ailab-uniud/distiller-CORE
&lt;bean i d=" d e f a u l t P i p e l i n e "</p>
        <p>c l a s s=" i t . uniud . a i l a b . dcore . a n n o t a t i o n . P i p e l i n e "&gt;
&lt;p r o p e r t y name=" a n n o t a t o r s "&gt;
&lt; l i s t&gt;
&lt;! s p l i t t h e document &gt;
&lt;r e f bean="openNLP"/&gt;
&lt;! annotate t h e t o k e n s &gt;
&lt;r e f bean="tagme"/&gt;
&lt;! g e n e r a t e t h e n grams &gt;
&lt;r e f bean="nGramGenerator"/&gt;
&lt;! annotate t h e n grams &gt;
&lt;r e f bean=" s t a t i s t i c a l " /&gt;
&lt;r e f bean="tagmegram" /&gt;
&lt;! e v a l u a t e t h e k e y p h r a s e n e s s &gt;
&lt;r e f bean=" l i n e a r E v a l u a t o r "/&gt;
&lt;! i n f e r c o n c e p t s &gt;
&lt;r e f bean=" w i k i p e d i a I n f e r e n c e " /&gt;
&lt;! f i l t e r t h e non i n t e r e s t i n g o u t p u t &gt;
&lt;r e f bean=" s k y l i n e G r a m F i l t e r "/&gt;
&lt;r e f bean=" hypernymFilter "/&gt;
&lt;r e f bean=" r e l a t e d F i l t e r "/&gt;
&lt;/ l i s t&gt;
&lt;/ p r o p e r t y&gt;
&lt;/ bean&gt;</p>
        <p>Listing 1.1: A con guration snippet</p>
        <p>Each module of the pipeline must implement the Annotator interface. An
example of Annotator is the OpenNLPBootstrapper, a module that uses the
Apache OpenNLP library12 to split, tokenize, and POS tag the document. This
annotator is de ned as a bean, as in Listing 1.2, in the XML le and then passed
to the pipeline as in Listing 1.1 above.
&lt;bean i d="openNLP"
c l a s s=" i t . uniud . a i l a b . dcore . wrappers . e x t e r n a l .</p>
        <p>OpenNlpBootstrapperAnnotator "&gt;
&lt;p r o p e r t y name=" modelPaths "&gt;
&lt;map key type=" java . lang . S t r i n g " value type=" java
. lang . S t r i n g "&gt;
&lt;en try key="en s e n t " v a lu e="/ opt / d i s t i l l e r /</p>
        <p>models /en s e n t . bin "/&gt;
&lt;en try key="en token " v al u e="/ opt / d i s t i l l e r /</p>
        <p>models /en token . bin "/&gt;
&lt;en try key="en pos maxent" v al u e="/ opt /</p>
        <p>d i s t i l l e r / models /en pos maxent . bin "/&gt;
12 http://opennlp.apache.org/
&lt;e n t r y key=" i t s e n t " v a l u e=" / opt / d i s t i l l e r /</p>
        <p>models / i t / i t s e n t . b i n " /&gt;
&lt;e n t r y key=" i t t o k e n " v a l u e=" / opt / d i s t i l l e r /</p>
        <p>models / i t / i t t o k e n . b i n " /&gt;
&lt;e n t r y key=" i t pos maxent " v a l u e=" / opt /</p>
        <p>d i s t i l l e r / models / i t / i t pos maxent . b i n " /&gt;
&lt;/map&gt;
&lt;/ p r o p e r t y&gt;
&lt;/ bean&gt;</p>
        <p>Listing 1.2: A con guration snippet</p>
        <p>Listing 1.2 is also useful to show how a single module can be con gured. Here
again we use the facilities provided by the Spring framework to set the model
le paths that the OpenNLP framework is going to use in this con guration to
split, tokenize and POS tag text.</p>
        <p>Once con gured, Distiller o ers a simple and minimal interface to allow
programmers to instantiate and run the application. Listing 1.3 shows how to build
a Distiller application according to the con guration le and to launch
extraction from a text. It is also possible to use the Spring framework (or the wrappers
for the framework provided in the DistillerFactory class) to load and use any
custom pipeline for the distiller.</p>
        <p>D i s t i l l e r d = D i s t i l l e r F a c t o r y . g e t D e f a u l t ( ) ;
D i s t i l l e d O u t p u t output = d . d i s t i l l ( ' ' Text t o d i s t i l l ' ' ) ;</p>
        <p>Listing 1.3: Running Distiller with the default con guration</p>
        <p>The output format is an object containing ranked concepts, links to
external knowledge sources (if any) and other annotations generated along the KE
pipeline.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>With respect to the four issues of KE presented in Section 1, Distiller allows
the development of applications able to overcome such shortcomings. The issue
of multilinguality is eased by the possibility of specifying a wide array of
annotators and to dynamically link them at runtime on the basis of the considered
language. The issue of Knowledge Source Completeness is eased by the
possibility of integrating heterogeneous knowledge sources as di erent annotators, such
as TagME or Babelfy. The issue of Knowledge Overload, nally, is eased by the
presence of a ltering phase in which entities are evaluated with respect to their
relevance in the text. Currently we are releasing the Distiller framework as an
open source project and providing, by request, a RESTful API to access a
sample application with multilingual support. Finally, we believe that the Distiller
is exible enough to tackle complex and diverse tasks, provided that the right
annotators for these tasks are available. If an annotator for a speci c problem
does not exists, however, it is possible to implement it and easily plug it in a
custom KE pipeline.</p>
    </sec>
    <sec id="sec-6">
      <title>Future Work</title>
      <p>Since the keyphrase ranking phase is based on heuristically calculated weights
for the features we discussed in this paper, we plan to build a keyphrase ranking
module with the possibility to use di erent machine learning techniques for this
task. This work is out of the scope of this paper and will be discussed in a future
work.</p>
      <p>We also plan to include support for other languages in the Keyphrase
Extraction task. We're currently working on Portuguese, Arabic, and Romanian.</p>
      <p>Other future work will include the development of a di erent kind of pipelines
in the Distiller, such as a Sentiment Analysis oriented pipeline. In order to
demonstrate this possibility we already built a simple module, which is a Java
port of the Syuzhet R package13, which is used to detect the emotional intensity
of a text.
13 https://github.com/mjockers/syuzhet
[7] Zhiyuan Liu et al. \Automatic keyphrase extraction via topic
decomposition". In: Proceedings of the 2010 Conference on Empirical Methods in
Natural Language Processing. Association for Computational Linguistics.
2010, pp. 366{376.
[8] Patrice Lopez and Laurent Romary. \HUMB: Automatic key term
extraction from scienti c articles in GROBID". In: Proceedings of the 5th
international workshop on semantic evaluation. Association for Computational
Linguistics. 2010, pp. 248{251.
[9] Christopher D. Manning et al. \The Stanford CoreNLP Natural Language
Processing Toolkit". In: Proceedings of 52nd Annual Meeting of the
Association for Computational Linguistics: System Demonstrations. 2014,
pp. 55{60.
[10] Olena Medelyan, Eibe Frank, and Ian H. Witten. \Human-competitive
Tagging Using Automatic Keyphrase Extraction". In: Proceedings of the
2009 Conference on Empirical Methods in Natural Language Processing:
Volume 3 - Volume 3. EMNLP '09. Singapore: Association for
Computational Linguistics, 2009, pp. 1318{1327. isbn: 978-1-932432-63-3.
[11] Dario De Nart and Carlo Tasso. \A Keyphrase Generation Technique
Based upon Keyphrase Extraction and Reasoning on Loosely Structured
Ontologies". In: Proceedings of the 7th International Workshop on
Information Filtering and Retrieval co-located with the 13th Conference of the
Italian Association for Arti cial Intelligence (AI*IA 2013), Turin, Italy,
December 6, 2013. 2013, pp. 49{60.
[12] Roberto Navigli and Simone Paolo Ponzetto. \BabelNet: The Automatic
Construction, Evaluation and Application of a Wide-Coverage
Multilingual Semantic Network". In: Arti cial Intelligence 193 (2012), pp. 217{
250.
[13] Nirmala Pudota et al. \Automatic keyphrase extraction and ontology
mining for content-based tag recommendation". In: International Journal of
Intelligent Systems 25.12 (2010), pp. 1158{1186.
[14] Stuart Rose et al. \Automatic keyword extraction from individual
documents". In: Text Mining (2010), pp. 1{20.
[15] Natalie Schluter. \A critical survey on measuring success in rank-based
keyword assignment to documents". In: 22eme Traitement Automatique
des Langues Naturelles, Caen, 2015 ().
[16] Ian H Witten et al. \KEA: Practical automatic keyphrase extraction". In:
Proceedings of the fourth ACM conference on Digital libraries. ACM. 1999,
pp. 254{255.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Dante</given-names>
            <surname>Degl'Innocenti</surname>
          </string-name>
          , Dario De Nart, and Carlo Tasso.
          <article-title>\A New Multilingual Knowledge-base Approach to Keyphrase Extraction for the Italian Language."</article-title>
          <source>In: Proceedings of the 6th International Conference on Knowledge Discovery and Information Retrieval. SciTePress</source>
          ,
          <year>2014</year>
          , pp.
          <volume>78</volume>
          {
          <fpage>85</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Ferragina</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ugo</given-names>
            <surname>Scaiella</surname>
          </string-name>
          . \TAGME:
          <article-title>On-the- y Annotation of Short Text Fragments (by Wikipedia Entities)"</article-title>
          .
          <source>In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management. CIKM '10</source>
          . Toronto, ON, Canada: ACM,
          <year>2010</year>
          , pp.
          <volume>1625</volume>
          {
          <fpage>1628</fpage>
          . isbn:
          <fpage>978</fpage>
          -1-
          <fpage>4503</fpage>
          -0099-5. doi:
          <volume>10</volume>
          .1145/1871437.1871689.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Aldo</given-names>
            <surname>Gangemi</surname>
          </string-name>
          .
          <article-title>\A comparison of knowledge extraction tools for the semantic web"</article-title>
          .
          <source>In: The Semantic Web: Semantics and Big Data</source>
          . Springer,
          <year>2013</year>
          , pp.
          <volume>351</volume>
          {
          <fpage>366</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Mounia</given-names>
            <surname>Haddoud</surname>
          </string-name>
          et al. \
          <article-title>Accurate Keyphrase Extraction from Scienti c Papers by Mining Linguistic Information"</article-title>
          .
          <source>In: Proc. of the Workshop Mining Scienti c Papers: Computational Linguistics and Bibliometrics, 15th International Society of Scientometrics and Informetrics Conference (ISSI)</source>
          , Istanbul, Turkey: http://ceur-ws.
          <source>org</source>
          .
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Kazi</given-names>
            <surname>Saidul</surname>
          </string-name>
          Hasan and
          <string-name>
            <given-names>Vincent</given-names>
            <surname>Ng</surname>
          </string-name>
          . \
          <article-title>Automatic keyphrase extraction: A survey of the state of the art"</article-title>
          .
          <source>In: Proceedings of the Association for Computational Linguistics (ACL)</source>
          , Baltimore, Maryland: Association for Computational Linguistics (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Anette</given-names>
            <surname>Hulth. \Improved Automatic Keyword Extraction Given More Linguistic Knowledge</surname>
          </string-name>
          <article-title>"</article-title>
          .
          <source>In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing. EMNLP '03</source>
          .
          <string-name>
            <surname>Stroudsburg</surname>
          </string-name>
          , PA, USA: Association for Computational Linguistics,
          <year>2003</year>
          , pp.
          <volume>216</volume>
          {
          <fpage>223</fpage>
          . doi:
          <volume>10</volume>
          .3115/1119355.1119383.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>