<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Introducing Distiller: a Lightweight Framework for Knowledge Extraction and Filtering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Dario De Nart</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dante Degl'Innocenti</string-name>
          <email>dante.deglinnocenti@spes.uniud.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carlo Tasso</string-name>
          <email>carlo.tassog@uniud.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Articial Intelligence Lab Department of Mathematics and Computer Science University of Udine</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Semantic content analysis is an activity that can greatly support a broad range of user modelling applications. Several automatic tools are available, however such systems usually provide little tuning possibilities and do not support integration with dierent systems. Personalization applications, on the other hand, are becoming increasingly multi-lingual and cross-domain. In this paper we present a novel framework for Knowledge Extraction, whose main goal is to support the development of new strategies and technologies and to ease the integration of the existing ones.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>{ Knowledge Overload : long texts, such as scientic papers, may include a lot
of named entities, but not all are equally relevant inside the text.
State-ofthe-art KE systems currently provide Named Entity Recognition, but do not
lter relevant entities nor include relevance measures.</p>
      <p>In this paper we introduce Distiller, a KE framework whose aim is to overcome
these limitations and allow integration of heterogeneous KE technologies.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Named entity recognition and automatic semantic data generation from
natural language text has already been investigated and several knowledge extraction
systems already exist [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], such as OpenCalais 2, Apache Stanbol3, and TagMe [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
In [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] an ensemble learning strategy to raise the accuracy of the named entity
identication process is presented. Several authors in the literature have
addressed the problem of ltering document information by identifying keyphrases
(herein KPs) and a wide range of approaches have been proposed. The authors
of [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] identify four types of KP extraction strategies:
{ Simple Statistical Approaches : mostly unsupervised techniques, considering
word frequency, TF-IDF or word co-occurency [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
{ Linguistic Approaches : techniques relying on linguistic knowledge to
identify KPs. Proposed methods include lexical analysis, syntactic analysis, and
discourse analysis [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
{ Machine Learning Approaches : techniques based on machine learning
algorithms such as Naive Bayes classiers and SVM. Systems such as KEA [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]
belong to this category.
{ Other Approaches : other strategies exist which do not t into one of the
above categories, mostly hybrid approaches combining two or more of the
above techniques [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Among others, heuristic approaches based on
knowledgebased criteria [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] have been proposed.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>System Overview</title>
      <p>
        In order to overcome the shortcomings of state-of-the-art KE systems we
extended the approach presented in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and formalized it as a framework named
Distiller whose main aim is to support research and prototyping activities by
providing an environment for building testbed systems and integrating existing
systems. The guiding principle of the framework design is that several
dierent types of knowledge are involved in the process of KE and should be clearly
separated to design systems able to cope with multilinguality and multi-domain
issues. We consider four main types of knowledge: Statistical, Linguistic,
External (i.e. coming from outside the text, like the one extracted from ontologies),
2 http://www.opencalais.com/
3 https://stanbol.apache.org/
and Heuristic knowledge. Linguistic knowledge is language dependant, Heuristic
knowledge is domain dependent, and External knowledge is both domain and
language dependant. At a more practical level, this principle implies that
dierent types of knowledge must reside in distinct modules, for instance, statistical
and linguistic analysis must be handled by dierent modules.
      </p>
      <p>Distiller is organized in a series of single-knowledge oriented modules and
its workow is organized in four phases: Concept Unit Splitting, Annotation,
Candidate Generation, and Filtering, as shown in Figure 1. In the rst phase the
text is split into Concept Units, i.e. logical blocks such as chapters, paragraphs
or sentences. The framework allows the co-existence of concept units of dierent
languages inside a document. The Annotation phase consists in enriching the
text with information such as POS tagging, stems, lemmas, or links to entities
from external knowledge sources (such as DBpedia). This phase introduces new
knowledge in the text, and several dierent annotators can contribute, enriching
the text with dierent kinds of knowledge, but mostly with External knowledge
that may come from heterogeneous sources. Existing KE tools, such as TagMe,
can be integrated in the framework as annotators. The Candidate Generation
phase identies in the text all the candidate entities and/or concepts of interests
exploiting the annotations provided in the previous step and internally represents
them as KPs with an attached set of annotations. Finally, the Filtering phase
evaluates a relevance score for each candidate concept depending on which it is
returned as output or hidden. The Filtering phase, like the Candidate Generation
one, may exploit dierent types of knowledge embedded in annotations, and
combine them according to the needs of the applications that will eventually use
the extracted knowledge.</p>
      <p>Distiller is implemented in Java using the Dependency Injection pattern,
that allows users to easily switch between dierent modules and congurations.
Default implementations for all the above described modules are provided with
the framework4.
4 A sample application built with the default modules is showcased at
ailab.uniud.it:8080/distiller</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>With respect to the three issues of KE presented in Section 1, Distiller allows the
development of applications able to overcome such shortcomings. The issue of
multilinguality is eased by the possibility of specifying a wide array of annotators
and to dynamically link them at runtime depending on the text language. The
issue of Knowledge Source Completeness is eased by the possibility of
integrating heterogeneous knowledge sources as dierent annotators and implementing
annotators who generate URIs on the y. The issue of Knowledge Overload,
nally, is eased by the presence of a ltering phase in which entities are evaluated
with respect to their relevance in the text.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>De Nart</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tasso</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>A domain independent double layered approach to keyphrase generation</article-title>
          .
          <source>In: WEBIST 2014 - Proceedings of the 10th International Conference on Web Information Systems and Technologies</source>
          . pp.
          <volume>305</volume>
          {
          <fpage>312</fpage>
          .
          <string-name>
            <given-names>SCITEPRESS</given-names>
            <surname>Science and Technology Publications</surname>
          </string-name>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Degl'Innocenti</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Nart</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tasso</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>A new multi-lingual knowledge-base approach to keyphrase extraction for the italian language</article-title>
          .
          <source>In: Proceedings of the 6th International Conference on Knowledge Discovery and Information Retrieval</source>
          . pp.
          <volume>78</volume>
          {
          <fpage>85</fpage>
          .
          <string-name>
            <surname>SciTePress</surname>
          </string-name>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Ferragina</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scaiella</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          : Tagme:
          <article-title>On-the-y annotation of short text fragments (by wikipedia entities)</article-title>
          .
          <source>In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management</source>
          . pp.
          <volume>1625</volume>
          {
          <fpage>1628</fpage>
          . CIKM '10,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Gangemi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>A comparison of knowledge extraction tools for the semantic web</article-title>
          .
          <source>In: The Semantic Web: Semantics and Big Data</source>
          , pp.
          <volume>351</volume>
          {
          <fpage>366</fpage>
          . Springer (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Krapivin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marchese</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yadrantsau</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Unsupervised key-phrases extraction from scientic papers using domain and linguistic knowledge</article-title>
          .
          <source>In: Digital Information Management</source>
          ,
          <year>2008</year>
          .
          <article-title>ICDIM 2008</article-title>
          . Third International Conference on. pp.
          <volume>105</volume>
          {
          <issue>112</issue>
          (Nov
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zheng</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Clustering to nd exemplar terms for keyphrase extraction</article-title>
          .
          <source>In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing:</source>
          Volume
          <volume>1</volume>
          . pp.
          <volume>257</volume>
          {
          <fpage>266</fpage>
          . EMNLP '
          <volume>09</volume>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Matsuo</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ishizuka</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Keyword extraction from a single document using word co-occurrence statistical information</article-title>
          .
          <source>International Journal on Articial Intelligence Tools</source>
          <volume>13</volume>
          (
          <issue>01</issue>
          ),
          <volume>157</volume>
          {
          <fpage>169</fpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Speck</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Ngonga</given-names>
            <surname>Ngomo</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.C.</surname>
          </string-name>
          :
          <article-title>Ensemble learning for named entity recognition</article-title>
          .
          <source>In: The Semantic Web ISWC 2014, Lecture Notes in Computer Science</source>
          , vol.
          <volume>8796</volume>
          , pp.
          <volume>519</volume>
          {
          <fpage>534</fpage>
          . Springer International Publishing (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>I.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paynter</surname>
            ,
            <given-names>G.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gutwin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nevill-Manning</surname>
            ,
            <given-names>C.G.</given-names>
          </string-name>
          :
          <article-title>Kea: Practical automatic keyphrase extraction</article-title>
          .
          <source>In: Proceedings of the fourth ACM conference on Digital libraries</source>
          . pp.
          <volume>254</volume>
          {
          <fpage>255</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Automatic keyword extraction from documents using conditional random elds</article-title>
          .
          <source>Journal of Computational Information Systems</source>
          <volume>4</volume>
          (
          <issue>3</issue>
          ),
          <volume>1169</volume>
          {
          <fpage>1180</fpage>
          (
          <year>2008</year>
          ), http://eprints.rclis.org/handle/10760/12305
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>