<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Semantic Metadata Generator for Web Pages Based on Keyphrase Extraction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Dario De Nart</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carlo Tasso</string-name>
          <email>carlo.tassog@uniud.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dante Degl'Innocenti</string-name>
          <email>dante.deglinnocenti@spes.uniud.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Arti cial Intelligence Lab Department of Mathematics and Computer Science University of Udine</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The annotation of documents and web pages with semantic metatdata is an activity that can greatly increase the accuracy of Information Retrieval and Personalization systems, but the growing amount of text data available is too large for an extensive manual process. On the other hand, automatic keyphrase generation and wiki cation can signi cantly support this activity. In this demonstration we present a system that automatically extracts keyphrases, identi es candidate DBpedia entities, and returns as output a set of RDF triples compliant with the Opengraph and the Schema.org vocabularies.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Several authors in the literature have already addressed the problem of extracting
keyphrases (herein KPs) from natural language documents and a wide range of
1 A live demo of the system can be found at http://goo.gl/beKJu5 and can be accessed
by logging as user \guest" with password \guest"
approaches have been proposed. The authors of [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] identify four types of KPE
strategies:
{ Simple Statistical Approaches : mostly unsupervised techniques, considering
word frequency, TF-IDF or word co-occurency [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
{ Linguistic Approaches : techniques relying on linguistic knowledge to identify
KPs. Proposed methods include lexical analysis [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], syntactic analysis [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
and discourse analysis [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
{ Machine Learning Approaches : techniques based on machine learning
algorithms such as Naive Bayes classi ers and SVM. Systems such as KEA [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ],
LAKE [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and GenEx [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] belong to this category.
{ Other Approaches : other strategies exist which do not t into one of the
above categories, mostly hybrid approaches combining two or more of the
above techniques. Among others, heuristic approaches based on
knowledgebased criteria [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] have been proposed.
      </p>
      <p>
        Automatic semantic data generation from natural language text has already
been investigated as well and several knowledge extraction systems already exist
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], such as OpenCalais 2, AIDA3, Apache Stanbol4, and NERD5.
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>System Overview</title>
      <p>The proposed system includes three main modules: a Domain Independent KPE
module (herein DIKPE), a KP Inference module (KPIM), and a RDF Triple
Builder (RTB). Our KPE technique exploits a knowledge-based strategy.
After a candidate KP generation stage, candidate KPs are selected according to
various features including statistic (such as word frequency), linguistic (part of
speech analysis), meta-knowledge based (life span in the text, rst and last
occurrence, and presence of speci c tags), and external-knowledge based (existence
of a match with a DBpedia entity) ones. Such features correspond to di erent
kinds of knowledge that are involved in the process of recognizing relevant
entities in a text. Most of such features are language-independent and the modular
architecture of DIKPE allows an easy substitution of language-dependent
components, making our framework language-independent. Currently English and
Italian languages are supported.</p>
      <p>The result of this KPE phase is a set of relevant KPs including DBpedia matches,
hence providing a partial wiki cation of the text. Such knowledge is used by the
KPIM for a further step of KP generation, in which a new set of potentially
relevant KPs not included in the text is inferred exploiting the link structure of
DBpedia. Properties such as type and subject are considered in order to discover
concepts possibly related to the text. Finally, the extracted and the inferred KPs
are used by the RTB to build a set of Opengraph and Schema.org triples. Due to
2 http://www.opencalais.com/
3 www.mpi-inf.mpg.de/yago-naga/aida/
4 https://stanbol.apache.org/
5 http://nerd.eurecom.fr/
the simplicity of the adopted vocabularies, this task is performed in a rule-based
way. The rdf fragment to be generated, in fact, is considered by the RTB as a
template to ll according to the data provided by the DIKPE and the KPIM.
4</p>
    </sec>
    <sec id="sec-3">
      <title>Evaluation and Conclusions</title>
      <p>In order to support and validate our approach several experiments have been
performed. Due to the early stage of development of the system and being the KP
generation the critical component of the systems, testing e orts were focused on
assessing the quality of generated KPs. The DIKPE module was benchmarked
against the KEA algorithm on a set of 215 English documents labelled with
keyphrases generated by the authors and by additional experts. For each
document, the KP sets returned by the two compared systems were matched against
the set of human generated KPs. Each time a machine-generated KP matched
a human-generated KP, it was considered a correct KP; the number of correct
KPs generated for each document was then averaged over the whole data set.
Various machine-generated KP set sizes were tested. As shown in Table 1, the
DIKPE system signi cantly outperformed the KEA baseline. A user evaluation
of the perceived quality of generated KPs was also performed: a set of 50 articles
was annotated and a pool of experts of various ages and gender was asked to
assess the quality of generated metadata. Table 2 shows the results of the user
evaluation.</p>
      <p>Evaluation is, however, still ongoing: an extensive benchmark with more
complex Knowledge Extraction systems is planned, as well as further enhancements
such as inclusion of more complex vocabularies and integration with the Apache
Stanbol framework.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Barker</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cornacchia</surname>
          </string-name>
          , N.:
          <article-title>Using noun phrase heads to extract document keyphrases</article-title>
          .
          <source>In: Advances in Arti cial Intelligence</source>
          , pp.
          <volume>40</volume>
          {
          <fpage>52</fpage>
          . Springer (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eckert</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meusel</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , Muhleisen, H.,
          <string-name>
            <surname>Schuhmacher</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , Volker, J.:
          <article-title>Deployment of rdfa, microdata, and microformats on the web{a quantitative analysis</article-title>
          .
          <source>In: The Semantic Web{ISWC</source>
          <year>2013</year>
          , pp.
          <volume>17</volume>
          {
          <fpage>32</fpage>
          . Springer (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>DAvanzo</surname>
          </string-name>
          , E.,
          <string-name>
            <surname>Magnini</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vallin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Keyphrase extraction for summarization purposes: The lake system at duc-2004</article-title>
          .
          <source>In: Proceedings of the 2004 document understanding conference</source>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Fagan</surname>
          </string-name>
          , J.:
          <article-title>Automatic phrase indexing for document retrieval</article-title>
          .
          <source>In: Proceedings of the 10th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          . pp.
          <volume>91</volume>
          {
          <fpage>101</fpage>
          . SIGIR '87,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>1987</year>
          ), http://doi.acm.
          <source>org/10</source>
          .1145/42005.42016
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Gangemi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>A comparison of knowledge extraction tools for the semantic web</article-title>
          .
          <source>In: The Semantic Web: Semantics and Big Data</source>
          , pp.
          <volume>351</volume>
          {
          <fpage>366</fpage>
          . Springer (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Krapivin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marchese</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yadrantsau</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Unsupervised key-phrases extraction from scienti c papers using domain and linguistic knowledge</article-title>
          .
          <source>In: Digital Information Management</source>
          ,
          <year>2008</year>
          .
          <article-title>ICDIM 2008</article-title>
          . Third International Conference on. pp.
          <volume>105</volume>
          {
          <issue>112</issue>
          (Nov
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zheng</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Clustering to nd exemplar terms for keyphrase extraction</article-title>
          .
          <source>In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1 -</source>
          Volume 1. pp.
          <volume>257</volume>
          {
          <fpage>266</fpage>
          . EMNLP '
          <volume>09</volume>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computational Linguistics, Stroudsburg, PA, USA (
          <year>2009</year>
          ), http://dl.acm.org/citation.cfm?id=
          <volume>1699510</volume>
          .
          <fpage>1699544</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Matsuo</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ishizuka</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Keyword extraction from a single document using word co-occurrence statistical information</article-title>
          .
          <source>International Journal on Arti cial Intelligence Tools</source>
          <volume>13</volume>
          (
          <issue>01</issue>
          ),
          <volume>157</volume>
          {
          <fpage>169</fpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Turney</surname>
          </string-name>
          , P.D.:
          <article-title>Learning algorithms for keyphrase extraction</article-title>
          .
          <source>Information Retrieval</source>
          <volume>2</volume>
          (
          <issue>4</issue>
          ),
          <volume>303</volume>
          {
          <fpage>336</fpage>
          (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>I.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paynter</surname>
            ,
            <given-names>G.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gutwin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nevill-Manning</surname>
            ,
            <given-names>C.G.</given-names>
          </string-name>
          :
          <article-title>Kea: Practical automatic keyphrase extraction</article-title>
          .
          <source>In: Proceedings of the fourth ACM conference on Digital libraries</source>
          . pp.
          <volume>254</volume>
          {
          <fpage>255</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Automatic keyword extraction from documents using conditional random elds</article-title>
          .
          <source>Journal of Computational Information Systems</source>
          <volume>4</volume>
          (
          <issue>3</issue>
          ),
          <volume>1169</volume>
          {
          <fpage>1180</fpage>
          (
          <year>2008</year>
          ), http://eprints.rclis.org/handle/10760/12305
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>