<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Extracting Knowledge from Text with PIKES</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Francesco Corcoglioniti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Rospocher</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessio Palmero Aprosio</string-name>
          <email>aprosiog@fbk.eu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fondazione Bruno Kessler-IRST</institution>
          ,
          <addr-line>Via Sommarive 18, Trento, I-38123</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this demonstration we showcase PIKES, a Semantic Role Labeling (SRL)-powered approach for Knowledge Extraction. PIKES implements a rule-based strategy that reinterprets SRL output in light of other linguistic analyses, such as dependency parsing and co-reference resolution, thus properly capturing and formalizing in RDF important linguistic aspects such as argument nominalization, frame-frame relations, and group entities.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Overview</title>
      <p>argument nominalizations e.g., from ‘Joseph Blatter, the president of FIFA [..]’, we
capture both aspects of ‘president’,2 the predicate itself and its implicit subject, thus
correctly extracting the information that the implicit president is Joseph Blatter, and
he is related to FIFA in a ‘president’ frame (NomBank president.01);3
frame-frame relations e.g., from ‘Joseph Blatter became president of FIFA in 1998’,
we relate the information from the frames ‘become’ and ‘president’ so that Joseph
Blatter, the subject of the ‘become’ frame, is coreferred to the subject of the
‘president’ frame, and a relation between the two frames is explicitly represented;4
similarly, from ‘Joseph Blatter ended up resigning from FIFA in 2015’, we correlate the
1 http://pikes.fbk.eu/
2 In Nombank (http://bit.ly/nombank) for some predicate nouns, the predicate can be its own
argument (usually ARG0). Examples include: teacher, president, director.
3 The complete output for this example and its graphical representation obtained with the on-line
demo of the tool (see Section 3) are directly accessible from here: http://bit.ly/pikes-argnorm.
4 See http://bit.ly/pikes-frmfrm.</p>
      <p>information from the frames ‘end’ and ‘resign’ so that Joseph Blatter, the subject
of the ‘end’ frame, is coreferred to the ‘president’ argument of the ‘resign’ frame,
and a relation between ‘end’ and ‘resign’ is represented;5 and,
group entities e.g., given a sentence like ‘Joseph Blatter and Joa˜o Havelange led FIFA
[..]’, we extract that two distinct entities, Joseph Blatter and Joa˜o Havelange, are
both subject of the frame ‘lead’ (while SRL tools typically annotate the whole span
of text ‘Joseph Blatter and Joa˜o Havelange’ as a single argument).6
Besides better interpreting the SRL information, our approach adopts a representation
model where all the content processed and produced in extracting knowledge – the
textual input and its metadata (e.g., author, date/time), the intermediate output produced
by NLP tools (e.g., extracted SRL frames), the final output of the system – is organized
in three interlinked layers – Text, Mentions, Instances – and exposed as RDF according
to Semantic Web best practices. Furthermore, by exploiting linking between layers, we
relate each triple extracted from text to the span(s) of text and the intermediate NLP
output from where it was derived, thus enabling a fine-grain tracking of the whole extracted
content that helps debugging and improving the knowledge extraction process.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Under the hood</title>
      <p>PIKES works in two main phases that we briefly describe below; more details on the
NLP tools used and the knowledge extraction algorithm are reported on the website.</p>
      <p>In the first phase (from Text to Mentions), the input text is processed by several NLP
tools to extract mentions, i.e., pieces of text denoting something of interest, such as an
entity or relation. Mentions represent, in a structured form, all the information needed
to extract the knowledge conveyed by the text. For instance, given the example text7
G.W. Bush and Bono are very strong supporters of the fight of HIV in Africa.</p>
      <p>Their March 2002 meeting resulted in a 5 billion dollar aid.
the span of text “G.W. Bush” corresponds to a mention that is identified by a URI
(e.g., &lt;resource uri#char=0,10&gt;) and has several attributes such as a textual extent
(‘G.W. Bush’), a position in the text (characters 0 to 10), a type (e.g., NameMention), a
possible corresponding DBpedia entity (e.g., dbpedia:George W. Bush), and so on.
All these attributes, together with the input text metadata (e.g., author, creation time),
are also exposed by PIKES as RDF. PIKES currently relies on state-of-the-art NLP
tools such as: Stanford CoreNLP8 for part-of-speech tagging, named entity recognition
and classification, temporal expression recognition and normalization, and coreference
resolution; mate-tools9 for dependency parsing and SRL; DBpedia Spotlight10 for entity
linking; and, UKB11 for word sense disambiguation (with respect to WordNet 3.0).
Furthermore, we developed a dedicated module for mapping the NLP annotations produced
by all these tools to mentions and mentions attributes expressed in RDF.
5 See http://bit.ly/pikes-frmfrm2.
6 See http://bit.ly/pikes-group.
7 Try it on PIKES: (http://bit.ly/pikes-example)
8 http://nlp.stanford.edu/software/corenlp.shtml
9 https://code.google.com/p/mate-tools/
10 http://spotlight.dbpedia.org/
11 http://ixa2.si.ehu.es/ukb/</p>
      <p>
        In the second phase (from Mentions to Instances), mentions are processed with
mapping rules that match certain mention attributes/patterns and create consequent
RDF triples. Mapping rules are formulated as SPARQL Update INSERT. . . WHERE. . .
statements that are repeatedly executed until a fixed-point is reached. Rules are allowed to
create new individuals, can invoke external code by means of custom SPARQL
functions and can access and match also data in auxiliary resources (e.g., for mapping
purposes) as well as the instance data created so far. Current rules can be organized in
six categories based on their function: (i) for creating new instances; (ii) for typing
extracted instances (i.e., generate rdf:type assertions), based on mention attributes
(e.g., WordNet synset, PropBank/NomBank roleset, NERC class); (iii) for adding
annotations (e.g., rdfs:label and foaf:name assertions) from the textual extent of the
mention; (iv) for linking (via owl:sameAs or rdfs:seeAlso assertions) an instance
and the corresponding DBpedia resource; (v) for relating (with the proper properties)
frame instances to argument instances; and, (vi) for linking (via owl:sameAs
assertions) the instances corresponding to coreferential mentions. The resulting RDF is also
post-processed materializing implicit knowledge and discarding unnecessary data. All
the processing in this second phase is performed exploiting RDFpro [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], an RDF
manipulation tool which we extended with additional plugins for SPARQL-like rule execution
and named-graph normalization. For instance, from the mention we previously
considered (&lt;resource uri#char=0,10&gt;), several triples are instantiated, e.g.,
dbpedia:George_W._Bush rdfs:label ‘‘G. W. Bush’’ ;
foaf:name ‘‘G. W. Bush’’ ;
rdf:type dbyago:HeadOfState110164747 ;
rdf:type sumo:Entity ; ...
      </p>
      <p>The input text, its mentions, and the triples extracted are related by various
properties that enable to state that a mention is part of an input text, or that a mention
expresses a triple (i.e., the triple can be derived from the mention). In the latter
example, instead of reifying the assertion, we prefer to use named graphs and keep the RDF
representation more compact. In particular, each extracted triple is placed in a named
graph that represents the set of mentions (in some cases a single mention) that express
that particular triple. Clearly, a named graph may contain many triples, meaning that
all these triples were extracted from the same mention (e.g., different type assertions on
the same instance), and a named graph may be expressed by many mentions, meaning
that all the triples in the named graph were extracted from each one of these mentions.
3</p>
    </sec>
    <sec id="sec-3">
      <title>PIKES in action</title>
      <p>PIKES is publicly accessible through an on-line demo version,12 where users can
freely test our approach on sentences of their choice. To run the demo, users just have to
type in a text, and press the submit button. Several tabs become available once PIKES
finishes processing the input text: one example, the graphical rendering of the
knowledge extracted by the tool, is shown in Figure 1. In particular: the Metadata tab reports
the RDF encoding of the metadata attached to the input document, as well as a
summary of the modules applied to extract knowledge; the Mentions tab reports the RDF
serialization of the mentions identified in the input text by NLP tools, together with
12 Accessible from http://pikes.fbk.eu. An explanatory demo video is also available.
their attributes;13 the Instances tab shows the content of the Instances layer, i.e., the
actual triples distilled from the mention information, representing the final output of the
system; these triples are also graphically rendered in the Graph tab, where nodes
represent instances, and arcs assertions on them; additional assertions (e.g., types, labels,
etc) are shown by tooltips when hovering with the mouse over any element of the graph
(e.g., the grey box in Figure 1); finally, the Hybrid tab highlights, sentence by sentence,
the mentions where they occur in the text (hovering over the annotations, mention
attributes can be accessed) as well as the graph of the corresponding instances extracted
from each single sentence (by clicking on the sentence ID).
4</p>
    </sec>
    <sec id="sec-4">
      <title>Concluding Remarks</title>
      <p>PIKES was applied to extract knowledge from the whole Simple English Wikipedia,14
consisting of 110K text documents long 219 words on average. The processing took
507 core hours (16 s per wikipedia page on average), and was completed by 16 parallel
instances of PIKES in less than 32 hours. The resulting RDF dataset (including
textual metadata, mentions and mention attributes, extracted triples) is available for
download.15 We also performed an evaluation of the output,16 obtaining an average precision
of 85.4% on a random sample of 200 triples (2 annotators, Fleiss’s kappa 0.372), which
demonstrates how PIKES can efficiently extract accurate knowledge from text.
Acknowledgements The research leading to this paper was supported by the European
Union’s 7th Framework Programme via the NewsReader Project (ICT-316404).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Corcoglioniti</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rospocher</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mostarda</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Amadori</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Processing Billions of RDF Triples on a Single Machine using Streaming and Sorting</article-title>
          .
          <source>In: ACM SAC</source>
          <year>2015</year>
          <article-title>Proceedings 13 The Annotations tab shows the raw annotation file produced by NLP tools used in PIKES.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>14 http://simple.wikipedia.org/</mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>15 http://pikes.fbk.eu/sew-rdf.html</mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>16 http://pikes.fbk.eu/evaluation.html</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>