<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Lexicalizing DBpedia with Realization Enabled Ensemble Architecture: RealTextlex2 Approach</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rivindu Perera</string-name>
          <email>rperera@aut.ac.nz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Parma Nand</string-name>
          <email>pnand@aut.ac.nz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gisela Klette</string-name>
          <email>gklette@aut.ac.nz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Computer and Mathematical Sciences, Auckland University of Technology</institution>
          ,
          <country country="NZ">New Zealand</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>DBpedia encodes massive amounts of open domain knowledge and is growing by accumulating more triples at the same rate as Wikipedia. However, the applications often require natural language formulations of these triples to present the information as a natural text. The RealTextlex2 framework o ers a scalable platform to transform these triples to natural language sentences using lexicalization patterns. The framework has evolved from its previous version (RealTextlex) and is comprised of four lexicalization pattern mining modules which derive patterns from a training triple collection. These patterns can be then applied on the new triples given that they satisfy a de ned set of constraints.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>DBpedia has become a central hub for the applications searching for information
on the web. Since this information is provided in structured form as triples, the
applications require the natural language formulation of these triples. In essence,
an application that needs to provide biographies would need to transform a
selected set of triples to natural language in order to present it as a natural
text. This approach gives more freedom to content owners to concentrate on the
actual content rather using naive techniques to retrieve content from another
unstructured text resource using summarization or other approaches.</p>
      <p>
        Transforming triple-like meaning representations into natural language is
termed as lexicalization - a subtask of Natural Language Generation. RealTextlex21
(refer [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] for RealTextlex) approach for lexicalization is based on an ensemble
architecture comprising of four pattern mining modules. Three of them are based
on specially crafted lexicons and the others extract patterns from unstructured
text using Open Information Extraction (OpenIE) and make them cohesive, so
that they can be generalized. This is a completely di erent approach compared
to the available Linked Data lexicalization platforms; corpus based approach [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
which extracts bare typed-dependency paths as patterns, and LOD-DEF [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
which substitutes the triple subject and object in a sentence to form a pattern.
A de nition of lexicalization pattern in our approach is another triple structure
1 A video demonstration is available at https://vimeo.com/173608664
which S? and O? expressions denote subject and triple respectively. As an
example, a pattern such as hS?, was born in, O? iL can be used to lexicalize the
triple hSteve Jobs, birthDate, 1955-02-24iT .
      </p>
      <p>The rest of the paper provides details on the framework and all features
presented herein will be part of the demonstration.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Demonstration</title>
      <p>For the demonstration we utilize the Java client application. Although it shows
a similar interface to the previous version, the application layer has been
redeveloped for various improvements. These will be discussed in Section 2.2.
2.1</p>
      <sec id="sec-2-1">
        <title>Datasets 2.2</title>
      </sec>
      <sec id="sec-2-2">
        <title>Work ow</title>
        <p>For the purpose of this demonstration we focus on randomly selected seven
different ontology classes namely: O ce Holder, Educational Institute, Mountain,
Basketball Player, Country, City, Actor, etc.</p>
        <p>The framework is based on four pattern mining modules to generate
lexicalization patterns.</p>
        <p>Occupational Metonym Patterns Occupational metonyms are used to
identify a person based on his/her occupation. In majority of the cases these
represent -er nominalized verbs (e.g., director, publisher, designer). DBpedia uses
occupational metonyms as predicates in multiple scenarios. If such predicate is
used then the triple can be lexicalized using the base verb of the -er
nominalized verb. We have developed a lexicon of such -er nominalized occupational
metonyms and associated patterns. For example, for a triple such as hNow You
See Me, director, Jon M. ChuiT , we can use the patternhS?, is directed by, O?
iL which is associated with the occupational metonym \director ".
Context Free Grammar Patterns Context Free Grammar (CFG) is a two
directional grammar formalism which helps to both understand and generate
language. This research uses only the S NP $VP $NP, CFG rule where S
denotes a sentence, NP and VP represent noun phrase and verb phrase
respectively. Based on this CFG rule, we de ne the pattern hS?, P?, O? iL for all
triples which satisfy two constraints. Firstly, the triple predicate should be a
verb and secondly the verb should have a NP $VP $NP in VerbNet.</p>
        <p>Lexicalizing DBpedia: RealTextlex2 Approach
Relational Patterns Relational pattern are derived from then unstructured
text. We rst retrieve triples (hsubject, predicate, objectiT ) from number of
entities from di erent ontology classes. Parallel to this process, we also extract
text related to each entity considered. This text is preprocessed to tokenize
sentences and resolve co-references. We then extract relations (harg1, rel, arg2
iR) from the preprocessed text using Open Information Extraction (OpenIE).
The relations are then aligned with retrieved triples (e.g., a triple subject may
align with arg1 of a relation). The alignment is calculated using Phrasal Overlap
Measure (POM) for triple subject and object alignments individually and then
multiplied to get the nal alignment score. We have experimentally determined
that a threshold alignment score of 0.21 limits low ranked inaccurate relational
patterns being included in the result.</p>
        <p>Furthermore, we noticed that grammatical gender of a triple and object
multiplicity of a triple can make a lexicalization pattern more speci c. For instance,
although the triple hBarack Obama, spouse, Michelle ObamaiT cannot be
lexicalized with the pattern hS?, is the husband of, O? iL which is derived using the
triple hMichelle Obama, spouse, Barack ObamaiT . Although both triples have
a same predicate and subjects belong to the same ontology class, grammatical
gender of the subject makes the pattern more speci c. Similarly, object
multiplicity also needs to be considered as an exception. In this case, a predicate
can hold either one or more objects. For example, East River has a triple hEast
River, country, United StatesiT and Nile river has triples: hNile River, country,
EgyptiT , hNile River, country, BurundiiT , and hNile River, country, GhanaiT .
Although pattern hS?, is in, O? iL can lexicalize the East river triple, the most
suitable pattern for Nile River would be hS?, ows through, O? iL. In these
cases, we associate each pattern with either of the above two features.
Property Patterns Property patterns consists of prede ned set of
lexicalization patterns to transform a known predicates to natural language sentences.
Table 1 lists the ve prede ned patterns with examples.
Pattern Search and Realization The pattern search process associates a
lexicalization pattern for a given triple by executing one or more of the
aforementioned modules. The modules are prioritized in the same sequential order</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <p>We performed two evaluations for the
lexicalization framework. The rst
focused on linguistic accuracy and the
second was a human evaluation on
40 random sub-sample to rate both
readability and accuracy in a 5-point
Likert scale. The framework
generated 283 linguistically correct
lexicalization patterns for 400 triples
yielding 70.75% accuracy rate. The human
evaluation showed that more than
70% of the sub-sample is rated
between weighted average rating of 4.1
and 5 for both accuracy and
readability with 0.866 and 0.807 Cronbach
alpha values.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>as they are explained here and if a matching pattern is found then the rest of
the modules are not executed. The framework also carries out further
realization to the patterns post these lexicalization modules. For instance, if there is a
mismatch of a grammatical gender, then we perform a dependency parsing and
automatically correct it. Furthermore, for patterns which describes persons who
are not alive in present tense are realized into past tense automatically.
This paper presented the RealTextlex2, a framework that uses an ensemble
architecture utilizing four separate pattern mining modules. Furthermore,
realization is implemented on top of these to increase the pattern's linguistic accuracy.
RealTextlex2 is a part of a larger Natural Language Generation project targeting
generating natural language descriptions from the Linked Data cloud. In future,
we expect to further enhance the framework with an improved accuracy and a
readability level.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Perera</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nand</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>A multi-strategy approach for lexicalizing linked open data</article-title>
          .
          <source>In: CICLing-2015</source>
          . (
          <year>2015</year>
          )
          <volume>348</volume>
          {
          <fpage>363</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Walter</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Unger</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cimiano</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>A corpus-based approach for the induction of ontology lexica</article-title>
          .
          <source>In: NLDB-2013</source>
          . (
          <year>2013</year>
          )
          <volume>102</volume>
          {
          <fpage>113</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Duma</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klein</surname>
          </string-name>
          , E.:
          <article-title>Generating natural language from linked data: Unsupervised template extraction</article-title>
          .
          <source>In: IWCS-2013</source>
          . (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>