Lexicalizing DBpedia with Realization Enabled
 Ensemble Architecture: RealTextlex2 Approach

                 Rivindu Perera, Parma Nand, and Gisela Klette

                   School of Computer and Mathematical Sciences,
                   Auckland University of Technology, New Zealand
                       {rperera,pnand,gklette}@aut.ac.nz


        Abstract. DBpedia encodes massive amounts of open domain knowl-
        edge and is growing by accumulating more triples at the same rate as
        Wikipedia. However, the applications often require natural language for-
        mulations of these triples to present the information as a natural text.
        The RealTextlex2 framework offers a scalable platform to transform these
        triples to natural language sentences using lexicalization patterns. The
        framework has evolved from its previous version (RealTextlex ) and is
        comprised of four lexicalization pattern mining modules which derive
        patterns from a training triple collection. These patterns can be then
        applied on the new triples given that they satisfy a defined set of con-
        straints.


1     Introduction
DBpedia has become a central hub for the applications searching for information
on the web. Since this information is provided in structured form as triples, the
applications require the natural language formulation of these triples. In essence,
an application that needs to provide biographies would need to transform a
selected set of triples to natural language in order to present it as a natural
text. This approach gives more freedom to content owners to concentrate on the
actual content rather using naive techniques to retrieve content from another
unstructured text resource using summarization or other approaches.
    Transforming triple-like meaning representations into natural language is
termed as lexicalization - a subtask of Natural Language Generation. RealTextlex2 1
(refer [1] for RealTextlex ) approach for lexicalization is based on an ensemble ar-
chitecture comprising of four pattern mining modules. Three of them are based
on specially crafted lexicons and the others extract patterns from unstructured
text using Open Information Extraction (OpenIE) and make them cohesive, so
that they can be generalized. This is a completely different approach compared
to the available Linked Data lexicalization platforms; corpus based approach [2]
which extracts bare typed-dependency paths as patterns, and LOD-DEF [3],
which substitutes the triple subject and object in a sentence to form a pattern.
A definition of lexicalization pattern in our approach is another triple structure
1
    A video demonstration is available at https://vimeo.com/173608664
2      Perera et al

which S? and O? expressions denote subject and triple respectively. As an ex-
ample, a pattern such as hS?, was born in, O? iL can be used to lexicalize the
triple hSteve Jobs, birthDate, 1955-02-24iT .
    The rest of the paper provides details on the framework and all features
presented herein will be part of the demonstration.


2     Demonstration

For the demonstration we utilize the Java client application. Although it shows
a similar interface to the previous version, the application layer has been rede-
veloped for various improvements. These will be discussed in Section 2.2.


2.1   Datasets

For the purpose of this demonstration we focus on randomly selected seven dif-
ferent ontology classes namely: Office Holder, Educational Institute, Mountain,
Basketball Player, Country, City, Actor, etc.


2.2   Workflow

The framework is based on four pattern mining modules to generate lexicaliza-
tion patterns.


Occupational Metonym Patterns Occupational metonyms are used to iden-
tify a person based on his/her occupation. In majority of the cases these rep-
resent -er nominalized verbs (e.g., director, publisher, designer). DBpedia uses
occupational metonyms as predicates in multiple scenarios. If such predicate is
used then the triple can be lexicalized using the base verb of the -er nominal-
ized verb. We have developed a lexicon of such -er nominalized occupational
metonyms and associated patterns. For example, for a triple such as hNow You
See Me, director, Jon M. ChuiT , we can use the patternhS?, is directed by, O?
iL which is associated with the occupational metonym “director ”.


Context Free Grammar Patterns Context Free Grammar (CFG) is a two
directional grammar formalism which helps to both understand and generate
language. This research uses only the S ≡NP ↔VP ↔NP, CFG rule where S
denotes a sentence, NP and VP represent noun phrase and verb phrase respec-
tively. Based on this CFG rule, we define the pattern hS?, P?, O? iL for all
triples which satisfy two constraints. Firstly, the triple predicate should be a
verb and secondly the verb should have a NP ↔VP ↔NP in VerbNet.
                               Lexicalizing DBpedia: RealTextlex2 Approach        3

Relational Patterns Relational pattern are derived from then unstructured
text. We first retrieve triples (hsubject, predicate, objectiT ) from number of en-
tities from different ontology classes. Parallel to this process, we also extract
text related to each entity considered. This text is preprocessed to tokenize
sentences and resolve co-references. We then extract relations (harg1 , rel, arg2
iR ) from the preprocessed text using Open Information Extraction (OpenIE).
The relations are then aligned with retrieved triples (e.g., a triple subject may
align with arg1 of a relation). The alignment is calculated using Phrasal Overlap
Measure (POM) for triple subject and object alignments individually and then
multiplied to get the final alignment score. We have experimentally determined
that a threshold alignment score of 0.21 limits low ranked inaccurate relational
patterns being included in the result.
    Furthermore, we noticed that grammatical gender of a triple and object mul-
tiplicity of a triple can make a lexicalization pattern more specific. For instance,
although the triple hBarack Obama, spouse, Michelle ObamaiT cannot be lexi-
calized with the pattern hS?, is the husband of, O? iL which is derived using the
triple hMichelle Obama, spouse, Barack ObamaiT . Although both triples have
a same predicate and subjects belong to the same ontology class, grammatical
gender of the subject makes the pattern more specific. Similarly, object mul-
tiplicity also needs to be considered as an exception. In this case, a predicate
can hold either one or more objects. For example, East River has a triple hEast
River, country, United StatesiT and Nile river has triples: hNile River, country,
EgyptiT , hNile River, country, BurundiiT , and hNile River, country, GhanaiT .
Although pattern hS?, is in, O? iL can lexicalize the East river triple, the most
suitable pattern for Nile River would be hS?, flows through, O? iL . In these
cases, we associate each pattern with either of the above two features.

Property Patterns Property patterns consists of predefined set of lexicaliza-
tion patterns to transform a known predicates to natural language sentences.
Table 1 lists the five predefined patterns with examples.

                    Table 1. Property patterns with examples

Pattern            Predicate      Resulting lexicalization
   0
hS? s P ?, is, O?iL height       hKobe Bryant0 s height, is, 1.98iLR
hS?, has, O? P ?iL championships hM ichael Schumacher, has, 7 championshipsiLR
hS?, is, O?iL        occupation  hJennif er Lawrence, is, an actressiLR
hP ? in S?, is, O?iL largestCity hLargest city in Canada, is, T orontoiLR
hS?, P ?, O?iL       isPartOf    hScotland, is part of, U nited KingdomiLR


Pattern Search and Realization The pattern search process associates a
lexicalization pattern for a given triple by executing one or more of the afore-
mentioned modules. The modules are prioritized in the same sequential order
4       Perera et al

as they are explained here and if a matching pattern is found then the rest of
the modules are not executed. The framework also carries out further realiza-
tion to the patterns post these lexicalization modules. For instance, if there is a
mismatch of a grammatical gender, then we perform a dependency parsing and
automatically correct it. Furthermore, for patterns which describes persons who
are not alive in present tense are realized into past tense automatically.


3   Evaluation
We performed two evaluations for the
lexicalization framework. The first fo-
cused on linguistic accuracy and the
second was a human evaluation on
40 random sub-sample to rate both
readability and accuracy in a 5-point
Likert scale. The framework gener-
ated 283 linguistically correct lexical-
ization patterns for 400 triples yield-
ing 70.75% accuracy rate. The human
evaluation showed that more than
70% of the sub-sample is rated be-         Fig. 1. RealText desktop application. The
                                           patterns extracted are shown in the left grid
tween weighted average rating of 4.1
                                           window. The three stacked windows in right
and 5 for both accuracy and readabil-      show the selected DBpedia resources, can-
ity with 0.866 and 0.807 Cronbach al-      didate sentences, and extracted relations.
pha values.


4   Conclusion
This paper presented the RealTextlex2 , a framework that uses an ensemble ar-
chitecture utilizing four separate pattern mining modules. Furthermore, realiza-
tion is implemented on top of these to increase the pattern’s linguistic accuracy.
RealTextlex2 is a part of a larger Natural Language Generation project targeting
generating natural language descriptions from the Linked Data cloud. In future,
we expect to further enhance the framework with an improved accuracy and a
readability level.


References
1. Perera, R., Nand, P.: A multi-strategy approach for lexicalizing linked open data.
   In: CICLing-2015. (2015) 348–363
2. Walter, S., Unger, C., Cimiano, P.: A corpus-based approach for the induction of
   ontology lexica. In: NLDB-2013. (2013) 102–113
3. Duma, D., Klein, E.: Generating natural language from linked data: Unsupervised
   template extraction. In: IWCS-2013. (2013)