<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A lemon lexicon for DBpedia</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Christina Unger</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>John McCrae</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sebastian Walter</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sara Winter</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Philipp Cimiano</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Semantic Computing Group CITEC, Bielefeld University</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>As the body of knowledge available as linked data grows, so does the need to provide methods that make this knowledge accessible for humans. Such methods usually require knowledge about how the vocabulary elements used in the available ontologies and datasets are verbalized in natural language. This has lead to much interest in the development of models and frameworks for publishing ontology lexica as linked data. In this paper we describe a process for the manual development of such lexica in lemon format and illustrate some of the key challenges involved. As a proof of concept, we provide a manually created English lexicon for the DBpedia ontology and describe its rst release.</p>
      </abstract>
      <kwd-group>
        <kwd>Ontology lexicalization</kwd>
        <kwd>DBpedia</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        As the body of knowledge available as linked data grows, so does the need to
provide methods that make this knowledge accessible for humans, for example
by systems that transform natural language questions into SPARQL queries,
systems that generate natural language paraphrases of SPARQL queries, or systems
that generate verbalizations of a given ontology or natural language summaries
from RDF datasets. All such systems require knowledge about how the
vocabulary elements used in the available ontologies and datasets are verbalized in
natural language, in particular covering di erent verbalization variants, possibly
in multiple languages. Although standards such as RDFS and SKOS, or models
such as OTR [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], allow terminological and linguistic information to be attached
to an ontology, this information is very limited and often not rich enough for
natural language applications. For example, the DBpedia property team can be
verbalized as play for in English if the subject is any kind of player, while it would
be verbalized as race for in case the subject is a race driver, and as manage in
case the subject is a sports manager. Moreover, not only verbalizations of single
classes or properties are relevant, but also verbalizations of complex
constructions. For instance, the expression grandchildren refers to the property chain
child child. Similarly, the adjective female describes individuals that are
related to the resource Female through the property gender. There are many more
examples, because commonly the conceptual granularity of natural language and
that of the schema underlying a particular dataset do not fully coincide. Also
while RDF permits only binary relations, natural language expressions can more
freely relate one, two or more arguments.
      </p>
      <p>
        In order to capture linguistically rich information about verbalizations of
simple and complex elements of an ontology or dataset, for example specifying
arguments as optional or capturing restrictions on the usage of verbalizations,
lexical knowledge is needed. Moreover, this lexical knowledge should become part
of the linked data cloud itself, in order to avoid that it has to be recreated by
every application that wants to use it. The lemon model1 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] has been developed
for exactly this purpose, i.e. to create a standard format for publishing lexica as
RDF data that declaratively state how vocabulary elements de ned in a given
ontology or used in a given dataset are verbalized in a particular language.
However, the creation of such lexica is costly and while it can be automated to
some extent, as shown for example in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], it is highly desirable to share these
lexica in accordance with linked data principles, so that everybody can bene t
from them.
      </p>
      <p>As proof of concept for a lexical layer enriching the linked data cloud, we
manually developed an English lexicon for the DBpedia ontology in lemon
format. In this paper we describe the release of the rst version of this lexicon,
illustrate the methodology that has lead to its creation and summarize the
challenges in creating such a resource. Finally, we give an outlook on future plans
and invite NLP and Semantic Web researchers to use the lexicon, improve and
extend it, and to create similar lexical resources for other domains.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Method and dataset description</title>
      <p>The DBpedia ontology2 currently comprises 359 classes and 1,775 properties.
Since the manual creation of lexical entries is an e ort-intensive process, our
approach to creating a lexicon for the DBpedia ontology is an iterative one.
The rst step consists in covering all classes and those properties that are most
frequent with respect to the number of occurrences in triples in the DBpedia
dataset. Later, we will successively extend the lexicon to also cover the tail of
less frequent properties, ideally with support from the community.</p>
      <p>The rst release of the English DBpedia 3.8 lexicon comprises lexicalizations
of 354 classes and 300 properties. The covered classes are complete except for
one abstract class (PersonFunction) and a few classes without any instances
(e.g. NoteworthyPartOfBuilding). The covered properties comprise all those
with more than 10,000 occurrences in the DBpedia dataset. Excluded were the
Wikipedia-speci c ones (e.g. wikiPageRevisionID and thumbnail), abstract
properties (e.g. leaderFunction), and properties for which no straightforward
lexicalization was found (e.g. the datatype property strength, relating a battle
to the numerical sizes of the involved military units).</p>
      <p>The rst release of the lexicon thus covers 98 % of the classes and
approximately 20 % of the properties. We plan to successively extend the current lexicon,</p>
      <sec id="sec-2-1">
        <title>1 http://lemon-model.net</title>
      </sec>
      <sec id="sec-2-2">
        <title>2 http://dbpedia.org/Ontology</title>
        <p>so that a second release will cover all properties with at least 1,000 occurrences
(752 properties, i.e. 42 %), and a third release will cover all properties with at
least 100 occurrences (1,112 properties, i.e. 63 %).</p>
        <p>The lexicon currently contains 1,217 entries (443 class lexicalizations and
774 property lexicalizations), which amounts to approximately 1.8 entries per
ontology concept (1.3 per class and 2.4 per property). The distribution of the
number of lexicalizations per entry is depicted in Figure 1, where the x-axis
speci es the number of entries, up to 5, and the bars specify how many concepts
have this number of lexicalizations.</p>
        <p>
          All lexical entries were created
manually by two of the authors, fa- 277
miliar with both DBpedia and lemon,
partly by manually selecting the most
frequent patterns from the
WikiFramework [
          <xref ref-type="bibr" rid="ref1 ref3">3,1</xref>
          ] and BOA [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] pattern
libraries. The main objective when
devising entries was to provide a wide
range of lexical variants, especially 121
those that di er from the RDFS label 106
and thus are most likely to be
helpful for NLP applications. For exam- 53 53
tphlee, lefoxricothneliostbsjeacst vperrobpaelirztaytiosnpsoutshee 13 4 20 1 7
nouns spouse of, wife of and husband 1 2 3 4 5
of, as well as the verb marry and Fig. 1: Distribution of class
lexicalizaits participle form married to; for the tions (light grey bars) and property
lexdatatype property elevation it lists icalizations (dark grey bars)
the nouns elevation of, altitude of and
height of, the verbs stand at and rise
to, as well as the adjective high.
        </p>
        <p>
          Most of the entries were constructed using a domain-speci c language3 for
common lemon design patterns [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Figure 2 gives some examples, specifying
di erent frames (e.g. relational noun and state verb) together with a
canonical form, a reference w.r.t. the ontology, and a mapping between semantic and
syntact arguments.
        </p>
        <p>Of all lexical entries, 54 could not be captured by the lemon design patterns
but only by writing lemon RDF triples. This is mostly the case for constructions,
such as X has Y inhabitants as verbalization of population, or X consists to
Y percent of water as verbalization of percentageOfAreaWater, and for entries
with a compound meaning, i.e. a sense that consists of several subsenses. An
example is the verb link to as in The Autostrada A19 links Palermo to
Catania, which refers to both routeStart and routeEnd. Both constructions and
compound meanings in patterns will be subject of future developments.
3 https://github.com/jmccrae/lemon.patterns</p>
        <sec id="sec-2-2-1">
          <title>ClassNoun (" mountain ", dbpedia : Mountain )</title>
        </sec>
        <sec id="sec-2-2-2">
          <title>RelationalNoun (" author ", dbpedia : writer , propSubj = PossessiveAdjunct , propObj = CopulativeArg )</title>
        </sec>
        <sec id="sec-2-2-3">
          <title>StateVerb (" write ", dbpedia : writer , propSubj = DirectObject , propObj = Subject )</title>
        </sec>
        <sec id="sec-2-2-4">
          <title>RelationalAdjective (" based ", dbpedia : headquarter , relationalArg = PrepositionalObject (" in ") )</title>
          <p>In case a lexical entry verbalizes a concept that is not named in the
ontology, it is de ned in the lexicon. In total, the lexicon speci es 89 classes and 2
properties in addition to the DBpedia concepts. An example are the classes of
all male or female persons, which are not part of the DBpedia ontology but can
straightforwardly be de ned as the restriction classes of all things with gender
either Male or Female, e.g.:
lex:Female rdf:type owl:Restriction ;
owl:onProperty dbpedia:gender ;
owl:hasValue resource:Female .</p>
          <p>These classes can be verbalized as man and woman. Additionally they can serve
as domain or range restrictions, as e.g. in the verbalization daughter of, referring
to the property child with its object restricted to the class Female, as shown
in Figure 3.</p>
        </sec>
        <sec id="sec-2-2-5">
          <title>ClassNoun (" woman ", lex : Female ) with plural " women "</title>
        </sec>
        <sec id="sec-2-2-6">
          <title>RelationalNoun (" daugther ", dbpedia : child , propSubj = PossessiveAdjunct , propObj = CopulativeArg restrictedTo lex : Female )</title>
          <p>Other common cases of compound meanings are properties for which not
the verbalization of the property itself but rather the verbalization of the
property together with a particular object is relevant. Examples are nationalities
(e.g. Dutch being the class of all persons related to the resource Netherlands
through the property nationality), occupations (e.g. Surfer being the class of
1
2
3
4</p>
        </sec>
        <sec id="sec-2-2-7">
          <title>IntersectiveDataPropertyAdjective (" extinct ",</title>
          <p>dbpedia : conservationStatus ," EX ")</p>
        </sec>
        <sec id="sec-2-2-8">
          <title>IntersectiveDataPropertyAdjective (" endangered ", dbpedia : conservationStatus ," EN ")</title>
          <p>all persons related to the resource Surfing through the property profession),
and religions (e.g. Buddhist being the class of all persons related to the resource
Buddhism through the property religion). Another example is the datatype
property conservationStatus, which characterizes a species as endangered,
extinct, or the like. Since the conservation status code is not meaningful for
non-experts, verbalizations such as endangered and extinct, as given in Figure 4,
should be included in the lexicon.</p>
          <p>Although our DBpedia lexicon currently covers only the top 30 % of all URIs
in the DBpedia ontology, it can already prove useful for NLP applications. As a
rough estimation, our lexicon provides verbalizations for 91 of the 104 di erent
URIs used in the QALD-34 query set for question answering over linked data
(i.e. for 95 %).</p>
          <p>
            Note that the provided lexicon includes only schema but no instance data.
For lexicalizations of the nearly 3 million individuals, for example the DBpedia
lexicalization dataset [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ] can be exploited, providing label alternatives for all
resources, based on redirects, disambiguation links and Wikipedia anchor texts.
3
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Release status and future plans</title>
      <p>The rst version of the DBpedia lexicon is released at http://lemon-model.
net/lexica/dbpedia_en/, under the Creative Commons license CC BY 3.05.
It is accessible as open source on GitHub, at https://github.com/cunger/
lemon.dbpedia, allowing others to improve and extend the lexicon as well as
to port it to other languages. In the future, we plan to release the lexicon also
under lemon source6, a collaborative web interface for creating and editing lexica.
Possibly even the DBpedia community could be involved in lexically enriching
the ontology in multiple languages when editing the ontology wiki.</p>
      <p>In order to get an idea of the coverage of the lexicon and its usefulness for
NLP applications like question answering and natural language generation, we
provide a demo at purl.org/3dlt/demo.</p>
      <p>
        In future releases we will focus much more on a semi-automatic approach to
creating lexical entries, as described in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], in order to reduce the manual e ort in
the lexicon creation process. The manually created lexicon can then serve as gold
standard for evaluating this and also other approaches to ontology lexicalization.
      </p>
      <sec id="sec-3-1">
        <title>4 http://www.sc.cit-ec.uni-bielefeld.de/qald/</title>
      </sec>
      <sec id="sec-3-2">
        <title>5 https://creativecommons.org/licenses/by/3.0/</title>
      </sec>
      <sec id="sec-3-3">
        <title>6 http://monnetproject.deri.ie/lemonsource/</title>
        <p>
          Furthermore, we are creating a rst version of a German and Spanish DBpedia
lexicon, based on an automatic translation of the English lexical entries (cf. [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ])
and a subsequent manual validation and correction step.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>We described the rst release of a manually constructed English lexicon for the
DBpedia ontology in lemon format. The focus is on high quality entries, covering
especially those lexicalizations that di er from the provided RDFS label, and
those that verbalize complex constructs and thus cannot yet be handled by
automatic lexicalization methods.</p>
      <p>We hope that the rst release of the DBpedia lexicon serves to prove the
usefulness of rich lexical knowledge for NLP applications, and inspires NLP and
Semantic Web researchers to improve and extend the lexicon, port it to other
languages, as well as build and share lexica for other domains, thereby enriching
the linked data cloud with a lexical layer.</p>
      <p>Acknowledgment This work was partially funded within the EU projects
PortDial (FP7-296170) and Monnet (FP7-248458).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>E.</given-names>
            <surname>Cabrio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cojan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Palmero</given-names>
            <surname>Aprosio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Magnini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lavelli</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Gandon</surname>
          </string-name>
          .
          <article-title>QAKiS: an open domain QA system based on relational patterns</article-title>
          .
          <source>In Proc. of the 11th International Semantic Web Conference (ISWC</source>
          <year>2012</year>
          ), demo paper,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>D.</given-names>
            <surname>Gerber</surname>
          </string-name>
          and A.
          <string-name>
            <surname>-C. Ngonga</surname>
          </string-name>
          <article-title>Ngomo. Bootstrapping the linked data web</article-title>
          .
          <source>In Proc. of the 10th International Semantic Web Conference (ISWC)</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>R.</given-names>
            <surname>Mahendra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wanzare</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Magnini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bernardi</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Lavelli</surname>
          </string-name>
          .
          <article-title>Acquiring relational patterns from Wikipedia: A case study</article-title>
          .
          <source>In Proc. of the 5th Language and Technology Conference</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>J. McCrae</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Espinoza</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Montiel-Ponsoda</surname>
            , G. Aguado de Cea, and
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Cimiano</surname>
          </string-name>
          .
          <article-title>Combining statistical and semantic approaches to the translation of ontologies and taxonomies</article-title>
          .
          <source>In Proc. of the Fifth workshop on Syntax, Structure and Semantics in Statistical Translation (SSST-5)</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>J. McCrae</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spohr</surname>
            , and
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Cimiano</surname>
          </string-name>
          .
          <article-title>Linking lexical resources and ontologies on the semantic web with lemon</article-title>
          .
          <source>In Proc. of the 8th Extended Semantic Web Conference (ESWC)</source>
          . Springer,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>J.</given-names>
            <surname>McCrae</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Unger</surname>
          </string-name>
          .
          <article-title>Design patterns for engineering the ontology-lexicon interface</article-title>
          . In Paul Buitelaar and Philipp Cimiano, editors,
          <source>Multilingual Semantic Web</source>
          . Springer, to appear.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>P.N.</given-names>
            <surname>Mendes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jakob</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>Dbpedia for nlp: A multilingual cross-domain knowledge base</article-title>
          .
          <source>In Proc. of the Eight International Conference on Language Resources and Evaluation (LREC'12)</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>A.</given-names>
            <surname>Reymonet</surname>
          </string-name>
          , J. Thomas, and
          <string-name>
            <given-names>N.</given-names>
            <surname>Aussenac-Gilles</surname>
          </string-name>
          .
          <article-title>Modelling ontological and terminological resources in OWL DL</article-title>
          .
          <source>In Proc. of OntoLex07</source>
          , volume
          <volume>7</volume>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>S.</given-names>
            <surname>Walter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Unger</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Cimiano</surname>
          </string-name>
          .
          <article-title>A corpus-based approach for the induction of ontology lexica</article-title>
          .
          <source>In Proc. of the 18th International Conference on Application of Natural Language to Information Systems (NLDB</source>
          <year>2013</year>
          ),
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>