<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Multilingual eLexicography by Means of Linked (Open) Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Thierry Declerck</string-name>
          <email>declerck@dfki.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eveline Wandl-Vogt</string-name>
          <email>Eveline.Wandl-Vogt@oeaw.ac.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simon Krek</string-name>
          <email>simon.krek@ijs.si</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carole Tiberius</string-name>
          <email>Carole.Tiberius@inl.nl</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ACDH, Austrian Academy of Sciences</institution>
          ,
          <addr-line>Vienna</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>DFKI GmbH, Multilingual Technologies Lab</institution>
          ,
          <addr-line>Saarbrücken</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Instituut voor Nederlandse Lexicologie</institution>
          ,
          <addr-line>Leiden</addr-line>
          ,
          <country country="NL">Netherlands</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Jožef Stefan Institute</institution>
          ,
          <addr-line>Ljubljana</addr-line>
          ,
          <country country="SI">Slovenia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this short paper, we document the current state of work consisting in mapping various lexicographic resources onto the OntoLex model, which is an OWL and RDF(s) based representation format. This model has been designed in the context of a W3C Community Group effort for supporting the publication of linguistic data in the Linked (Open) Data cloud. The deployment of OntoLex is currently being tested within the ISCH COST Action IS1305 European Network of e-Lexicography (ENeL), which is adapting to the field of digital lexicography guidelines that have been suggested by the LIDER FP7 Support Action.</p>
      </abstract>
      <kwd-group>
        <kwd>Multilingual Lexicography</kwd>
        <kwd>Linguistic Linked Open Data</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The European Network of e-Lexicography (ENeL)1 is a European COST action that
aims at increasing, coordinating and harmonizing European research in the field of
elexicography and to make authoritative information on the languages of Europe easily
accessible.</p>
      <p>The working groups of ENel deal with the fact that computers and the availability
of the World Wide Web (WWW) have changed the conditions for the production and
reception of dictionaries. For editors of scientific dictionaries, the WWW is not only a
source of inspiration, but also a new and challenging possibility, for example, when it
comes to closing the gap between the public and scientific dictionaries, while
ensuring users easy access to scientific dictionaries. ENeL also attempts to provide a</p>
    </sec>
    <sec id="sec-2">
      <title>1 http://www.elexicography.eu/</title>
      <p>broader and more systematic exchange of know-how and common standards and
solutions in the field of lexicography. In addition, the pan-European nature of the
lexicographical work in Europe is central to ENeL.</p>
      <p>This effort involves the exchange of resources, technologies and experience in
elexicography and provides support for dictionaries which are not yet online. A focal
point of ENeL consists in discussing and establishing standards for innovative
edictionaries that fully exploit the possibilities of the digital medium. In doing so,
ENeL explores new ways of representing the common heritage of European
languages by developing shared editorial practices and by interconnecting already
existing information.</p>
      <p>For participants of the Working Group 3 “Innovative eDictionnaries” of ENeL it
rapidly seemed obvious that the expanding Linked Open Data (LOD) framework2,
and more specifically the emerging Linguistic Linked Open Data (LLOD)3, could
offer a potential infrastructure for realizing some of its goals. In the next sections we
shortly present the main principles of the LLOD and its core representation format,
the Ontolex model4, before describing the current state of our work in mapping
various lexicographic resources of the ENeL participants to the Ontolex model.
2.</p>
      <sec id="sec-2-1">
        <title>Linguistic Linked (Open) Data</title>
        <p>Wikipedia gives the following definition of Linked Data: “In computing, linked data
(often capitalized as Linked Data) describes a method of publishing structured data
so that it can be interlinked and become more useful through semantic queries. It
builds upon standard Web technologies such as HTTP, RDF and URIs, but rather than
using them to serve web pages for human readers, it extends them to share
information in a way that can be read automatically by computers. This enables data
from different sources to be connected and queried”5. Data sets that have been
published in the linked data format can be visualized by the so-called Linked Open
Data Cloud diagram6 or also by other representations like the Linked Open Data
Graph7.</p>
        <p>In the context of this further expanding Linked Data framework, work has started
in encoding linguistic resources in the same format as already existing linked data
sets, which were primarily consisting of “classical” knowledge objects and entities. In
those data sets, language data is mainly used as human readable information encoded
for example in the RDF(s) annotation properties “label”, “comment” and the like.</p>
        <p>Recently, some researchers in the field of Human Language Technology (HLT)
and Semantic Web technologies started to work on models and their implementation
that would elevate the language data used in existing LOD data sets to the same type
of representation as this is the case for the encyclopedic knowledge they were
2 See http://linkeddata.org/ for more details.
3 See http://linguistics.okfn.org/resources/llod/ for more details.
4 See https://www.w3.org/community/ontolex/ for more details.
5 http://en.wikipedia.org/wiki/Linked_data
6 http://lod-cloud.net/
7 http://inkdroid.org/lod-graph/
“commenting” and “labeling”8. Cooperation on those topics has been established
between, among others, the Working Group on Open Data in Linguistics9 and the
European FP7 Support Action “LIDER”10. Those joined efforts have led to the
establishment of a Linked Open Data (sub-)cloud of linguistic resources, which is
called Linguistic Linked Open Data (LLOD)11. The Linguistic Linked Open Data
cloud is also visualized by an on-line diagram12, which itself is derived from
information contained in the LingHub13 repository developed in the context of the
LIDER project.</p>
        <p>At the core of the publication of language data and linguistic information in the
LLOD there is the model “Ontolex” resulting from the W3C Ontology-Lexicon
community group14. Since this model was originally based on LMF, which is itself the
ISO standard for Natural Language Processing (NLP) lexicons and Machine Readable
Dictionaries (MRD), it is an appealing model for lexicographers who are seeking to
publish their data in the LOD.
3.</p>
      </sec>
      <sec id="sec-2-2">
        <title>OntoLex</title>
        <p>The OntoLex model is based on the ISO Lexical Markup Framework (LMF)15 and is
an extension of the lemon model, which is described in [5]. Ontolex describes a
modular approach to lexicon specification, allowing thus the eLexicographer to depart
from the “book” view that the headword is the (unique) entry point to information
encoded in a dictionary. Senses, usages, concepts, etc. can be independently
described, accessed and are all linked to what was considered the headword, and
which now is encoded as a virtual entry in a RDF model.</p>
        <p>With Ontolex, we can advocate for the fact that all elements of a dictionary entry
can be described independently from each other and connected by explicit relation
markers. Now, the components of a dictionary entry can be distributed in a network
and be linked together by RDF encoded relations/properties. An important aspect of
this model is also the relation called “reference”. This represents a property that
supports the linking of senses of lexicon entries to knowledge objects available in the
LOD cloud. This reflects also our view that the meaning of a lexicon (or dictionary)
entry is no longer necessarily encoded in the lexicon (or dictionary) but can be
referred to in the Web of data.</p>
        <p>Practically, this means that a dictionary author does not need to describe all
components or elements of an entry in details, but that she/he can also draw on
existing elements (e.g. the etymology of a word), and can simply refer to it. We are
convinced that these properties of the model can facilitate and support the cooperation
8 See for example [8] and [9].
9 http://linguistics.okfn.org/.
10 See http://lider-project.eu/ for more details.
11 See http://linguistics.okfn.org/tag/llod/ for more details.
12 http://linguistic-lod.org/llod-cloud
13 See http://linghub.lider-project.eu/about for more details.
14 See http://www.w3.org/community/ontolex/ for more details.
15 See [7] and http://www.lexicalmarkupframework.org
between various scientific lexicographers, and that this can result in virtual and
collaborative research environments in the lexicographical field.</p>
        <p>Fig. 1 below displays the core model of Ontolex16. Boxes represent classes of the
model. Arrows with filled heads represent object properties, while arrows with empty
heads represent the Sub-Class relations. In arrows labeled 'X/Y', X is the name of the
object property and Y the name of the inverse property.</p>
        <p>We used this model on a list of lexical resources made available by participants of
the ENeL network, and we describe this transformation process in the next section.
4.</p>
      </sec>
      <sec id="sec-2-3">
        <title>Mapping ENeL lexicographic Resources onto Ontolex</title>
        <p>In order to test our intuition about the format for the publication of lexicographic
resources in the LOD, cooperation between the ENeL COST action and the LIDER
projects has therefore been established. We got from ENeL participants, for a first
test, samples from 13 dictionaries, which are in different languages:
–
–
–
–
–
2 Austrian dialect dictionaries (Tustep/XML and Word)
1 sample of a Slovak dictionary (XML, + PDF/Word)
1 Slovene XML dictionary (XML, based on the LMF standard)
2 TEI encoded Arabic dialects (in TEI)
1 Sample from a Bask-German dictionary (XML)
16 The figure and the explanations are taken from the wiki page of Ontolex:
http://www.w3.org/community/ontolex/wiki/Final_Model_Specification.</p>
        <p>–
–
–
–
–</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>1 Sample from a French lexicon (extracted from Wiktionary)</title>
      <p>1 Limburg lexicon (Excel)
1 Sample from the KDictionary multilingual source (XML file)
1 Sample from the Digital Scottisch Lexicon (Old Scottisch, html +
1 example in TEI)
1 Lexicon extracted from a corpus of „Baroque German“</p>
      <p>Below in figures 2 and 3, the reader can see the kind of data we are dealing with. In
Figure 2, we have an example taken from a longer entry of the Austrian Dictionary of
Bavarian Dialects17 and in Figure 3 a screen shot from the web page of the Dictionary
of the Older Scottish Tongue18</p>
      <p>In Figure 2 we can observe a property of many dialectal or regional dictionaries.
They express the meaning of the entries by using words taken from the standard
language. The meaning of “Puss” or “Puss(e)lein” is expressed by the standard
17 See
http://verlag.oeaw.ac.at/Woerterbuch-der-bairischen-Mundarten-in-Oesterreich-38.</p>
      <p>Lieferung-WBOe for more details.
18 See http://www.dsl.ac.uk/ for more details.</p>
      <p>German word “Kuss” (kiss), or “Gebäck” (pastry). The third meaning of the entry is
expressed by an abbreviation expressing a category “PflN” (meaning name of a
plant). We find in this (part of the) entry also etymological information (“frühnhd.”,
this being the abbreviation for Early New High German). The entry is marked as
being a masculine word, but used mostly in neutrum. A Bavarian variant of the word
with the kiss meaning is also given (“Busselr”). And much more (a long list of
examples of usages is also given in the full entry, with some additional definition text)
is available. It is important to note here, that in the digital version of the dictionary we
have at our disposal, and which is encoded in the TUSTEP19 format for supporting
publication, we had to gather much of the information ourselves. The metadata
information is very poor. This kind of dictionary was in the past in fact more directed
to the professional lexicographer rather than the general public, and many
interpretation aspects of the codes used for the entry were supposed to be known by
the reader. So for example the typographical coding of the headword (what we call
here “entry”) can include some information which we had to gain from an annex or
directly from the lexicographers.</p>
      <p>Similar comments apply to the example in Figure 3. There we had for example to
infer the interpretation of the different temporal expressions (the TEI code of the
dictionary has been provided to us as a sample only for one entry). And we also had
to interpret the typographical codes used.</p>
      <p>Therefore a manual analysis of the resources we got from the ENeL participants
was needed in order to know if and how an automatized mapping to Ontolex can be
implemented. Also we had to add some few classes and properties to the Ontolex
model in order to deal with certain features of the dictionaries. For example we added
a class for the etymology, a class for describing the lexicographic slips used by
lexicographer and some properties to encode the different types of temporal
information (date of publication vs etymological information etc.). For most of the
lexical information encoded in the 13 dictionaries, we could find a way to map it to
the Ontolex model (see Fig. 1). Every dictionary has been encoded as an
ontolex:lexicon, using the ontolex:entry object property to indicate inclusion of an
entry:
ontolex:WBÖ
rdf:type ontolex:Lexicon ;
rdfs:comment "Dictionary of Bavarain Dialects in Austria"@en ;
ontolex:entry ontolex:lex_trupp ;
ontolex:entry ontolex:lex_trüllen ;
ontolex:entry ontolex:lex_trüsche ;
ontolex:language "bar"^^xsd:string ;</p>
      <p>The entries are instances of the ontolex:LexicalEntry class and ambiguities are
marked by introducing various instances of the ontolex:LexicalSense class.
ontolex:lex_trupp
rdf:type ontolex:LexicalEntry ;
ontolex:denotes &lt;http://live.dbpedia.org/page/Herd&gt; ;
ontolex:denotes &lt;http://live.dbpedia.org/page/Social_group&gt; ;
rdfs:comment "An entry of WBÖ: Trupp"@en ;
19 See http://www.tustep.uni-tuebingen.de/tustep_eng.html for more details.</p>
      <p>ontolex:canonicalForm ontolex:form_trupp ;
ontolex:hasEtymology ontolex:ety_trupp ;
ontolex:sense ontolex:trupp_sense1 ;
ontolex:sense ontolex:trupp_sense2 ;
ontolex:sense ontolex:trupp_sense3 ;
The use of the properties “ontolex:sense” and “ontolex:denotes” is very
important if one wants to link lexical resources in a multilingual way, just looking if
they are sharing the same senses. The difference between the two properties is that the
first one is pointing to instances of the class “LexicalSense”, which is collecting
ontological objects within the model, while the second property points directly to
external resources. Instances of the class “LexicalSense” are linked to external
knowledge resources via the property ontolex:reference. Figure 1 in chapter 3
above is graphically representing the difference between the usages of the two
properties “reference” and “denotes”.</p>
      <p>On the basis of the use of the Ontolex model we could semi-automatically
establish not only links between entries within and between the samples of the ENeL
dictionaries, but also links to encyclopedic data sets in the LOD, like for example
DBpedia20 or the BabelNet resource21, which is automatically merging various
multilingual language and encyclopedic resources that are available in RDF.</p>
      <p>BabelNet is in fact an excellent example of such a combination of linguistic and
encyclopedic data in the LOD cloud. All language data are encoded in RDF and
lemon (the former version of OntoLex). While BabelNet was considering mainly the
RDF Version of WordNet and collaboratively created lexical resources, like
Wiktionary, our work is aiming at adding to this framework the language and
encyclopedic data that has been created and published by professional lexicographers.
5.</p>
      <sec id="sec-3-1">
        <title>Conclusions and future Work</title>
        <p>We could successfully use the Ontolex model, with very few additions, for encoding
in the LLOD format the lexicographic resources of some participants of the ENeL
Network. Next steps will consist in effectively publish the results in the Web.
Our current work consists in further automatizing the mapping between the original
formats of other ENeL dictionaries and in investigating more efficient linking
strategies to encyclopedic sources. We are also extending our work to the encoding of
so-called conceptual records used by lexicographers when doing field studies: the
20 http://dbpedia.org/
21 http://babelnet.org/
interview people in certain regions and ask them how they express in their language
certain concepts. We started to use the ConceptSet and LexicalConcept constructs of
Ontolex for this task.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Acknowledgments</title>
        <p>The work described in this short paper submission is supported in part by the
European Union, both by the LIDER project (under Grant No. 610782) and by the
COST Action IS1305 “ENeL”. Our thanks go also to the participants of the ENeL
COST Action, who provided for their data and advices. And finally our thanks go to
the anonymous reviewers of the first version of this paper, helping a lot to improve it.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Thierry</given-names>
            <surname>Declerck</surname>
          </string-name>
          , Eveline Wandl-Vogt.
          <article-title>Cross-linking Austrian dialectal Dictionaries through formalized Meanings</article-title>
          . In: Andrea Abel, Chiara Vettori, Natascia Ralli (eds.):
          <source>Proceedings of the XVI EURALEX International Congress, Pages</source>
          <volume>329</volume>
          -
          <fpage>343</fpage>
          (
          <year>2014</year>
          )
          <article-title>Thierry Declerck</article-title>
          , Eveline Wandl-Vogt.
          <article-title>How to semantically relate dialectal Dictionaries in the Linked Data Framework</article-title>
          .
          <source>Proceedings of the 8th Workshop on Language Technology for Cultural Heritage</source>
          ,
          <source>Social Sciences, and Humanities (LaTeCH</source>
          <year>2014</year>
          ), Gothenburg, Sweden,
          <string-name>
            <surname>ACL</surname>
          </string-name>
          (
          <year>2014</year>
          )
          <article-title>Maud Ehrmann</article-title>
          , Francesca Cecconi, Daniele Vannella, John P. McCrae,
          <string-name>
            <surname>Philipp Cimiano</surname>
            , and
            <given-names>Roberto</given-names>
          </string-name>
          <string-name>
            <surname>Navigli</surname>
          </string-name>
          .
          <article-title>A Multilingual Semantic Network as Linked Data: lemon-BabelNet.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>Proceedings of the 3rd Workshop on Linked Data in Linguistics</source>
          (
          <year>2014</year>
          )
          <article-title>Philipp Cimiano</article-title>
          and
          <string-name>
            <given-names>Christina</given-names>
            <surname>Unger</surname>
          </string-name>
          .
          <article-title>Multilingualität und Linked Data</article-title>
          . In: Tassilo Pellegrini, Harald Sack, and Sören Auer (eds):
          <article-title>Linked Enterprise Data</article-title>
          .
          <article-title>Management und Bewirtschaftung vernetzter Unternehmensdaten mit Semantic Web Technologien (</article-title>
          <year>2014</year>
          )
          <article-title>J</article-title>
          .
          <string-name>
            <surname>McCrae</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Aguado-de-Cea</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Buitelaar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Cimiano</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Declerck</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Gómez-Pérez</surname>
          </string-name>
          , J.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Gracia</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Hollink</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Montiel-Ponsoda</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spohr</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Wunner</surname>
          </string-name>
          .
          <article-title>Interchanging lexical resources on the Semantic Web</article-title>
          .
          <source>Language Resources and Evaluation</source>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Georg</given-names>
            <surname>Rehm</surname>
          </string-name>
          and
          <string-name>
            <given-names>Felix</given-names>
            <surname>Sasaki</surname>
          </string-name>
          .
          <article-title>Semantische Technologien und Standards für das mehrsprachige Europa</article-title>
          . In: B.
          <string-name>
            <surname>Humm Ege</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          and A. Reibold (eds.): Corporate Semantic Web (
          <year>2014</year>
          )
          <article-title>Gil Francopoulo</article-title>
          , Monte George, Nicoletta Calzolari, Monica Monachini, Nuria Bel, Mandy Pet,
          <string-name>
            <given-names>Claudia</given-names>
            <surname>Soria</surname>
          </string-name>
          .
          <article-title>Lexical Markup Framework (LMF)</article-title>
          .
          <source>In: Proceedings of the fifth international conference on Language Resources and Evaluation</source>
          (
          <year>2006</year>
          )
          <article-title>Christian Chiarcos</article-title>
          ,
          <string-name>
            <surname>John McCrae</surname>
            ,
            <given-names>Philipp</given-names>
          </string-name>
          <string-name>
            <surname>Cimiano</surname>
          </string-name>
          , and Christiane Fellbaum.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>Towards open data for linguistics: Lexical Linked Data</article-title>
          . In Alessandro Oltramari, Piek Vossen, Lu Qin, and Eduard Hovy (eds.): New Trends of Research in Ontologies and Lexical Resources Springer, Heidelberg (
          <year>2013</year>
          )
          <article-title>Christian Chiarcos</article-title>
          , Steven Moran,
          <string-name>
            <given-names>Pablo N.</given-names>
            <surname>Mendes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Nordhoff</surname>
          </string-name>
          , and Richard Littauer.
          <article-title>Building a Linked Open Data cloud of linguistic resources: Motivations and developments</article-title>
          .
          <source>In Iryna Gurevych and Jungi</source>
          Kim (eds.):
          <article-title>The People's Web Meets NLP</article-title>
          .
          <source>Collaboratively Constructed Language Resources</source>
          , Springer, Heidelberg (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>