<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Linking Domain-Specific Knowledge to Encyclopedic Knowledge: an Initial Approach to Linked Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pilar León Araúz</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pamela Faber</string-name>
          <email>pfaber@ugr.es</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pedro J. Magaña Redondo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Andalusian Centre for the Environment (CEAMA), University of Granada</institution>
          <addr-line>Avda. Mediterráneo s/n, 18071 Granada</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Translation and Interpreting, University of Granada Buensuceso</institution>
          ,
          <addr-line>11 18002 Granada</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <fpage>68</fpage>
      <lpage>73</lpage>
      <abstract>
        <p>Linked Data creates a shared information space by publishing and connecting resources in the Semantic Web. However, the specification of semantic relationships between data sources is still a stumbling block. One solution is to enrich ontologies with multilingual and concept-oriented information. Usefully linking entities in the Semantic Web is thus facilitated by a semantic-oriented cross-lingual ontology mapping framework in which knowledge representations are not restricted to a particular natural language. Accordingly, this paper describes a preliminary approach for integrating general encyclopedic knowledge in DBpedia with EcoLexicon, a multilingual terminological knowledge base on the environment.</p>
      </abstract>
      <kwd-group>
        <kwd>terminology</kwd>
        <kwd>knowledge representation</kwd>
        <kwd>linked data</kwd>
        <kwd>multilinguality</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Knowledge bases play an increasingly important role in enhancing the intelligence of
Web as well as in supporting information integration [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In this respect, the Semantic
Web is an extension of the current Web in which information is given a well-defined
meaning, better enabling computers and people to work in cooperation [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. This
refers to people all over the world, who speak different languages. As Cimiano [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
states, the Semantic Web has the potential for dealing with cross-linguistic mappings
since its content is structured much like a database and thus is language-independent.
      </p>
      <p>
        The awareness of linguistic complexity has intensified over the last ten years as the
number of Internet webpages in other languages has soared. This is a challenge for
usefully linking entities in the Semantic Web because this process requires some sort
of semantic-oriented cross-lingual ontology mapping framework in which knowledge
representations are not restricted to the use of a particular natural language [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
However, without a coherent description of concepts and terminological variants that
take into account the categorization of real world entities by other language
communities, the Semantic Web will never be truly multilingual. We thus propose a
model for integrating general encyclopedic knowledge in DBpedia with our
domainspecific resource, EcoLexicon (http://ecolexicon.ugr.es), a multilingual terminological
knowledge base (TKB) on the environment.
EcoLexicon was initially implemented in Spanish, English, and German though more
languages are currently being added (Modern Greek, Russian, Dutch and French). So
far it has a total of 3,271 concepts and 14,646 terms. One of its main assets is its
multilinguality, but as an added value, its user interface also includes semantic
networks, graphical resources, definitions and contextual information that enhance the
representation of conceptual and terminological knowledge [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Nevertheless, the
focus here is on terminological and semantic information since these data sets are the
ones used for linking EcoLexicon to DBpedia. For every environmental concept,
multilingual choices are made available. The users can then click on any of them and
obtain terminological information, such as whether a linguistic designation is a
synonym, acronym, register, or stylistic variant. For instance, the concept EBB
CURRENT has a total of 16 different designations in Spanish, English, German, and
Greek since all registers and linguistic varieties are accounted for. In this entry, ebb
tide appears as a non-technical variant for the concept EBB CURRENT. Even in large
term bases, multilingual variety is rarely represented in an exhaustive way. However,
this type of information is invaluable because not only does it provide users with
multiple options for text comprehension and production, but it is also useful for
conceptual disambiguation (section 3). Concepts are also displayed in dynamic
semantic networks linked to other concepts. Nevertheless, problems can arise when it
is a question of browsing networks of very general concepts, which carry an excessive
load of information (Fig. 1).
      </p>
      <p>
        Overloaded concepts, such as WATER, share multiple relations with many other
concepts, but they rarely, if ever, activate all those relations at the same time since
this would evoke completely different and incompatible scenarios. Our claim is that
any specialized domain contains sub-domains in which conceptual dimensions
become more or less salient, depending on the activation of specific contexts [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The
area of environmental knowledge was thus divided into a set of contextual domains
(e.g. HYDROLOGY, GEOGRAPHY, OCEANOGRAPHY, CIVIL ENGINEERING,
ENVIRONMENTAL ENGINEERING, etc.) and the relational power of concepts was
constrained accordingly. Thus, when constraints are applied, the network of WATER
within the CIVIL ENGINEERING domain is recontextualized and becomes more
meaningful (Fig. 2).
      </p>
      <p>
        EcoLexicon is primarily hosted in a relational database (RDB), but at the same
time it is integrated in an ontological model. Semantic information is stored in the
ontology, while leaving the rest in the relational database [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. This is important
because the linked data process not only involves the transformation of data to RDF
format, but also includes the use of terminologies, controlled vocabularies, and
ontologies to describe triples attributes in a systematic way and as reference
conceptual models to support an integrated view of data and semantic interoperability
between datasets [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. As seen in Fig. 3, contextual domains have inspired the design
of our ontology classes. The ontology is automatically retrieved from the data stored
in our RDB, according to the following assumption: if a concept c is part of one or
more propositions allocated to a contextual domain C, c will be an instance of the
class C. EcoLexicon keeps multilingual terminological information and ontological
information separate. Each terminological entry has different word forms linked to the
same natural language definition, constrained by the knowledge represented in the
ontology concept [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
Linked Data is an important initiative for creating a shared information space by
publishing and connecting structured resources in the Semantic Web [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. However,
the specification of semantic relationships between data sources is still a stumbling
block. Our initial proposal is to integrate EcoLexicon with DBpedia through the
sameAs property, because: (1) DBpedia is at the core of the Linked Data initiative; (2)
users can complete their knowledge acquisition process through a guided access to
encyclopedic knowledge.
      </p>
      <p>
        Linking data sources from DBpedia can be quite straightforward since different
tools, such as the ontology editor TopBraid Composer, can automatically suggest the
links. However, because of lexical variation and the lack of univocity in both general
and specialized knowledge, automatic mappings are not always viable. Furthermore,
although establishing an identity relation initially may appear to be a simple task,
matching two entities, both at the syntactic and the semantic levels, is often far from
easy [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Problems with text searching and entity matching highlight the fact that a
word is more than a mere string of characters. The following are basically the same
problems that have plagued linguists over the years: polysemy, homonymy,
synonymy, and different levels of specificity [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. There are also other aspects of
lexical meaning that lead to confusion, such as the fact that: (i) the meaning of a term
can vary, depending on the context; (ii) meanings can change in time and space; (iii)
different languages reflect different mappings of reality, which may coincide totally,
partially, or not at all.
      </p>
      <p>A solution to some of these problems can be found when ontologies are enriched
with multilingual and concept-oriented information, as reflected in the field of
environmental knowledge, but manual work is still necessary to a certain extent.
Nevertheless, instead of mapping one-to-one manual correspondences, we can take
advantage of the semantics contained in each resource. In our approach, the term
strings of EcoLexicon are compared with those from DBpedia, enhanced by those
data sets that include multilingual choices and variants as well as category
membership. To illustrate our data linking proposal, we have chosen four concepts:
GROIN, BANK, ACCRETION, WASTEWATER TREATMENT PLANT and the pseudocode of
the general matching algorithm is shown in the following table:
for each w:word in ecolexicon
for each cp:concept in dbpedia
w' = stem(w); cp' = stem(cp)
if str_compare(v', cp') &gt; word_threshold
multi_e = multilingual_variants(v)
multi_g = multilingual_variants(cp)
if multingual_compare(multi_e, multi_g) &gt; multilingual_threshold
result.add(pair(v, cp))
related_instances = instances_of(context(v))
for each i:instance in related_instances
if look_for_text(comment_properties(cp), i) &gt; text_threshold
result.add(pair(v, i))</p>
      <p>The concept GROIN in DBpedia is not designated by its most frequent form but by a
geographical variant (groyne). The fact that EcoLexicon stores all lexical variations of
each concept allows us to identify the same entity in both resources by comparing the
string of all our English monolingual variants with the entries in DBpedia. However,
if the search was only performed for the string groin, DBpedia would redirect to a
disambiguation page, since GROIN can also refer to a part of the human body. In this
case, with the help of the English variant groyne and the French equivalent épi, the
concept can be easily disambiguated.</p>
      <p>The case of BANK is similar to that of GROIN. Nevertheless, it is necessary to add
other parameters to the linking rule since bank is polysemic at a cross-linguistic level.
For example, as in English, the Spanish term banco can refer to a geographic
landform or a financial institution, and there are not many other common multilingual
equivalents in DBpedia for disambiguation. In DBpedia, this domain-specific entry is
named, and differentiated from others, as BANK (GEOGRAPHY). In order to match this
entry and not any of the others, it is necessary to add a context-based rule. Therefore,
this match will occur in the following situations: (1) when the word in brackets
matches the string of any of our contextual classes or their linguistic variants; (2)
when any term, in any language, associated with any concept belonging to the same
contextual class as the search concept appears in one or more of the values of the
following properties: dbpedia-owl:abstract, dcterms:subject, rdfs:comment, or
dbpedia-owl:wikiPageRedirects. In this case BANK in EcoLexicon belongs to the
classes, GEOGRAPHY, GEOLOGY and OCEANOGRAPHY, as do many other concepts,
such as SHORELINE, ESTUARY, RESERVOIR, SLOPE, RIVER, MARSH, etc, all of which are
contained in the properties dbpedia-owl:abstract, dcterms:subject and rdfs:comment.
Furthermore, since the disambiguating word in brackets coincides with the
EcoLexicon class GEOGRAPHY, the second step is not even required in this case.</p>
      <p>Nevertheless, there is a similar but even more complex example in the concept
ACCRETION. ACCRETION is polysemic in different languages as well as within the
environmental domain. This time disambiguation is not only performed in order to
differentiate other domains from the environmental one. On the contrary, three
different senses (concepts) in EcoLexicon, designated by the same terms in all
languages and with no variants, have to be matched with three out of the five entries
in DBpedia. In DBpedia, the term accretion may be related to the fields of FINANCE,
ASTROPHYSICS, ATMOSPHERE, GEOLOGY, or COASTAL MANAGEMENT, of which only
the last three are included in EcoLexicon. In EcoLexicon, the concepts belong to the
classes of ATMOSPHERIC SCIENCES, GEOLOGY, and OCEANOGRAPHY, respectively. The
concepts related to FINANCE and ASTROPHYSICS are ruled out through the same
context-based rule as in BANK. However, this rule must be further specified in order to
disambiguate the DBpedia entries of ACCRETION (ATMOSPHERE), ACCRETION
(GEOLOGY) and ACCRETION (COASTAL MANAGEMENT). In this case, matching the
concepts in common with those included in the property values and those that belong
to the same contextual class as each of the concepts designated by accretion is
insufficient, since all three concepts are closely interrelated. For instance, the key
terms ice or droplet, only present in the property values of ACCRETION (ATMOSPHERE),
could seem enough to disambiguate the concept. However, the concepts designated by
these terms belong to both our ATMOSPHERIC SCIENCES and GEOLOGY classes. Apart
from their obvious relation to the atmosphere, they are also related to geological
concepts, such as AVALANCHE or EROSION. Therefore, at this point, disambiguation is
still necessary between ACCRETION (GEOLOGY) and ACCRETION (ATMOSPHERE). As for
the property values of ACCRETION (COASTAL MANAGEMENT), there are certain terms in
the property values, such as erosion, sediment, beach, and weather that can point to
all of the three classes (i.e. weather to ATMOSPHERIC SCIENCES, erosion and sediment
to GEOLOGY, and beach to both GEOLOGY and OCEANOGRAPHY). Consequently, for
these cases, one more variable is added to the matching algorithm: from all the
contextual classes to which key concepts may belong, only the most frequent one will
be used for disambiguation. This means that if most concepts included in the property
values of ACCRETION (COASTAL MANAGEMENT) are mostly activated in propositions
framed within the OCEANOGRAPHY class, then both concepts are equivalent.</p>
      <p>Finally WASTEWATER TREATMENT PLANT does not show ambiguity problems
because it is a very specialized concept. Nevertheless, this is a good example of how
linking data does not always ensure knowledge acquisition since conceptual modeling
does not necessarily follow a concrete pattern in all resources. There is thus no
assurance that the content is well structured. The definition of wastewater treatment
plant in DBpedia does not describe the concept at all. In fact, it is incorrectly assigned
to a disambiguation category, and it redirects users to different types of wastewater
treatment. In fact, it does not even offer a proper definition of the plant itself. The
Spanish version of Wikipedia has a good entry for its equivalent (estación
depuradora de aguas residuales), but there is no link between them. In this sense,
EcoLexicon could serve as a bridge between the multilingual environmental entries in
DBpedia that are not correctly linked.</p>
    </sec>
    <sec id="sec-2">
      <title>4 Conclusions</title>
      <p>This paper has discussed the importance of multilinguality for the Semantic Web and
the problems that can arise when knowledge representations in other languages are
not taken into account in the linked data process. More specifically, we have
compared the term strings of EcoLexicon’s concepts GROIN, BANK, ACCRETION, and
WASTEWATER TREATMENT PLANT with those from DBpedia, enhanced by multilingual
choices and variants as well as category membership. The results show how valid
correspondences can be obtained by taking advantage of the semantics contained in
each resource.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Meij</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bron</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hollink</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huurnink</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Rijke</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Mapping Queries to the Linking Open Data cloud: A Case Study Using DBpedia</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web</source>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Berners-Lee</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hendler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Lassila</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <article-title>The Semantic Web</article-title>
          . Scientific
          <string-name>
            <surname>American</surname>
          </string-name>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Janev</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Vranes</surname>
          </string-name>
          ,
          <source>S. Applicability Assessment of Semantic Web Technologies Information Processing and</source>
          Management vol.
          <volume>47</volume>
          pp.
          <fpage>507</fpage>
          -
          <lpage>517</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Cimiano</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <article-title>Towards the Multilingual Semantic Web</article-title>
          . Lecture given at the University of Granada,
          <year>February18</year>
          ,
          <year>2011</year>
          . (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Fu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brennan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <article-title>and</article-title>
          <string-name>
            <surname>O'Sullivan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Cross-Lingual Ontology</surname>
          </string-name>
          Mapping and
          <article-title>Its Use on the Multilingual Semantic Web</article-title>
          .
          <source>1st Workshop on the Multilingual Semantic Web</source>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Faber</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <article-title>The Dynamics of Specialized Knowledge Representation: Simulational Reconstruction of the Perception-Action Interface</article-title>
          .
          <source>Terminology</source>
          , vol.
          <volume>17</volume>
          , pp.
          <fpage>9</fpage>
          -
          <lpage>29</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>León</given-names>
            <surname>Araúz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Magaña</surname>
          </string-name>
          <string-name>
            <surname>Redondo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            and
            <surname>Faber</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          <article-title>Managing Inner and Outer Overinformation in Ecolexicon: an Environmental Ontology</article-title>
          .
          <source>8th International Conference on Terminology and Artificial Intelligence</source>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Cordeiro</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marino</surname>
            .,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Campos</surname>
            ,
            <given-names>M.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Borges</surname>
            ,
            <given-names>M.R.S.</given-names>
          </string-name>
          <article-title>Use of Linked Data in the Design of Information Infrastructure for Collaborative Emergency Management System</article-title>
          .
          <source>Computer Supported Cooperative Work in Design (CSCWD</source>
          ) pp.
          <fpage>746</fpage>
          -
          <lpage>711</lpage>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Montiel-Ponsoda</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , Aguado de Cea, G.,
          <string-name>
            <surname>Gómez</surname>
            <given-names>Pérez</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            and
            <surname>Peters</surname>
          </string-name>
          ,
          <string-name>
            <surname>W. Enriching</surname>
          </string-name>
          <article-title>Ontologies with Multilingual Information</article-title>
          .
          <source>Natural Language Engineering</source>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>27</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heath</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Berners-Lee</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <article-title>Linked Data: Principles and State of the Art (</article-title>
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Hassanzadeh</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kementsietsidis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lim</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>R.J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>A Framework for Semantic Link Discovery over Relational Data</article-title>
          .
          <source>CIKM '09: Proceeding of the 18th ACM conference on Information and knowledge management</source>
          . (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Dostal</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Jezek</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <article-title>Automatic Tagging Based on Linked Data: Unsupervised Methods for the Extraction of Hidden Information</article-title>
          .
          <source>Service-Oriented Computing and Applications (SOCA)</source>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>