CLOVA: An Architecture for Cross-Language Semantic Data
                        Querying

                   John McCrae                          Jesús R. Campaña                    Philipp Cimiano
            Semantic Computing Group,                   Department of Computer           Semantic Computing Group,
                     CITEC                               Science and Artificial                   CITEC
              University of Bielefeld                         Intelligence                 University of Bielefeld
               Bielefeld, Germany                        University of Granada              Bielefeld, Germany
                                                           Granada, Spain
              jmccrae@cit-ec.uni-                                                           cimiano@cit-ec.uni-
                  bielefeld.de                         jesuscg@decsai.ugr.es                    bielefeld.de

ABSTRACT                                                               ence to natural language1 . In order to facilitate the interac-
Semantic web data formalisms such as RDF and OWL al-                   tion of human users with semantic data, supporting language-
low us to represent data in a language independent manner.             based interfaces in multiple languages is crucial. However,
However, so far there is no principled approach allowing us to         currently there is no principled approach supporting the ac-
query such data in multiple languages. We present CLOVA,               cess of semantic data across multiple languages. To fill this
an architecture for cross-lingual querying that aims to ad-            gap, we present in this paper an architecture we call CLOVA
dress this gap. In CLOVA, we make a distinction between a              (Cross-Lingual Ontology Visualisation Architecture) designed
language independent data layer and a language independent             for querying semantic data in multiple languages. A devel-
lexical layer. We show how this distinction allows us to create        oper of a CLOVA application can define the search interface
modular and extensible cross lingual applications that need            independently of any natural language by referring to onto-
to access semantic data. We specify the search interface at            logical relations and classes within a semantic form specifi-
a conceptual level using what we call a semantic form spec-            cation (SFS), which represents a declarative and conceptual
ification abstracting from specific languages. We show how,            representation of the search interface with respect to a given
on the basis of this conceptual specification, both the query          ontology. We have designed an XML-based language which
interface and the query results can be localized to any sup-           is inspired by the Fresnel language [2] for this purpose. The
ported language with almost no effort. More generally, we              search interface can then be automatically localised by the
describe how the separation of the lexical layer can be used           use of a lexicon ontology model such as LexInfo [4], enabling
with a principled ontology lexicon model (LexInfo) in order            the system to automatically generate the form in the appro-
to produce application-specific lexicalisations of properties,         priate language. The queries to the semantic repository are
classes and individuals contained in the data.                         generated on the basis of the information provided in the SFS
                                                                       and the results of the query can be localised using the same
                                                                       method as used for the localisation of the search interface.
Categories and Subject Descriptors                                     The CLOVA framework is generic in the sense that it can
H.5.m [Information Interfaces and Presentation]: User                  be quickly customised to new scenarios, new ontologies and
Interfaces; I.2.1 [Artificial Intelligence]: Applications and          search forms and additional languages can be added without
Expert Systems; I.2.4 [Artificial Intelligence]: Knowledge             changing the actual application, even at run time if we desire.
Representation Formalisms and Methods; I.2.7 [Artificial                  The paper is organised as follows. Section 2 describes state
Intelligence]: Natural Language Processing                             of the art on information access across languages and points
                                                                       out basic requirements for cross lingual systems. Section 3
                                                                       describes the CLOVA framework for rapid development of
General Terms                                                          cross-lingual search applications accessing semantic data. We
Design, Human Factors, Languages                                       conclude in Section 4.


Keywords                                                               2.   RELATED WORK
Multilingual Semantic Web, Ontology Localisation, Software                Providing access to information across languages is an im-
Architecture                                                           portant topic in a number of research fields. While our work
                                                                       is positioned in the area of the Semantic Web, we discuss
                                                                       work related to a number of other research areas, including
1.   INTRODUCTION                                                      databases, cross-language information retrieval as well as on-
  Data models and knowledge representation formalisms in               tology presentation and visualisation.
the Semantic Web allow us to represent data without refer-
                                                                       1
Copyright is held by the author/owner(s).                                This holds mainly for RDF triples with resources as subjects
WWW2010, April 26-30, 2010, Raleigh, North Carolina.                   and objects. String data-type elements are often language-
.                                                                      specific.


                                                                   5
2.1   Database Systems                                                     • Linguistic equivalences: Multilingual database sys-
   Supporting cross-language data access is an important topic               tems should support linguistic joins which exploit pre-
in the area of database systems, albeit one which has not re-                defined mappings between attributes and values across
ceived very prominent attention (see [8]). An important issue                languages. For example, we might state explicitly that
is certainly the one of character encoding as we need to rep-                the attributes “marital status” (in English) and “Fami-
resent characters for different languages. However, most of                  lienstand” are equivalent and that the values “married”
the current database systems support Unicode so that this                    and “verheiratet” are equivalent.
issue is not a problem anymore. A more complex issue is the
representation of content in the database in such a way that             In fact, those two requirements follow from Kumaran et
information can be accessed across languages. There seems             al.’s assumption that the database should store the data in
to be no consensus so far on what the optimal representa-             multiple languages. If this is the case then we certainly have
tion of information would be such that cross-language access          to push all the cross-language querying functionality into the
can be realised effectively and efficiently. One of the basic         DBMS itself. This is rather undesirable from our point of
requirements for multilingual organisation of data mentioned          view as every time a new language is added to the system,
by Kumaran et al. [8] is the following:                               the DBMS needs to be modified to extend the linguistic and
                                                                      lexical equivalences. Further, the data is stored redundantly
“The basic multilingual requirement is that the database              (once for every language supported). Therefore, we actu-
system must be capable of storing data in multiple                    ally advocate a system design where the data is stored in a
languages.”                                                           language-independent fashion and the cross-lingual querying
                                                                      functionality as well as result localisation is external to the
   This requirement seems definitely too strict to us as it as-       DBMS itself, implemented as pre- and post-processing steps,
sumes that the representation of data is language-dependent           respectively.
and that the database is supposed to store the data in mul-              In fact, we would add the following requirement to any
tiple languages. This rules out language-independent ap-              system allowing to access data across languages:
proaches which do not represent language-specific informa-
tion in the database at all.
                                                                      Requirement 3 (Language Modularity)
The following requirement by Kumaran et al. is one we can
                                                                      The addition of further languages should be modular in the
directly adhere to:
                                                                      sense that it should not require the modification of the DBMS
                                                                      or influence the other languages supported by the system.
Requirement 1 (Querying in multiple languages)
Data must be queriable using query strings in any (sup-
                                                                        As a consequence, the capability of querying data across
ported) language.
                                                                      languages should not be specific to a certain implementation
In fact, we will refer to the above as Requirement 1a and             of a DBMS but work for any DBMS supporting the data
add the following closely related Requirement 1b: ‘The re-            model in question.
sults of a query should also be presented in any (supported)            One of the important issues in representing information in
language.’ Figure 1 summarises all the requirements dis-              multiple languages is avoiding redundancy (see [6]). Hoque
cussed in this section. However, it does not strictly follow          et al. indeed propose a schema to give IDs to every piece
from this that the data should be stored in multiple lan-             of information and then include the language information in
guages in the database. In fact, it suffices that the front end       a dictionary table. This is perfectly in line with Semantic
that users interact with supports different languages and is          Web data models (RDF in particular) where URIs are used
able to translate the user’s input into a formal (language-           to uniquely identify resources. Dictionaries can then be con-
independent) query and localises the results returned by the          structed expressing how the elements represented by the URIs
database management system (DBMS) into any of the sup-                are referred to across languages. This thus allows to concep-
ported languages.                                                     tually separate the data from the dictionary. This is a crucial
   A further important requirement by Kumaran et al. we               distinction that CLOVA also adheres to (see below).
subscribe to is related to interoperability:
                                                                      2.2      Cross-language Information Retrieval
Requirement 2 (Interoperability)                                        In the field of information retrieval, information access across
The multilingual data must be represented in such a way that          languages has also been an important topic, mainly in the
it can be exchanged across systems.                                   context of the so called Cross-Language Evaluation Forum2
                                                                      (see [10] for the proceedings of CLEF 2008). Cross-language
   This feature is certainly desirable. We will come back to          information retrieval (CLIR) represents an extreme case of
this requirement in the context of our discussion of the Se-          the so called vocabulary mismatch problem well-known from
mantic Web (see below). The next two requirements men-                information retrieval. The problem, in short, is the fact that
tioned by Kumaran et al. are in our view questionable as              a document can be highly relevant to a query in spite of not
they assume that the DBMS itself has built-in support for             having any words in common with the query. CLIR rep-
multiple languages:                                                   resents an extreme case in the sense that if a query and a
                                                                      document are in different languages, then the word overlap
   • String equality across scripts: A Multilingual database          and consequently every vector-based similarity measure will
     system should support lexical joins allowing to join in-         be zero.
     formation in different tables even if the relevant at-
                                                                      2
     tributes of the join are in different scripts.                       http://www.clef-campaign.org/


                                                                  6
  In CLIR, the retrieval unit is the document, while in database        lexico-syntactic information is required, in turn more com-
systems the retrieval unit corresponds to the information               plex representations are necessary. A more formal distinc-
units stored in the data base. Therefore, the requirements              tion of the “data layer” and “lexical layer” is provided by
with respect to multilinguality are rather different for CLIR           lexicon ontology models of which the most prominent models
and multilingual database systems.                                      are the Linguistic Information Repository (LIR) and LexInfo
                                                                        (see [4]).

                                                                        2.4   Ontology Presentation and Visualisation
2.3    Semantic Web                                                       Fresnel [2] is a display vocabulary that describes methods of
   Multilinguality has been so far an underrepresented topic            data presentation in terms of lenses and formats. In essence
in the Semantic Web field. While on the Semantic Web we                 the lens in Fresnel selects which values are to be displayed
encounter similar problems as in the case of databases, there           and the format selects the formatting applied to each part of
are some special considerations and requirements. We will               the lens. This provides many of the basic tools for presenting
consider further important requirements for multilinguality             semantic web data. However it does not represent multilin-
in the context of the Semantic Web. Before, we introduce the            guality within the vocabulary and it is not designed to present
crucial distinction between the data layer (proper) and the             a queriable interface to the data. There exist many forms of
lexical layer. We will see below that the conceptual separa-            ontology visualisation methods through the use of trees, and
tion between the data and the dictionary is even more impor-            other structures to display the data contained within the on-
tant in the context of the Semantic Web. According to our               tology, a survey of which is provided in [7]. These are of
distinction, the data layer contains the application-relevant           course focussed mainly on displaying the structure of the on-
data while the lexical layer merely contains information about          tology and do not attempt to convert the ontology to natu-
how the data is realised/expressed in different languages and           ral language. Furthermore, for very large data sources, it is
acts like a dictionary. We note that this distinction is a con-         impractical to visualise the whole ontology at one time and
ceptual one as the data in both layers can be stored in the             hence we wish only to select a certain section of it and hence
same DBMS. However, this might not always be possible in                require a query interface to perform this task.
a decentralised system such as the Semantic Web:
                                                                        3.    MULTILINGUAL ACCESS AND QUERY-
Requirement 4 (Data and Lexicon Separation)
We require a clear separation between the data and lexicon                    ING USING CLOVA
layer in the Semantic Web. The addition of further languages               CLOVA addresses the problem of realising localised search
should be possible without modifying the data layer. This               interfaces on top of language-independent data sources, ab-
means that the proper data layer and the lexical layer are              stracting the work flow and design of a search engine and
cleanly separated and data is not stored redundantly.                   providing the developer with a set of tools to define and de-
                                                                        velop a new system with relatively little effort. CLOVA ab-
  In the Semantic Web, the parties interested in accessing              stracts lexicalisation and data storage as services, providing
a certain data source are not necessarily its owners (in con-           a certain degree of independence from data sources and mul-
trast to standard centralised database systems as considered            tilingual representation models.
by Kumaran et al.). As a corollary it follows that if a user re-           The different modules of the system have been designed
quires access to a data source in language x he might not have          with the goal of providing very specific, non-overlapping and
the permission to enrich the data source by data represented            independent tasks to developers working on the system de-
in the language x.                                                      ployment concurrently. User interface definition tasks are
  A further relevant requirement in the context of the Se-              completely separated from data access and lexicalisation, al-
mantic Web is the following:                                            lowing developers of each module to use different resources
                                                                        as required.
Requirement 5 (Sharing of Lexica)                                          CLOVA as an architecture does not fulfil any of the afore-
Lexica should be represented declaratively and in a form                mentioned requirements (as they should be fulfilled by lexi-
which is independent of specific applications such that it can          calisation services), but provides a framework to fully exploit
be shared.                                                              cross-lingual services meeting these requirements. The appli-
  It is very much in the spirit of the Semantic Web that                cation design allows to separate conceptual representations
information should be interoperable and thus reusable beyond            from language dependant lexical representations, making user
specific applications. Following this spirit, it seems desireable       interfaces completely language independent in order to later
that (given that data representation is language-independent)           localise them to any supported language.
the language-specific information how certain resources are
expressed in various languages can be shared across systems.
                                                                        3.1   System Architecture
This can be accomplished by declaratively described lexica                The CLOVA architecture is designed to enable the query-
which can be shared.                                                    ing of semantic data in a language of choice, while still pre-
  Multilinguality has been approached in RDF through the                senting queries to the data source in a language-independent
use of its label property, which can assign labels with lan-            form. CLOVA is modular, reusable and extensible and as
guage annotations to URIs. The SKOS framework [9] further               such is easily configured to adapt to different data sources,
expands on this by use of prefLabel, altLabel, hiddenLabel.             user interfaces and localisation tools3 .
These formalisms are sufficient for providing simple represen-          3
                                                                         A Java implementation of CLOVA is available at http://
tation of language information. However, as more complex                www.sc.cit-ec.uni-bielefeld.de/clova/


                                                                    7
 Req. No    Implication                                     Status
 Req. 1a    Querying in multiple languages                  REQUIRED
 Req. 1b    Result localisation in multiple languages       REQUIRED
 Req. 2     Data interoperability                           REQUIRED
 Req. 3     Language modularity                             REQUIRED
 Req. 4a    Separation between data and lexical layer       DESIRED TO SUPPORT Req. 3
 Req. 4b    Language-independent data representation        DESIRED TO AVOID REDUNDANCY
 Req. 5     Declarative representation of lexica            DESIRED FOR SHARING LEXICAL INFORMATION

                             Figure 1: Requirements for multilingual organisation of data


   Figure 2 depicts the general architecture of CLOVA and
its main modules. The form displayer is a module which
translates the semantic form specification into a displayable
format, for example HTML. Queries are performed by the
query manager and then the results are displayed to the user
using the output displayer module. All of the modules use
the lexicaliser module to convert the conceptual descriptions
(i.e., URIs) to and from natural language. Each of these mod-
ules are implemented independently and can be exchanged or
modified without affecting the other parts of the system.
   We assume that we have a data source consisting of a set
of properties referenced by URIs and whose values are also
URIs or language-independent data values. We shall also
assume that there are known labels for each such URI and                      Figure 2: CLOVA general architecture
each language supported by the application. If this separation
between the lexical layer and the data layer does not already
                                                                      by an RDF type declaration or similar. If this is omitted
exist, we introduce elements to create this separation. It is
                                                                      we simply choose all individuals in the data source. The
often necessary to apply such manual enrichment to a data
                                                                      SFS essentially consists of a list of fields which are to be
source, as it is not trivial to identify which strings in the
                                                                      used to query the ontology. Each field contains the following
data source are language-dependent, however we find that is
                                                                      information:
often a simple task to perform by identifying which properties
have language-dependent ranges, or by using XML’s language               • Name: An internal identifier is used to name the input
attribute.                                                                 fields for HTML and HTTP requests.
   We introduce an abstract description of a search interface
by way of XML called a semantic form specification. It spec-             • Query output: This defines whether this field will
ifies the relevant properties that can be queried by using the             be included in these results. Valid values are always,
URIs in the data source, thus abstracting from any natural                 never, ask (the user could decide wether to include the
language. We show how this can be used to display a form                   field in the results or not), if empty (if the field has not
to the user and to generate appropriate queries once he/she                been queried it is included in the output), if queried
has filled in the form. The query manager provides a back-                 (if the field is queried, it is included in the output) and
end that allows us to convert our queries using information                ask default selected (the user decides, but as default the
in the form into standard query languages such as SPARQL                   field will be shown).
and SQL. Finally, we introduce a lexicalisation component,
which is used to translate between the language-independent              • Property: represents the URI for the ontology prop-
forms specified by the developer and the localised forms pre-              erty to be queried through the field. An indication of
sented to the user. We describe a lexicaliser which builds on              reference=self in place of a URI means that we are
a complex lexicon model and demonstrate that it can provide                querying the domain of the search. Such queries are
more flexibility with respect to the context and complexity                useful for querying the lexicalisation of the object being
of the results we wish to lexicalise.                                      queried or limiting the query to a fixed set of objects.
                                                                         • Property Range: We define a number of types (called
3.2     Modules                                                            property ranges) that describe the data that a field can
                                                                           handle. It differs from the data types of RDF or similar
3.2.1    Semantic Form Specification                                       in that we also describe how the data should be queried
  One of the most important aspects of the architecture is                 as well. For example, while it is possible to describe
the Semantic Form Specification (SFS), which contains all                  both the revenue of a company and the age of an em-
the necessary information to build a user interface to query               ployee as integers in the database, it is not sensible to
the ontology. In the SFS the developer specifies the ontology              query revenue as a single value, whereas it is often use-
properties to be queried by the application via their URIs.                ful to query age as a single value. These property ranges
This consists of a form for which we specify a domain, i.e.,               provide an abstraction of these properties in the data
the class of objects we are querying as defined in the database            and thus support the generation of appropriate forms


                                                                  8
  and queries. The following property ranges are built-in
  into CLOVA:

    – String, Numeric, Integer, Date: Simple data-type
      values. Note that String is intended for represent-          Figure 3: HTML form generated for a SFS document
      ing language-independent strings, e.g. IDs, not
      natural language strings. The numeric and date
      ranges are used to query precise values like “age”              The SFS document is in principle similar to the concept of a
      and “birth date”.                                            “lens” in the Fresnel display vocabulary [2] in that it describes
    – Range, Segment, Set: These are defined relative              the set of fields in the data that should be used for display
      to another property range and specify how a user             and querying. However, by including more information about
      can query the property in question. Range speci-             methods for querying the data, we provide a description that
      fies that the user should query the data by provid-          can be used for both presentation and querying of the data.
      ing an upper and/or lower bound, e.g. “revenue”,                Example: Suppose that we want to build a small web appli-
      “number of employees”. Segment is similar but re-            cation that queries an ontology with information about com-
      quires that the developer divides the data up into           panies stored in an RDF repository. The application should
      pre-defined intervals. Set allows the developer to           ask for company names, companies’ revenue, and company
      specify a fixed set of queriable values, e.g. “marital       locations. The syntax of a SFS XML document for that ap-
      status”.                                                     plication is shown below:
    – Lexicalised Element: Although we assume all data             <!--xmlns:dbpedia="http://dbpedia.org/ontology/"-->
      in the source is defined by URIs, it is obviously de-        <form domain="dbpedia:Company">
      sirable that the user can query the data using nat-              <fields>
                                                                           <field name="Name" output="ALWAYS">
      ural language. This property range in fact allows                        <property reference="self"/>
      to query for URIs through language-specific strings                      <property-range>
      that need to be resolved by the system to the URI                            <lexicalised-property-range/>
                                                                               </property-range>
      in question. The strings introduced into this field                      <rendering context="html">
      are processed by the lexicaliser to find the URI to                          <property name="autocompletion" value="yes"/>
      which they belong which is then used in the corre-                       </rendering>
                                                                           </field>
      sponding queries. For example, locations can have
      different names in different languages, e.g. “New                    <field name="Location" output="ASK">
      York ” and “Nueva York ”, but the URI in the data                        <property uri="&dbpedia;Organisation/location"/>
                                                                               <property-range>
      source should be the same.                                                   <lexicalised-property-range/>
    – Complex : A complex property is considered to                            </property-range>
                                                                           </field>
      be a property composed of other sub-properties.
      For example, searching for a “key person” within                     <field name="Revenue" output="ASK_DEFAULT_SELECTED">
      a company can be done by searching for prop-                             <property uri="&dbpedia;Organisation/revenue"/>
                                                                               <property-range>
      erties of the person, e.g., “name”, “birth place”.                           <ranged-property-range>
      This nested form allows us to express queries over                              <continuous-property-range>
      the structure of an RDF repository or other data                                    <min>0</min>
                                                                                      </continuous-property-range>
      source.                                                                      </ranged-proprety-range>
    – Unqueriable: For some data, methods for efficient                        </property-range>
                                                                           </field>
      querying cannot be provided, especially binary data              </fields>
      such as images. Thus we defined this field to allow          </form>
      the result to still be extracted from the data source
      and included in the results.                                 3.2.2    Form Displayer
                                                                     The form displayer consists of a set of form display elements
  The described property ranges are supported natively             defined for each property range. It processes the SFS by using
  by CLOVA, but it is also possible to define new property         these elements to render the fields in a given order. The
  ranges and include them in the SFS XML document.                 implementation of these elements is dependent on the output
  The appropriate implementation for a form display ele-           method. The form display elements are rendered using Java
  ment that can handle the newly defined property range            code to convert the document to XHTML4 .
  has to be provided of course (see Section 3.2.2).                  Figure 3 shows an example of rendering of an SFS which
• Rendering Properties: There is often information for             includes the fields in the example above. In this rendering
  a particular rendering that cannot be provided in the            the field “name” is displayed as a text field as it refers to the
  description of the property ranges alone. Thus, we allow         lexicalisation of this company. The location of a company for
  for a set of context specific properties to be passed to         instance is represented as a text field. However, in spite of
  the rendering engine. Examples of these include the use          the fact that the data is represented in the data source as a
  of auto-completion features or an indication of the type         language independent URI, the user can query by specifying
  of form element to display, i.e. a Set can be displayed          4
                                                                    The CLOVA project also provides XSLT files to perform the
  as a drop-down list, or as a radio button selection.             same task


                                                               9
the name of the resource in their own language (e.g., a Ger-
man user querying “München” receives the same results as
an English user querying “Munich”). Finally, the revenue is
asserted as a continuous value which is queried by specifying
a range and is thus rendered with two inputs allowing the
user to specify the upper and/or lower bounds of their query.
A minimum value on this range allows for client-side data
consistency checks. In addition, check boxes are appended
to fields in order to allow users to decide if the fields will be
shown in the results, according to the output parameter in
the SFS.

3.2.3    Query Manager
   Once the form is presented to the user, he or she can fill
the fields and select which properties he or she wishes to visu-
alise in the results. When the query form is sent to the Query
Manager, it is translated into a specific query for a particular
knowledge base. We have provided modules to support the
use of SQL queries using JDBC and SPARQL queries using
Sesame [3]. We created an abstract query interface which can
be used to specify the information required in a manner that
is easy to convert to the appropriate query language allowing
                                                                             Figure 4: HTML result page for the example
us to change the knowledge base, ontology and back end with-
out major problems. The query also needs to be preprocessed
using the lexicaliser due to the presence of language-specific            The following output specification defines two output ele-
terms introduced by the user which need to be converted to               ments to show results.
language independent URIs.
                                                                         <!-- xmlns:clova="jar:file:clova-html.jar!/clova/html/output/"
                                                                            xmlns:dbpedia="http://dbpedia.org/ontology/"-->
3.2.4    Output Displayer                                                 <output>
   Once the query is evaluated, the results are processed by                 <elements>
the output displayer and an appropriate rendering shown to                       <element id="HTable" URI="&clova;HTableDisplayElement">
                                                                                     <fields>
the user. The displayer consists of a number of display el-                              <all/>
ements, each of which represents a different visualisation of                        </fields>
the data, including not only simple tabular forms, but also                      </element>
                                                                                 <element id="BarChart" URI="&clova;GraphDisplayElement">
graphs and other visual display methods. As with the form                            <fields>
displayer, all of these elements are lexicalised in the same                             <field name="revenue"/>
manner as the form displayer.                                                        </fields>
                                                                                     <display>
   In general we might restrict the types of data that compo-                            <property name="Type" value="barChart"/>
nents will display as not every visualisation paradigm is suit-                      </display>
able for any kind of data. For example, a bar chart showing                      </element>
                                                                             </elements>
foundation year and annual income would be both uninfor-                 </output>
mative and difficult to display due to the scale of values. For
this reason we provide an Output Specification to define the               The first element displays a table containing all the re-
set of available display elements and sets of values they can            sults returned by the query, while the second output element
display. These output specifications consist of a list of output         shows a bar chart for the property “Revenue”. The HTML
elements described as follows:                                           output generated for a given output specification containing
                                                                         the above mentioned descriptions is shown in Figure 4.
    • ID: Internal identifier of the output element displayed.
                                                                         3.2.5    Lexicaliser
    • URI: A reference to the output resource specified as a
      URI.5                                                                 Simple lexicon models can be provided by language anno-
                                                                         tations, for example RDF’s label and SKOS’s prefLabel,
    • Fields: The set of fields used by this element. These              and developing a lexicaliser is then as simple as looking up
      should correspond by name to elements in the SFS.                  these labels for the given resource URI. This approach may
                                                                         be suitable for some tasks. However, we sometimes require
    • Display properties: Additional parameters passed to                lexicalisation using extra information about the context and
      the display element to modify its behaviour. Some of               would like to provide lexicalisation of more than just URIs,
      these parameters include the possibility to ignore in-             e.g. when lexicalising triples. While RDF labels can be at-
      complete data, or to define the subtypes of a chart to             tached to properties and individuals for instance, there is no
      display. These parameters are class dependant so that              mechanism that allows to compute a lexicalization for a triple
      each output element has its own set of valid parameters.           by composing together the labels of the property and the in-
5                                                                        dividuals. This is a complex problem and we will leave a full
 These can reference Java classes by linking to the appropri-
ate class file or location in a JAR file                                 investigation and evaluation of this for future work.


                                                                    10
     Subject : SyntacticArgument           SynSem Arg Map 1          Domain : SemanticArgument
                                                                                                                     the interested reader is referred to [4].
                                                                                                                     LILAC:
                                                                                                                        In order to produce lexicalisations of ontology elements
                                                                                                                     from a LexInfo model we use a simple rule language included
                      PObject : SyntacticArgument          SynSem Arg Map 2
                                                                                    Range : SemanticArgument         with the LexInfo API called LILAC (LexInfo Label Analysis
                                                                                                                     & Construction). A LILAC rule set describes the structure of
                                                                                                                     labels and can be used for both generating the lexicon from
NounPP : SubcategorizationFrame                                          SemanticPredicate
                                                                                                                     labels and generating lexicalisations from the lexicon. In gen-
                                                                                                                     eral we assume that lexicons are generated from some set of
                                                                                                                     existing labels, which may be extracted from annotations in
                                     SyntacticBehaviour
                                                                                                                     the data source, e.g., RDFS’s label, from the URIs in the
                                                                                                                     ontology or from automatic translations of these labels from
                                                                                                                     another language. The process of generating aggregates from
           Lemma                                                                                                     raw labels requires that first the part of speech tags are identi-
hasWrrittenForm="product"            Noun : LexicalEntry        http://dbpedia.org/ontology/productOf : Sense
                                                                                                                     fied by a tagger such as TreeTagger. Then, the part-of-speech
                                                                                                                     tagged labels are parsed using a LR(1)-based parser (see [1]).
                                       WordForm                                                                      The API then handles these parse trees and converts them
                       hasWrittenForm="products" [ number=plural ]
                                                                                                                     into LexInfo aggregates.
                                                                                                                        LILAC rules are implemented in a symmetric manner so
                                                                                                                     that they can be used to both generate the aggregates in the
Figure 5: A simplified example of a LexInfo aggregate
                                                                                                                     lexicon ontology model (e.g. by analysing the labels of a given
                                                                                                                     ontology) as well as lexicalise those aggregates.
   Furthermore, it is often desirable to have fine control over                                                         A simple example rule for a label such as “revenue of” is:
the form of the lexicalisation, for example, the ontology la-                                                        Noun_NounPP -> <noun> <preposition>
bel may be “company location in city”. However, we may
wish to have this property expressed by the simpler label “lo-                                                          This rule states that the lexicalisation of a Noun NounPP
cation”. By using a lexicon ontology model we can specify                                                            Aggregate is given by first using the written form of lemma of
the lexicalisation in a programmatic way, and hence adapt                                                            the “noun” of the aggregate followed by the lemma of “prepo-
it to the needs of the particular query interface. For these                                                         sition” of the aggregate. LILAC also supports the insertion
reasons we primarily support lexicalisation through the use                                                          of literal terms and choosing the appropriate word form in
of the LexInfo [4] lexicon ontology model and its associated                                                         the following manner:
API6 , which is compatible with the LMF Vocabulary [5].
The LexInfo model:                                                                                                   Verb_Transitive -> "is" <verb> [ participle,
   A LexInfo model is essentially an OWL model describing                                                                tense=past ] "by"
the lexical layer of an ontology specifying how properties,
classes and individuals are expressed in different languages.                                                          This rule can be used to convert a verb with transitive
We refer to the task of producing language-specific repre-                                                           behaviour into a passive form (e.g., it transforms “eats” into
sentation of elements in the data source including triples as                                                        “is eaten by”).
lexicalisation of the data. The corresponding LexInfo API                                                              LILAC can create lexicalisations recursively for phrase and
organises the lexical layer mainly by defining so called aggre-                                                      similar, for example to lexicalise an aggregate for “yellow
gates which describe the lexicalisation of a particular URI,                                                         moon”, the following rules are used. Note that in this cases
specifying in particular the lexico-syntactic behaviour of cer-                                                      the names provided by the aggregate class are not available
tain lexical entries as well as their interpretation in terms of                                                     so the name of the type is used instead:
properties, classes and individuals defined in the data. An ag-
gregate essentially bundles all the relevant individuals of the                                                      NounPhrase -> <adjective> <NounPhrase>
LexInfo model needed to describe the lexicalization of a cer-                                                        NounPhrase -> <noun>
tain URI. This includes a description of syntactic, lexical and
morphological characteristics of each lexicon entry in the lexi-                                                       The process for lexicalisation proceeds as follows: for each
con. Indeed, each aggregate describes a lexical entry together                                                       ontology element (identified by a URI) that needs to be lexi-
with its lemma and several word forms (e.g. inflectional forms                                                       calised, the LexInfo API is used to find the lexical entry that
such as the plural etc.). The syntactic behaviour of a lexical                                                       refers to the URI in question. Then the appropriate LILAC
entry is described through subcategorization frames making                                                           rules are invoked to provide a lexicalization of the URI in a
the required syntactic arguments explicit. The semantic in-                                                          given language.
terpretation of the lexical entry with respect to the ontology                                                         As this process requires only the URI of the ontology ele-
is captured through a mapping (“syn-sem argument map”)                                                               ment, by changing the LexInfo model and providing a reusable
from the syntactic arguments to the semantic arguments of                                                            set of LILAC rules the language of the interface can be changed
a semantic predicate which stands proxy for an ontology ele-                                                         to any suitable form. It is important to emphasize that the
ment in the ontology. Finally the aggregate is linked through                                                        LILAC rules are language-specific and thus need to be pro-
a hasSense link to the URI in the data layer it lexicalises.                                                         vided for each language supported.
An example of an aggregate is given in figure 5. For details                                                           Another issue is that we desire that our users are capable of
                                                                                                                     searching for elements by their lexicalised form. LexInfo can
6
    Available at http://lexinfo.googlecode.com/                                                                      support this as well. This involves querying the lexicon for


                                                                                                                11
all lexical entries that have a word form matching the query           providing fine control on the lexicalisations used in a partic-
and returning the URI that the lexical entry is associated             ular context.
to. Once we have mapped all language-specific strings to
URIs, the query can be handled using the query manager as              Acknowledgements
usual. For example if the user queries for “food” then the
LexInfo model could be queried for all lexical entries that            This work has been carried out in the context of the Mon-
have either a lemma or word form matching this literal. The            net STREP Project funded by the European Commission
URIs referred to by this word can then be used to query the            under FP7, and partially funded by the “Consejerı́a de In-
knowledge base. This means that a user can query in their              novación Ciencia y Empresa de Andalucı́a” (Spain) under
own language and expect the same results, for example the              research project P06-TIC-01433.
same concept for “food processing” will be returned by an
English user querying “food” and a Spanish user querying               5.   REFERENCES
for “alimento” (part of the compound noun “Procesado de                 [1] A. Aho, R. Sethi, and J. Ullman. Compilers: principles,
los alimentos”).                                                            techniques, and tools. Reading, MA,, 1986.
                                                                        [2] C. Bizer, R. Lee, and E. Pietriga. Fresnel: A
3.3   CLOVA for company search                                              browser-independent presentation vocabulary for rdf.
   We developed a search interface for querying data about                  In Proceedings of the Second International Workshop
companies using CLOVA, which is available at http://www.                    on Interaction Design and the Semantic Web, Galway,
sc.cit-ec.uni-bielefeld.de/clova/demo. For this appli-                      Ireland. Citeseer, 2005.
cation we used data drawn from the DBPedia ontology, which              [3] J. Broekstra, A. Kampman, and F. Van Harmelen.
we entered into a Sesame store. We used the labels of the                   Sesame: A generic architecture for storing and querying
URIs to generate the lexicon model for English, and used                    rdf and rdf schema. Lecture Notes in Computer
the translations provided by DBPedia’s wikipage links (them-                Science, pages 54–68, 2002.
selves derived from WikiPedia’s “other languages” links), to            [4] P. Buitelaar, P. Cimiano, P. Haase, and M. Sintek.
provide labels in German and Spanish. As properties were                    Towards linguistically grounded ontologies. In
not translated in this way, the translations for these elements             Proceedings of the European Semantic Web Conference
were manually provided. These translations were converted                   (ESWC), pages 111–125, 2009.
into a LexInfo model through the use of about 100 LILAC
                                                                        [5] G. Francopoulo, N. Bel, M. George, N. Calzolari,
rules. About 20 of these rules were selected to provide lex-
                                                                            M. Monachini, M. Pet, and C. Soria. Lexical markup
icalisation for the company search application. In addition,
                                                                            framework (LMF) for NLP multilingual resources. In
we selected the form properties and output visualisations by
                                                                            Proceedings of the workshop on multilingual language
producing a semantic form specification as well as an output
                                                                            resources and interoperability, pages 1–8. Association
specification. These were rendered by the default elements
                                                                            for Computational Linguistics, 2006.
of the CLOVA HTML modules, and the appearance was fur-
                                                                        [6] A. S. M. L. Hoque and M. Arefin. Multilingual data
ther modified by specifying a CSS style-sheet. In general,
                                                                            management in database environment. Malaysian
the process of adapting CLOVA involves creating a lexicon,
                                                                            Journal of Computer Science, 22(1):44–63, 2009.
which could be a LexInfo model or a simpler representation
such as with RDF’s label property, and then producing the               [7] A. Katifori, C. Halatsis, G. Lepouras, C. Vassilakis,
semantic form specification and output specification. Adapt-                and E. Giannopoulou. Ontology visualization methods:
ing CLOVA to a different output format or data back end, it                 a survey. ACM Computing Surveys (CSUR), 39(4):10,
requires implementing only a set of modest interfaces in Java.              2007.
                                                                        [8] A. Kumaran and J. R. Haritsa. On database support
                                                                            for multilingual environments. In Proceedings of the
4.    CONCLUSION                                                            IEEE RIDE Workshop on Multilingual Information
  We have presented an architecture for querying semantic                   Management,, 2003.
data in multiple languages. We started by providing methods             [9] A. Miles, B. Matthews, M. Wilson, and D. Brickley.
to specify the creation of forms, the querying of the results               SKOS Core: Simple knowledge organisation for the
and presentation of the results in a language-independent                   web. In Proceedings of the International Conference on
manner through the use of URIs and XML specifications.                      Dublin Core and Metadata Applications, pages 12–15,
By creating this modular framework we provide an interop-                   2005.
erable language-independent description of the data, which             [10] C. Peters, T. Deselaers, N. Ferro, J. Gonzalo, G. F.
could be used in combination with a lexicalisation module                   Jones, M. Kurimo, T. Mandl, A. Peñas, and V. Petras.
to enable multilingual search and querying. We then sepa-                   Evaluating Systems for Multilingual and Multimodal
rated the data source into a language-independent data layer                Information Access, volume 5706. Springer, 2008.
and a language-dependent lexical layer, which allows us to
modularise each language and made the lexical information
available separately on the semantic web. In this way we
achieved all the requirements we set out in Figure 1. We
described an implementation of this framework, which was
designed to transform abstract specifications of the data into
HTML pages available on the web and performed its lexi-
calisations by the use of LexInfo lexicon ontology models [4]


                                                                  12