CLOVA: An Architecture for Cross-Language Semantic Data Querying John McCrae Jesús R. Campaña Philipp Cimiano Semantic Computing Group, Department of Computer Semantic Computing Group, CITEC Science and Artificial CITEC University of Bielefeld Intelligence University of Bielefeld Bielefeld, Germany University of Granada Bielefeld, Germany Granada, Spain jmccrae@cit-ec.uni- cimiano@cit-ec.uni- bielefeld.de jesuscg@decsai.ugr.es bielefeld.de ABSTRACT ence to natural language1 . In order to facilitate the interac- Semantic web data formalisms such as RDF and OWL al- tion of human users with semantic data, supporting language- low us to represent data in a language independent manner. based interfaces in multiple languages is crucial. However, However, so far there is no principled approach allowing us to currently there is no principled approach supporting the ac- query such data in multiple languages. We present CLOVA, cess of semantic data across multiple languages. To fill this an architecture for cross-lingual querying that aims to ad- gap, we present in this paper an architecture we call CLOVA dress this gap. In CLOVA, we make a distinction between a (Cross-Lingual Ontology Visualisation Architecture) designed language independent data layer and a language independent for querying semantic data in multiple languages. A devel- lexical layer. We show how this distinction allows us to create oper of a CLOVA application can define the search interface modular and extensible cross lingual applications that need independently of any natural language by referring to onto- to access semantic data. We specify the search interface at logical relations and classes within a semantic form specifi- a conceptual level using what we call a semantic form spec- cation (SFS), which represents a declarative and conceptual ification abstracting from specific languages. We show how, representation of the search interface with respect to a given on the basis of this conceptual specification, both the query ontology. We have designed an XML-based language which interface and the query results can be localized to any sup- is inspired by the Fresnel language [2] for this purpose. The ported language with almost no effort. More generally, we search interface can then be automatically localised by the describe how the separation of the lexical layer can be used use of a lexicon ontology model such as LexInfo [4], enabling with a principled ontology lexicon model (LexInfo) in order the system to automatically generate the form in the appro- to produce application-specific lexicalisations of properties, priate language. The queries to the semantic repository are classes and individuals contained in the data. generated on the basis of the information provided in the SFS and the results of the query can be localised using the same method as used for the localisation of the search interface. Categories and Subject Descriptors The CLOVA framework is generic in the sense that it can H.5.m [Information Interfaces and Presentation]: User be quickly customised to new scenarios, new ontologies and Interfaces; I.2.1 [Artificial Intelligence]: Applications and search forms and additional languages can be added without Expert Systems; I.2.4 [Artificial Intelligence]: Knowledge changing the actual application, even at run time if we desire. Representation Formalisms and Methods; I.2.7 [Artificial The paper is organised as follows. Section 2 describes state Intelligence]: Natural Language Processing of the art on information access across languages and points out basic requirements for cross lingual systems. Section 3 describes the CLOVA framework for rapid development of General Terms cross-lingual search applications accessing semantic data. We Design, Human Factors, Languages conclude in Section 4. Keywords 2. RELATED WORK Multilingual Semantic Web, Ontology Localisation, Software Providing access to information across languages is an im- Architecture portant topic in a number of research fields. While our work is positioned in the area of the Semantic Web, we discuss work related to a number of other research areas, including 1. INTRODUCTION databases, cross-language information retrieval as well as on- Data models and knowledge representation formalisms in tology presentation and visualisation. the Semantic Web allow us to represent data without refer- 1 Copyright is held by the author/owner(s). This holds mainly for RDF triples with resources as subjects WWW2010, April 26-30, 2010, Raleigh, North Carolina. and objects. String data-type elements are often language- . specific. 5 2.1 Database Systems • Linguistic equivalences: Multilingual database sys- Supporting cross-language data access is an important topic tems should support linguistic joins which exploit pre- in the area of database systems, albeit one which has not re- defined mappings between attributes and values across ceived very prominent attention (see [8]). An important issue languages. For example, we might state explicitly that is certainly the one of character encoding as we need to rep- the attributes “marital status” (in English) and “Fami- resent characters for different languages. However, most of lienstand” are equivalent and that the values “married” the current database systems support Unicode so that this and “verheiratet” are equivalent. issue is not a problem anymore. A more complex issue is the representation of content in the database in such a way that In fact, those two requirements follow from Kumaran et information can be accessed across languages. There seems al.’s assumption that the database should store the data in to be no consensus so far on what the optimal representa- multiple languages. If this is the case then we certainly have tion of information would be such that cross-language access to push all the cross-language querying functionality into the can be realised effectively and efficiently. One of the basic DBMS itself. This is rather undesirable from our point of requirements for multilingual organisation of data mentioned view as every time a new language is added to the system, by Kumaran et al. [8] is the following: the DBMS needs to be modified to extend the linguistic and lexical equivalences. Further, the data is stored redundantly “The basic multilingual requirement is that the database (once for every language supported). Therefore, we actu- system must be capable of storing data in multiple ally advocate a system design where the data is stored in a languages.” language-independent fashion and the cross-lingual querying functionality as well as result localisation is external to the This requirement seems definitely too strict to us as it as- DBMS itself, implemented as pre- and post-processing steps, sumes that the representation of data is language-dependent respectively. and that the database is supposed to store the data in mul- In fact, we would add the following requirement to any tiple languages. This rules out language-independent ap- system allowing to access data across languages: proaches which do not represent language-specific informa- tion in the database at all. Requirement 3 (Language Modularity) The following requirement by Kumaran et al. is one we can The addition of further languages should be modular in the directly adhere to: sense that it should not require the modification of the DBMS or influence the other languages supported by the system. Requirement 1 (Querying in multiple languages) Data must be queriable using query strings in any (sup- As a consequence, the capability of querying data across ported) language. languages should not be specific to a certain implementation In fact, we will refer to the above as Requirement 1a and of a DBMS but work for any DBMS supporting the data add the following closely related Requirement 1b: ‘The re- model in question. sults of a query should also be presented in any (supported) One of the important issues in representing information in language.’ Figure 1 summarises all the requirements dis- multiple languages is avoiding redundancy (see [6]). Hoque cussed in this section. However, it does not strictly follow et al. indeed propose a schema to give IDs to every piece from this that the data should be stored in multiple lan- of information and then include the language information in guages in the database. In fact, it suffices that the front end a dictionary table. This is perfectly in line with Semantic that users interact with supports different languages and is Web data models (RDF in particular) where URIs are used able to translate the user’s input into a formal (language- to uniquely identify resources. Dictionaries can then be con- independent) query and localises the results returned by the structed expressing how the elements represented by the URIs database management system (DBMS) into any of the sup- are referred to across languages. This thus allows to concep- ported languages. tually separate the data from the dictionary. This is a crucial A further important requirement by Kumaran et al. we distinction that CLOVA also adheres to (see below). subscribe to is related to interoperability: 2.2 Cross-language Information Retrieval Requirement 2 (Interoperability) In the field of information retrieval, information access across The multilingual data must be represented in such a way that languages has also been an important topic, mainly in the it can be exchanged across systems. context of the so called Cross-Language Evaluation Forum2 (see [10] for the proceedings of CLEF 2008). Cross-language This feature is certainly desirable. We will come back to information retrieval (CLIR) represents an extreme case of this requirement in the context of our discussion of the Se- the so called vocabulary mismatch problem well-known from mantic Web (see below). The next two requirements men- information retrieval. The problem, in short, is the fact that tioned by Kumaran et al. are in our view questionable as a document can be highly relevant to a query in spite of not they assume that the DBMS itself has built-in support for having any words in common with the query. CLIR rep- multiple languages: resents an extreme case in the sense that if a query and a document are in different languages, then the word overlap • String equality across scripts: A Multilingual database and consequently every vector-based similarity measure will system should support lexical joins allowing to join in- be zero. formation in different tables even if the relevant at- 2 tributes of the join are in different scripts. http://www.clef-campaign.org/ 6 In CLIR, the retrieval unit is the document, while in database lexico-syntactic information is required, in turn more com- systems the retrieval unit corresponds to the information plex representations are necessary. A more formal distinc- units stored in the data base. Therefore, the requirements tion of the “data layer” and “lexical layer” is provided by with respect to multilinguality are rather different for CLIR lexicon ontology models of which the most prominent models and multilingual database systems. are the Linguistic Information Repository (LIR) and LexInfo (see [4]). 2.4 Ontology Presentation and Visualisation 2.3 Semantic Web Fresnel [2] is a display vocabulary that describes methods of Multilinguality has been so far an underrepresented topic data presentation in terms of lenses and formats. In essence in the Semantic Web field. While on the Semantic Web we the lens in Fresnel selects which values are to be displayed encounter similar problems as in the case of databases, there and the format selects the formatting applied to each part of are some special considerations and requirements. We will the lens. This provides many of the basic tools for presenting consider further important requirements for multilinguality semantic web data. However it does not represent multilin- in the context of the Semantic Web. Before, we introduce the guality within the vocabulary and it is not designed to present crucial distinction between the data layer (proper) and the a queriable interface to the data. There exist many forms of lexical layer. We will see below that the conceptual separa- ontology visualisation methods through the use of trees, and tion between the data and the dictionary is even more impor- other structures to display the data contained within the on- tant in the context of the Semantic Web. According to our tology, a survey of which is provided in [7]. These are of distinction, the data layer contains the application-relevant course focussed mainly on displaying the structure of the on- data while the lexical layer merely contains information about tology and do not attempt to convert the ontology to natu- how the data is realised/expressed in different languages and ral language. Furthermore, for very large data sources, it is acts like a dictionary. We note that this distinction is a con- impractical to visualise the whole ontology at one time and ceptual one as the data in both layers can be stored in the hence we wish only to select a certain section of it and hence same DBMS. However, this might not always be possible in require a query interface to perform this task. a decentralised system such as the Semantic Web: 3. MULTILINGUAL ACCESS AND QUERY- Requirement 4 (Data and Lexicon Separation) We require a clear separation between the data and lexicon ING USING CLOVA layer in the Semantic Web. The addition of further languages CLOVA addresses the problem of realising localised search should be possible without modifying the data layer. This interfaces on top of language-independent data sources, ab- means that the proper data layer and the lexical layer are stracting the work flow and design of a search engine and cleanly separated and data is not stored redundantly. providing the developer with a set of tools to define and de- velop a new system with relatively little effort. CLOVA ab- In the Semantic Web, the parties interested in accessing stracts lexicalisation and data storage as services, providing a certain data source are not necessarily its owners (in con- a certain degree of independence from data sources and mul- trast to standard centralised database systems as considered tilingual representation models. by Kumaran et al.). As a corollary it follows that if a user re- The different modules of the system have been designed quires access to a data source in language x he might not have with the goal of providing very specific, non-overlapping and the permission to enrich the data source by data represented independent tasks to developers working on the system de- in the language x. ployment concurrently. User interface definition tasks are A further relevant requirement in the context of the Se- completely separated from data access and lexicalisation, al- mantic Web is the following: lowing developers of each module to use different resources as required. Requirement 5 (Sharing of Lexica) CLOVA as an architecture does not fulfil any of the afore- Lexica should be represented declaratively and in a form mentioned requirements (as they should be fulfilled by lexi- which is independent of specific applications such that it can calisation services), but provides a framework to fully exploit be shared. cross-lingual services meeting these requirements. The appli- It is very much in the spirit of the Semantic Web that cation design allows to separate conceptual representations information should be interoperable and thus reusable beyond from language dependant lexical representations, making user specific applications. Following this spirit, it seems desireable interfaces completely language independent in order to later that (given that data representation is language-independent) localise them to any supported language. the language-specific information how certain resources are expressed in various languages can be shared across systems. 3.1 System Architecture This can be accomplished by declaratively described lexica The CLOVA architecture is designed to enable the query- which can be shared. ing of semantic data in a language of choice, while still pre- Multilinguality has been approached in RDF through the senting queries to the data source in a language-independent use of its label property, which can assign labels with lan- form. CLOVA is modular, reusable and extensible and as guage annotations to URIs. The SKOS framework [9] further such is easily configured to adapt to different data sources, expands on this by use of prefLabel, altLabel, hiddenLabel. user interfaces and localisation tools3 . These formalisms are sufficient for providing simple represen- 3 A Java implementation of CLOVA is available at http:// tation of language information. However, as more complex www.sc.cit-ec.uni-bielefeld.de/clova/ 7 Req. No Implication Status Req. 1a Querying in multiple languages REQUIRED Req. 1b Result localisation in multiple languages REQUIRED Req. 2 Data interoperability REQUIRED Req. 3 Language modularity REQUIRED Req. 4a Separation between data and lexical layer DESIRED TO SUPPORT Req. 3 Req. 4b Language-independent data representation DESIRED TO AVOID REDUNDANCY Req. 5 Declarative representation of lexica DESIRED FOR SHARING LEXICAL INFORMATION Figure 1: Requirements for multilingual organisation of data Figure 2 depicts the general architecture of CLOVA and its main modules. The form displayer is a module which translates the semantic form specification into a displayable format, for example HTML. Queries are performed by the query manager and then the results are displayed to the user using the output displayer module. All of the modules use the lexicaliser module to convert the conceptual descriptions (i.e., URIs) to and from natural language. Each of these mod- ules are implemented independently and can be exchanged or modified without affecting the other parts of the system. We assume that we have a data source consisting of a set of properties referenced by URIs and whose values are also URIs or language-independent data values. We shall also assume that there are known labels for each such URI and Figure 2: CLOVA general architecture each language supported by the application. If this separation between the lexical layer and the data layer does not already by an RDF type declaration or similar. If this is omitted exist, we introduce elements to create this separation. It is we simply choose all individuals in the data source. The often necessary to apply such manual enrichment to a data SFS essentially consists of a list of fields which are to be source, as it is not trivial to identify which strings in the used to query the ontology. Each field contains the following data source are language-dependent, however we find that is information: often a simple task to perform by identifying which properties have language-dependent ranges, or by using XML’s language • Name: An internal identifier is used to name the input attribute. fields for HTML and HTTP requests. We introduce an abstract description of a search interface by way of XML called a semantic form specification. It spec- • Query output: This defines whether this field will ifies the relevant properties that can be queried by using the be included in these results. Valid values are always, URIs in the data source, thus abstracting from any natural never, ask (the user could decide wether to include the language. We show how this can be used to display a form field in the results or not), if empty (if the field has not to the user and to generate appropriate queries once he/she been queried it is included in the output), if queried has filled in the form. The query manager provides a back- (if the field is queried, it is included in the output) and end that allows us to convert our queries using information ask default selected (the user decides, but as default the in the form into standard query languages such as SPARQL field will be shown). and SQL. Finally, we introduce a lexicalisation component, which is used to translate between the language-independent • Property: represents the URI for the ontology prop- forms specified by the developer and the localised forms pre- erty to be queried through the field. An indication of sented to the user. We describe a lexicaliser which builds on reference=self in place of a URI means that we are a complex lexicon model and demonstrate that it can provide querying the domain of the search. Such queries are more flexibility with respect to the context and complexity useful for querying the lexicalisation of the object being of the results we wish to lexicalise. queried or limiting the query to a fixed set of objects. • Property Range: We define a number of types (called 3.2 Modules property ranges) that describe the data that a field can handle. It differs from the data types of RDF or similar 3.2.1 Semantic Form Specification in that we also describe how the data should be queried One of the most important aspects of the architecture is as well. For example, while it is possible to describe the Semantic Form Specification (SFS), which contains all both the revenue of a company and the age of an em- the necessary information to build a user interface to query ployee as integers in the database, it is not sensible to the ontology. In the SFS the developer specifies the ontology query revenue as a single value, whereas it is often use- properties to be queried by the application via their URIs. ful to query age as a single value. These property ranges This consists of a form for which we specify a domain, i.e., provide an abstraction of these properties in the data the class of objects we are querying as defined in the database and thus support the generation of appropriate forms 8 and queries. The following property ranges are built-in into CLOVA: – String, Numeric, Integer, Date: Simple data-type values. Note that String is intended for represent- Figure 3: HTML form generated for a SFS document ing language-independent strings, e.g. IDs, not natural language strings. The numeric and date ranges are used to query precise values like “age” The SFS document is in principle similar to the concept of a and “birth date”. “lens” in the Fresnel display vocabulary [2] in that it describes – Range, Segment, Set: These are defined relative the set of fields in the data that should be used for display to another property range and specify how a user and querying. However, by including more information about can query the property in question. Range speci- methods for querying the data, we provide a description that fies that the user should query the data by provid- can be used for both presentation and querying of the data. ing an upper and/or lower bound, e.g. “revenue”, Example: Suppose that we want to build a small web appli- “number of employees”. Segment is similar but re- cation that queries an ontology with information about com- quires that the developer divides the data up into panies stored in an RDF repository. The application should pre-defined intervals. Set allows the developer to ask for company names, companies’ revenue, and company specify a fixed set of queriable values, e.g. “marital locations. The syntax of a SFS XML document for that ap- status”. plication is shown below: – Lexicalised Element: Although we assume all data in the source is defined by URIs, it is obviously de-
the result to still be extracted from the data source and included in the results. 3.2.2 Form Displayer The form displayer consists of a set of form display elements The described property ranges are supported natively defined for each property range. It processes the SFS by using by CLOVA, but it is also possible to define new property these elements to render the fields in a given order. The ranges and include them in the SFS XML document. implementation of these elements is dependent on the output The appropriate implementation for a form display ele- method. The form display elements are rendered using Java ment that can handle the newly defined property range code to convert the document to XHTML4 . has to be provided of course (see Section 3.2.2). Figure 3 shows an example of rendering of an SFS which • Rendering Properties: There is often information for includes the fields in the example above. In this rendering a particular rendering that cannot be provided in the the field “name” is displayed as a text field as it refers to the description of the property ranges alone. Thus, we allow lexicalisation of this company. The location of a company for for a set of context specific properties to be passed to instance is represented as a text field. However, in spite of the rendering engine. Examples of these include the use the fact that the data is represented in the data source as a of auto-completion features or an indication of the type language independent URI, the user can query by specifying of form element to display, i.e. a Set can be displayed 4 The CLOVA project also provides XSLT files to perform the as a drop-down list, or as a radio button selection. same task 9 the name of the resource in their own language (e.g., a Ger- man user querying “München” receives the same results as an English user querying “Munich”). Finally, the revenue is asserted as a continuous value which is queried by specifying a range and is thus rendered with two inputs allowing the user to specify the upper and/or lower bounds of their query. A minimum value on this range allows for client-side data consistency checks. In addition, check boxes are appended to fields in order to allow users to decide if the fields will be shown in the results, according to the output parameter in the SFS. 3.2.3 Query Manager Once the form is presented to the user, he or she can fill the fields and select which properties he or she wishes to visu- alise in the results. When the query form is sent to the Query Manager, it is translated into a specific query for a particular knowledge base. We have provided modules to support the use of SQL queries using JDBC and SPARQL queries using Sesame [3]. We created an abstract query interface which can be used to specify the information required in a manner that is easy to convert to the appropriate query language allowing Figure 4: HTML result page for the example us to change the knowledge base, ontology and back end with- out major problems. The query also needs to be preprocessed using the lexicaliser due to the presence of language-specific The following output specification defines two output ele- terms introduced by the user which need to be converted to ments to show results. language independent URIs. 3.2.4 Output Displayer mative and difficult to display due to the scale of values. For this reason we provide an Output Specification to define the The first element displays a table containing all the re- set of available display elements and sets of values they can sults returned by the query, while the second output element display. These output specifications consist of a list of output shows a bar chart for the property “Revenue”. The HTML elements described as follows: output generated for a given output specification containing the above mentioned descriptions is shown in Figure 4. • ID: Internal identifier of the output element displayed. 3.2.5 Lexicaliser • URI: A reference to the output resource specified as a URI.5 Simple lexicon models can be provided by language anno- tations, for example RDF’s label and SKOS’s prefLabel, • Fields: The set of fields used by this element. These and developing a lexicaliser is then as simple as looking up should correspond by name to elements in the SFS. these labels for the given resource URI. This approach may be suitable for some tasks. However, we sometimes require • Display properties: Additional parameters passed to lexicalisation using extra information about the context and the display element to modify its behaviour. Some of would like to provide lexicalisation of more than just URIs, these parameters include the possibility to ignore in- e.g. when lexicalising triples. While RDF labels can be at- complete data, or to define the subtypes of a chart to tached to properties and individuals for instance, there is no display. These parameters are class dependant so that mechanism that allows to compute a lexicalization for a triple each output element has its own set of valid parameters. by composing together the labels of the property and the in- 5 dividuals. This is a complex problem and we will leave a full These can reference Java classes by linking to the appropri- ate class file or location in a JAR file investigation and evaluation of this for future work. 10 Subject : SyntacticArgument SynSem Arg Map 1 Domain : SemanticArgument the interested reader is referred to [4]. LILAC: In order to produce lexicalisations of ontology elements from a LexInfo model we use a simple rule language included PObject : SyntacticArgument SynSem Arg Map 2 Range : SemanticArgument with the LexInfo API called LILAC (LexInfo Label Analysis & Construction). A LILAC rule set describes the structure of labels and can be used for both generating the lexicon from NounPP : SubcategorizationFrame SemanticPredicate labels and generating lexicalisations from the lexicon. In gen- eral we assume that lexicons are generated from some set of existing labels, which may be extracted from annotations in SyntacticBehaviour the data source, e.g., RDFS’s label, from the URIs in the ontology or from automatic translations of these labels from another language. The process of generating aggregates from Lemma raw labels requires that first the part of speech tags are identi- hasWrrittenForm="product" Noun : LexicalEntry http://dbpedia.org/ontology/productOf : Sense fied by a tagger such as TreeTagger. Then, the part-of-speech tagged labels are parsed using a LR(1)-based parser (see [1]). WordForm The API then handles these parse trees and converts them hasWrittenForm="products" [ number=plural ] into LexInfo aggregates. LILAC rules are implemented in a symmetric manner so that they can be used to both generate the aggregates in the Figure 5: A simplified example of a LexInfo aggregate lexicon ontology model (e.g. by analysing the labels of a given ontology) as well as lexicalise those aggregates. Furthermore, it is often desirable to have fine control over A simple example rule for a label such as “revenue of” is: the form of the lexicalisation, for example, the ontology la- Noun_NounPP ->