CLOVA: An Architecture for Cross-Language Semantic Data Querying John McCrae Jesús R. Campaña Philipp Cimiano Semantic Computing Group, Department of Computer Semantic Computing Group, CITEC Science and Artificial CITEC University of Bielefeld Intelligence University of Bielefeld Bielefeld, Germany University of Granada Bielefeld, Germany Granada, Spain jmccrae@cit-ec.uni- cimiano@cit-ec.uni- bielefeld.de jesuscg@decsai.ugr.es bielefeld.de ABSTRACT ence to natural language1 . In order to facilitate the interac- Semantic web data formalisms such as RDF and OWL al- tion of human users with semantic data, supporting language- low us to represent data in a language independent manner. based interfaces in multiple languages is crucial. However, However, so far there is no principled approach allowing us to currently there is no principled approach supporting the ac- query such data in multiple languages. We present CLOVA, cess of semantic data across multiple languages. To fill this an architecture for cross-lingual querying that aims to ad- gap, we present in this paper an architecture we call CLOVA dress this gap. In CLOVA, we make a distinction between a (Cross-Lingual Ontology Visualisation Architecture) designed language independent data layer and a language independent for querying semantic data in multiple languages. A devel- lexical layer. We show how this distinction allows us to create oper of a CLOVA application can define the search interface modular and extensible cross lingual applications that need independently of any natural language by referring to onto- to access semantic data. We specify the search interface at logical relations and classes within a semantic form specifi- a conceptual level using what we call a semantic form spec- cation (SFS), which represents a declarative and conceptual ification abstracting from specific languages. We show how, representation of the search interface with respect to a given on the basis of this conceptual specification, both the query ontology. We have designed an XML-based language which interface and the query results can be localized to any sup- is inspired by the Fresnel language [2] for this purpose. The ported language with almost no effort. More generally, we search interface can then be automatically localised by the describe how the separation of the lexical layer can be used use of a lexicon ontology model such as LexInfo [4], enabling with a principled ontology lexicon model (LexInfo) in order the system to automatically generate the form in the appro- to produce application-specific lexicalisations of properties, priate language. The queries to the semantic repository are classes and individuals contained in the data. generated on the basis of the information provided in the SFS and the results of the query can be localised using the same method as used for the localisation of the search interface. Categories and Subject Descriptors The CLOVA framework is generic in the sense that it can H.5.m [Information Interfaces and Presentation]: User be quickly customised to new scenarios, new ontologies and Interfaces; I.2.1 [Artificial Intelligence]: Applications and search forms and additional languages can be added without Expert Systems; I.2.4 [Artificial Intelligence]: Knowledge changing the actual application, even at run time if we desire. Representation Formalisms and Methods; I.2.7 [Artificial The paper is organised as follows. Section 2 describes state Intelligence]: Natural Language Processing of the art on information access across languages and points out basic requirements for cross lingual systems. Section 3 describes the CLOVA framework for rapid development of General Terms cross-lingual search applications accessing semantic data. We Design, Human Factors, Languages conclude in Section 4. Keywords 2. RELATED WORK Multilingual Semantic Web, Ontology Localisation, Software Providing access to information across languages is an im- Architecture portant topic in a number of research fields. While our work is positioned in the area of the Semantic Web, we discuss work related to a number of other research areas, including 1. INTRODUCTION databases, cross-language information retrieval as well as on- Data models and knowledge representation formalisms in tology presentation and visualisation. the Semantic Web allow us to represent data without refer- 1 Copyright is held by the author/owner(s). This holds mainly for RDF triples with resources as subjects WWW2010, April 26-30, 2010, Raleigh, North Carolina. and objects. String data-type elements are often language- . specific. 5 2.1 Database Systems • Linguistic equivalences: Multilingual database sys- Supporting cross-language data access is an important topic tems should support linguistic joins which exploit pre- in the area of database systems, albeit one which has not re- defined mappings between attributes and values across ceived very prominent attention (see [8]). An important issue languages. For example, we might state explicitly that is certainly the one of character encoding as we need to rep- the attributes “marital status” (in English) and “Fami- resent characters for different languages. However, most of lienstand” are equivalent and that the values “married” the current database systems support Unicode so that this and “verheiratet” are equivalent. issue is not a problem anymore. A more complex issue is the representation of content in the database in such a way that In fact, those two requirements follow from Kumaran et information can be accessed across languages. There seems al.’s assumption that the database should store the data in to be no consensus so far on what the optimal representa- multiple languages. If this is the case then we certainly have tion of information would be such that cross-language access to push all the cross-language querying functionality into the can be realised effectively and efficiently. One of the basic DBMS itself. This is rather undesirable from our point of requirements for multilingual organisation of data mentioned view as every time a new language is added to the system, by Kumaran et al. [8] is the following: the DBMS needs to be modified to extend the linguistic and lexical equivalences. Further, the data is stored redundantly “The basic multilingual requirement is that the database (once for every language supported). Therefore, we actu- system must be capable of storing data in multiple ally advocate a system design where the data is stored in a languages.” language-independent fashion and the cross-lingual querying functionality as well as result localisation is external to the This requirement seems definitely too strict to us as it as- DBMS itself, implemented as pre- and post-processing steps, sumes that the representation of data is language-dependent respectively. and that the database is supposed to store the data in mul- In fact, we would add the following requirement to any tiple languages. This rules out language-independent ap- system allowing to access data across languages: proaches which do not represent language-specific informa- tion in the database at all. Requirement 3 (Language Modularity) The following requirement by Kumaran et al. is one we can The addition of further languages should be modular in the directly adhere to: sense that it should not require the modification of the DBMS or influence the other languages supported by the system. Requirement 1 (Querying in multiple languages) Data must be queriable using query strings in any (sup- As a consequence, the capability of querying data across ported) language. languages should not be specific to a certain implementation In fact, we will refer to the above as Requirement 1a and of a DBMS but work for any DBMS supporting the data add the following closely related Requirement 1b: ‘The re- model in question. sults of a query should also be presented in any (supported) One of the important issues in representing information in language.’ Figure 1 summarises all the requirements dis- multiple languages is avoiding redundancy (see [6]). Hoque cussed in this section. However, it does not strictly follow et al. indeed propose a schema to give IDs to every piece from this that the data should be stored in multiple lan- of information and then include the language information in guages in the database. In fact, it suffices that the front end a dictionary table. This is perfectly in line with Semantic that users interact with supports different languages and is Web data models (RDF in particular) where URIs are used able to translate the user’s input into a formal (language- to uniquely identify resources. Dictionaries can then be con- independent) query and localises the results returned by the structed expressing how the elements represented by the URIs database management system (DBMS) into any of the sup- are referred to across languages. This thus allows to concep- ported languages. tually separate the data from the dictionary. This is a crucial A further important requirement by Kumaran et al. we distinction that CLOVA also adheres to (see below). subscribe to is related to interoperability: 2.2 Cross-language Information Retrieval Requirement 2 (Interoperability) In the field of information retrieval, information access across The multilingual data must be represented in such a way that languages has also been an important topic, mainly in the it can be exchanged across systems. context of the so called Cross-Language Evaluation Forum2 (see [10] for the proceedings of CLEF 2008). Cross-language This feature is certainly desirable. We will come back to information retrieval (CLIR) represents an extreme case of this requirement in the context of our discussion of the Se- the so called vocabulary mismatch problem well-known from mantic Web (see below). The next two requirements men- information retrieval. The problem, in short, is the fact that tioned by Kumaran et al. are in our view questionable as a document can be highly relevant to a query in spite of not they assume that the DBMS itself has built-in support for having any words in common with the query. CLIR rep- multiple languages: resents an extreme case in the sense that if a query and a document are in different languages, then the word overlap • String equality across scripts: A Multilingual database and consequently every vector-based similarity measure will system should support lexical joins allowing to join in- be zero. formation in different tables even if the relevant at- 2 tributes of the join are in different scripts. http://www.clef-campaign.org/ 6 In CLIR, the retrieval unit is the document, while in database lexico-syntactic information is required, in turn more com- systems the retrieval unit corresponds to the information plex representations are necessary. A more formal distinc- units stored in the data base. Therefore, the requirements tion of the “data layer” and “lexical layer” is provided by with respect to multilinguality are rather different for CLIR lexicon ontology models of which the most prominent models and multilingual database systems. are the Linguistic Information Repository (LIR) and LexInfo (see [4]). 2.4 Ontology Presentation and Visualisation 2.3 Semantic Web Fresnel [2] is a display vocabulary that describes methods of Multilinguality has been so far an underrepresented topic data presentation in terms of lenses and formats. In essence in the Semantic Web field. While on the Semantic Web we the lens in Fresnel selects which values are to be displayed encounter similar problems as in the case of databases, there and the format selects the formatting applied to each part of are some special considerations and requirements. We will the lens. This provides many of the basic tools for presenting consider further important requirements for multilinguality semantic web data. However it does not represent multilin- in the context of the Semantic Web. Before, we introduce the guality within the vocabulary and it is not designed to present crucial distinction between the data layer (proper) and the a queriable interface to the data. There exist many forms of lexical layer. We will see below that the conceptual separa- ontology visualisation methods through the use of trees, and tion between the data and the dictionary is even more impor- other structures to display the data contained within the on- tant in the context of the Semantic Web. According to our tology, a survey of which is provided in [7]. These are of distinction, the data layer contains the application-relevant course focussed mainly on displaying the structure of the on- data while the lexical layer merely contains information about tology and do not attempt to convert the ontology to natu- how the data is realised/expressed in different languages and ral language. Furthermore, for very large data sources, it is acts like a dictionary. We note that this distinction is a con- impractical to visualise the whole ontology at one time and ceptual one as the data in both layers can be stored in the hence we wish only to select a certain section of it and hence same DBMS. However, this might not always be possible in require a query interface to perform this task. a decentralised system such as the Semantic Web: 3. MULTILINGUAL ACCESS AND QUERY- Requirement 4 (Data and Lexicon Separation) We require a clear separation between the data and lexicon ING USING CLOVA layer in the Semantic Web. The addition of further languages CLOVA addresses the problem of realising localised search should be possible without modifying the data layer. This interfaces on top of language-independent data sources, ab- means that the proper data layer and the lexical layer are stracting the work flow and design of a search engine and cleanly separated and data is not stored redundantly. providing the developer with a set of tools to define and de- velop a new system with relatively little effort. CLOVA ab- In the Semantic Web, the parties interested in accessing stracts lexicalisation and data storage as services, providing a certain data source are not necessarily its owners (in con- a certain degree of independence from data sources and mul- trast to standard centralised database systems as considered tilingual representation models. by Kumaran et al.). As a corollary it follows that if a user re- The different modules of the system have been designed quires access to a data source in language x he might not have with the goal of providing very specific, non-overlapping and the permission to enrich the data source by data represented independent tasks to developers working on the system de- in the language x. ployment concurrently. User interface definition tasks are A further relevant requirement in the context of the Se- completely separated from data access and lexicalisation, al- mantic Web is the following: lowing developers of each module to use different resources as required. Requirement 5 (Sharing of Lexica) CLOVA as an architecture does not fulfil any of the afore- Lexica should be represented declaratively and in a form mentioned requirements (as they should be fulfilled by lexi- which is independent of specific applications such that it can calisation services), but provides a framework to fully exploit be shared. cross-lingual services meeting these requirements. The appli- It is very much in the spirit of the Semantic Web that cation design allows to separate conceptual representations information should be interoperable and thus reusable beyond from language dependant lexical representations, making user specific applications. Following this spirit, it seems desireable interfaces completely language independent in order to later that (given that data representation is language-independent) localise them to any supported language. the language-specific information how certain resources are expressed in various languages can be shared across systems. 3.1 System Architecture This can be accomplished by declaratively described lexica The CLOVA architecture is designed to enable the query- which can be shared. ing of semantic data in a language of choice, while still pre- Multilinguality has been approached in RDF through the senting queries to the data source in a language-independent use of its label property, which can assign labels with lan- form. CLOVA is modular, reusable and extensible and as guage annotations to URIs. The SKOS framework [9] further such is easily configured to adapt to different data sources, expands on this by use of prefLabel, altLabel, hiddenLabel. user interfaces and localisation tools3 . These formalisms are sufficient for providing simple represen- 3 A Java implementation of CLOVA is available at http:// tation of language information. However, as more complex www.sc.cit-ec.uni-bielefeld.de/clova/ 7 Req. No Implication Status Req. 1a Querying in multiple languages REQUIRED Req. 1b Result localisation in multiple languages REQUIRED Req. 2 Data interoperability REQUIRED Req. 3 Language modularity REQUIRED Req. 4a Separation between data and lexical layer DESIRED TO SUPPORT Req. 3 Req. 4b Language-independent data representation DESIRED TO AVOID REDUNDANCY Req. 5 Declarative representation of lexica DESIRED FOR SHARING LEXICAL INFORMATION Figure 1: Requirements for multilingual organisation of data Figure 2 depicts the general architecture of CLOVA and its main modules. The form displayer is a module which translates the semantic form specification into a displayable format, for example HTML. Queries are performed by the query manager and then the results are displayed to the user using the output displayer module. All of the modules use the lexicaliser module to convert the conceptual descriptions (i.e., URIs) to and from natural language. Each of these mod- ules are implemented independently and can be exchanged or modified without affecting the other parts of the system. We assume that we have a data source consisting of a set of properties referenced by URIs and whose values are also URIs or language-independent data values. We shall also assume that there are known labels for each such URI and Figure 2: CLOVA general architecture each language supported by the application. If this separation between the lexical layer and the data layer does not already by an RDF type declaration or similar. If this is omitted exist, we introduce elements to create this separation. It is we simply choose all individuals in the data source. The often necessary to apply such manual enrichment to a data SFS essentially consists of a list of fields which are to be source, as it is not trivial to identify which strings in the used to query the ontology. Each field contains the following data source are language-dependent, however we find that is information: often a simple task to perform by identifying which properties have language-dependent ranges, or by using XML’s language • Name: An internal identifier is used to name the input attribute. fields for HTML and HTTP requests. We introduce an abstract description of a search interface by way of XML called a semantic form specification. It spec- • Query output: This defines whether this field will ifies the relevant properties that can be queried by using the be included in these results. Valid values are always, URIs in the data source, thus abstracting from any natural never, ask (the user could decide wether to include the language. We show how this can be used to display a form field in the results or not), if empty (if the field has not to the user and to generate appropriate queries once he/she been queried it is included in the output), if queried has filled in the form. The query manager provides a back- (if the field is queried, it is included in the output) and end that allows us to convert our queries using information ask default selected (the user decides, but as default the in the form into standard query languages such as SPARQL field will be shown). and SQL. Finally, we introduce a lexicalisation component, which is used to translate between the language-independent • Property: represents the URI for the ontology prop- forms specified by the developer and the localised forms pre- erty to be queried through the field. An indication of sented to the user. We describe a lexicaliser which builds on reference=self in place of a URI means that we are a complex lexicon model and demonstrate that it can provide querying the domain of the search. Such queries are more flexibility with respect to the context and complexity useful for querying the lexicalisation of the object being of the results we wish to lexicalise. queried or limiting the query to a fixed set of objects. • Property Range: We define a number of types (called 3.2 Modules property ranges) that describe the data that a field can handle. It differs from the data types of RDF or similar 3.2.1 Semantic Form Specification in that we also describe how the data should be queried One of the most important aspects of the architecture is as well. For example, while it is possible to describe the Semantic Form Specification (SFS), which contains all both the revenue of a company and the age of an em- the necessary information to build a user interface to query ployee as integers in the database, it is not sensible to the ontology. In the SFS the developer specifies the ontology query revenue as a single value, whereas it is often use- properties to be queried by the application via their URIs. ful to query age as a single value. These property ranges This consists of a form for which we specify a domain, i.e., provide an abstraction of these properties in the data the class of objects we are querying as defined in the database and thus support the generation of appropriate forms 8 and queries. The following property ranges are built-in into CLOVA: – String, Numeric, Integer, Date: Simple data-type values. Note that String is intended for represent- Figure 3: HTML form generated for a SFS document ing language-independent strings, e.g. IDs, not natural language strings. The numeric and date ranges are used to query precise values like “age” The SFS document is in principle similar to the concept of a and “birth date”. “lens” in the Fresnel display vocabulary [2] in that it describes – Range, Segment, Set: These are defined relative the set of fields in the data that should be used for display to another property range and specify how a user and querying. However, by including more information about can query the property in question. Range speci- methods for querying the data, we provide a description that fies that the user should query the data by provid- can be used for both presentation and querying of the data. ing an upper and/or lower bound, e.g. “revenue”, Example: Suppose that we want to build a small web appli- “number of employees”. Segment is similar but re- cation that queries an ontology with information about com- quires that the developer divides the data up into panies stored in an RDF repository. The application should pre-defined intervals. Set allows the developer to ask for company names, companies’ revenue, and company specify a fixed set of queriable values, e.g. “marital locations. The syntax of a SFS XML document for that ap- status”. plication is shown below: – Lexicalised Element: Although we assume all data in the source is defined by URIs, it is obviously de-
sirable that the user can query the data using nat- ural language. This property range in fact allows to query for URIs through language-specific strings that need to be resolved by the system to the URI in question. The strings introduced into this field are processed by the lexicaliser to find the URI to which they belong which is then used in the corre- sponding queries. For example, locations can have different names in different languages, e.g. “New York ” and “Nueva York ”, but the URI in the data source should be the same. – Complex : A complex property is considered to be a property composed of other sub-properties. For example, searching for a “key person” within a company can be done by searching for prop- erties of the person, e.g., “name”, “birth place”. This nested form allows us to express queries over the structure of an RDF repository or other data 0 source. – Unqueriable: For some data, methods for efficient querying cannot be provided, especially binary data such as images. Thus we defined this field to allow
the result to still be extracted from the data source and included in the results. 3.2.2 Form Displayer The form displayer consists of a set of form display elements The described property ranges are supported natively defined for each property range. It processes the SFS by using by CLOVA, but it is also possible to define new property these elements to render the fields in a given order. The ranges and include them in the SFS XML document. implementation of these elements is dependent on the output The appropriate implementation for a form display ele- method. The form display elements are rendered using Java ment that can handle the newly defined property range code to convert the document to XHTML4 . has to be provided of course (see Section 3.2.2). Figure 3 shows an example of rendering of an SFS which • Rendering Properties: There is often information for includes the fields in the example above. In this rendering a particular rendering that cannot be provided in the the field “name” is displayed as a text field as it refers to the description of the property ranges alone. Thus, we allow lexicalisation of this company. The location of a company for for a set of context specific properties to be passed to instance is represented as a text field. However, in spite of the rendering engine. Examples of these include the use the fact that the data is represented in the data source as a of auto-completion features or an indication of the type language independent URI, the user can query by specifying of form element to display, i.e. a Set can be displayed 4 The CLOVA project also provides XSLT files to perform the as a drop-down list, or as a radio button selection. same task 9 the name of the resource in their own language (e.g., a Ger- man user querying “München” receives the same results as an English user querying “Munich”). Finally, the revenue is asserted as a continuous value which is queried by specifying a range and is thus rendered with two inputs allowing the user to specify the upper and/or lower bounds of their query. A minimum value on this range allows for client-side data consistency checks. In addition, check boxes are appended to fields in order to allow users to decide if the fields will be shown in the results, according to the output parameter in the SFS. 3.2.3 Query Manager Once the form is presented to the user, he or she can fill the fields and select which properties he or she wishes to visu- alise in the results. When the query form is sent to the Query Manager, it is translated into a specific query for a particular knowledge base. We have provided modules to support the use of SQL queries using JDBC and SPARQL queries using Sesame [3]. We created an abstract query interface which can be used to specify the information required in a manner that is easy to convert to the appropriate query language allowing Figure 4: HTML result page for the example us to change the knowledge base, ontology and back end with- out major problems. The query also needs to be preprocessed using the lexicaliser due to the presence of language-specific The following output specification defines two output ele- terms introduced by the user which need to be converted to ments to show results. language independent URIs. 3.2.4 Output Displayer Once the query is evaluated, the results are processed by the output displayer and an appropriate rendering shown to the user. The displayer consists of a number of display el- ements, each of which represents a different visualisation of the data, including not only simple tabular forms, but also graphs and other visual display methods. As with the form displayer, all of these elements are lexicalised in the same manner as the form displayer. In general we might restrict the types of data that compo- nents will display as not every visualisation paradigm is suit- able for any kind of data. For example, a bar chart showing foundation year and annual income would be both uninfor- mative and difficult to display due to the scale of values. For this reason we provide an Output Specification to define the The first element displays a table containing all the re- set of available display elements and sets of values they can sults returned by the query, while the second output element display. These output specifications consist of a list of output shows a bar chart for the property “Revenue”. The HTML elements described as follows: output generated for a given output specification containing the above mentioned descriptions is shown in Figure 4. • ID: Internal identifier of the output element displayed. 3.2.5 Lexicaliser • URI: A reference to the output resource specified as a URI.5 Simple lexicon models can be provided by language anno- tations, for example RDF’s label and SKOS’s prefLabel, • Fields: The set of fields used by this element. These and developing a lexicaliser is then as simple as looking up should correspond by name to elements in the SFS. these labels for the given resource URI. This approach may be suitable for some tasks. However, we sometimes require • Display properties: Additional parameters passed to lexicalisation using extra information about the context and the display element to modify its behaviour. Some of would like to provide lexicalisation of more than just URIs, these parameters include the possibility to ignore in- e.g. when lexicalising triples. While RDF labels can be at- complete data, or to define the subtypes of a chart to tached to properties and individuals for instance, there is no display. These parameters are class dependant so that mechanism that allows to compute a lexicalization for a triple each output element has its own set of valid parameters. by composing together the labels of the property and the in- 5 dividuals. This is a complex problem and we will leave a full These can reference Java classes by linking to the appropri- ate class file or location in a JAR file investigation and evaluation of this for future work. 10 Subject : SyntacticArgument SynSem Arg Map 1 Domain : SemanticArgument the interested reader is referred to [4]. LILAC: In order to produce lexicalisations of ontology elements from a LexInfo model we use a simple rule language included PObject : SyntacticArgument SynSem Arg Map 2 Range : SemanticArgument with the LexInfo API called LILAC (LexInfo Label Analysis & Construction). A LILAC rule set describes the structure of labels and can be used for both generating the lexicon from NounPP : SubcategorizationFrame SemanticPredicate labels and generating lexicalisations from the lexicon. In gen- eral we assume that lexicons are generated from some set of existing labels, which may be extracted from annotations in SyntacticBehaviour the data source, e.g., RDFS’s label, from the URIs in the ontology or from automatic translations of these labels from another language. The process of generating aggregates from Lemma raw labels requires that first the part of speech tags are identi- hasWrrittenForm="product" Noun : LexicalEntry http://dbpedia.org/ontology/productOf : Sense fied by a tagger such as TreeTagger. Then, the part-of-speech tagged labels are parsed using a LR(1)-based parser (see [1]). WordForm The API then handles these parse trees and converts them hasWrittenForm="products" [ number=plural ] into LexInfo aggregates. LILAC rules are implemented in a symmetric manner so that they can be used to both generate the aggregates in the Figure 5: A simplified example of a LexInfo aggregate lexicon ontology model (e.g. by analysing the labels of a given ontology) as well as lexicalise those aggregates. Furthermore, it is often desirable to have fine control over A simple example rule for a label such as “revenue of” is: the form of the lexicalisation, for example, the ontology la- Noun_NounPP -> bel may be “company location in city”. However, we may wish to have this property expressed by the simpler label “lo- This rule states that the lexicalisation of a Noun NounPP cation”. By using a lexicon ontology model we can specify Aggregate is given by first using the written form of lemma of the lexicalisation in a programmatic way, and hence adapt the “noun” of the aggregate followed by the lemma of “prepo- it to the needs of the particular query interface. For these sition” of the aggregate. LILAC also supports the insertion reasons we primarily support lexicalisation through the use of literal terms and choosing the appropriate word form in of the LexInfo [4] lexicon ontology model and its associated the following manner: API6 , which is compatible with the LMF Vocabulary [5]. The LexInfo model: Verb_Transitive -> "is" [ participle, A LexInfo model is essentially an OWL model describing tense=past ] "by" the lexical layer of an ontology specifying how properties, classes and individuals are expressed in different languages. This rule can be used to convert a verb with transitive We refer to the task of producing language-specific repre- behaviour into a passive form (e.g., it transforms “eats” into sentation of elements in the data source including triples as “is eaten by”). lexicalisation of the data. The corresponding LexInfo API LILAC can create lexicalisations recursively for phrase and organises the lexical layer mainly by defining so called aggre- similar, for example to lexicalise an aggregate for “yellow gates which describe the lexicalisation of a particular URI, moon”, the following rules are used. Note that in this cases specifying in particular the lexico-syntactic behaviour of cer- the names provided by the aggregate class are not available tain lexical entries as well as their interpretation in terms of so the name of the type is used instead: properties, classes and individuals defined in the data. An ag- gregate essentially bundles all the relevant individuals of the NounPhrase -> LexInfo model needed to describe the lexicalization of a cer- NounPhrase -> tain URI. This includes a description of syntactic, lexical and morphological characteristics of each lexicon entry in the lexi- The process for lexicalisation proceeds as follows: for each con. Indeed, each aggregate describes a lexical entry together ontology element (identified by a URI) that needs to be lexi- with its lemma and several word forms (e.g. inflectional forms calised, the LexInfo API is used to find the lexical entry that such as the plural etc.). The syntactic behaviour of a lexical refers to the URI in question. Then the appropriate LILAC entry is described through subcategorization frames making rules are invoked to provide a lexicalization of the URI in a the required syntactic arguments explicit. The semantic in- given language. terpretation of the lexical entry with respect to the ontology As this process requires only the URI of the ontology ele- is captured through a mapping (“syn-sem argument map”) ment, by changing the LexInfo model and providing a reusable from the syntactic arguments to the semantic arguments of set of LILAC rules the language of the interface can be changed a semantic predicate which stands proxy for an ontology ele- to any suitable form. It is important to emphasize that the ment in the ontology. Finally the aggregate is linked through LILAC rules are language-specific and thus need to be pro- a hasSense link to the URI in the data layer it lexicalises. vided for each language supported. An example of an aggregate is given in figure 5. For details Another issue is that we desire that our users are capable of searching for elements by their lexicalised form. LexInfo can 6 Available at http://lexinfo.googlecode.com/ support this as well. This involves querying the lexicon for 11 all lexical entries that have a word form matching the query providing fine control on the lexicalisations used in a partic- and returning the URI that the lexical entry is associated ular context. to. Once we have mapped all language-specific strings to URIs, the query can be handled using the query manager as Acknowledgements usual. For example if the user queries for “food” then the LexInfo model could be queried for all lexical entries that This work has been carried out in the context of the Mon- have either a lemma or word form matching this literal. The net STREP Project funded by the European Commission URIs referred to by this word can then be used to query the under FP7, and partially funded by the “Consejerı́a de In- knowledge base. This means that a user can query in their novación Ciencia y Empresa de Andalucı́a” (Spain) under own language and expect the same results, for example the research project P06-TIC-01433. same concept for “food processing” will be returned by an English user querying “food” and a Spanish user querying 5. REFERENCES for “alimento” (part of the compound noun “Procesado de [1] A. Aho, R. Sethi, and J. Ullman. Compilers: principles, los alimentos”). techniques, and tools. Reading, MA,, 1986. [2] C. Bizer, R. Lee, and E. Pietriga. Fresnel: A 3.3 CLOVA for company search browser-independent presentation vocabulary for rdf. We developed a search interface for querying data about In Proceedings of the Second International Workshop companies using CLOVA, which is available at http://www. on Interaction Design and the Semantic Web, Galway, sc.cit-ec.uni-bielefeld.de/clova/demo. For this appli- Ireland. Citeseer, 2005. cation we used data drawn from the DBPedia ontology, which [3] J. Broekstra, A. Kampman, and F. Van Harmelen. we entered into a Sesame store. We used the labels of the Sesame: A generic architecture for storing and querying URIs to generate the lexicon model for English, and used rdf and rdf schema. Lecture Notes in Computer the translations provided by DBPedia’s wikipage links (them- Science, pages 54–68, 2002. selves derived from WikiPedia’s “other languages” links), to [4] P. Buitelaar, P. Cimiano, P. Haase, and M. Sintek. provide labels in German and Spanish. As properties were Towards linguistically grounded ontologies. In not translated in this way, the translations for these elements Proceedings of the European Semantic Web Conference were manually provided. These translations were converted (ESWC), pages 111–125, 2009. into a LexInfo model through the use of about 100 LILAC [5] G. Francopoulo, N. Bel, M. George, N. Calzolari, rules. About 20 of these rules were selected to provide lex- M. Monachini, M. Pet, and C. Soria. Lexical markup icalisation for the company search application. In addition, framework (LMF) for NLP multilingual resources. In we selected the form properties and output visualisations by Proceedings of the workshop on multilingual language producing a semantic form specification as well as an output resources and interoperability, pages 1–8. Association specification. These were rendered by the default elements for Computational Linguistics, 2006. of the CLOVA HTML modules, and the appearance was fur- [6] A. S. M. L. Hoque and M. Arefin. Multilingual data ther modified by specifying a CSS style-sheet. In general, management in database environment. Malaysian the process of adapting CLOVA involves creating a lexicon, Journal of Computer Science, 22(1):44–63, 2009. which could be a LexInfo model or a simpler representation such as with RDF’s label property, and then producing the [7] A. Katifori, C. Halatsis, G. Lepouras, C. Vassilakis, semantic form specification and output specification. Adapt- and E. Giannopoulou. Ontology visualization methods: ing CLOVA to a different output format or data back end, it a survey. ACM Computing Surveys (CSUR), 39(4):10, requires implementing only a set of modest interfaces in Java. 2007. [8] A. Kumaran and J. R. Haritsa. On database support for multilingual environments. In Proceedings of the 4. CONCLUSION IEEE RIDE Workshop on Multilingual Information We have presented an architecture for querying semantic Management,, 2003. data in multiple languages. We started by providing methods [9] A. Miles, B. Matthews, M. Wilson, and D. Brickley. to specify the creation of forms, the querying of the results SKOS Core: Simple knowledge organisation for the and presentation of the results in a language-independent web. In Proceedings of the International Conference on manner through the use of URIs and XML specifications. Dublin Core and Metadata Applications, pages 12–15, By creating this modular framework we provide an interop- 2005. erable language-independent description of the data, which [10] C. Peters, T. Deselaers, N. Ferro, J. Gonzalo, G. F. could be used in combination with a lexicalisation module Jones, M. Kurimo, T. Mandl, A. Peñas, and V. Petras. to enable multilingual search and querying. We then sepa- Evaluating Systems for Multilingual and Multimodal rated the data source into a language-independent data layer Information Access, volume 5706. Springer, 2008. and a language-dependent lexical layer, which allows us to modularise each language and made the lexical information available separately on the semantic web. In this way we achieved all the requirements we set out in Figure 1. We described an implementation of this framework, which was designed to transform abstract specifications of the data into HTML pages available on the web and performed its lexi- calisations by the use of LexInfo lexicon ontology models [4] 12