ULiS: An Expert System on Linguistics to Support Multilingual Management of Interlingual Semantic Web Knowledge bases Maxime Lefrançois, Fabien Gandon EPI Edelweiss – INRIA Sophia Antipolis 2004 rt des Lucioles, BP93, Sophia Antipolis, 06902, France Maxime.Lefrancois@inria.fr | Fabien.Gandon@inria.fr Abstract. We are interested in bridging the world of natural language and the world of the semantic web in particular to support multilingual access to the web of data. In this paper we introduce the ULiS project, that aims at designing a pivot-based NLP technique called Universal Linguistic System, 100% using the semantic web formalisms, and being compliant with the Meaning-Text theory. Through the ULiS, a user could interact with an interlingual knowledge base (IKB) in controlled natural language. Linguistic resources themselves are part of a specific IKB: The Universal Lexical Knowledge base (ULK), so that actors may enhance their controlled natural language, through requests in con- trolled natural language. We describe a basic interaction scenario at the system level, and provide an overview of the architecture of ULiS. We then introduce the core of the ULiS: the interlingual lexical ontology (ILexicOn), in which each interlingual lexical unit class (ILUc) supports the projection of its semantic decomposition on itself. We validate our model with a standalone ILexicOn, and introduce and explain a concise human-readable notation for it. Keywords. Semantic Web; Explanatory Combinatorial Lexicology; Interlingual Lexical Ontology; Semantic decomposition; Interlingual Lexical Primitives, Meaning Text Theory. 1 Introduction In this paper we introduce and illustrate the recently begun ULiS project, which aims at redesigning a pivot-based NLP technique, 100% using the semantic web formal- isms, and being compliant with the Meaning-Text theory. ULiS stands for Universal Linguistic System, and is a system through which multiple actors could interact with interlingual semantic web knowledge bases in multiple controlled (i.e., restricted and formal) natural languages. Each controlled natural language (dictionary, grammar rules) would be described in a part of a universal linguistic knowledge base (ULK). Besides this, the ULK consists in one specific interlingual knowledge base. Actors could then enhance their controlled natural language through different actions in con- trolled natural language (e.g., create, describe, modify, merge, or delete lexical units in the dictionaries and grammar rules; connect situational lexical units to interlingual lexical units; add linguistic attributes with their associated rules, etc.). 50 The aim of this paper is to overview our proposal for the architecture of ULiS, and to introduce and validate the cornerstone of the universal linguistic knowledge base: the interlingual lexical ontology (ILexicOn). 2 Related Work The Meaning-Text Theory (MTT). The MTT is a theoretical linguistic framework for the construction of models of natural language. As such, its goal is to write sys- tems of explicit rules that express the correspondence between meanings and texts (or sounds) in various languages (Kahane, 2003). Seven different levels of linguistic re- presentation are supposed for each set of synonymous utterances: a semantic repre- sentation that is a network; the deep and surface syntactic representations (DSynR and SSynR) that are trees; the deep and surface morphological representations (DMorphR and SMorphR) that are lists of annotated tokens; and the the deep and surface phono- logical representations (DPhonR and SPhonR) that are also lists of annotated tokens. (Mel'čuk, 1998). Thus, twelve modules containing transformation rules are used to transcribe repre- sentations of a level into representations of an adjacent level. The main constituent of the MTT is the dictionary model where lexical units are described, which is called the Explanatory Combinatorial Dictionary (ECD), and has been the object of many works on lexical functions, e.g., (Mel'čuk et. al., 1995). Lexical ontologies and meaning representation languages. Lexical ontologies are ontologies of lexicalized concepts, widely used to model lexical semantics. Some have broad coverage but shallow treatment (i.e., with no or little axiomatization) such as Princeton WordNet (e.g., Miller et al., 1990), and some have small coverage but are highly axiomatized such as FrameNet (Baker et al. 1998). They use different theo- ries of lexical semantics but most of them do not describe phrasemes nor lexical col- locations. The French Lexical Network (Lux-Pogodalla & Polguère, 2011) is a grow- ing ECD-compliant lexical resource, but it does not use the semantic web formalisms, and the definitions of the lexical units are not fully formalized. On the other hand, the Universal Networking Language (UNL) is a meaning repre- sentation language, originally designed for pivot techniques Machine Translation. Its dictionary is an interlingual lexical ontology based on so-called Universal Words ++, but the lack of argument frames and lexical functions in the UNL dictionary was pointed out in (Bogulsavsky, 2002; Bogulsavsky, 2005). This is when the idea of an ECD-compliant interlingual lexical ontology was first mentioned. After the semantic web formalisms were introduced at the W3C, an attempt to port the UNL to semantic web formalisms was the topic of the W3C Common Web Language Incubator Group (XGR-CWL, 2008), but no improvement was made to the lexical ontology. SPARQL Inferencing Notation (SPIN). Grammar rules are not part of the Common Web Language (CWL) framework, in fact, the construction of grammar modules may 51 be done in any programming language. Knublauch et. al. (2011) introduced SPIN: an RDFS schema to represent SPARQL rules and constraints. Positioning of the ULiS project. The lexical resource we propose to develop is an interlingual lexical ontology coupled with a situational (i.e., a generalization of lan- guage-specific) lexical ontology, both using semantic web formalisms, and that to- gether form an ECD-compliant dictionary. Benefits of using semantic web formalisms are high as it enables us to construct an axiomatized graph-representation of a lexical ontology, with validation and inference rules. Using SPIN, we propose to include transformation rules directly in an RDF format, on top of the ECD-compliant lexical ontologies, thus obtaining an expert system on linguistics. The ULiS model is somehow similar to the FunGramKB (Periñán-Pascual & Ar- cas-Túnez, 2010) which is a lexico-conceptual knowledge base for NLP. However, the two projects have different inspiring influence. We choose to comply with the Meaning-Text theory, which gives a thorough understanding of lexical functions that are ubiquitous in every natural language. We also choose to describe the whole ULiS with the semantic web formalisms. This thus potentially enables the enhancement of the system itself through controlled natural language interactions. 3 Basic Interaction Scenarios with the ULiS The three basic scenarios of ULiS are illustrated on Figure 1 below. An actor in a situation c inputs some utterance (e.g., in English: "Who killed Mary?") that is first transformed into an RDF situational representation, which under- goes different language-specific process, and which is finally transformed into a CWL-like interlingual representation. Machine translation. At this stage, depending on the context, the interlingual repre- sentation of the utterance may be translated into another utterance in situation d (e.g., in the French situation: "Qui a tué Mary?") through a situational representation (Out- put1TEXT on Figure 1. Management of Interlingual Knowledge Bases. Another possibility is that the inter- lingual representation of the utterance is transformed in a SPARQL request that is applied on an interlingual knowledge base (IKB), which eventually produces an RDF output (e.g., ex:John01). This RDF output is then first transformed into an interlin- gual representation, then into a situational representation and finally into an output utterance: Output2TEXT on Figure 1 (e.g., "John killed Mary"). Management of the Universal Linguistic Knowledge base. Finally, the third scena- rio is the human-computing scenario: the SPARQL request is applied on the Univer- sal Linguistic Knowledge base, which is the Interlingual Knowledge Base where the 52 whole ULiS is described. Human actors may thus enhance the controlled natural lan- guages through actions stated in controlled natural language. The RDF-World SPARQL SPARQLRDF + X RDF X RDF RDF Request IKBRDF Output RDF interlingual representations IR RDF IR RDF RDF situational representations SRc RDF SRd RDF SRd RDF InputTEXT Output1TEXT Output2TEXT Fig. 1. ULiS: The basic interaction scenario with an interlingual knowledge base. Thus the interlingual representation format acts as a pivot not only for natural lan- guages, but any interlingual representation may be translated into a SPARQL request, and any RDF graph may be translated to an interlingual representation. 4 The ULiS components 4.1 Overview Figure 2 below illustrates the ULiS, with its three different layers: The second row represents interlingual layer (section 4.2), with a meta-ontology that describes the interlingual lexical ontology (ILexicOn): the cornerstone of the whole Universal Linguistic Knowledge base. The ILexicOn enables inference in interlingual semantic representations (ISemRs, on the right). The first row represents the interlingual knowledge base (IKB) layer, with facts (on the right) and an ontology or thesaurus (on the left), augmented with anchors and transformation rules (section 4.4), that enable the transformation of facts into ISemRs, and vice versa. The IKB enables situation-independent inference on utterance repre- sentation. The third row represents the situational layer (section 4.3), with a meta-ontology that describes the situational lexical ontology (SLexicOn), that itself enables situa- tion-dependent linguistic inference on utterances' situation-dependent representations (Situational representations, SRs, on the right). Situation-annotated links and trans- formation rules define transformation of utterances among SRs. 53 InterlingualKnowledgeBase XRDF IKB = KB + anchors + transformation rules (ex:John01,ikb:kill,ex:Mary01) (ikb:kill,rdfs:range,ikb:Person) (ikb:kill,rdfs:domain,ikb:Person) ILexiMOn ILexicOn « Interlingual Lexical pure interlingual features of the ECD IRs RDF ex:John01 Meta-Ontology » ilexicon:Person ileximon:ILexicalUnit DSynRs SSynRs SLexicOn DMorphRs SLexiMOn « Situational englishlexicon:Person SMorphRs Lexical espanollexicon:Persona Meta-Ontology » francaislexicon:Personne DPhonRs sleximon:SLexicalUnit Other features of the ECD + Links + Transformation rules SPhonRs SRs RDF Fig. 2. Overview of the architecture of the ULiS. From top to bottom: the interlingual layer, the interlingual layer, the situational layer. From left to right: meta-ontologies; ontologies; facts and different representations. 4.2 Architecture in the interlingual layer The interlingual layer of ULiS is divided in three components: The meta-ontology. The interlingual lexical meta-ontology (ILexiMOn) is the sche- ma that the ILexicOn must satisfy to be compliant with the pure semantic features of the Explanatory Combinatorial Dictionary (ECD). It defines meta-classes, uses RDFS and some of OWL full's axioms, and contains ad hoc SPIN validation and inference rules for the ILexicOn and the interlingual semantic representations (ISemRs). The ontology. The interlingual lexical ontology (ILexicOn) is the interlingual dictio- nary where interlingual lexical unit classes (ILUcs) are formally defined as instances of the ILexicalUnit meta-class from the ILexiMOn. The ILexicOn contains all the pure semantic features of the Explanatory Combinatorial Dictionary (ECD). Any concept expressible in a natural language or a jargon is defined in the ILexicOn that contains: • The formal definitions of the ILUcs (described in section 5.2) • The definitions of interlingual attribute classes (IAtts) (e.g., plural, future, 1st per- son, indefinite, etc.); 54 • The definitions of the interlingual semantic relations (ISemRels), that are used in the formal definitions of the ILUcs and to construct interlingual semantic represen- tations (ISemRs); • Interlingual lexical functions: every purely-semantic lexical links such as synony- my, and purely-semantic generic constructions such as the lexical function Centr(X), i.e., (the center of X), or Fin(X), i.e., (stop being X). The interlingual semantic representations. ISemRs are RDF graphs with nodes being interlingual lexical unit instances (ILUis), and arcs being ISemRels. ILUis may also be instances of IAtts. Arcs are interlingual semantic relations (ISemRels). 4.3 To and from Natural Language facts Situations. Interlingual-based lexical resources consider connecting language specific dictionaries to some interlingual dictionary. We generalize this by using situations (i.e., the situations of understanding and use of some linguistic element). The situation of a linguistic element is part of the pragmatics of its use: it represents not only the language used (e.g., EN, FR), but also sociolectal marks (e.g., biologists, architects, official, slang, reverential), topolectal marks (e.g., U.S., Cana- da), chronolectal marks (e.g., old, neologic), and even individual marks (e.g., a partic- ular group of people). The intersection of situations is also a situation (EN-U.S.- slang), and so is the union of situations (FR-Canada OR FR-France-old). Architecture of the situational layer. This architecture purposefully mirrors the interlingual layer: A situational lexical meta-ontology (SLexiMOn) describes the SLexicOn, A situational lexical ontology (SLexicOn), contains all non-purely semantic fea- tures of the ECD. A non-exhaustive list is the following: • Definitions of situational lexical unit classes, called SLUcs, by means of a link to an ILUc, which is annotated by a specific situation. • Situational lexical functions such as Instr(X), i.e., the preposition that governs the keyword X and means: (by means of). • Situational attribute classes (e.g., invariable English nouns, French 1st verb group, German dative, etc.), their associated situations and rules. • Situational relations: relations that link two instances of the SLUcs, thus defining the dependency syntax of the utterance, or the order of the words in an utterance. Situational representations (SRs). The data consist of situational representations (SRs): RDF graphs having situational lexical unit instances (SLUis) as nodes and situational relations as arcs. A SR thus represents the different representations of the Meaning-Text theory. Transformation rules. Contrary to the Common Web Language (CWL), where no grammar rules representation is proposed, we plan to introduce transformation rules 55 in the SLexiMOn. Transformation rules form a subclass of the SPIN rules and are attached to a SLUc to define a correspondence between a generic pattern from a repre- sentation level, to another pattern at a deeper or to a higher representation level. Thus, each situation may define its own analysis and production grammar, both made of six sets of transformation rules. Transformation rules may be sorted according to their level of genericity: trans- formation rules that are attached to ISemRels, or to IAtts, are less specific than rules that may be triggered only when a complex ISemR patterns is met; also, rules that may be triggered in generic situations are less specific than those that may only be triggered in more specific situations. The important point is that a rule must be trig- gered if and only if there is not a more specific rule that can be triggered instead. This implies that an algorithm different from the simple forward-chaining algorithm must be proposed. It will be very important to optimize the application of such an algorithm with a whole set of rules. We therefore plan to construct a Rete network (Forgy, 1982) on top of each set of transformation rules, which is eased by the SPIN framework as each rule is modeled as an RDF graph. Finally, a set of generic transformation rules must be designed to ensure that for each situation, every SR is transformable to an ISemR, and that every ISemR is trans- formable to a SR. When a new situation is introduced (e.g., a new language), this criterion is a priori not met. This is the reason why we suggest the introduction of the universal situation, and transformation rules that produce Notation3-like output. We claim that a small set of rules will suffice to produce and analyze simple controlled natural languages. 4.4 To and from Interlingual Knowledge Bases facts. Interlingual knowledge bases. The main criterion that an interlingual knowledge base must meet is that any RDF graph inside it must be transformable into an interlin- gual semantic representation (ISemR). We thus propose to form interlingual know- ledge bases by augmenting classic knowledge bases with anchors and transformation rules: • An anchor is a triple that links an RDF resource to an ILUc. For instance, the RDF resource rdfs:Class will be anchored to a specific ILUc ilexicon:RdfClass that formally defines the concept of an RDF class, and that is itself linked to an English SLUc that is a pluralizable noun realized by the string "class"; • The transformation rules are stored in the interlingual knowledge base and form two separated sets of rules: one for producing RDF from an ISemR, the other for producing an ISemR from RDF. Here again, transformation rules may be sorted according to their level of genericity, and the most generic rules must be inhibited when more specific ones can be triggered. Augmenting classic semantic web formalisms. The output of an ISemR must be a valid SPARQL request, and the output of any RDF graph must be a valid ISemR. This criterion will be satisfied by the introduction of different anchors and generic trans- 56 formation rules in the classic semantic web vocabularies: RDF, then RDFS, OWL and SPIN, and finally SKOS. Thus an RDF class that has no anchor, e.g., foaf:Person, has a correspondence with an ISemR that itself has a correspondence to the textual representation for the EN situation: "The RDF class foaf:Person". 5 Modeling Choices in the Interlingual Layer 5.1 Overview owl:Class owl:ObjectProperty xsd:boolean OWL owl:intersectionOf is-a owl:propertyChainAxiom owl:unionOf owl:hasSelf subClassOf is-a subClassOf is-a core-ILexiMOn layer :ILexicalUnit :ISemanticRelation is-a :onISemanticRelation subClassOf :allValuesFrom :ILexicalPrimitive is-a :isObligatory range :Entity :hasEntity ILexicOn layer :State true :Person Class/instance property intersectionOf A B A B :Alive A is a subClass of B A is an instance of B Data-layer B A A p B hasEntity C :Mary01 :Alive01 A is the intersection A is linked to B of B and C through property p Fig. 3. The three compoents of the interlingual layer, with details of the whole core-ILexiMOn that we introduced, and overview of the light standalone ILexicOn and the data. Figure 3 illustrates the architecture of our work, with its integration in the semantic web formalisms. To validate our approach, we designed a light core-ILexiMOn1, a light standalone ILexicOn2, and simple ISemRs3. 1 RDF/XML document available at URL: http://ns.inria.fr/ulk/2011/06/10/ileximon-core 2 RDF/XML document available at URL: http://ns.inria.fr/ulk/2011/06/10/ilexicon-ex 3 RDF/XML document available at URL: http://ns.inria.fr/ulk/2011/06/10/sems-ex 57 From top to bottom: 1) the semantic web formalisms, with a few OWL classes and properties that are useful for our work; 2) the detailed core-ILexiMOn; 3) an overview of the light standalone ILexicOn; and 4) an overview of data from the interlingual data component. Notice that: i) ILUis from the data are instances of ILUcs described in the ILexicOn, that are themselves instances of the ILexicalUnit meta-classes de- scribed in the ILexiMOn; and ii) properties used to link two resources in a layer are described in an upper layer. ILexicOn – standalone&light Entity State –(hasEntity)→1.Entity Person