Learning term to concept mapping through verbs: a case study Valentina Ceausu Sylvie Desprès CRIP 5 - Paris V University LIPN UMR CNRS 7030 - University of Paris 13 45, Rue des Saints Pères 99 avenue Jean Baptiste Clément 75006 Paris, France 93430 Villetaneuse, France ceausu@math-info.univ-paris5.fr sylvie.despres@lipn.univ-paris13.fr ABSTRACT For this work, we assume that the ontology takes into account the linguistic level of entities. Thus, concepts and roles are We propose in this paper an approach to learn term to concept labeled by terms, which are linguistic manifestation of ontology mapping with the joint utilization of verb relations and an entities in a specific language (French, English, etc.). Therefore, existing ontology. This is a non-supervised solution that can be ontology considered for this work has two levels: a conceptual applied to any field for which an ontology modeling verbs as level, describing domain specific entities (concepts and roles) relations holding between the concepts was already created. and a linguistic level, providing expressions of those entities in a Conceptual graphs, representing a set of verb relations, are given language. learned from a natural language corpus by using part-of-speech The goal of this approach is to identify within c terms t information and statistic measures. Labeling strategies are representing linguistic expression of concepts of o ontology. proposed to assign terms of the corpus to concepts of the Thus, we can label terms identified in the corpus by concepts of ontology by taking into account the structure of the ontology ontology. We propose a three steps approach to carry out this and the extracted conceptual graphs. Results of this assignation labeling process: could be used to automatically create semantic annotations of documents. A first experimentation in the field of accidentology (1) in a first stage, verb relations are extracted from the corpus. was done and its results are also presented. Each verb relation is composed of a verb, be that a general one or a field specific one, and a pair of terms connected by this Categories and Subject Descriptors verb. D.3.3: term to concept mapping, semantic annotation. (2) in a second phase, statistical processing is performed to structure verb relations as conceptual graphs. As the verb is General Terms considered to be the key element of a verb relation, it is placed Experimentation. at the top of the conceptual graph. Terms occurring as arguments of the verb are connected to this verb through links representing theirs syntactic function which could be subject or Keywords object. Ontology, verb relation, concept learning. (3) the last phase is based on the assumption that the domain ontology models verbs of the field as relations holding between 1. INTRODUCTION the concepts. If this is the case, labeling strategies are using both the ontology and extracted conceptual graphs to assign field The rapid evolution in the production of documents in natural specific terms to field specific concepts. language requires the definition of efficient automated We shall approach that topic by answering a number of approaches allowing finding relevant information in those questions: which method should be used to extract verb relations documents. This paper presents an approach that uses verb from corpus? How to learn conceptual graphs from the extracted relations and a domain ontology to assign terms of a given verb relations? Those questions are analyzed in sections 2 and 3. corpus to concepts of the field. Those assignations can be used Given a domain ontology and a set of conceptual graphs, which thereafter for various exploitation scenarios, that is to say: strategies will be used to assign terms to concepts? The solution semantic annotation of documents, estimating similarities is discussed in section 4. A first experimentation in the field of between documents, etc. accidentology is described and its results are presented in This approach is based on an entirely automatic and non- section 5. Related work is presented in section 6. Conclusions supervised process, unless the use of a domain ontology to and perspectives end this paper. support the process. The task to achieve could be described as follows: let o be a domain ontology and c a collection of domain-specific texts. 2. EXRACTING VERB RELATIONS similarity of strings s, t . In a similar way, a lexical distance FROM CORPUS measure associates a real number to a pair of strings but the interpretation is different: important values of r indicates minor To extract verb relations from corpus, we adopted an approach similarity of strings s, t . based on pattern recognition. This approach is using part-of- speech information and consists in seeking within the corpus for Many coefficients were proposed to calculate similarities or particular associations of lexical categories. Such an association distances between strings. A number of them are presented in represents a lexical pattern. For example Verb, Noun or Verb, [5]. For this work, we have implemented the Jaccard, Jaro, Jaro- Preposition, Noun are lexical patterns. Winkler and Monge-Elkan coefficients–. We manually crafted a set of lexical patterns including a verb Jaccard coefficient calculates the similarity between two strings (among other categories). Associations of words matching s and t by considering a string composed of several sub- patterns of this set are identified by a pattern recognition strings. Jaccard coefficient is given by: algorithm, described in [4]. The algorithm takes as input the corpus tagged by TreeTagger, see [19] and a set of lexical s t patterns including verbs. It is applied at sentence level and it Jaccard ( s, t )  automatically generates a set of word regroupings matching s t those patterns, such as (examples of this paper are translated in English, although they are extracted from a French corpus This measure is takes into account the number of sub-strings experimentation: Verb, Preposition : diriger vers (direct to ); common to s and t and the number of all sub-strings of s and Verb, Preposition, Noun: diriger vers place (direct to square). t . If we consider characters as sub-strings, the coefficient Obtained word regroupings can be: expresses the similarity by taking into account the number of common characters of s and t only. - a verb relation, highlighting domain relations, such as: véhicule diriger vers bretelle (vehicle direct to slip road); Jaro and Jaro-Winkler coefficients, introduced below, express the distance between two strings s, t by taking into account the - an incomplete verb relation such as piéton traverser (pedestrian crossing); or diriger vers l'opéra (direct to opera); number and the position of characters shared by s and t. - or meaningless word regroupings, as we can see : c, véhicule Let s  s1 ...s k and t  t1 ...t k be two strings. A character (c, vehicle,) ; venir de i (come from i). s i in s is considered common to both strings if there exists 3. LEARNING CONCEPTUAL GRAPHS t j in t such as: si  t j and i  h  j  i  h , where The goal of this phase is to learn conceptual graphs from the min( s, t ) results of pattern recognition algorithm. h . A conceptual graph represents a hierarchy having as a top a verb 2 and, on a second layer, arguments connected to the verb by their grammatical function, subject or object. We use the term Let s 1  s11 ...s 1k be characters of s common to t conceptual graph as it was introduced by [18]. and t 1  t11 ...t k1 characters of t common to s . We define a As many terms could be the subject or object of the same verb, a conceptual graph corresponds to a set of verb relations transposition between s and t as an index i such as: generated by the same verb. To learn conceptual graphs, the s t 1 i 1 i . If Ts ,t is the number of transpositions from s i to chain of treatments based on lexical similarity measures presented bellow is performed. t i the Jaro coefficient calculates the lexical distance between s and t as follows: 3.1 Lexical similarities and lexical distances A lexical similarity measure associates a real number r to a 1  s s 1  Ts ,t  1 t1 pair of strings s, t . Important values of r indicate a significant Jaro( s, t )     3  s t s 1    Permission to make digital or hard copies of all or part of this work for [13] proposes a variant of Jaro coefficient by using p , the personal or classroom use is granted without fee provided that copies are length of the longer prefix common to both strings : not made or distributed for profit or commercial advantage and that p copies bear this notice and the full citation on the first page. To copy Jaro  Winkler ( s, t )  Jaro( s, t )  (1  Jaro( s, t )) otherwise, or republish, to post on servers or to redistribute to lists, 10 requires prior specific permission and/or a fee. Presented coefficients calculate lexical similarity or distances partie gauche (left side) iteratively and consider strings as blocks. There are also hybrid approaches calculating similarities recursively, by analyzing partie droite (right side) sub-strings of initial strings. Thus, Monge-Elkan coefficient calculates lexical similarity between s 1  s11 ...s 1k and rétroviseur (rear view mirror) t 1  t11 ...t l1 by performing two steps. First, the two strings rétroviseur extérieur (external rear view mirror) are divided into sub-strings; then the similarity is given by: Hence, a new layer can be added to each conceptual graph by 1 k Monge  Elkan( s, t )   k i 1 max lj 1 sim( s i , t j ) clustering those arguments. A cluster is a group of similar terms, having a central term c called centroid and its k nearest neighbors. Based on the where sim( si , t j ) are given by some similarity function, for heuristic that the greater number of words in a word regrouping instance one of those previously presented. Such a function is there are, the more specific his meaning is, an algorithm is called a level 2 function. proposed to cluster arguments of verb relations. The clustering algorithm is written as follows: (1) for each list of arguments, create the list L of centroids, 3.2 An iterative approach to learn composed of all one-word arguments; conceptual graphs (2) for each centroid c , calculate the lexical similarity with Conceptual graphs are learned from the set of lexical pattern other terms of the list by using Monge-Elkan coefficient; instances extracted according to section 2. An iterative solution is proposed, performing a number of steps, each of them adding (3) add to cluster c terms having a similarity value greater than a new layer to the graphs. a given threshold. An expert intervention allows us to chose the value of this trehsolds. (1) The first step identifies verb classes which represent the set of verb relations generated by the same verb, see Table 1. At that stage, Monge-Elkan function is used because it carries out recursive comparisons between sub-strings. Consequently, it has the capacity to agglomerate around a word (as centroids of Table 1.Extract from diriger (to direct) class clusters are one-word terms, which is to say words), terms derived from this word. diriger vers (direct towards ) We chose one-word terms as centroids as they have the most diriger vers lieu (direct towards place) general meaning, and, by consequence, will be able to attract into a cluster terms that are similar from a lexical point of view véhicule diriger vers (vehicle direct towards) and that have more specific meanings. Figures 1 and 2 show the automobile diriger vers esplanade (car direct towards iterative construction of conceptual graphs. We can see one- esplanade) level conceptual graphs learned from diriger (to direct) class and two-level conceptual graphs learned from circuler (to circulate) class. For each verb class, instances of patterns “Verb” and “Verb, Preposition, are added to the set of roots. We argue that for verbs accepting prepositions, each “verb, preposition” pattern accepts specific arguments and for this reason conceptual graphs are created for each instance of those patterns. This step creates a number of conceptual graphs having one level, which is to say the root (see Figure 1). (2) For each root, its arguments are identified: terms that are subjects and objects. As each relation accepts many terms as subject or object, lists of arguments are obtained. This step is adding a second layer to each conceptual graph. (3) We observe that, for a given verb, arguments can have different levels of granularity, as we can see in table 2: Table 2.Granularity of arguments partie (side ) Figure 1. Conceptual graph modeling circuler avec (circulate The second strategy we propose is a top-down strategy. In the with) first phase, it identifies concepts of ontology which label the centroid of the cluster. If the centroid of a cluster is labeled as unknown, the same label is assigned to each term of the cluster. If the centroid of a cluster is labeled by a concept c of ontology, labels for other terms of the cluster are searched only in the set of sub-concepts of c . In this way, the top-down labelling strategy reduces the search space. A third strategy is based on a bottom-up approach. For each cluster, the similarities between its terms and the concepts of ontology are calculated by using one of presented coefficients. If values of similarities are higher than a threshold, the concept labels the term. If this is not the case, the term will be labeled as inconnu (unknown). Based on the assignments of each term of cluster to ontology concepts, the similarity between the centroid and a concept of ontology is given by: 1 k sim(Centroid , c)   sim(t i , c) , where ti is a term k i 1 of the cluster, c is a concept of ontology, sim(t i , c ) is the similarity between t i and c , and k is the number of terms Figure 2. Conceptual graph modeling diriger (direct to) labeled by c . Those three labelling strategies are used in a first 4. TERM TO CONCEPT MAPPING experimentation in the field of accidentology which is described in the next section. USING THE ONTOLOGY At this stage, arguments of verb relations can be assigned to 5. EXPERIMENTATION IN concepts of the domain by using the previous conceptual graphs ACCIDENTOLOGY AND FIRST RESULTS and a domain-ontology. We make the assumption that, for a given conceptual graph, the verb R representing its root node Results of our approach can be affected by different parameters: is already modeled by the ontology. If this is the case, let r be the corpus we use, that is to say its size and its nature (a domain the corresponding relation and Ranger , Domainr specific corpus or a general one) and the ontology. Our first experimentation was performed in accidentology and aimed to concepts of the ontology connected by r . Those concepts and points out how different ontologies affect the outcome. theirs descendants will be used to label arguments of the verbs. As arguments are connected to verb by links corresponding to For this experimentation, we used a corpus and we aimed to the syntactic function, Domainr will be used to label subject assign terms extracted from this corpus to concepts of two different ontologies. Here after we describe those resources. arguments, while Ranger will be used to label object The corpus we used is composed of about 250 accident reports arguments. Assignation of terms to concepts is performed by one of accidents which occurred in and around Lille region (130 of labelling strategies described bellow. KO, 205 000 words). Accident reports are documents created by A first strategy ignores the hierarchical organization of the police describing road accidents. They are written by arguments. Thus, similarities between each argument and terms policemen, according to declarations of people involved in naming concepts of the ontology are calculated using one of accident and testimonies of witnesses. presented similarity measures. The argument is assigned to the A first case study was done by using an ontology created from concept maximizing this similarity, if the value of this similarity is greater than a pre-defined threshold. If similarity values are accident reports O1 , see [4]. The ontology was created with below the threshold, the term will be labelled as inconnu Terminae, see [3], and it is expressed in OWL, see [6] and [21]. (unknown). This is a non-oriented strategy because all It models the domain of accidentology as it appears through arguments are considered at the same level. documents created by the police. In this case study, the ontology and the corpus are created by the same community. Further on, we present two strategies (the second and the third) which take into account the hierarchical structure of arguments. Our second case study was done by using an ontology O2 , Therefore, each cluster of arguments is considered as a hierarchy created from accident scenarios, see [7]. Accident scenarios are having on its first level the centroid and on its second level documents created by researchers in road safety which describe terms that are specializations of centroid. prototypes of road accidents. The ontology was created with Protégé see [16] and it is expressed in OWL. This ontology models the domain of accidentology as it appears through documents created by road safety researchers. In this second case study, the corpus and the ontology are created from two different communities. Each ontology models concepts (see figure 3) and roles. Figure 4. Roles of concept Véhicule (Vehicule) Ta C D(Corpus, O)  * a , where Ta is the number Ttotal C total of arguments of verb relations assigned to concepts of the ontology; Ttotal is the number of terms extracted from corpus Figure 3. The concept Véhicule (Vehicle) (arguments of verb relations); C a is the number of concepts to whom arguments of verb relation are assigned, and C total is the Roles are designated by domain specific verbs, see fig. 4. number of concepts of the ontology. As the community of road safety researchers is smaller, the This definition is based on relative measures which enables us to number of entities of O2 is less important, see table 3. compare results obtained by using different corpus and ontologies. Table 3. O1 and O2 : number of entities Values of assignation degree rank from 0 (all terms extracted Concepts Roles from corpus are labelled as unknown) to 1 (each term extracted is assigned to a concept and each concept of the ontology labels O1 450 320 at least one term). O2 130 70 For each case study, terms are assigned to concepts by using a labelling strategy. Results obtained are presented below: Table 4. Assignation degree : non oriented strategy The analysis of results is done by using a new measure which we defined. This measure is called assignation degree, and it is Assignation Case study Corpus Ontology given by: degree accident 1 reports O1 70,5 accident 2 reports O2 30,5 Table 5. Assignation degree : top-down strategy [13] use a principle from information theory to model Assignation selectional preferences for verbs. Several classes may be Case study Corpus Ontology appropriate for modeling selectional preferences. degree accident [20] propose RelExt, a system which is capable of automatically 1 reports O1 68,5 identifying highly relevant triples (pairs of concepts connected accident by a relation). RelExt extracts relevant terms and verbs from a 2 reports O2 25,5 given text collection and it estimates relations between them through a combination of linguistic and statistical processing. Table 6. Assignation degree : bottom -up strategy Extracted triples can be integrated in an already existing ontology. Assignation Case study Corpus Ontology [18] propose a system having a multi-layered architecture degree accident aiming to extract information from genetic interaction data. The 1 reports O1 68,5 system uses verb patterns modelled as conceptual sub-graphs to accident characterize unknown terms in sentences. The goal is to enrich 2 reports O2 25,5 an existing ontology by integrating discovered concepts. Our approach is based on the previous work presented in [18], Result analysis is two-fold: for each case study, we compare the whose major drawback is the impossibility to assign terms results provided by each labelling strategy; for the same case composed of many words (multi-words terms) to concepts of study (which is to say the same ontology), we compare the ontology. In order to overcome this limitation, our approach results provided by each labelling strategy. takes into account arguments of verb relations which have different levels of granularity. Therefore, we represent verb As tables above show, the assignation degree has more relations by conceptual graphs having three levels: the verb (first important values if the corpus and the ontology share the same level), one-word arguments (second level) and multi-words community. This could be explained by the similarity between arguments (the third level). the linguistic level of the ontology (terms designing its entities) and the corpus. By using a corpus and an ontology belonging to different communities, the assignation degree decrease 7. CONCLUSION AND FUTURE WORK drastically. In order to overcome this problem, we can use lexical resources such as WordNet, see [14], allowing us to take We have presented a non supervised approach developed to into account synonymy between terms when estimating theirs automatically assign terms of a corpus to concepts of ontology. similarity. This approach is using jointly verb relations and a domain ontology. Results provided could be used to semantically Among the labelling strategies, the bottom-up one shows lows annotate or index documents. values of assignation degrees in both case studies. This is because the most clusters we have obtained have less than 10 A first experimentation in the accidentology domain was done in words, and this strategy fails in case of small sized clusters. order to point out how different ontologies affect the outcome. In order to evaluate the results of this evaluation we have The non oriented strategy and the top-down strategy provide defined a new measure, called assignation degree. This similar values of assignation degree. Nevertheless, the top-down evaluation shows that the approach provide better results if the strategy performs faster, as it reduces the search space. corpus and the ontology belong to the same community. If they belong to different communities, values of assignation 6. RELATED WORK degree decrease. This experimentation shows that our approach is sensitive to lexical level (changing the vocabulary, by passing Approaches proposed in different application fields, such as from a community to another, affects values of assignation ontology learning or word-sense disambiguation are at the origin degree). of this work. As a future work, new evaluation scenarios have to be proposed Among them, [10] propose Asium, a machine learning system in order to study how other factors (namely the corpus: its size which acquires subcategorization frames of verbs based on and its nature) affect the results. syntactic input. Asium hierarchically clusters nouns based on the Another perspective concerns the exploitation of lexical verbs that they are syntactically related with and vice versa. resources such as WordNet, in order to take into account the The work of [24] concerns the identification meaning of synonymy between terms. Namely, this will allow us to unknown verbs using the context of occurrence of the verb. The overcome the problem of lexical variation between different system Camille uses WordNet, see [14] as background communities. knowledge and generates assumptions concerning the meaning As a continuation of this work, a feedback could be added in of verbs. The assumptions are formulated according to linguistic order to enrich the domain ontology by integrating new criteria's. concepts. International Conference on Knowledge Discovery and Data Mining, 1996. 8. REFERENCES [16] Noy, N., Fergerson, R. W. and Musen, M. A.. The knowledge model of Protégé-2000 : Combining interoperability and _exibility. In Proceedings of the [1] Alfonseca, E., Manandhar, S.: Improving an ontology International Conference on Knowledge Engineering and refinement method with hyponymy patterns. In proceedings Knowledge Management, 2000. of the Third International Conference on Language Resources and Evaluation, 2001. [17] Parekh, V., Jack, P.G., Finin., T.: Mining domain specific texts and glossaries to evaluate and enrich domain [2] Aussenac-Gilles, N., Seguela, P.: Les relations ontologies. In Proceedings of the International Conference sémantiques: du linguistique au formel. Cahiers de on Information and Knowledge Engineering, 2004. grammaire 25 (175), 2000. [18] Roux, C., Prouxet, D., Rechenmann, F., Julliard, L.: An [3] Biébow, B., Szulman, S.: A linguistic-based tool for the ontology enrichment method for a pragmatic information building of a domain ontology. In proceedings of the extraction system gathering data on genetic interactions. In International Conference on Knowledge Engineering and Proceedings of Ontology Learning Workshop at ECAI, Knowledge Management, 1999. 2000. [4] Ceausu, V., Desprè, S.: Towards a text mining driven [19] Schmid, H.: Probabilistic part-of-speech tagging using approach for terminology construction. In proceedings of decision trees. In Proceedings of the International the 7th International conference on Terminology and Conference on New Methods in Language Processing, Knowledge Engineering, TKE 2005, 2005. 1994. [5] Cohen, W., Ravikumar, P., Fienberg, S.: A comparison of [20] Schutz, A., Buitelaar, P.: Relext: A tool for relation string distance metrics for name-matching tasks. In extraction from text in ontology extension. In Proceedings proceedings of IJCAI-2003, Workshop on Information of the International Semantic Web Conference, 593–606. Integration on the Web pages, 2003. 2005. [6] Dean, G. Schreiber, P. Patel-Schneider, P. Hayes, and I. [21] Szulman, S., Biébow, B.: Owl et Terminae. In actes de la Horrocks. Owl web ontology language reference. Technical 14ème Journée Francophone d’Ingénierie des report, W3C Proposed Recommendation, 2004. Connaissances, 2004. [7] Després, S. Contribution à la conception de méthodes et [22] Valarakos, A., Paliouras, G., Karkaletsis, V., Vouros, G.: A d'outils pour la gestion des connaissances. Habilitation à name matching algorithm for supporting ontology diriger des recherches, Université René Descartes, 2002. enrichment. In Proceedings of the 3rd Hellenic Conference [8] Euzenat, J., Valtchev, P.: An integrative proximity measure on Artificial Intelligence, 2004. for ontology alignment. In proceedings of ISWC-2003, [23] Ville-Ometz, F., Royaut, J., Zasadzinski, A.: Filtrage semi- Workshop on Semantic Information Integration, 2003. automatique des variantes de termes dans un processus [9] Faatz, A., Steinmetz, R.: Ontology enrichment with texts d’indexation contrle. In actes du Colloque International from the www. In proceedings of the SemanticWeb Mining sur la Fouille de Textes, 2004. 2nd Workshop at ECML/PKDD, 2002. [24] Wiemer-Hastings, P., Graesser, A., Wiemer-Hastings, K.: [10] Faure, D., Nedellec, C.: Asium, learning subcategorization Inferring the meaning of verbs from context. In frames and restrictions of selection. In proceedings of the Proceedings of the Twentieth Annual Conference of the 10th European Conference On Machine Learning, Cognitive Science Society, 1998. Workshop on text mining, Chemnitz, Germany, 1998. [25] Xiaomeng, S.: Semantic Enrichment for Ontology [11] Gagliardi, H., Haemmerl, O., Pernelle, N., Sas, F.: An Mapping. PhD thesis, Norwegian University of Science and automatic ontology-based approach to enrich tables Technology, 2004. semantically. In proceedings of the first International [26] Warin, M., Oxhammer, H., Volk, M.: Enriching an Workshop on Context and Ontologies: Theory, Practice ontology with wordnet based on similarity measures. In: and Applications, 2005. MEANING-2005 Workshop, 2005. [12] Hearst, M.: Automatic acquisition of hyponyms from large [27] Widdows, D.: Unsupervised methods for developing text corpora. In proceedings of the 14th International taxonomies by combining syntactic and statistical Conference on Computational Linguistics, 1992. information. In proceedings of Human Language [13] Li, H., Abe, N.: Generalizing case frames using a Technology Conference, HTL-NAACL, 2003. thesaurus and the MDL principle. Computational Linguistics 24, 217–244, 1998. [14] Miller, G.: Wordnet: A lexical database for english. CACM 38, 39–41, 1995. [15] Monge, A., Elkan, C.: The field-matching problem: algorithm and applications. In proceedings of the Second