An Ontological Representation of Documents and Queries for Information Retrieval Systems Mauro Dragoni Célia da Costa Pereira Andrea G.B. Tettamanzi Università degli Studi di Milano Università degli Studi di Milano Università degli Studi di Milano Dipartimento di Tecnologie Dipartimento di Tecnologie Dipartimento di Tecnologie dell’Informazione dell’Informazione dell’Informazione Via Bramante 65, I-26013 Via Bramante 65, I-26013 Via Bramante 65, I-26013 Crema (CR), Italy Crema (CR), Italy Crema (CR), Italy mauro.dragoni@unimi.it celia.pereira@unimi.it andrea.tettamanzi@unimi.it ABSTRACT be represented in the same way. This way, the risk of omit- This paper presents a vector space model approach, for rep- ting some related terms (as it may happen in the classical resenting documents and queries, using concepts instead of query expansion technique), is reduced. However, it is nec- terms and WordNet as a light ontology. This way, informa- essary to use a language resource that permits to cover a tion overlap is reduced with respect to the classic semantic higher number of terms in order to avoid information loss. expansion techniques. Experiments undertaken on Much- This paper presents a new representation for documents More benchmark showed the effectiveness of the approach. and queries. The proposed approach exploits the structure of the well-known machine readable dictionary WordNet in order to reduce the redundancy of information generally con- 1. INTRODUCTION tained in a concept-based document representation. The This paper presents an ontology-based approach for a con- second improvement is the reduction of the computational ceptual representation of documents. Such an approach is time needed to compare documents and queries represented inspired by a recently proposed idea presented in [9], and by using concepts. This representation has been applied uses an adapted version of that method to standardize the to the ad-hoc retrieval problem. The approach has been representation of documents and queries. The proposed evaluated on the MuchMore1 Collection [4] and the results approach is somehow similar to the classic query expan- demonstrate its viability. sion technique. However additional considerations have been In Section 2 an overview of the environment in which on- taken into account and some improvements have been ap- tology has been used is presented. Section 3 presents the plied as explained below. tools used for this work. Section 4 illustrates the proposed Query expansion is an approach used in Information Re- approach to represent information, while Section 5 compares trieval (IR) in order to improve the system’s performance. this approach with other two well-known approaches used in It consists of the expansion of the content of the query by conceptual representation of documents. In Section 6 the re- adding the terms that are semantical correlated with the sults obtained from the test campaign are discussed. Finally, original terms of the query [12]. Several works demonstrated Section 7 concludes. the enhanced performance of IR systems that implement query expansion approaches [19] [3] [5]. However, the query 2. RELATED WORKS expansion approach has to be used carefully because, as An increasing number of recent information retrieval sys- demonstrated in [8], expansion might degrade the perfor- tems make use of ontologies to help the users clarify their mance of some individual queries. This is due to the fact information needs and come up with semantic representa- that an incorrect choice of terms and concepts for the ex- tions of documents. Many ontology-based information re- pansion task might harm the retrieval process by drifting it trieval systems and models have been proposed in the last away from the optimal correct answer. decade. An interesting review on IR techniques based on Document expansion applied to IR has been recently pro- ontologies is presented in [11], while in [16] the author stud- posed in [2]. In that work a sub-tree approach has been im- ies the application of ontologies to a large-scale IR system plemented to represent concepts in documents and queries. for web purposes. A model for the exploitation of ontology- However, when using a tree structure there is a redundancy based knowledge bases is presented in [7]. The aim of this of information because more general concepts may be rep- model is to improve search over large document reposito- resented implicitly by using only the leaf concepts they sub- ries. The model includes an ontology-based scheme for the sume. The smart idea behind the representation of docu- annotation of documents, and a retrieval model based on an ments by using concepts is that documents and queries may adaptation of the classic vector-space model [15]. Another information retrieval system based on ontologies is presented in [14]. The authors propose an information retrieval system which has landmark information database that has hierar- chical structures and semantic meanings of the features and Appears in the Proceedings of the 1st Italian Information Retrieval Workshop (IIR’10), January 27–28, 2010, Padova, Italy. 1 http://muchmore.dfki.de http://ims.dei.unipd.it/websites/iir10/index.html Copyright owned by the authors. characteristics of the landmarks. query. Therefore, to each element present in the concept- The implementation of ontology models has been also in- based representation of the query, its concept weight has vestigated by using fuzzy models [6]. been used as boost value. In IR, the user’s input queries usually are not detailed enough, so the satisfactory query results can not be brought 4. DOCUMENT REPRESENTATION back. Query expansion of IR can help to solve this problem. Conventional IR approaches represent documents as vec- However, the common query expansion in IR cannot get tors of term weights. Such representations use a vector with steady retrieval results. Ontologies play a key role in query one component for every significant term that occurs in the expansion research. A common use of ontologies in query document. This has several limitations, for example: expansion is to enrich the resources with some well-defined meaning to enhance the search capabilities of existing web 1. different vector positions may be allocated to the syn- searching systems. onyms of the same term; this way there is an infor- In [18] the authors propose and implement query expan- mation loss because the importance of a determinate sion method which combines domain ontology with the fre- concept is distributed among different vector compo- quency of terms. Ontology is used to describe domain knowl- nents; edge; logic reasoner and the frequency of terms are used to 2. the size of a document vector have to be at least equal choose fitting expansion words. This way, higher recall and to the total number of words of the language used to precision can be gotten as user’s query results. write the document; In [10] the authors present an approach to expand queries that consists in searching terms from the topic query in an 3. every time a new set of terms is introduced (which is a ontology in order to add similar terms. high-probability event), all document vectors must be reconstructed; the size of a repository thus grows not only as a function of the number of documents that 3. PRELIMINARIES it contains, but also of the size of the representation The roadmap to prove the viability of a concept-based rep- vectors. resentation of documents and queries consists in two main tasks: To overcome these weaknesses of term-based representations, an ontology-based representation has been used [9]. - to choose a method that permits to represent all docu- An ontology-based representation has been recently pro- ments terms by using the same set of concepts; posed in [9] which exploits the hierarchical is-a relation among concepts, i.e., the meanings of words. For example, - to implement an approach that permits to index and to to describe with a term-based representation documents con- evaluate each concept, in both documents and queries, taining the three words: “animal”, “dog”, and “cat” a vector with the appropriate weight. of three elements is needed; with an ontology-based repre- sentation, since “animal” subsumes both “dog” and “cat”, it To represent documents, the method described in Sec- is possible to use a vector with only two elements, related to tion 4 has been used, combined with the use of the WordNet the “dog” and “cat” concepts, that can also implicitly con- machine-readable dictionary. From the WordNet database, tain the information given by the presence of the “animal” the set of terms that do not have hyponymy has been ex- concept. Moreover, by defining an ontology base, which is a tracted, each term is named “base concept”. A vector, named set of independent concepts that covers the whole ontology, “base vector”, has been created and, to each component of an ontology-based representation allows the system to use the vector, a base concept has been assigned. This way, each fixed-size document vectors, consisting of one component term is represented by using the base vector of the WordNet per base concept. ontology. Calculating term importance is a significant and funda- The representation described above has been implemented mental aspect for representing documents in conventional on top of the Apache Lucene open-source API. 2 information retrieval approaches. It is usually determined In the pre-indexing phase, each document has been con- through term frequency-inverse document frequency (TF- verted in its ontological representation. After the calcula- IDF). When using an ontology-based representation, such tion of the importance of each concept in a document, only usual definition of term-frequency cannot be applied because concepts with a degree of importance higher than a fixed one does not operate by keywords, but by concepts. This cut-value have been maintained, while the others have been is the reason why it has been adopted the document rep- discarded. The cut-value used in these experiments is 0.01. resentation based on concepts proposed in [9], which is a This choice has a drawback, namely that an approximation concept-based adaptation of TF-IDF. of representing information is introduced due to the discard In this paper, an adaptation of the approach proposed in of some minor concepts. However, we have experimentally [9] is presented. The original approach was proposed for verified that this approximation does not affect the final re- domain specific ontologies and does not always consider all sults. the possible concepts in the considered ontology, in the sense During the evaluation activity, queries have been also con- that it assumes a cut at a given specificity level. Instead, verted into the ontological representation. This way, weights the proposed approach has been adapted for more general have to be assigned to each concept to evaluate all concepts purpose ontologies and it takes into account all independent with the right proportion. One of the features of Lucene is concepts contained in the considered ontology. This way, the possibility of assigning a payload to each term of the information associated to each concept is more precise and the problem of choosing the suitable level to apply the cut 2 See URL http://lucene.apache.org/. is overcome. Figure 1: Ontology representation for concept ’z’. Figure 2: Ontology representation for concept ’y’. The quantity of information given by the presence of con- z = (0.25, 0.25, 0.25, 0.125, 0.125) cept z in a document depends on the depth of z in the ontol- a = (1.0, 0.0, 0.0, 0.0, 0.0) ogy graph, on how many times it appears in the document, b = (0.0, 1.0, 0.0, 0.0, 0.0) and how many times it occurs in the whole document repos- c = (0.0, 0.0, 1.0, 0.0, 0.0) itory. These two frequencies also depend on the number of y = (0.0, 0.0, 0.0, 0.5, 0.5) concepts which subsume or are subsumed by z. Let us con- d = (0.0, 0.0, 0.0, 1.0, 0.0) sider a concept x which is a descendant of another concept y x = (0.0, 0.0, 0.0, 0.0, 1.0) , which has q children including x. Concept y is a descendant so the document vector associated to D1 is: of a concept z which has k children including y. Concept x is a leaf of the graph representing the used ontology. For instance, considering a document containing only “xy”, the D1 = (2∗ x̄)+(3∗ ȳ)+z̄ = (0.25, 0.25, 0.25, 1.625, 3.625). (3) occurrence of x in the document is 1 + (1/q). In the docu- In Section 5, a comparison between the proposed repre- ment “xyz”, the occurrence of x is 1 + (1/q(1 + 1/k)). As sentation and other two classic concept-based representation it is possible to see, the number of occurrences of a leaf is is discussed. proportional to the number of children which all of its an- cestors have. Explicit and implicit concepts are taken into account by using the following formulas: 5. REPRESENTATION COMPARISON In Section 4 the approach used to represent information depth(c) was described. This section shows the improvements ob- X X occ(ci ) tained by applying the proposed approach and it illustrates N (c) = occ(c) + Qi , i=2 j=2 ||children(cj )|| a comparison between the proposed approach and other two c∈Path(c,...,>) (1) approaches commonly used in conceptual document repre- where N (c) is the number of occurrences, both explicit sentation. The expansion technique is generally used to en- and implicit, of concept c and occ(c) is the number of lexi- rich information content of queries. However, in the last calizations of c occurring in the document. The value N (c) years some authors applied the expansion technique also to is the weight associated with the concept c. represent documents [2]. Like in [13] [2], we propose an ap- Given the ontology base I = b1 , . . . , bn , where the bi s are proach that uses WordNet to extract concepts from terms. the base concepts, the quantity of information, info(bi ), per- The two main improvements obtained by the application taining to base concept bi in a document is: of the ontology-based approach are illustrated below. Information Redundancy. Ndoc (bi ) info(bi ) = , (2) Approaches that apply the expansion of documents and Nrep (bi ) queries, use correlated concepts to expand the original terms where Ndoc (bi ) is the number of explicit and implicit oc- of documents and queries. A problem with expansion is currences of bi in the document, and Nrep (bi ) is the total that information is redundant and there is not a real im- number of its explicit and implicit occurrences in the whole provement of the representation of the document (or query) document repository. This way, every component of the rep- content. With the proposed representation this redundancy resentation vector gives a value of the importance relation is eliminated because only independent concepts are taken between a document and the relevant base concept. into account to represent documents and queries. Another A concrete example can be explained starting from the positive aspect is that the size of the vector representing doc- light ontology represented in Figures 1 and 2, and by con- ument content by using concepts is generally lower than the sidering a document D1 containing concepts “xxyyyz”. size of the vector representing document content by using In this case the ontology base is: terms. An example of technique that shows this drawback is pre- I = {a, b, c, d, x} sented in [13]. In this work the authors propose an indexing technique that takes into account WordNet synsets instead and, for each concept in the ontology, the vectors Ndoc of terms. For each term in documents, the synsets asso- are: ciated to that terms are extracted and then used as token for the indexing task. This way, the computational time 1,00 needed to perform a query is not increased, however, there is a significant overlap of information because different synsets 0,75 might be semantically correlated. An example is given by the terms “animal” and “pet”, these terms have two different Precision synsets, however, observing the WordNet lattice, the term 0,50 “pet” is linked with an “is-a” relation with the term “animal”. Therefore, in a scenario in which a document contains both 0,25 terms, the same conceptual information is repeated. This is clear because, even if the terms “animal” and “pet” are not represented by using the same synset, they are semantically 0,00 correlated because “pet” is a sub-concept of “animal”. This 0 0,2 0,4 0,6 0,8 1 way, when a document contains both terms, the presence of Recall the term “animal” has to contribute to the importance of the Term-Based Synsets Onto-Based concept “pet” instead of to be represented with a different token. Figure 3: Precision/recall results. Computational Time. When IR approaches are applied in a real-world environ- ontology representation is able to improve the representation ment, the computational time needed to evaluate the match of the documents contents. However, for documents that are between documents and the submitted query has to be con- partially related to a topic or that contains many ambigu- sidered. It is known that systems using the vector space ous terms, the proposed approach is not able to maintain model have higher efficiency. Conceptual-based approaches, an high precision of the results. At the end of this section such as the one presented in [2], generally implement a non- some improvements that may be responsible of this fact are vectorial data structure which needs a higher computational discussed. time with respect to a vector space model representation. In Table 1 the three different representations are compared The approach proposed in this paper overcomes this issue for the Precision@X and MAP values. The results show because the document content is represented by using a vec- that the proposed approach obtains better results for the all tor and therefore, the computational time needed to com- precision levels and also for the MAP value. pute document score is comparable to the computational time needed by using the vector space model. Systems Precisions P5 P10 P15 P30 MAP 6. EXPERIMENTS Term-Based 0.544 0.480 0.405 0.273 0.449 In this section, the impact of the ontology document and Synset-Indexing [13] 0.648 0.484 0.403 0.309 0.459 query representation is evaluated. The evaluation method Concept-Based 0.744 0.544 0.478 0.394 0.507 follows the TREC protocol [17]. For each query the first 1000 retrieved documents have been considered and the pre- Table 1: Comparisons table between semantic ex- cision of the system has been calculated at different points: pansion approaches. 5, 10, 15, and 30 documents retrieved. Moreover, the preci- sion/recall graph has been calculated The experimental campaign has been performed by us- An in-depth study of this first experiments campaign has ing the MuchMore collection that consists of 7823 abstracts been performed, and we have noticed that for some queries of medical papers and 25 queries with their relevance judg- the concept-based representation obtained results that are ments. One of the particular features of this collection is below our expectations. By inspecting the implemented that there are a lot of medical terms. This way, a term-based model, some issues have been noticed and are at now un- representation is more advantaged with respect to semantic der analysis: representation, because specific terms present in documents - Absence of some terms in the ontology: some terms, in (for example “Arthroscopic”) are very discriminant. Indeed, particular terms related to specific domains (biomed- by using a semantic expansion some problems may occur ical, mechanical, business, etc.), are not defined in because, generally, the MRD and thesaurus used to expand the machine readable dictionary used to define the terms do not contain some domain-specific terms. concept-based version of the documents. This way The precision/recall graph showed in Figure 3 illustrates there is, in some cases, a loss of information that affects the comparison between the proposed approach (gray curve the final retrieval result. with circle marks), the classical term-based representation (black curve), and the synset representation method [13] - Proper names have not been considered: proper names (light gray curve with square marks). As expected, for all of persons, geographical locations, industries, etc., are recall values, the proposed approach obtained better results not present in the concept-based index. Observing the than the term-based representation. The best gain of the content of some documents and topics, proper names concept-based representation is at recall levels 0.0, 0.2, and turn out to be a discriminant feature in some cases. 0.4. While for recall values between 0.6 and 1.0, the concept- based precision curve lies with the other two curves. - Verbs and adjective are not present as well in the ontology: A possible explanation for this scenario is that for docu- the concept representation of terms, described in Sec- ments that are well related to a particular topic the adopted tion 4, does not take into account verbs and adjectives. This happens because verbs and adjectives are struc- retrieval. In F. Bobillo, P. da Costa, C. d’Amato, tured in a different way than nouns. The hyperonymy N. Fanizzi, F. Fung, T. Lukasiewicz, T. Martin, and hyponymy relations (that make MRD comparable M. Nickles, Y. Peng, M. Pool, P. Smrz, and P. Vojtás, with ontologies) are not defined for verbs and adjec- editors, URSW, volume 327 of CEUR Workshop tives, therefore another approach will be studied and Proceedings. CEUR-WS.org, 2007. implemented to overcome this drawback. [7] P. Castells, M. Fernández, and D. Vallet. An adaptation of the vector-space model for - Term ambiguity: the concept-based representation has the ontology-based information retrieval. IEEE Trans. problem of introducing an error given by not using a Knowl. Data Eng., 19(2):261–272, 2007. word sense disambiguation algorithm. Using such a [8] S. Cronen-Townsend, Y. Zhou, and W. Croft. A method, concepts associated to incorrect senses would framework for selective query expansion. In be discarded or weighted less. Therefore, the concept- D. Grossman, L. Gravano, C. Zhai, O. Herzog, and based representation of each word would be finer, with D. Evans, editors, CIKM, pages 236–237. ACM, 2004. the consequence of representing the information con- [9] C. da Costa Pereira and A. G. B. Tettamanzi. Soft tained in a document with more precision. computing in ontologies and semantic Web, chapter Improving the actual model with the above features, would An ontology-based method for user model acquisition, certainly yield significantly better results in the next experi- pages 211–227. Studies in fuzziness and soft ments campaign. This positive view is motivated by the fact computing. Ed. Zongmin Ma, Springer, Berlin, 2006. that, in spite of these issues, the preliminary goal of outper- [10] M. Dı́az-Galiano, M. G. Cumbreras, forming the precision of the term-based representation has M. Martı́n-Valdivia, A. M. Ráez, and L. Ureña-López. been accomplished. Integrating mesh ontology to improve medical information retrieval. In CLEF, volume 5152 of Lecture Notes in Computer Science, pages 601–606. 7. CONCLUSION Springer, 2007. In this paper we have discussed an approach to index doc- [11] O. Dridi. Ontology-based information retrieval: uments and to represent queries for information retrieval Overview and new proposition. In O. Pastor, A. Flory, purposes which exploits a conceptual representation based and J.-L. Cavarero, editors, RCIS, pages 421–426. on ontologies. IEEE, 2008. Experiments have been performed on the MuchMore Col- [12] E. Efthimiadis. Query expansion. In M. Williams, lection to validate the approach with respect to problems editor, Annual review of information science and like term-synonymity in documents. technology, pages Vol. 31, pp. 121Ű187. Information Preliminary experimental results show that the proposed Today Inc, Medford NJ, 1996. representation improves the ranking of the documents. In- [13] J. Gonzalo, F. Verdejo, I. Chugur, and J. Cigarrán. vestigation on results highlights that further improvement Indexing with wordnet synsets can improve text could be obtained by integrating WSD techniques like the retrieval. CoRR, cmp-lg/9808002, 1998. one discussed in [1] to avoid the error introduced by con- [14] T. Hattori, K. Hiramatsu, T. Okadome, B. Parsia, and sidering incorrect word senses, and with a better usage and E. Sirin. Ichigen-san: An ontology-based information interpretation of WordNet to overcome the loss of informa- retrieval system. In X. Zhou, J. Li, H. Shen, tion caused by the absence of proper nouns, verbs, and ad- M. Kitsuregawa, and Y. Zhang, editors, APWeb, jectives. volume 3841 of Lecture Notes in Computer Science, pages 1197–1200. Springer, 2006. 8. REFERENCES [15] G. Salton, A. Wong, and C. Yang. A vector space [1] A. Azzini, M. Dragoni, C. da Costa Pereira, and model for automatic indexing. Commun. ACM, A. Tettamanzi. Evolving neural networks for word 18(11):613–620, 1975. sense disambiguation. In Proc. of HIS ’08, Barcelona, [16] S. Tomassen. Research on ontology-driven information Spain, September 10-12, pages 332–337, 2008. retrieval. In R. Meersman, Z. Tari, and P. Herrero, [2] M. Baziz, M. Boughanem, G. Pasi, and H. Prade. An editors, OTM Workshops (2), volume 4278 of Lecture information retrieval driven by ontology: from query Notes in Computer Science, pages 1460–1468. to document expansion. In D. Evans, S. Furui, and Springer, 2006. C. Soulé-Dupuy, editors, RIAO. CID, 2007. [17] E. Voorhees and D. Harman. Overview of the sixth [3] B. Billerbeck and J. Zobel. Techniques for efficient text retrieval conference (trec-6). In TREC, pages query expansion. In A. Apostolico and M. Melucci, 1–24, 1997. editors, SPIRE, volume 3246 of Lecture Notes in [18] F. Wu, G. Wu, and X. Fu. Design and implementation Computer Science, pages 30–42. Springer, 2004. of ontology-based query expansion for information [4] M. Boughanem, T. Dkaki, J. Mothe, and retrieval. In L. Xu, A. Tjoa, and S. Chaudhry, editors, C. Soulé-Dupuy. Mercure at trec7. In TREC, pages CONFENIS (1), volume 254 of IFIP, pages 293–298. 355–360, 1998. Springer, 2007. [5] D. Cai, C. van Rijsbergen, and J. Jose. Automatic [19] J. Xu and W. Croft. Query expansion using local and query expansion based on divergence. In CIKM, pages global document analysis. In H.-P. Frei, D. Harman, 419–426. ACM, 2001. P. Schäuble, and R. Wilkinson, editors, SIGIR, pages [6] S. Calegari and E. Sanchez. A fuzzy 4–11. ACM, 1996. ontology-approach to improve semantic information