An Ontological Representation of Documents and Queries
           for Information Retrieval Systems

                   Mauro Dragoni                           Célia da Costa Pereira            Andrea G.B. Tettamanzi
          Università degli Studi di Milano              Università degli Studi di Milano    Università degli Studi di Milano
           Dipartimento di Tecnologie                    Dipartimento di Tecnologie          Dipartimento di Tecnologie
                dell’Informazione                             dell’Informazione                   dell’Informazione
            Via Bramante 65, I-26013                      Via Bramante 65, I-26013            Via Bramante 65, I-26013
                Crema (CR), Italy                             Crema (CR), Italy                   Crema (CR), Italy
            mauro.dragoni@unimi.it                          celia.pereira@unimi.it          andrea.tettamanzi@unimi.it

ABSTRACT                                                                  be represented in the same way. This way, the risk of omit-
This paper presents a vector space model approach, for rep-               ting some related terms (as it may happen in the classical
resenting documents and queries, using concepts instead of                query expansion technique), is reduced. However, it is nec-
terms and WordNet as a light ontology. This way, informa-                 essary to use a language resource that permits to cover a
tion overlap is reduced with respect to the classic semantic              higher number of terms in order to avoid information loss.
expansion techniques. Experiments undertaken on Much-                        This paper presents a new representation for documents
More benchmark showed the effectiveness of the approach.                  and queries. The proposed approach exploits the structure
                                                                          of the well-known machine readable dictionary WordNet in
                                                                          order to reduce the redundancy of information generally con-
1.   INTRODUCTION                                                         tained in a concept-based document representation. The
   This paper presents an ontology-based approach for a con-              second improvement is the reduction of the computational
ceptual representation of documents. Such an approach is                  time needed to compare documents and queries represented
inspired by a recently proposed idea presented in [9], and                by using concepts. This representation has been applied
uses an adapted version of that method to standardize the                 to the ad-hoc retrieval problem. The approach has been
representation of documents and queries. The proposed                     evaluated on the MuchMore1 Collection [4] and the results
approach is somehow similar to the classic query expan-                   demonstrate its viability.
sion technique. However additional considerations have been                  In Section 2 an overview of the environment in which on-
taken into account and some improvements have been ap-                    tology has been used is presented. Section 3 presents the
plied as explained below.                                                 tools used for this work. Section 4 illustrates the proposed
   Query expansion is an approach used in Information Re-                 approach to represent information, while Section 5 compares
trieval (IR) in order to improve the system’s performance.                this approach with other two well-known approaches used in
It consists of the expansion of the content of the query by               conceptual representation of documents. In Section 6 the re-
adding the terms that are semantical correlated with the                  sults obtained from the test campaign are discussed. Finally,
original terms of the query [12]. Several works demonstrated              Section 7 concludes.
the enhanced performance of IR systems that implement
query expansion approaches [19] [3] [5]. However, the query               2.     RELATED WORKS
expansion approach has to be used carefully because, as
                                                                             An increasing number of recent information retrieval sys-
demonstrated in [8], expansion might degrade the perfor-
                                                                          tems make use of ontologies to help the users clarify their
mance of some individual queries. This is due to the fact
                                                                          information needs and come up with semantic representa-
that an incorrect choice of terms and concepts for the ex-
                                                                          tions of documents. Many ontology-based information re-
pansion task might harm the retrieval process by drifting it
                                                                          trieval systems and models have been proposed in the last
away from the optimal correct answer.
                                                                          decade. An interesting review on IR techniques based on
   Document expansion applied to IR has been recently pro-
                                                                          ontologies is presented in [11], while in [16] the author stud-
posed in [2]. In that work a sub-tree approach has been im-
                                                                          ies the application of ontologies to a large-scale IR system
plemented to represent concepts in documents and queries.
                                                                          for web purposes. A model for the exploitation of ontology-
However, when using a tree structure there is a redundancy
                                                                          based knowledge bases is presented in [7]. The aim of this
of information because more general concepts may be rep-
                                                                          model is to improve search over large document reposito-
resented implicitly by using only the leaf concepts they sub-
                                                                          ries. The model includes an ontology-based scheme for the
sume. The smart idea behind the representation of docu-
                                                                          annotation of documents, and a retrieval model based on an
ments by using concepts is that documents and queries may
                                                                          adaptation of the classic vector-space model [15]. Another
                                                                          information retrieval system based on ontologies is presented
                                                                          in [14]. The authors propose an information retrieval system
                                                                          which has landmark information database that has hierar-
                                                                          chical structures and semantic meanings of the features and
Appears in the Proceedings of the 1st Italian Information Retrieval
Workshop (IIR’10), January 27–28, 2010, Padova, Italy.                    1
                                                                              http://muchmore.dfki.de
http://ims.dei.unipd.it/websites/iir10/index.html
Copyright owned by the authors.
characteristics of the landmarks.                                query. Therefore, to each element present in the concept-
   The implementation of ontology models has been also in-       based representation of the query, its concept weight has
vestigated by using fuzzy models [6].                            been used as boost value.
   In IR, the user’s input queries usually are not detailed
enough, so the satisfactory query results can not be brought     4.     DOCUMENT REPRESENTATION
back. Query expansion of IR can help to solve this problem.        Conventional IR approaches represent documents as vec-
However, the common query expansion in IR cannot get             tors of term weights. Such representations use a vector with
steady retrieval results. Ontologies play a key role in query    one component for every significant term that occurs in the
expansion research. A common use of ontologies in query          document. This has several limitations, for example:
expansion is to enrich the resources with some well-defined
meaning to enhance the search capabilities of existing web            1. different vector positions may be allocated to the syn-
searching systems.                                                       onyms of the same term; this way there is an infor-
   In [18] the authors propose and implement query expan-                mation loss because the importance of a determinate
sion method which combines domain ontology with the fre-                 concept is distributed among different vector compo-
quency of terms. Ontology is used to describe domain knowl-              nents;
edge; logic reasoner and the frequency of terms are used to
                                                                      2. the size of a document vector have to be at least equal
choose fitting expansion words. This way, higher recall and
                                                                         to the total number of words of the language used to
precision can be gotten as user’s query results.
                                                                         write the document;
   In [10] the authors present an approach to expand queries
that consists in searching terms from the topic query in an           3. every time a new set of terms is introduced (which is a
ontology in order to add similar terms.                                  high-probability event), all document vectors must be
                                                                         reconstructed; the size of a repository thus grows not
                                                                         only as a function of the number of documents that
3.     PRELIMINARIES                                                     it contains, but also of the size of the representation
  The roadmap to prove the viability of a concept-based rep-             vectors.
resentation of documents and queries consists in two main
tasks:                                                           To overcome these weaknesses of term-based representations,
                                                                 an ontology-based representation has been used [9].
- to choose a method that permits to represent all docu-            An ontology-based representation has been recently pro-
      ments terms by using the same set of concepts;             posed in [9] which exploits the hierarchical is-a relation
                                                                 among concepts, i.e., the meanings of words. For example,
- to implement an approach that permits to index and to          to describe with a term-based representation documents con-
      evaluate each concept, in both documents and queries,      taining the three words: “animal”, “dog”, and “cat” a vector
      with the appropriate weight.                               of three elements is needed; with an ontology-based repre-
                                                                 sentation, since “animal” subsumes both “dog” and “cat”, it
   To represent documents, the method described in Sec-          is possible to use a vector with only two elements, related to
tion 4 has been used, combined with the use of the WordNet       the “dog” and “cat” concepts, that can also implicitly con-
machine-readable dictionary. From the WordNet database,          tain the information given by the presence of the “animal”
the set of terms that do not have hyponymy has been ex-          concept. Moreover, by defining an ontology base, which is a
tracted, each term is named “base concept”. A vector, named      set of independent concepts that covers the whole ontology,
“base vector”, has been created and, to each component of        an ontology-based representation allows the system to use
the vector, a base concept has been assigned. This way, each     fixed-size document vectors, consisting of one component
term is represented by using the base vector of the WordNet      per base concept.
ontology.                                                           Calculating term importance is a significant and funda-
   The representation described above has been implemented       mental aspect for representing documents in conventional
on top of the Apache Lucene open-source API. 2                   information retrieval approaches. It is usually determined
   In the pre-indexing phase, each document has been con-        through term frequency-inverse document frequency (TF-
verted in its ontological representation. After the calcula-     IDF). When using an ontology-based representation, such
tion of the importance of each concept in a document, only       usual definition of term-frequency cannot be applied because
concepts with a degree of importance higher than a fixed         one does not operate by keywords, but by concepts. This
cut-value have been maintained, while the others have been       is the reason why it has been adopted the document rep-
discarded. The cut-value used in these experiments is 0.01.      resentation based on concepts proposed in [9], which is a
This choice has a drawback, namely that an approximation         concept-based adaptation of TF-IDF.
of representing information is introduced due to the discard        In this paper, an adaptation of the approach proposed in
of some minor concepts. However, we have experimentally          [9] is presented. The original approach was proposed for
verified that this approximation does not affect the final re-   domain specific ontologies and does not always consider all
sults.                                                           the possible concepts in the considered ontology, in the sense
   During the evaluation activity, queries have been also con-   that it assumes a cut at a given specificity level. Instead,
verted into the ontological representation. This way, weights    the proposed approach has been adapted for more general
have to be assigned to each concept to evaluate all concepts     purpose ontologies and it takes into account all independent
with the right proportion. One of the features of Lucene is      concepts contained in the considered ontology. This way,
the possibility of assigning a payload to each term of the       information associated to each concept is more precise and
                                                                 the problem of choosing the suitable level to apply the cut
2
    See URL http://lucene.apache.org/.                           is overcome.
 Figure 1: Ontology representation for concept ’z’.                           Figure 2: Ontology representation for concept ’y’.


   The quantity of information given by the presence of con-                                 z = (0.25, 0.25, 0.25, 0.125, 0.125)
cept z in a document depends on the depth of z in the ontol-                                 a = (1.0, 0.0, 0.0, 0.0, 0.0)
ogy graph, on how many times it appears in the document,                                     b = (0.0, 1.0, 0.0, 0.0, 0.0)
and how many times it occurs in the whole document repos-                                    c = (0.0, 0.0, 1.0, 0.0, 0.0)
itory. These two frequencies also depend on the number of                                    y = (0.0, 0.0, 0.0, 0.5, 0.5)
concepts which subsume or are subsumed by z. Let us con-                                     d = (0.0, 0.0, 0.0, 1.0, 0.0)
sider a concept x which is a descendant of another concept y                                 x = (0.0, 0.0, 0.0, 0.0, 1.0) ,
which has q children including x. Concept y is a descendant                        so the document vector associated to D1 is:
of a concept z which has k children including y. Concept
x is a leaf of the graph representing the used ontology. For
instance, considering a document containing only “xy”, the                     D1 = (2∗ x̄)+(3∗ ȳ)+z̄ = (0.25, 0.25, 0.25, 1.625, 3.625). (3)
occurrence of x in the document is 1 + (1/q). In the docu-
                                                                                 In Section 5, a comparison between the proposed repre-
ment “xyz”, the occurrence of x is 1 + (1/q(1 + 1/k)). As
                                                                              sentation and other two classic concept-based representation
it is possible to see, the number of occurrences of a leaf is
                                                                              is discussed.
proportional to the number of children which all of its an-
cestors have. Explicit and implicit concepts are taken into
account by using the following formulas:                                      5.     REPRESENTATION COMPARISON
                                                                                 In Section 4 the approach used to represent information
                                     depth(c)                                 was described. This section shows the improvements ob-
                        X              X                 occ(ci )             tained by applying the proposed approach and it illustrates
N (c) = occ(c) +                                Qi                        ,
                                       i=2          j=2 ||children(cj )||     a comparison between the proposed approach and other two
                   c∈Path(c,...,>)
                                                                (1)           approaches commonly used in conceptual document repre-
   where N (c) is the number of occurrences, both explicit                    sentation. The expansion technique is generally used to en-
and implicit, of concept c and occ(c) is the number of lexi-                  rich information content of queries. However, in the last
calizations of c occurring in the document. The value N (c)                   years some authors applied the expansion technique also to
is the weight associated with the concept c.                                  represent documents [2]. Like in [13] [2], we propose an ap-
   Given the ontology base I = b1 , . . . , bn , where the bi s are           proach that uses WordNet to extract concepts from terms.
the base concepts, the quantity of information, info(bi ), per-                  The two main improvements obtained by the application
taining to base concept bi in a document is:                                  of the ontology-based approach are illustrated below.

                                                                              Information Redundancy.
                                     Ndoc (bi )
                      info(bi ) =               ,                       (2)      Approaches that apply the expansion of documents and
                                     Nrep (bi )                               queries, use correlated concepts to expand the original terms
   where Ndoc (bi ) is the number of explicit and implicit oc-                of documents and queries. A problem with expansion is
currences of bi in the document, and Nrep (bi ) is the total                  that information is redundant and there is not a real im-
number of its explicit and implicit occurrences in the whole                  provement of the representation of the document (or query)
document repository. This way, every component of the rep-                    content. With the proposed representation this redundancy
resentation vector gives a value of the importance relation                   is eliminated because only independent concepts are taken
between a document and the relevant base concept.                             into account to represent documents and queries. Another
   A concrete example can be explained starting from the                      positive aspect is that the size of the vector representing doc-
light ontology represented in Figures 1 and 2, and by con-                    ument content by using concepts is generally lower than the
sidering a document D1 containing concepts “xxyyyz”.                          size of the vector representing document content by using
   In this case the ontology base is:                                         terms.
                                                                                 An example of technique that shows this drawback is pre-
                        I = {a, b, c, d, x}                                   sented in [13]. In this work the authors propose an indexing
                                                                              technique that takes into account WordNet synsets instead
  and, for each concept in the ontology, the vectors Ndoc                     of terms. For each term in documents, the synsets asso-
are:                                                                          ciated to that terms are extracted and then used as token
for the indexing task. This way, the computational time                        1,00

needed to perform a query is not increased, however, there is
a significant overlap of information because different synsets
                                                                               0,75
might be semantically correlated. An example is given by
the terms “animal” and “pet”, these terms have two different


                                                                   Precision
synsets, however, observing the WordNet lattice, the term                      0,50
“pet” is linked with an “is-a” relation with the term “animal”.
Therefore, in a scenario in which a document contains both
                                                                               0,25
terms, the same conceptual information is repeated. This is
clear because, even if the terms “animal” and “pet” are not
represented by using the same synset, they are semantically                    0,00
correlated because “pet” is a sub-concept of “animal”. This                           0       0,2          0,4              0,6                0,8    1

way, when a document contains both terms, the presence of                                                        Recall

the term “animal” has to contribute to the importance of the                                        Term-Based    Synsets         Onto-Based

concept “pet” instead of to be represented with a different
token.                                                                                    Figure 3: Precision/recall results.

Computational Time.
   When IR approaches are applied in a real-world environ-        ontology representation is able to improve the representation
ment, the computational time needed to evaluate the match         of the documents contents. However, for documents that are
between documents and the submitted query has to be con-          partially related to a topic or that contains many ambigu-
sidered. It is known that systems using the vector space          ous terms, the proposed approach is not able to maintain
model have higher efficiency. Conceptual-based approaches,        an high precision of the results. At the end of this section
such as the one presented in [2], generally implement a non-      some improvements that may be responsible of this fact are
vectorial data structure which needs a higher computational       discussed.
time with respect to a vector space model representation.            In Table 1 the three different representations are compared
The approach proposed in this paper overcomes this issue          for the Precision@X and MAP values. The results show
because the document content is represented by using a vec-       that the proposed approach obtains better results for the all
tor and therefore, the computational time needed to com-          precision levels and also for the MAP value.
pute document score is comparable to the computational
time needed by using the vector space model.                        Systems                                              Precisions
                                                                                                         P5          P10    P15     P30              MAP
6.   EXPERIMENTS                                                    Term-Based                          0.544       0.480 0.405 0.273                0.449
   In this section, the impact of the ontology document and         Synset-Indexing [13]                0.648       0.484 0.403 0.309                0.459
query representation is evaluated. The evaluation method            Concept-Based                       0.744       0.544 0.478 0.394                0.507
follows the TREC protocol [17]. For each query the first
1000 retrieved documents have been considered and the pre-        Table 1: Comparisons table between semantic ex-
cision of the system has been calculated at different points:     pansion approaches.
5, 10, 15, and 30 documents retrieved. Moreover, the preci-
sion/recall graph has been calculated
   The experimental campaign has been performed by us-              An in-depth study of this first experiments campaign has
ing the MuchMore collection that consists of 7823 abstracts       been performed, and we have noticed that for some queries
of medical papers and 25 queries with their relevance judg-       the concept-based representation obtained results that are
ments. One of the particular features of this collection is       below our expectations. By inspecting the implemented
that there are a lot of medical terms. This way, a term-based     model, some issues have been noticed and are at now un-
representation is more advantaged with respect to semantic        der analysis:
representation, because specific terms present in documents
                                                                  - Absence of some terms in the ontology: some terms, in
(for example “Arthroscopic”) are very discriminant. Indeed,
                                                                       particular terms related to specific domains (biomed-
by using a semantic expansion some problems may occur
                                                                       ical, mechanical, business, etc.), are not defined in
because, generally, the MRD and thesaurus used to expand
                                                                       the machine readable dictionary used to define the
terms do not contain some domain-specific terms.
                                                                       concept-based version of the documents. This way
   The precision/recall graph showed in Figure 3 illustrates
                                                                       there is, in some cases, a loss of information that affects
the comparison between the proposed approach (gray curve
                                                                       the final retrieval result.
with circle marks), the classical term-based representation
(black curve), and the synset representation method [13]          - Proper names have not been considered: proper names
(light gray curve with square marks). As expected, for all             of persons, geographical locations, industries, etc., are
recall values, the proposed approach obtained better results           not present in the concept-based index. Observing the
than the term-based representation. The best gain of the               content of some documents and topics, proper names
concept-based representation is at recall levels 0.0, 0.2, and         turn out to be a discriminant feature in some cases.
0.4. While for recall values between 0.6 and 1.0, the concept-
based precision curve lies with the other two curves.             - Verbs and adjective are not present as well in the ontology:
   A possible explanation for this scenario is that for docu-          the concept representation of terms, described in Sec-
ments that are well related to a particular topic the adopted          tion 4, does not take into account verbs and adjectives.
     This happens because verbs and adjectives are struc-               retrieval. In F. Bobillo, P. da Costa, C. d’Amato,
     tured in a different way than nouns. The hyperonymy                N. Fanizzi, F. Fung, T. Lukasiewicz, T. Martin,
     and hyponymy relations (that make MRD comparable                   M. Nickles, Y. Peng, M. Pool, P. Smrz, and P. Vojtás,
     with ontologies) are not defined for verbs and adjec-              editors, URSW, volume 327 of CEUR Workshop
     tives, therefore another approach will be studied and              Proceedings. CEUR-WS.org, 2007.
     implemented to overcome this drawback.                         [7] P. Castells, M. Fernández, and D. Vallet. An
                                                                        adaptation of the vector-space model for
- Term ambiguity: the concept-based representation has the              ontology-based information retrieval. IEEE Trans.
     problem of introducing an error given by not using a               Knowl. Data Eng., 19(2):261–272, 2007.
     word sense disambiguation algorithm. Using such a              [8] S. Cronen-Townsend, Y. Zhou, and W. Croft. A
     method, concepts associated to incorrect senses would              framework for selective query expansion. In
     be discarded or weighted less. Therefore, the concept-             D. Grossman, L. Gravano, C. Zhai, O. Herzog, and
     based representation of each word would be finer, with             D. Evans, editors, CIKM, pages 236–237. ACM, 2004.
     the consequence of representing the information con-
                                                                    [9] C. da Costa Pereira and A. G. B. Tettamanzi. Soft
     tained in a document with more precision.
                                                                        computing in ontologies and semantic Web, chapter
  Improving the actual model with the above features, would             An ontology-based method for user model acquisition,
certainly yield significantly better results in the next experi-        pages 211–227. Studies in fuzziness and soft
ments campaign. This positive view is motivated by the fact             computing. Ed. Zongmin Ma, Springer, Berlin, 2006.
that, in spite of these issues, the preliminary goal of outper-    [10] M. Dı́az-Galiano, M. G. Cumbreras,
forming the precision of the term-based representation has              M. Martı́n-Valdivia, A. M. Ráez, and L. Ureña-López.
been accomplished.                                                      Integrating mesh ontology to improve medical
                                                                        information retrieval. In CLEF, volume 5152 of
                                                                        Lecture Notes in Computer Science, pages 601–606.
7.   CONCLUSION                                                         Springer, 2007.
   In this paper we have discussed an approach to index doc-       [11] O. Dridi. Ontology-based information retrieval:
uments and to represent queries for information retrieval               Overview and new proposition. In O. Pastor, A. Flory,
purposes which exploits a conceptual representation based               and J.-L. Cavarero, editors, RCIS, pages 421–426.
on ontologies.                                                          IEEE, 2008.
   Experiments have been performed on the MuchMore Col-            [12] E. Efthimiadis. Query expansion. In M. Williams,
lection to validate the approach with respect to problems               editor, Annual review of information science and
like term-synonymity in documents.                                      technology, pages Vol. 31, pp. 121Ű187. Information
   Preliminary experimental results show that the proposed              Today Inc, Medford NJ, 1996.
representation improves the ranking of the documents. In-          [13] J. Gonzalo, F. Verdejo, I. Chugur, and J. Cigarrán.
vestigation on results highlights that further improvement              Indexing with wordnet synsets can improve text
could be obtained by integrating WSD techniques like the                retrieval. CoRR, cmp-lg/9808002, 1998.
one discussed in [1] to avoid the error introduced by con-
                                                                   [14] T. Hattori, K. Hiramatsu, T. Okadome, B. Parsia, and
sidering incorrect word senses, and with a better usage and
                                                                        E. Sirin. Ichigen-san: An ontology-based information
interpretation of WordNet to overcome the loss of informa-
                                                                        retrieval system. In X. Zhou, J. Li, H. Shen,
tion caused by the absence of proper nouns, verbs, and ad-
                                                                        M. Kitsuregawa, and Y. Zhang, editors, APWeb,
jectives.
                                                                        volume 3841 of Lecture Notes in Computer Science,
                                                                        pages 1197–1200. Springer, 2006.
8.   REFERENCES                                                    [15] G. Salton, A. Wong, and C. Yang. A vector space
 [1] A. Azzini, M. Dragoni, C. da Costa Pereira, and                    model for automatic indexing. Commun. ACM,
     A. Tettamanzi. Evolving neural networks for word                   18(11):613–620, 1975.
     sense disambiguation. In Proc. of HIS ’08, Barcelona,         [16] S. Tomassen. Research on ontology-driven information
     Spain, September 10-12, pages 332–337, 2008.                       retrieval. In R. Meersman, Z. Tari, and P. Herrero,
 [2] M. Baziz, M. Boughanem, G. Pasi, and H. Prade. An                  editors, OTM Workshops (2), volume 4278 of Lecture
     information retrieval driven by ontology: from query               Notes in Computer Science, pages 1460–1468.
     to document expansion. In D. Evans, S. Furui, and                  Springer, 2006.
     C. Soulé-Dupuy, editors, RIAO. CID, 2007.                    [17] E. Voorhees and D. Harman. Overview of the sixth
 [3] B. Billerbeck and J. Zobel. Techniques for efficient               text retrieval conference (trec-6). In TREC, pages
     query expansion. In A. Apostolico and M. Melucci,                  1–24, 1997.
     editors, SPIRE, volume 3246 of Lecture Notes in               [18] F. Wu, G. Wu, and X. Fu. Design and implementation
     Computer Science, pages 30–42. Springer, 2004.                     of ontology-based query expansion for information
 [4] M. Boughanem, T. Dkaki, J. Mothe, and                              retrieval. In L. Xu, A. Tjoa, and S. Chaudhry, editors,
     C. Soulé-Dupuy. Mercure at trec7. In TREC, pages                  CONFENIS (1), volume 254 of IFIP, pages 293–298.
     355–360, 1998.                                                     Springer, 2007.
 [5] D. Cai, C. van Rijsbergen, and J. Jose. Automatic             [19] J. Xu and W. Croft. Query expansion using local and
     query expansion based on divergence. In CIKM, pages                global document analysis. In H.-P. Frei, D. Harman,
     419–426. ACM, 2001.                                                P. Schäuble, and R. Wilkinson, editors, SIGIR, pages
 [6] S. Calegari and E. Sanchez. A fuzzy                                4–11. ACM, 1996.
     ontology-approach to improve semantic information