=Paper=
{{Paper
|id=None
|storemode=property
|title=An Ontological Representation of Documents and Queries for Information Retrieval Systems
|pdfUrl=https://ceur-ws.org/Vol-560/paper18.pdf
|volume=Vol-560
|dblpUrl=https://dblp.org/rec/conf/iir/DragoniPT10
}}
==An Ontological Representation of Documents and Queries for Information Retrieval Systems==
An Ontological Representation of Documents and Queries
for Information Retrieval Systems
Mauro Dragoni Célia da Costa Pereira Andrea G.B. Tettamanzi
Università degli Studi di Milano Università degli Studi di Milano Università degli Studi di Milano
Dipartimento di Tecnologie Dipartimento di Tecnologie Dipartimento di Tecnologie
dell’Informazione dell’Informazione dell’Informazione
Via Bramante 65, I-26013 Via Bramante 65, I-26013 Via Bramante 65, I-26013
Crema (CR), Italy Crema (CR), Italy Crema (CR), Italy
mauro.dragoni@unimi.it celia.pereira@unimi.it andrea.tettamanzi@unimi.it
ABSTRACT be represented in the same way. This way, the risk of omit-
This paper presents a vector space model approach, for rep- ting some related terms (as it may happen in the classical
resenting documents and queries, using concepts instead of query expansion technique), is reduced. However, it is nec-
terms and WordNet as a light ontology. This way, informa- essary to use a language resource that permits to cover a
tion overlap is reduced with respect to the classic semantic higher number of terms in order to avoid information loss.
expansion techniques. Experiments undertaken on Much- This paper presents a new representation for documents
More benchmark showed the effectiveness of the approach. and queries. The proposed approach exploits the structure
of the well-known machine readable dictionary WordNet in
order to reduce the redundancy of information generally con-
1. INTRODUCTION tained in a concept-based document representation. The
This paper presents an ontology-based approach for a con- second improvement is the reduction of the computational
ceptual representation of documents. Such an approach is time needed to compare documents and queries represented
inspired by a recently proposed idea presented in [9], and by using concepts. This representation has been applied
uses an adapted version of that method to standardize the to the ad-hoc retrieval problem. The approach has been
representation of documents and queries. The proposed evaluated on the MuchMore1 Collection [4] and the results
approach is somehow similar to the classic query expan- demonstrate its viability.
sion technique. However additional considerations have been In Section 2 an overview of the environment in which on-
taken into account and some improvements have been ap- tology has been used is presented. Section 3 presents the
plied as explained below. tools used for this work. Section 4 illustrates the proposed
Query expansion is an approach used in Information Re- approach to represent information, while Section 5 compares
trieval (IR) in order to improve the system’s performance. this approach with other two well-known approaches used in
It consists of the expansion of the content of the query by conceptual representation of documents. In Section 6 the re-
adding the terms that are semantical correlated with the sults obtained from the test campaign are discussed. Finally,
original terms of the query [12]. Several works demonstrated Section 7 concludes.
the enhanced performance of IR systems that implement
query expansion approaches [19] [3] [5]. However, the query 2. RELATED WORKS
expansion approach has to be used carefully because, as
An increasing number of recent information retrieval sys-
demonstrated in [8], expansion might degrade the perfor-
tems make use of ontologies to help the users clarify their
mance of some individual queries. This is due to the fact
information needs and come up with semantic representa-
that an incorrect choice of terms and concepts for the ex-
tions of documents. Many ontology-based information re-
pansion task might harm the retrieval process by drifting it
trieval systems and models have been proposed in the last
away from the optimal correct answer.
decade. An interesting review on IR techniques based on
Document expansion applied to IR has been recently pro-
ontologies is presented in [11], while in [16] the author stud-
posed in [2]. In that work a sub-tree approach has been im-
ies the application of ontologies to a large-scale IR system
plemented to represent concepts in documents and queries.
for web purposes. A model for the exploitation of ontology-
However, when using a tree structure there is a redundancy
based knowledge bases is presented in [7]. The aim of this
of information because more general concepts may be rep-
model is to improve search over large document reposito-
resented implicitly by using only the leaf concepts they sub-
ries. The model includes an ontology-based scheme for the
sume. The smart idea behind the representation of docu-
annotation of documents, and a retrieval model based on an
ments by using concepts is that documents and queries may
adaptation of the classic vector-space model [15]. Another
information retrieval system based on ontologies is presented
in [14]. The authors propose an information retrieval system
which has landmark information database that has hierar-
chical structures and semantic meanings of the features and
Appears in the Proceedings of the 1st Italian Information Retrieval
Workshop (IIR’10), January 27–28, 2010, Padova, Italy. 1
http://muchmore.dfki.de
http://ims.dei.unipd.it/websites/iir10/index.html
Copyright owned by the authors.
characteristics of the landmarks. query. Therefore, to each element present in the concept-
The implementation of ontology models has been also in- based representation of the query, its concept weight has
vestigated by using fuzzy models [6]. been used as boost value.
In IR, the user’s input queries usually are not detailed
enough, so the satisfactory query results can not be brought 4. DOCUMENT REPRESENTATION
back. Query expansion of IR can help to solve this problem. Conventional IR approaches represent documents as vec-
However, the common query expansion in IR cannot get tors of term weights. Such representations use a vector with
steady retrieval results. Ontologies play a key role in query one component for every significant term that occurs in the
expansion research. A common use of ontologies in query document. This has several limitations, for example:
expansion is to enrich the resources with some well-defined
meaning to enhance the search capabilities of existing web 1. different vector positions may be allocated to the syn-
searching systems. onyms of the same term; this way there is an infor-
In [18] the authors propose and implement query expan- mation loss because the importance of a determinate
sion method which combines domain ontology with the fre- concept is distributed among different vector compo-
quency of terms. Ontology is used to describe domain knowl- nents;
edge; logic reasoner and the frequency of terms are used to
2. the size of a document vector have to be at least equal
choose fitting expansion words. This way, higher recall and
to the total number of words of the language used to
precision can be gotten as user’s query results.
write the document;
In [10] the authors present an approach to expand queries
that consists in searching terms from the topic query in an 3. every time a new set of terms is introduced (which is a
ontology in order to add similar terms. high-probability event), all document vectors must be
reconstructed; the size of a repository thus grows not
only as a function of the number of documents that
3. PRELIMINARIES it contains, but also of the size of the representation
The roadmap to prove the viability of a concept-based rep- vectors.
resentation of documents and queries consists in two main
tasks: To overcome these weaknesses of term-based representations,
an ontology-based representation has been used [9].
- to choose a method that permits to represent all docu- An ontology-based representation has been recently pro-
ments terms by using the same set of concepts; posed in [9] which exploits the hierarchical is-a relation
among concepts, i.e., the meanings of words. For example,
- to implement an approach that permits to index and to to describe with a term-based representation documents con-
evaluate each concept, in both documents and queries, taining the three words: “animal”, “dog”, and “cat” a vector
with the appropriate weight. of three elements is needed; with an ontology-based repre-
sentation, since “animal” subsumes both “dog” and “cat”, it
To represent documents, the method described in Sec- is possible to use a vector with only two elements, related to
tion 4 has been used, combined with the use of the WordNet the “dog” and “cat” concepts, that can also implicitly con-
machine-readable dictionary. From the WordNet database, tain the information given by the presence of the “animal”
the set of terms that do not have hyponymy has been ex- concept. Moreover, by defining an ontology base, which is a
tracted, each term is named “base concept”. A vector, named set of independent concepts that covers the whole ontology,
“base vector”, has been created and, to each component of an ontology-based representation allows the system to use
the vector, a base concept has been assigned. This way, each fixed-size document vectors, consisting of one component
term is represented by using the base vector of the WordNet per base concept.
ontology. Calculating term importance is a significant and funda-
The representation described above has been implemented mental aspect for representing documents in conventional
on top of the Apache Lucene open-source API. 2 information retrieval approaches. It is usually determined
In the pre-indexing phase, each document has been con- through term frequency-inverse document frequency (TF-
verted in its ontological representation. After the calcula- IDF). When using an ontology-based representation, such
tion of the importance of each concept in a document, only usual definition of term-frequency cannot be applied because
concepts with a degree of importance higher than a fixed one does not operate by keywords, but by concepts. This
cut-value have been maintained, while the others have been is the reason why it has been adopted the document rep-
discarded. The cut-value used in these experiments is 0.01. resentation based on concepts proposed in [9], which is a
This choice has a drawback, namely that an approximation concept-based adaptation of TF-IDF.
of representing information is introduced due to the discard In this paper, an adaptation of the approach proposed in
of some minor concepts. However, we have experimentally [9] is presented. The original approach was proposed for
verified that this approximation does not affect the final re- domain specific ontologies and does not always consider all
sults. the possible concepts in the considered ontology, in the sense
During the evaluation activity, queries have been also con- that it assumes a cut at a given specificity level. Instead,
verted into the ontological representation. This way, weights the proposed approach has been adapted for more general
have to be assigned to each concept to evaluate all concepts purpose ontologies and it takes into account all independent
with the right proportion. One of the features of Lucene is concepts contained in the considered ontology. This way,
the possibility of assigning a payload to each term of the information associated to each concept is more precise and
the problem of choosing the suitable level to apply the cut
2
See URL http://lucene.apache.org/. is overcome.
Figure 1: Ontology representation for concept ’z’. Figure 2: Ontology representation for concept ’y’.
The quantity of information given by the presence of con- z = (0.25, 0.25, 0.25, 0.125, 0.125)
cept z in a document depends on the depth of z in the ontol- a = (1.0, 0.0, 0.0, 0.0, 0.0)
ogy graph, on how many times it appears in the document, b = (0.0, 1.0, 0.0, 0.0, 0.0)
and how many times it occurs in the whole document repos- c = (0.0, 0.0, 1.0, 0.0, 0.0)
itory. These two frequencies also depend on the number of y = (0.0, 0.0, 0.0, 0.5, 0.5)
concepts which subsume or are subsumed by z. Let us con- d = (0.0, 0.0, 0.0, 1.0, 0.0)
sider a concept x which is a descendant of another concept y x = (0.0, 0.0, 0.0, 0.0, 1.0) ,
which has q children including x. Concept y is a descendant so the document vector associated to D1 is:
of a concept z which has k children including y. Concept
x is a leaf of the graph representing the used ontology. For
instance, considering a document containing only “xy”, the D1 = (2∗ x̄)+(3∗ ȳ)+z̄ = (0.25, 0.25, 0.25, 1.625, 3.625). (3)
occurrence of x in the document is 1 + (1/q). In the docu-
In Section 5, a comparison between the proposed repre-
ment “xyz”, the occurrence of x is 1 + (1/q(1 + 1/k)). As
sentation and other two classic concept-based representation
it is possible to see, the number of occurrences of a leaf is
is discussed.
proportional to the number of children which all of its an-
cestors have. Explicit and implicit concepts are taken into
account by using the following formulas: 5. REPRESENTATION COMPARISON
In Section 4 the approach used to represent information
depth(c) was described. This section shows the improvements ob-
X X occ(ci ) tained by applying the proposed approach and it illustrates
N (c) = occ(c) + Qi ,
i=2 j=2 ||children(cj )|| a comparison between the proposed approach and other two
c∈Path(c,...,>)
(1) approaches commonly used in conceptual document repre-
where N (c) is the number of occurrences, both explicit sentation. The expansion technique is generally used to en-
and implicit, of concept c and occ(c) is the number of lexi- rich information content of queries. However, in the last
calizations of c occurring in the document. The value N (c) years some authors applied the expansion technique also to
is the weight associated with the concept c. represent documents [2]. Like in [13] [2], we propose an ap-
Given the ontology base I = b1 , . . . , bn , where the bi s are proach that uses WordNet to extract concepts from terms.
the base concepts, the quantity of information, info(bi ), per- The two main improvements obtained by the application
taining to base concept bi in a document is: of the ontology-based approach are illustrated below.
Information Redundancy.
Ndoc (bi )
info(bi ) = , (2) Approaches that apply the expansion of documents and
Nrep (bi ) queries, use correlated concepts to expand the original terms
where Ndoc (bi ) is the number of explicit and implicit oc- of documents and queries. A problem with expansion is
currences of bi in the document, and Nrep (bi ) is the total that information is redundant and there is not a real im-
number of its explicit and implicit occurrences in the whole provement of the representation of the document (or query)
document repository. This way, every component of the rep- content. With the proposed representation this redundancy
resentation vector gives a value of the importance relation is eliminated because only independent concepts are taken
between a document and the relevant base concept. into account to represent documents and queries. Another
A concrete example can be explained starting from the positive aspect is that the size of the vector representing doc-
light ontology represented in Figures 1 and 2, and by con- ument content by using concepts is generally lower than the
sidering a document D1 containing concepts “xxyyyz”. size of the vector representing document content by using
In this case the ontology base is: terms.
An example of technique that shows this drawback is pre-
I = {a, b, c, d, x} sented in [13]. In this work the authors propose an indexing
technique that takes into account WordNet synsets instead
and, for each concept in the ontology, the vectors Ndoc of terms. For each term in documents, the synsets asso-
are: ciated to that terms are extracted and then used as token
for the indexing task. This way, the computational time 1,00
needed to perform a query is not increased, however, there is
a significant overlap of information because different synsets
0,75
might be semantically correlated. An example is given by
the terms “animal” and “pet”, these terms have two different
Precision
synsets, however, observing the WordNet lattice, the term 0,50
“pet” is linked with an “is-a” relation with the term “animal”.
Therefore, in a scenario in which a document contains both
0,25
terms, the same conceptual information is repeated. This is
clear because, even if the terms “animal” and “pet” are not
represented by using the same synset, they are semantically 0,00
correlated because “pet” is a sub-concept of “animal”. This 0 0,2 0,4 0,6 0,8 1
way, when a document contains both terms, the presence of Recall
the term “animal” has to contribute to the importance of the Term-Based Synsets Onto-Based
concept “pet” instead of to be represented with a different
token. Figure 3: Precision/recall results.
Computational Time.
When IR approaches are applied in a real-world environ- ontology representation is able to improve the representation
ment, the computational time needed to evaluate the match of the documents contents. However, for documents that are
between documents and the submitted query has to be con- partially related to a topic or that contains many ambigu-
sidered. It is known that systems using the vector space ous terms, the proposed approach is not able to maintain
model have higher efficiency. Conceptual-based approaches, an high precision of the results. At the end of this section
such as the one presented in [2], generally implement a non- some improvements that may be responsible of this fact are
vectorial data structure which needs a higher computational discussed.
time with respect to a vector space model representation. In Table 1 the three different representations are compared
The approach proposed in this paper overcomes this issue for the Precision@X and MAP values. The results show
because the document content is represented by using a vec- that the proposed approach obtains better results for the all
tor and therefore, the computational time needed to com- precision levels and also for the MAP value.
pute document score is comparable to the computational
time needed by using the vector space model. Systems Precisions
P5 P10 P15 P30 MAP
6. EXPERIMENTS Term-Based 0.544 0.480 0.405 0.273 0.449
In this section, the impact of the ontology document and Synset-Indexing [13] 0.648 0.484 0.403 0.309 0.459
query representation is evaluated. The evaluation method Concept-Based 0.744 0.544 0.478 0.394 0.507
follows the TREC protocol [17]. For each query the first
1000 retrieved documents have been considered and the pre- Table 1: Comparisons table between semantic ex-
cision of the system has been calculated at different points: pansion approaches.
5, 10, 15, and 30 documents retrieved. Moreover, the preci-
sion/recall graph has been calculated
The experimental campaign has been performed by us- An in-depth study of this first experiments campaign has
ing the MuchMore collection that consists of 7823 abstracts been performed, and we have noticed that for some queries
of medical papers and 25 queries with their relevance judg- the concept-based representation obtained results that are
ments. One of the particular features of this collection is below our expectations. By inspecting the implemented
that there are a lot of medical terms. This way, a term-based model, some issues have been noticed and are at now un-
representation is more advantaged with respect to semantic der analysis:
representation, because specific terms present in documents
- Absence of some terms in the ontology: some terms, in
(for example “Arthroscopic”) are very discriminant. Indeed,
particular terms related to specific domains (biomed-
by using a semantic expansion some problems may occur
ical, mechanical, business, etc.), are not defined in
because, generally, the MRD and thesaurus used to expand
the machine readable dictionary used to define the
terms do not contain some domain-specific terms.
concept-based version of the documents. This way
The precision/recall graph showed in Figure 3 illustrates
there is, in some cases, a loss of information that affects
the comparison between the proposed approach (gray curve
the final retrieval result.
with circle marks), the classical term-based representation
(black curve), and the synset representation method [13] - Proper names have not been considered: proper names
(light gray curve with square marks). As expected, for all of persons, geographical locations, industries, etc., are
recall values, the proposed approach obtained better results not present in the concept-based index. Observing the
than the term-based representation. The best gain of the content of some documents and topics, proper names
concept-based representation is at recall levels 0.0, 0.2, and turn out to be a discriminant feature in some cases.
0.4. While for recall values between 0.6 and 1.0, the concept-
based precision curve lies with the other two curves. - Verbs and adjective are not present as well in the ontology:
A possible explanation for this scenario is that for docu- the concept representation of terms, described in Sec-
ments that are well related to a particular topic the adopted tion 4, does not take into account verbs and adjectives.
This happens because verbs and adjectives are struc- retrieval. In F. Bobillo, P. da Costa, C. d’Amato,
tured in a different way than nouns. The hyperonymy N. Fanizzi, F. Fung, T. Lukasiewicz, T. Martin,
and hyponymy relations (that make MRD comparable M. Nickles, Y. Peng, M. Pool, P. Smrz, and P. Vojtás,
with ontologies) are not defined for verbs and adjec- editors, URSW, volume 327 of CEUR Workshop
tives, therefore another approach will be studied and Proceedings. CEUR-WS.org, 2007.
implemented to overcome this drawback. [7] P. Castells, M. Fernández, and D. Vallet. An
adaptation of the vector-space model for
- Term ambiguity: the concept-based representation has the ontology-based information retrieval. IEEE Trans.
problem of introducing an error given by not using a Knowl. Data Eng., 19(2):261–272, 2007.
word sense disambiguation algorithm. Using such a [8] S. Cronen-Townsend, Y. Zhou, and W. Croft. A
method, concepts associated to incorrect senses would framework for selective query expansion. In
be discarded or weighted less. Therefore, the concept- D. Grossman, L. Gravano, C. Zhai, O. Herzog, and
based representation of each word would be finer, with D. Evans, editors, CIKM, pages 236–237. ACM, 2004.
the consequence of representing the information con-
[9] C. da Costa Pereira and A. G. B. Tettamanzi. Soft
tained in a document with more precision.
computing in ontologies and semantic Web, chapter
Improving the actual model with the above features, would An ontology-based method for user model acquisition,
certainly yield significantly better results in the next experi- pages 211–227. Studies in fuzziness and soft
ments campaign. This positive view is motivated by the fact computing. Ed. Zongmin Ma, Springer, Berlin, 2006.
that, in spite of these issues, the preliminary goal of outper- [10] M. Dı́az-Galiano, M. G. Cumbreras,
forming the precision of the term-based representation has M. Martı́n-Valdivia, A. M. Ráez, and L. Ureña-López.
been accomplished. Integrating mesh ontology to improve medical
information retrieval. In CLEF, volume 5152 of
Lecture Notes in Computer Science, pages 601–606.
7. CONCLUSION Springer, 2007.
In this paper we have discussed an approach to index doc- [11] O. Dridi. Ontology-based information retrieval:
uments and to represent queries for information retrieval Overview and new proposition. In O. Pastor, A. Flory,
purposes which exploits a conceptual representation based and J.-L. Cavarero, editors, RCIS, pages 421–426.
on ontologies. IEEE, 2008.
Experiments have been performed on the MuchMore Col- [12] E. Efthimiadis. Query expansion. In M. Williams,
lection to validate the approach with respect to problems editor, Annual review of information science and
like term-synonymity in documents. technology, pages Vol. 31, pp. 121Ű187. Information
Preliminary experimental results show that the proposed Today Inc, Medford NJ, 1996.
representation improves the ranking of the documents. In- [13] J. Gonzalo, F. Verdejo, I. Chugur, and J. Cigarrán.
vestigation on results highlights that further improvement Indexing with wordnet synsets can improve text
could be obtained by integrating WSD techniques like the retrieval. CoRR, cmp-lg/9808002, 1998.
one discussed in [1] to avoid the error introduced by con-
[14] T. Hattori, K. Hiramatsu, T. Okadome, B. Parsia, and
sidering incorrect word senses, and with a better usage and
E. Sirin. Ichigen-san: An ontology-based information
interpretation of WordNet to overcome the loss of informa-
retrieval system. In X. Zhou, J. Li, H. Shen,
tion caused by the absence of proper nouns, verbs, and ad-
M. Kitsuregawa, and Y. Zhang, editors, APWeb,
jectives.
volume 3841 of Lecture Notes in Computer Science,
pages 1197–1200. Springer, 2006.
8. REFERENCES [15] G. Salton, A. Wong, and C. Yang. A vector space
[1] A. Azzini, M. Dragoni, C. da Costa Pereira, and model for automatic indexing. Commun. ACM,
A. Tettamanzi. Evolving neural networks for word 18(11):613–620, 1975.
sense disambiguation. In Proc. of HIS ’08, Barcelona, [16] S. Tomassen. Research on ontology-driven information
Spain, September 10-12, pages 332–337, 2008. retrieval. In R. Meersman, Z. Tari, and P. Herrero,
[2] M. Baziz, M. Boughanem, G. Pasi, and H. Prade. An editors, OTM Workshops (2), volume 4278 of Lecture
information retrieval driven by ontology: from query Notes in Computer Science, pages 1460–1468.
to document expansion. In D. Evans, S. Furui, and Springer, 2006.
C. Soulé-Dupuy, editors, RIAO. CID, 2007. [17] E. Voorhees and D. Harman. Overview of the sixth
[3] B. Billerbeck and J. Zobel. Techniques for efficient text retrieval conference (trec-6). In TREC, pages
query expansion. In A. Apostolico and M. Melucci, 1–24, 1997.
editors, SPIRE, volume 3246 of Lecture Notes in [18] F. Wu, G. Wu, and X. Fu. Design and implementation
Computer Science, pages 30–42. Springer, 2004. of ontology-based query expansion for information
[4] M. Boughanem, T. Dkaki, J. Mothe, and retrieval. In L. Xu, A. Tjoa, and S. Chaudhry, editors,
C. Soulé-Dupuy. Mercure at trec7. In TREC, pages CONFENIS (1), volume 254 of IFIP, pages 293–298.
355–360, 1998. Springer, 2007.
[5] D. Cai, C. van Rijsbergen, and J. Jose. Automatic [19] J. Xu and W. Croft. Query expansion using local and
query expansion based on divergence. In CIKM, pages global document analysis. In H.-P. Frei, D. Harman,
419–426. ACM, 2001. P. Schäuble, and R. Wilkinson, editors, SIGIR, pages
[6] S. Calegari and E. Sanchez. A fuzzy 4–11. ACM, 1996.
ontology-approach to improve semantic information