=Paper=
{{Paper
|id=Vol-205/paper-6
|storemode=property
|title=LexiRes: A Tool for Exploring and Restructuring EuroWordNet for IR
|pdfUrl=https://ceur-ws.org/Vol-205/paper6.pdf
|volume=Vol-205
}}
==LexiRes: A Tool for Exploring and Restructuring EuroWordNet for IR==
LexiRes: A Tool for Exploring and Restructuring
EuroWordNet for Information Retrieval
Ernesto William De Luca and Andreas Nürnberger 1
Abstract. The problem of word sense disambiguation in lexical ical resources in deleting or restructuring concepts using automatic
resources is one of the most important tasks in order to recognize merging methods. The restructured information can be navigated and
and disambiguate the most significant word senses of a term. Lexi- explored. Authors can decide if word senses are unambiguous and
cographers have to decide how to structure information in order to important enough to let them in the hierarchy at the same place or
describe the world in an objective way. However, the introduced dis- if they express similar concepts and can be merged under the same
tinctions between word meanings are very often too fine grained for (now, more general) meaning.
specific applications. If we want to use or even combine lexical re- In the following, we first briefly introduce the structure of Word-
sources within information retrieval systems, for example, we might Net and EuroWordNet. Then we discuss the problem of word sense
want to apply the lexical resources in order to disambiguate docu- disambiguation in information retrieval and problems related to
ments (retrieved from the web within an information retrieval sys- WordNet in order to motivate the LexiRes system, which is then pre-
tem) given the different meanings (retrieved from lexical resources) sented in Sect. 4.
of a search term having unambiguous description. Therefore, we are
usually interested in a small list of meanings with very distinctive 2 WordNet
features. Since many lexical resources, especially WordNet, provide
frequently too fine grained word sense distinctions, we implemented WordNet [4] was designed by use of psycholinguistic and computa-
the tool LexiRes that gives the possibility to navigate lexical informa- tional theories of human lexical memory. It provides a list of word
tion, helping authors of already available lexical resources in deleting senses for each word, organized into synonym sets (SynSets), each
or restructuring concepts using automatic merging methods. representing one constitutional lexicalized concept. Every element
of a SynSet is uniquely identified by an identifier (SynSetID). It is
unambiguous and carrier of exactly one meaning. Furthermore, dif-
1 Introduction ferent relations link these elements of synonym sets to semantically
related terms (e.g. hypernyms, hyponyms, etc.). All related terms are
Standard keyword based search engines retrieve documents without
also represented as SynSet entries. These SynSets also contains de-
considering the importance of user oriented information presenta-
scriptions of nouns, verbs, adjectives, and adverbs. With this infor-
tion. It means that the user has to analyze every document and decide
mation we can describe the word context. Fig. 1 represents an exam-
himself which are the documents that are relevant with respect to the
ple of the ontology hierarchy defined by WordNet [4]. This resource
context of his search. For example, users have to navigate every doc-
can be used for text analysis, computational linguistics and many re-
ument in order to recognize to which meaning of their query words
lated areas.
the documents belong to. Thus, it would strongly support a user if the
context - which is defining the meaning of a word - could be recog-
nized automatically and the documents could be labelled or grouped
with respect to the meaning of the respective search terms. One way
to obtain a context description of different word senses is to explore
lexical resources using the word we are looking for in order to select
concepts based on the linguistic relations of the lexical resource that
define the different word senses. Such disambiguating relations are
intuitively used by humans. However, if we want to automate this
process, we have to use resources - such as probabilistic language
models or ontologies - that define appropriate relations. One of these
most important resources available to researchers for this purpose is
WordNet [4] and its variations like MultiWordNet [3] and EuroWord-
Net [15] as discussed in the following.
However, since many lexical resources or ontologies, especially
WordNet, provide frequently too fine grained word sense distinc-
tions, we implemented the tool LexiRes that gives the possibility to
navigate lexical information, helping authors of already available lex- Figure 1. Example of an ontology hierarchy for a given term A.
1 University of Magdeburg, Germany, email: deluca@iws.cs.uni-
magdeburg.de
2.1 EuroWordNet cal word vectors of the disambiguating classes (”Sense Folders” [8])
are constructed. Every document is assigned to its nearest prototype
WordNet was first developed only for the English language. Then (computed by using the cosine similarity) and afterward this classifi-
different versions were developed for several other languages as for cation is revised by a clustering process.
example EuroWordNet [15] for several European languages (Dutch, Agreeing with [16] we see one document having one sense per
Italian, Spanish, German, French, Czech and Estonian). Given that collocation and discourse. But differentiating us from [16], we do not
we want to retrieve from the web different documents in different want to learn and disambiguate word senses from untagged corpora.
languages analysing different contexts, we decided to use the Eu- The idea of this approach is to use ontologies in order to disam-
roWordNet multilingual lexical database. Its structure is the same biguate query terms used in the retrieved documents [9]. Thus it
as the Princeton WordNet [4] in terms of SynSets with different is possible to categorize documents with respect to the meaning of
semantic relations between them. Each individual wordnet repre- a search term, i.e. each document is assigned to the best matching
sents a unique language-internal system of lexicalizations. The Inter- meaning (”Sense Folder”) of the search terms used in it. Obviously,
Lingual-Index (ILI) was introduced in order to connect the WordNets only one sense per document can be distinguished in this setting,
of the different languages. Thus, it is possible to access the concepts which is, however, appropriate for many typical retrieval problems
(SynSets) of a word sense in different languages. where only short documents are considered as, for example, in Web
In addition to the Inter-Lingual-index, there is also a Domain- Search.
Ontology and a Top-Concept-Ontology related to this lexical For this annotation process we currently use WordNet (resp. Eu-
database. The shared Top-Ontology is a superordinate hierarchy of roWordNet). However, if we analyze it, different problems have to
63 semantic distinctions for the most important language indepen- be resolved. Very often meanings are distinguished that are seman-
dent concepts (e.g. Artifact, Natural, Cause, Building) and is inter- tically very close. For example, searching for the term ”bank” in an
connected with the ILI through the WordNet-Offsets. Hereby a com- information retrieval environment, the user usually wants to know if
mon semantic framework for all the languages is given, while lan- the retrieved documents belong to the meaning ”bank” in the sense
guage specific properties are maintained individually. The Domain- of ”furniture” or in the sense of ”banking”. The fine grained linguis-
Ontology was created for use in information retrieval settings in or- tic differentiation between the ”depository bank” meaning and the
der to obtain specific concepts (only implemented exemplary for the ”building bank” one is very often not so significant in order to select
computer terminology). Figure 2 gives an overview over the archi- a relevant document.
tecture of the EuroWordNet whereby the single components and its This problem of too fine grained description of meanings in Word-
relations are represented among one another. Net makes on the one hand the automatic categorization very difficult
and on the other hand burdens the users with a much too detailed spe-
cialization. Therefore, we propose a simple pruning strategy in order
3 Word Sense Disambiguation in Information to obtain a reduced set of (more expressive) concepts for the cate-
Retrieval gorization approach (see Sect. 3.2). Furthermore, we describe in the
User studies have shown that categorized information can improve following some further problems that should be tackled for a better
the retrieval performance for a user. Thus, interfaces providing cate- expressiveness of WordNet.
gory information are more effective than pure list interfaces for pre-
senting and browsing information [2]. The authors of [2] evaluated 3.1 Problems of the EuroWordNet Hierarchy
the effectiveness of different interfaces for organizing search results.
Users strongly preferred interfaces that provide categorized infor- In the following we briefly examine the main semantic limitations of
mation and were 50% faster in finding information organized into WordNet and describe some problems that have to be solved for its
categories. Similar results based on categories used by Yahoo were better expressiveness (see also [6, 5, 13]).
presented in [7]. Some lexical links of WordNet should be interpreted using formal
The tool which we present in this paper, was developed as part of semantics in order to express ”things in the world”. The authors of
our work research towards a (multilingual) retrieval system that clas- [13] revise the Top Level of WordNet (upper or general level) where
sifies documents with respect to the search terms in unambiguous the criteria of identity and unity are very general, in order to recog-
classes, so-called Sense Folders. The main idea of our approach is nize the constraint violations occurring in it. The concepts of identity
to provide additional disambiguating information to the documents and unity are described in [13].
of a result set retrieved from a search engine in order to enable to However, we analyze the expressiveness of every SynSet in order
restructure or filter the retrieved document result set. The use of web to better categorize the context for clustering purposes. It means that
documents implies an on line categorization approach of the docu- we merge categories that are in the same domain and that are not
ments given the query terms provided from the user. Thus, we can much different from another. This decision is based on our need of
support the user in choosing the relevant information by categoriz- few unique classes that are carrier of an expressive meaning for a
ing the documents using different classification techniques. In the user as well as for an improved clustering performance.
system presented in [8, 10], we use user and query specific informa- An example is given in [10]. If we retrieve a word from WordNet,
tion in order to annotate - and thus categorize - search results from several meanings are assigned to the domain ”Factotum” that could
other search engines or text archives connected to the meta search be described as the class ”other domain, generic”. The reason for this
engine by web services. The system currently supports methods to assignment is simply the problem that the WordNet authors have to
group documents based on semantic disambiguation of query terms assign a domain to each SynSet. If a term can not be categorized (by
using an ontology that can be selected by the user. The system ana- the author) to a more specific domain, the generic domain ”Facto-
lyzes every search term and extracts the belonging SynSets, that are, tum” is used. Therefore, if we want to categorize documents with
the sets defining the different meanings of a term and the linguistic WordNet senses, we have to choose which senses are relevant and
relations from the used ontology. Based on these terms, prototypi- which are not, in order to obtain appropriate disambiguation results.
Figure 2. EuroWordNet Architecture (see [15]).
However, if we maintain all senses that are labelled with ”Factotum”, incomplete) translations contained in the lexical resource, apart from
we have in many cases to distinguish between only slightly different the lexical gaps (word senses that exist in a language and not in an-
contexts defined by different SynSets. One possibility to derive terms other).
that have a very similar meaning is to analyze their hyponyms or hy-
pernyms. If there are two senses described in WordNet belonging
3.2 Merging the EuroWordNet SynSets
to the same domain, they often have the same hyponyms or hyper-
nym. This frequently causes disambiguation problems that can not be One possible way to tackle some of the problems described above
solved if we keep all classes. For this reason, we decided to exclude is to merge SynSets manually, when the author means that they be-
some irrelevant (for the context disambiguation process) ”Factotum” long together. Another possibility is to use methods that restructure
SynSets. EuroWordNet by merging SynSets that have a very similar mean-
Another critical point is given by the confusion between concepts ing. Therefore, we studied methods in order to automatically merge
and instances resulting in an ”expressivity lack” [5]. For example, if SynSets based on the analysis of the linguistic relations defined in
we look for the hyponyms of ”mountain” in WordNet, we will find EuroWordNet.
the ”Olympus mount” as a subsumed concept of the word treated as We implemented four online methods to merge SynSets based on
”volcano” and not as instance of it. Thus, we do not have a clear relations like hypernyms and hyponyms, and further context informa-
differentiation between what we use to describe (concepts) and their tion like glosses and domain. The first merging approach is based on
instantiation (instances). We also have the problem that we can not context information extracted from the hypernymy relation (superor-
use only concepts or only instances because there is no intended sep- dinate words) in order to define the Sense Folders. It means that we
aration between them in WordNet. first build word vectors for every word sense (Sense Folder), contain-
The authors of [12] treat also the important difference between ing the whole hypernymy hierarchy related to the query word. Then
endurance and perdurance of the entities that should be included in we compare all Sense Folders with one another and merge them when
WordNet. Enduring and perduring entities are related to their be- the similarity exceeds a given threshold (i.e., when their word vectors
haviour in time. Endurants are always wholly present at any time are sufficiently close to each other). A similar approach is applied for
they are present. Perdurants are only partially present, in the sense the hyponyms (subordinate words). In the third approach we merge
that some of their proper parts (e.g., their previous phases) may be the Sense Folders if their linguistic relations and context information
not present. However, these aspects of instances are not discussed in (glosses) are similar. The fourth approach exploits the domain con-
this paper since they seem to be of less importance for the considered cept of MultiWordNet [3]. Here we merge the Sense Folders only
disambiguation problem. if they belong to the same domain (having exactly the same domain
When we deal with EuroWordNet, these problems persist, and description).
other problems come along. The problem of automatically finding An evaluation of this methods was done on a small corpus of 252
multilingual translation of word senses over languages can be solved documents retrieved from web searches that had been manually an-
using such a resource. The use of the Inter-Lingual-Index helps for notated. Hereby, we compared the manual annotated classes with the
this purpose, but the coverage of language-dependent word senses Sense Folders assigned using the approach described in [8] together
varies from language to language. The number of Synsets varies from with the merging functions implemented. Based on this first evalu-
an amount of 20.000 (german) to 150.000 (english) Synsets. Using ation, the hypernym approach seemed to nicely merge Sense Fold-
this lexical resource, we have to take into account the missing (or ers that had similar hypernyms which even might be labeled with
different domain descriptions. However, a better classification was
obtained for words that had fewer meanings (SynSets) before merg- are describing the different word senses. Every word sense is repre-
ing starts. The second approach based on hyponyms almost never sented as a SynSet. We can apply different actions for these SynSets.
merged SynSets due to the usually very different hyponyms assigned Some meanings that belong to the same domain, as the two ”bank” -
to each sense. Using the third approach, a lattice was built between SynSets under the superordinate ”incline” SynSet could be merged.
the merged Sense Folders. This approach merges SynSets not having If authors decide that the description of these SynSets is too fine
the same hypernyms, but similar words given from the descriptions grained, they can choose to merge the ”source” SynSets to a ”target”.
of all relations and words together. With the fourth approach we are The goal is to obtain only word senses describing contexts as unam-
sure to merge Sense Folders that belong to the same context, de- biguous as possible. Based on the merging a new SynSet is created
scribing it in a different way. The classification was always the best, to which all relations of the original SynSets are assigned. Authors
but the Factotum problem as discussed in Sect. 3.1 persisted. If this can also decide that a SynSet should not be a carrier of meaning for
merged class contains very different meanings and is used for clas- the intended application of the ontology; this SynSet can be removed
sification, this classification is worse than before. The possibility to just clicking on it and choosing to remove it.
exclude such classes (labeled with the ”Factotum” domain) will be The linguistic relations as also the properties of every SynSet can
studied in future work, e.g. by analyzing approaches that exploits be shown just picking the corresponding fields. These can be first set
combined information from the first three merging methods. For de- within the check boxes under the ”show relations” area. If the author
tails of the evaluation see [11]. activates the check boxes, the linguistic relations related to the se-
lected SynSet will be shown. The author can choose to ”show proper-
ties” or ”hide properties” with a right mouse click on a SynSet. Here
4 The lexical restructuring tool (LexiRes) all SynSet-related information is shown. The original XML code part
of the SynSet can also be chosen clicking on the right mouse button
The main idea of this tool is to give authors the possibility to navigate
and choosing the ”show XML” option. The properties and the XML
the ontology hierarchy in order to restructure it, by manual merging
code are shown on the right side down of the interface under ”De-
or using the merging functions described in Section 3.2.
tails”.
The SynSets can be also automatically retrieved and translated
4.1 Related Work in the different languages available in the ontology (see Figure 4).
These can be set within the menu button language and can be shown,
Different work has been already done using the variants of WordNet. always SynSet-dependent within a click. We can notice that not all
The authors of [1] developed VisDic for browsing and editing multi- SynSet have a translation, due to the missing entries in the lexical
lingual information taken from EuroWordNet. Here users can browse resource.
static information on text blocks. As we said before, the tool gives the possibility to manually merge
Another web interface for multilingual information browsing is SynSets, when the authors decide that two SynSets belong to the
presented in [14]. Here a parallel corpus annotated with MultiWord- same meaning and/or describe the same concept. The author working
Net [3] can be browsed as well as the words with their related an- with LexiRes can also use an automatically created list of candidate
notated word senses, but the corpus is very restricted. All accessible SynSets that can be merged. This list can be created with the ap-
information is static. This interface is used only for a bilingual search proaches discussed in 3.2. The system proposes the list of changes
in a closed domain. and the user can select to accept all or check each proposal for
Other work dealing with the lexicography has shown that re- merging manually. At the moment these merging methods are im-
searchers in this area mostly deal with multilingual lexical resources plemented outside the tool. The resulting list of possible merging
or corpora only, without the possibility of merging similar word SynSets is first examinated from the authors and then done manu-
senses. ally. After having restructured the ontology hierarchy, a new set of
Given that the EuroWordNet format is defined by the EuroWord- SynSets is created. This set is supposed to contain only word senses
Net Database Editor Polaris that uses a proprietary specification, we that are carrier of a distinctive meaning in the context of the consid-
first converted the EuroWordNet Database in an XML format, in or- ered application. This is a very important step for a use of lexical
der to access it with standard XML query tools. In order to retrieve resources in information retrieval. The possibility to merge SynSets
information from this resource, we use the Exist Open Source native in advance gives the advantage to categorize the retrieved documents
XML database. disambiguating them with structured word senses that facilitate an
automatic classification process [8]. A detailed description of the
evaluation of the automatic merging methods applied to the Word-
4.2 The tool Net SynSets in given in [11].
In order to use the LexiRes tool, we have to load an ontology into
its scratch framework. The tool currently supports the EuroWordNet 5 Conclusions
structure, but can easily be extended for different ontologies. Consid-
ering that we use a multilingual lexical resource, we give the possi- In this paper we motivated and presented LexiRes, a tool to help lexi-
bility to define the language we want to work with and the linguistic cographers in exploring available lexical resources for navigating and
relations we want to show for recognizing the query word in the con- restructuring them, especially for use in information retrieval frame-
text menu. After having set it the hierarchy will be displayed. works. Furthermore, we have discussed how lexical resources, here
Figure 3 shows a screenshot of the LexiRes editor. On the left EuroWordNet, can be used in order to disambiguate documents (re-
side, we can enter the query words. On the right side, we can choose trieved from the web within an information retrieval system) given
which collection we want to retrieve and which language we want different meanings (retrieved from lexical resources). After having
to use as a source language. Looking for ”bank”, in the english lan- discussed the problems related to the EuroWordNet structure, we pre-
guage, the ontology engine retrieves 19 meanings. These meanings sented the functionality of our tool. Using LexiRes we obtain a hier-
Figure 3. Example of the word ”bank” - manual merging functions - in the LexiRes Editor.
Figure 4. Example of the word ”bank” - SynSet translations - in the LexiRes Editor.
archical word specific overview that gives the possibility to restruc-
ture concepts using automatic or manual merging methods. These
methods are important to obtain a lexical resource that is more ap-
propriate in order to disambiguate user query words in documents
retrieved from an information retrieval system.
REFERENCES
[1] Hork A. and Smr P., ‘Visdic - wordnet browsing and editing tool.’,
in Proceedings of the Second International WordNet Conference
(GWC2004), (2004).
[2] Susan T. Dumais, Edward Cutrell, and Hao Chen, ‘Optimizing search
by showing results in context’, in CHI, pp. 277–284, (2001).
[3] L. Bentivogli E. Pianta and C. Girardi., ‘Multiwordnet: developing an
aligned multilingual database.’, in First International Conference on
Global WordNet, Mysore, India, (2002).
[4] C. Fellbaum D. Gross G. Miller, R. Beckwith and K. Miller., ‘Five pa-
pers on wordnet.’, International Journal of Lexicology, 3(4), (1990).
[5] Aldo Gangemi, Nicola Guarino, and Alessandro Oltramari, ‘Concep-
tual analysis of lexical taxonomies: the case of wordnet top-level’,
in FOIS ’01: Proceedings of the international conference on Formal
Ontology in Information Systems, pp. 285–296, New York, NY, USA,
(2001). ACM Press.
[6] N. Guarino and C. A. Welty., An overview of OntoClean., 151–172,
Handbook on Ontologies, Springer, 2004.
[7] Yannis Labrou and Timothy W. Finin, ‘Yahoo! as an ontology: Us-
ing yahoo! categories to describe documents’, in CIKM, pp. 180–187,
(1999).
[8] Ernesto William De Luca and Andreas Nürnberger, ‘Improving
ontology-based sense folder classification of document collections with
clustering methods’, in Proc. of 2nd Int. Workshop on Adaptive Multi-
media Retrieval (AMR 2004), part of ECAI 2004, eds., Philippe Joly,
Marcin Detyniecki, and Andreas Nürnberger, (2004).
[9] Ernesto William De Luca and Andreas Nürnberger, ‘Ontology-based
semantic online classification of documents: Supporting users in
searching the web’, in Proc. of the European Symposium on Intelligent
Technologies (EUNITE 2004), (2004).
[10] Ernesto William De Luca and Andreas Nürnberger, ‘Supporting mo-
bile web search by ontology-based categorization’, in Sprachtechnolo-
gie, mobile Kommunikation und linguistische Ressourcen, Proc. of
GLDV 2005, eds., Bernhard Fisseni, Hans-Christian Schmitz, Bernhard
Schröder, and Petra Wagner, pp. 28–41, (2005).
[11] Ernesto William De Luca and Andreas Nürnberger, ‘The use of lex-
ical resources for sense folder disambiguation.’, in Workshop Lexical
Semantic Resources (DGfS-06), Bielefeld, Germany., (2006).
[12] E. Motta, S. Buckingham, and J. Domingue. Ontology-driven docu-
ment enrichment: Principles and case studies, 1999.
[13] A. Oltramari, A. Gangemi, N. Guarino, and C. Masolo. Restructuring
wordnet’s top-level: The ontoclean approach.
[14] Pianta E. Ranieri M. and Bentivogli L., ‘Browsing multilingual in-
formation with the multisemcor web interface’, in Proceedings of the
LREC 2004 Satellite Workshop on The Amazing Utility of Parallel and
Comparable Corpora, pp. 38–41, Portugal, (2004).
[15] P. Vossen. Eurowordnet general document.
[16] David Yarowsky, ‘Unsupervised word sense disambiguation rivaling
supervised methods’, in Meeting of the Association for Computational
Linguistics, pp. 189–196, (1995).