<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>y Alexandre Termier , Marie-Christine Rousset , Miche`le Sebag</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>LRI, CNRS UMR 8623, Baˆt. 490 Universite ́ de Paris-Sud 91405 Orsay Cedex</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>A new approach for constructing pseudo-keywords, referred to as Sense Units, is proposed. Sense Units are obtained by a word clustering process, where the underlying similarity reflects both statistical and semantic properties, respectively detected through Latent Semantic Analysis and WordNet. Sense Units are used to recode documents and are evaluated from the performance increase they permit in classification tasks. Experimental results show that accounting for semantic information in fact decreases the performances compared to LSI standalone. The main weakenesses of the current hybrid scheme are discussed and several tracks for improvement are sketched.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        This paper focuses on document description and clustering.
Learning and mining techniques meet particular difficulties
when dealing with textual information [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. These
difficulties are related to the structured nature of texts (grammar),
which requires advanced techniques to be accounted for;
unfortunately, such techniques (e.g. syntactic analysers) are not
yet as robust as desirable, and entail a non-negligible amount
of noise. This is the reason why so many efficient approaches
(see [3; 13] among many others) actually only rely on
bagof-words representation, even if this representation does not
capture the whole semantics contents of a corpus (text set).
      </p>
      <p>
        Canonical bag-of-word representations present several
characteristics that adversely affect statistical approaches
[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. One is the huge number of attributes (number of words
in the corpus, or dictionary size), and the fact that any text
actually uses a small fraction of the dictionary. In other words,
a text is a vector in a high-dimension space (each dictionary
word corresponds to a dimension), and most of its
components are equal to zero. Furthermore, a single dimension
(word) might correspond to more than one semantic notion,
due to polysemy; and conversely, distinct dimensions might
correspond to same notion (synonymy).
      </p>
      <p>
        An important research topic thus is to design new and
better text descriptions (using word-windows [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] or
syntactic analysis [5; 15]), such that semantically relevant patterns
would correspond to statistically emergent ones, and vice
versa. These approaches, which will be discussed in more
detail in Section 5, proceed by specializing the texts, using
adjacency relations [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] or syntactical taggers [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Such a
specialization hopefully alleviates the polysemy effects. Still,
it offers no remedy regarding the synonymy effects, and the
resulting sparseness of the text distribution.
      </p>
      <p>In this paper, we present a new approach for text
description and clustering, termed Semistics for SEMantico–
statISTICal System. Semistics involves the automatic
construction of pseudo-keywords, which are bags of words
referred to as Sense Units. Ideally, Sense Units allow for a
synonymy and polysemy-free description of documents;
furthermore, the number of SUs is controlled by the user in order to
guarantee the scalability of the approach.</p>
      <p>Sense Units are word clusters constructed by a preliminary
clustering stage operating at the word level, using a standard
distance-based clustering algorithm (Hierarchical
Agglomerative Clustering). The novelty lies in the similarity employed,
which combines statistical and semantic information. The
statistical ingredient is borrowed from Latent Semantic
Indexing [1] while the semantic one is provided by WordNet
[7].</p>
      <p>
        LSI achieves a statistical compression of the data based on
a Singular Value Decomposition technique. This allows LSI
to detect connections between words even though they never
co-occur in a document, as opposed to window-based
approaches [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] : words employed in a same context are found
similar even though they are not employed together
(Harrisian hypothesis).
      </p>
      <p>WordNet is a publically available linguistic resource
providing a thesaurus which organizes English words into sets of
synonyms, termed synsets. A synset groups all words with
same sense. Polysemy is accounted for by the fact that a
word can appear in several synsets. In summary, WordNet
can be viewed as a source of general domain knowledge about
words.</p>
      <p>Even though LSI is reasonably good at guessing synonyms
or disambiguating words, there is no doubt it is outperformed
2. The statistical reduction of the data through LSI, visible
through a numerical word similarity ‘w(words; words).
3. The definition of synsets (set of synonyms) through</p>
      <p>
        WordNet, and the creation of a numerical synset
simi4. The creation of Sense Units, which are clusters of
synsets. The synset clustering is a simple Hierarchical
Agglomerative Clustering [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] based on similarity ‘s,
and involving a specific stop criterion.
5. The redescription of all documents as vectors on
the Sense Unit set. A cosine-based distance, noted
2.2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Latent Semantic Indexing</title>
      <p>D The input of the system is a collection of documents,
viewed as bags of words (subsets of the word set W). It is
worth keeping in mind that these words and documents are
not necessarily “natural”, in the sense that they might be
produced by another text mining tool (this will be detailed in
Section 3).</p>
      <p>The data undergoes six stages:
by WordNet in this respect. On the other hand LSI sees each
document in the perspective of the corpus, limited to the
application domain and vocabulary. In summary, WordNet
provides a very general domain knowledge about words, while
LSI constructs a specific, corpus-driven knowledge about
words, expressed as a semi-distance.</p>
      <p>The paper investigates how to combine both sources of
knowledge in order to create an accurate and yet
understandable description of the corpus, the Sense Units. The Sense
Units are intended both to sustain an efficient distance-based
clustering process, and to provide the user with many and
simple opportunities to inspect the results and include extra
knowledge.</p>
      <p>Sense Units ideally correspond to the nodes in an ontology.</p>
      <p>The difference is that an ontology is structured according to
logical relations (is-a, part-of relations), while Sense Units
are constructed together with a similarity function, i.e. they
are structured in a topological sense.</p>
      <p>
        The paper is organized as follows. Section 2 provides an
overview of Semistics ; it details the construction of Sense
Units and how these are used to redescribe the texts. Section
3 gives the experiment goal and setting. Section 4 reports on
the Semistics results obtained on the well-studied benchmark
Reuters [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], and a real-world application concerned with the
clustering of XML Document Type Definitions (DTD) [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>Section 5 briefly reviews and discusses some related work,
and the paper ends with some perspectives for further
research.</p>
      <p>Overview of the system
1. The cleaning - downsizing of the data: deterministic and
stochastic filters are used, in order to respectively
remove poorly meaningful words, and keep the problem
size under control,
larity ‘s(synsets; synsets).
6. The evaluation of distance ‘d is then performed as
de</p>
      <p>tailed in Section 3.
2.1</p>
      <p>
        LSI differs from Principal Component Analysis [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] in
two respects. First of all, LSI operates on the documents
words matrix M, whereas PCA considers the word
covariance matrix. Second, PCA cancels all but a few eigenvalues
similarity ‘w between words, defined as the cosine of wM0i
and wM0i :
w is not recognized by WordNet (e.g. company names).
w co-occurs with in at least one document in the corpus (e.g.
Swork:4 = fstudy, work, learning, acquisition g). Note that
Sw:s might be reduced to fwg; this typically happens when
      </p>
      <p>
        The similarity ‘s between synsets is defined as the average
similarity of the words they contain.
d co-occurs frequently with the very words in i, even though
= 2; (d0 3), which allows for mapping the data in a 2- or 3-D
M 0 inspection is therefore forbidden. However, matrix gives
d = 100 of eigenvalues (set to in the following). Any visual
M 0 based approaches are more robust when applied on
inM stead of [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. Consider the descriptions of words wi and
space, enabling a visual detection of the word clusters.
      </p>
      <p>On the contrary, LSI retains a significantly higher number
an extended and saturated description of the documents; the
contribution M0ij of word wj to document di is raised if wj
wj was not actually present in di. In this respect, M0 can be
viewed as a smooth “transitive closure” of the initial
description M.</p>
      <p>This saturation effect might explain the fact that
euclideanwj according to M0, given as the ith and jth columns of M0,
further referred to as wM0i and wM0j . LSI thus induces a
M composition methods exploiting the sparsity of (e.g.
ap</p>
      <p>A dual similarity is likewise defined between documents,
enabling the use of any distance-based clustering algorithm.</p>
      <p>Experimentally, it is observed than ‘w performs better (in
a word disambiguation context) than the cosine similarity
based on the initial matrix M.</p>
      <p>Notably, LSI is highly scalable with respect to the number
of documents and words considered, due to sophisticated
deplications in the TREC context have considered up to several
data gigabytes; our database is about 3 MB large).
w:s word in the word set W. To each such word sense it
w 0 which are synonymous to w:s; it is further required that</p>
      <p>
        The success of LSI on several text mining tasks, e.g. word
disambiguation or essay rating [
        <xref ref-type="bibr" rid="ref9">8</xref>
        ], confirms indeed that
(restricted-scope) semantic information can be extracted from
statistical estimates.
      </p>
      <p>
        However, it is worth noting that sources of partial
semantic information are commonly available. The resource we
use in the following is WordNet, that is an electronic lexical
database enriched with conceptual-semantic relations
(linking concepts) and lexical relations (linking individual words)
which is publicly available [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The use of WordNet for text
mining has been investigated in several respects, e.g.
supporting text retrieval through query expansion [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], or achieving
sophisticated spell checking through word sense
disambiguation [9].
      </p>
      <p>
        It seems worth combining the complementary knowledge
conveyed by statistical estimates and WordNet semantic
relations. The question is how. A previous approach uses a
tagged corpus to enrich WordNet relations with distributions
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>Our approach does not require the preliminary and tedious
labelling of the word senses in the corpus. Rather, Semistics
looks for all senses, according to WordNet, associated to any
associates a synset Sw:s, defined as the set of all words w0
w pus containing at least one word belonging to some synset
s n s that HAC stops after performing merging steps, with
n first possibility is to set the desired number of clusters, such
C C 0 in C. Merging clusters and is said to be admissible iff
C coverage of cluster be the number of documents in the
corSimilarity ‘s is exploited through a standard bottom-up
clustering algorithm, namely Hierarchical Agglomerative
Clustering (HAC). HAC starts with a set of singleton clusters, each
one containing exactly one synset. At each step, the two most
similar clusters are merged into a single one. The similarity
of two clusters is the average similarity of the synsets they
contain.</p>
      <p>HAC produces a partition of the synsets into disjoint
clusters, which strongly depends on the termination criterion. A
denoting the number of synsets. Another possibility is to set
a minimal similarity threshold, such that HAC stops when the
current best similarity is lower than the threshold.</p>
      <p>However, these stop criteria hardly cope with the
varying granularity of natural language concepts; the similarity
threshold should typically depend on the local density of the
concepts. We therefore propose an adaptive criterion, based
on controlling the cluster coverage and its growth. Let the
their relative overlap is above a prescribed threshold ,
referred to as growth limit:
u Note that is indirectly controlled from threshold ;
experiU Let denote the set of sense units, the size of which is u.</p>
      <p>The idea is that, if the coverage of the cluster abruptly
grows, the underlying concept is becoming exceedingly
general.</p>
      <p>Finally at each step HAC merges the most similar clusters
such that their merge is admissible, until no more merge is
admissible. The complexity is cubic in the number of synsets
and linear in the number of documents.</p>
      <p>Each cluster so constructed is a set of words labelled
with their sense, referred to as Sense Unit. An example of
sense unit constructed from the Reuters corpus (Section 4) is
fprotests, protestings, leftist g. Sense units containing a
single word are filtered out.
mentally, the number of sense units is lower than the number
of synsets by two orders of magnitude.</p>
      <p>w12S1; w22S2 ‘w(w1; w2)</p>
      <p>Only synsets are considered thereafter. Interestingly,
though individual words usually belong to many synsets due
to polysemy, the number of synsets is close to the number of
words due to construction requirements.
2.4</p>
    </sec>
    <sec id="sec-3">
      <title>Constructing Sense Units 2.3</title>
    </sec>
    <sec id="sec-4">
      <title>Coupling LSI and WordNet</title>
      <p>coverage(C0)
w 2 w 2 if there exists di such as Uj</p>
    </sec>
    <sec id="sec-5">
      <title>2.5 Document Redescription and Clustering</title>
      <p>Each document is redescribed with respect to the sense units,
and mapped onto IRu. The contribution of sense unit Uj to
document di, noted M0i0j is computed as
U ber of documents containing at least one word in j).
F ) U j where (Uj stands for the frequency of sense unit
(numIR u From the mapping of documents onto the metric space
we derive a document similarity noted ‘d, given as the cosine
of M00 vectors :
otherwise
jjM0i0?jj jjM0j0?jj
M standard cosine of row vectors in matrix M, 0, or M00.</p>
    </sec>
    <sec id="sec-6">
      <title>3.1 Experiment goal</title>
      <p>We compare three (re-)descriptions of the corpus. The first
description simply involves the initial words in the
documents. The second description, built from the first one, is
based on LSI: the “descriptors” of the documents are made
of LSI eigenvectors (implicitly derived from the LSI
eigenvalues). The third description, which is used by Semistics ,
relies on the Sense Units. These descriptions are materialized
respectively by matrices M, M0 and M00.</p>
      <p>Each description is processed by the same similarity-based
clustering algorithm (HAC): the document similarity is the</p>
      <p>Evaluating a description ultimately amounts to evaluate the
relevance of the provided clusters, and the flexibility of the
re-description/clustering process (through diverse parameters
such as word sampling rate, number of LSI eigenvalues
retained, growth limit in Semistics ).
g chevrolet, oldsmobile ). In other cases, the synsets
inducef Some of them are impressively relevant (for example
A first and disappointing observation is that Sense Units
appear to be less appropriate than the initial words in order to
classify documents. This might be analysed under two
directions.</p>
      <p>A first point simply regards the amount of information
conveyed by the description; the number of SUs appears too
restricted to support a sufficiently detailed description. This is
confirmed by the fact that adding 200 SUs improves all
results by about 5%.</p>
      <p>A second point regards the quality of the Sense Units.
much noise due to polysemy problems. For instance, mark
The baseline experiment considers all 40,000 words in the
corpus.</p>
      <p>The LSI-baseline involves 100 eigenvectors, built on the same
40,000 words.</p>
      <p>In Semistics , a first sampling is effectuated on the corpus,
retaining 4,000 among the 40,000 words. LSI is applied on
the sampled description, and finally the Sense Units are built
by combining the LSI and Wordnet techniques. The growth
limit is set to 0.7.</p>
      <p>Due to this sampling step, Semistics might be considered
as a randomized algorithm, and experimental results should
therefore be averaged over a number of independent runs.</p>
      <p>Unfortunately and due to the total computation time needed,
results presented in the following will be based on a single
run.</p>
      <p>It is observed that Semistics finally produces 634 Sense Units.</p>
      <p>As this might be insufficient to carry on all the corpus
information, in other experiments (noted Semistics +) we consider
an extended set of Sense Units, completed with the 200 most
frequent synsets which were left apart by the HAC.</p>
      <p>Table 1 displays the predictive accuracy of all compared
approaches. First column is the word baseline, second
column is the LSI baseline, third column corresponds to
Semistics , and fourth column is Semistics +.</p>
      <p>These results indeed show that much care must be
exercised when combining statistical and semantic information.</p>
      <p>Semistics is clearly outperformed by both the initial
description and LSI. The reason for such a failure remains to be
explained.
3</p>
    </sec>
    <sec id="sec-7">
      <title>Experiment goal and setting</title>
      <p>This section details the questions experiments should enable
to address, the performance criteria, and our experiment
setting.</p>
    </sec>
    <sec id="sec-8">
      <title>3.2 Criteria</title>
      <p>
        The difficulties of evaluating a clustering process have long
been discussed [18; 22]. As many authors [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], we finally
retain the classification predictive accuracy, derived from the
1-nearest and the 20-nearest neighbor classifier using the
considered similarity.
      </p>
      <p>So we will not evaluate the quality of the clusters produced,
but rather the quality of the similarity measure leading to that
clusters.</p>
      <p>
        The considered data is a subset of the Reuters corpus,
where the document class is given as the value of the attached
field Topics. Following [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], documents with field Topics not
informed are rejected; furthermore, we also reject documents
attached to several Topics.
      </p>
      <p>The number of documents is 8,842, partitioned in 135
disjoint classes, and involving about 40,000 words.</p>
      <p>The quality of a description is finally estimated from the
predictive accuracy of the 1-nearest neighbor (or 20-nearest
neighbor) classifier. We have two different ways to produce a
measure :
A standard leave-one-out test process: each document is
taken as correctly classified (legend %OK) iff its nearest
neighbor (or the majority of its 20 nearest neighbors) is
labelled with the same Topics ;
A less demanding evaluation of the description quality,
obtained by considering that a document is reasonably
classified (legend %OK-rel) if its nearest neighbor (or
the majority of its 20 nearest neighbors) falls in the same
category for at least one of the six main fields qualifying
the documents (Location, People, Orgs, ....).
3.3</p>
    </sec>
    <sec id="sec-9">
      <title>Experiment setting</title>
      <p>4</p>
    </sec>
    <sec id="sec-10">
      <title>Results</title>
      <p>4.1</p>
    </sec>
    <sec id="sec-11">
      <title>Sense Units vs Words</title>
      <p>Base
and marker naturally constitute a synset. Unfortunately, this
will favor clustering documents concerned with marks
(german currency) and documents about pencils, or some specific
scientific documents.</p>
      <p>One cause for the above difficulty is the fact that the word
clustering process for building the Sense Units actually relies
on the average similarity of the words contained in the
clusters (section 2), even though some words are more central
than others to a cluster. Further research will suggest a better
cluster similarity.</p>
      <p>On the other hand, the redescription step (deciding to
which extent a document involves a Sense Unit) might be
insufficiently elaborated; for instance, it does not take into
account how many words in the SU appear in the document, nor
the frequency of these words in the document. A worthwhile
perspective would be to consider a Sense Unit as a surrogate
document, and consider the LSI distance between the SU and
the document to be redescribed.
A second equally disappointing observation is that even if our
Sense Units are built using a combination of techniques
including LSI, they are not as efficient as LSI alone for
classifying documents.</p>
      <p>
        One can note that in [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], it is shown that the partition of
Reuters that we consider is very favorable to pure statistical
methods. Sense units are not purely statistical, and like
ruleinduction or decision-tree methods, fail to top those methods
on this part of the corpus.
      </p>
      <p>Of course, it is quite dissatisfactory to see that even if the
projection made by LSI is made on much fewer “dimensions”
than that of Sense Units (100 eigenvalues against 600 Sense
Units), LSI performance is much better. This means that the
eigenvectors of LSI are much richer for redescription of the
documents than the Sense Units. Here we have an
opportunity to improve our system : one idea could be to start from
those information-rich eigenvectors, and use WordNet to
dislocate them into Sense Units.
4.3</p>
    </sec>
    <sec id="sec-12">
      <title>XML data</title>
      <p>
        Apart from all those results obtained with Reuters, we also
performed some testings on a small XML documents corpus
containing about 2000 documents, provided by the Xyleme
crawler ([
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]). These documents did not came with labels,
so it was impossible to make an evaluation of the different
similarities as we did before. The only results we can give so
far on that corpus are very subjective, and based on the Sense
Units obtained and the final clusters of documents. 324 Sense
Units were produced, and we obtained about 200 clusters of
various sizes. A brief examination indicates that these
clusters seem to make some sense. Many duplicates are present
in the data, and have been detected. More difficult clusterings
were performed appropriately, for example with some
documents about biology.
5
      </p>
    </sec>
    <sec id="sec-13">
      <title>Related work</title>
      <p>Several approaches have been proposed in order to provide
better document descriptors than simple words. Such better
descriptions are sought for under diverse forms, ranging from
ontologies to distributions.</p>
      <p>These approaches can be characterized depending on the
nature of the information used to rewrite the documents,
which draws upon semantic or statistical methods, or both.</p>
      <p>
        On the semantic side for instance, [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] maps the document
words onto WordNet synsets, each synset accounting for a
concept. Experiments in the domain of text retrieval show
that the performance strongly depends on the word sense
disambiguation method used, which still is a limitation.
      </p>
      <p>
        More recently, a pure statistical approach has been
proposed by [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], obtaining excellent experimental results on
the 20Newsgroup corpus. This approach is quite similar to
ours, in the sense that it involves a two-step process,
clustering words first, then using word clusters to rewrite the
documents, and clustering documents last. The difference lies
in the criterion used to cluster words. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] use a purely
statistical criterion; words are clustered so as to minimize the
information loss, i.e. the difference in the corpus quantity of
information. In opposition, our criterion involves both
statistical (through LSI) and semantic (through WordNet)
information. As discussed in the previous section, the weaknesses
of our approach are precisely blamed on the insufficient care
exercised when clustering synsets.
      </p>
      <p>The combination of semantic and statistic-based
approaches has been investigated in several ways. Most works
rely on a syntactic tagging of the sentences. In [5] for
instance, a syntactic analyzer is used to spot relations in the
sentences (verb + preposition + complement). Words are
considered similar if they often occur together with same verb and
preposition. Based on this similarity, the ASIUM system
interactively constructs word clusters, which are modified and
refined online by the expert. It is worth noting that ASIUM
significantly reduces the time needed to manually build
ontologies.</p>
      <p>
        Another approach, also based on a syntactic parser [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ],
focuses on particular (binary) relations (subject, verb) in the
sentences. These are used to construct a distribution over the
pairs (words,verbs).
6
      </p>
    </sec>
    <sec id="sec-14">
      <title>Conclusion</title>
      <p>This paper has presented our first attempt in order to perform
document clustering using a method for redescribing
document that involves both statistical and semantic information.</p>
      <p>The first results on Reuters have been quite disappointing.
Part of this problem might be due to the choice of the
corpus as discussed in Section 4. That is why ongoing
experiments consider alternative corpus, like XML documents or
20-Newsgroups.</p>
      <p>It appears that the major weakness of our approach is
the weighting between statistical and semantic feature. This
drawback can be blamed on two causes :
insufficient care was exercised in building synset
similarity. This similarity was defined as the average of
similarities of words in the synset. So when dealing with
small synsets, the polysemy effect might be amplified.
One possibility to alleviate this limitation is to consider
synsets as documents, and to use LSI to have directly
similarity values between synsets.
our document redescription method is not precise
enough. Hence the algorithm has a “fuzzy” view of
the documents after redescription. Taking more
parameters into account during this step should take care of this
problem.</p>
      <p>
        Many people (see [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] for exemple) use statistical methods
on top of semantic results. We tried to give the two techniques
equal importance, but our system is very basic and will need
some more tuning. Another way to explore is to use semantic
techniques on top of statistical results, like the
decomposition using WordNet of the bags-of-words constituted by LSI
eigenvectors.
      </p>
      <p>Hirst G. and St-Onge D. WordNet: An Electronic
Lexical Database and some of its Applications, chapter 13:
Lexical chains as representations of context for the
detection and correction of malapropisms. MIT Press,
Christiane Fellbaum editor, 1998.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Berry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dumais</surname>
          </string-name>
          , and
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>O'Brien. Using linear algebra for intelligent information retrieval</article-title>
          .
          <source>SIAM Review</source>
          ,
          <volume>37</volume>
          (
          <issue>4</issue>
          ):
          <fpage>573</fpage>
          -
          <lpage>595</lpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Leacok</surname>
            <given-names>C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Chodorov</surname>
            <given-names>M.</given-names>
          </string-name>
          <article-title>WordNet: An Electronic Lexical Database and some of its Applications, chapter 11: Combining local context and WordNet similarity for word sense identification</article-title>
          . MIT Press, Christiane Fellbaum editor,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Deerwester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dumais</surname>
          </string-name>
          , G. Furnas,
          <string-name>
            <given-names>T.</given-names>
            <surname>Landauer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Harshman</surname>
          </string-name>
          .
          <article-title>Indexing by latent semantic analysis</article-title>
          .
          <source>Journal of the American Society for Information Science</source>
          ,
          <volume>41</volume>
          (
          <issue>6</issue>
          ):
          <fpage>391</fpage>
          -
          <lpage>407</lpage>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>[4] [5] [6]</source>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Fagin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Maarek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Ben-Shaul</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Pelleg</surname>
          </string-name>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>Technical Report 10186, IBM</source>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>D.</given-names>
            <surname>Faure</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Nedellec</surname>
          </string-name>
          .
          <article-title>Knowledge acquisition of predicate argument structures from technical texts using Machine Learning: the system ASIUM</article-title>
          . In Dieter Fensel Rudi Studer, editor,
          <source>11th European Workshop EKAW'99</source>
          , pages
          <fpage>329</fpage>
          -
          <lpage>334</lpage>
          . Springer-Verlag,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>U.M. Fayyad</surname>
            and
            <given-names>B.K.</given-names>
          </string-name>
          <string-name>
            <surname>Irani</surname>
          </string-name>
          <article-title>. Multi-interval discretization of continuous valued attributes for classification learning</article-title>
          .
          <source>In Proceedings of IJCAI-93</source>
          , pages
          <fpage>1022</fpage>
          -
          <lpage>1027</lpage>
          . Morgan Kaufmann,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>C.</given-names>
            <surname>Felbaum</surname>
          </string-name>
          , editor.
          <source>WordNet: an electronic lexical database</source>
          . Boston: MIT Press,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P. W.</given-names>
            <surname>Foltz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Laham</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T. K.</given-names>
            <surname>Landauer</surname>
          </string-name>
          . Automated essay scoring:
          <article-title>Applications to educational technology</article-title>
          .
          <source>In Proceedings of EdMedia'99</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>[10] http://www.cogsci.princeton.edu/ wn.</mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>[11] http://www.xyleme.com.</mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Jain</surname>
          </string-name>
          .
          <article-title>Fundamentals of Digital Image Processing, chapter 5: Image transforms</article-title>
          , pages
          <fpage>132</fpage>
          -
          <lpage>188</lpage>
          . Prentice Hall,
          <year>1989</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>T.</given-names>
            <surname>Joachims</surname>
          </string-name>
          .
          <article-title>Text categorization with support vector machines: Learning with many relevant features</article-title>
          .
          <source>In Machine Learning: ECML-98, Tenth European Conference on Machine Learning</source>
          , pages
          <fpage>137</fpage>
          -
          <lpage>142</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Grobelnik</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Mladenic</surname>
          </string-name>
          and
          <string-name>
            <given-names>N.</given-names>
            <surname>Milic-Frayling</surname>
          </string-name>
          .
          <source>Proceedings of the Workshop on Text Mining, held at KDD</source>
          <year>2000</year>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pereira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tishby</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Lee</surname>
          </string-name>
          .
          <article-title>Distributional clustering of english words</article-title>
          .
          <source>In 30th Annual Meeting of the ACL</source>
          , pages
          <fpage>183</fpage>
          -
          <lpage>190</lpage>
          ,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Cluet</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Veltri</surname>
            <given-names>P.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Vodislav</surname>
            <given-names>D.</given-names>
          </string-name>
          <article-title>Views in a large scale xml repository</article-title>
          .
          <source>Submitted</source>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Gerard</given-names>
            <surname>Salton</surname>
          </string-name>
          .
          <article-title>Automatic text processing: the transformation, analysis, and retrieval of information by computer</article-title>
          .
          <source>Addison Wesley</source>
          ,
          <year>1989</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>N.</given-names>
            <surname>Slonim</surname>
          </string-name>
          and
          <string-name>
            <given-names>N.</given-names>
            <surname>Tishby</surname>
          </string-name>
          .
          <article-title>Document clustering using word clusters via the information bottleneck method</article-title>
          .
          <source>In SIGIR 2000</source>
          , pages
          <fpage>208</fpage>
          -
          <lpage>215</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>EM</surname>
          </string-name>
          . Voorhees.
          <source>WordNet: An Electronic Lexical Database and some of its Applications</source>
          , chapter
          <volume>12</volume>
          :
          <article-title>Using WordNet for Text Retrieval</article-title>
          . MIT Press, Christiane FellBaum editor,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>P.</given-names>
            <surname>Wiemer-Hastings</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Wiemer-Hastings</surname>
          </string-name>
          , and
          <string-name>
            <given-names>A</given-names>
            <surname>Graesser</surname>
          </string-name>
          .
          <article-title>How latent is latent semantic analysis</article-title>
          ?
          <source>In Proceedings of the Sixteenth International Joint Congress on Artificial Intelligence</source>
          , pages
          <fpage>932</fpage>
          -
          <lpage>937</lpage>
          , San Francisco,
          <year>1999</year>
          . Morgan Kaufmann.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>P.</given-names>
            <surname>Willett</surname>
          </string-name>
          .
          <article-title>Recent trends in hierarchic document clustering: A critical review</article-title>
          .
          <source>Information Processing and Management</source>
          ,
          <volume>24</volume>
          (
          <issue>5</issue>
          ):
          <fpage>577</fpage>
          -
          <lpage>597</lpage>
          ,
          <year>1988</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Yiming</given-names>
            <surname>Yang</surname>
          </string-name>
          .
          <article-title>An evaluation of statistical approaches to text categorization</article-title>
          .
          <source>Journal of Information Retrieval</source>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          /2):
          <fpage>67</fpage>
          -
          <lpage>88</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>