<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Syntax versus Semantics:</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Benno Stein</string-name>
          <email>benno.stein@medien.uni-weimar.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Potthast</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Computer Science. Paderborn University</institution>
          ,
          <addr-line>33098 Paderborn</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Faculty of Media, Media Systems. Bauhaus University Weimar</institution>
          ,
          <addr-line>99421 Weimar</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Sven Meyer zu Eissen</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents a robust method for the construction of collection-speci c document models. These document models are variants of the well-known vector space model, which relies on a process of selecting, modifying, and weighting index terms with respect to a given document collection. We improve the step of index term selection by applying statistical methods for concept identi cation. This approach is particularly suited for post-retrieval categorization and retrieval tasks in closed collections, which is typical for intranet search.</p>
      </abstract>
      <kwd-group>
        <kwd>vector space model</kwd>
        <kwd>concept identi cation</kwd>
        <kwd>semantic concepts</kwd>
        <kwd>text categorization</kwd>
        <kwd>evaluation measures</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>Each text retrieval task that is automated by a computer relies on
some kind of document model, which is an abstraction of the
original document d. The document model must be tailored well with
respect to the retrieval task in question: It determines the quality of the
analysis, anddiametrically opposedits computational
complexity. Though its obvious simplicity the vector space model has shown
great success in many text retrieval tasks [11; 12; 16; 15], and, the
analysis of this paper uses this model as its starting point.</p>
      <p>The standard vector space model abstracts a document d toward a
vector d of weighted index terms. Each term t that is included in d
derives from a term 2 d by af x removal, which is necessary to
map morphological variants of onto the same stem t. The
respective term weights in d account for the different discriminative power
of the original terms in d and are computed according to some
frequency scheme. The main application of the vector space model is
document similarity computation.</p>
      <sec id="sec-1-1">
        <title>In this paper we focus on the index construction step and, in particular, on index term selection. Other concepts of the vector space model, such as the term weighting scheme or its disregard of word order are adopted.</title>
        <p>1.1</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>A Note on Semantics</title>
      <sec id="sec-2-1">
        <title>We classify an index construction method as being semantic if it re</title>
        <p>lies on additional domain knowledge, or if it exploits external
information sources by means of some inference procedure, or both.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Short documents may be similar to each other from the (semantic)</title>
        <p>
          viewpoint of a human reader, while the related instances of the
vector space model do not re ect this fact because of the different words
used. Index term enrichment can account for this by adding
synonymous terms, hypernyms, hyponyms, or co-occurring terms [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>Semantic approaches are oriented at the human understanding of
language and text, and, as given in the case of ontological index term
enrichment, they are computationally ef cient. However, the
application of the semantic approaches is problematic, if, for instance,
the document language is unknown or if a document combines
passages from several languages. Moreover, there are situations where
semantic approaches can even impair the retrieval quality: Consider
a document collection with specialized texts, then ontological index
term enrichment will move the speci c character of a text toward
a more general understanding. As a consequence, the similarity of
highly specialized text is diluted in favor of less specialized text
which compares to the effect of adding noise.
1.2</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Contributions</title>
      <p>We investigate variants of the vector space model with respect to
their classi cation performance. Starting point is the standard vector
space model where the step of index term selection is improved by a
syntactic approach for concept identi cation; the resulting model is
compared to semantically enriched vector space models. The
syntactic concept identi cation approach is based on a collection-speci c
suf x tree analysis. In a nutshell, the paper's underlying question
may be summarized as follows:</p>
      <p>Can syntactically determined concepts keep up with a</p>
      <p>semantically motivated index term enrichment?</p>
      <p>To answer this question we have set up a number of text
categorization experiments with different clustering algorithms. Since these
algorithms are susceptible to various side effects, we will also present
results that rely on an objective similarity assessment statistic: the
measure of expected density, . Perhaps the most interesting result
may be anticipated: The positive effect of semantic index term
enrichment, which has been reported by some authors in the past, could
hardly be observed in our comprehensive analysis.</p>
      <sec id="sec-3-1">
        <title>The remainder of the paper is organized as follows. Section 2 presents a taxonomy of index construction methods and outlines commonly used technology, and Section 3 reports on similarity analysis and unsupervised classi cation experiments.</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>2 INDEX CONSTRUCTION</title>
    </sec>
    <sec id="sec-5">
      <title>FOR DOCUMENT MODELS</title>
      <sec id="sec-5-1">
        <title>This section organizes the current practice of index construction for vector space models. In particular, we review the concept of a document model and propose a classi cation scheme for both popular and specialized index construction principles.</title>
        <p>A document d can be viewed under different aspects: layout,
structural or logical setup, and semantics. A computer representation d of
d may capture different portions of theses aspects. Note that d is
designed purposefully, with respect to the structure of a formalized
query, q, and also with having a particular retrieval model in mind. A
retrieval model, R, provides the linguistic rationale for the model
formation process behind the mapping d 7! d. This mapping involves
an inevitable simpli cation of d that should be</p>
      </sec>
      <sec id="sec-5-2">
        <title>1. quanti able,</title>
      </sec>
      <sec id="sec-5-3">
        <title>2. useful with respect to the information need, and</title>
      </sec>
      <sec id="sec-5-4">
        <title>3. tailored to q, the formalized query.</title>
      </sec>
      <sec id="sec-5-5">
        <title>The retrieval model R gives answers to these points, be it theo</title>
        <p>retically or empirically, and provides a concrete means, (q; d), for
quantifying the relevance between a formalized query q and a
document's computer representation d. Note that (q; d) is often
specied in the form of a similarity measure '.</p>
      </sec>
      <sec id="sec-5-6">
        <title>Together, the computer representation d along with the underlying</title>
        <p>retrieval model R form the document model; Figure 2 illustrates the
connections.</p>
        <p>Let D be a document collection and let T be the set of all terms
that occur in D. The vector space model d of a document d is a
vector of jT j weights, each of which quantifying the importance of
some index term in T with respect to d.3 This quanti cation must
be seen against the background that one is interested in a similarity
function ' that maps from the vectors d1 and d2 of two documents
d1, d2 into the interval [0; 1] and that has the following property:
If '(d1; d2) is close to 1 then the documents d1 and d2 are similar;
likewise, a value close to zero indicates a high dissimilarity. Note that
document models and similarity functions determine each other: The
vector space model and its variants are amenable to the cosine
similarity (= normalized dot product) in rst place, but can also be used
in connection with Euclidean distance, overlap measures, or other
distance concepts.</p>
        <p>Under the vector space paradigm the document model
construction process is determined in two dimensions: index construction and
weight computation. In the following we will concentrate on the
former dimension since this paper contributes right here. We have
clas3 Note that, in effect, the vector space model is a computer representation
of a the textual content of a document. However, in the literature the term
vector space model is also understood as a retrieval model with a certain
kind of relevance computation.</p>
        <p>Index construction principle</p>
        <sec id="sec-5-6-1">
          <title>Index term selection</title>
        </sec>
        <sec id="sec-5-6-2">
          <title>Index term modification</title>
        </sec>
        <sec id="sec-5-6-3">
          <title>Index term enrichment</title>
        </sec>
        <sec id="sec-5-6-4">
          <title>Index transformation</title>
          <p>Conceptual
model
Layout
view
Structural /
logical view
Semantic
view</p>
          <p>d ÎD
Real-world
document</p>
          <p>Computer
representation
of document
Relevance
computation</p>
          <p>Linguistic
theory</p>
          <p>Retrieval model R
r(q,d)</p>
          <p>Document model</p>
          <p>q ÎQ
Formalized
query</p>
          <p>q ÎQ
Information
need
si ed the index construction principles for vector space models in
four main classes, which are shown in Figure 1.</p>
          <p>Index Term Selection. Selection methods further divide into
inclusion and exclusion methods. An important exclusion method is
stopword removal: Common words, such as prepositions or conjunctions,
introduce noise and provide no discriminating similarity information;
they are usually discarded from the index set. However, there are
special purpose models (e. g. for text genre identi cation) that rely on
stopword features [13; 9].</p>
        </sec>
      </sec>
      <sec id="sec-5-7">
        <title>The standard vector space model does not apply an inclusion</title>
        <p>method but simply takes the entire set T without stopwords. More
advanced vector space models use also n-grams, i. e., continuous
sequences of n words, n 4, which occur in the documents of D.</p>
      </sec>
      <sec id="sec-5-8">
        <title>Since the usage of n-grams entails the risk of introducing noise, not</title>
        <p>
          all n-grams should be added but threshold-based selection methods
be applied, which rely on the information gain or a similar
statistic [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
        <p>Index Term Modi cation. Most term modi cation methods aim at
generalization. A common problem in this connection is the
mapping of morphologically different words that embody the same
concept onto the same index term. So-called stemming algorithms apply
here; their goal is to nd canonical forms for in ected or derived
words, e. g. for declined nouns or conjugated verbs. Since the uni
cation of words with respect to gender, number, time, and case is a
language-speci c issue, rule-based stemming algorithms require the
development of specialized rule sets for each language. Recall that</p>
        <sec id="sec-5-8-1">
          <title>Example for technology:</title>
        </sec>
        <sec id="sec-5-8-2">
          <title>Inclusion methods</title>
          <p>Co-occurrence analysis</p>
        </sec>
        <sec id="sec-5-8-3">
          <title>Exclusion methods</title>
          <p>Stopword removal
Stemming
Addition of synonym sets
Singular value decomposition
the application of language-speci c rule sets requires the problem of
language detection both in unilingual and multilingual documents to
be solved.</p>
          <p>Index Term Enrichment. We classify a method as term
enriching, if it introduces terms not found in T . By nature,
meaningful index term enrichment must be semantically motivated and
exploit linguistic knowledge. A standard approach is thepossibly
transitiveextension of T by synonyms, hypernyms, hyponyms, and
co-occurring terms. The extension shall alleviate the problem of
different writing styles, or of vocabulary variations observed in very
small document snippets as they are returned from search engines.</p>
        </sec>
      </sec>
      <sec id="sec-5-9">
        <title>Note that these methods are not employed to address the problem of polysemy, since the required in-depth analysis of the term context is computationally too expensive for many similarity search applications.</title>
        <p>Index Transformation. In contrast to the construction methods
mentioned before, transformation methods operate on all document
vectors of a collection D at the same time by analyzing the
termdocument matrix, A. A popular index transformation method is latent
semantic indexing (LSI), which uses a singular value decomposition
of A in order to improve query rankings and similarity computations
[2; 1; 8]. For this purpose, the document vectors are projected into
a low-dimensional space that is spanned by the eigenvectors that
belong to the largest singular values of the decomposition of A.
2.1</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Discussion</title>
      <p>Index terms that consist of a single word can be found by a skillful
analysis of pre x frequency and pre x length. This idea can be
extended to the identi cation of compound word concepts in written
text. If continuous sequences of n words occur signi cantly often,
then it is likely that these words form a concept. Put another way,
concept detection reduces to the identi cation of frequent n-grams.</p>
      <p>
        n-grams as a replacement for index term enrichment has been
analyzed by several authors in the past, with moderate success only [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <sec id="sec-6-1">
        <title>We explain the disappointing results with noise effects, which dom</title>
        <p>inate the positive impact of few additional concepts: Most authors
apply a strategy of complete extension; i. e., they add all 2-grams
and 3-grams to the index vector. However, when analyzing the
frequency distribution of n-grams, it becomes clear that only a small
fraction of all compound word sequences is statistically relevant.</p>
      </sec>
      <sec id="sec-6-2">
        <title>The advantages of syntactical (statistical) methods for index con</title>
        <p>struction can be summarized as follows:</p>
      </sec>
      <sec id="sec-6-3">
        <title>1. language independence</title>
      </sec>
      <sec id="sec-6-4">
        <title>2. robustness with respect to multi-lingual documents</title>
      </sec>
      <sec id="sec-6-5">
        <title>3. tailored indexes for retrieval tasks on closed collections</title>
      </sec>
      <sec id="sec-6-6">
        <title>An obvious disadvantage may be the necessary statistical mass:</title>
      </sec>
      <sec id="sec-6-7">
        <title>Syntactical index construction cannot work if only few, very small</title>
        <p>document snippets are involved. This problem is also investigated
in the next section, where the development of the index quality is
compared against the underlying collection size.</p>
      </sec>
      <sec id="sec-6-8">
        <title>As an aside, statistical stemming and the detection of compound</title>
        <p>word concepts are essentially the samethe level of granularity
makes the difference: Stemming means frequency analysis at the
level of characters; likewise, the identi cation of concepts means
frequency analysis at the level of words.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>ANALYSIS OF</title>
    </sec>
    <sec id="sec-8">
      <title>ENRICHED VECTOR SPACE MODELS</title>
      <sec id="sec-8-1">
        <title>Existing reports on the impact of index term selection and index term</title>
        <p>
          enrichment are contradictory [4; 5; 7], and not all of the published
performance improvements could be reproduced [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Most of this
research analyzes the effects of a modi ed vector space model on
typical information retrieval tasks, such as document clustering or
query answering.
        </p>
      </sec>
      <sec id="sec-8-2">
        <title>Note that clustering results that have been obtained by employing</title>
        <p>the same cluster algorithm under different document models may
tell us two things: (i) whether one document model captures more
of the gist of the original document d than another model, and,
(ii) whether the cluster algorithm is able to take advantage of this
added value.</p>
      </sec>
      <sec id="sec-8-3">
        <title>A cluster algorithm's performance depends on various parameters,</title>
        <p>such as the cluster number, its randomized start con guration, or
preset similarity thresholds, etc., which renders a comparison dif cult.
Moreover, there is the prevalently observed effect that different
cluster algorithms behave differently sensitive to document model
improvements. From an analysis point of view the following questions
arise:</p>
      </sec>
      <sec id="sec-8-4">
        <title>1. Which cluster algorithm shall de ne the baseline for a comparison</title>
        <p>(the best for the dataset, the most commonly used, the simplest)?</p>
      </sec>
      <sec id="sec-8-5">
        <title>2. Given several clustering results obtained by the same cluster algorithm, which result can be regarded as meaningful (the best, the worst, the average)?</title>
      </sec>
      <sec id="sec-8-6">
        <title>Especially to the second point less attention is paid in cur</title>
        <p>rent research: Common practice is to select the best result
compared to a given reference classi cation, e. g. by maximizing the F</p>
      </sec>
      <sec id="sec-8-7">
        <title>Measure valueignoring that such a combined usage of unsuper</title>
        <p>vised/supervised methods is far away from reality.4</p>
        <p>An objective way to rank different document models is to compare
their ability to capture the intrinsic similarity relations of a given
collection D. Basic idea is the construction of a similarity graph,
measuring its conformance to a reference classi cation, and analyzing
the improvement or decline of this conformance under some
document model. Exactly this is operationalized in form of the -measure
that is introduced below; it enables one to evaluate differences in the
similarity concepts of alternative document models without being
dependent on a cluster algorithm.5</p>
      </sec>
      <sec id="sec-8-8">
        <title>Hence, the performance analyses presented in this section com</title>
        <p>prise two types of analyses: (i) Experiments that, based on , quantify
objective improvements or declines of a document model, (ii)
experiments that, based on the F -Measure, quantify the effects of a
document model onto different cluster algorithms.
3.1</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>A Measure of Expected Density:</title>
      <p>As before let D = fd1; : : : ; dng be a document collection whose
corresponding computer representations are denoted as d1; : : : ; dn.
A similarity graph G = hV; E; 'i for D is a graph where a node in</p>
      <sec id="sec-9-1">
        <title>V represents a document and an edge (di; dj ) 2 E is weighted with</title>
        <p>the similarity '(di; dj ).</p>
        <p>
          A graph G = hV; E; wi is called sparse if jEj = O(jV j); it
is called dense if jEj = O(jV j2). Put another way, we can
compute the density of a graph from the equation jEj = jV j . With
4 This issue is addressed in [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
5 The -measure was originally introduced in [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], as an alternative for the
Davies-Bouldin-Index and the Dunn-Index, in order to evaluate the quality
of cluster algorithms for text retrieval applications.
5 categories
standard vector space model
synonym enrichment
hypernym enrichment
n-gram index term selection
5 categories
standard vector space model
n-gram index term selection
2.4
w(G) := jV j + Pe2E w(e), this relation extends naturally to
weighted graphs:6
w(G) = jV j
,
=
ln w(G)
ln jV j
        </p>
      </sec>
      <sec id="sec-9-2">
        <title>Obviously, can be used to compare the density of each induced</title>
        <p>subgraph G0 = hV 0; E0; w0i of G to the density of G: G0 is sparse
(dense) compared to G if the quotient w(G0)=(jV 0j ) is smaller
(larger) than 1. This consideration provides a key to quantify a
document model's ability to capture the intrinsic similarity relations of</p>
      </sec>
      <sec id="sec-9-3">
        <title>G, and hence, of the underlying collection.</title>
        <p>Let C = fC1; : : : ; Ckg be an exclusive categorization of D in k
distinct categories, that is to say, Ci; Cj D with Ci \ Cj = ;
and [ik=1Ci = D, and let Gi = hVi; Ei; 'i be the induced subgraph
of G with respect to category Ci. Then the expected density of C is
de ned as follows.</p>
        <p>k
(C) = X jVij</p>
        <p>jV j
i=1
w(Gi) ;
jVij
where jV j = w(G)</p>
      </sec>
      <sec id="sec-9-4">
        <title>Since the edge weights resemble the similarity of the documents</title>
        <p>associated with V , a higher value of indicates a better modeling of
a collection's similarity relations.
3.2</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>Syntax versus Semantics:</title>
    </sec>
    <sec id="sec-11">
      <title>Variants of the Vector Space Model</title>
      <p>
        Aside from the standard vector space model our analysis compares
the following three vector space model variants:
1. Syntactic Term Selection. Within this variant the index term
selection step also considers syntactically identi ed concepts, i. e.,
2grams, 3-grams, and 4-grams. To identify the signi cant n-grams
the document collection D is inserted into a suf x tree and a
statistical successor variety analysis is applied. The operationalized
principle behind this analysis is the peak-and-plateau method [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],
for which we have developed a re nement in our working group.
6 w(G) denotes the total edge weight of G plus the number of nodes, jV j,
which serves as adjustment term for graphs with edge weights in [0; 1].
2. Semantic Synonym Enrichment. Within this variant of semantic
term enrichment the so-called synsets from Wordnet for nouns are
added [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]; this procedure has been reported to work well for
categorization tasks [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Note that adding synonyms to all index terms
of a document vector will introduce a lot of noise, and hence only
the top-ranked 10% of the index terms (respecting the employed
term weighting scheme) are selected for enrichment.
3. Semantic Hypernym Enrichment. This variant of semantic term
enrichment relies also on Wordnet: a sequence of up to four
consecutive hypernyms is substituted for each noun. The rationale
is as follows. Documents dealing with closely relatedbut still
differenttopics often contain terms which derive from a single
hypernym representing their common category. The enrichment
proposed here yields a stronger similarity between such
documents without generalizing too much.
      </p>
      <sec id="sec-11-1">
        <title>Index term weighting of both unigrams and n-grams follows the</title>
        <p>tf idf -scheme; stopwords are not indexed and unigram stemming is
done according to Porter's algorithm.</p>
        <p>Discussion. The resulting graphs in Figure 3 as well as the
comparison in Table 1 show that the syntactic approach outperforms both
semantic approaches. From the semantic variants only the semantic
hypernym enrichment is above the baseline; note that this happens
even if a large number synsets is added. We explain the results as
follows: Index terms with a high term weight typically belong to a
special vocabulary, and, from a semantic point of view, they are used
deliberately so that adding their synsets will tend to decrease their
importance. Likewise, adding the synsets of low-weighted terms has
no effect other than adding noise since the importance of these terms
will be increased without a true rationale.</p>
      </sec>
      <sec id="sec-11-2">
        <title>Vector space model variant</title>
        <p>standard vector space model
synonym enrichment
hypernym enrichment
n-gram index term selection
F -min F -max F -av.
(sample size 1000, 10 categories)
-8%
+5%
+15%
baseline
+4%
+12%
+6%
-2%
+3%
+8%</p>
      </sec>
    </sec>
    <sec id="sec-12">
      <title>Test Corpus and Sample Formation</title>
      <sec id="sec-12-1">
        <title>Experiments have been conducted with samples from RCV1, a short hand for Reuters Corpus Volume 1 [10], as well as with documents from German newsgroup postings.</title>
      </sec>
      <sec id="sec-12-2">
        <title>RCV1 is a document collection that was published by the Reuters</title>
        <p>Corporation for research purposes. It contains more than 800,000
documents each of which consisting of a few hundred up to several
thousands words. The documents are tagged with meta information
like category (also called topic), geographic region, or industry
sector. There are 103 different categories, which are arranged within
a hierarchy of the four top level categories Government, Social,
Economics, Markets, and Corporate, Industrial. Each of the
top level categories de nes the root of a tree of sub-categories, where
each child node ne grains the information given by its parent. Note
that a document d can be assigned to several categories c1; : : : ; cp,
and that d does also belong to all ancestor categories of some
category ci.</p>
      </sec>
      <sec id="sec-12-3">
        <title>Within our experiments two documents di; dj are considered to</title>
        <p>belong to the same category if they share the same top level category
ct and the same most speci c category cs. Moreover, the test sets are
constructed in such a way that there is no document di whose most
speci c category cs is an ancestor of the most speci c category of
some other document dj .</p>
      </sec>
      <sec id="sec-12-4">
        <title>The samples were formed as follows: For the analysis of the in</title>
        <p>trinsic similarity relations based on , the sample sizes ranged from
200 to 1000 documents taken from 5 categories. For the analysis of
the categorization experiments, based on cluster algorithms and
evaluated with the F -Measure, the sample sizes were 1000 documents
taken from 10 categories.7</p>
        <p>RCV1</p>
        <p>Government, Social
Economics
Markets
Corporate,
Industrial</p>
        <p>Insolvency,
Liquidity
Performance
...</p>
        <p>Account,
Earnings
Comment,
Forecasts</p>
        <p>Annual
results</p>
      </sec>
      <sec id="sec-12-5">
        <title>This paper provided a comparison of syntactical and semantic meth</title>
        <p>ods for the construction of vector space models; the special focus
was index term selection. Interestingly, little attention has been paid
to the mentioned syntactical methods in connection with text retrieval
tasks. Following results of our paper shall be emphasized:</p>
      </sec>
      <sec id="sec-12-6">
        <title>With syntactically identi ed concepts signi cant improvements can be achieved for categorization tasks.</title>
      </sec>
      <sec id="sec-12-7">
        <title>The bene t of semantic term enrichment is generally overestimated.</title>
      </sec>
      <sec id="sec-12-8">
        <title>The -measure provides an algorithm-neutral approach to analyze the similarity knowledge contained in document models.</title>
        <p>7 To make our analysis results reproducible for other researchers, meta
information les that describe the compiled test collections have been recorded;
they are available upon request.</p>
      </sec>
      <sec id="sec-12-9">
        <title>Note that the last point may be interesting to develop accepted</title>
        <p>benchmarks to compare research efforts related to document models
or similarity measures.</p>
        <p>Though syntactical analyses must not be seen as a cure-all for the
index construction of vector space models, they provide advantages
over semantic methods, such as language independence, robustness,
and tailored index sets. With respect to several retrieval tasks they
can keep up with semantic methodshowever, our results give no
room for an over-simpli cation: Both paradigms have the potential
to outperform the other.</p>
      </sec>
      <sec id="sec-12-10">
        <title>Benalmádena, Spain, ed., M. H. Hanza, pp. 216221,</title>
      </sec>
      <sec id="sec-12-11">
        <title>Anaheim, Calgary, Zurich, (September 2003). ACTA Press.</title>
        <p>[15] Michael Steinbach, George Karypis, and Vipin Kumar, `A
comparison of document clustering techniques', Technical</p>
      </sec>
      <sec id="sec-12-12">
        <title>Report 00-034, Department of Computer Science and</title>
      </sec>
      <sec id="sec-12-13">
        <title>Egineering, University of Minnesota, (2000). [16] Yiming Yang and Jan O. Pedersen, `A comparative study on feature selection in text categorization', in Proceedings of</title>
        <p>ICML-97, 14th International Conference on Machine</p>
      </sec>
      <sec id="sec-12-14">
        <title>Learning, ed., Douglas H. Fisher, pp. 412420, Nashville, US,</title>
        <p>(1997). Morgan Kaufmann Publishers, San Francisco, US.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Michael</surname>
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Berry</surname>
          </string-name>
          , Susan T. Dumais, and
          <string-name>
            <surname>Gavin W. O'Brien</surname>
          </string-name>
          , `
          <article-title>Using Linear Algebra for Intelligent Information Retrieval'</article-title>
          ,
          <source>Technical Report UT-CS-94-270</source>
          , Computer Science Department, (dec
          <year>1994</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Scott</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Deerwester</surname>
          </string-name>
          , Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman, `
          <article-title>Indexing by Latent Semantic Analysis'</article-title>
          ,
          <source>Journal of the American Society of Information Science</source>
          ,
          <volume>41</volume>
          (
          <issue>6</issue>
          ),
          <volume>391</volume>
          
          <fpage>407</fpage>
          , (
          <year>1990</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Christiane</given-names>
            <surname>Fellbaum</surname>
          </string-name>
          ,
          <source>WordNet: An Electronic Lexical Database</source>
          , MIT Press,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>W. B.</given-names>
            <surname>Frakes</surname>
          </string-name>
          , `
          <article-title>Term con ation for information retrieval'</article-title>
          ,
          <source>in SIGIR '84: Proceedings of the 7th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , pp.
          <volume>383</volume>
          
          <issue>389</issue>
          ,
          <string-name>
            <surname>Swinton</surname>
            ,
            <given-names>UK</given-names>
          </string-name>
          , (
          <year>1984</year>
          ). British Computer Society.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>W. B.</given-names>
            <surname>Frakes</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ricardo</given-names>
            <surname>Baeza-Yates</surname>
          </string-name>
          ,
          <article-title>Information retrieval: Data Structures and Algorithms</article-title>
          , Prentice-Hall, Inc., Upper Saddle River, NJ, USA,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Johannes</given-names>
            <surname>Fürnkranz</surname>
          </string-name>
          , `
          <article-title>A Study Using n-gram Features for Text Categorization'</article-title>
          ,
          <source>Technical report, Austrian Institute for Arti cial Intelligence</source>
          , (
          <year>1998</year>
          ).
          <source>Technical Report OEFAI-TR-9830.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hotho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Staab</surname>
          </string-name>
          , and G. Stumme, `
          <article-title>Wordnet improves text document clustering'</article-title>
          ,
          <source>in Proceedings of the SIGIR Semantic Web Workshop</source>
          , (
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Christos</surname>
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Papadimitriou</surname>
          </string-name>
          , Hisao Tamaki, Prabhakar Raghavan, and Santosh Vempala, `
          <article-title>Latent semantic indexing: a probabilistic analysis'</article-title>
          ,
          <source>in PODS '98: Proceedings of the 17th ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems</source>
          , pp.
          <volume>159</volume>
          
          <issue>168</issue>
          , New York, NY, USA, (
          <year>1998</year>
          ). ACM Press.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Andreas</given-names>
            <surname>Rauber</surname>
          </string-name>
          and
          <article-title>Alexander Müller-Kögler, `Integrating automatic genre analysis into digital libraries'</article-title>
          ,
          <source>in ACM/IEEE Joint Conference on Digital Libraries</source>
          , pp.
          <volume>1</volume>
          
          <fpage>10</fpage>
          , (
          <year>2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T.G.</given-names>
            <surname>Rose</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stevenson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Whitehead</surname>
          </string-name>
          , `
          <article-title>The Reuters Corpus Volume 1 - From Yesterday's News to Tomorrow's Language Resources'</article-title>
          ,
          <source>in Proceedings of the Third International Conference on Language Resources and Evaluation</source>
          , (
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>G.</given-names>
            <surname>Salton and M. E. Lesk</surname>
          </string-name>
          , `
          <article-title>Computer Evaluation of Indexing and Text Processing'</article-title>
          , ACM,
          <volume>15</volume>
          (
          <issue>1</issue>
          ),
          <volume>8</volume>
          
          <fpage>36</fpage>
          , (
          <year>January 1968</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Karen</surname>
          </string-name>
          Sparck-Jones,
          <article-title>`A statistical interpretation of term speci city and its application in retrieval'</article-title>
          ,
          <source>Journal of Documentation</source>
          ,
          <volume>28</volume>
          , 11
          <fpage>21</fpage>
          , (
          <year>1972</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Fakotakis</surname>
          </string-name>
          , and G. Kokkinakis, `
          <article-title>Text genre detection using common word frequencies'</article-title>
          ,
          <source>in Proceedings of the 18th Int. Conference on Computational Linguistics</source>
          , Saarbrücken, Germany, (
          <year>2000</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Benno</surname>
            <given-names>Stein</given-names>
          </string-name>
          , Sven Meyer zu Eißen, and Frank Wißbrock, `
          <article-title>On Cluster Validity and the Information Need of Users'</article-title>
          ,
          <source>in Proceedings of the 3rd IASTED International Conference on Arti cial Intelligence and Applications</source>
          (AIA 03),
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>