=Paper= {{Paper |id=Vol-205/paper-3 |storemode=property |title=Syntax versus Semantics: Analysis of Enriched Vector Space Models |pdfUrl=https://ceur-ws.org/Vol-205/paper3.pdf |volume=Vol-205 }} ==Syntax versus Semantics: Analysis of Enriched Vector Space Models== https://ceur-ws.org/Vol-205/paper3.pdf
                                           Syntax versus Semantics:
                                      Analysis of Enriched Vector Space Models
                                 Benno Stein1 and Sven Meyer zu Eissen1 and Martin Potthast2


Abstract. This paper presents a robust method for the construction          1.1    A Note on Semantics
of collection-specific document models. These document models are
variants of the well-known vector space model, which relies on a pro-       We classify an index construction method as being semantic if it re-
cess of selecting, modifying, and weighting index terms with respect        lies on additional domain knowledge, or if it exploits external in-
to a given document collection. We improve the step of index term           formation sources by means of some inference procedure, or both.
selection by applying statistical methods for concept identification.       Short documents may be similar to each other from the (semantic)
This approach is particularly suited for post-retrieval categorization      viewpoint of a human reader, while the related instances of the vec-
and retrieval tasks in closed collections, which is typical for intranet    tor space model do not reflect this fact because of the different words
search.                                                                     used. Index term enrichment can account for this by adding synony-
                                                                            mous terms, hypernyms, hyponyms, or co-occurring terms [7].
   We compare our approach to “enriched” vector-space-based doc-               Semantic approaches are oriented at the human understanding of
ument models that employ knowledge of the underlying language in            language and text, and, as given in the case of ontological index term
the form of external semantic concepts. Primary objective is to quan-       enrichment, they are computationally efficient. However, the appli-
tify the impact of a purely syntactic analysis in contrast to a semantic    cation of the semantic approaches is problematic, if, for instance,
enrichment in the index construction step. As a by-product we pro-          the document language is unknown or if a document combines pas-
vide an efficient and language-independent means for vector space           sages from several languages. Moreover, there are situations where
model construction, whereas the resulting document models perform           semantic approaches can even impair the retrieval quality: Consider
better than the standard vector space model.                                a document collection with specialized texts, then ontological index
Keywords vector space model, concept identification, semantic con-          term enrichment will move the specific character of a text toward
cepts, text categorization, evaluation measures                             a more general understanding. As a consequence, the similarity of
                                                                            highly specialized text is diluted in favor of less specialized text—
                                                                            which compares to the effect of adding noise.
1     INTRODUCTION
Each text retrieval task that is automated by a computer relies on          1.2    Contributions
some kind of document model, which is an abstraction of the origi-
nal document d. The document model must be tailored well with re-           We investigate variants of the vector space model with respect to
spect to the retrieval task in question: It determines the quality of the   their classification performance. Starting point is the standard vector
analysis, and—diametrically opposed—its computational complex-              space model where the step of index term selection is improved by a
ity. Though its obvious simplicity the vector space model has shown         syntactic approach for concept identification; the resulting model is
great success in many text retrieval tasks [11; 12; 16; 15], and, the       compared to semantically enriched vector space models. The syntac-
analysis of this paper uses this model as its starting point.               tic concept identification approach is based on a collection-specific
   The standard vector space model abstracts a document d toward a          suffix tree analysis. In a nutshell, the paper’s underlying question
vector d of weighted index terms. Each term t that is included in d         may be summarized as follows:
derives from a term τ ∈ d by affix removal, which is necessary to                  Can syntactically determined concepts keep up with a
map morphological variants of τ onto the same stem t. The respec-                    semantically motivated index term enrichment?
tive term weights in d account for the different discriminative power
of the original terms in d and are computed according to some fre-             To answer this question we have set up a number of text catego-
quency scheme. The main application of the vector space model is            rization experiments with different clustering algorithms. Since these
document similarity computation.                                            algorithms are susceptible to various side effects, we will also present
   In this paper we focus on the index construction step and, in par-       results that rely on an objective similarity assessment statistic: the
ticular, on index term selection. Other concepts of the vector space        measure of expected density, ρ. Perhaps the most interesting result
model, such as the term weighting scheme or its disregard of word           may be anticipated: The positive effect of semantic index term en-
order are adopted.                                                          richment, which has been reported by some authors in the past, could
                                                                            hardly be observed in our comprehensive analysis.
1 Faculty of Media, Media Systems.
                                                                               The remainder of the paper is organized as follows. Section 2
    Bauhaus University Weimar, 99421 Weimar, Germany
    {benno.stein | sven.meyer-zu-eissen}@medien.uni-weimar.de               presents a taxonomy of index construction methods and outlines
2 Faculty of Computer Science.                                              commonly used technology, and Section 3 reports on similarity anal-
    Paderborn University, 33098 Paderborn, Germany                          ysis and unsupervised classification experiments.
2     INDEX CONSTRUCTION                                                                                                                           Document model
      FOR DOCUMENT MODELS                                                              Conceptual               Computer         d ∈D
                                                                                         model
                                                                                                              representation
This section organizes the current practice of index construction for                   Layout                 of document
vector space models. In particular, we review the concept of a doc-                      view
ument model and propose a classification scheme for both popular                       Structural /                            Retrieval model R           q ∈Q
and specialized index construction principles.                                        logical view
                                                                                                                Relevance                            Formalized
   A document d can be viewed under different aspects: layout, struc-                  Semantic                computation                             query
tural or logical setup, and semantics. A computer representation d of                    view                                  ρ(q,d)
d may capture different portions of theses aspects. Note that d is
designed purposefully, with respect to the structure of a formalized                                              Linguistic
query, q, and also with having a particular retrieval model in mind. A                                              theory

retrieval model, R, provides the linguistic rationale for the model for-
mation process behind the mapping d 7→ d. This mapping involves                               d ∈D                                                         q ∈Q
an inevitable simplification of d that should be
                                                                                       Real-world                                                    Information
1. quantifiable,                                                                       document                                                          need
2. useful with respect to the information need, and
3. tailored to q, the formalized query.                                            Figure 2. In the end, an information need q is satisfied by a real-world doc-
                                                                                   ument d. A computer support for this retrieval task requires an abstraction
   The retrieval model R gives answers to these points, be it theo-                of q and d towards q and d. The rationale for this abstraction comes from a
                                                                                   linguistic theory and is operationalized by a retrieval model R.
retically or empirically, and provides a concrete means, ρ(q, d), for
quantifying the relevance between a formalized query q and a docu-
ment’s computer representation d. Note that ρ(q, d) is often speci-                sified the index construction principles for vector space models in
fied in the form of a similarity measure ϕ.                                        four main classes, which are shown in Figure 1.
   Together, the computer representation d along with the underlying
retrieval model R form the document model; Figure 2 illustrates the                Index Term Selection. Selection methods further divide into inclu-
connections.                                                                       sion and exclusion methods. An important exclusion method is stop-
                                                                                   word removal: Common words, such as prepositions or conjunctions,
   Let D be a document collection and let T be the set of all terms
                                                                                   introduce noise and provide no discriminating similarity information;
that occur in D. The vector space model d of a document d is a vec-
                                                                                   they are usually discarded from the index set. However, there are spe-
tor of |T | weights, each of which quantifying the “importance” of
                                                                                   cial purpose models (e. g. for text genre identification) that rely on
some index term in T with respect to d.3 This quantification must
                                                                                   stopword features [13; 9].
be seen against the background that one is interested in a similarity
                                                                                      The standard vector space model does not apply an inclusion
function ϕ that maps from the vectors d1 and d2 of two documents
                                                                                   method but simply takes the entire set T without stopwords. More
d1 , d2 into the interval [0; 1] and that has the following property:
                                                                                   advanced vector space models use also n-grams, i. e., continuous se-
If ϕ(d1 , d2 ) is close to 1 then the documents d1 and d2 are similar;
                                                                                   quences of n words, n ≤ 4, which occur in the documents of D.
likewise, a value close to zero indicates a high dissimilarity. Note that
                                                                                   Since the usage of n-grams entails the risk of introducing noise, not
document models and similarity functions determine each other: The
                                                                                   all n-grams should be added but threshold-based selection methods
vector space model and its variants are amenable to the cosine simi-
                                                                                   be applied, which rely on the information gain or a similar statis-
larity (= normalized dot product) in first place, but can also be used
                                                                                   tic [6].
in connection with Euclidean distance, overlap measures, or other
distance concepts.                                                                 Index Term Modification. Most term modification methods aim at
   Under the vector space paradigm the document model construc-                    generalization. A common problem in this connection is the map-
tion process is determined in two dimensions: index construction and               ping of morphologically different words that embody the same con-
weight computation. In the following we will concentrate on the for-               cept onto the same index term. So-called stemming algorithms apply
mer dimension since this paper contributes right here. We have clas-               here; their goal is to find canonical forms for inflected or derived
3 Note that, in effect, the vector space model is a computer representation
                                                                                   words, e. g. for declined nouns or conjugated verbs. Since the “unifi-
    of a the textual content of a document. However, in the literature the term
                                                                                   cation” of words with respect to gender, number, time, and case is a
    “vector space model” is also understood as a retrieval model with a certain    language-specific issue, rule-based stemming algorithms require the
    kind of relevance computation.                                                 development of specialized rule sets for each language. Recall that

                                                                                                                 Example for technology:

                                                                                         Inclusion methods         Co-occurrence analysis
                                                         Index term selection
                                                                                         Exclusion methods         Stopword removal

            Index construction principle                 Index term modification                                   Stemming

                                                         Index term enrichment                                     Addition of synonym sets

                                                         Index transformation                                      Singular value decomposition

                                        Figure 1.   A taxonomy of index construction principles for vector space models.
the application of language-specific rule sets requires the problem of     3     ANALYSIS OF
language detection both in unilingual and multilingual documents to              ENRICHED VECTOR SPACE MODELS
be solved.
                                                                           Existing reports on the impact of index term selection and index term
Index Term Enrichment. We classify a method as term enrich-                enrichment are contradictory [4; 5; 7], and not all of the published
ing, if it introduces terms not found in T . By nature, meaning-           performance improvements could be reproduced [6]. Most of this
ful index term enrichment must be semantically motivated and ex-           research analyzes the effects of a modified vector space model on
ploit linguistic knowledge. A standard approach is the—possibly            typical information retrieval tasks, such as document clustering or
transitive—extension of T by synonyms, hypernyms, hyponyms, and            query answering.
co-occurring terms. The extension shall alleviate the problem of dif-         Note that clustering results that have been obtained by employing
ferent writing styles, or of vocabulary variations observed in very        the same cluster algorithm under different document models may
small document snippets as they are returned from search engines.          tell us two things: (i) whether one document model captures more
   Note that these methods are not employed to address the problem         of the “gist” of the original document d than another model, and,
of polysemy, since the required in-depth analysis of the term context      (ii) whether the cluster algorithm is able to take advantage of this
is computationally too expensive for many similarity search applica-       added value.
tions.                                                                        A cluster algorithm’s performance depends on various parameters,
Index Transformation. In contrast to the construction methods men-         such as the cluster number, its randomized start configuration, or pre-
tioned before, transformation methods operate on all document vec-         set similarity thresholds, etc., which renders a comparison difficult.
tors of a collection D at the same time by analyzing the term-             Moreover, there is the prevalently observed effect that different clus-
document matrix, A. A popular index transformation method is latent        ter algorithms behave differently sensitive to document model “im-
semantic indexing (LSI), which uses a singular value decomposition         provements”. From an analysis point of view the following questions
of A in order to improve query rankings and similarity computations        arise:
[2; 1; 8]. For this purpose, the document vectors are projected into       1. Which cluster algorithm shall define the baseline for a comparison
a low-dimensional space that is spanned by the eigenvectors that be-           (the best for the dataset, the most commonly used, the simplest)?
long to the largest singular values of the decomposition of A.             2. Given several clustering results obtained by the same cluster algo-
                                                                               rithm, which result can be regarded as meaningful (the best, the
                                                                               worst, the average)?
2.1    Discussion
                                                                              Especially to the second point less attention is paid in cur-
Index terms that consist of a single word can be found by a skillful       rent research: Common practice is to select the best result com-
analysis of prefix frequency and prefix length. This idea can be ex-       pared to a given reference classification, e. g. by maximizing the F -
tended to the identification of compound word concepts in written          Measure value—ignoring that such a combined usage of unsuper-
text. If continuous sequences of n words occur significantly often,        vised/supervised methods is far away from reality.4
then it is likely that these words form a concept. Put another way,           An objective way to rank different document models is to compare
concept detection reduces to the identification of frequent n-grams.       their ability to capture the intrinsic similarity relations of a given col-
   n-grams as a replacement for index term enrichment has been an-         lection D. Basic idea is the construction of a similarity graph, mea-
alyzed by several authors in the past, with moderate success only [6].     suring its conformance to a reference classification, and analyzing
We explain the disappointing results with noise effects, which dom-        the improvement or decline of this conformance under some docu-
inate the positive impact of few additional concepts: Most authors         ment model. Exactly this is operationalized in form of the ρ-measure
apply a strategy of “complete extension”; i. e., they add all 2-grams      that is introduced below; it enables one to evaluate differences in the
and 3-grams to the index vector. However, when analyzing the fre-          similarity concepts of alternative document models without being de-
quency distribution of n-grams, it becomes clear that only a small         pendent on a cluster algorithm.5
fraction of all compound word sequences is statistically relevant.
                                                                              Hence, the performance analyses presented in this section com-
   The advantages of syntactical (statistical) methods for index con-      prise two types of analyses: (i) Experiments that, based on ρ, quantify
struction can be summarized as follows:                                    objective improvements or declines of a document model, (ii) exper-
1. language independence                                                   iments that, based on the F -Measure, quantify the effects of a docu-
2. robustness with respect to multi-lingual documents                      ment model onto different cluster algorithms.
3. tailored indexes for retrieval tasks on closed collections

   An obvious disadvantage may be the necessary statistical mass:
                                                                           3.1      A Measure of Expected Density: ρ
Syntactical index construction cannot work if only few, very small         As before let D = {d1 , . . . , dn } be a document collection whose
document snippets are involved. This problem is also investigated          corresponding computer representations are denoted as d1 , . . . , dn .
in the next section, where the development of the index quality is         A similarity graph G = hV, E, ϕi for D is a graph where a node in
compared against the underlying collection size.                           V represents a document and an edge (di , dj ) ∈ E is weighted with
   As an aside, statistical stemming and the detection of compound         the similarity ϕ(di , dj ).
word concepts are essentially the same—the level of granularity               A graph G = hV, E, wi is called sparse if |E| = O(|V |); it
makes the difference: Stemming means frequency analysis at the             is called dense if |E| = O(|V |2 ). Put another way, we can com-
level of characters; likewise, the identification of concepts means fre-   pute the density θ of a graph from the equation |E| = |V |θ . With
quency analysis at the level of words.                                     4 This issue is addressed in [14].
                                                                           5 The ρ-measure was originally introduced in [14], as an alternative for the
                                                                               Davies-Bouldin-Index and the Dunn-Index, in order to evaluate the quality
                                                                               of cluster algorithms for text retrieval applications.
                     2.4                               standard vector space model                                           2.4     5 categories     standard vector space model
                              5 categories
                                                               synonym enrichment                                                                      n-gram index term selection
                     2.2                                      hypernym enrichment                                            2.2
                                                        n-gram index term selection




                                                                                                        Expected density ρ
Expected density ρ




                      2                                                                                                       2

                     1.8                                                                                                     1.8

                     1.6                                                                                                     1.6

                     1.4                                                                                                     1.4

                     1.2                                                                                                     1.2

                      1                                                                                                       1
                       200       300      400        500       600      700    800       900   1000                            200      300    400   500    600     700      800     900    1000
                                                        Sample size                                                                                    Sample size
Figure 3. Comparison of the standard vector space model, two semantically enriched models (synonym, hypernym), and a vector space model with syntac-
tically identified concepts (n-gram) in two languages: The left-hand graph illustrates the development of ρ depending on the collection size for an English
document collection; the right-hand graph compares the n-gram vector space model to the standard model for a German document collection.


                                                     w(e), this relation extends naturally to         2. Semantic Synonym Enrichment. Within this variant of semantic
                                          P
w(G) := |V | +                                e∈E
weighted graphs:6                                                                                        term enrichment the so-called synsets from Wordnet for nouns are
                                                                                                         added [3]; this procedure has been reported to work well for cate-
                                                                                                         gorization tasks [7]. Note that adding synonyms to all index terms
                                                                                     
                                                                        ln w(G)
                                 w(G) = |V |θ              ⇔       θ=                                   of a document vector will introduce a lot of noise, and hence only
                                                                         ln |V |
                                                                                                         the top-ranked 10% of the index terms (respecting the employed
   Obviously, θ can be used to compare the density of each induced                                       term weighting scheme) are selected for enrichment.
subgraph G0 = hV 0 , E 0 , w0 i of G to the density of G: G0 is sparse                                3. Semantic Hypernym Enrichment. This variant of semantic term
(dense) compared to G if the quotient w(G0 )/(|V 0 |θ ) is smaller                                       enrichment relies also on Wordnet: a sequence of up to four con-
(larger) than 1. This consideration provides a key to quantify a doc-                                    secutive hypernyms is substituted for each noun. The rationale
ument model’s ability to capture the intrinsic similarity relations of                                   is as follows. Documents dealing with closely related—but still
G, and hence, of the underlying collection.                                                              different—topics often contain terms which derive from a single
   Let C = {C1 , . . . , Ck } be an exclusive categorization of D in k                                   hypernym representing their common category. The enrichment
distinct categories, that is to say, Ci , Cj ⊆ D with Ci ∩ Cj = ∅                                        proposed here yields a stronger similarity between such docu-
and ∪ki=1 Ci = D, and let Gi = hVi , Ei , ϕi be the induced subgraph                                     ments without generalizing too much.
of G with respect to category Ci . Then the expected density of C is                                     Index term weighting of both unigrams and n-grams follows the
defined as follows.                                                                                   tf · idf -scheme; stopwords are not indexed and unigram stemming is
                                    k                                                                 done according to Porter’s algorithm.
                                    X |Vi | w(Gi )
                           ρ(C) =                ·             ,     where    |V |θ = w(G)            Discussion. The resulting graphs in Figure 3 as well as the compar-
                                          |V |        |Vi |θ
                                    i=1
                                                                                                      ison in Table 1 show that the syntactic approach outperforms both
   Since the edge weights resemble the similarity of the documents                                    semantic approaches. From the semantic variants only the semantic
associated with V , a higher value of ρ indicates a better modeling of                                hypernym enrichment is above the baseline; note that this happens
a collection’s similarity relations.                                                                  even if a large number synsets is added. We explain the results as
                                                                                                      follows: Index terms with a high term weight typically belong to a
                                                                                                      special vocabulary, and, from a semantic point of view, they are used
3.2                    Syntax versus Semantics:                                                       deliberately so that adding their synsets will tend to decrease their
                       Variants of the Vector Space Model                                             importance. Likewise, adding the synsets of low-weighted terms has
Aside from the standard vector space model our analysis compares                                      no effect other than adding noise since the importance of these terms
the following three vector space model variants:                                                      will be increased without a true rationale.

1. Syntactic Term Selection. Within this variant the index term selec-                                                                                     F -min         F -max        F -av.
   tion step also considers syntactically identified concepts, i. e., 2-                                  Vector space model variant
                                                                                                                                                             (sample size 1000, 10 categories)
   grams, 3-grams, and 4-grams. To identify the significant n-grams
                                                                                                          standard vector space model                                —baseline—
   the document collection D is inserted into a suffix tree and a sta-
                                                                                                          synonym enrichment                                -8%        +4%      -2%
   tistical successor variety analysis is applied. The operationalized                                    hypernym enrichment                               +5%        +12%     +3%
   principle behind this analysis is the peak-and-plateau method [5],                                     n-gram index term selection                      +15%        +6%      +8%
   for which we have developed a refinement in our working group.
                                                                                                      Table 1. The table shows the improvements of the averaged F -Measure val-
6 w(G) denotes the total edge weight of G plus the number of nodes, |V |,
                                                                                                      ues that were achieved with the cluster algorithms k-means and MajorClust
            which serves as adjustment term for graphs with edge weights in [0; 1].                   for the investigated variants of the vector space model.
3.3      Test Corpus and Sample Formation                                             Note that the last point may be interesting to develop accepted
                                                                                   benchmarks to compare research efforts related to document models
Experiments have been conducted with samples from RCV1, a short                    or similarity measures.
hand for “Reuters Corpus Volume 1” [10], as well as with documents                    Though syntactical analyses must not be seen as a cure-all for the
from German newsgroup postings.                                                    index construction of vector space models, they provide advantages
   RCV1 is a document collection that was published by the Reuters                 over semantic methods, such as language independence, robustness,
Corporation for research purposes. It contains more than 800,000                   and tailored index sets. With respect to several retrieval tasks they
documents each of which consisting of a few hundred up to several                  can keep up with semantic methods—however, our results give no
thousands words. The documents are tagged with meta information                    room for an over-simplification: Both paradigms have the potential
like category (also called topic), geographic region, or industry sec-             to outperform the other.
tor. There are 103 different categories, which are arranged within
a hierarchy of the four top level categories “Government, Social”,
“Economics”, “Markets”, and “Corporate, Industrial”. Each of the                   References
top level categories defines the root of a tree of sub-categories, where            [1] Michael W. Berry, Susan T. Dumais, and Gavin W. O’Brien,
each child node fine grains the information given by its parent. Note                   ‘Using Linear Algebra for Intelligent Information Retrieval’,
that a document d can be assigned to several categories c1 , . . . , cp ,               Technical Report UT-CS-94-270, Computer Science
and that d does also belong to all ancestor categories of some cate-                    Department, (dec 1994).
gory ci .                                                                           [2] Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer,
   Within our experiments two documents di , dj are considered to                       George W. Furnas, and Richard A. Harshman, ‘Indexing by
belong to the same category if they share the same top level category                   Latent Semantic Analysis’, Journal of the American Society of
                                                                                        Information Science, 41(6), 391–407, (1990).
ct and the same most specific category cs . Moreover, the test sets are
                                                                                    [3] Christiane Fellbaum, WordNet: An Electronic Lexical
constructed in such a way that there is no document di whose most                       Database, MIT Press, 1998.
specific category cs is an ancestor of the most specific category of                [4] W. B. Frakes, ‘Term conflation for information retrieval’, in
some other document dj .                                                                SIGIR ’84: Proceedings of the 7th annual international ACM
   The samples were formed as follows: For the analysis of the in-                      SIGIR conference on Research and development in
trinsic similarity relations based on ρ, the sample sizes ranged from                   information retrieval, pp. 383–389, Swinton, UK, (1984).
200 to 1000 documents taken from 5 categories. For the analysis of                      British Computer Society.
the categorization experiments, based on cluster algorithms and eval-               [5] W. B. Frakes and Ricardo Baeza-Yates, Information retrieval:
uated with the F -Measure, the sample sizes were 1000 documents                         Data Structures and Algorithms, Prentice-Hall, Inc., Upper
                                                                                        Saddle River, NJ, USA, 1992.
taken from 10 categories.7
                                                                                    [6] Johannes Fürnkranz, ‘A Study Using n-gram Features for Text
                                                                                        Categorization’, Technical report, Austrian Institute for
                Government, Social
                                                                                        Artificial Intelligence, (1998). Technical Report
                Economics                                                               OEFAI-TR-9830.
                                                                                    [7] A. Hotho, S. Staab, and G. Stumme, ‘Wordnet improves text
    RCV1        Markets
                                 Insolvency,
                                                                                        document clustering’, in Proceedings of the SIGIR Semantic
                                                      Account,       Annual             Web Workshop, (2003).
                                 Liquidity            Earnings       results
                Corporate,                                                          [8] Christos H. Papadimitriou, Hisao Tamaki, Prabhakar
                Industrial       Performance
                                 ...                  Comment,                          Raghavan, and Santosh Vempala, ‘Latent semantic indexing: a
                                                      Forecasts                         probabilistic analysis’, in PODS ’98: Proceedings of the 17th
Figure 4. Category organization of the RCV1 corpus showing the four top                 ACM SIGACT-SIGMOD-SIGART symposium on Principles of
level categories from which “Corporate, Industrial” is further refined.                 database systems, pp. 159–168, New York, NY, USA, (1998).
                                                                                        ACM Press.
                                                                                    [9] Andreas Rauber and Alexander Müller-Kögler, ‘Integrating
                                                                                        automatic genre analysis into digital libraries’, in ACM/IEEE
                                                                                        Joint Conference on Digital Libraries, pp. 1–10, (2001).
4     SUMMARY                                                                      [10] T.G. Rose, M. Stevenson, and M. Whitehead, ‘The Reuters
This paper provided a comparison of syntactical and semantic meth-                      Corpus Volume 1 - From Yesterday’s News to Tomorrow’s
ods for the construction of vector space models; the special focus                      Language Resources’, in Proceedings of the Third
                                                                                        International Conference on Language Resources and
was index term selection. Interestingly, little attention has been paid
                                                                                        Evaluation, (2002).
to the mentioned syntactical methods in connection with text retrieval             [11] G. Salton and M. E. Lesk, ‘Computer Evaluation of Indexing
tasks. Following results of our paper shall be emphasized:                              and Text Processing’, ACM, 15(1), 8–36, (January 1968).
• With syntactically identified concepts significant improvements                  [12] Karen Sparck-Jones, ‘A statistical interpretation of term
  can be achieved for categorization tasks.                                             specificity and its application in retrieval’, Journal of
• The benefit of semantic term enrichment is generally overesti-                        Documentation, 28, 11–21, (1972).
                                                                                   [13] E. Stamatatos, N. Fakotakis, and G. Kokkinakis, ‘Text genre
  mated.                                                                                detection using common word frequencies’, in Proceedings of
• The ρ-measure provides an “algorithm-neutral” approach to ana-                        the 18th Int. Conference on Computational Linguistics,
  lyze the similarity knowledge contained in document models.                           Saarbrücken, Germany, (2000).
7 To make our analysis results reproducible for other researchers, meta infor-     [14] Benno Stein, Sven Meyer zu Eißen, and Frank Wißbrock, ‘On
    mation files that describe the compiled test collections have been recorded;
                                                                                        Cluster Validity and the Information Need of Users’, in
    they are available upon request.                                                    Proceedings of the 3rd IASTED International Conference on
                                                                                        Artificial Intelligence and Applications (AIA 03),
     Benalmádena, Spain, ed., M. H. Hanza, pp. 216–221,
     Anaheim, Calgary, Zurich, (September 2003). ACTA Press.
[15] Michael Steinbach, George Karypis, and Vipin Kumar, ‘A
     comparison of document clustering techniques’, Technical
     Report 00-034, Department of Computer Science and
     Egineering, University of Minnesota, (2000).
[16] Yiming Yang and Jan O. Pedersen, ‘A comparative study on
     feature selection in text categorization’, in Proceedings of
     ICML-97, 14th International Conference on Machine
     Learning, ed., Douglas H. Fisher, pp. 412–420, Nashville, US,
     (1997). Morgan Kaufmann Publishers, San Francisco, US.