=Paper=
{{Paper
|id=Vol-205/paper-3
|storemode=property
|title=Syntax versus Semantics: Analysis of Enriched Vector Space Models
|pdfUrl=https://ceur-ws.org/Vol-205/paper3.pdf
|volume=Vol-205
}}
==Syntax versus Semantics: Analysis of Enriched Vector Space Models==
Syntax versus Semantics:
Analysis of Enriched Vector Space Models
Benno Stein1 and Sven Meyer zu Eissen1 and Martin Potthast2
Abstract. This paper presents a robust method for the construction 1.1 A Note on Semantics
of collection-specific document models. These document models are
variants of the well-known vector space model, which relies on a pro- We classify an index construction method as being semantic if it re-
cess of selecting, modifying, and weighting index terms with respect lies on additional domain knowledge, or if it exploits external in-
to a given document collection. We improve the step of index term formation sources by means of some inference procedure, or both.
selection by applying statistical methods for concept identification. Short documents may be similar to each other from the (semantic)
This approach is particularly suited for post-retrieval categorization viewpoint of a human reader, while the related instances of the vec-
and retrieval tasks in closed collections, which is typical for intranet tor space model do not reflect this fact because of the different words
search. used. Index term enrichment can account for this by adding synony-
mous terms, hypernyms, hyponyms, or co-occurring terms [7].
We compare our approach to “enriched” vector-space-based doc- Semantic approaches are oriented at the human understanding of
ument models that employ knowledge of the underlying language in language and text, and, as given in the case of ontological index term
the form of external semantic concepts. Primary objective is to quan- enrichment, they are computationally efficient. However, the appli-
tify the impact of a purely syntactic analysis in contrast to a semantic cation of the semantic approaches is problematic, if, for instance,
enrichment in the index construction step. As a by-product we pro- the document language is unknown or if a document combines pas-
vide an efficient and language-independent means for vector space sages from several languages. Moreover, there are situations where
model construction, whereas the resulting document models perform semantic approaches can even impair the retrieval quality: Consider
better than the standard vector space model. a document collection with specialized texts, then ontological index
Keywords vector space model, concept identification, semantic con- term enrichment will move the specific character of a text toward
cepts, text categorization, evaluation measures a more general understanding. As a consequence, the similarity of
highly specialized text is diluted in favor of less specialized text—
which compares to the effect of adding noise.
1 INTRODUCTION
Each text retrieval task that is automated by a computer relies on 1.2 Contributions
some kind of document model, which is an abstraction of the origi-
nal document d. The document model must be tailored well with re- We investigate variants of the vector space model with respect to
spect to the retrieval task in question: It determines the quality of the their classification performance. Starting point is the standard vector
analysis, and—diametrically opposed—its computational complex- space model where the step of index term selection is improved by a
ity. Though its obvious simplicity the vector space model has shown syntactic approach for concept identification; the resulting model is
great success in many text retrieval tasks [11; 12; 16; 15], and, the compared to semantically enriched vector space models. The syntac-
analysis of this paper uses this model as its starting point. tic concept identification approach is based on a collection-specific
The standard vector space model abstracts a document d toward a suffix tree analysis. In a nutshell, the paper’s underlying question
vector d of weighted index terms. Each term t that is included in d may be summarized as follows:
derives from a term τ ∈ d by affix removal, which is necessary to Can syntactically determined concepts keep up with a
map morphological variants of τ onto the same stem t. The respec- semantically motivated index term enrichment?
tive term weights in d account for the different discriminative power
of the original terms in d and are computed according to some fre- To answer this question we have set up a number of text catego-
quency scheme. The main application of the vector space model is rization experiments with different clustering algorithms. Since these
document similarity computation. algorithms are susceptible to various side effects, we will also present
In this paper we focus on the index construction step and, in par- results that rely on an objective similarity assessment statistic: the
ticular, on index term selection. Other concepts of the vector space measure of expected density, ρ. Perhaps the most interesting result
model, such as the term weighting scheme or its disregard of word may be anticipated: The positive effect of semantic index term en-
order are adopted. richment, which has been reported by some authors in the past, could
hardly be observed in our comprehensive analysis.
1 Faculty of Media, Media Systems.
The remainder of the paper is organized as follows. Section 2
Bauhaus University Weimar, 99421 Weimar, Germany
{benno.stein | sven.meyer-zu-eissen}@medien.uni-weimar.de presents a taxonomy of index construction methods and outlines
2 Faculty of Computer Science. commonly used technology, and Section 3 reports on similarity anal-
Paderborn University, 33098 Paderborn, Germany ysis and unsupervised classification experiments.
2 INDEX CONSTRUCTION Document model
FOR DOCUMENT MODELS Conceptual Computer d ∈D
model
representation
This section organizes the current practice of index construction for Layout of document
vector space models. In particular, we review the concept of a doc- view
ument model and propose a classification scheme for both popular Structural / Retrieval model R q ∈Q
and specialized index construction principles. logical view
Relevance Formalized
A document d can be viewed under different aspects: layout, struc- Semantic computation query
tural or logical setup, and semantics. A computer representation d of view ρ(q,d)
d may capture different portions of theses aspects. Note that d is
designed purposefully, with respect to the structure of a formalized Linguistic
query, q, and also with having a particular retrieval model in mind. A theory
retrieval model, R, provides the linguistic rationale for the model for-
mation process behind the mapping d 7→ d. This mapping involves d ∈D q ∈Q
an inevitable simplification of d that should be
Real-world Information
1. quantifiable, document need
2. useful with respect to the information need, and
3. tailored to q, the formalized query. Figure 2. In the end, an information need q is satisfied by a real-world doc-
ument d. A computer support for this retrieval task requires an abstraction
The retrieval model R gives answers to these points, be it theo- of q and d towards q and d. The rationale for this abstraction comes from a
linguistic theory and is operationalized by a retrieval model R.
retically or empirically, and provides a concrete means, ρ(q, d), for
quantifying the relevance between a formalized query q and a docu-
ment’s computer representation d. Note that ρ(q, d) is often speci- sified the index construction principles for vector space models in
fied in the form of a similarity measure ϕ. four main classes, which are shown in Figure 1.
Together, the computer representation d along with the underlying
retrieval model R form the document model; Figure 2 illustrates the Index Term Selection. Selection methods further divide into inclu-
connections. sion and exclusion methods. An important exclusion method is stop-
word removal: Common words, such as prepositions or conjunctions,
Let D be a document collection and let T be the set of all terms
introduce noise and provide no discriminating similarity information;
that occur in D. The vector space model d of a document d is a vec-
they are usually discarded from the index set. However, there are spe-
tor of |T | weights, each of which quantifying the “importance” of
cial purpose models (e. g. for text genre identification) that rely on
some index term in T with respect to d.3 This quantification must
stopword features [13; 9].
be seen against the background that one is interested in a similarity
The standard vector space model does not apply an inclusion
function ϕ that maps from the vectors d1 and d2 of two documents
method but simply takes the entire set T without stopwords. More
d1 , d2 into the interval [0; 1] and that has the following property:
advanced vector space models use also n-grams, i. e., continuous se-
If ϕ(d1 , d2 ) is close to 1 then the documents d1 and d2 are similar;
quences of n words, n ≤ 4, which occur in the documents of D.
likewise, a value close to zero indicates a high dissimilarity. Note that
Since the usage of n-grams entails the risk of introducing noise, not
document models and similarity functions determine each other: The
all n-grams should be added but threshold-based selection methods
vector space model and its variants are amenable to the cosine simi-
be applied, which rely on the information gain or a similar statis-
larity (= normalized dot product) in first place, but can also be used
tic [6].
in connection with Euclidean distance, overlap measures, or other
distance concepts. Index Term Modification. Most term modification methods aim at
Under the vector space paradigm the document model construc- generalization. A common problem in this connection is the map-
tion process is determined in two dimensions: index construction and ping of morphologically different words that embody the same con-
weight computation. In the following we will concentrate on the for- cept onto the same index term. So-called stemming algorithms apply
mer dimension since this paper contributes right here. We have clas- here; their goal is to find canonical forms for inflected or derived
3 Note that, in effect, the vector space model is a computer representation
words, e. g. for declined nouns or conjugated verbs. Since the “unifi-
of a the textual content of a document. However, in the literature the term
cation” of words with respect to gender, number, time, and case is a
“vector space model” is also understood as a retrieval model with a certain language-specific issue, rule-based stemming algorithms require the
kind of relevance computation. development of specialized rule sets for each language. Recall that
Example for technology:
Inclusion methods Co-occurrence analysis
Index term selection
Exclusion methods Stopword removal
Index construction principle Index term modification Stemming
Index term enrichment Addition of synonym sets
Index transformation Singular value decomposition
Figure 1. A taxonomy of index construction principles for vector space models.
the application of language-specific rule sets requires the problem of 3 ANALYSIS OF
language detection both in unilingual and multilingual documents to ENRICHED VECTOR SPACE MODELS
be solved.
Existing reports on the impact of index term selection and index term
Index Term Enrichment. We classify a method as term enrich- enrichment are contradictory [4; 5; 7], and not all of the published
ing, if it introduces terms not found in T . By nature, meaning- performance improvements could be reproduced [6]. Most of this
ful index term enrichment must be semantically motivated and ex- research analyzes the effects of a modified vector space model on
ploit linguistic knowledge. A standard approach is the—possibly typical information retrieval tasks, such as document clustering or
transitive—extension of T by synonyms, hypernyms, hyponyms, and query answering.
co-occurring terms. The extension shall alleviate the problem of dif- Note that clustering results that have been obtained by employing
ferent writing styles, or of vocabulary variations observed in very the same cluster algorithm under different document models may
small document snippets as they are returned from search engines. tell us two things: (i) whether one document model captures more
Note that these methods are not employed to address the problem of the “gist” of the original document d than another model, and,
of polysemy, since the required in-depth analysis of the term context (ii) whether the cluster algorithm is able to take advantage of this
is computationally too expensive for many similarity search applica- added value.
tions. A cluster algorithm’s performance depends on various parameters,
Index Transformation. In contrast to the construction methods men- such as the cluster number, its randomized start configuration, or pre-
tioned before, transformation methods operate on all document vec- set similarity thresholds, etc., which renders a comparison difficult.
tors of a collection D at the same time by analyzing the term- Moreover, there is the prevalently observed effect that different clus-
document matrix, A. A popular index transformation method is latent ter algorithms behave differently sensitive to document model “im-
semantic indexing (LSI), which uses a singular value decomposition provements”. From an analysis point of view the following questions
of A in order to improve query rankings and similarity computations arise:
[2; 1; 8]. For this purpose, the document vectors are projected into 1. Which cluster algorithm shall define the baseline for a comparison
a low-dimensional space that is spanned by the eigenvectors that be- (the best for the dataset, the most commonly used, the simplest)?
long to the largest singular values of the decomposition of A. 2. Given several clustering results obtained by the same cluster algo-
rithm, which result can be regarded as meaningful (the best, the
worst, the average)?
2.1 Discussion
Especially to the second point less attention is paid in cur-
Index terms that consist of a single word can be found by a skillful rent research: Common practice is to select the best result com-
analysis of prefix frequency and prefix length. This idea can be ex- pared to a given reference classification, e. g. by maximizing the F -
tended to the identification of compound word concepts in written Measure value—ignoring that such a combined usage of unsuper-
text. If continuous sequences of n words occur significantly often, vised/supervised methods is far away from reality.4
then it is likely that these words form a concept. Put another way, An objective way to rank different document models is to compare
concept detection reduces to the identification of frequent n-grams. their ability to capture the intrinsic similarity relations of a given col-
n-grams as a replacement for index term enrichment has been an- lection D. Basic idea is the construction of a similarity graph, mea-
alyzed by several authors in the past, with moderate success only [6]. suring its conformance to a reference classification, and analyzing
We explain the disappointing results with noise effects, which dom- the improvement or decline of this conformance under some docu-
inate the positive impact of few additional concepts: Most authors ment model. Exactly this is operationalized in form of the ρ-measure
apply a strategy of “complete extension”; i. e., they add all 2-grams that is introduced below; it enables one to evaluate differences in the
and 3-grams to the index vector. However, when analyzing the fre- similarity concepts of alternative document models without being de-
quency distribution of n-grams, it becomes clear that only a small pendent on a cluster algorithm.5
fraction of all compound word sequences is statistically relevant.
Hence, the performance analyses presented in this section com-
The advantages of syntactical (statistical) methods for index con- prise two types of analyses: (i) Experiments that, based on ρ, quantify
struction can be summarized as follows: objective improvements or declines of a document model, (ii) exper-
1. language independence iments that, based on the F -Measure, quantify the effects of a docu-
2. robustness with respect to multi-lingual documents ment model onto different cluster algorithms.
3. tailored indexes for retrieval tasks on closed collections
An obvious disadvantage may be the necessary statistical mass:
3.1 A Measure of Expected Density: ρ
Syntactical index construction cannot work if only few, very small As before let D = {d1 , . . . , dn } be a document collection whose
document snippets are involved. This problem is also investigated corresponding computer representations are denoted as d1 , . . . , dn .
in the next section, where the development of the index quality is A similarity graph G = hV, E, ϕi for D is a graph where a node in
compared against the underlying collection size. V represents a document and an edge (di , dj ) ∈ E is weighted with
As an aside, statistical stemming and the detection of compound the similarity ϕ(di , dj ).
word concepts are essentially the same—the level of granularity A graph G = hV, E, wi is called sparse if |E| = O(|V |); it
makes the difference: Stemming means frequency analysis at the is called dense if |E| = O(|V |2 ). Put another way, we can com-
level of characters; likewise, the identification of concepts means fre- pute the density θ of a graph from the equation |E| = |V |θ . With
quency analysis at the level of words. 4 This issue is addressed in [14].
5 The ρ-measure was originally introduced in [14], as an alternative for the
Davies-Bouldin-Index and the Dunn-Index, in order to evaluate the quality
of cluster algorithms for text retrieval applications.
2.4 standard vector space model 2.4 5 categories standard vector space model
5 categories
synonym enrichment n-gram index term selection
2.2 hypernym enrichment 2.2
n-gram index term selection
Expected density ρ
Expected density ρ
2 2
1.8 1.8
1.6 1.6
1.4 1.4
1.2 1.2
1 1
200 300 400 500 600 700 800 900 1000 200 300 400 500 600 700 800 900 1000
Sample size Sample size
Figure 3. Comparison of the standard vector space model, two semantically enriched models (synonym, hypernym), and a vector space model with syntac-
tically identified concepts (n-gram) in two languages: The left-hand graph illustrates the development of ρ depending on the collection size for an English
document collection; the right-hand graph compares the n-gram vector space model to the standard model for a German document collection.
w(e), this relation extends naturally to 2. Semantic Synonym Enrichment. Within this variant of semantic
P
w(G) := |V | + e∈E
weighted graphs:6 term enrichment the so-called synsets from Wordnet for nouns are
added [3]; this procedure has been reported to work well for cate-
gorization tasks [7]. Note that adding synonyms to all index terms
ln w(G)
w(G) = |V |θ ⇔ θ= of a document vector will introduce a lot of noise, and hence only
ln |V |
the top-ranked 10% of the index terms (respecting the employed
Obviously, θ can be used to compare the density of each induced term weighting scheme) are selected for enrichment.
subgraph G0 = hV 0 , E 0 , w0 i of G to the density of G: G0 is sparse 3. Semantic Hypernym Enrichment. This variant of semantic term
(dense) compared to G if the quotient w(G0 )/(|V 0 |θ ) is smaller enrichment relies also on Wordnet: a sequence of up to four con-
(larger) than 1. This consideration provides a key to quantify a doc- secutive hypernyms is substituted for each noun. The rationale
ument model’s ability to capture the intrinsic similarity relations of is as follows. Documents dealing with closely related—but still
G, and hence, of the underlying collection. different—topics often contain terms which derive from a single
Let C = {C1 , . . . , Ck } be an exclusive categorization of D in k hypernym representing their common category. The enrichment
distinct categories, that is to say, Ci , Cj ⊆ D with Ci ∩ Cj = ∅ proposed here yields a stronger similarity between such docu-
and ∪ki=1 Ci = D, and let Gi = hVi , Ei , ϕi be the induced subgraph ments without generalizing too much.
of G with respect to category Ci . Then the expected density of C is Index term weighting of both unigrams and n-grams follows the
defined as follows. tf · idf -scheme; stopwords are not indexed and unigram stemming is
k done according to Porter’s algorithm.
X |Vi | w(Gi )
ρ(C) = · , where |V |θ = w(G) Discussion. The resulting graphs in Figure 3 as well as the compar-
|V | |Vi |θ
i=1
ison in Table 1 show that the syntactic approach outperforms both
Since the edge weights resemble the similarity of the documents semantic approaches. From the semantic variants only the semantic
associated with V , a higher value of ρ indicates a better modeling of hypernym enrichment is above the baseline; note that this happens
a collection’s similarity relations. even if a large number synsets is added. We explain the results as
follows: Index terms with a high term weight typically belong to a
special vocabulary, and, from a semantic point of view, they are used
3.2 Syntax versus Semantics: deliberately so that adding their synsets will tend to decrease their
Variants of the Vector Space Model importance. Likewise, adding the synsets of low-weighted terms has
Aside from the standard vector space model our analysis compares no effect other than adding noise since the importance of these terms
the following three vector space model variants: will be increased without a true rationale.
1. Syntactic Term Selection. Within this variant the index term selec- F -min F -max F -av.
tion step also considers syntactically identified concepts, i. e., 2- Vector space model variant
(sample size 1000, 10 categories)
grams, 3-grams, and 4-grams. To identify the significant n-grams
standard vector space model —baseline—
the document collection D is inserted into a suffix tree and a sta-
synonym enrichment -8% +4% -2%
tistical successor variety analysis is applied. The operationalized hypernym enrichment +5% +12% +3%
principle behind this analysis is the peak-and-plateau method [5], n-gram index term selection +15% +6% +8%
for which we have developed a refinement in our working group.
Table 1. The table shows the improvements of the averaged F -Measure val-
6 w(G) denotes the total edge weight of G plus the number of nodes, |V |,
ues that were achieved with the cluster algorithms k-means and MajorClust
which serves as adjustment term for graphs with edge weights in [0; 1]. for the investigated variants of the vector space model.
3.3 Test Corpus and Sample Formation Note that the last point may be interesting to develop accepted
benchmarks to compare research efforts related to document models
Experiments have been conducted with samples from RCV1, a short or similarity measures.
hand for “Reuters Corpus Volume 1” [10], as well as with documents Though syntactical analyses must not be seen as a cure-all for the
from German newsgroup postings. index construction of vector space models, they provide advantages
RCV1 is a document collection that was published by the Reuters over semantic methods, such as language independence, robustness,
Corporation for research purposes. It contains more than 800,000 and tailored index sets. With respect to several retrieval tasks they
documents each of which consisting of a few hundred up to several can keep up with semantic methods—however, our results give no
thousands words. The documents are tagged with meta information room for an over-simplification: Both paradigms have the potential
like category (also called topic), geographic region, or industry sec- to outperform the other.
tor. There are 103 different categories, which are arranged within
a hierarchy of the four top level categories “Government, Social”,
“Economics”, “Markets”, and “Corporate, Industrial”. Each of the References
top level categories defines the root of a tree of sub-categories, where [1] Michael W. Berry, Susan T. Dumais, and Gavin W. O’Brien,
each child node fine grains the information given by its parent. Note ‘Using Linear Algebra for Intelligent Information Retrieval’,
that a document d can be assigned to several categories c1 , . . . , cp , Technical Report UT-CS-94-270, Computer Science
and that d does also belong to all ancestor categories of some cate- Department, (dec 1994).
gory ci . [2] Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer,
Within our experiments two documents di , dj are considered to George W. Furnas, and Richard A. Harshman, ‘Indexing by
belong to the same category if they share the same top level category Latent Semantic Analysis’, Journal of the American Society of
Information Science, 41(6), 391–407, (1990).
ct and the same most specific category cs . Moreover, the test sets are
[3] Christiane Fellbaum, WordNet: An Electronic Lexical
constructed in such a way that there is no document di whose most Database, MIT Press, 1998.
specific category cs is an ancestor of the most specific category of [4] W. B. Frakes, ‘Term conflation for information retrieval’, in
some other document dj . SIGIR ’84: Proceedings of the 7th annual international ACM
The samples were formed as follows: For the analysis of the in- SIGIR conference on Research and development in
trinsic similarity relations based on ρ, the sample sizes ranged from information retrieval, pp. 383–389, Swinton, UK, (1984).
200 to 1000 documents taken from 5 categories. For the analysis of British Computer Society.
the categorization experiments, based on cluster algorithms and eval- [5] W. B. Frakes and Ricardo Baeza-Yates, Information retrieval:
uated with the F -Measure, the sample sizes were 1000 documents Data Structures and Algorithms, Prentice-Hall, Inc., Upper
Saddle River, NJ, USA, 1992.
taken from 10 categories.7
[6] Johannes Fürnkranz, ‘A Study Using n-gram Features for Text
Categorization’, Technical report, Austrian Institute for
Government, Social
Artificial Intelligence, (1998). Technical Report
Economics OEFAI-TR-9830.
[7] A. Hotho, S. Staab, and G. Stumme, ‘Wordnet improves text
RCV1 Markets
Insolvency,
document clustering’, in Proceedings of the SIGIR Semantic
Account, Annual Web Workshop, (2003).
Liquidity Earnings results
Corporate, [8] Christos H. Papadimitriou, Hisao Tamaki, Prabhakar
Industrial Performance
... Comment, Raghavan, and Santosh Vempala, ‘Latent semantic indexing: a
Forecasts probabilistic analysis’, in PODS ’98: Proceedings of the 17th
Figure 4. Category organization of the RCV1 corpus showing the four top ACM SIGACT-SIGMOD-SIGART symposium on Principles of
level categories from which “Corporate, Industrial” is further refined. database systems, pp. 159–168, New York, NY, USA, (1998).
ACM Press.
[9] Andreas Rauber and Alexander Müller-Kögler, ‘Integrating
automatic genre analysis into digital libraries’, in ACM/IEEE
Joint Conference on Digital Libraries, pp. 1–10, (2001).
4 SUMMARY [10] T.G. Rose, M. Stevenson, and M. Whitehead, ‘The Reuters
This paper provided a comparison of syntactical and semantic meth- Corpus Volume 1 - From Yesterday’s News to Tomorrow’s
ods for the construction of vector space models; the special focus Language Resources’, in Proceedings of the Third
International Conference on Language Resources and
was index term selection. Interestingly, little attention has been paid
Evaluation, (2002).
to the mentioned syntactical methods in connection with text retrieval [11] G. Salton and M. E. Lesk, ‘Computer Evaluation of Indexing
tasks. Following results of our paper shall be emphasized: and Text Processing’, ACM, 15(1), 8–36, (January 1968).
• With syntactically identified concepts significant improvements [12] Karen Sparck-Jones, ‘A statistical interpretation of term
can be achieved for categorization tasks. specificity and its application in retrieval’, Journal of
• The benefit of semantic term enrichment is generally overesti- Documentation, 28, 11–21, (1972).
[13] E. Stamatatos, N. Fakotakis, and G. Kokkinakis, ‘Text genre
mated. detection using common word frequencies’, in Proceedings of
• The ρ-measure provides an “algorithm-neutral” approach to ana- the 18th Int. Conference on Computational Linguistics,
lyze the similarity knowledge contained in document models. Saarbrücken, Germany, (2000).
7 To make our analysis results reproducible for other researchers, meta infor- [14] Benno Stein, Sven Meyer zu Eißen, and Frank Wißbrock, ‘On
mation files that describe the compiled test collections have been recorded;
Cluster Validity and the Information Need of Users’, in
they are available upon request. Proceedings of the 3rd IASTED International Conference on
Artificial Intelligence and Applications (AIA 03),
Benalmádena, Spain, ed., M. H. Hanza, pp. 216–221,
Anaheim, Calgary, Zurich, (September 2003). ACTA Press.
[15] Michael Steinbach, George Karypis, and Vipin Kumar, ‘A
comparison of document clustering techniques’, Technical
Report 00-034, Department of Computer Science and
Egineering, University of Minnesota, (2000).
[16] Yiming Yang and Jan O. Pedersen, ‘A comparative study on
feature selection in text categorization’, in Proceedings of
ICML-97, 14th International Conference on Machine
Learning, ed., Douglas H. Fisher, pp. 412–420, Nashville, US,
(1997). Morgan Kaufmann Publishers, San Francisco, US.