=Paper= {{Paper |id=None |storemode=property |title=Distributional Similarity of Words with Different Frequencies |pdfUrl=https://ceur-ws.org/Vol-986/paper_1.pdf |volume=Vol-986 |dblpUrl=https://dblp.org/rec/conf/dir/Wartena13 }} ==Distributional Similarity of Words with Different Frequencies== https://ceur-ws.org/Vol-986/paper_1.pdf

Distributional Similarity of Words with Different
Frequencies

Christian Wartena
Hochschule Hannover, University of Applied Sciences and Arts
Expo Plaza 12
30359 Hannover, Germany
christian.wartena@hs-hannover.de

ABSTRACT words occur in the same context, e.g. by representing words
Distributional semantics tries to characterize the meaning in a document space. Especially, if we consider small con-
of words by the contexts in which they occur. Similarity of texts, like a window of a few words around a word, this
words hence can be derived from the similarity of contexts. approach gives pairs of words that are in some dependence
Contexts of a word are usually vectors of words appearing relation to each other. De Saussure [3] calls such such re-
near to that word in a corpus. It was observed in previous lations, defined by co-presence in a linguistic structure (e.g.
research that similarity measures for the context vectors of a text, sentence, phrase, fixed window, words in a certain
two words depend on the frequency of these words. In the grammatical relation to the studied word and so on), syn-
present paper we investigate this dependency in more detail tagmatic relations. The type of similarity that is much closer
for one similarity measure, the Jensen-Shannon divergence. to synonymy and much more determined by the meaning of a
We give an empirical model of this dependency and propose word, is obtained by comparing the contexts in which a word
the deviation of the observed Jensen-Shannon divergence occurs. This type of similarity is usually called paradigmatic
from the divergence expected on the basis of the frequen- similarity or distributional similarity.
cies of the words as an alternative similarity measure. We Though distributional similarity has widely been studied
show that this new similarity measure is superior to both and has established as a method to find similar words, there
the Jensen-Shannon divergence and the cosine similarity in is no consensus on the way the context of a word has to be
a task, in which pairs of words, taken from Wordnet, have defined and on the best way to compute the similarity be-
to be classified as being synonyms or not. tween contexts. In the most general definitions the context
of a word consists of words and their relation to the given
word (see e.g. [6, 2]). In the following we will only consider
Categories and Subject Descriptors the simplest case in which there is only one relation: the
H.3.1 [Information Storage and Retrieval]: Content relation of being in the same sentence. Now each word can
Analysis and Indexing—Indexing Methods, Linguistic Pro- be represented by a context vector in a high dimensional
cessing; G.3 [Probability and Statistics]: [Correlation word space. Since these context vectors are very sparse, of-
and regression analysis]; I.2.7 [Artificial Intelligence]: Nat- ten dimensionality reduction techniques are applied. In the
ural Language Processing—Language Models,Text Analysis present paper we use random indexing, introduced by Karl-
gren and Sahlgren [7] and Sahlgren [9] to reduce the size of
General Terms the context vectors. For random indexing each word is rep-
resented by a random index vector. The context vector of
Experimentation a word is constructed by addition of the index vectors of all
words in the context. Thus the dimensionality of the con-
Keywords text vector is the same as the dimensionality chosen for the
Distributional Similarity, Synonymy index vectors. It was shown by Karlgren and Sahlgren [7]
that this technique gives results that are comparable to those
obtained by dimensionality reduction techniques like singu-
1. INTRODUCTION lar value decomposition, but requires less computational re-
For many applications dealing with texts it is useful or sources. The similarity of the context vectors, finally, can
necessary to know what words in a language are similar. be used as a proxy for the similarity of words.
Similarity between words can be found in hand crafted re- In order to evaluate the various methods to define con-
sources, like WordNet [8], but methods to derive word sim- text vectors and the various similarity measures that can
ilarities from large text corpora are at least an interesting be used subsequently, usually the computed similarity of
alternative. Intuitively, words that occur in the same texts words is tested in a task in which words have to be classified
or, more generally, the same contexts are similar. Thus we as being synonym or not to a given word. Often the data
could base a similarity measure on the number of times two are taken from the synonym detection task from TOEFL
(Test of English as a Foreign Language) in which the clos-
est related word from a set of four words has to be chosen.
DIR 2013, April 26, 2013, Delft, The Netherlands. Görnerup and Karlgren [5] found that best results are ob-
Copyright remains with the authors and/or original copyright holders. tained using L1-norm or Jensen-Shannon divergence (JSD).
.
Curran and Moens [2] obtain best results using a combina- that the JSD can be written as
tion of the Jaccard coefficient and the T-test while Van der
Plas and Bouma [10] report best results using a combina- JSD(p, q) = 21 D(p|| 12 p + 12 q) + 12 D(q|| 21 p + 12 q)
tion of the Dice coefficient and pointwise mutual informa- 1 X p(t)
= log 2 + p(t) log ( )
tion. Both Curran and Moens and Van der Plas and Bouma 2 p(t) + q(t)
t:p(t)6=0 ∧ q(t)6=0
use a number of different relations and need a similarity
measure that is able to assign different weights to the rela- q(t)
+q(t) log ( ) . (1)
tions. This makes their results less relevant for the present p(t) + q(t)
paper. The differences between the latter two studies show
how strongly the results depend on the exact settings of the This formulation of the JSD explicitly shows that the value
experiment. Many authors, however, use cosine similarity only depends on the words that have a non-zero value in both
as a generally well established similarity measure for vectors context vectors. If there is no common word the JSD is max-
in high dimensional word spaces. imal. Now suppose that all words are independent. If the
Weeds et al. [13] do not compare similarity measures to context vectors are based on a few instances of a word, the
hand crafted data sets but studied characteristic properties probability that a context word co-occurs with both words is
of various measures. They find that, in a task where words rather low. To be a bit more precise, if we have context vec-
related to a given word have to be found, some similarity tors v1 and v2 that are distributions over d elements, with n1
measures tend to find words with a similar frequency as the and n2 non zero elements, than the probability that a word
target word, while others favor highly frequent words. The is not zero in both distributions is, as a first approximation,
n1
Jensen-Shannon divergence (JSD) is one of the measures d
· nd2 . Even if the words are not independent, we might
that tends to favor more general terms. In the following expect a similar behavior: the probability that a word has
we will investigate this in more detail. We show that a a non zero value in two context vectors increases with the
better similarity measure can be defined on the base of the number contexts on which the vectors are based.
JSD, when we use our knowledge about the dependency of If we try to predict the JSD of the context vectors of two
the JSD on the frequency of the words. Finally, we show words, we could base this prediction on the frequency of the
that this new similarity measure outperforms the original words. However, it turns out that this is a very complicated
JSD and the cosine similarity in a task in which a large dependency. Alternatively, we could base the prediction on
number of word pairs have to be classified as synonyms or the entropy of the context vector (if we interpret the vector
non-synonyms. as a probability distribution, as we have to do to compute
the JSD): if the entropy of both vectors is maximal, they
have to be identical and the JSD will be 0. If the entropy
of both vectors is minimal, the JSD of the two vector is
most likely to be maximal. Since, in case of independence of
all words, the context vectors will not converge to the equal
distribution but to the background distribution, i.e. the word
distribution of the whole corpus, it is more natural to use the
2. INFLUENCE OF WORD FREQUENCY relative entropy to the background distribution. Preliminary
As already mentioned above, Weeds et al. [13] observed experiments have shown that this works, but that JSD of
that, in tasks in which related words have to be found, some two context vectors can be better predicted by the number
measures prefer words with a frequency similar to that of of non-zero values in the vectors.
the target word while others prefer highly frequent words, Figure 1 shows the relation between the JSD of two con-
regardless of the frequency of the target word. The JSD be- text vectors and the product of the number of non zero val-
longs to the latter category. In Wartena et al. [12] we also ues in both distributions. The divergences in this figure are
made this observation. There we compared context vec- computed for distributions over 20 000 random indices com-
tors of words with the word distribution of a document with puted on the 2,2 billion words ukWaC Corpus for 9916 word
the goal of finding keywords for the document. In order pairs. We found the same dependency for the L1 norm. In
to compensate for the strong bias to highly frequent words, contrast, for the cosine similarity we could not find any de-
we introduced specificity as an explicit second condition for pendency between the number of instances of the words or
finding keywords. As long as we try to find synonyms for a the number of non zero values in the context distributions.
given word, i.e. if we compare pairs of words in which one
component is fixed, like in the TOEFL tests, the problem
usually is tolerable. Moreover, the problem is not that ap- 3. EXPERIMENTAL RESULTS
parent if the range of the lowest and highest frequencies is To test our hypothesis that the divergence of two con-
not too large, e.g. when only words with certain minimal text vectors depends on the number of instances on which
frequency are considered and the size of the corpus gives a these vectors are based, we computed divergences for almost
low upper bound on the frequency. Length effects are com- 10 000 word pairs on a very large corpus. Furthermore, we
pletely avoided if for every word the same amount of contexts show how the knowledge about this dependency can be used
is sampled, as e.g. is done by Giesbrecht [4]. As we will see to find a better measure to capture the semantic similarity
below, JSD becomes completely useless if we compare arbi- between two words.
trary word pairs and do not pose any lower or upper bound
on the frequency of the words. 3.1 Data
The JSD between two probability distributions is defined As a corpus to compute the context distribution we use the
as the average of the relative entropy of each of the distribu- POS tagged and lemmatized version of the ukWaC Corpus
tions to their average distribution. It is interesting to note, of approximately 2,2 billion words [1]. As the context of a
Figure 1: Divergence vs. product of relative number
of non-zero values for pairs of context vectors and a
function modeling the dependency. Figure 2: ROC Curves for ranking of word pairs
(849 synonym pairs, 8967 non synonym pairs) using
different similarity measures.
word we consider all lemmata of open class words (i.e. nouns,
adjectives, verbs, etc.) in the same sentence. We define a
sentence simply as a set of words. A corpus then is a set of 3.2 Predicting JSD of context vectors
sentences. Let C be a corpus and w a word, then we define Figure 1 shows that there is a clear dependency between
Cw = {S ∈ C | w ∈ S}. Given a corpus C, the context the JSD of a word pair and the product of the relative num-
vector pw of a word w can be defined as bers of non zero values in the context distributions. This
dependency can be captured by the following equation:
1 X 1 X
pw = rv (2)
b

|Cw | S∈C |S| v∈S JSDexp (p1 , p2 ) = a log 1 + +c (3)
w
n
where rv is the random index vector of the word v. The
with n = nd1 · nd2 where n1 and n2 are the number of non
random index vector is defined as a probability distribution
zero values of n1 and n2 , respectively. Optimal values for
over d elements, such that for some small set of random
a, b and c were found by maximizing the coefficient of de-
numbers R = {r ∈ N | r < d} there are n elements rv (i) =
1 termination, R2 , on all non-synonym word pairs. We left
|R|
if i ∈ R and rv (i) = 0 otherwise. In the following we will
out the synonyms, since we try to model the similarity that
use distributions with d = 20 000 and |R| = 8 unless stated is caused just by the probability of random words to occur
else. Note, that we will always use probability distributions, in these context with an increasing number of observations.
but stick to the usual terminology of (context) vectors. With a = −0.34, b = 0.032 and c = 0.67 a R2 score of
For the evaluation of the similarity measures we selected 0.95 is reached (0.93 for the same constants when synonyms
pairs of words from Wordnet [8]. We started with a list of are included). The curve corresponding to these values is
pairs (w1 , w2 ) such that (1) w1 and w2 are single words, (2) displayed in red in Figure 1. Since usually context vectors
w1 occurs at least two times in the British National Corpus with much less dimensions are used, we repeated the exper-
and (3) w1 and w2 share at least one sense. This resulted iment with context distributions over 1 000 random indices
in a list of 24 576 word pairs. From this list we selected all and obtained a R2 value of 0, 92 (a = −1.65, b = 0.99 and
pairs for which the Jaccard coefficient of the sets of senses c = 0.61).
of the words is at least 0.7. After filtering out all pairs
containing a word that was not found in the ukWaC corpus 3.3 Ranking word pairs
a list of 849 pairs remained. These word pairs are considered Most of the variance in the JSD of two context distri-
as synonyms in the following. Next from the list of 24 576 butions can be explained by (3). Now we expect that the
word pairs the second components were reordered randomly. remaining variance reflects the degree to which the words
The resulting list of new word pairs was filtered such that have a similar function or even meaning. To test this we
the two words of each pair both occur in the ukWaC corpus define the (frequency) normalized JSD as
and have no common sense. This resulted in a list of 8967
word pairs.1 JSDnorm (p1 , p2 ) = JSD(p1 , p2 ) − JSDexp (p1 , p2 ) (4)
As a consequence of the requirement of the overlap of Ideally, all word pairs of synonyms will be ranked higher
Wordnet senses, most words in the synonym list have very than the non-synonym pairs. We use the area under the
few senses and are very infrequent words. Thus the average ROC curve (AUC) to evaluate the ranking. We compare the
frequency in ukWaC of the synonyms is much lower than ranking according to the normalized JSD with the rankings
that of the words of the non-synonym list. The most fre- from the JSD, the cosine similarity and the L1 norm that is
quent word (use) was found 4.57 million times in the ukWaC used sometimes in combination with random indexing. The
corpus; 117 words were only found once (e.g. somersaulting, L1 norm between two vectors v1 and v2 of dimensionality d is
sakartvelo). P
defined as 0≤i