=Paper= {{Paper |id=None |storemode=property |title=Distributional Similarity of Words with Different Frequencies |pdfUrl=https://ceur-ws.org/Vol-986/paper_1.pdf |volume=Vol-986 |dblpUrl=https://dblp.org/rec/conf/dir/Wartena13 }} ==Distributional Similarity of Words with Different Frequencies== https://ceur-ws.org/Vol-986/paper_1.pdf
               Distributional Similarity of Words with Different
                                 Frequencies

                                                              Christian Wartena
                                  Hochschule Hannover, University of Applied Sciences and Arts
                                                       Expo Plaza 12
                                                 30359 Hannover, Germany
                                                christian.wartena@hs-hannover.de

ABSTRACT                                                                words occur in the same context, e.g. by representing words
Distributional semantics tries to characterize the meaning              in a document space. Especially, if we consider small con-
of words by the contexts in which they occur. Similarity of             texts, like a window of a few words around a word, this
words hence can be derived from the similarity of contexts.             approach gives pairs of words that are in some dependence
Contexts of a word are usually vectors of words appearing               relation to each other. De Saussure [3] calls such such re-
near to that word in a corpus. It was observed in previous              lations, defined by co-presence in a linguistic structure (e.g.
research that similarity measures for the context vectors of            a text, sentence, phrase, fixed window, words in a certain
two words depend on the frequency of these words. In the                grammatical relation to the studied word and so on), syn-
present paper we investigate this dependency in more detail             tagmatic relations. The type of similarity that is much closer
for one similarity measure, the Jensen-Shannon divergence.              to synonymy and much more determined by the meaning of a
We give an empirical model of this dependency and propose               word, is obtained by comparing the contexts in which a word
the deviation of the observed Jensen-Shannon divergence                 occurs. This type of similarity is usually called paradigmatic
from the divergence expected on the basis of the frequen-               similarity or distributional similarity.
cies of the words as an alternative similarity measure. We                 Though distributional similarity has widely been studied
show that this new similarity measure is superior to both               and has established as a method to find similar words, there
the Jensen-Shannon divergence and the cosine similarity in              is no consensus on the way the context of a word has to be
a task, in which pairs of words, taken from Wordnet, have               defined and on the best way to compute the similarity be-
to be classified as being synonyms or not.                              tween contexts. In the most general definitions the context
                                                                        of a word consists of words and their relation to the given
                                                                        word (see e.g. [6, 2]). In the following we will only consider
Categories and Subject Descriptors                                      the simplest case in which there is only one relation: the
H.3.1 [Information Storage and Retrieval]: Content                      relation of being in the same sentence. Now each word can
Analysis and Indexing—Indexing Methods, Linguistic Pro-                 be represented by a context vector in a high dimensional
cessing; G.3 [Probability and Statistics]: [Correlation                 word space. Since these context vectors are very sparse, of-
and regression analysis]; I.2.7 [Artificial Intelligence]: Nat-         ten dimensionality reduction techniques are applied. In the
ural Language Processing—Language Models,Text Analysis                  present paper we use random indexing, introduced by Karl-
                                                                        gren and Sahlgren [7] and Sahlgren [9] to reduce the size of
General Terms                                                           the context vectors. For random indexing each word is rep-
                                                                        resented by a random index vector. The context vector of
Experimentation                                                         a word is constructed by addition of the index vectors of all
                                                                        words in the context. Thus the dimensionality of the con-
Keywords                                                                text vector is the same as the dimensionality chosen for the
Distributional Similarity, Synonymy                                     index vectors. It was shown by Karlgren and Sahlgren [7]
                                                                        that this technique gives results that are comparable to those
                                                                        obtained by dimensionality reduction techniques like singu-
1.    INTRODUCTION                                                      lar value decomposition, but requires less computational re-
   For many applications dealing with texts it is useful or             sources. The similarity of the context vectors, finally, can
necessary to know what words in a language are similar.                 be used as a proxy for the similarity of words.
Similarity between words can be found in hand crafted re-                  In order to evaluate the various methods to define con-
sources, like WordNet [8], but methods to derive word sim-              text vectors and the various similarity measures that can
ilarities from large text corpora are at least an interesting           be used subsequently, usually the computed similarity of
alternative. Intuitively, words that occur in the same texts            words is tested in a task in which words have to be classified
or, more generally, the same contexts are similar. Thus we              as being synonym or not to a given word. Often the data
could base a similarity measure on the number of times two              are taken from the synonym detection task from TOEFL
                                                                        (Test of English as a Foreign Language) in which the clos-
                                                                        est related word from a set of four words has to be chosen.
DIR 2013, April 26, 2013, Delft, The Netherlands.                       Görnerup and Karlgren [5] found that best results are ob-
Copyright remains with the authors and/or original copyright holders.   tained using L1-norm or Jensen-Shannon divergence (JSD).
.
Curran and Moens [2] obtain best results using a combina-         that the JSD can be written as
tion of the Jaccard coefficient and the T-test while Van der
Plas and Bouma [10] report best results using a combina-                JSD(p, q) = 21 D(p|| 12 p + 12 q) + 12 D(q|| 21 p + 12 q)
tion of the Dice coefficient and pointwise mutual informa-                             1        X                          p(t)
                                                                          = log 2 +                          p(t) log (              )
tion. Both Curran and Moens and Van der Plas and Bouma                                 2                                 p(t) + q(t)
                                                                                          t:p(t)6=0 ∧ q(t)6=0
use a number of different relations and need a similarity
measure that is able to assign different weights to the rela-                                   q(t)      
                                                                               +q(t) log (               ) .                             (1)
tions. This makes their results less relevant for the present                                p(t) + q(t)
paper. The differences between the latter two studies show
how strongly the results depend on the exact settings of the      This formulation of the JSD explicitly shows that the value
experiment. Many authors, however, use cosine similarity          only depends on the words that have a non-zero value in both
as a generally well established similarity measure for vectors    context vectors. If there is no common word the JSD is max-
in high dimensional word spaces.                                  imal. Now suppose that all words are independent. If the
   Weeds et al. [13] do not compare similarity measures to        context vectors are based on a few instances of a word, the
hand crafted data sets but studied characteristic properties      probability that a context word co-occurs with both words is
of various measures. They find that, in a task where words        rather low. To be a bit more precise, if we have context vec-
related to a given word have to be found, some similarity         tors v1 and v2 that are distributions over d elements, with n1
measures tend to find words with a similar frequency as the       and n2 non zero elements, than the probability that a word
target word, while others favor highly frequent words. The        is not zero in both distributions is, as a first approximation,
                                                                  n1
Jensen-Shannon divergence (JSD) is one of the measures             d
                                                                      · nd2 . Even if the words are not independent, we might
that tends to favor more general terms. In the following          expect a similar behavior: the probability that a word has
we will investigate this in more detail. We show that a           a non zero value in two context vectors increases with the
better similarity measure can be defined on the base of the       number contexts on which the vectors are based.
JSD, when we use our knowledge about the dependency of               If we try to predict the JSD of the context vectors of two
the JSD on the frequency of the words. Finally, we show           words, we could base this prediction on the frequency of the
that this new similarity measure outperforms the original         words. However, it turns out that this is a very complicated
JSD and the cosine similarity in a task in which a large          dependency. Alternatively, we could base the prediction on
number of word pairs have to be classified as synonyms or         the entropy of the context vector (if we interpret the vector
non-synonyms.                                                     as a probability distribution, as we have to do to compute
                                                                  the JSD): if the entropy of both vectors is maximal, they
                                                                  have to be identical and the JSD will be 0. If the entropy
                                                                  of both vectors is minimal, the JSD of the two vector is
                                                                  most likely to be maximal. Since, in case of independence of
                                                                  all words, the context vectors will not converge to the equal
                                                                  distribution but to the background distribution, i.e. the word
                                                                  distribution of the whole corpus, it is more natural to use the
2.   INFLUENCE OF WORD FREQUENCY                                  relative entropy to the background distribution. Preliminary
   As already mentioned above, Weeds et al. [13] observed         experiments have shown that this works, but that JSD of
that, in tasks in which related words have to be found, some      two context vectors can be better predicted by the number
measures prefer words with a frequency similar to that of         of non-zero values in the vectors.
the target word while others prefer highly frequent words,           Figure 1 shows the relation between the JSD of two con-
regardless of the frequency of the target word. The JSD be-       text vectors and the product of the number of non zero val-
longs to the latter category. In Wartena et al. [12] we also      ues in both distributions. The divergences in this figure are
made this observation. There we compared context vec-             computed for distributions over 20 000 random indices com-
tors of words with the word distribution of a document with       puted on the 2,2 billion words ukWaC Corpus for 9916 word
the goal of finding keywords for the document. In order           pairs. We found the same dependency for the L1 norm. In
to compensate for the strong bias to highly frequent words,       contrast, for the cosine similarity we could not find any de-
we introduced specificity as an explicit second condition for     pendency between the number of instances of the words or
finding keywords. As long as we try to find synonyms for a        the number of non zero values in the context distributions.
given word, i.e. if we compare pairs of words in which one
component is fixed, like in the TOEFL tests, the problem
usually is tolerable. Moreover, the problem is not that ap-       3.     EXPERIMENTAL RESULTS
parent if the range of the lowest and highest frequencies is        To test our hypothesis that the divergence of two con-
not too large, e.g. when only words with certain minimal          text vectors depends on the number of instances on which
frequency are considered and the size of the corpus gives a       these vectors are based, we computed divergences for almost
low upper bound on the frequency. Length effects are com-         10 000 word pairs on a very large corpus. Furthermore, we
pletely avoided if for every word the same amount of contexts     show how the knowledge about this dependency can be used
is sampled, as e.g. is done by Giesbrecht [4]. As we will see     to find a better measure to capture the semantic similarity
below, JSD becomes completely useless if we compare arbi-         between two words.
trary word pairs and do not pose any lower or upper bound
on the frequency of the words.                                    3.1      Data
   The JSD between two probability distributions is defined         As a corpus to compute the context distribution we use the
as the average of the relative entropy of each of the distribu-   POS tagged and lemmatized version of the ukWaC Corpus
tions to their average distribution. It is interesting to note,   of approximately 2,2 billion words [1]. As the context of a
Figure 1: Divergence vs. product of relative number
of non-zero values for pairs of context vectors and a
function modeling the dependency.                                  Figure 2: ROC Curves for ranking of word pairs
                                                                   (849 synonym pairs, 8967 non synonym pairs) using
                                                                   different similarity measures.
word we consider all lemmata of open class words (i.e. nouns,
adjectives, verbs, etc.) in the same sentence. We define a
sentence simply as a set of words. A corpus then is a set of       3.2    Predicting JSD of context vectors
sentences. Let C be a corpus and w a word, then we define            Figure 1 shows that there is a clear dependency between
Cw = {S ∈ C | w ∈ S}. Given a corpus C, the context                the JSD of a word pair and the product of the relative num-
vector pw of a word w can be defined as                            bers of non zero values in the context distributions. This
                                                                   dependency can be captured by the following equation:
                          1 X 1 X
                 pw =                     rv              (2)                                           
                                                                                                             b
                                                                                                               
                        |Cw | S∈C |S| v∈S                                      JSDexp (p1 , p2 ) = a log 1 +     +c         (3)
                                 w
                                                                                                             n
where rv is the random index vector of the word v. The
                                                                   with n = nd1 · nd2 where n1 and n2 are the number of non
random index vector is defined as a probability distribution
                                                                   zero values of n1 and n2 , respectively. Optimal values for
over d elements, such that for some small set of random
                                                                   a, b and c were found by maximizing the coefficient of de-
numbers R = {r ∈ N | r < d} there are n elements rv (i) =
 1                                                                 termination, R2 , on all non-synonym word pairs. We left
|R|
     if i ∈ R and rv (i) = 0 otherwise. In the following we will
                                                                   out the synonyms, since we try to model the similarity that
use distributions with d = 20 000 and |R| = 8 unless stated        is caused just by the probability of random words to occur
else. Note, that we will always use probability distributions,     in these context with an increasing number of observations.
but stick to the usual terminology of (context) vectors.           With a = −0.34, b = 0.032 and c = 0.67 a R2 score of
   For the evaluation of the similarity measures we selected       0.95 is reached (0.93 for the same constants when synonyms
pairs of words from Wordnet [8]. We started with a list of         are included). The curve corresponding to these values is
pairs (w1 , w2 ) such that (1) w1 and w2 are single words, (2)     displayed in red in Figure 1. Since usually context vectors
w1 occurs at least two times in the British National Corpus        with much less dimensions are used, we repeated the exper-
and (3) w1 and w2 share at least one sense. This resulted          iment with context distributions over 1 000 random indices
in a list of 24 576 word pairs. From this list we selected all     and obtained a R2 value of 0, 92 (a = −1.65, b = 0.99 and
pairs for which the Jaccard coefficient of the sets of senses      c = 0.61).
of the words is at least 0.7. After filtering out all pairs
containing a word that was not found in the ukWaC corpus           3.3    Ranking word pairs
a list of 849 pairs remained. These word pairs are considered        Most of the variance in the JSD of two context distri-
as synonyms in the following. Next from the list of 24 576         butions can be explained by (3). Now we expect that the
word pairs the second components were reordered randomly.          remaining variance reflects the degree to which the words
The resulting list of new word pairs was filtered such that        have a similar function or even meaning. To test this we
the two words of each pair both occur in the ukWaC corpus          define the (frequency) normalized JSD as
and have no common sense. This resulted in a list of 8967
word pairs.1                                                             JSDnorm (p1 , p2 ) = JSD(p1 , p2 ) − JSDexp (p1 , p2 )   (4)
   As a consequence of the requirement of the overlap of              Ideally, all word pairs of synonyms will be ranked higher
Wordnet senses, most words in the synonym list have very           than the non-synonym pairs. We use the area under the
few senses and are very infrequent words. Thus the average         ROC curve (AUC) to evaluate the ranking. We compare the
frequency in ukWaC of the synonyms is much lower than              ranking according to the normalized JSD with the rankings
that of the words of the non-synonym list. The most fre-           from the JSD, the cosine similarity and the L1 norm that is
quent word (use) was found 4.57 million times in the ukWaC         used sometimes in combination with random indexing. The
corpus; 117 words were only found once (e.g. somersaulting,        L1 norm between    two vectors v1 and v2 of dimensionality d is
sakartvelo).                                                                   P
                                                                   defined as 0≤i