<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>April</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Distributional Similarity of Words with Different Frequencies</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Christian Wartena</string-name>
          <email>christian.wartena@hs-hannover.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Distributional Similarity, Synonymy</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Hochschule Hannover, University of Applied Sciences and Arts</institution>
          <addr-line>Expo Plaza 12 30359 Hannover</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <volume>26</volume>
      <issue>2013</issue>
      <abstract>
        <p>Distributional semantics tries to characterize the meaning of words by the contexts in which they occur. Similarity of words hence can be derived from the similarity of contexts. Contexts of a word are usually vectors of words appearing near to that word in a corpus. It was observed in previous research that similarity measures for the context vectors of two words depend on the frequency of these words. In the present paper we investigate this dependency in more detail for one similarity measure, the Jensen-Shannon divergence. We give an empirical model of this dependency and propose the deviation of the observed Jensen-Shannon divergence from the divergence expected on the basis of the frequencies of the words as an alternative similarity measure. We show that this new similarity measure is superior to both the Jensen-Shannon divergence and the cosine similarity in a task, in which pairs of words, taken from Wordnet, have to be classi ed as being synonyms or not.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Categories and Subject Descriptors</title>
      <p>Experimentation</p>
    </sec>
    <sec id="sec-2">
      <title>1. INTRODUCTION</title>
      <p>
        For many applications dealing with texts it is useful or
necessary to know what words in a language are similar.
Similarity between words can be found in hand crafted
resources, like WordNet [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], but methods to derive word
similarities from large text corpora are at least an interesting
alternative. Intuitively, words that occur in the same texts
or, more generally, the same contexts are similar. Thus we
could base a similarity measure on the number of times two
words occur in the same context, e.g. by representing words
in a document space. Especially, if we consider small
contexts, like a window of a few words around a word, this
approach gives pairs of words that are in some dependence
relation to each other. De Saussure [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] calls such such
relations, de ned by co-presence in a linguistic structure (e.g.
a text, sentence, phrase, xed window, words in a certain
grammatical relation to the studied word and so on),
syntagmatic relations. The type of similarity that is much closer
to synonymy and much more determined by the meaning of a
word, is obtained by comparing the contexts in which a word
occurs. This type of similarity is usually called paradigmatic
similarity or distributional similarity.
      </p>
      <p>
        Though distributional similarity has widely been studied
and has established as a method to nd similar words, there
is no consensus on the way the context of a word has to be
de ned and on the best way to compute the similarity
between contexts. In the most general de nitions the context
of a word consists of words and their relation to the given
word (see e.g. [
        <xref ref-type="bibr" rid="ref2 ref6">6, 2</xref>
        ]). In the following we will only consider
the simplest case in which there is only one relation: the
relation of being in the same sentence. Now each word can
be represented by a context vector in a high dimensional
word space. Since these context vectors are very sparse,
often dimensionality reduction techniques are applied. In the
present paper we use random indexing, introduced by
Karlgren and Sahlgren [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and Sahlgren [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] to reduce the size of
the context vectors. For random indexing each word is
represented by a random index vector. The context vector of
a word is constructed by addition of the index vectors of all
words in the context. Thus the dimensionality of the
context vector is the same as the dimensionality chosen for the
index vectors. It was shown by Karlgren and Sahlgren [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
that this technique gives results that are comparable to those
obtained by dimensionality reduction techniques like
singular value decomposition, but requires less computational
resources. The similarity of the context vectors, nally, can
be used as a proxy for the similarity of words.
      </p>
      <p>
        In order to evaluate the various methods to de ne
context vectors and the various similarity measures that can
be used subsequently, usually the computed similarity of
words is tested in a task in which words have to be classi ed
as being synonym or not to a given word. Often the data
are taken from the synonym detection task from TOEFL
(Test of English as a Foreign Language) in which the
closest related word from a set of four words has to be chosen.
Gornerup and Karlgren [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] found that best results are
obtained using L1-norm or Jensen-Shannon divergence (JSD).
Curran and Moens [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] obtain best results using a
combination of the Jaccard coe cient and the T-test while Van der
Plas and Bouma [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] report best results using a
combination of the Dice coe cient and pointwise mutual
information. Both Curran and Moens and Van der Plas and Bouma
use a number of di erent relations and need a similarity
measure that is able to assign di erent weights to the
relations. This makes their results less relevant for the present
paper. The di erences between the latter two studies show
how strongly the results depend on the exact settings of the
experiment. Many authors, however, use cosine similarity
as a generally well established similarity measure for vectors
in high dimensional word spaces.
      </p>
      <p>
        Weeds et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] do not compare similarity measures to
hand crafted data sets but studied characteristic properties
of various measures. They nd that, in a task where words
related to a given word have to be found, some similarity
measures tend to nd words with a similar frequency as the
target word, while others favor highly frequent words. The
Jensen-Shannon divergence (JSD) is one of the measures
that tends to favor more general terms. In the following
we will investigate this in more detail. We show that a
better similarity measure can be de ned on the base of the
JSD, when we use our knowledge about the dependency of
the JSD on the frequency of the words. Finally, we show
that this new similarity measure outperforms the original
JSD and the cosine similarity in a task in which a large
number of word pairs have to be classi ed as synonyms or
non-synonyms.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2. INFLUENCE OF WORD FREQUENCY</title>
      <p>
        As already mentioned above, Weeds et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] observed
that, in tasks in which related words have to be found, some
measures prefer words with a frequency similar to that of
the target word while others prefer highly frequent words,
regardless of the frequency of the target word. The JSD
belongs to the latter category. In Wartena et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] we also
made this observation. There we compared context
vectors of words with the word distribution of a document with
the goal of nding keywords for the document. In order
to compensate for the strong bias to highly frequent words,
we introduced speci city as an explicit second condition for
nding keywords. As long as we try to nd synonyms for a
given word, i.e. if we compare pairs of words in which one
component is xed, like in the TOEFL tests, the problem
usually is tolerable. Moreover, the problem is not that
apparent if the range of the lowest and highest frequencies is
not too large, e.g. when only words with certain minimal
frequency are considered and the size of the corpus gives a
low upper bound on the frequency. Length e ects are
completely avoided if for every word the same amount of contexts
is sampled, as e.g. is done by Giesbrecht [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. As we will see
below, JSD becomes completely useless if we compare
arbitrary word pairs and do not pose any lower or upper bound
on the frequency of the words.
      </p>
      <p>The JSD between two probability distributions is de ned
as the average of the relative entropy of each of the
distributions to their average distribution. It is interesting to note,
that the JSD can be written as</p>
      <p>JSD(p; q) = 12 D(pjj 21 p + 12 q) + 12 D(qjj 12 p + 21 q)
= log 2 +
+q(t) log (
This formulation of the JSD explicitly shows that the value
only depends on the words that have a non-zero value in both
context vectors. If there is no common word the JSD is
maximal. Now suppose that all words are independent. If the
context vectors are based on a few instances of a word, the
probability that a context word co-occurs with both words is
rather low. To be a bit more precise, if we have context
vectors v1 and v2 that are distributions over d elements, with n1
and n2 non zero elements, than the probability that a word
is not zero in both distributions is, as a rst approximation,
nd1 nd2 . Even if the words are not independent, we might
expect a similar behavior: the probability that a word has
a non zero value in two context vectors increases with the
number contexts on which the vectors are based.</p>
      <p>If we try to predict the JSD of the context vectors of two
words, we could base this prediction on the frequency of the
words. However, it turns out that this is a very complicated
dependency. Alternatively, we could base the prediction on
the entropy of the context vector (if we interpret the vector
as a probability distribution, as we have to do to compute
the JSD): if the entropy of both vectors is maximal, they
have to be identical and the JSD will be 0. If the entropy
of both vectors is minimal, the JSD of the two vector is
most likely to be maximal. Since, in case of independence of
all words, the context vectors will not converge to the equal
distribution but to the background distribution, i.e. the word
distribution of the whole corpus, it is more natural to use the
relative entropy to the background distribution. Preliminary
experiments have shown that this works, but that JSD of
two context vectors can be better predicted by the number
of non-zero values in the vectors.</p>
      <p>Figure 1 shows the relation between the JSD of two
context vectors and the product of the number of non zero
values in both distributions. The divergences in this gure are
computed for distributions over 20 000 random indices
computed on the 2,2 billion words ukWaC Corpus for 9916 word
pairs. We found the same dependency for the L1 norm. In
contrast, for the cosine similarity we could not nd any
dependency between the number of instances of the words or
the number of non zero values in the context distributions.
3.</p>
    </sec>
    <sec id="sec-4">
      <title>EXPERIMENTAL RESULTS</title>
      <p>To test our hypothesis that the divergence of two
context vectors depends on the number of instances on which
these vectors are based, we computed divergences for almost
10 000 word pairs on a very large corpus. Furthermore, we
show how the knowledge about this dependency can be used
to nd a better measure to capture the semantic similarity
between two words.
3.1</p>
    </sec>
    <sec id="sec-5">
      <title>Data</title>
      <p>
        As a corpus to compute the context distribution we use the
POS tagged and lemmatized version of the ukWaC Corpus
of approximately 2,2 billion words [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. As the context of a
word we consider all lemmata of open class words (i.e. nouns,
adjectives, verbs, etc.) in the same sentence. We de ne a
sentence simply as a set of words. A corpus then is a set of
sentences. Let C be a corpus and w a word, then we de ne
Cw = fS 2 C j w 2 Sg. Given a corpus C, the context
vector pw of a word w can be de ned as
pw =
      </p>
      <p>1 X 1
jCwj S2Cw jSj v2S</p>
      <p>X rv
(2)
where rv is the random index vector of the word v. The
random index vector is de ned as a probability distribution
over d elements, such that for some small set of random
numbers R = fr 2 N j r &lt; dg there are n elements rv(i) =
1 if i 2 R and rv(i) = 0 otherwise. In the following we will
jRj
use distributions with d = 20 000 and jRj = 8 unless stated
else. Note, that we will always use probability distributions,
but stick to the usual terminology of (context) vectors.</p>
      <p>
        For the evaluation of the similarity measures we selected
pairs of words from Wordnet [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. We started with a list of
pairs (w1; w2) such that (1) w1 and w2 are single words, (2)
w1 occurs at least two times in the British National Corpus
and (3) w1 and w2 share at least one sense. This resulted
in a list of 24 576 word pairs. From this list we selected all
pairs for which the Jaccard coe cient of the sets of senses
of the words is at least 0:7. After ltering out all pairs
containing a word that was not found in the ukWaC corpus
a list of 849 pairs remained. These word pairs are considered
as synonyms in the following. Next from the list of 24 576
word pairs the second components were reordered randomly.
The resulting list of new word pairs was ltered such that
the two words of each pair both occur in the ukWaC corpus
and have no common sense. This resulted in a list of 8967
word pairs.1
      </p>
      <p>As a consequence of the requirement of the overlap of
Wordnet senses, most words in the synonym list have very
few senses and are very infrequent words. Thus the average
frequency in ukWaC of the synonyms is much lower than
that of the words of the non-synonym list. The most
frequent word (use) was found 4.57 million times in the ukWaC
corpus; 117 words were only found once (e.g. somersaulting,
sakartvelo).
1The lists of word pairs are available at http://
nbn-resolving.de/urn:nbn:de:bsz:960-opus-4077.
with n = n1 n2 where n1 and n2 are the number of non
d d
zero values of n1 and n2, respectively. Optimal values for
a, b and c were found by maximizing the coe cient of
determination, R2, on all non-synonym word pairs. We left
out the synonyms, since we try to model the similarity that
is caused just by the probability of random words to occur
in these context with an increasing number of observations.
With a = 0:34, b = 0:032 and c = 0:67 a R2 score of
0:95 is reached (0:93 for the same constants when synonyms
are included). The curve corresponding to these values is
displayed in red in Figure 1. Since usually context vectors
with much less dimensions are used, we repeated the
experiment with context distributions over 1 000 random indices
and obtained a R2 value of 0; 92 (a = 1:65, b = 0:99 and
c = 0:61).
3.3</p>
    </sec>
    <sec id="sec-6">
      <title>Ranking word pairs</title>
      <p>Most of the variance in the JSD of two context
distributions can be explained by (3). Now we expect that the
remaining variance re ects the degree to which the words
have a similar function or even meaning. To test this we
de ne the (frequency) normalized JSD as</p>
      <p>JSDnorm(p1; p2) = JSD(p1; p2)
JSDexp(p1; p2)
(4)</p>
      <p>Ideally, all word pairs of synonyms will be ranked higher
than the non-synonym pairs. We use the area under the
ROC curve (AUC) to evaluate the ranking. We compare the
ranking according to the normalized JSD with the rankings
from the JSD, the cosine similarity and the L1 norm that is
used sometimes in combination with random indexing. The
L1 norm between two vectors v1 and v2 of dimensionality d is
de ned as P0 i&lt;d jv1(i) v2(i)j. The ROC curves are given
in Figure 2 when using context vectors with 20 000
dimensions. The AUC-values are summarized in Table 1, both for
the experiment using context distributions over 20 000 and
1 000 random indices.</p>
      <p>We see that the JSD gives a ranking worse than a random
ranking. The remarkable observation is the large di erence
between the AUC values, since we are comparing exactly the
same context distributions, and thus use exactly the same
information. A further observation is the strange behavior of
the cosine similarity. For pairs of words for which less than
a dozen instances were found, the cosine similarity seems
to give almost random results. Thus some positive pairs are
ranked very low, explaining the rise of the ROC curve at the
right end. The results of the L1 norm are almost the same
as those of the JSD, which is not surprising as we also found
a linear correspondence between JSD and the L1 norm.</p>
      <p>
        Finally, it should be noted that we did not try to nd
the best possible ranking. If we would include frequency
information (two very frequent words are unlikely to be
synonyms) or Levenshtein distance (there are many spelling
variants included in the list of synonyms) we could easily
obtain a better ranking. The goal of the experiment, however,
was evaluation of distance measures for random indexing.
The classi cation is only a means to assess the quality of the
distance measure. In [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] we also investigate the possibility
to combine various distance measures and other features to
get an optimal ranking.
      </p>
    </sec>
    <sec id="sec-7">
      <title>DISCUSSION AND CONCLUSIONS</title>
      <p>We have clearly found a very strong dependency between
the number of non-zero values in random context vectors
and the JSD between the vectors. When we use data with
an extremely large range in frequencies this leads to JSD
values that are useless for ranking word pairs according to
their similarity. Note that we included words with
frequencies ranging from 1 to 4,57 Million. We used the known
dependency between the number of non zero values in the
distributions and the JSD to de ne a new similarity
measure, the frequency normalized JSD. This measure clearly
outperforms the cosine similarity in the ranking experiment.</p>
      <p>Though this result is convincing, we are lacking a
theoretical base from which a formula like (3) can be derived. Also,
it would be preferable if the constants could be estimated
directly from the size of the corpus, the number of
dimensions, etc. Now, only one from three constants can easily be
explained, namely as the maximum JSD. Alternatively, also
smoothing of the context distributions might be a solution to
make JSD more useful. The smoothing should then account
for the similarities that stem from random words appearing
in both contexts. In general, the results show that the choice
for the right similarity measure to be used for distributional
similarity is not a solved question and more research in this
area is needed.
5.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Baroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bernardini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ferraresi</surname>
          </string-name>
          , and
          <string-name>
            <surname>E. Zanchetta.</surname>
          </string-name>
          <article-title>The wacky wide web: A collection of very large linguistically processed web-crawled corpora</article-title>
          .
          <source>Language Resources and Evaluation</source>
          <volume>43</volume>
          (
          <issue>3</issue>
          ):
          <fpage>209</fpage>
          -
          <lpage>226</lpage>
          ,
          <issue>43</issue>
          (
          <issue>3</issue>
          ):
          <volume>209</volume>
          {
          <fpage>226</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Curran</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Moens</surname>
          </string-name>
          .
          <article-title>Improvements in automatic thesaurus extraction</article-title>
          .
          <source>In Unsupervised Lexical Acquisition: Proceedings of the Workshop of the ACL Special Interest Group on the Lexicon (SIGLAX)</source>
          ., pages
          <volume>59</volume>
          {
          <fpage>66</fpage>
          . Association of Computational Linguistics,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F.</given-names>
            <surname>de Saussure</surname>
          </string-name>
          . Cours de linguistique generale. V.C.
          <article-title>Bally and A</article-title>
          . Sechehaye (eds.), Paris/Lausanne,
          <year>1916</year>
          . English translation: Course in General Linguistics. London: Peter Owen,
          <year>1960</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E.</given-names>
            <surname>Giesbrecht</surname>
          </string-name>
          .
          <article-title>Towards a matrix-based distributional model of meaning</article-title>
          .
          <source>In Proceedings of the NAACL HLT 2010 Student Research Workshop</source>
          , pages
          <volume>23</volume>
          {
          <fpage>28</fpage>
          , Los Angeles, California,
          <year>2010</year>
          . ACL.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>O.</given-names>
            <surname>Go</surname>
          </string-name>
          <article-title>rnerup and</article-title>
          <string-name>
            <given-names>J.</given-names>
            <surname>Karlgren</surname>
          </string-name>
          .
          <article-title>Cross-lingual comparison between distributionally determined word similarity networks</article-title>
          .
          <source>In Proceedings of the 2010 Workshop on Graph-based Methods for Natural Language Processing</source>
          , pages
          <volume>48</volume>
          {
          <fpage>54</fpage>
          .
          <string-name>
            <surname>ACL</surname>
          </string-name>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Grefenstette</surname>
          </string-name>
          .
          <article-title>Use of syntactic context to produce term association lists for text retrieval</article-title>
          .
          <source>In SIGIR '92: Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval</source>
          , pages
          <volume>89</volume>
          {
          <fpage>97</fpage>
          . ACM,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Karlgren</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahlgren</surname>
          </string-name>
          .
          <article-title>From words to understanding</article-title>
          .
          <source>In Foundations of Real-World Intelligence</source>
          , pages
          <fpage>294</fpage>
          {
          <fpage>308</fpage>
          . CSLI Publications, Stanford, Californa,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G. A.</given-names>
            <surname>Miller</surname>
          </string-name>
          .
          <article-title>Wordnet: A lexical database for english</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <volume>38</volume>
          (
          <issue>11</issue>
          ):
          <volume>39</volume>
          {
          <fpage>41</fpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahlgren</surname>
          </string-name>
          .
          <article-title>An introduction to random indexing</article-title>
          .
          <source>In Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering</source>
          , TKE, volume
          <volume>5</volume>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>L. Van Der</given-names>
            <surname>Plas</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Bouma</surname>
          </string-name>
          .
          <article-title>Syntactic contexts for nding semantically related words</article-title>
          .
          <source>In Proceedings of Computational Linguistics in the Netherlands</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>C.</given-names>
            <surname>Wartena</surname>
          </string-name>
          . HsH:
          <article-title>Estimating semantic similarity of words and short phrases with frequency normalized distance measures</article-title>
          .
          <source>In Proceedings of the 7th International Workshop on Semantic Evaluation (SemEval</source>
          <year>2013</year>
          ),
          <year>2013</year>
          . to appear.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>C.</given-names>
            <surname>Wartena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Brussee</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Slakhorst</surname>
          </string-name>
          .
          <article-title>Keyword extraction using word co-occurrence</article-title>
          .
          <source>In Database and Expert Systems Applications (DEXA)</source>
          ,
          <source>2010 Workshop on</source>
          , pages
          <volume>54</volume>
          {
          <fpage>58</fpage>
          . IEEE,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Weeds</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Weir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and D.</given-names>
            <surname>McCarthy</surname>
          </string-name>
          .
          <article-title>Characterising measures of lexical distributional similarity</article-title>
          .
          <source>In COLING 2004, Proceedings of the 20th International Conference on Computational Linguistics</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>