<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Neural Networks with Attention for Word Sense Induction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Oleg Struyanskiy</string-name>
          <email>oleg.fox@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nikolay Arefyev</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Moscow State University</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Samsung Moscow Research Center</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Attentional neural networks have achieved remarkable results for a number of tasks in the past few years. The fascinating success of neural networks with attention mechanism in natural language processing, especially in machine translation, suggests that these models can capture the meaning of ambiguous words considering their context. In this paper we introduce a new method for constructing vectors of ambiguous words occurrences for word sense induction based on the recently introduced model Transformer that achieved state of the art results for machine translation. Similar to the CBOW model for constructing word embeddings we train the Transformer to predict a word from it's context and use its trained parameters for word sense induction. On some datasets the proposed method outperforms the simple but hard-to-beat baseline, which was among the best three methods in the recent shared task on word sense induction for the Russian language RUSSE-WSI 2018. On one of the datasets our method beats the top result from the competition. Furthermore, we explore how different methods of weighing word embeddings affect the performance in word sense induction. Together with weighted sums of word2vec vectors, we explore the performance of vectors from Transformer's hidden layers and introduce a combined approach that improves previous results.</p>
      </abstract>
      <kwd-group>
        <kwd>word sense induction</kwd>
        <kwd>attention</kwd>
        <kwd>Transformer</kwd>
        <kwd>neural network</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Word sense induction is a problem of clustering contexts, i.e. short texts containing a
polysemous word into clusters depending on the sense of the word. The recent
competition on word sense induction for the Russian language RUSSE-WSI [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] has shown
that existing approaches work for homonyms but fail for complex polysemous words.
We follow the unsupervised approach to word sense induction starting with building
vector representations of contexts and then running a clustering algorithm to
distinguish contexts that contain the ambiguous word in different senses. One of the ways
to construct a vector representation of a context is to compute a weighted sum of word
embeddings for the context. The main question in this approach is how to determine
the weights. One simple yet effective method is to use weights based on word
frequencies. For example the model proposed in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] which we use as baseline employs
tf-idf weights. But models based on word frequencies do not take into account
complex linguistic relationships and can assign large weights to relatively unimportant
words. In this paper we propose a more sophisticated approach for determining the
weights using one of the newest neural network models.
2
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Model</title>
      <p>
        Transformer
Transformer [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is a recently proposed sequence transduction model which shows
state of the art results for several tasks including machine translation, text
summarization etc. Similar to previous best method for machine translation (sequence to
sequence models with attention) Transformer consists of encoder and decoder. The
main novelty of Transformer is that it does not use recurrent neural networks, its
encoder and decoder are based on a combination of feed forward networks and attention
mechanism which is responsible for packing sequences of variable length into vectors
of fixed size. Attention can be defined as a mechanism that computes weights of all
the elements in the input sequence. It is assumed that elements are represented as real
value vectors. The weights represent the importance of elements, and are typically
used to build a weighted sum of the element vectors. There are different types of
attention mechanism, depending on how the element weights are computed; in
Transformer dot product attention is used [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The connection between the encoder and the
decoder is also based on attention, therefore the model uses 3 different types of
attention: self attention in encoder, masked self attention in decoder and the
encoderdecoder attention. The model has n identical blocks where n is a hyperparameter (Fig
1). Every attention block in the model is multihead, which means that h independent
attention layers operate in parallel and the results are combined using concatenation
and a linear transformation. The number of heads is also a hyperparameter and is the
same in all 3 attention blocks of the model.
It is impossible to train a model to predict sense labels of polysemous words due to
the lack of enough sense labeled texts. One way to solve this problem is to train a
model to perform a side task and then use trained parameters for other purposes. This
technic was used in the widely known word embedding tool word2vec. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] The
CBOW model is trained to predict words by their context and the weights of the
model are then used as word representations. We propose a similar solution training the
Transformer to predict words by their contexts with the aim to use attention weights
for word sense induction afterwards. So, the input of the model is a text fragment with
all occurrences of a specific word replaced by a special token CENTERWORD, and
the desired output is the word which occurrences were replaced.
Weights from the encoder-decoder attention mechanism (* on figure 1), taken from
the timestep when the model is generating the prediction of the center word, directly
point out, how much each word in context contributes to the prediction. These
weights were extracted during the processing of target datasets and used for the
weighting of word embeddings when building context vectors. We hope that context
words that the model has learned to attend to for the missing word prediction will also
be useful for discriminating between that word’s senses. We considered different
hyperparameter values for Transformer, varying the number of model layers and the
number of attention heads. From the simple variant of the model with just one layer
and one head, attention weights were extracted without any aggregation, as there is
only one vector of weights per input sequence. For more complex models with several
layers and attention heads the weights from the first layer were extracted, and then for
each word in the input sequence a maximum over weights from different attention
heads were calculated. The idea behind such aggregation was that different heads
would attend to different parts of the context, thus the maximum over all heads would
determine how much the model attends to a particular word in general.
      </p>
      <p>We consider two methods of determining word weights, one relying solely on
attention weights, and another one using a combination of tf-idf and attention weights.
The weights are raised to the specific power so a context vector is formed as follows:
!"#$%&amp;$ =</p>
      <p>!∈!"#$%&amp;$ !!"_!"# ∙ !!""_!"# ∙ !
where ! and ! are tf-idf and attention weights of the word  respectively, _
and _ are hyperparameters, ! is the word2vec embedding for a context word
w.</p>
      <p>Also we explore the word sense induction performance of Transformer output
embedding for an ambiguous word. Specifically, we take the vectors from the output of
the decoder (** on fig 1), taken at the timestep when the model predicts the center
word. We hypothesize that this vector is a good representation for the sense of the
predicted word because it summarizes the whole context. Finally, our best performing
method uses the concatenation of Transformer output embedding for an ambiguous
word and the weighted sum for the words of its context.</p>
      <p>
        The actual task of word sense induction is performed by clustering of the context
vectors using agglomerative clustering algorithm [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The exact number of clusters
was selected on the train sets for every dataset individually.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <p>
        For evaluation of our model we used the datasets and the evaluation scripts of the
word sense induction for the Russian language shared task RUSSE-WSI [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The task
provides three datasets (bts-rnc, active-dict and wiki-wiki) based on different corpora
and sense inventories, each was split into train and test parts. We used the official
evaluation script of the task, which calculates adjusted Rand index (ARI); it equals 0
for a random clustering and 1 for the gold standard clustering.
      </p>
      <p>
        The train set for transformer was built from 25% of librusec [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] text collection. We
extracted 12 M contexts (4.5 GB) containing any of 341 ambiguous words from the
datasets of the shared task. All occurrences of the ambiguous words were replaced
with a special token CENTERWORD. The average length of contexts in the train set is
20 words, the same as in RUSSE-WSI datasets. The preprocessing for all data included
converting to lowercase and inserting separating spaces between words and
punctuation signs. The size of the dataset was chosen to keep the training time reasonable. For
this same reason we only considered the 341 polysemous words from RUSSE-WSI
datasets as possible center words. To control the process of training, a development
set with 10 000 examples was sampled from this train set.
      </p>
      <p>The Transformer model was trained until the accuracy on the development set
stopped increasing. We trained two models: 1 layer 1 head and 2 layers 4 heads.1 The
final list of hyperparameters includes powers of tf-idf and attention weights, the
number of clusters and Transformer architecture. For each dataset we picked the optimal
hyperparameters for our new method and the baseline on train set and evaluated them
on test set.</p>
      <p>
        We compared our models with the simple but hard to beat baseline that achieved
second best results on bts-rnc and active-dict and third best on wiki-wiki on
RUSSEWSI task [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The method in question also relies on weighted sums of word
embeddings for building context vectors and uses combinations of tf-idf and chi-squared
weights.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Results and discussion</title>
      <p>1 All other hyperparameters for Transformer model taken from the transformer_small
configuration.
(https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.
py)
w2v, tf-idf, attention (1 layer 1 head)
Bts-rnc</p>
      <sec id="sec-4-1">
        <title>Active-dict</title>
      </sec>
      <sec id="sec-4-2">
        <title>Wiki-wiki</title>
        <p>The best results are highlighted in bold, our top score on active-dict beats the best
result from RUSSE-WSI competition, and on bts-rnc our result is the second best2.
Combination of tf-idf and attention weights works considerably better than each type
of weight. Our method outperforms the baseline on two of the test sets</p>
        <p>Transformer output embeddings did not show good results when used on their own,
however combined with weighted averages of word2vec vectors they helped to
improve results in a number of cases. Remarkably, most of our best results were
obtained with the combined approach.</p>
        <p>After evaluating all different hyperparameter values on the train sets from
RUSSEWSI we found that top 10 results on active-dict and bts-rnc datasets all used
Transformer with 2 layers and 4 attention heads. This suggests that using several layers and
attention heads can be crucial for achieving good results. The powers of the weights
vary among the best results, which indicates that these hyperparameters need to be
adjusted to a particular dataset. The hyperparameters used to achieve best results on
test are as follows: tf_idf_pow = 1.5, attention_pow = 0.75, 2 clusters for bts-rnc;
tf_idf_pow = 1.5, attention_pow = 0.25, 3 clusters for active-dict; tf_idf_pow = 0.5,
attention_pow = 0.125, 2 clusters for wiki-wiki.</p>
        <p>Considering the big difference in results of different models we explored the
weight distributions to find, what plays a major role in the quality of word sense
induction when using weighted sums of word embeddings. We observed many
examples, all of which indicate that the combination of tf-idf weights with attention
weights helps to reduce noise when compared to tf-idf and chi-squared weights.
Figure 2 shows the weight distribution given by different models for the same context:
“Onion (crop) - a species of two_year and multiyear crop, attributable to the
subfami2</p>
        <p>Top scores on test datasets are submitted to RUSSE-WSI post-competition
active-dict - https://competitions.codalab.org/competitions/public_submissions/17806
bts-rnc - https://competitions.codalab.org/competitions/public_submissions/17809
wiki-wiki - https://competitions.codalab.org/competitions/public_submissions/17810
ly of onion family. Scientific latin name given by Karl Linn comes from latin name of
garlic.”.
This example illustrates that the proposed model, comparing to the baseline, more
vividly selects important words («two_year», «subfamily», «crop», «garlic») that
indicate the sense of the word «onion».</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Arefyev</surname>
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Panchenko</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lukanin</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lesota</surname>
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Romanov</surname>
            <given-names>P</given-names>
          </string-name>
          .
          <article-title>Evaluating three corpusbased semantic similarity systems for Russian</article-title>
          .
          <source>Proceedings of the International Conference on Computational Linguistics and Intelligent Technologies "Dialogue"</source>
          , June 2015 Moscow, Russia
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Panchenko</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lopukhina</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ustalov</surname>
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lopukhin</surname>
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leontyev</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Arefyev</surname>
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Loukachevitch</surname>
            <given-names>N.</given-names>
          </string-name>
          : (
          <year>2018</year>
          )
          <article-title>: RUSSE'2018: A Shared Task on Word Sense Induction and Disambiguation for the Russian Language</article-title>
          .
          <source>In Proceedings of the 24rd International Conference on Computational Linguistics and Intellectual Technologies</source>
          (Dialogue'
          <year>2018</year>
          ). May 30 - June 2, Moscow, Russia
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Vaswani</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shazeer</surname>
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parmar</surname>
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Uszkoreit</surname>
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jones</surname>
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gomez</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaiser</surname>
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polosukhin</surname>
            <given-names>I</given-names>
          </string-name>
          .
          <source>Attention Is All You Need. 31st Conference on Neural Information Processing Systems (NIPS</source>
          <year>2017</year>
          ), Long Beach, CA, USA.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Arefyev</surname>
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ermolaev</surname>
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Panchenko</surname>
            <given-names>A</given-names>
          </string-name>
          .
          <string-name>
            <surname>HOW MUCH DOES A WORD WEIGHT</surname>
          </string-name>
          <article-title>: WEIGHTING WORD2VEC FOR WORD SENSE INDUCTION Proceedings of the International Conference on Computational Linguistics and Intelligent Technologies "Dialogue" 2018 Moscow Russia</article-title>
          [in print]
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Mikolov</surname>
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            <given-names>J</given-names>
          </string-name>
          .
          <source>Efficient Estimation of Word Representations in Vector Space. Conference on Neural Information Processing Systems (NIPS</source>
          <year>2013</year>
          ), Long Beach, CA, USA.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Joe</surname>
            <given-names>H. WARD</given-names>
          </string-name>
          , Jr.
          <article-title>Hierarchical Grouping to Optimize an Objective Function</article-title>
          .
          <source>Journal of the American Statistical Association</source>
          .
          <year>1963</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>