=Paper=
{{Paper
|id=Vol-1986/SML17_paper_2
|storemode=property
|title=A Simple Neural Network For Evaluating Semantic Textual Similarity
|pdfUrl=https://ceur-ws.org/Vol-1986/SML17_paper_2.pdf
|volume=Vol-1986
|authors=Yang Shao
|dblpUrl=https://dblp.org/rec/conf/ijcai/Shao17
}}
==A Simple Neural Network For Evaluating Semantic Textual Similarity==
<pdf width="1500px">https://ceur-ws.org/Vol-1986/SML17_paper_2.pdf</pdf>
<pre>
A Simple
  simple Neural
         neural Network
                network for
                        for Evaluating
                            evaluating Semantic
                                       semanticTextual
                                                textual Similarity
                                                        similarity

                                                           Yang SHAO
                                                           Hitachi, Ltd.
                                              Higashi koigakubo 1-280, Tokyo, Japan
                                                   yang.shao.kn@hitachi.com


                                                                            a standard benchmark to compare among meaning repre-
                                                                            sentation systems in future years, the organizers of STS
                             Abstract                                       tasks created a benchmark dataset in 2017. STS Bench-
                                                                            mark2 comprises a selection of the English datasets used
     This paper describes a simple neural network sys-                      in the STS tasks organized in the context of SemEval be-
     tem for Semantic Textual Similarity (STS) task.                        tween 2012 and 2017 [Agirre et al., 2012; 2013; 2014;
     The basic type of the system took part in the STS                      2015; 2016; Cer et al., 2017]. The selection of datasets
     task of SemEval 2017 and ranked 3rd in the pri-                        include text from image captions, news headlines and user
     mary track. More variant neural network struc-                         forums. Estimating the degree of semantic similarity of two
     tures and experiments are explored in this paper.                      sentences requires a very deep understanding of both sen-
     Semantic similarity score between two sentences                        tences. Therefore, methods developed for STS tasks could
     is calculated by comparing their semantic vectors                      also be used for a lot of other natural language understand-
     in our system. Semantic vector of every sentence                       ing tasks, such as ”Paraphrasing” tasks, ”Entailment” tasks,
     is generated by max pooling over every dimen-                          ”Answer Sentence Selection” tasks, ”Hypothesis Evidenc-
     sion of their word vectors. There are mainly two                       ing” tasks, etc..
     trick points in our system. One is that we trained                         Measuring sentence similarity is challenging mainly be-
     a convolutional neural network (CNN) to trans-                         cause of two reasons. One is the variability of linguis-
     fer GloVe word vectors to a more proper form                           tic expression and the other is the limited amount of an-
     for STS task before pooling. Another is that we                        notated training data. Therefore, conventional NLP ap-
     trained a fully-connected neural network (FCNN)                        proaches, such as sparse, hand-crafted features are difficult
     to transfer difference of two semantic vectors to                      to use. However, neural network systems [He et al., 2015a;
     the probability distribution over similarity scores.                   He and Lin, 2016] can alleviate data sparseness with pre-
     In spite of the simplicity of our neural network                       training and distributed representations. We propose a sim-
     system, the best variant neural network achieved a                     ple neural network system with 5 components:
     Pearson correlation coefficient result of 0.7930 on                    1) Enhance GloVe word vectors in every sentence by
     the STS benchmark test dataset and ranked 3rd1 .                           adding hand-crafted features.
                                                                            2) Transfer the enhanced word vectors to a more proper
                                                                                form by convolutional neural network (CNN).
1    Introduction                                                           3) Max pooling over every dimension of all word vectors
    Semantic Textual Similarity (STS) is a task of decid-                       to generate semantic vector.
ing a score that estimating the degree of semantic sim-                     4) Generate semantic difference vector by concatenating
ilarity between two sentences. STS task is a building                           the element-wise absolute difference and the element-
block of many Natural Language Processing (NLP) ap-                             wise multiplication of two semantic vectors.
                                                                            5) Transfer the semantic difference vector to the probabil-
plications. Therefore, it has received a lot of attentions
                                                                                ity distribution over similarity scores by fully-connected
in recent years. STS tasks in SemEval have been held
                                                                                neural network (FCNN).
from 2012 to 2017 [Cer et al., 2017]. In order to provide

Copyright c by the paper’s authors. Copying permitted for private and
                                                                            2     System Description
academic purposes.                                                            Figure 1 provides an overview of our system. The
In:
 In: A.  Editor, B. of
      Proceedings   Coeditor (eds.): Proceedings
                       IJCAI Workshop  on Semanticof the XYZ Learning
                                                     Machine    Workshop,   two sentences to be semantically compared are first pre-
Location,(SML    2017),
           Country,      Aug 19-25 2017,published
                     DD-MMM-YYYY,        Melbourne,   Australia.
                                                  at http://ceur-ws.org
    1 As of May 26, 2017                                                        2 http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark
                                                                               Table 1: Hyper parameters

                                                                   Sentence pad length                         30
                                                                   Dimension of GloVe vectors                 300
                                                                   Number of CNN layers l                       1
                                                                   Number of CNN filters in layer1 k1         300
                                                                   Activation function of CNN                tanh
                                                                   Initial function of CNN                he unif orm
                                                                   Number of FCNN layers n                      2
                                                                   Dimension of input layer                   600
                                                                   Dimension of layer1 m1                     300
                                                                   Dimension of output layer                   6
                                                                   Activation of layers except output        tanh
                                                                   Activation of output layer              sof tmax
                 Figure 1: Overview of system                      Initial function of layers             he unif orm
processed as described in subsection 2.1. Then the CNN             Optimizer                                ADAM
described in subsection 2.2 transfers the word vectors to a        Batch size                                1500
more proper form for each sentence. After that, the pro-           Max epoch                                   25
cesses introduced in subsection 2.3 is adopted to calculate        Run times                                    8
semantic vector and semantic difference vector from the
                                                               2.3   Comparison of semantic vectors
transferred word vectors. Then, an FCNN described in
subsection 2.4 transfers the semantic difference vector to        The semantic vector of sentence is calculated by max
a probability distribution over similarity scores. We imple-   pooling [Scherer et al., 2010] over every dimension of the
mented our neural network system by using Keras3 [Chol-        CNN transferred word vector. To calculate the semantic
let, 2015] and TensorFlow4 [Abadi et al., 2016].               similarity score of two sentences, we generate a semantic
                                                               difference vector by concatenating the element-wise abso-
2.1   Pre-process                                              lute difference and the element-wise multiplication of two
   Several text preprocessing operations were performed        semantic vectors. The calculation equation is
before feature engineering:
                                                                          ~ = (|SV
                                                                         SDV     ~1          ~ 2|, SV
                                                                                            SV      ~ 1 SV
                                                                                                         ~ 2)           (1)
1) All punctuations are removed.
2) All words are lower-cased.
                                                                      ~ is the semantic difference vector between two
                                                               Here,SDV
3) All sentences are tokenized by Natural Language                         ~ 1 and SV  ~ 2 are the semantic vectors of
   Toolkit (NLTK) [Bird et al., 2009].                         sentences, SV
4) All words are replaced by pre-trained GloVe word vec-       two sentences, is Hadamard product which generate the
   tors (Common Crawl, 840B tokens) [Pennington et al.,        element-wise multiplication of two semantic vectors.
   2014]. Words that do not exist in the pre-trained word      2.4   Fully-connected neural network (FCNN)
   vectors are set to the zero vector.
5) All sentences are padded to a static length l = 30 with         An FCNN is used to transfer the semantic difference
   zero vectors [He et al., 2015a].                            vector to a probability distribution over the six similarity
One hand-crafted feature is added to enhance the GloVe         labels used by STS. The number of layers is n. The dimen-
word vectors:                                                  sion of every layer is mn . The activation function of every
1) If a word appears in both sentences, add a TRUE flag to     layer except the last one is tanh. The activation function
   the word vector, otherwise, add a FALSE flag.               of the last layer is sof tmax. We train without using regu-
                                                               larization or drop out.
2.2   Convolutional neural network (CNN)
                                                               3     Experiments and Results
    The number of our CNN layers is l. Every layer con-
sists of kl one dimensional filters. The length of filters         The basic type of our neural network system took part
are set to be same as the dimension of enhanced word vec-      in the STS task of SemEval 2017 and ranked 3rd in the
tors. The activation function of convolutional neural is set   primary track [Shao, 2017]. The hyper parameters used
to be tanh. We did not use any regularization or drop out.     in our basic type system were empirically decided for the
Early stopping triggered by model performance on valida-       STS task and shown in Table 1. Our objective function
tion data was used to avoid overfitting. We used the same      is the Pearson correlation coefficient. ADAM [P.Kingma
model weights to transfer each of the words in a sentence.     and Ba, 2015] was used as the gradient descent optimiza-
  3 http://github.com/fchollet/keras                           tion method. All parameters of the optimizer are set to be
  4 http://github.com/tensorflow/tensorflow                    followed with the original paper. The learning rate is 0.001,
       Table 2: Increasing of dimensions of FCNN                 are same with the basic type system in Table 1 except the
                                                                 number of FCNN layers n and the dimension of FCNN lay-
         Dimensions of         Pearson correlation               ers mn . The number of FCNN layers n is set to be 3 in this
        FCNN layer1 m1          coefficient results              subsection. The dimensions of FCNN layern mn and the
             300              0.778679 ± 0.003508                Pearson correlation coefficient results are shown in Table
             600              0.776741 ± 0.002711                4. The number of filters in CNN layer1 k1 is set to be 1800
             900              0.778596 ± 0.001876                based on the previous experiments. Figure 4 shows the av-
            1200              0.779059 ± 0.003414                erage results in every epoch with standard deviation error
            1500              0.778852 ± 0.003400                bar.
            1800              0.779247 ± 0.002261
                                                                 3.4   Increasing of layers of CNN
           Table 3: Increasing of filters of CNN
                                                                    We run the experiment using more CNN layers in this
        Number of CNN          Pearson correlation               subsection. The hyper parameters used in this subsection
       filters in layer1 k1     coefficient results              are same with the basic type system in Table 1 except the
                300           0.780586 ± 0.001843                number of CNN layers l and the number of filters in CNN
                600           0.785420 ± 0.002587                layers kl . The number of CNN layers l is set to be 2 in
                900           0.790137 ± 0.002325                this subsection. The number of filters in CNN layerl kl
               1200           0.791042 ± 0.002557                and the Pearson correlation coefficient results are shown in
               1500           0.792357 ± 0.002256                Table 5. The dimensions of FCNN layer1 m1 is set to be
               1800           0.792580 ± 0.002613                1800 based on the previous experiments. Figure 5 shows
                                                                 the average results in every epoch with standard deviation
  1 is 0.9, 2 is 0.999, ✏ is 1e-08. he unif orm [He et al.,      error bar.
2015c] was used as the initial function of all layers. The ba-
sic model achieved a Pearson correlation coefficient result      3.5   2 CNN layers with shortcut
of 0.778679 ± 0.003508 and ranked 4th on the STS bench-
                                                                    We run the experiment using 2 CNN layers in this sub-
mark5 . We explore more variant neural network structures
                                                                 section. We add a shortcut [He et al., 2015b] between input
and experiments in this section.
                                                                 layer and the second layer. The hyper parameters used in
3.1   Increasing of dimensions of FCNN                           this subsection are same with the basic type system in Ta-
                                                                 ble 1 except the number of CNN layers l and the number
   We run the experiment using more FCNN dimensions              of CNN filters in layers kl . The number of CNN layers l is
in this subsection. The hyper parameters used in this sub-       set to be 2. The number of CNN filters in layer2 k2 is set to
section are same with the basic type system in Table 1 ex-       be 301, same with the dimensions of expanded GloVe word
cept the dimension of FCNN layer1 m1 . The dimensions            vectors. The number of filters in CNN layersl k1 and the
of FCNN layer1 m1 and the Pearson correlation coefficient        Pearson correlation coefficient results are shown in Table
results are shown in Table 2. Figure 2 shows the average re-     6. The dimensions of FCNN layer1 m1 is set to be 1800
sults in every epoch with standard deviation error bar. The      based on the previous experiments. Figure 6 shows the av-
highest curve is the Pearson correlation coefficient results     erage results in every epoch with standard deviation error
on the training data. The curve in the middle is the results     bar.
on the validation data. The lowest curve is the results on
the test data.                                                   3.6   3 CNN layers with shortcut
3.2   Increasing of filters of CNN                                   We run the experiment using 3 CNN layers in this sub-
    We run the experiment using more CNN filters in this         section. We add a shortcut [He et al., 2015b] between the
subsection. The hyper parameters used in this subsection         first layer and the third layer. The hyper parameters used
are same with the basic type system in Table 1 except the        in this subsection are same with the basic type system in
number of CNN filters in layer1 k1 . The number of CNN           Table 1 except the number of CNN layers l and the number
filters in layer1 k1 and the Pearson correlation coefficient     of CNN filters in layers kl . The number of CNN layers l
results are shown in Table 3. Figure 3 shows the average         is set to be 3. The number of filters in CNN layer3 k3 is
results in every epoch with standard deviation error bar.        set to be same with the number of filters in CNN layer1
                                                                 k1 . The number of CNN filters in layers kl and the Pearson
3.3   Increasing of layers of FCNN                               correlation coefficient results are shown in Table 7. The di-
                                                                 mensions of FCNN layer1 m1 is set to be 1800 based on the
   We run the experiment using more FCNN layers in this
                                                                 previous experiments. Figure 7 shows the average results
subsection. The hyper parameters used in this subsection
                                                                 in every epoch with standard deviation error bar. For this
  5 As of May 26, 2017                                           experiment, we also tried the model that without the hand-
          Table 4: Increasing of layers of FCNN                              Table 6: 2 CNN layers with shortcut

 Dimensions of       Dimensions of        Pearson correlation        Number of           Number of         Pearson correlation
 FCNN layer1         FCNN layer2           coefficient results    filters in layer1   filters in layer2     coefficient results
     300                 300             0.788331 ± 0.004569             300                 301          0.762030 ± 0.008716
     600                 600             0.785838 ± 0.003565             600                 301          0.768793 ± 0.003466
     900                 900             0.789736 ± 0.002546             900                 301          0.767369 ± 0.004021
    1200                1200             0.786109 ± 0.003820            1200                 301          0.768415 ± 0.005799
    1500                1500             0.789013 ± 0.001524            1500                 301          0.769528 ± 0.002299
    1800                1800             0.782995 ± 0.003396            1800                 301          0.770214 ± 0.006707

           Table 5: Increasing of layers of CNN                              Table 7: 3 CNN layers with shortcut

    Number of           Number of         Pearson correlation        Number of           Number of         Pearson correlation
 filters in layer1   filters in layer2     coefficient results    filters in layer1   filters in layer2     coefficient results
        300                 301          0.762369 ± 0.002277            1800                 300          0.793013 ± 0.002325
        600                 301          0.765034 ± 0.002445            1800                 600          0.791661 ± 0.002444
        900                 301          0.765966 ± 0.003641            1800                 900          0.787749 ± 0.003798
       1200                 301          0.761183 ± 0.004322            1800                1200          0.785493 ± 0.002761
       1500                 301          0.764604 ± 0.004969            1800                1500          0.785675 ± 0.003413
       1800                 301          0.766178 ± 0.004455            1800                1800          0.783370 ± 0.004499

crafted feature. The purely sentence representation system       Comparing with the structure that has only one CNN layer,
achieved an accuracy of 0.788154 ± 0.003412.                     3 CNN layers with shortcut structure can learn faster. 3
4   Discussion                                                   CNN layers with shortcut structure achieved a Pearson cor-
                                                                 relation coefficient result of 0.793013 ± 0.002325 and that
    From the results of experiment 1, we can find that in-       is the best result in all of the variant neural networks.
creasing the dimensions of FCNN does not have remark-
able effect on the accuracy of evaluations. The curves in        5   Conclusion
Figure 2 are almost coincidental. From the results of exper-         We investigated a simple neural network system for the
iment 2, we can find that increasing the number of filters in    STS task. All variant models used convolutional neural
CNN layer can improve the accuracy of evaluations. Al-           network to transfer hand-crafted feature enhanced GloVe
though the size of training data (5749 records) is not very      word vectors to a proper form. Then, the models calcu-
large, abstracting more features still benefits the evalua-      lated semantic vectors of sentences by max pooling over
tion results. By increasing the number of filters in CNN         every dimension of their transferred word vectors. After
layer, we achieved a Pearson correlation coefficient result      that, semantic difference vector between two sentences is
of 0.792580 ± 0.002613 and improve the rank from 4th to          generated by concatenating the element-wise absolute dif-
3rd .                                                            ference and element-wise multiplication of their semantic
    From the results of experiment 3, we can find that in-       vectors. At last, a fully-connected neural network was used
creasing the layer of FCNN is harmful to the evaluation          to transfer the semantic difference vector to the probability
results. But increasing the dimensions of FCNN layer has         distribution over similarity scores.
little effect on the accuracy of evaluations. From the re-           In spite of the simplicity of our neural network system,
sults of experiment 4, we can find that increasing the layer     the basic type ranked 3rd in the primary track of the STS
of CNN could significantly pull down the accuracy of eval-       task of SemEval 2017. On the STS benchmark test dataset,
uations. However, changing the number of filters in CNN          the basic model achieved a Pearson correlation coefficient
layers only changes the learning speed, has little effect on     result of 0.778679 ± 0.003508 and ranked 4th . By investi-
the final accuracy. The structure with smaller number of         gating several variant neural networks in this research, we
filters can learn faster.                                        found that 3 CNN layers with shortcut between the first
    From the results of experiment 5, we can find that           layer and the third layer structure achieved the best result, a
adding a shortcut between input layer and the second CNN         result of 0.793013 ± 0.002325 improved our rank from 4th
layer can slightly improve the accuracy of evaluations.          to 3rd . We also tried purely sentence representation system
From the results of experiment 6, we can find that adding        for this model and the result is 0.788154 ± 0.003412, also
a shortcut between the first CNN layer and the third CNN         ranked 3rd .
layer can get a result that close to the model that has only
one CNN layer. Smaller number of filters in the sec-
ond CNN layer can achieve better accuracy of evaluations.
Figure 2: Increasing of dimensions of FCNN   Figure 5: Increasing of layers of CNN


   Figure 3: Increasing of filters of CNN    Figure 6: 2 CNN layers with shortcut


  Figure 4: Increasing of layers of FCNN     Figure 7: 3 CNN layers with shortcut
References                                                      [Cer et al., 2017] Daniel Cer, Mona Diab, Eneko Agirre,
                                                                  Inigo Lopez-Gazpio, and Lucia Specia. Semeval-2017
[Abadi et al., 2016] Martı́n Abadi, Paul Barham, Jian-
                                                                  task 1: Semantic textual similarity multilingual and
  min Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean,
                                                                  crosslingual focused evaluation. In Proceedings of the
  Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving,
                                                                  11th International Workshop on Semantic Evaluation
  Michael Isard, Manjunath Kudlur, Josh Levenberg, Ra-
                                                                  (SemEval-2017), pages 1–14, Vancouver, Canada, Au-
  jat Monga, Sherry Moore, Derek G. Murray, Benoit
                                                                  gust 2017. Association for Computational Linguistics.
  Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden,
  Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensor-           [Chollet, 2015] François Chollet. Keras. https://
  flow: A system for large-scale machine learning. In Pro-        github.com/fchollet/keras, 2015.
  ceedings of the 12th USENIX Conference on Operating
                                                                [He and Lin, 2016] Hua He and Jimmy Lin. Pairwise word
  Systems Design and Implementation, OSDI’16, pages
  265–283, Berkeley, CA, USA, 2016. USENIX Associa-               interaction modelling with deep neural networks for se-
  tion.                                                           mantic similarity measurement. In Proceedings of the
                                                                  2016 Conference of the North American Chapter of
[Agirre et al., 2012] Eneko Agirre, Mona Diab, Daniel             the Association for Computational Linguistics: Human
  Cer, and Aitor Gonzalez-Agirre. Semeval-2012 task 6:            Language Technologies, 2016.
  A pilot on semantic textual similarity. In Proceedings        [He et al., 2015a] Hua He, Kevin Gimpel, and Jimmy Lin.
  of the First Joint Conference on Lexical and Computa-           Multi-perspective sentence similarity modelling with
  tional Semantics, pages 385–393, 2012.                          convolutional neural networks. In Proceedings of the
                                                                  2015 Conference on Empirical Methods in Natural Lan-
[Agirre et al., 2013] Eneko Agirre, Daniel Cer, Mona              guage Processing, pages 1576–1586, 2015.
  Diab, Aitor Gonzalez-Agirre, and Weiwei Guo. Sem
  2013 shared task: Semantic textual similarity. In Pro-        [He et al., 2015b] Kaiming He, Xiangyu Zhang, Shaoqing
  ceedings of the Main Conference and the Shared Task:            Ren, and Jian Sun. Deep residual learning for image
  Semantic Textual Similarity, pages 32–43, 2013.                 recognition. arXiv preprint arXiv:1512.03385, 2015.

[Agirre et al., 2014] Eneko Agirre, Carmen Banea, Claire        [He et al., 2015c] Kaiming He, Xiangyu Zhang, Shaoqing
  Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre,           Ren, and Jian Sun. Delving deep into rectifiers: Sur-
  Weiwei Guo, Rada Mihalcea, German Rigau, and                    passing human-level performance on imagenet classifi-
  Janyce Wiebe. Semeval-2014 task 10: Multilingual                cation. In Proceedings of the International Conference
  semantic textual similarity. In Proceedings of the 8th          on Computer Vision (ICCV), 2015.
  International Workshop on Semantic Evaluation, pages          [Pennington et al., 2014] Jeffrey Pennington, Richard
  81–91, 2014.                                                     Socher, and Christopher D. Manning. Glove: Global
                                                                   vectors for word representation. In Empirical Methods
[Agirre et al., 2015] Eneko Agirre, Carmen Banea, Claire           in Natural Language Processing, pages 1532–1543,
  Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre,            2014.
  Weiwei Guo, Inigo Lopez-Gazpio, Montse Maritxalar,
  Rada Mihalcea, German Rigau, Larraitz Uria, and               [P.Kingma and Ba, 2015] Diederik      P.Kingma       and
  Janyce Wiebe. Semeval-2015 task 2: Semantic textual              Jimmy Lei Ba. Adam: A method for stochastic
  similarity, english, spanish and pilot on interpretability.      optimization. In Proceedings of the 3rd International
  In Proceedings of the 9th International Workshop on Se-          Conference on Learning Representations (ICLR), 2015.
  mantic Evaluation, pages 252–263, 2015.                       [Scherer et al., 2010] Dominik Scherer, Andreas C.
[Agirre et al., 2016] Eneko Agirre, Carmen Banea, Daniel           Muller, and Sven Behnke. Evaluation of pooling
                                                                   operations in convolutional architectures for object
  Cer, Mona Diab, Aitor Gonzalez-Agirre, Rada Mihal-
                                                                   recognition. In Proceedings of 20th International
  cea, German Rigau, and Janyce Wiebe. Semeval-2016
                                                                   Conference on Artificial Neural Networks (ICANN),
  task 1: Semantic textual similarity, monolingual and
                                                                   pages 92–101, 2010.
  cross-lingual evaluation. In Proceedings of the 10th
  International Workshop on Semantic Evaluation, pages          [Shao, 2017] Yang Shao. HCTI at semeval-2017 task
  497–511, San Diego, California, June 2016. Association           1: Use convolutional neural network to evaluate se-
  for Computational Linguistics.                                   mantic textual similarity. In Proceedings of the 11th
                                                                   International Workshop on Semantic Evaluation (Se-
[Bird et al., 2009] Steven Bird, Ewan Klein, and Edward            mEval 2017), pages 130–133, Vancouver, Canada, Au-
   Loper. Natural Language Processing with Python.                 gust 2017. Association for Computational Linguistics.
   O’Reilly Media, 2009.

</pre>