Incorporating multiple feature groups to a Siamese
     Neural Network for Semantic Textual Similarity task in
                       Portuguese texts
      João Vitor Andrioli de Souza1, Lucas Emanuel Silva e Oliveira1, Yohan Bonescki
          Gumiel1, Deborah Ribeiro Carvalho1 and Claudia Maria Cabral Moro1

     Graduate Program on Health Technology (PPGTS), Pontifical Catholic University of Paraná
                                  (PUCPR). Curitiba, Brazil
     joao.vitor.andrioli@gmail.com,{lucas.oliveira,yohan.gumiel,ri
                        beiro.carvalho,c.moro}@pucpr.br


         Abstract. The Semantic Textual Similarity (STS) algorithms have a key role in
         Natural Language Processing (NLP) studies since it can support various NLP
         tasks such as Text Summarization and Information Retrieval. Although we
         found several STS initiatives in the literature, just a few authors explored
         Siamese Neural Networks (SNN) to solve this problem, especially for the
         Portuguese language, even considering their lower need for training data and an
         architecture built for similarity tasks. We defined a set of lexical, semantic,
         distributional and graph-based feature groups to capture distinct aspects of the
         text and incorporated to a SNN architecture to perform STS in ASSIN 1 and
         ASSIN 2 datasets. The experiments indicate positive results since we improved
         the results of previous attempts of STS using SNNs in Portuguese texts.


         Keywords. Semantic Textual Similarity, Siamese Neural Networks, Shared
         Tasks.


 1       Introduction and Background

    The Semantic Textual Similarity (STS) task consists of quantifying the degree of
 semantic equivalence of one text to another. It is essential for many Natural Language
 Processing (NLP) research and applications, supporting tasks such as Plagiarism
 Detection, Text Deduplication, Text Summarization, Information Retrieval and Text
 Clustering [1].
    The Neural Network (NN) architectures have been outperforming traditional
 Machine Learning (ML) models in several fields of study including NLP. One
 example of successful NN is the Siamese Neural Network (SNN), which is a type of
 NN used to calculate similarity in studies like [2–5], it has been successful in various
 tasks focused on both image, and more recently on text context, achieving good
 results using less data than other approaches. Additionally, it has shown less
 susceptibility to overfitting [5].


Copyright c 2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    The SNNs layers are configurable to any layer type that suits the problem
resolution, such as Convolutional Neural Networks or Long-Short Term Memory, the
architecture is composed of two or more equal subnetworks that share the same
configurations, parameters, and weights, which has the values updated simultaneously
during the learning process [6].
    Many features have been tested to tackle the STS problem, such as lexical with
string-based approaches, semantical with corpus-based and knowledge-based
approaches [7], structural syntactic or morphological information, and recently,
several studies have used distributional and contextual Word Embeddings (WE).
    The shared tasks have an important role in the STS task, for the English language,
the SemEval initiatives released various STS tasks over the years (e.g., [1]). The
Portuguese language is represented by the ASSIN 1 [8] and ASSIN 2 [9] shared tasks,
which focused on STS texts from the journalistic domain. The winning team [10] of
ASSIN 1 used an approach in which they combined TF-IDF calculation with WE.
The SNN architecture was not fully explored, with just one group of ASSIN 1
exploring it [11] and just a few other studies applying to other languages [e.g., [4–6,
12].
    We hypothesize that associating the SNN’s efficacy to train with data limitations
(which is often the case in shared-tasks), and the use of a set of lexical, semantic,
graph representation and distributional features, it could be able to capture different
aspects of the text, establishing a consistent model.
    In this work, we present a SNN architecture inputted with lexical, semantic,
distributional and graph-based features, aiming to perform STS in ASSIN 1 & 2
datasets.


2      Materials and Methods

2.1    Dataset

   The ASSIN 1 & 2 datasets are composed of manually annotated Portuguese
sentence pairs with their respective similarity/relatedness scores ranging from 1 to 5,
where 1 depicts no similarity and 5 depicts equivalence. The ASSIN 1 dataset is
divided into two parts, the Brazilian Portuguese (BR) and the European Portuguese
(PT), while the ASSIN 2 contains BR sentences only. Table 1 presents the sizes of the
datasets while Table 2 depicts some sentence pairs and their respective similarities.
   Aiming to conduct exploratory data analysis and compare the data sparsity and
density of both datasets, we applied a graph representation algorithm to the data and
verified interesting aspects such as low vocabulary volume on ASSIN 2 dataset
compared to the ASSIN 1, that despite having more sentence pairs, have less unique
tokens (shown in Table 3). When we compare the number of edges of each dataset
(see Table 4) it implies the high and low data sparsity of ASSIN 1 and ASSIN 2
respectively.


                                          59
                        Table 1. The number of text pairs in each dataset.
                                                  ASSIN 1
                        ASSIN 2                                              ASSIN 1 & 2
                                            PT                BR
         Train            7000             3000              3000               13000
          Test            2448             2000              2000               6448
         Total            9448             5000              5000               19448


    Table 2. ASSIN 1 & 2 text pairs with their corresponding similarity score and description.
                                     ASSIN 1 & ASSIN 2
       Description                     The two sentences are totally unrelated
           S1                    Um cachorro branco de coleira está andando na água
1
           S2           Um homem sem camisa está jogando futebol em um gramado
          Score                                      1.0
       Description          The two sentences have similar actions or objects.
                      Um cachorro aparentemente desnutrido está em pé nas patas de
            S1
2                                     trás e se preparando para pular
            S2           Um cachorro de aparência saudável está deitado no chão
          Score                                      2.0
       Description                    The two sentences share details.
                      Um homem e uma criança estão andando de caiaque pelas águas
            S1
                                                   calmas
3
                       Um caiaque amarelo está sendo navegado por um homem e um
            S2
                                                menino jovem
          Score                                      3.0
       Description                 The two sentences are closely related.
            S1             O cara está montando um cavalo perto de um riacho
4
            S2           O cara está montando um cavalo perto de uma correnteza
          Score                                      4.0
       Description                   The two sentences are equivalent.
            S1                Um cara está brincando com uma bola de meia
5
            S2                Tem um cara brincando com uma bola de meia
          Score                                      5.0


                                                60
Table 3. The number of Unique Tokens (Nodes) in the ASSIN 1 & 2 datasets on train, test and
                            concatenated train+test portions.
                                                            ASSIN 1
                                ASSIN 2
                                                     PT                  BR
              Train               2342              11075               10058
              Test                1967              9282                8675
                                                    15249               13757
                                  2542
           Train+Test                                         22389

                                                    23673

 Table 4. The number of Unique Token Bigram (Edges) in the ASSIN 1 & 2 datasets on train,
                         test and concatenated train+test portions.
                                                            ASSIN 1
                                ASSIN 2
                                                     PT                  BR
              Train               8787              45218               40039
              Test                7090              35075               33441
                                                    70965               63710
                                 10327
           Train+Test                                        120453

                                                   129143


2.2     Machine Learning classifier and Features

   Our STS algorithm was based on formerly proposed SNN in [6], but the
Manhattan’s distance was replaced with a 50-units dense layer since the addition of
new features would not work well with the Manhattan’s distance. We used two 300-
sized BiLSTM subnetworks and trained for 70 epochs with the mean squared error
(mse) loss function.
   We decided to use the pre-trained Word2Vec CBOW 300-sized vector from NILC
[13] since it was utilized in previous STS works for Portuguese and achieved good
results [10, 14] (from now on named NILC Word2vec) . The 100-sized Word2Vec
skip-gram vector ID 63 from NLPL WE Repository 1 was used as well, hereafter
NLPL ID 63 Word2vec. A 100-sized vector was trained with each of the dataset
groups texts using the Word2Vec algorithm [15], to work as our baseline, henceforth
ASSIN Word2Vec.

  1
      http://vectors.nlpl.eu/repository/


                                            61
   A new parallel dense layer with 100-units was incorporated in the SNN to
accommodate five new feature groups: (i) lexical-based similarities, (ii) knowledge-
based wordnet tokens distances, (iii) distributional-based WE cosine distance between
the sentences, (iv) the sum of the degree centrality of the tokens using a graph created
from the dataset, and (v) overlap of common words around the sentences.
   The lexical-based feature group is composed of three different equations,
computed with the Jaccard index, Dice coefficient and Cosine distance [7]. They
represent the overlap of common tokens of both sentences and have shown very
similar results to the word overlap metric used as the baseline of the ASSIN.
   We used the Open Multilingual Wordnet (OMW) 2 from the Natural Language
Toolkit (NLTK) to build our knowledge-based feature group. Three different
similarity calculations were used as features, such as the Wu-Palmer similarity,
Leacock-Chodorow similarity and shortest path distance that connects the
hypernym/hyponym taxonomy.
   The third feature group is the cosine distance of both inputted sentences using the
WE model selected for each experiment (more details on the different models later in
this chapter). Each sentence position is the average position of their words in the WE
model, as presented in Figure 1.


      Figure 1. Process for calculating the Cosine distance of two sentences Word Vectors.

   Intending to use the information found in the dataset itself as features, a directed
graph of each dataset was generated to calculate both the degree centrality of a token
and the overlap of common words around the sentences, each node is a token, and
each edge is the pair of token (current token, next token), the edges possess some
syntactic information.
   The degree centrality of a token is the link incident over the token, for instance in
Figure 2 the degree centrality of “jovem” is 8, because there are 8 links around it, this
metric was selected as a way to measure the significance of the words. We have
chosen the degree centrality equation due to the low computational cost and the high
importance of words closely related, other centrality equations can be used and need
to be tested.


  2
      https://www.nltk.org/howto/wordnet.html


                                               62
   The overlap of common words around the sentences is the number of common
words directly around both sentences; the equation is normalized between 0 and 1 by
dividing the overlap with the sum of tokens around the sentences.


                        Figure 2. Example of the generated Graph.

   Our experiments were executed in each of the following dataset configurations: (i)
a concatenation of ASSIN 1+2, (ii) ASSIN 2, (iii) ASSIN 1 BR, (iv) ASSIN 1 PT and
(v) a concatenation of ASSIN 1 BR and ASSIN 1 PT. Our experimental setup
involved the algorithm evaluation using only the WE as features to our SNN, and the
WE plus additional hand-crafted features. The performance was evaluated using the
same metrics used in the ASSIN, the Pearson correlation (p) and the mean squared
error (mse).


3      Results

   The results of our SNN algorithm are displayed in Table 5, which distinguishes the
scores for each dataset configuration (i.e., ASSIN1, ASSIN2, ASSIN1-PT, ASSIN1-
BR, ASSIN1&2), WE model (i.e., ASSIN Word2Vec, NLPL ID 63 Word2vec and
NILC Word2vec) and features used (i.e., WE plus five features and WE as only
feature). The best scores for each dataset configuration are in bold (p value the larger
the better, mse the smaller the better).
   The five proposed features seem to improve most of the results if compared to the
“WE as only feature” runs. The best p score was achieved by the five features model
with the NLPL ID 63 Word2vec pre-trained model for all datasets, for ASSIN 2 it
scored 0.72 p and 0.65 mse. Our feature selection has improved the scores mostly for
the pre-trained WE.


                                           63
   In Table 6, we compare the performance of our current method with the language-
independent approach developed in [16]. The improvement is evident in all dataset
configurations.
   Table 7 shows a comparison with the five feature groups in isolation and pairs, that
were evaluated exclusively with the NLPL ID 63 WE in the concatenation of ASSIN
1 & 2 datasets, giving that the NLPL ID 63 WE achieved the best results in most runs.
Differently from previous experiments, which were trained with 70 epochs, this table
was trained with only 50 epochs, due to time constraints and that the focus was only
to compare the features results.

Table 5. Pearson correlation and Mean squared error scores for each SNN algorithm execution
                                                 NLPL ID 63
                  ASSIN Word2Vec                                           NILC Word2vec
                                                  Word2vec
                     p           mse             p           mse               p    mse

                basic feat basic feat basic feat basic feat basic feat basic feat
  ASSIN 2       0.67 0.70 0.64 0.65 0.69 0.72 0.70 0.65 0.68 0.68 0.73 0.60
  ASSIN 1       0.62 0.62 0.59 0.61 0.62 0.66 0.58 0.56 0.61 0.64 0.62 0.57
ASSIN 1 PT 0.64 0.64 0.75 0.76 0.62 0.66 0.72 0.64 0.62 0.65 0.75 0.66
ASSIN 1 BR 0.61 0.61 0.49 0.50 0.63 0.64 0.47 0.46 0.62 0.64 0.51 0.45
ASSIN 1+2 0.66 0.66 0.64 0.64 0.68 0.70 0.62 0.60 0.65 0.67 0.70 0.63
*basic denotes for the WE as the only feature approach.
*feat denotes for the WE and additional five features approach.

   Table 6. Best scores of our current method compared to a language-independent method
                                                 p                 mse
                                          lind       new    lind         new
                          ASSIN 2         0.69       0.72   0.61     0.60
                          ASSIN 1         0.63       0.66   0.60     0.56
                      ASSIN 1 PT          0.64       0.66   0.75     0.64
                      ASSIN 1 BR          0.63       0.64   0.46     0.45
                         ASSIN 1+2        0.64       0.70   0.71     0.60
*lind denotes for the language-independent approach by [16].
*new denotes the best result of our current approach.


                                              64
4       Discussion

   The degree centrality and the overlap of common words around are dependent on
the generated graph, and the graph for ASSIN 1 and ASSIN 2 have very different
aspects, for instance, the low data sparsity of ASSIN 2, shown in Table 3 and Table 4,
may have led to the different performances for each WE on Table 5, contrasting with
each other distinct dataset performance.

                 Table 7. Comparison of the features in pairs and isolated.
                                                                  WE Cosine        Degree
                  (isolated)      Lexical         WordNet
                                                                   Distance       Centrality
                   p     mse      p       mse      p       mse     p       mse     p       mse
     Lexical     0.68    0.64         -     -          -     -         -      -        -     -
    WordNet      0.66    0.71   0.68      0.63         -     -         -      -        -     -
    WE Cosine
                 0.67    0.63   0.70      0.58    0.67     0.61        -      -        -     -
     Distance
     Degree
                 0.60    0.74   0.68      0.67    0.61     0.72   0.68     0.72        -     -
    Centrality
    Common
     Words       0.64    0.65   0.69      0.60    0.69     0.65   0.71     0.60   0.64     0.65
    Around

   The lexical and the WE cosine distance were the less costly features to calculate,
while the overlap of common words around was the slower to calculate, despite that
they have shown good results, while the WordNet have shown smaller score increase
and the degree centrality have not shown good results.
   The degree centrality got the worst results, this might be an indication of low
representativeness of the significance of words for similarity tasks, nonetheless, other
centralities and node influence metrics need to be tested and the graph representation
features can be improved with contextual weighting.
   One of the difficulties relative to the method selection was due to the lack of
annotation guidelines released, therefore some aspects of the text were not fully
presented, for instance, if the scores are about sentence relatedness or similarity. For
example, in the sentences “O menino está tocando o piano” and “O menino não está
tocando o piano” with score “3.0” the sentence is related, but due to the negation the
meaning is opposite to each other, this score does not corroborate with the sentences
“Não tem nenhum homem executando um truque em uma bicicleta verde” and “Um
homem está realizando um truque em uma bicicleta verde” with score “4.8”.
   As future work we intend to use at least the three best features of Table 7 with a
contextual WE, such as BERT and ELMO, this could improve the scores by taking
advantage of the contextual aspects stored in this kind of embeddings. We
hypothesize that a contextual WE would improve the results mainly of ASSIN 2, due


                                             65
to the low data sparsity of the dataset, while a distributional WE could fail when
dealing with words used in multiple contexts and with more similarity variation.


5     Conclusion

   We trained a SNN architecture in association with pre-trained Word Embeddings
and a set of features trying to cover different aspects of the text (e.g., lexical,
semantic), aiming to perform the Semantic Textual Similarity task in Portuguese
texts. The experiments showed promising results since we improved all the last
attempts to use SNN for Portuguese STS, and checked each feature contribution by
isolating each one, and training separated models.


References

1.     Agirre, E., Cer, D., Diab, M., Gonzalez-Agirre, A.: SemEval-2012 Task 6: A pilot on
       semantic textual similarity. In: *SEM 2012 - 1st Joint Conference on Lexical and
       Computational Semantics. pp. 385–393. Association for Computational Linguistics,
       Montréal, Canada (2012).
2.     Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., Shah, R.: Signature verification
       using a “Siamese” time delay neural network. In: NIPS’93 Proceedings of the 6th
       International Conference on Neural Information Processing Systems. pp. 737–744.
       Morgan Kaufmann Publishers Inc., Denver, Colorado (1993).
3.     Chopra, S., Hadsell, R., LeCun, Y.: Learning a Similarity Metric Discriminatively,
       with Application to Face Verification. In: 2005 IEEE Computer Society Conference
       on Computer Vision and Pattern Recognition (CVPR’05). pp. 539–546. IEEE, San
       Diego, CA, USA (2005).
4.     Neculoiu, P., Versteegh, M., Rotaru, M.: Learning Text Similarity with Siamese
       Recurrent Networks. In: Proceedings of the 1st Workshop on Representation Learning
       for NLP. pp. 148–157. Association for Computational Linguistics, Stroudsburg, PA,
       USA (2016).
5.     Ranasinghe, T., Orasan, C., Mitkov, R.: Semantic Textual Similarity with Siamese
       Neural Networks. In: Proceedings of the International Conference Recent Advances in
       Natural Language Processing, RANLP 2019. , Varna, Bulgaria (2019).
6.     Mueller, J., Thyagarajan, A.: Siamese recurrent architectures for learning sentence
       similarity. In: 30th AAAI Conference on Artificial Intelligence, AAAI 2016. pp.
       2786–2792. AAAI Press, Phoenix, Arizona (2016).
7.     H.Gomaa, W., A. Fahmy, A.: A Survey of Text Similarity Approaches. Int. J. Comput.
       Appl. 68, 13–18 (2013).
8.     Fonseca, E.R., Santos, L.B. dos, Criscuolo, M.: Visão Geral da Avaliação de
       Similaridade Semântica e Inferência Textual. In: Linguamática. pp. 3–13 (2016).
9.     Real, L., Fonseca, E., Gonçalo Oliveira, H.: The ASSIN 2 Shared Task: Evaluating
       Semantic Textual Similarity and Textual Entailment in Portuguese. In: Proceedings of
       the ASSIN 2 Shared Task: Evaluating Semantic Textual Similarity and Textual
       Entailment in Portuguese. p. In this volume. CEUR-WS.org (2020).


                                           66
10.   Hartmann, N.S.: Solo queue at ASSIN: Combinando abordagens tradicionais e
      emergentes. Linguamatica. 8, 59–64 (2016).
11.   Barbosa, L., Cavalin, P., Guimarães, V., Kormaksson, M.: Blue Man Group at
      ASSIN : Using Distributed Representations for Semantic Similarity and Entailment
      Recognition. Linguamática. 8, 15–22 (2016).
12.   Barrow, J., Peskov, D.: UMDeep at SemEval-2017 Task 1: End-to-End Shared Weight
      LSTM Model for Semantic Textual Similarity. In: Proceedings of the 11th
      International Workshop on Semantic Evaluation (SemEval-2017). pp. 180–184.
      Association for Computational Linguistics, Stroudsburg, PA, USA (2017).
13.   Hartmann, N.S., Fonseca, E., Shulby, C.D., Treviso, M. V, Rodrigues, J.S., Aluísio,
      S.M.: Portuguese Word Embeddings: Evaluating on Word Analogies and Natural
      Language Tasks. In: Proceedings of the 11th Brazilian Symposium in Information and
      Human Language Technology. pp. 122–131. Sociedade Brasileira de Computação,
      Uberlândia, MG, Brazil (2017).
14.   Alves, A., Gonçalo Oliveira, H., Rodrigues, R., Encarnação, R.: ASAPP 2.0:
      Advancing the state-of-the-art of semantic textual similarity for Portuguese. In:
      Informatik, S.D.-L.-Z. fuer (ed.) Proceedings of 7th Symposium on Languages,
      Applications and Technologies (SLATE 2018). pp. 1–12 (2018).
15.   Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient Estimation of Word
      Representations in Vector Space. arXiv Prepr. arXiv1301.3781. (2013).
16.   de Souza, J.V.A., Oliveira, L.E.S., Gumiel, Y.B., Carvalho, D.R., Moro, C.M.C.:
      Exploiting Siamese Neural Networks on Short Text Similarity tasks for multiple
      domains and languages. In: Computational Processing of the Portuguese Language -
      13th International Conference, PROPOR 2020, Évora, Portugal, March 2-4, 2020,
      Proceedings (2020).


                                          67