Wide And Deep Transformers Applied to
        Semantic Relatedness and Textual Entailment

                       Evandro Fonseca, João Paulo Reis Alvarenga

                                       STILINGUE
                          {evandro, joaopaulo}@stilingue.com.br


            Abstract. In this paper we present our approach to deal with seman-
            tic relatedness and textual entailment, two tasks proposed in ASSIN-2
            (Second evaluation of semantic relatedness and textual entailment). We
            develop 18 features that explore lexical, syntactic and semantic infor-
            mation. To train the models we applied both supervised machine learn-
            ing and an architecture based in Wide and Deep learning. Our proposal
            demonstrated to be competitive with the current state-of art models and
            with other participant models for Portuguese, mainly when the mean
            square error is considered.

            Keywords: Semantic Relatedness,Textual Entailment, Natural Language
            Processing


   1      Introduction
   In this paper we present an approach to deal with Semantic Relatedness assertion
   and Textual Entailment Recognition. The Semantic Relatedness task (SR) is a
   process that measures the degree of semantic relatedness of a sentence pair by
   assigning a relatedness score ranging from 1 (completely unrelated) to 5 (very
   related). For example, in the pair (1) [Um homem está tocando o trompete –
   Alguém está brincando com um sapo] [A man is playing the trumpet – Someone
   is playing with a frog] the defined value is 1. It is because the sentence pair relates
   very distinct topics. In (2) [Um lêmure está lambendo o dedo de uma pessoa –
   Um lêmure está mordendo o dedo de uma pessoa] [A lemur is licking a person’s
   finger – A lemur is biting a person’s finger] we can see that the meaning is not
   the same, however the sentences share some aspects. So, we can infer that the
   relatedness value is close to 31 . And, in cases like: (3) [Um garoto está fazendo
   um discurso – Um garoto está falando] [A boy is giving a speech – A boy is
   talking] the relatedness value is close to 5. Textual Entailment (TE) consists of
   recognizing when a sentence "A" entails or not a sentence "B". In other words,
   this task consists in defining when we may conclude B from A. Thus, when we
   consider the Textual Entailment and the previous examples ((1), (2) and (3))
   we can assert, respectively, "none","none" and "entailment". Both Semantic
   Relatedness and the Textual Entailment are very important tasks and also a
    1
        Samples collected from ASSIN-2 test corpus


Copyright c 2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
great challenges, it is because depends of many processing levels, such as: Part-of-
speech tagging, Sentiment Analysis, Coreference Resolution, among others. Plus,
when we deal with less resourceful languages like Portuguese, these challenges
are even greater, due to lack of dense semantic bases, such as YAGO[19] and
FrameNet[2]. The paper is organized as follows: Section 2 presents related work;
in Section 3 we describe our proposed models; in Section 4 we show a summary
of the oﬃcial ASSIN-2[16] results; Section 5 presents an error analysis; in Section
6 the conclusions and future work are presented.


2   Related Work

Transfer learning technique is widely used by many NLP tasks, such as Sentiment
Analysis [14], Text Classification [22], Question Answering [20] among others.
The reason of that is clear. Transfer learning models may improve significantly
NLP models [10]. For SR and TE tasks it is not diﬀerent. In 2018 Devlin et al.
[6] has proposed an approach based on transfer learning (BERT) to solve SR
and TE tasks for English. They achieved 0.865 of Pearson for SR and 70.1%
of F1 for TE in GLUE Benchmark [21]. In 2019 some works based on BERT
architecture were emerged (also for English), such as: RoBERTa [13], whose
results were 0.922 of Person for SR and 88.2% of F1 for TE; and ALBERT
[11], with 0.925 of Pearson for SR and 89.2% of F1 for TE. The current state
of art for Portuguese is Fonseca’s work [7]. Fonseca has proposed the use of
neural networks and syntactic trees distance to solve SR and TE tasks and as a
result his model has achieving 0.577 of Pearson for SR and 74.2% of F1 for TE
using ASSIN-2 corpus. In this competition we called his model of baseline. In
our approach we propose the use of a machine learning architecture named Wide
And Deep. Wide and Deep architecture consists of unifying handcrafted features
with dense features. Cheng et al. [5] has proposed the use of Wide and Deep
to deal with recommender systems and their results were encouraging. Plus, we
believe that handcrafted features, based in linguistic knowledge, NLP techniques
and in a corpora study may outperform pure deep learning features. However,
when we apply handcrafted features only the results may be not so good. To
show that, we train and test models using both Wide and Deep architecture
and the traditional supervised machine learning architecture. The result shows
that Wide and Deep architecture may outperforms significantly the traditional
machine learning architecture with handcrafted features.


3   Proposed model

To address the semantic relatedness and textual entailment problem we propose
eighteen features, which consists of exploring the lexical, syntactic and semantic
information. Besides, we use Wide and Deep Transformer architecture, which
mix our proposed features and deep learning features. In subsection 3.1 we show
our set of propose features. Our set of features is based on some related works[7]


                                      69
and also is empirically designed, through a case study based in ASSIN-2 training
set2 .

3.1     Features
1. Sentiment Agreement: returns true when both sentences agree in the
   sentiment polarity[1] and false otherwise (as in below example).
     – O animal está comendo – The animal is eating (+)
     – O animal está mordendo uma pessoa – The animal is biting a person (-)
2. Negation Agreement: returns true when the both sentences agree in the
   co-occurrence of negative terms3 or expressions, such as: "jamais", "nada",
   "nenhum", "ninguém", "nunca", "não", among others. This feature is very
   relevant for textual entailment. It helps in cases such as:
     – O menino está pulando – The boy is jumping
     – Ninguém está pulando – Nobody is jumping
3. Synonym: returns the quantity of synonyms between the two sentences. To
   identify it we use Onto.PT [8]. This feature helps to improve the semantic
   relatedness process. It is because synonyms are used to refer to a same entity,
   as in below example:
     – Um garoto está fazendo um discurso – A young man is giving a speech
     – Um menino está falando – A boy is talking
4. Hyponym: returns the quantity of hyponyms between the two sentences. As
   in the Synonymy feature, we use Onto.PT to identify the semantic relations.

5. Verb Similarity: returns the number of similar verbs between two sen-
   tences. To recognize it, Onto.PT and VerbNet.Br[17] were used. It helps to
   identify pairs such as:
     – Uma menina está caminhando – A girl is stepping
     – Uma menina está andando – A girl is walking
6. Nouns Similarity: returns the quantity of similar nouns between two sen-
   tences. Here we use synonymy relation provided by Onto.PT and the lexical
   similarity(when two words is exactly equals).
     – O garoto está em casa – The young man is in home
     – O menino está em casa – The boy is in home
7. Adjectives Similarity: returns the quantity of similar adjectives between
   two sentences. As in Nouns Similarity, we use synonymy relation and the
   lexical similarity, but considers just adjectives.
8. Gender: returns the number of tokens that agree in gender (male/female).

2
    available in: https://sites.google.com/view/assin2/
3
    never, nothing, no, nobody, no one, never,...


                                         70
 9. Number: returns the number of tokens that agree in number (singular/plural).
    To identify number and gender features we use SNLP4
10. Jaccard Similarity: returns a real number, containing the Jaccard[12] sim-
    ilarity between two sentences. Here we perform a preprocessing: firstly we
    remove determinants5 ; second we sort the tokens alphabetically6 ; and, finally,
    we calculate the Jaccard similarity. Basically, each sentence is modified as in
    the follow example:
      – A mulher está cortando cebola ! cebola cortando está mulher
      – The woman is cutting onion ! cutting is onion woman
11. Verb+Participle: returns true when both the sentences have a verb+participle
    construction„ which do not necessarily have to be equal, as in:
     – O urso está sentado – The bear is sitting
     – O urso está deitado – The bear is lying down
12. Verb+Participle+Equals: returns true when both the sentences have the
    same verb+participle construction, as in:
     – O urso está sentado – The bear is sitting
     – O urso está sentado – The bear is sitting
13. Conjunction_E_A: returns true when the sentence "A" has the "e"(and)
    conjunction, which helps in cases such as:
     – Um menino e uma menina estão caminhando – A boy and a girl are
       walking
     – Duas pessoas estão andando – Two people are walking
14. Conjunction_E_B: the same as Conjunction_E_A, but for sentence B.
15. TokensDif: calculates the diﬀerence in the amount of tokens between the
    sentences "A" and "B". It does not consider determinants. In the below ex-
    ample, TokensDif returns 2, because sentence A has six tokens and sentence
    B has four tokens7 ;
      – Uma mulher não está fritando algum alimento – A woman is not
        frying any food
      – Uma mulher está fritando comida – A woman is frying food
16. Same Word: returns an integer value, containing the number of exactly
    equal words in the sentences(common words). Here, we consider just verbs,
    nouns and adjectives and apply just lexical match.
17. Same Subject: returns true when the sentences has the same subject.

4
  Stilingue proprietary software.
5
  Although determinants may change a referent, in ASSIN-2 shared-task there is an
  agreement that consists of considering, for example "the girl" and "a girl" the same
  entity.
6
  it is because, to calc Jaccard we want to consider just the tokens, not its sequence
  in the sentence.
7
  determinants are not considered


                                       71
18. Cosine Similarity: returns the cosine similarity8 of two sentences. Here we
    use FastText Skip-Gram 300d built by NILC9 [9].


3.2    Model Set Up and Runs
In the shared task, each participant was encouraged to submit three output
files. Each output file could have results of one or the two proposed tasks. We
performed experiments using three distinct configurations to produce the models.
For the first model we use the traditional supervised machine learning. Basically
we train a model using Random Forest[3] and our set of proposed features.
For the second and third models the Wide And Deep architecture was used.
For that, we use BERT-Base multilingual [6], Universal Sentence Encoder-Large
multilingual[4] and our set of proposed features. In table 1 we detail the set up
of each model.

                               Table 1. Trained models

 Model Wide And Deep Random Forest Bert-Base Universal Sentence Encoder
  1                       x
  2          x                                            x
  3          x                         x


    Using the proposed models we perform three runs, considering the two tasks,
as in table 2:

                            Table 2. Runnings and models

                       Textual Entailment Semantic Relatedness
                 Run
                                       Model
                   1            1                  1
                   2            3                  2
                   3            3                  3


    Basically, in the first run we use just Random Forest and our set of features.
We tested some other traditional supervised machine learning algorithms, such
as Multilayer Perceptron, Linear Regression, Naive Bayes, Decision Table, J48,
Random Tree, among others. However, Random Forest easily outperforms all
of then. In the second and third runs we use Wide and Deep architecture. We
can see that the model three was used in second and third run. It is because
in our tests we have not found a model that outperforms Bert-Base for textual
entailment task.
8
    We calc Cosine Similarity considering averaged word vectors of each sentence
9
    http://nilc.icmc.usp.br/embeddings


                                        72
4      Results

In table 3 we show results10 of ASSIN-2 shared-task for the two proposed tasks.
There is a great distance between Wide and Deep architecture and the tra-
ditional supervised machine learning. Regarding our model and the best mod-
els(winners), our models presents very close results. Basically, our model achieved
1.7 points less in F1 and 1.68 less accuracy for TE task. Regarding SR task, our
model presented 0.009 less Pearson coeﬃcient (for running 3) and 0.026 for run-
ning 2. However, we can see that out model presents better mean squared error
(MSE). It is known that the MSE penalizes outliers. Thus, we can say that our
model is more linear than others. An error analysis is presented in Section 5.


                                Table 3. ASSIN-2 results

                                     Textual Entailment Semantic Relatedness
                  Team         Run
                                      F1     Accuracy Pearson      MSE
                                1    0.788    78.84      0.748      0.53
             Stilingue (Our)    2    0.866    86.64      0.800     0.39
                                3    0.866    86.64      0.817      0.47
             IPR                1    0.876    87.58      0.826      0.52
             Deep Learning
                                1    0.883        88.32   0.785      0.59
             Brasil
             Baseline [7]       1    0.742        74.18   0.577      0.75


5      Error Analysis

In table 4 it is possible to see that there are over 1550 pairs with a range very
near of the gold samples (ranges between 0 and 0.4); for 0.5 to 0.9 there are
618 pairs. It is important to say that this diﬀerence is acceptable. It is because,
even in the annotation process, many human annotators disagree on this range.
We also can see that for all 2448 pairs of test corpus, there is just one sample
with a range above of 3.0. In this pair there are many equal words, however they
refer to distinct facts. For the example below, our model has predicted 4.5 of
similarity while gold is 1.5.

 – um cachorro preto e um branco estão correndo alegremente na grama – a
   black and a white dog are running happily in the grass
 – uma pessoa negra vestindo branco está correndo alegremente com o cachorro
   na grama – a black person wearing white is running happily with the dog on
   the grass
10
     for baseline model we unify its runs and shows only its better results


                                             73
                    Table 4. Semantic Relatedness error range

                             Error Range Stances
                             0.0 ⇠ 0.4     1559
                             0.5 ⇠ 0.9     618
                             1.0 ⇠ 1.9     261
                             2.0 ⇠ 2.9      9
                             3.0 ⇠ 5.0      1


   Regarding TE task we found two main errors: the first refers to cases which
we have referential expressions, such as:

 – Um chefe mexicano está preparando uma refeição – A mexican chef is
   preparing a meal
 – Um chefe mexicano está cozinhando – A mexican chef is cooking
 – Um menino está fazendo um discurso – A boy is giving a speech
 – Um menino está falando – A boy is speaking

   We identify that there is a limitation in our model. It is because our syn-
onymy feature just consider single words. The second main error found is when
a sentence "A" has verb + participle construction and the sentence "B" has
gerund and vice-versa, such as:

 – O pelo de um gato está sendo penteado por uma garota – A cat’s fur is
   being combed by a girl
 – Uma pessoa está penteando o pelo de um gato – A person is combing a
   cat’s fur
 – O cara está comendo uma banana – The guy is eating a banana
 – Uma banana está sendo comida por um cara – A banana is being eaten by
   a guy


6   Conclusion and Future Work
In this paper we presented our models to deal with two important tasks, Semantic
Relatedness and Textual Entailment. Our models were based in 18 features, that
cover natural language patterns and Wide and Deep architecture. The latter
explores the mix between our linguistic features and deep learning features. As
results we show that our models can be competitive. Plus, although the MSE is
not the oﬃcial metric, we believe that our model built for semantic relatedness
task provides a good solution for the proposed task, mainly when we need a
more reliable model, with less outliers. As future work we want to improve our
semantic features, in order to recognizes referential expressions. We also intend
to explore ConceptNet [18] e BabelNet [15] to provide a more robust semantic
knowledge to our models.


                                     74
References
 1. L. V. Avanço and M. d. G. V. Nunes. Lexicon-based sentiment analysis for reviews
    of products in Brazilian Portuguese. In 2014 Brazilian Conference on Intelligent
    Systems, pages 277–281. IEEE, 2014.
 2. C. F. Baker, C. J. Fillmore, and J. B. Lowe. The Berkeley Framenet project. In
    Proceedings of the 17th International Conference on Computational Linguistics,
    pages 86–90, Quebec, Canada, 1998.
 3. R. R. Bouckaert, E. Frank, M. Hall, R. Kirkby, P. Reutemann, A. Seewald, and
    D. Scuse. Weka manual for version 3-7-8, 2013.
 4. D. Cer, Y. Yang, S. Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant,
    M. Guajardo-Cespedes, S. Yuan, C. Tar, Y. Sung, B. Strope, and R. Kurzweil.
    Universal sentence encoder. CoRR, abs/1803.11175, 2018.
 5. H.-T. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chandra, H. Aradhye, G. Ander-
    son, G. Corrado, W. Chai, M. Ispir, et al. Wide & deep learning for recommender
    systems. In Proceedings of the 1st workshop on deep learning for recommender
    systems, pages 7–10. ACM, 2016.
 6. J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: pre-training of deep
    bidirectional transformers for language understanding. CoRR, abs/1810.04805,
    2018.
 7. E. R. Fonseca. Reconhecimento de implicação textual em português. PhD thesis,
    Universidade de São Paulo, 2018.
 8. H. Gonçalo Oliveira and P. Gomes. ECO and Onto.PT: A flexible approach for
    creating a Portuguese wordnet automatically. Language Resources and Evaluation,
    48(2):373–393, 2014.
 9. N. Hartmann, E. Fonseca, C. Shulby, M. Treviso, J. Rodrigues, and S. Aluisio.
    Portuguese word embeddings: Evaluating on word analogies and natural language
    tasks. arXiv preprint arXiv:1708.06025, 2017.
10. J. Howard and S. Ruder. Fine-tuned language models for text classification. CoRR,
    abs/1801.06146, 2018.
11. Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut. AL-
    BERT: A lite BERT for self-supervised learning of language representations. CoRR,
    abs/1909.11942, 2019.
12. M. Levandowsky and D. Winter. Distance between sets. Nature, 234(5323):34–35,
    1971.
13. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettle-
    moyer, and V. Stoyanov. Roberta: A robustly optimized BERT pretraining ap-
    proach. CoRR, abs/1907.11692, 2019.
14. A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning
    word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting
    of the Association for Computational Linguistics: Human Language Technologies -
    Volume 1, HLT ’11, pages 142–150, Stroudsburg, PA, USA, 2011. Association for
    Computational Linguistics.
15. R. Navigli and S. P. Ponzetto. BabelNet: The automatic construction, evalua-
    tion and application of a wide-coverage multilingual semantic network. Artificial
    Intelligence, 193:217–250, 2012.
16. L. Real, E. Fonseca, and H. Gonçalo Oliveira. The ASSIN 2 shared task: Evaluating
    Semantic Textual Similarity and Textual Entailment in Portuguese. In Proceedings
    of the ASSIN 2 Shared Task: Evaluating Semantic Textual Similarity and Textual
    Entailment in Portuguese, CEUR Workshop Proceedings, page [In this volume].
    CEUR-WS.org, 2020.


                                       75
17. C. E. Scarton. Verbnet.BR: construção semiautomática de um léxico verbal online
    e independente de domínio para o português do brasil. 2013.
18. R. Speer and C. Havasi. Representing general relational knowledge in conceptnet
    5. In Proceedings of the Eighth International Conference on Language Resources
    and Evaluation, pages 3679–3686, 2012.
19. F. M. Suchanek, G. Kasneci, and G. Weikum. YAGO: a core of semantic knowl-
    edge. In Proceedings of the 16th International Conference on World Wide Web,
    pages 697–706, Banﬀ, AB, Canada, 2007.
20. E. M. Voorhees. The trec question answering track. Nat. Lang. Eng., 7(4):361–378,
    Dec. 2001.
21. A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. GLUE: A
    multi-task benchmark and analysis platform for natural language understanding.
    CoRR, abs/1804.07461, 2018.
22. X. Zhang, J. Zhao, and Y. LeCun. Character-level convolutional networks for
    text classification. In Proceedings of the 28th International Conference on Neural
    Information Processing Systems - Volume 1, NIPS’15, pages 649–657, Cambridge,
    MA, USA, 2015. MIT Press.


                                       76