<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Bidirectional Semantic Matching with Deep Contextualized Word Embedding for Chinese Sentence Matching</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kunxun Qi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jianfeng Du</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Qiqi Ou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Linxi Jin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jinglan Zhong</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Computer Science and Technology, Guangdong University of Foreign Studies</institution>
          ,
          <addr-line>Guangzhou 510006</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, a bidirectional matching model is proposed to identify whether two Chinese sentences are paraphrases of each other. The model is adapted from the well-known BiMPM model on two main aspects. On the one hand, it exploits a deep contextualized model named ELMo to generate the input word embedding. On the other hand, three out of four bidirectional matching mechanisms in BiMPM are carefully selected to model interaction between two sentences. The proposed model is evaluated on a dataset about Chinese sentence pairs from CCKS 2018. Experimental results show that the model achieves 86.2% F1-score on the validation set and 84.6% F1-score on the test set.</p>
      </abstract>
      <kwd-group>
        <kwd>Sentence Matching</kwd>
        <kwd>Chinese Sentence Pairs</kwd>
        <kwd>Deep Neural Network</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Modeling two natural language sentences is a fundamental task in many natural
language processing (NLP) tasks, such as paraphrase identification (PI) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], textual
entailment(TE) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and etc. In paraphrase identification task, we identify whether two
sentences are paraphrase or not. In text entailment task, we estimate whether a
sentence can be inferred from another sentence.
      </p>
      <p>
        In recent years, neural network models have been widely used in modeling sentence
pairs. Two advanced frameworks have been proposed in previous work. The first
framework usually implements two weight sharing sentence encoders, such as
Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN), to represent a
sentence pair as two low-dimensional real-value vectors u1, u2 and then makes a
prediction based on the two vectors. This framework usually constructs a feature
vector, such as (u1, u2, |u1 − u2|, ∗ uu21), feeding it into a fully-connected network
followed by a softmax layer to make final prediction. Some typical methods in this
framework include BCNN [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], InferSent [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and SWEMs [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This framework pays
more attention on constructing sentence encoder, but ignores the relevance between
* Corresponding author.
two sentences. Existing empirical studies reveal that this framework
cannot achieve the state-of-the-art performance. This limitation may be caused by the
losing of some interactive information between the two sentences. To further improve
the performance, the second framework studies how to learn interaction between two
sentences. This framework usually calculates the relevance between the two sentences
by using a variety of attention mechanisms. The prominent methods in this framework
include ABCNN [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], ESIM [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and BiMPM [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In this paper, we implement three out
of four bidirectional matching mechanisms in BiMPM to calculate the interaction
between two sentences, including full-matching, attentive-matching and
maxattentive-matching. We do not use the maxpooling-matching mechanism because it is
time consuming and hard to be evaluated in our experiments.
      </p>
      <p>
        All the above approaches use word embedding as input. Word embedding aims to
represent the tokens from textual documents as low-dimensional real-value vectors.
As known that, word embedding has been widely used in a broad range of NLP tasks,
such as named entity extraction (NER), part-of-speech tagging (POS Tagging),
question answering (QA), textual entailment (TE), machine comprehension (MC), etc. The
most famous word embedding models are Word2vec [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and GloVe [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], which have
demonstrated advanced performance in a variety of NLP tasks. However, most of
these word embedding models generate pre-trained word vectors for each natural
language token in training corpus, which means that the out of vocabulary (OOV)
words have no representation. One common solution is to initialize the word
embedding randomly and update the word vectors during training. It is easy to incur
overfitting. Another solution is using the N-gram features in training the word embedding.
For example, FastText [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] trains word embedding by predicting the labels of
documents. It is applicable to the document classification task but is not suitable for
sentence modeling tasks. Recently, a new type of deep contextualized word
representation, ELMo [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], has been proposed to address wrongly written or mispronounced
characters, wrongly Chinese word segmentation and OOV words. It has been
demonstrated to improve the performance in six challenging NLP tasks [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. ELMo generates
word vectors based on the input of character sequences and the representations of the
contextualized words in a sentence. In this paper, we train an ELMo model on
Chinese Wikipedia corpus and use it to generate word vectors.
      </p>
      <p>In this study, our model is evaluated on the dataset about Chinese sentence pairs
from CCKS 2018. Experimental results show that the model achieves 86.2% F1-score
on the validation set as well as 84.6% F1-score on the test set.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>
        There are lots of studies for modeling sentence pairs. In this section, we only make a
review on previous deep learning methods. We refer the interested reader to [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] for
other methods. There are two major deep learning frameworks for modeling sentence
pairs, namely the classical encoding framework and attention-based encoding
framework.
...
...
...
...
…...
      </p>
      <p>...
…
...
…
……
Pr(y|P, Q)
Softmax
Matching Layer</p>
      <p>Full-Matching(→)</p>
      <p>Full-Matching(←)</p>
      <p>AttentiveMatching(→)</p>
      <p>AttentiveMatching(←)</p>
      <p>Max-AttentiveMatching(→)</p>
      <p>Max-Attentive</p>
      <p>Matching(←)
...
…
……</p>
      <p>
        Prediction Layer
Aggregating Layer
Context Representation
Layer
Highway Network Layer
Methods in this framework employ two weight-sharing classical encoders, suach as
CNN or RNN, to generate two vector representations for the two input sentences.
BCNN [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] used two weight sharing CNNs to generate two sentence representations
and constructed a feature vector by connecting the two vectors. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] implemented two
bidirectional LSTM (BiLSTM) networks as sentence encoders. SWEMs [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] employed
two hierarchical pooling encoders instead of using any CNNs or RNNs. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] modeled
sentence pairs by using Transformer [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] encoder, which is a recent network
architecture that makes use of self-attention [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] mechanism.
On the basis of the first framework, methods in this framework employ various
attention mechanisms that are based on the similarity between two sentences to adjust the
two representations. ABCNN [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] enhanced the BCNN [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] by employing an attention
feature matrix to learn interactive information. ESIM [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] employed Tree-LSTM
(Long Short-Term Memory) as sentence encoder. It calculated the relevance between
two sentences by applying a local inference modeling layer. BiMPM [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] proposed
four effective bidirectional matching mechanisms to learn the interactive information.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Adaptation of BiMPM with ELMo</title>
      <p>
        Our proposed model is shown in Figure 1. The input of our model has two parts for
each sentence. The first part is the word embedding generated by ELMo. The second
part is the character embedding created by a bidirectional LSTM (BiLSTM) network
on randomly instantiated character embedding. The concatenated vector from these
two parts are fed into a Highway network [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] to generate two sequences of word
vectors. The two sequences of word vectors are fed into the contextual representation
layer to learn the contextual representations. Three bidirectional matching
mechanisms in BiMPM are employed in the matching layer to calculate the interaction
between two sentences. The two matching vectors are fed into the aggregating layer to
generate the feature vectors, which are used to make prediction in the prediction layer.
3.1
      </p>
      <sec id="sec-3-1">
        <title>Word Representation Layer</title>
        <p>This layer generates a d-dimensional vector for each word within the experimental
sentences. There are two parts in this layer. The first part is ELMo generated word
embedding. We train an ELMo model on Chinese Wikipedia articles corpus1 and use
it to generate word vectors. The second part is the character embedding. We initialize
fixed dimensional vectors randomly for each character within a word. They are fed
into a BiLSTM network to compose word vectors. We pick the last hidden state of the
BiLSTM network as the representation of each word. We feed the concatenated
vectors from these two parts into a Highway network to generate the final word vectors.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Contextual Representation Layer</title>
        <p>This layer generates the context representation of two sentences by using two
BiLSTM networks. The weights in these two networks are shared during training.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Matching Layer</title>
        <p>This layer calculates the interactive information between two sentences. We apply
three out of four bidirectional matching mechanisms in BiMPM, including the
fullmatching mechanism, the attentive-matching mechanism and the
max-attentivematching mechanism. For details, we use function to calculate the relevance
between two contextual representations.</p>
        <p>In eq(1), v1 and v2 are the hidden states of the two BiLSTM networks in contextual
representation layer. Both v1 and v2 are d-dimensional vectors. W∈Rl×d is a trainable
parameter and l is a hyperparameter that means the perspective of the interactive
features. For each element mk∈m, k means the k-th dimension of the interactive vector.
They are calculated by a cosine similarity function
where is the element-wise multiplication and Wk is the k-th row in W.</p>
        <p>Further, we apply three bidirectional matching mechanisms to calculate the
interac</p>
        <sec id="sec-3-3-1">
          <title>1 https://zh.wikipedia.org/wiki/</title>
          <p>tive features of each time-step of sentence against all time-steps of the other sentence:</p>
          <p>Full-Matching. This matching mechanism calculates the interactive features
between each contextual representation ⃗ (or ⃖⃗ ) and the last time-step of the
contextual representation of the other sentence ⃗ (or ⃖⃗ ).</p>
          <p>⃗⃗
⃐⃗⃗
( ⃗
( ⃖⃗
⃗
⃖⃗
)
)</p>
          <p>Attentive-Matching. This matching mechanism calculates interactive feature
between each contextual representation and the weighted summing contextual
representation of the other sentence. Firstly, we calculate the similarity between two
contextual representations.</p>
          <p>Then, we use (or ⃖ ) as the weight of ⃗ (or ⃖⃗ ). We generate an attentive
contextual representation by weighted summing all time-steps of the contextual
representations of the other sentence.
( ⃗
( ⃖⃗
⃗ ))
⃖⃗ ))
)
)
)
( ⃗
( ⃖⃗
⃗ )
⃖⃗ )
∑
∑
∑
∑
⃗
⃖⃗
⃗
⃖⃗
⃗
⃖⃗
⃗
⃖⃗
( ⃗
( ⃖⃗
(
(
(⃗</p>
          <p>⃗
⃗
⃖⃗
⃗⃗
⃐⃗⃗
⃗⃗
⃖
⃗
⃖⃗
(3)
(4)
(5)
(6)
(7)</p>
          <p>Finally, we calculate the interactive features between each contextual
representation ⃗ (or ⃖⃗ ) and the attentive contextual representation ⃗ (or ⃖⃗ )..</p>
          <p>Max-Attentive-Matching. This matching mechanism uses the most similar
contextual representation as the attentive representation with max cosine similarity.</p>
          <p>Function generates attentive representation ⃗ (or ⃖⃗
highest cosine similarity between the two contextual representations.
) by picking the
(8)</p>
          <p>We calculate the interactive feature between each contextual representation ⃗ (or
⃖⃗ ) and the max-attentive contextual representation ⃗ (or ⃖⃗ ).
3.4</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>Aggregating Layer</title>
        <p>This layer employs two BiLSTM networks to generate the feature vector
individually. The four last hidden states of the BiLSTM networks are used to compose the
feature vector.
3.5</p>
      </sec>
      <sec id="sec-3-5">
        <title>Prediction Layer</title>
        <p>This layer employs a two-layer feed-forward network and a softmax transformation
function to calculate the probability distribution Pr(y|P, Q).
4
4.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <sec id="sec-4-1">
        <title>Dataset and Evaluation</title>
        <p>In the CCKS 2018 challenge, the organizers provided 100, 000 labeled Chinese
sentence pairs for the training set, 10, 000 unlabeled sentence pairs for the validation set
and 110, 000 unlabeled sentence pairs for the test set.</p>
        <p>All the evaluation results are calculated by an official evaluation system for the
CCKS 2018 challenge. The evaluation system computes four metrices including
micro-average precision (Prec.), recall (Rec.), F1-score (F1) and accuracy (Acc.) on the
validation set and the test set.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Experiments Settings and results</title>
        <p>We train a ELMo model on 3.3GB Chinese Wikipedia corpus. Both the corpus and
the dataset are processed by Jieba2 tool for Chinese word segmentation. We use the
ELMo generated word vectors to initialize the word embedding layer and do not
update them during training. We initialize the 20-dimensional character vectors
randomly. We utilize a 1-layer Highway network to generate the final word representation.
We set the hidden size as 100 for all BiLSTM networks. We employ a dropout for
each layer in Figure 1 and set the dropout ratio as 0.5. We set the learning rate as
0.0005 for Adam optimizer and 3 for Adadelta optimizer. We generate three results
by applying Adadelta optimizer twice and Adam optimizer once. We apply a vote
mechanism on the three result to generate the final prediction.</p>
        <p>Table 1 shows the performances of some prior methods on validation set. We
evaluate seven state-of-the-art models as baseline. We can see that models in the sentence
pair matching framework have a better performance than those in the sentence
encod</p>
        <sec id="sec-4-2-1">
          <title>2 https://github.com/fxsjy/jieba</title>
        </sec>
        <sec id="sec-4-2-2">
          <title>Classical encoding framework</title>
        </sec>
        <sec id="sec-4-2-3">
          <title>Attention based encoding framework</title>
          <p>ing-based framework. All the baseline models use word2vec embedding as input. In
classical encoding framework, we apply four sentence encoders, including CNN,
Hierarchical Pooling, Transformer and BiLSTM. Transformer and BiLSTM have a
better performance, achieving 82.9% and 82.6% F1-scores. In the attention based
encoding framework, we evaluate four baseline models, including ABCNN-2,
ABCNN-2 (Multi-Perspective), ESIM and BiMPM. ABCNN-2 (Multi-Perspective) is
the implement of ABCNN-2 model with different kernel sizes. We can see that ESIM
and BiMPM achieve better performances than ABCNN. Our model achieves highest
performance in single model with 85.0% F1-score. Finally, we employ a vote
mechanism to merge different results of our model. It achieves the best performance in the
validation set with 86.2% F1-score and achieve 84.6% F1-score in the test set.
5</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this study, we have proposed a model adapted from BiMPM. The model
implements three out of four bidirectional matching mechanisms in BiMPM and exploits
ELMo to generate word embedding. The final prediction of our adapted model is
given by voting three results of our model with different hyperparameters. We
evaluate our model on the dataset about Chinese sentence pairs from CCKS 2018.
Experimental results reveal that the model achieves 86.2% F1-score on the validation set and
84.6% F1-score on the test set, ranking the fifth in this challenge.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>This work was partly supported by National Natural Science Foundation of China
(61375056), Science and Technology Program of Guangzhou (201804010496), and
Scientific Research Innovation Team in Department of Education of Guangdong
Province (2017KCXTD013).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Peters</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neumann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Iyyer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gardner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Deep Contextualized Word Representations</article-title>
          .
          <source>In: Proceedings of the</source>
          <year>2018</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT)</article-title>
          . pp.
          <fpage>2227</fpage>
          -
          <lpage>2237</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hamza</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Florian</surname>
          </string-name>
          , R.:
          <article-title>Bilateral Multi-Perspective Matching for Natural Language Sentences</article-title>
          .
          <source>In: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI)</source>
          . pp.
          <fpage>4144</fpage>
          -
          <lpage>4150</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Yin</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schütze</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xiang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs. Transactions of the Association for Computational Linguistics vol</article-title>
          .
          <volume>4</volume>
          ,
          <fpage>259</fpage>
          -
          <lpage>272</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Conneau</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kiela</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schwenk</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barrault</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bordes</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Supervised Learning of Universal Sentence Representations from Natural Language Inference Data</article-title>
          .
          <source>In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          . pp.
          <fpage>670</fpage>
          -
          <lpage>680</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Min</surname>
            ,
            <given-names>M. R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Su</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          , Zhang,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Henao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Carin</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.</surname>
          </string-name>
          :
          <article-title>Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms</article-title>
          .
          <source>In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL)</source>
          . pp.
          <fpage>440</fpage>
          -
          <lpage>450</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ling</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wei</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Inkpen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Enhanced LSTM for Natural Language Inference</article-title>
          .
          <source>In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL)</source>
          . pp.
          <fpage>1657</fpage>
          -
          <lpage>1668</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G. S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Distributed Representations of Words and Phrases and their Compositionality</article-title>
          .
          <source>In: Proceedings of the 27th Annual Conference on Neural Information Processing Systems (NIPS)</source>
          . pp.
          <fpage>3111</fpage>
          -
          <lpage>3119</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Pennington</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
          </string-name>
          , C. D.: Glove:
          <article-title>Global vectors for word representation</article-title>
          .
          <source>In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          . pp.
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Bojanowski</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grave</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joulin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics vol</article-title>
          .
          <volume>4</volume>
          ,
          <fpage>135</fpage>
          -
          <lpage>146</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Bowman</surname>
            ,
            <given-names>S. R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Angeli</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Potts</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C. D.:</given-names>
          </string-name>
          <article-title>A large annotated corpus for learning natural language inference</article-title>
          .
          <source>In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          . pp.
          <fpage>632</fpage>
          -
          <lpage>642</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yuan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cer</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kong</surname>
            ,
            <given-names>S.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Constant</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pilar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ge</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sung</surname>
            ,
            <given-names>Y.h.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Strope</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kurzweil</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Learning Semantic Textual Similarity from Conversations</article-title>
          .
          <source>In: Proceedings of The Third Association for Computational Linguistics Workshop on Representation Learning for NLP</source>
          . pp.
          <fpage>164</fpage>
          -
          <lpage>174</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Vaswani</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shazeer</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parmar</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Uszkoreit</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gomez</surname>
            ,
            <given-names>A. N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaiser</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polosukhin</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Attention Is All You Need</article-title>
          .
          <source>In: Proceedings of the 30th Annual Conference on Neural Information Processing Systems (NIPS)</source>
          . pp.
          <fpage>6000</fpage>
          -
          <lpage>6010</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. , , , , u , , u , :
          <article-title>Recurrent Highway Networks</article-title>
          .
          <source>In: Proceedings of the 34th International Conference on Machine Learning (ICML)</source>
          . pp.
          <fpage>4189</fpage>
          -
          <lpage>4198</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>