Bidirectional Semantic Matching with Deep Contextualized Word Embedding for Chinese Sentence Matching Kunxun Qi, Jianfeng Du*, Qiqi Ou, Linxi Jin and Jinglan Zhong School of Computer Science and Technology, Guangdong University of Foreign Studies, Guangzhou 510006, China jfdu@gdufs.edu.cn Abstract. In this paper, a bidirectional matching model is proposed to identify whether two Chinese sentences are paraphrases of each other. The model is adapted from the well-known BiMPM model on two main aspects. On the one hand, it exploits a deep contextualized model named ELMo to generate the in- put word embedding. On the other hand, three out of four bidirectional matching mechanisms in BiMPM are carefully selected to model inter- action between two sentences. The proposed model is evaluated on a dataset about Chinese sentence pairs from CCKS 2018. Experimental results show that the model achieves 86.2% F1-score on the validation set and 84.6% F1-score on the test set. Keywords: Sentence Matching, Chinese Sentence Pairs, Deep Neural Network. 1 Introduction Modeling two natural language sentences is a fundamental task in many natural lan- guage processing (NLP) tasks, such as paraphrase identification (PI) [3], textual en- tailment(TE) [3] and etc. In paraphrase identification task, we identify whether two sentences are paraphrase or not. In text entailment task, we estimate whether a sen- tence can be inferred from another sentence. In recent years, neural network models have been widely used in modeling sentence pairs. Two advanced frameworks have been proposed in previous work. The first framework usually implements two weight sharing sentence encoders, such as Convo- lutional Neural Network (CNN) and Recurrent Neural Network (RNN), to represent a sentence pair as two low-dimensional real-value vectors u1, u2 and then makes a prediction based on the two vectors. This framework usually constructs a feature vec- tor, such as (u1, u2, |u1 − u2|, u1 ∗ u2), feeding it into a fully-connected network fol- lowed by a softmax layer to make final prediction. Some typical methods in this framework include BCNN [3], InferSent [4] and SWEMs [5]. This framework pays more attention on constructing sentence encoder, but ignores the relevance between * Corresponding author. 2 two sentences. Existing empirical studies reveal that this framework can- not achieve the state-of-the-art performance. This limitation may be caused by the losing of some interactive information between the two sentences. To further improve the performance, the second framework studies how to learn interaction between two sentences. This framework usually calculates the relevance between the two sentences by using a variety of attention mechanisms. The prominent methods in this framework include ABCNN [3], ESIM [6] and BiMPM [2]. In this paper, we implement three out of four bidirectional matching mechanisms in BiMPM to calculate the interaction between two sentences, including full-matching, attentive-matching and max- attentive-matching. We do not use the maxpooling-matching mechanism because it is time consuming and hard to be evaluated in our experiments. All the above approaches use word embedding as input. Word embedding aims to represent the tokens from textual documents as low-dimensional real-value vectors. As known that, word embedding has been widely used in a broad range of NLP tasks, such as named entity extraction (NER), part-of-speech tagging (POS Tagging), ques- tion answering (QA), textual entailment (TE), machine comprehension (MC), etc. The most famous word embedding models are Word2vec [7] and GloVe [8], which have demonstrated advanced performance in a variety of NLP tasks. However, most of these word embedding models generate pre-trained word vectors for each natural language token in training corpus, which means that the out of vocabulary (OOV) words have no representation. One common solution is to initialize the word embed- ding randomly and update the word vectors during training. It is easy to incur over- fitting. Another solution is using the N-gram features in training the word embedding. For example, FastText [9] trains word embedding by predicting the labels of docu- ments. It is applicable to the document classification task but is not suitable for sen- tence modeling tasks. Recently, a new type of deep contextualized word representa- tion, ELMo [1], has been proposed to address wrongly written or mispronounced characters, wrongly Chinese word segmentation and OOV words. It has been demon- strated to improve the performance in six challenging NLP tasks [1]. ELMo generates word vectors based on the input of character sequences and the representations of the contextualized words in a sentence. In this paper, we train an ELMo model on Chi- nese Wikipedia corpus and use it to generate word vectors. In this study, our model is evaluated on the dataset about Chinese sentence pairs from CCKS 2018. Experimental results show that the model achieves 86.2% F1-score on the validation set as well as 84.6% F1-score on the test set. 2 Related work There are lots of studies for modeling sentence pairs. In this section, we only make a review on previous deep learning methods. We refer the interested reader to [3] for other methods. There are two major deep learning frameworks for modeling sentence pairs, namely the classical encoding framework and attention-based encoding frame- work. 3 Pr(y|P, Q) Prediction Layer Softmax Aggregating Layer ... ... ... ... ... … ... … Attentive- Attentive- Max-Attentive- Max-Attentive- Matching Layer Full-Matching(→) Full-Matching(←) Matching(→) Matching(←) Matching(→) Matching(←) Context Representation ... ... ... ... Layer ... … ... … Highway Network Layer …... …… …... …… Character Representation … … … … … … … … … … … … Layer Word Representation …... .….. …... .….. Layer …... .….. p1 p2 pi pM q1 q2 …... qi .….. qM Fig. 1. The overview architecture of our model for Chinese sentence pair matching. 2.1 Classical Encoding Framework Methods in this framework employ two weight-sharing classical encoders, suach as CNN or RNN, to generate two vector representations for the two input sentences. BCNN [3] used two weight sharing CNNs to generate two sentence representations and constructed a feature vector by connecting the two vectors. [4] implemented two bidirectional LSTM (BiLSTM) networks as sentence encoders. SWEMs [5] employed two hierarchical pooling encoders instead of using any CNNs or RNNs. [11] modeled sentence pairs by using Transformer [12] encoder, which is a recent network architec- ture that makes use of self-attention [12] mechanism. 2.2 Attention-based Encoding Framework On the basis of the first framework, methods in this framework employ various atten- tion mechanisms that are based on the similarity between two sentences to adjust the two representations. ABCNN [3] enhanced the BCNN [3] by employing an attention feature matrix to learn interactive information. ESIM [6] employed Tree-LSTM (Long Short-Term Memory) as sentence encoder. It calculated the relevance between two sentences by applying a local inference modeling layer. BiMPM [2] proposed four effective bidirectional matching mechanisms to learn the interactive information. 3 Adaptation of BiMPM with ELMo Our proposed model is shown in Figure 1. The input of our model has two parts for each sentence. The first part is the word embedding generated by ELMo. The second 4 part is the character embedding created by a bidirectional LSTM (BiLSTM) network on randomly instantiated character embedding. The concatenated vector from these two parts are fed into a Highway network [13] to generate two sequences of word vectors. The two sequences of word vectors are fed into the contextual representation layer to learn the contextual representations. Three bidirectional matching mecha- nisms in BiMPM are employed in the matching layer to calculate the interaction be- tween two sentences. The two matching vectors are fed into the aggregating layer to generate the feature vectors, which are used to make prediction in the prediction layer. 3.1 Word Representation Layer This layer generates a d-dimensional vector for each word within the experimental sentences. There are two parts in this layer. The first part is ELMo generated word embedding. We train an ELMo model on Chinese Wikipedia articles corpus1 and use it to generate word vectors. The second part is the character embedding. We initialize fixed dimensional vectors randomly for each character within a word. They are fed into a BiLSTM network to compose word vectors. We pick the last hidden state of the BiLSTM network as the representation of each word. We feed the concatenated vec- tors from these two parts into a Highway network to generate the final word vectors. 3.2 Contextual Representation Layer This layer generates the context representation of two sentences by using two BiLSTM networks. The weights in these two networks are shared during training. 3.3 Matching Layer This layer calculates the interactive information between two sentences. We apply three out of four bidirectional matching mechanisms in BiMPM, including the full- matching mechanism, the attentive-matching mechanism and the max-attentive- matching mechanism. For details, we use function to calculate the relevance be- tween two contextual representations. ( ) (1) In eq(1), v1 and v2 are the hidden states of the two BiLSTM networks in contextual representation layer. Both v1 and v2 are d-dimensional vectors. W∈Rl×d is a trainable parameter and l is a hyperparameter that means the perspective of the interactive fea- tures. For each element mk∈m, k means the k-th dimension of the interactive vector. They are calculated by a cosine similarity function ( ) (2) where is the element-wise multiplication and Wk is the k-th row in W. Further, we apply three bidirectional matching mechanisms to calculate the interac- 1 https://zh.wikipedia.org/wiki/ 5 tive features of each time-step of sentence against all time-steps of the other sentence: Full-Matching. This matching mechanism calculates the interactive features be- tween each contextual representation ⃗ (or ⃖⃗ ) and the last time-step of the contex- tual representation of the other sentence ⃗ (or ⃖⃗ ). ⃗⃗ (⃗ ⃗ ) ⃐⃗⃗ ( ⃖⃗ ⃖⃗ ) (3) Attentive-Matching. This matching mechanism calculates interactive feature be- tween each contextual representation and the weighted summing contextual represen- tation of the other sentence. Firstly, we calculate the similarity between two contextu- al representations. (⃗ ⃗ ) ⃖ ( ⃖⃗ ⃖⃗ ) (4) Then, we use (or ⃖ ) as the weight of ⃗ (or ⃖⃗ ). We generate an attentive contextual representation by weighted summing all time-steps of the contextual rep- resentations of the other sentence. ∑ ⃗⃗ ⃗ ⃗ ∑ ⃗⃗ ∑ ⃖⃗⃗ ⃖⃗⃗ ⃖⃗ (5) ∑ ⃖⃗⃗ Finally, we calculate the interactive features between each contextual representa- tion ⃗ (or ⃖⃗ ) and the attentive contextual representation ⃗ (or ⃖⃗ ).. ⃗⃗ (⃗ ⃗ ) ⃐⃗⃗ ( ⃖⃗ ⃖⃗ ) (6) Max-Attentive-Matching. This matching mechanism uses the most similar con- textual representation as the attentive representation with max cosine similarity. ⃗ ( (⃗ ⃗ )) ⃖⃗ ( ( ⃖⃗ ⃖⃗ )) (7) Function generates attentive representation ⃗ (or ⃖⃗ ) by picking the highest cosine similarity between the two contextual representations. ⃗⃗ (⃗ ⃗ ) 6 ⃐⃗⃗ ( ⃖⃗ ⃖⃗ ) (8) We calculate the interactive feature between each contextual representation ⃗ (or ⃖⃗ ) and the max-attentive contextual representation ⃗ (or ⃖⃗ ). 3.4 Aggregating Layer This layer employs two BiLSTM networks to generate the feature vector individu- ally. The four last hidden states of the BiLSTM networks are used to compose the feature vector. 3.5 Prediction Layer This layer employs a two-layer feed-forward network and a softmax transformation function to calculate the probability distribution Pr(y|P, Q). 4 Experiments 4.1 Dataset and Evaluation In the CCKS 2018 challenge, the organizers provided 100, 000 labeled Chinese sen- tence pairs for the training set, 10, 000 unlabeled sentence pairs for the validation set and 110, 000 unlabeled sentence pairs for the test set. All the evaluation results are calculated by an official evaluation system for the CCKS 2018 challenge. The evaluation system computes four metrices including mi- cro-average precision (Prec.), recall (Rec.), F1-score (F1) and accuracy (Acc.) on the validation set and the test set. 4.2 Experiments Settings and results We train a ELMo model on 3.3GB Chinese Wikipedia corpus. Both the corpus and the dataset are processed by Jieba2 tool for Chinese word segmentation. We use the ELMo generated word vectors to initialize the word embedding layer and do not up- date them during training. We initialize the 20-dimensional character vectors random- ly. We utilize a 1-layer Highway network to generate the final word representation. We set the hidden size as 100 for all BiLSTM networks. We employ a dropout for each layer in Figure 1 and set the dropout ratio as 0.5. We set the learning rate as 0.0005 for Adam optimizer and 3 for Adadelta optimizer. We generate three results by applying Adadelta optimizer twice and Adam optimizer once. We apply a vote mechanism on the three result to generate the final prediction. Table 1 shows the performances of some prior methods on validation set. We eval- uate seven state-of-the-art models as baseline. We can see that models in the sentence pair matching framework have a better performance than those in the sentence encod- 2 https://github.com/fxsjy/jieba 7 Table 1. Performances of various prior methods on validation set Methods Prec. Rec. F1 Acc. BCNN 78.6% 79.7% 79.1% 78.9% Classical SWEMs 82.5% 80.9% 81.7% 81.9% encoding Transformer-based encoders 81.8% 84.1% 82.9% 82.7% framework BiLSTM-Attention encoders 78.5% 87.1% 82.6% 81.6% ABCNN-2 79.4% 84.9% 82.0% 81.4% Attention ABCNN-2 (Multi-Perspective) 80.6% 84.9% 82.9% 81.5% based ESIM 83.7% 82.8% 83.2% 83.3% encoding BiMPM 84.1% 83.2% 83.5% 83.6% framework Our model 86.3% 83.7% 85.0% 85.2% Our model (Vote) 85.0% 87.4% 86.2% 86.1% Our model on test set 83.2% 86.0% 84.6% 84.3% ing-based framework. All the baseline models use word2vec embedding as input. In classical encoding framework, we apply four sentence encoders, including CNN, Hierarchical Pooling, Transformer and BiLSTM. Transformer and BiLSTM have a better performance, achieving 82.9% and 82.6% F1-scores. In the attention based encoding framework, we evaluate four baseline models, including ABCNN-2, ABCNN-2 (Multi-Perspective), ESIM and BiMPM. ABCNN-2 (Multi-Perspective) is the implement of ABCNN-2 model with different kernel sizes. We can see that ESIM and BiMPM achieve better performances than ABCNN. Our model achieves highest performance in single model with 85.0% F1-score. Finally, we employ a vote mecha- nism to merge different results of our model. It achieves the best performance in the validation set with 86.2% F1-score and achieve 84.6% F1-score in the test set. 5 Conclusion In this study, we have proposed a model adapted from BiMPM. The model imple- ments three out of four bidirectional matching mechanisms in BiMPM and exploits ELMo to generate word embedding. The final prediction of our adapted model is given by voting three results of our model with different hyperparameters. We evalu- ate our model on the dataset about Chinese sentence pairs from CCKS 2018. Experi- mental results reveal that the model achieves 86.2% F1-score on the validation set and 84.6% F1-score on the test set, ranking the fifth in this challenge. Acknowledgements This work was partly supported by National Natural Science Foundation of China (61375056), Science and Technology Program of Guangzhou (201804010496), and 8 Scientific Research Innovation Team in Department of Education of Guangdong Province (2017KCXTD013). References 1. Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep Contextualized Word Representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). pp. 2227–2237 (2018) 2. Wang, Z., Hamza, W., Florian, R.: Bilateral Multi-Perspective Matching for Natural Lan- guage Sentences. In: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI). pp. 4144-4150 (2017) 3. Yin, W., Schütze, H., Xiang, B., Zhou, B.: ABCNN: Attention-Based Convolutional Neu- ral Network for Modeling Sentence Pairs. Transactions of the Association for Computa- tional Linguistics vol.4, 259-272 (2016) 4. Conneau, A., Kiela, D., Schwenk, H., Barrault, L., Bordes, A.: Supervised Learning of Universal Sentence Representations from Natural Language Inference Data. In: Proceed- ings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 670-680 (2017) 5. Shen, D., Wang, G., Wang, W., Min, M. R., Su, Q., Zhang, Y., Li, C., Henao, R., Carin, L.: Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associat- ed Pooling Mechanisms. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL). pp. 440–450 (2018) 6. Chen, Q., Zhu, X., Ling, Z., Wei, S., Jiang, H., Inkpen, D.: Enhanced LSTM for Natural Language Inference. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL). pp. 1657-1668 (2017) 7. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., Dean, J.: Distributed Representations of Words and Phrases and their Compositionality. In: Proceedings of the 27th Annual Con- ference on Neural Information Processing Systems (NIPS). pp. 3111-3119 (2013) 8. Pennington, J., Socher, R., Manning, C. D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Pro- cessing (EMNLP). pp. 1532-1543 (2014) 9. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics vol.4, 135-146 (2017) 10. Bowman, S. R., Angeli, G., Potts, C., Manning, C. D.: A large annotated corpus for learn- ing natural language inference. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 632-642 (2015) 11. Yang, Y., Yuan, S., Cer, D., Kong, S.Y., Constant, N., Pilar, P., Ge, H., Sung, Y.h., Strope, B., Kurzweil, R.: Learning Semantic Textual Similarity from Conversations. In: Proceed- ings of The Third Association for Computational Linguistics Workshop on Representation Learning for NLP. pp. 164-174 (2018) 12. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., Polosukhin, I.: Attention Is All You Need. In: Proceedings of the 30th Annual Conference on Neural Information Processing Systems (NIPS). pp. 6000-6010 (2017) 13. , , , , u , , u , : Recurrent Highway Networks. In: Proceedings of the 34th International Conference on Machine Learning (ICML). pp. 4189-4198 (2017)