Introduction

Reproducing Russian NER Baseline Quality without Additional Data

Valentin Malykh

0 1

Alexey Ozerin

1 0 Institute for Systems Analysis of Russian Academy of Sciences , 9, pr. 60-letiya Oktyabrya, Moscow, 117312 , Russia 1 Laboratory of Neural Systems and Deep learning, Moscow Institute of Physics and Technology (State University) , 9, Institutskiy per., Dolgoprudny, Moscow Region, 141700 , Russia

54 59

Baseline solutions for the named entity recognition task in Russian language were published a few years ago. These solutions rely heavily on the addition data, like databases, and di erent kinds of preprocessing. Here we demonstrate that it is possible to reproduce the quality of existing database-based solution by character-aware neural net trained on corpus itself only.

named entity recognition character awareness neural nets multitasking

Introduction Related work

The rst results for character-based named entity recognition in English language were presented in early 2000-s [ 1 ]. The close idea of character-based named entity tagging was introduced in [ 2 ] for the Portuguese and Spanish languages, but our model does not use convolution inside. For the English language text classi cation (close task for the named entity recognition) character-aware architecture was described in [ 3 ], it is also basing on convolutions, so principally di ers from our model. Previous research for Russian language hadn't been based on characters, but on words [ 4 ]. State of the art solution on the public corpus with named entity markup [ 5 ] is also word-level based.

One of the core ideas for our model comes from the character aware neural nets introduced recently in [ 6 ], [ 7 ]. Another idea, that of matching the sequences to train the arti cial neural net to get the text structure is coming from [ 8 ]. Our solution is based on the multi-task learning which was introduced for natural language processing tasks in [ 9 ]. 3

Model

The architecture of our recurrent neural network is inspired by [ 7 ]. The network consists of long short-term memory units, which were initially proposed in [ 10 ]. There are two main di erences to the Yoon Kim setup [ 7 ]. First one is that our model predicts two things instead of one: { the next character, { a markup label for the current character.

Second one is that we do not use convolution, so we not exploiting word concept inside our architecture, only character concept. We suppose that model could learn the concept of word from data, and rely on this assumption while quality measurement. Prediction errors and gradients are calculated, and then weights are updated by truncated back-propagation through time [ 11 ]. 3.1

Mathematical formulation Let ht be the state of the last neural net layer before softmax transformations (hidden state). The probability is predicted by standard sotfmax over the set of characters C and the set of markup labels M:

P r(ct+1jc1:t) = exp(ht pj1+qj)

1 Pj02C ht pj10 +q1j0

exp(ht pi2+qi ) P r(mtjc1:t) = Pi02M ht pi20 +q2i0 2 (1) (2) Here pj1 is j-th column in character output embedding matrix P1 2 Rk jCj, q1j is a character bias term. pi2 is i-th column in markup output embedding matrix P2 2 Rl jMj and q2i is markup bias term, k and l are character and markup embedding vector lengths.

The nal negative log likelihood (N LL) is computed over the test corpus of length T :

N LL =

T X(log P r(ct+1jc1:t) + log P r(mtjc1:t)) t=1 (3)

The diagram of our model could be found on the gure 1. 4

Experiments

The corpus parameters are presented at table 1, more details on it could be found in [ 5 ]. It can be obtained from the authors of the original paper by sending a request to gareev-rm@yandex.ru or to any other author of the original paper.

Similar to [ 5 ] we calculate 5-fold cross-validation with precision (P), recall (R), and F-measure (F) metrics. The results of experiments are presented in table 2. Since we are working with characters we cannot use labelling produced for characters by our system directly, so we parse the produced markup for every token (which is known for us from the corpus) and take the label for the majority of characters in the token as a token label.

g n i l p m a s d n a x a m t f o S s r e y a l N N R s g n i d d e b m E

M a n d e l b r o t

Comparison

The results of comparison are presented on tables 3, 4, 5.

On the person token class our system performed better than CRF-based one by all the metrics by the mean value and standard deviation. On the organisation class our system is better by recall and comparable by F-measure with CRFmodel. In overall case our system was on par with knowledge-base approach performance in F-measure and in recall with CRF-model. 6

Conclusion

We applied character aware RNN model with LSTM units to the problem of the named entity recognition in Russian language. Even without any preprocessing and supplementary data from external knowledge-base the model was able to learn solution end-to-end from the corpus with markup. Results demonstrated by our approach are on the level of existing state of the art in the eld.

The main weakness of proposed model is di erentiation between person and organization tokens. This is due to the small size of the corpus. A possible solution is pre-training on a large corpus such as Wikipedia, without any markup, just to train internal distributed representation of a language model. We presume that such pre-training would allow RNN to beat CRF-model.

Another direction of our future work is addition of attention as it was demonstrated to improve performance on character-level sequence tasks [ 12 ].

1. Klein , D. , Smarr , J. , Nguyen , H. , Manning , C.D.: Named entity recognition with character-level models . In: Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4 , Association for Computational Linguistics ( 2003 ) 180 { 183

2. dos Santos, C. , Guimaraes , V. , Niteroi , R., de Janeiro, R.: Boosting named entity recognition with neural character embeddings . In: Proceedings of NEWS 2015 The Fifth Named Entities Workshop . ( 2015 ) 25

3. Zhang , X. , Zhao , J. , LeCun, Y.: Character-level convolutional networks for text classi cation . In: Advances in Neural Information Processing Systems . ( 2015 ) 649 { 657

4. Popov , B. , Kirilov , A. , Maynard , D. , Manov , D. : Creation of reusable components and language resources for named entity recognition in russian . In: Conference on Language Resources and Evaluation . ( 2004 )

5. Gareev , R. , Tkachenko , M. , Solovyev , V. , Simanovsky , A. , Ivanov , V. : Introducing baselines for russian named entity recognition . In: Computational Linguistics and Intelligent Text Processing . Springer ( 2013 ) 329 { 342

6. Bojanowski , P. , Joulin , A. , Mikolov , T. : Alternative structures for character-level rnns . arXiv preprint arXiv:1511.06303 ( 2015 )

7. Kim , Y. , Jernite , Y. , Sontag , D. , Rush , A.M. : Character-aware neural language models . arXiv preprint arXiv:1508.06615 ( 2015 )

8. Sutskever , I. , Vinyals , O. , Le , Q.V. : Sequence to sequence learning with neural networks . In: Advances in neural information processing systems . ( 2014 ) 3104 { 3112

9. Collobert , R. , Weston , J.: A uni ed architecture for natural language processing: Deep neural networks with multitask learning . In: Proceedings of the 25th international conference on Machine learning , ACM ( 2008 ) 160 { 167

10. Hochreiter , S. , Schmidhuber , J.: Long short-term memory . Neural computation 9(8) ( 1997 ) 1735 { 1780

11. Graves , A. : Generating sequences with recurrent neural networks . arXiv preprint arXiv:1308.0850 ( 2013 )

12. Golub , D. , He , X. : Character-level question answering with attention . arXiv preprint arXiv:1604.00727 ( 2016 )