=Paper=
{{Paper
|id=Vol-1625/paper6
|storemode=property
|title=Reproducing Russian NER Baseline Quality without Additional Data
|pdfUrl=https://ceur-ws.org/Vol-1625/paper6.pdf
|volume=Vol-1625
|authors=Valentin Malykh,Alexey Ozerin
|dblpUrl=https://dblp.org/rec/conf/cla/MalykhO16
}}
==Reproducing Russian NER Baseline Quality without Additional Data==
<pdf width="1500px">https://ceur-ws.org/Vol-1625/paper6.pdf</pdf>
<pre>
    Reproducing Russian NER Baseline Quality
            without Additional Data

                       Valentin Malykh1,2 , Alexey Ozerin2
         1
            Institute for Systems Analysis of Russian Academy of Sciences,
                 9, pr. 60-letiya Oktyabrya, Moscow, 117312, Russia
                                  http://www.isa.ru/
                 2
                   Laboratory of Neural Systems and Deep learning,
          Moscow Institute of Physics and Technology (State University),
         9, Institutskiy per., Dolgoprudny, Moscow Region, 141700, Russia
                                 http://www.mipt.ru/


      Abstract. Baseline solutions for the named entity recognition task in
      Russian language were published a few years ago. These solutions rely
      heavily on the addition data, like databases, and different kinds of prepro-
      cessing. Here we demonstrate that it is possible to reproduce the quality
      of existing database-based solution by character-aware neural net trained
      on corpus itself only.

      Keywords: named entity recognition, character awareness, neural nets,
      multitasking


1   Introduction
Named entity recognition is a well known task in natural language processing
field. It is highly demanded in the industry and has a long history of academic
research.
    Current approaches are critically dependent on the size and quality of the
knowledge-base used. The knowledge base should be kept up to date, which
requires additional resources to be constantly involved.
    In contrast our solution relies only on the text of the corpus itself without
any additional data, except of the training corpus markup.
    Contributions of the paper are following:
 – We propose an architecture of artificial neural net as an alternative to the
   knowledge base based approach for the named entity recognition task.
 – We provide results of the model tests on publicly available corpus for Russian
   language.


2   Related work
The first results for character-based named entity recognition in English language
were presented in early 2000-s [1]. The close idea of character-based named entity
                                                  Reproducing Baseline Quality   55

tagging was introduced in [2] for the Portuguese and Spanish languages, but our
model does not use convolution inside. For the English language text classifica-
tion (close task for the named entity recognition) character-aware architecture
was described in [3], it is also basing on convolutions, so principally differs from
our model. Previous research for Russian language hadn’t been based on charac-
ters, but on words [4]. State of the art solution on the public corpus with named
entity markup [5] is also word-level based.
    One of the core ideas for our model comes from the character aware neural
nets introduced recently in [6], [7]. Another idea, that of matching the sequences
to train the artificial neural net to get the text structure is coming from [8]. Our
solution is based on the multi-task learning which was introduced for natural
language processing tasks in [9].


3     Model

The architecture of our recurrent neural network is inspired by [7]. The network
consists of long short-term memory units, which were initially proposed in [10].
There are two main differences to the Yoon Kim setup [7]. First one is that our
model predicts two things instead of one:

 – the next character,
 – a markup label for the current character.

Second one is that we do not use convolution, so we not exploiting word concept
inside our architecture, only character concept. We suppose that model could
learn the concept of word from data, and rely on this assumption while quality
measurement. Prediction errors and gradients are calculated, and then weights
are updated by truncated back-propagation through time [11].


3.1   Mathematical formulation

Let ht be the state of the last neural net layer before softmax transformations
(hidden state). The probability is predicted by standard sotfmax over the set of
characters C and the set of markup labels M:


                                            exp(ht ·pj1 +q1j )
                         P r(ct+1 |c1:t ) = P              j0  j0                (1)
                                               j 0 ∈C ht ·p1 +q1

                                            exp(ht ·pi2 +q2i )
                          P r(mt |c1:t ) = P            i0     i0                (2)
                                            i0 ∈M ht ·p2 +q2


Here pj1 is j-th column in character output embedding matrix P1 ∈ Rk×|C| , q1j
is a character bias term. pi2 is i-th column in markup output embedding matrix
P2 ∈ Rl×|M| and q2i is markup bias term, k and l are character and markup
embedding vector lengths.
56       Valentin Malykh and Alexey Ozerin

   The final negative log likelihood (N LL) is computed over the test corpus of
length T :
                          XT
               N LL = −       (log P r(ct+1 |c1:t ) + log P r(mt |c1:t ))   (3)
                           t=1

     The diagram of our model could be found on the figure 1.


4     Experiments

The corpus parameters are presented at table 1, more details on it could be found
in [5]. It can be obtained from the authors of the original paper by sending a
request to gareev-rm@yandex.ru or to any other author of the original paper.


                      Table 1. Russian NER corpus statistics

                         Tokens                   44326
                         Words & Numbers          35116
                         Characters               263968
                         Organization annotations 1317
                         Org. ann. characters     14172
                         Person annotations       486
                         Per. ann. characters     5978


    Similar to [5] we calculate 5-fold cross-validation with precision (P), recall
(R), and F-measure (F) metrics. The results of experiments are presented in
table 2. Since we are working with characters we cannot use labelling produced
for characters by our system directly, so we parse the produced markup for every
token (which is known for us from the corpus) and take the label for the majority
of characters in the token as a token label.


               Table 2. 5-fold cross-validation of the NN-based NER.

             Fold #    Person         Organization         Overall
                    P     R     F     P     R     F     P     R     F
               1  93.09 93.32 93.20 68.75 78.57 73.33 63.25 71.94 67.32
               2  94.85 94.16 94.51 64.29 73.90 68.76 59.38 67.86 63.33
               3  90.91 93.37 92.12 66.22 65.52 65.87 58.45 58.76 58.60
               4  90.45 91.74 91.09 68.02 77.48 72.45 60.12 68.56 64.06
               5  94.03 93.06 93.54 62.15 68.81 65.31 57.06 61.40 59.15
             mean 92.67 93.13 92.89 65.89 72.86 69.14 59.65 65.70 62.49
              std  1.92 0.88 1.32 2.70 5.60 3.67 2.31 5.44 3.63
                          Reproducing Baseline Quality   57


                                    .


                                                          Softmax and sampling
                                    a
     0 none
                                    b
2    1 org
                                    c              .
     2 per
                                    d
                                    e
                                   ...


                   LSTM


                                                          RNN layers
                   LSTM


                   LSTM


                                                          Embeddings


    Benoit B. Mandelbrot


      Fig. 1. Neural net architecture
58     Valentin Malykh and Alexey Ozerin

5    Comparison
The results of comparison are presented on tables 3, 4, 5.


                 Table 3. Person class performance comparison.

                        System               Person
                                 Precision   Recall   F-measure
                                mean std mean std mean std
              Best KB-based [5] 79.38 N/A 79.22 N/A 79.30 N/A
               CRF-based [5]    90.94 4.04 79.52 2.91 84.84 3.33
                  NN-based      92.67 1.92 93.13 0.88 92.89 1.32


              Table 4. Organization class performance comparison.

                        System             Organization
                                 Precision    Recall    F-measure
                                mean std mean std mean std
              Best KB-based [5] 59.04 N/A 52.32 N/A 55.48 N/A
               CRF-based [5]    81.31 7.44 63.88 6.54 71.31 5.38
                  NN-based      65.89 2.70 72.86 5.60 69.14 3.67


                   Table 5. Overall performance comparison.

                        System              Overall
                                 Precision   Recall   F-measure
                                mean std mean std mean std
              Best KB-based [5] 65.01 N/A 59.57 N/A 62.17 N/A
               CRF-based [5]    84.10 6.22 67.98 5.57 75.05 4.82
                  NN-based      59.65 2.31 65.70 5.44 62.49 3.63


    On the person token class our system performed better than CRF-based one
by all the metrics by the mean value and standard deviation. On the organisation
class our system is better by recall and comparable by F-measure with CRF-
model. In overall case our system was on par with knowledge-base approach
performance in F-measure and in recall with CRF-model.


6    Conclusion
We applied character aware RNN model with LSTM units to the problem of the
named entity recognition in Russian language. Even without any preprocessing
                                                Reproducing Baseline Quality        59

and supplementary data from external knowledge-base the model was able to
learn solution end-to-end from the corpus with markup. Results demonstrated
by our approach are on the level of existing state of the art in the field.
    The main weakness of proposed model is differentiation between person and
organization tokens. This is due to the small size of the corpus. A possible
solution is pre-training on a large corpus such as Wikipedia, without any markup,
just to train internal distributed representation of a language model. We presume
that such pre-training would allow RNN to beat CRF-model.
    Another direction of our future work is addition of attention as it was demon-
strated to improve performance on character-level sequence tasks [12].


References
 1. Klein, D., Smarr, J., Nguyen, H., Manning, C.D.: Named entity recognition with
    character-level models. In: Proceedings of the seventh conference on Natural lan-
    guage learning at HLT-NAACL 2003-Volume 4, Association for Computational
    Linguistics (2003) 180–183
 2. dos Santos, C., Guimaraes, V., Niterói, R., de Janeiro, R.: Boosting named entity
    recognition with neural character embeddings. In: Proceedings of NEWS 2015 The
    Fifth Named Entities Workshop. (2015) 25
 3. Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text
    classification. In: Advances in Neural Information Processing Systems. (2015) 649–
    657
 4. Popov, B., Kirilov, A., Maynard, D., Manov, D.: Creation of reusable components
    and language resources for named entity recognition in russian. In: Conference on
    Language Resources and Evaluation. (2004)
 5. Gareev, R., Tkachenko, M., Solovyev, V., Simanovsky, A., Ivanov, V.: Introducing
    baselines for russian named entity recognition. In: Computational Linguistics and
    Intelligent Text Processing. Springer (2013) 329–342
 6. Bojanowski, P., Joulin, A., Mikolov, T.: Alternative structures for character-level
    rnns. arXiv preprint arXiv:1511.06303 (2015)
 7. Kim, Y., Jernite, Y., Sontag, D., Rush, A.M.: Character-aware neural language
    models. arXiv preprint arXiv:1508.06615 (2015)
 8. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural
    networks. In: Advances in neural information processing systems. (2014) 3104–
    3112
 9. Collobert, R., Weston, J.: A unified architecture for natural language process-
    ing: Deep neural networks with multitask learning. In: Proceedings of the 25th
    international conference on Machine learning, ACM (2008) 160–167
10. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation
    9(8) (1997) 1735–1780
11. Graves, A.: Generating sequences with recurrent neural networks. arXiv preprint
    arXiv:1308.0850 (2013)
12. Golub, D., He, X.: Character-level question answering with attention. arXiv
    preprint arXiv:1604.00727 (2016)

</pre>