=Paper=
{{Paper
|id=Vol-2667/paper65
|storemode=property
|title=A multiclass words classification by the recurrent neural network with memory (LSTM) as applicable to the named entity recognition problem 
|pdfUrl=https://ceur-ws.org/Vol-2667/paper65.pdf
|volume=Vol-2667
|authors=Vladimir Vakurin,Andrey Kopylov,Oleg Seredin,Konstantin Mertsalov
}}
==A multiclass words classification by the recurrent neural network with memory (LSTM) as applicable to the named entity recognition problem ==
<pdf width="1500px">https://ceur-ws.org/Vol-2667/paper65.pdf</pdf>
<pre>
 A Multiclass Words Classification by the Recurrent
    Neural Network with Memory (LSTM) as
   Applicable to the Named Entity Recognition
                     Problem
           Vladimir Vakurin                                      Andrey Kopylov                                     Oleg Seredin
          Tula State University                                Tula State University                            Tula State University
              Tula, Russia                                         Tula, Russia                                     Tula, Russia
         vakourinvl@yandex.ru                                and.kopylov@gmail.com                              oseredin@yandex.ru

                                                              Konstantin Mertsalov
                                                          Rensselaer Polytechnic Institute
                                                                 Troy, NY, USA
                                                             kmertsalov@gmail.com

    Abstract—This study considers back propagation neural                    generalized pattern; it is a so-called “online update”, refer to
networks (NN) training for named entity recognition using                    [9]). With this, the expected global error minimum can be
multilayer NN architectures and various feature spaces on                    found faster [9]. On the other hand, the ground truth and the
character strings. Experimental results showing the relation                 loss function should match the NN learning objective.
between the generalizing properties and the intersection of the
training and test named entity sets while solving the                            The problem statement for this research is improving the
conventional named entity recognition problem are presented.                 quality of the models used for the recognition of named
We also propose a method for improving the model predictive                  entities not presented at the NN training phase by using a
ability to recognize named entities not used in the training.                multiclass loss function along with a probabilistic
                                                                             representation of the specific named entity strings. We also
   Keywords—recurrent neural network, character feature                      present the experimental results showing the relation
spaces, long short-term memory architecture                                  between the generalizing properties and the intersection of
                                                                             the training and test named entity sets while solving the
                       I.    INTRODUCTION                                    conventional named NE recognition problem, and the
   The paper proposes a new method and investigates the                      extremely poor generalizing ability of such conventionally
key disadvantages of the existing named entity (NE)                          trained models when applied to texts that contain new,
recognition solutions. Named entity recognition is a well-                   unknown NEs which is common in actual (commercial) NE
known problem, a part of the text mining domain [1].                         recognition applications.
    Within the text mining domain, named entity recognition                                     II. RELATED WORKS
is used to locate and identify identical information objects
contained in the text either directly, or indirectly. The general               There are several approaches to the named entity
named entity recognition (NER) problem is the identification                 identification problem: grammar templates [10]; a classifier
of words/word sequences in a text that belongs to a specified                based on support vectors [11], statistical models, namely,
group, such as company names, geographic names, proper                       hidden Markov models [12], conditional random fields [13,
names, etc. The problem has many specific formulations and                   14], and a range of deep learning NN models [15-18]. To
is significant for automated text processing systems. The                    overcome the limitations of using recurrent neural networks
common problems mentioned in the available references are                    used for NE string prediction [15], neural network cells with
proper name recognition, drug name recognition (bio-NER,                     long short-term memory (LSTM) were introduced [5].
drug-NER) [2], and chemical entity recognition (chem-NER)                        The latest trend is combining various neural network
[3]. Since developing syntax rules and dictionaries for such                 architectures as layers of a top-level multilayer neural
problems is difficult, and proper names and formulas often                   network [19]. Lately, it has been considered as deep learning.
contain errors, the problems are usually solved with machine                 This is presented in [16]; the first results obtained with a
learning [3,4]. For the last three to four years, more advanced              convoluted network are shown in [17] as applied to advanced
named entity recognition methods emerged. The new                            neural network architectures [18]. Despite the relatively NER
methods use the most advanced long short-term memory                         solution high quality compared to the above-listed
neural network architectures [5] and are extensively                         conventional methods, the researchers note a disadvantage
investigated. An application of such a neural network                        attributed to random errors introduced to the features of an
architecture to the Russian language is presented in [6].                    entity to be recognized. The paper [20] notes that expanding
    A commonly used optimization method for neural                           the feature space by introducing capital letters and part of
network training is the stochastic gradient descend (SGD)                    speech attributes do not improve the quality. A solution that
[7]. It is iteratively controlled by a numeric loss function                 brings LSTM neural networks to a state-of-the-art level is the
value [8]. On one hand, the method is based on a random                      architectures that do not require manual feature engineering
distribution of changes to the neural network coefficients. It               or pre-processing. Instead, they are end-to-end architectures
means that the model parameter vector randomly oscillates                    that process character strings directly and generate a feature
around the common path since it is updated as a new entity                   space with a sufficient dimensionality [20, 21, 22] for the top
enters the network (with some noise relative to the                          LSTM layers that recognize the string (containing a NE.)


Copyright © 2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)
Data Science

The approach is supported by the paper [23]. It notes that the            processing applications is the dimension of the output
feature space generated by such a model can distinguish                   middle layer (usually between 100 and 1,000.) In our
word suffixes, capitalized words, prefixes, and perform                   experiment, the value is 650.
tokenization automatically. With such an approach, the NN
training seems to be similar to the way people learn words:
an explicit character string is matched to a test list of words
hidden from the observer. It is abstract and not obvious at the
initial phases of learning, but as the learning is completed,
the word list contains a set of words and the rules of their
usage. In this paper, we will experimentally verify if this
approach is valid. We will also experimentally verify the
controlled vertical addition of layers to a neural network. As
the number of layers is determined by the architecture, there
is a problem of representing the linear operator for multiple
NN layers (applied to the NN layers considered as elements:
as it would have been applied to the elements of a specific
NN layer in the conventional problem formulation.) The
problem is solved with such architectures as shown in [24,
25] that resulted in the emergence of highway neural
networks.
        III.    GENERAL ARCHITECTURE OF THE
               PROPOSED NEURAL NETWORK
A. Encoder Architecture
     The features are represented with a convolutional                    Fig. 1. The general arrangement of a char-cnn-lstm encoder based on the
encoder [9]. The encoder input is the letter features encoded             arrangement presented in [19].
by natural numbers [21]. Each word is encoded by a vector.                    As new sentences are supplied to the training window
Its length is equal to the length of the longest word (21 letters         100 sentences long an internal covariance shift may occur
in our experiment). The vector elements are the letter                    [9]. To minimize it, and to accelerate the training, we used
sequential numbers in the alphabet. An empty position is                  mini-batch normalization [26].
coded as 1.
                                                                              After normalization, the convoluted encoder output can
     As it is noted in [21], sequence convolutions (usually               be complemented by layers with linear transfer functions and
called ’time convolutions’) are used to process natural                   a carry gate that excludes several linear layers based on the
language texts in contrast to spatial convolutions used to                value of the function G [24, 25]:
process images. For this reason, a feature representation
 f k  Rl  w 1 of the neural network middle layer for the
                                                                                  y  H  x, WH   G  x, WG   x  1  G  x, WG   ,
word k is generated as follows: where Ck [*, i : i  w  1] are
columns of the Ck matrix from i to i  w  1 ,                            where x is the input, H  x, WH  is the transform gate,
 A, B  Tr  ABT  is the Frobenius scalar product.
                                                                           G  x, WG  is      the    carry     gate:      H (x)   WH x  bH  ,
    The most significant features for each word k are to be                G(x)    WG x  bG  , where  is the sigmoidal function.
selected from the feature vector f k : y k  max f k [i] (max-
                                                    i                          We used two such layers in the experiments.
over-time) for k , located at the center of a letter window
wide [21].                                                                    LSTM cells were applied for the sequence recognition. A
                                                                          layer with LSTM cells [6] replaces the NN hidden layer
    The most efficient method to represent the generated n-               coefficients ( W ) with a system of equations that connects
gram character sequences for a convoluted neural network is               the LSTM elements horizontally and enables short-term long
to use several such filters concurrently. The filters have                memory (refer to Fig. 2).
various bandwidths proportional to the expected n-gram
length (a word length expressed in characters.) We used the               B. Decoder. Using the Estimated vs. Reference Mismatch
same parameters as in the paper [21]: seven filters with [50,                Vector for Backpropagation
100, 150, 200, 200, 200, 200] dimensions. As the authors                     A language model that estimates the next word
note, the key concept is to identify the most significant                 probability wt 1 (a named entity or another word) from a
features for a specific n-gram input and each filter with
various dimensions.                                                       character sequence          w   w1 ,    wt     was developed as
                                                                          follows.
    For the filters H1 ,K , Hh ( h  7 in this case), the
convoluted neural network output for a character                              Upon every neural network weights update as new
                                                                          features (character strings) are presented, an error function is
representation is y k   y1k ,K , yhk  for the input                  estimated. The error function checks the match or mismatch
representation of the word k , max. length of 21 characters.              of the class index (the word number in the dictionary) in the
As the paper [21] specifies, for many natural language                    training set and the estimated class index (the word number


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                              294
Data Science

in the dictionary) for each character string that represents the                         IV.     EXPERIMENTAL PROCEDURE
word:                                                                         Two language corpora were used: Penn Treebank [27]
                     y*  arg max p( y z ; W, b) .                        and English NER task CoNLL2003 [28]. Refer to Table 1 for
                             yY ( z )                                    their summary data. For the CoNLL2003 corpus, NE-PER
    A result of successful training is matching the character             (Personal, person, human) were used. To estimate the named
string segments being words as individual elements [23].                  entity recognition quality we used conventional metrics:
                                                                          general accuracy for all the classes, accuracy, completeness,
                                                                          F1-score for the first class represented by the NEs [29]. Also,
                                                                          refer to [30].

                                                                                    TABLE I.          EXPERIMENTAL DATA SET STATISTICS
                                                                                                                        Penn
                                                                             Dataset          Text element type                     CoNLL2003
                                                                                                                      Treebank
                                                                                          Sentences                        42068      14987
                                                                            Training
                                                                                          Words                        887.521       204.567

                                                                                          Sentences                        3370       3466
                                                                            Validation
                                                                                          Words                            70.390     51.578

                                                                                          Sentences                        3761       3684
                                                                            Test
                                                                                          Words                            78.669     46.666


Fig. 2. A short-term long memory cell structure (from [4]).                              V.      EXPERIMENTAL PROCEDURE
    Estimating a word class (or a NE class) in a sentence                 A. Experiment No.1: Standard NE recognition problem
(text representation hidden from the NN input) as a character                Refer to Fig. 3 for the test set recognition results achieved
string containing the word is presented, or, if the prediction            with the multiclass loss function.
is wrong, a set of characters not related to the expected word
is as follows. Two extra layers are added to the recurrent
neural network output: a dropout layer with a 0.5 dropout
probability, and a so-called linear layer with its dimension
equal to the dictionary size:
                      P(x)  WP x  bP .
   In other words, the neural network output as a S  N
matrix is multiplied by a N  T  P matrix, where S is the
number of sentences (100), N is the neural network output
dimension, T is the number of words in the sentence (35),
P is the dictionary size.
    The resulting matrix contains non-normalized values of                Fig. 3. CoNLL2003 test set recognition result.
the dictionary word degree of membership to the classes
recognized in the array of sentences that the neural network              B. Experiment No.2: Random NE recognition
(not receiving the “right” term numbers directly) gets as a                   Feature space for the CoNLL2003 corpus is constructed
sequence of characters. In the course of optimization the                 in such a way as to make the named entity character strings
network is trained to recognize the sequences of characters as            composed of 3 - 20 random characters for training and
indivisible fragments (words) and to predict each such word,              testing. Refer to Fig. 4 for the results.
and also to predict (whether correctly or erroneously) the
class of an index 0 named entity.                                            We will further check if the experimental result is a
                                                                          mistake.
    To decrease the P dimension, we can estimate the
softmax index by assigning it to the respective element of the
 S  T array: the index is the expected word (class) index in
the dictionary used to compare the current neural network
output with the referenced one.
    The stochastic gradient descend (SGD) method is used to
optimize the neural network layer coefficients. The SGD
argument is the error value, i.e., the cross-entropy function
value estimated for the probability of membership in each
word of the language:
                   H ( p, q)   y p( y) log q( y) ,
   that is to be transformed back (with some error) into the              Fig. 4. The CoNLL2003 corpus test set recognition result with randomly
                                                                          misspelled NE character features during the training and the testing.
coefficients of an LSTM recurrent neural network.


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                         295
Data Science

C. Experiment No.3: Unique NE recognition refined                            favor of at least one class is less than 50%, then class 0
    problem statement                                                        (named entity) would be predicted. It means that the NN
    Using the information on Chem-NER [3], we can refine                     cannot recognize the unique string with a high probability:
the NE recognition problem with the CoNLL2003 corpus as                                                                        
follows: first, the NN is trained; then, it recognizes NEs not                            y*  ROUND  arg max p( y z ; W, b) 
present in the training set, only in the test one. The resulting                                         yY ( z )             
problem is more complicated: the network will be trained                         In this case, while in the training the error function skips
with the NE character features not found in the test set NEs.                the recognition errors associated with the randomly changed
For this, every corpus CoNLL2003 named entity is a string                    NE characters.
composed of 3 - 20 random characters. It is transfer learning                E. Experiment No.5: Solution verification with the Penn
[30] for named entity recognition.                                               TreeBank corpus
                                                                                 Experiment No. 4 is repeated with the Penn TreeBank
                                                                             corpus. The hypothesis is: with each named entity misspelled
                                                                             we will avoid the well-known <UNK> (unknown) character
                                                                             recognition problem. Every named entity is encoded by these
                                                                             characters. The text corpus (stock reports and financial news)
                                                                             is huge and homogeneous; that is why it is suitable to learn
                                                                             the unique named entity recognition accuracy with the
                                                                             method proposed in Experiment No. 4.


Fig. 5. The result of the CoNLL2003 corpus test recognition with the NN
trained on NEs with randomly misspelled character features.


                                                                             Fig. 8. The result of the PennTreeBank corpus test recognition with the NN
                                                                             trained on NEs with randomly misspelled character features. The NN
                                                                             modified the prediction and loss functions.

                                                                             F. Experiment No.6: The method improvement and the
                                                                                 comparative metrics estimation
Fig. 6. The result of the CoNLL2003 corpus validation set recognition with       During the experiments, we identified and confirmed the
the NN trained on NEs with randomly misspelled character features.           existence of the problem that was reviewed in [32].
   The results of this experiment and the previous one are                   Unfortunately, our team found it out too late, when
controversial.                                                               experiments 1-5 had been completed. It is an independent
                                                                             confirmation that the problem does exist in the industry.
                                                                             Initially, we introduced a more radical problem statement
                                                                             and offered an EN representation-agnostic solution, even if
                                                                             the recognition quality is not perfect. Thus, to estimate the
                                                                             comparative characteristics, the loss function will be left as
                                                                             in experiments 5-6, and the convolutional encoder will get
                                                                             NE character strings as input. The NEs that were used in
                                                                             training are deleted from the test set for the quality
                                                                             assessment as proposed in [32]. Since gazetteers are used in
                                                                             [32], we also used them for this experiment. Refer to Table 2
                                                                             for the comparative characteristic of this method with and
                                                                             without gazetteers. There are 1,500 training epochs for this
                                                                             model. The NE recognition target classes are Person,
                                                                             Organization, Location, as in [32].
Fig. 7. The result of the CoNLL2003 corpus test recognition with the NN
trained on NEs with randomly misspelled character features. The NN
modified the prediction and loss functions.                                      TABLE II.           THE QUALITY CHARACTERISTICS OF THE METHOD

D. Experiment No.4: The algorithm adaptation for unique                                               no gazetters                   with gazetters
                                                                                Corpus
   NE recognitions                                                                           Prec.     Recall         F1     Prec.      Recall        F1
                                                                              CONLL
   Using the feature space building conditions from                                          0.56       0.78         0.649   0.59        0.75     0.657
                                                                              test A
Experiment No. 3, we will change the predictive function                      CONLL
                                                                                             0.43       0.87         0.571   0.57        0.85     0.579
from the softmax class as follows: if the confidence factor in                test B


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                                     296
Data Science

    The recognition quality is higher if a NE generalized                   not. In this case, the contradiction is between the possible
pattern is generated through training. Refer to Table 3 for the             uniqueness of the NE representation and the statistical
comparison of the results with [32]. Refer to Table 14 for a                method of recognition applied.
comparison of the results. (Table 14: Out of domain                             These results mean that it is possible to formulate the
performance: F1 of NERC with different models).                             problem of NE recognition by searching the character string
                                                                            that was not used while in training.
               TABLE III.       RESULTS COMPARED TO [29]
                                                                                The most obvious solution for this contradiction is
                        Precision                                           increasing the classifier sensitivity threshold to, e.g., 50%
 The results                                  Recall         F1-score
 Proposed method                                                            probability of accurate identification of previously known,
 CONLL test B    0.59055                  0.75364         0.6537            standard words in a sentence. As experiments 4 and 5 show,
 CONLL test A    0.44853                  0.85251         0.57881           this aim is achievable. For a big training set (Experiment 5)
 Memorization                                                               the recognition quality is equal to that of the non-unique NE
 CONLL test B    0.5314                   0.2236          0.3148
 CONLL test A    0.5585                   0.2249          0.3207
                                                                            recognition.
 CRF Suite
 CONLL test A    0.6712                   0.3857          0.4899                  VII. CONCLUSIONS AND FURTHERRESEARCH
 CONLL test B    0.6794                   0.3641          0.4741               The experiments show that multilayer neural networks
 SENNA                                                                      can be applied to named entity recognition even if the NEs
 CONLL test A    0.6862                   0.5868          0.6326
 CONLL test B    0.6461                   0.5194          0.5758
                                                                            greatly differ from the training set. The unique NE
                                                                            recognition for the CoNLL2003 corpus complex text is
                                                                            possible with accuracy 0.5637, completeness 0.7809, and F1-
   The experimental numerical results are presented in                      score 0.6492.
Table 4. The specified natural language models quality
refers to the epoch indicated in the Table.                                     Nevertheless, the researchers should consider two
                                                                            different problems: the recognition of known or similar NEs,
                TABLE IV.           EXPERIMENTAL RESULTS                    and the recognition of unknown NEs not similar to those
                                                                            used for the training. The paper [32] also confirms that the
                                               NER                          problem exists. Our results are comparable to those presented
 Exp     Fig     Trainin       General                  NER          F1
                                             precisio
 No.     No.     g epoch       accuracy
                                                n
                                                        recall      score   in [32]. Our experiments showed that the conventional
 1       3       150          0.849          0.7859     0.8495     0.81     substitution or a substitution refined with extra statistical
 2       4       44           0.9214         0.8825     0.9950     0.934    data (gazetteers and additional features) can just significantly
 3       5       250          0.8174         0.3921     0.0302     0.054    improve the recognition of known NEs (e.g., included in the
 3       6       250          0.8401         0.4003     0.0346     0.061    dictionaries.) It is the case for the more complex, advanced
 4       7       250          0.8466         0.2681     0.9023     0.39     accuracy improvement algorithms. The extra statistical data
 5       8       54           0.9852         0.7708     0.9943     0.866
 6       --      1500         --             0.5637     0.7809     0.649
                                                                            used in Experiment No. 6 increased F1-score by 0.7%...0.8%
                                                                            through reducing the recognition completeness. The
               VI.   RESULTS AND DISCUSSION                                 achievable metrics of any new method for the conventional
                                                                            problem depends on the amount of intersection between the
    Interpreting Experiment No. 2 results as a success is a                 NE training set and the testing one. The recognition of
mistake because it contradicts Experiment No. 3 results. A                  general text patterns located between NEs is a more natural
possible reason for the contradiction is a feature of the                   problem statement. We also identified an issue with the
tensorflow softmax software package function that                           softmax function (particularly tensorflow tf.nn.softmax) as
processes the NN output:                                                    applied to NN output layer factors that represent NEs since it
    - the class occurrence probability P is estimated from the              leads to lower accuracy.
    NN output values with the class 0 features. The standard
    class index for NER-Person class is 0. The estimated                                                REFERENCES
                                                                            [1]   A. Kao and S. Poteet, “Natural Language Processing and Text
    probability is low, but still, it is higher than for the other                Mining,” London: Springer-Verlag, 2007.
    n classes representing the words.                                       [2]   J. Patrick and M. Li, “High accuracy information extraction of
    - or it assigns class index 0 (Person) if the probabilities                   medication information from clinical notes: 2009 i2b2 medication
    of the term being a member of each class in the set are                       extraction challenge,” Journal of the American Medical Informatics
    equal.                                                                        Association, vol. 17, pp. 524 -527, 2010.
    Nevertheless, as the classifier finds a non-random NE                   [3]   M. Krallinger, “The CHEMDNER corpus of chemicals and drugs and
                                                                                  its annotation principles,” Journal of cheminformatics, vol. 7, no. 1,
representation in a character string (refer to Experiment No.                     pp. 1-17 , 2015. DOI:10.1186/1758-2946-7-S1-S.
3), it will assign to it an index of the class (word) that differs          [4]   А. Glazkova, “Russian Person Names Recognition Using the Hybrid
from the NE class but is more similar to that of another non-                     Approach,” Supplementary Proceedings of the Seventh International
random word. A trivial example is: we need to recognize the                       Conferencem on Analysis of Images, Social Networks and Texts
proper noun: the Snowball dessert name. The NN model                              (AIST), pp. 34-41, 2018.
was trained with the names of other desserts. It was also                   [5]   S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,”
                                                                                  Neural Comput., vol. 9, no. 8, pp. 1735-1780, 1997.
trained with fairy tales used as counterexamples where the
                                                                            [6]   L. Anh, M. Arkhipov and M. Burtsev, “Application of a Hybrid Bi-
word Snowball represents a ball of snow for the winter                            LSTM-CRF model to the task of Russian Named Entity Recognition,”
game, but not the dessert.                                                        Proceedings of the AINL, 2017.
    This problem shows that the existing named entity                       [7]   H. Robbins and S. Monro, “A Stochastic Approximation Method,”
recognition training methods have a significant                                   The Annals of Mathematical Statistics, vol. 22, no. 3, pp. 400-407,
                                                                                  1951.
disadvantage: the recognition quality depends on whether
                                                                            [8]   A. Wald, “Statistical Decision Functions,” Wiley, 1950.
the NE lists for the training and recognition sets intersect or


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                             297
Data Science

[9]  Y. LeCun, L. Bottou, Y. Bengio and P. Haffner, “Gradient based          [21] Y. Kim, Y. Jernite, D. Sontag and A. Rush, “Character-Aware Neural
     learning applied to document recognition,” Proceedings of the IEEE,          Language Models,” Proceedings of the Thirtieth AAAI Conference on
     pp. 2278-2324, 1998.                                                         Artificial Intelligence, pp. 2741-2749, 2016.
[10] J. Jang, “Information extraction from text,” Mining Text Data,          [22] M. Cho, J. Ha, C. Park and S. Park, “Combinatorial feature
     Springer, 2012, 524 p.                                                       embedding based on CNN and LSTM for biomedical named entity
[11] H. Isozaki and H. Kazawa, “Efficient support vector classifiers for          recognition,” J. Biomed. Inform., vol. 103, no. 2019, 103381, 2020.
     named entity recognition,” Proceedings of the 19th international        [23] J.Chiu and E. Nichols, “Named entity recognition with bidirectional
     conference on Computational linguistics, vol. 1, pp. 1-7, 2002.              lstm-cnns,” Transactions of the Association for Computational
[12] G.D. Zhou and J. Su, “Named entity recognition using an hmm-based            Linguistics, vol. 4, pp. 357-370, 2016.
     chunk tagger,” Proceedings of the 40th Annual Meeting on                [24] R. Srivastava, K. Greff and J. Schmidhuber, “Highway networks,”
     Association for Computational Linguistics, pp. 473-480, 2002.                arXiv preprint: 1505.00387, 2015.
[13] R. Klinger, “Automatically selected skipedges in conditional random     [25] G. Pundak and N. Tara, “Sainath: Highway-LSTM and Recurrent
     fields for named entity recognition,” Proceedings of the 8th                 Highway Networks for Speech Recognition,” Proc. Interspeech,
     International Conference on Recent Advances in Natural Language              ISCA, 2017.
     Processing, pp. 580-585, 2011.                                          [26] S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep
[14] W. Chen, Y. Zhang and H. Isahara, “Chinese named entity                      Network Training by Reducing Internal Covariate Shift,” Proceedings
     recognition with conditional random fields,” Proceedings of the 5th          32nd ICML, pp. 448-456, 2015.
     Special Interest Group of Chinese Language Processing Workshop,         [27] M. Marcus, B. Santorini and M. Marcinkiewicz, “Building a large
     pp. 118-121, 2006.                                                           annotated corpus of English: the Penn Treebank,” Computational
[15] Y. Bengio, P. Simard and P. Frasconi, “Learning long-term                    Linguistics, vol. 19, no. 2, pp. 313-330, 1993.
     dependencies with gradient descent is difficult,” IEEE Transactions     [28] E. Tjong, K. Sang and F. De Meulder, “Introduction to the conll-2003
     on Neural Networks, vol. 5, pp. 157-166, 1994.                               shared task: Language independent named entity recognition,”
[16] A. Ivakhnenko, “Grouped Arguments Handling for Solving                       Proceedings of CoNLL, vol. 4, pp. 142-147, 2003.
     Prognostic Problems,” Automatics, no. 6, pp. 24-33, 1976.               [29] C. Van, “Rijsbergen, Information Retrieval,” Butterworth-
[17] Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W.                   Heinemann, 1979.
     Hubbard and L. Jackel, “Handwritten Digit Recognition with a            [30] H. He, “Learning from imbalanced data,” IEEE Transactions on
     Backpropagation Network,” Proceedings of NIPS, 1989.                         Knowledge and Data Engineering, pp. 1263-1284, 2009.
[18] Y. Bengio, “Learning Deep Architectures for AI,” Foundations and        [31] L. Pratt, “Discriminability-based transfer between neural networks,”
     Trends in Machine Learning, vol. 2, no. 1, pp. 1-127, 2009. DOI:             NIPS Conference: Advances in Neural Information Processing
     10.1561/2200000006.                                                          Systems 5. Morgan Kaufmann Publishers, pp. 204-211, 1993.
[19] J. Li, A. Sun, J. Han and C. Li, “A Survey on Deep Learning for         [32] L. Augenstein, L. Derczynski and K. Bontcheva, “Generalisation in
     Named Entity Recognition,” IEEE Trans. Knowl. Data Eng., 2020.               Named Entity Recognition: A Quantitative Analysis,” Computer
     DOI: 10.1109/TKDE.2020.2981314.                                              Speech & Language, 2017. DOI:10.1016/j.csl.2017.01.012.2017.
[20] X. Ma and E. Hovy, “End-to-end Sequence Labeling via Bi-
     directional LSTM-CNNs-CRF,” Proceedings of the 54th Annual
     Meeting of the Association for Computational Linguistics, vol. 1, pp.
     1064-1074, 2016.


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                           298

</pre>