=Paper=
{{Paper
|id=Vol-2667/paper65
|storemode=property
|title=A multiclass words classification by the recurrent neural network with memory (LSTM) as applicable to the named entity recognition problem
|pdfUrl=https://ceur-ws.org/Vol-2667/paper65.pdf
|volume=Vol-2667
|authors=Vladimir Vakurin,Andrey Kopylov,Oleg Seredin,Konstantin Mertsalov
}}
==A multiclass words classification by the recurrent neural network with memory (LSTM) as applicable to the named entity recognition problem ==
A Multiclass Words Classification by the Recurrent Neural Network with Memory (LSTM) as Applicable to the Named Entity Recognition Problem Vladimir Vakurin Andrey Kopylov Oleg Seredin Tula State University Tula State University Tula State University Tula, Russia Tula, Russia Tula, Russia vakourinvl@yandex.ru and.kopylov@gmail.com oseredin@yandex.ru Konstantin Mertsalov Rensselaer Polytechnic Institute Troy, NY, USA kmertsalov@gmail.com Abstract—This study considers back propagation neural generalized pattern; it is a so-called “online update”, refer to networks (NN) training for named entity recognition using [9]). With this, the expected global error minimum can be multilayer NN architectures and various feature spaces on found faster [9]. On the other hand, the ground truth and the character strings. Experimental results showing the relation loss function should match the NN learning objective. between the generalizing properties and the intersection of the training and test named entity sets while solving the The problem statement for this research is improving the conventional named entity recognition problem are presented. quality of the models used for the recognition of named We also propose a method for improving the model predictive entities not presented at the NN training phase by using a ability to recognize named entities not used in the training. multiclass loss function along with a probabilistic representation of the specific named entity strings. We also Keywords—recurrent neural network, character feature present the experimental results showing the relation spaces, long short-term memory architecture between the generalizing properties and the intersection of the training and test named entity sets while solving the I. INTRODUCTION conventional named NE recognition problem, and the The paper proposes a new method and investigates the extremely poor generalizing ability of such conventionally key disadvantages of the existing named entity (NE) trained models when applied to texts that contain new, recognition solutions. Named entity recognition is a well- unknown NEs which is common in actual (commercial) NE known problem, a part of the text mining domain [1]. recognition applications. Within the text mining domain, named entity recognition II. RELATED WORKS is used to locate and identify identical information objects contained in the text either directly, or indirectly. The general There are several approaches to the named entity named entity recognition (NER) problem is the identification identification problem: grammar templates [10]; a classifier of words/word sequences in a text that belongs to a specified based on support vectors [11], statistical models, namely, group, such as company names, geographic names, proper hidden Markov models [12], conditional random fields [13, names, etc. The problem has many specific formulations and 14], and a range of deep learning NN models [15-18]. To is significant for automated text processing systems. The overcome the limitations of using recurrent neural networks common problems mentioned in the available references are used for NE string prediction [15], neural network cells with proper name recognition, drug name recognition (bio-NER, long short-term memory (LSTM) were introduced [5]. drug-NER) [2], and chemical entity recognition (chem-NER) The latest trend is combining various neural network [3]. Since developing syntax rules and dictionaries for such architectures as layers of a top-level multilayer neural problems is difficult, and proper names and formulas often network [19]. Lately, it has been considered as deep learning. contain errors, the problems are usually solved with machine This is presented in [16]; the first results obtained with a learning [3,4]. For the last three to four years, more advanced convoluted network are shown in [17] as applied to advanced named entity recognition methods emerged. The new neural network architectures [18]. Despite the relatively NER methods use the most advanced long short-term memory solution high quality compared to the above-listed neural network architectures [5] and are extensively conventional methods, the researchers note a disadvantage investigated. An application of such a neural network attributed to random errors introduced to the features of an architecture to the Russian language is presented in [6]. entity to be recognized. The paper [20] notes that expanding A commonly used optimization method for neural the feature space by introducing capital letters and part of network training is the stochastic gradient descend (SGD) speech attributes do not improve the quality. A solution that [7]. It is iteratively controlled by a numeric loss function brings LSTM neural networks to a state-of-the-art level is the value [8]. On one hand, the method is based on a random architectures that do not require manual feature engineering distribution of changes to the neural network coefficients. It or pre-processing. Instead, they are end-to-end architectures means that the model parameter vector randomly oscillates that process character strings directly and generate a feature around the common path since it is updated as a new entity space with a sufficient dimensionality [20, 21, 22] for the top enters the network (with some noise relative to the LSTM layers that recognize the string (containing a NE.) Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) Data Science The approach is supported by the paper [23]. It notes that the processing applications is the dimension of the output feature space generated by such a model can distinguish middle layer (usually between 100 and 1,000.) In our word suffixes, capitalized words, prefixes, and perform experiment, the value is 650. tokenization automatically. With such an approach, the NN training seems to be similar to the way people learn words: an explicit character string is matched to a test list of words hidden from the observer. It is abstract and not obvious at the initial phases of learning, but as the learning is completed, the word list contains a set of words and the rules of their usage. In this paper, we will experimentally verify if this approach is valid. We will also experimentally verify the controlled vertical addition of layers to a neural network. As the number of layers is determined by the architecture, there is a problem of representing the linear operator for multiple NN layers (applied to the NN layers considered as elements: as it would have been applied to the elements of a specific NN layer in the conventional problem formulation.) The problem is solved with such architectures as shown in [24, 25] that resulted in the emergence of highway neural networks. III. GENERAL ARCHITECTURE OF THE PROPOSED NEURAL NETWORK A. Encoder Architecture The features are represented with a convolutional Fig. 1. The general arrangement of a char-cnn-lstm encoder based on the encoder [9]. The encoder input is the letter features encoded arrangement presented in [19]. by natural numbers [21]. Each word is encoded by a vector. As new sentences are supplied to the training window Its length is equal to the length of the longest word (21 letters 100 sentences long an internal covariance shift may occur in our experiment). The vector elements are the letter [9]. To minimize it, and to accelerate the training, we used sequential numbers in the alphabet. An empty position is mini-batch normalization [26]. coded as 1. After normalization, the convoluted encoder output can As it is noted in [21], sequence convolutions (usually be complemented by layers with linear transfer functions and called ’time convolutions’) are used to process natural a carry gate that excludes several linear layers based on the language texts in contrast to spatial convolutions used to value of the function G [24, 25]: process images. For this reason, a feature representation f k Rl w 1 of the neural network middle layer for the y H x, WH G x, WG x 1 G x, WG , word k is generated as follows: where Ck [*, i : i w 1] are columns of the Ck matrix from i to i w 1 , where x is the input, H x, WH is the transform gate, A, B Tr ABT is the Frobenius scalar product. G x, WG is the carry gate: H (x) WH x bH , The most significant features for each word k are to be G(x) WG x bG , where is the sigmoidal function. selected from the feature vector f k : y k max f k [i] (max- i We used two such layers in the experiments. over-time) for k , located at the center of a letter window wide [21]. LSTM cells were applied for the sequence recognition. A layer with LSTM cells [6] replaces the NN hidden layer The most efficient method to represent the generated n- coefficients ( W ) with a system of equations that connects gram character sequences for a convoluted neural network is the LSTM elements horizontally and enables short-term long to use several such filters concurrently. The filters have memory (refer to Fig. 2). various bandwidths proportional to the expected n-gram length (a word length expressed in characters.) We used the B. Decoder. Using the Estimated vs. Reference Mismatch same parameters as in the paper [21]: seven filters with [50, Vector for Backpropagation 100, 150, 200, 200, 200, 200] dimensions. As the authors A language model that estimates the next word note, the key concept is to identify the most significant probability wt 1 (a named entity or another word) from a features for a specific n-gram input and each filter with various dimensions. character sequence w w1 , wt was developed as follows. For the filters H1 ,K , Hh ( h 7 in this case), the convoluted neural network output for a character Upon every neural network weights update as new features (character strings) are presented, an error function is representation is y k y1k ,K , yhk for the input estimated. The error function checks the match or mismatch representation of the word k , max. length of 21 characters. of the class index (the word number in the dictionary) in the As the paper [21] specifies, for many natural language training set and the estimated class index (the word number VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 294 Data Science in the dictionary) for each character string that represents the IV. EXPERIMENTAL PROCEDURE word: Two language corpora were used: Penn Treebank [27] y* arg max p( y z ; W, b) . and English NER task CoNLL2003 [28]. Refer to Table 1 for yY ( z ) their summary data. For the CoNLL2003 corpus, NE-PER A result of successful training is matching the character (Personal, person, human) were used. To estimate the named string segments being words as individual elements [23]. entity recognition quality we used conventional metrics: general accuracy for all the classes, accuracy, completeness, F1-score for the first class represented by the NEs [29]. Also, refer to [30]. TABLE I. EXPERIMENTAL DATA SET STATISTICS Penn Dataset Text element type CoNLL2003 Treebank Sentences 42068 14987 Training Words 887.521 204.567 Sentences 3370 3466 Validation Words 70.390 51.578 Sentences 3761 3684 Test Words 78.669 46.666 Fig. 2. A short-term long memory cell structure (from [4]). V. EXPERIMENTAL PROCEDURE Estimating a word class (or a NE class) in a sentence A. Experiment No.1: Standard NE recognition problem (text representation hidden from the NN input) as a character Refer to Fig. 3 for the test set recognition results achieved string containing the word is presented, or, if the prediction with the multiclass loss function. is wrong, a set of characters not related to the expected word is as follows. Two extra layers are added to the recurrent neural network output: a dropout layer with a 0.5 dropout probability, and a so-called linear layer with its dimension equal to the dictionary size: P(x) WP x bP . In other words, the neural network output as a S N matrix is multiplied by a N T P matrix, where S is the number of sentences (100), N is the neural network output dimension, T is the number of words in the sentence (35), P is the dictionary size. The resulting matrix contains non-normalized values of Fig. 3. CoNLL2003 test set recognition result. the dictionary word degree of membership to the classes recognized in the array of sentences that the neural network B. Experiment No.2: Random NE recognition (not receiving the “right” term numbers directly) gets as a Feature space for the CoNLL2003 corpus is constructed sequence of characters. In the course of optimization the in such a way as to make the named entity character strings network is trained to recognize the sequences of characters as composed of 3 - 20 random characters for training and indivisible fragments (words) and to predict each such word, testing. Refer to Fig. 4 for the results. and also to predict (whether correctly or erroneously) the class of an index 0 named entity. We will further check if the experimental result is a mistake. To decrease the P dimension, we can estimate the softmax index by assigning it to the respective element of the S T array: the index is the expected word (class) index in the dictionary used to compare the current neural network output with the referenced one. The stochastic gradient descend (SGD) method is used to optimize the neural network layer coefficients. The SGD argument is the error value, i.e., the cross-entropy function value estimated for the probability of membership in each word of the language: H ( p, q) y p( y) log q( y) , that is to be transformed back (with some error) into the Fig. 4. The CoNLL2003 corpus test set recognition result with randomly misspelled NE character features during the training and the testing. coefficients of an LSTM recurrent neural network. VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 295 Data Science C. Experiment No.3: Unique NE recognition refined favor of at least one class is less than 50%, then class 0 problem statement (named entity) would be predicted. It means that the NN Using the information on Chem-NER [3], we can refine cannot recognize the unique string with a high probability: the NE recognition problem with the CoNLL2003 corpus as follows: first, the NN is trained; then, it recognizes NEs not y* ROUND arg max p( y z ; W, b) present in the training set, only in the test one. The resulting yY ( z ) problem is more complicated: the network will be trained In this case, while in the training the error function skips with the NE character features not found in the test set NEs. the recognition errors associated with the randomly changed For this, every corpus CoNLL2003 named entity is a string NE characters. composed of 3 - 20 random characters. It is transfer learning E. Experiment No.5: Solution verification with the Penn [30] for named entity recognition. TreeBank corpus Experiment No. 4 is repeated with the Penn TreeBank corpus. The hypothesis is: with each named entity misspelled we will avoid the well-known(unknown) character recognition problem. Every named entity is encoded by these characters. The text corpus (stock reports and financial news) is huge and homogeneous; that is why it is suitable to learn the unique named entity recognition accuracy with the method proposed in Experiment No. 4. Fig. 5. The result of the CoNLL2003 corpus test recognition with the NN trained on NEs with randomly misspelled character features. Fig. 8. The result of the PennTreeBank corpus test recognition with the NN trained on NEs with randomly misspelled character features. The NN modified the prediction and loss functions. F. Experiment No.6: The method improvement and the comparative metrics estimation Fig. 6. The result of the CoNLL2003 corpus validation set recognition with During the experiments, we identified and confirmed the the NN trained on NEs with randomly misspelled character features. existence of the problem that was reviewed in [32]. The results of this experiment and the previous one are Unfortunately, our team found it out too late, when controversial. experiments 1-5 had been completed. It is an independent confirmation that the problem does exist in the industry. Initially, we introduced a more radical problem statement and offered an EN representation-agnostic solution, even if the recognition quality is not perfect. Thus, to estimate the comparative characteristics, the loss function will be left as in experiments 5-6, and the convolutional encoder will get NE character strings as input. The NEs that were used in training are deleted from the test set for the quality assessment as proposed in [32]. Since gazetteers are used in [32], we also used them for this experiment. Refer to Table 2 for the comparative characteristic of this method with and without gazetteers. There are 1,500 training epochs for this model. The NE recognition target classes are Person, Organization, Location, as in [32]. Fig. 7. The result of the CoNLL2003 corpus test recognition with the NN trained on NEs with randomly misspelled character features. The NN modified the prediction and loss functions. TABLE II. THE QUALITY CHARACTERISTICS OF THE METHOD D. Experiment No.4: The algorithm adaptation for unique no gazetters with gazetters Corpus NE recognitions Prec. Recall F1 Prec. Recall F1 CONLL Using the feature space building conditions from 0.56 0.78 0.649 0.59 0.75 0.657 test A Experiment No. 3, we will change the predictive function CONLL 0.43 0.87 0.571 0.57 0.85 0.579 from the softmax class as follows: if the confidence factor in test B VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 296 Data Science The recognition quality is higher if a NE generalized not. In this case, the contradiction is between the possible pattern is generated through training. Refer to Table 3 for the uniqueness of the NE representation and the statistical comparison of the results with [32]. Refer to Table 14 for a method of recognition applied. comparison of the results. (Table 14: Out of domain These results mean that it is possible to formulate the performance: F1 of NERC with different models). problem of NE recognition by searching the character string that was not used while in training. TABLE III. RESULTS COMPARED TO [29] The most obvious solution for this contradiction is Precision increasing the classifier sensitivity threshold to, e.g., 50% The results Recall F1-score Proposed method probability of accurate identification of previously known, CONLL test B 0.59055 0.75364 0.6537 standard words in a sentence. As experiments 4 and 5 show, CONLL test A 0.44853 0.85251 0.57881 this aim is achievable. For a big training set (Experiment 5) Memorization the recognition quality is equal to that of the non-unique NE CONLL test B 0.5314 0.2236 0.3148 CONLL test A 0.5585 0.2249 0.3207 recognition. CRF Suite CONLL test A 0.6712 0.3857 0.4899 VII. CONCLUSIONS AND FURTHERRESEARCH CONLL test B 0.6794 0.3641 0.4741 The experiments show that multilayer neural networks SENNA can be applied to named entity recognition even if the NEs CONLL test A 0.6862 0.5868 0.6326 CONLL test B 0.6461 0.5194 0.5758 greatly differ from the training set. The unique NE recognition for the CoNLL2003 corpus complex text is possible with accuracy 0.5637, completeness 0.7809, and F1- The experimental numerical results are presented in score 0.6492. Table 4. The specified natural language models quality refers to the epoch indicated in the Table. Nevertheless, the researchers should consider two different problems: the recognition of known or similar NEs, TABLE IV. EXPERIMENTAL RESULTS and the recognition of unknown NEs not similar to those used for the training. The paper [32] also confirms that the NER problem exists. Our results are comparable to those presented Exp Fig Trainin General NER F1 precisio No. No. g epoch accuracy n recall score in [32]. Our experiments showed that the conventional 1 3 150 0.849 0.7859 0.8495 0.81 substitution or a substitution refined with extra statistical 2 4 44 0.9214 0.8825 0.9950 0.934 data (gazetteers and additional features) can just significantly 3 5 250 0.8174 0.3921 0.0302 0.054 improve the recognition of known NEs (e.g., included in the 3 6 250 0.8401 0.4003 0.0346 0.061 dictionaries.) It is the case for the more complex, advanced 4 7 250 0.8466 0.2681 0.9023 0.39 accuracy improvement algorithms. The extra statistical data 5 8 54 0.9852 0.7708 0.9943 0.866 6 -- 1500 -- 0.5637 0.7809 0.649 used in Experiment No. 6 increased F1-score by 0.7%...0.8% through reducing the recognition completeness. The VI. RESULTS AND DISCUSSION achievable metrics of any new method for the conventional problem depends on the amount of intersection between the Interpreting Experiment No. 2 results as a success is a NE training set and the testing one. The recognition of mistake because it contradicts Experiment No. 3 results. A general text patterns located between NEs is a more natural possible reason for the contradiction is a feature of the problem statement. We also identified an issue with the tensorflow softmax software package function that softmax function (particularly tensorflow tf.nn.softmax) as processes the NN output: applied to NN output layer factors that represent NEs since it - the class occurrence probability P is estimated from the leads to lower accuracy. NN output values with the class 0 features. The standard class index for NER-Person class is 0. The estimated REFERENCES [1] A. Kao and S. Poteet, “Natural Language Processing and Text probability is low, but still, it is higher than for the other Mining,” London: Springer-Verlag, 2007. n classes representing the words. [2] J. Patrick and M. Li, “High accuracy information extraction of - or it assigns class index 0 (Person) if the probabilities medication information from clinical notes: 2009 i2b2 medication of the term being a member of each class in the set are extraction challenge,” Journal of the American Medical Informatics equal. Association, vol. 17, pp. 524 -527, 2010. Nevertheless, as the classifier finds a non-random NE [3] M. Krallinger, “The CHEMDNER corpus of chemicals and drugs and its annotation principles,” Journal of cheminformatics, vol. 7, no. 1, representation in a character string (refer to Experiment No. pp. 1-17 , 2015. DOI:10.1186/1758-2946-7-S1-S. 3), it will assign to it an index of the class (word) that differs [4] А. Glazkova, “Russian Person Names Recognition Using the Hybrid from the NE class but is more similar to that of another non- Approach,” Supplementary Proceedings of the Seventh International random word. A trivial example is: we need to recognize the Conferencem on Analysis of Images, Social Networks and Texts proper noun: the Snowball dessert name. The NN model (AIST), pp. 34-41, 2018. was trained with the names of other desserts. It was also [5] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Comput., vol. 9, no. 8, pp. 1735-1780, 1997. trained with fairy tales used as counterexamples where the [6] L. Anh, M. Arkhipov and M. Burtsev, “Application of a Hybrid Bi- word Snowball represents a ball of snow for the winter LSTM-CRF model to the task of Russian Named Entity Recognition,” game, but not the dessert. Proceedings of the AINL, 2017. This problem shows that the existing named entity [7] H. Robbins and S. Monro, “A Stochastic Approximation Method,” recognition training methods have a significant The Annals of Mathematical Statistics, vol. 22, no. 3, pp. 400-407, 1951. disadvantage: the recognition quality depends on whether [8] A. Wald, “Statistical Decision Functions,” Wiley, 1950. the NE lists for the training and recognition sets intersect or VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 297 Data Science [9] Y. LeCun, L. Bottou, Y. Bengio and P. Haffner, “Gradient based [21] Y. Kim, Y. Jernite, D. Sontag and A. Rush, “Character-Aware Neural learning applied to document recognition,” Proceedings of the IEEE, Language Models,” Proceedings of the Thirtieth AAAI Conference on pp. 2278-2324, 1998. Artificial Intelligence, pp. 2741-2749, 2016. [10] J. Jang, “Information extraction from text,” Mining Text Data, [22] M. Cho, J. Ha, C. Park and S. Park, “Combinatorial feature Springer, 2012, 524 p. embedding based on CNN and LSTM for biomedical named entity [11] H. Isozaki and H. Kazawa, “Efficient support vector classifiers for recognition,” J. Biomed. Inform., vol. 103, no. 2019, 103381, 2020. named entity recognition,” Proceedings of the 19th international [23] J.Chiu and E. Nichols, “Named entity recognition with bidirectional conference on Computational linguistics, vol. 1, pp. 1-7, 2002. lstm-cnns,” Transactions of the Association for Computational [12] G.D. Zhou and J. Su, “Named entity recognition using an hmm-based Linguistics, vol. 4, pp. 357-370, 2016. chunk tagger,” Proceedings of the 40th Annual Meeting on [24] R. Srivastava, K. Greff and J. Schmidhuber, “Highway networks,” Association for Computational Linguistics, pp. 473-480, 2002. arXiv preprint: 1505.00387, 2015. [13] R. Klinger, “Automatically selected skipedges in conditional random [25] G. Pundak and N. Tara, “Sainath: Highway-LSTM and Recurrent fields for named entity recognition,” Proceedings of the 8th Highway Networks for Speech Recognition,” Proc. Interspeech, International Conference on Recent Advances in Natural Language ISCA, 2017. Processing, pp. 580-585, 2011. [26] S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep [14] W. Chen, Y. Zhang and H. Isahara, “Chinese named entity Network Training by Reducing Internal Covariate Shift,” Proceedings recognition with conditional random fields,” Proceedings of the 5th 32nd ICML, pp. 448-456, 2015. Special Interest Group of Chinese Language Processing Workshop, [27] M. Marcus, B. Santorini and M. Marcinkiewicz, “Building a large pp. 118-121, 2006. annotated corpus of English: the Penn Treebank,” Computational [15] Y. Bengio, P. Simard and P. Frasconi, “Learning long-term Linguistics, vol. 19, no. 2, pp. 313-330, 1993. dependencies with gradient descent is difficult,” IEEE Transactions [28] E. Tjong, K. Sang and F. De Meulder, “Introduction to the conll-2003 on Neural Networks, vol. 5, pp. 157-166, 1994. shared task: Language independent named entity recognition,” [16] A. Ivakhnenko, “Grouped Arguments Handling for Solving Proceedings of CoNLL, vol. 4, pp. 142-147, 2003. Prognostic Problems,” Automatics, no. 6, pp. 24-33, 1976. [29] C. Van, “Rijsbergen, Information Retrieval,” Butterworth- [17] Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Heinemann, 1979. Hubbard and L. Jackel, “Handwritten Digit Recognition with a [30] H. He, “Learning from imbalanced data,” IEEE Transactions on Backpropagation Network,” Proceedings of NIPS, 1989. Knowledge and Data Engineering, pp. 1263-1284, 2009. [18] Y. Bengio, “Learning Deep Architectures for AI,” Foundations and [31] L. Pratt, “Discriminability-based transfer between neural networks,” Trends in Machine Learning, vol. 2, no. 1, pp. 1-127, 2009. DOI: NIPS Conference: Advances in Neural Information Processing 10.1561/2200000006. Systems 5. Morgan Kaufmann Publishers, pp. 204-211, 1993. [19] J. Li, A. Sun, J. Han and C. Li, “A Survey on Deep Learning for [32] L. Augenstein, L. Derczynski and K. Bontcheva, “Generalisation in Named Entity Recognition,” IEEE Trans. Knowl. Data Eng., 2020. Named Entity Recognition: A Quantitative Analysis,” Computer DOI: 10.1109/TKDE.2020.2981314. Speech & Language, 2017. DOI:10.1016/j.csl.2017.01.012.2017. [20] X. Ma and E. Hovy, “End-to-end Sequence Labeling via Bi- directional LSTM-CNNs-CRF,” Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, vol. 1, pp. 1064-1074, 2016. VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 298