Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 APPLICATION OF RUSSIAN NAMED ENTITY RECOGNITION AND COREFERENCE RESOLUTION IN THE OIL INDUSTRY A.D. Kulnevich 1, a, V.L. Radishevskii1, R.A. Chugunov 1, A.A. Shevchuk 2 1 National Research Tomsk Polytechnic University, 30 Lenina avenue, Tomsk, 634050, Russia 2 National Research Tomsk State University, 36 Lenina avenue, Tomsk, 634050, Russia E-mail: a kulnevich94@mail.ru This paper describes the application of named entity recognition and coreference resolution algorithms in the oil industry. Oil industry researchers and businesses generate large amounts of content every day. Managing them correctly is very important to get the most use of each article and document. Named entity recognition algorithms can automatically scan entire articles and reveal the most significant people, organizations, and places discussed in them, while coreference resolution combines each entity mention into clusters of mentions. Each cluster represents one entity across one document. These methods allow to simplify the analysis of large numbers of documents and articles for researchers, managers, engineers, etc. Keywords: natural language processing, named entity recognition. coreference resolution. © 2018 Aleksey D. Kulnevich, Vladislav L. Radishevskii, Roman A. Chugunov, Anton A. Shevchuk 378 Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 1. Introduction Named entity recognition is a process where an algorithm takes a string of text (sentence or paragraph) as input and identifies relevant nouns (people, places, and organizations) that are mentioned in that string. In practice, texts often have the same entities mentioned in various ways (anaphora, cataphora, split antecedents, coreferring noun phrases). Coreference resolution algorithms are used to address this problem and to combine all mentions of the same entity into one cluster. These algorithms may greatly simplify the analysis of documents and articles. For example:  They can be used to create efficient search engines. If for every search query the algorithm ends up searching all the words in millions of articles, the process will take a lot of time. Instead, if named entity recognition can be run once on all the articles and the relevant entities (tags) associated with each of those articles are stored separately, this could speed up the search process considerably. With this approach, a search term will be matched with only a small list of entities discussed in each article, leading to faster search execution.  They can be used to improve content recommendation systems. This can be done by extracting entities from a document and recommending other documents that have the most similar entities mentioned in them.  An online journal or publication site can hold millions of research papers and scholarly articles. There can be hundreds of papers on a single topic with slight modifications. Information search can become complicated. Segregating articles by tags extracted using named entity recognition and coreference resolution can help find the desired article or document.  They can be used to create ontology objects and object properties.  They can be used to classify content for news providers: such algorithm can scan entire articles and reveal the most significant people, organizations, and locations discussed in them.  There are several ways to make the process of customer feedback handling smooth by means of solving named entity recognition tasks.  They can be used for automatic summarization systems: named entities are the important information of the text and increase the performance of identification of text segments that are further included in summarized data.  This is especially important for the oil industry for two reasons:  new technologies, cited in scientific papers, can save millions of dollars daily after implementation;  thousands of documents are generated in every oil company every day, these documents often require meticulous analysis. 2. Implementation of Named Entity Recognition There are two main approaches to address the named entity recognition (NER) problem [6]. The first one is based on handcrafted rules, and the other one relies on statistical learning. The rule- based methods are primarily focused on engineering the grammar and syntactic extraction of patterns related to the structure of the language. In this case, laborious tagging of a large number of examples is not required. The downsides of fixed rules are the poor ability to generalize and the inability to learn from examples. As a result, this type of NER systems is costly to develop and maintain. Learning- based systems automatically extract patterns relevant to the NER task from a training set of examples, so they don’t require deep language-specific knowledge. This makes it possible to apply the same NER system to different languages without significant changes in architecture. In this paper, we use a hybrid approach to this task in the Russian language: • An algorithm based on context-free grammar is used to extract some of the document’s entities, keywords, and attributes. • Another algorithm based on conditional random fields and word vectorization using a pre-trained skip-gram word2vec model for the Russian language and POS tags. 379 Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 The concept of conditional random fields (CRFs) [1] has been successfully adapted in many sequence labeling problems [2-3]. Even the in deep learning architecture, CRF has been used as a fundamental element in named entity recognition [4-5]. One of the primary advantages of applying a CRF to language processing is that it learns transition factors between hidden variables corresponding to the label of a single word. We used a hybrid approach to extract entities from texts: extracted entities were merged together, removing the duplicating ones. Entities of the following types were extracted: • person; • organization; • location; • product; • event; • money. Named entity extraction algorithms used morphological and part-of-speech tags to correctly label entities. To train and validate models, we used the Dialogue-2016 dataset and additionally labeled documents (newspapers, fiction books, technical documents). 3. Implementation of Coreference Resolution The coreference resolution algorithm is based on neural network, which is mostly derived from previous work [7]. Some changes were made to improve the results in the Russian coreference resolution task: • To train a network for the Russian language, we used the Dialogue-2014 dataset. • LSTM layers in the network have been changed to GRU layers (GRU showed slightly better results during evaluation on test data due to a smaller number of parameters and small dataset). • Pre-trained Russian word2vec skip-gram vectors, morphology, and POS tags were used as features. • An extracted named entity tag was added as a feature to help the network find coreferences between the entities extracted by the NER algorithm. 4. Results Named entity recognition (Table 1) and coreference resolution (Table 2) modules were tested on a holdout subsample of the dataset (randomly selected 10% of data). Metrics for entities were calculated for every word separately. Classes of entities were unbalanced: most of the words in the texts were not parts of entities Table 1. Entity recognition results on a holdout subsample Precision Recall F1-score Support B-PER 0.78 0.69 0.73 361 I-PER 0.74 0.74 0.74 425 B-ORG 0.76 0.60 0.67 533 I-ORG 0.77 0.73 0.75 562 B-LOC 0.75 0.86 0.80 651 I-LOC 0.76 0.60 0.67 263 B-PROD 0.63 0.65 0.64 752 I-PROD 0.70 0.47 0.56 227 B-DATE 0.92 0.86 0.89 337 I-DATE 0.91 0.97 0.94 420 O 0.99 0.99 0.99 53780 Avg / Total 0.97 0.97 0.97 58311 380 Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 Table 2. MUC-5 coreference resolution results on a holdout subsample MUC Metrics Prec. Rec. F1 Our model 71.7 65.2 0.683 5. Application Entity recognition and coreference resolution models were combined in a single pipeline, which also included document OCR, text preprocessing, and tokenizing. A web service was created, which included a search system based on the Elasticsearch framework. The extracted entities were used in ranging the search output. The system was loaded with oil industry-related documents: scientific articles and business documents. The documents in the search could be viewed with highlighted entities and coreferences. The agglomerative clustering method (using Doc2Vec model for feature extraction) and a simple named entity linking algorithm based on regular expressions were used to recommend similar documents to help the user quickly find relevant documents that are like the current document. Examples of our web service GUI and processed text can be seen in the Figures 1 and 2 below: Figure 1. An example of Natural Language Application for Gas and Oil industry Figure 2. An example of processed text 381 Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 Altogether, this system considerably improves information search efficiency and document analysis. 6. Conclusion This article describes application of machine learning algorithms for natural language processing tasks. Named Entity Recognition and Coreference Resolution allows to improve search engines and helps to analyze documents faster. Future work includes optimization of algorithms and addition of summary extraction, Named Entity Linking, recommendation engine based on documents features. All of these features are aimed to optimize the process of text documents analysis. References [1] Lafferty J., McCallum A., Pereira F. C. N. Conditional random fields: Probabilistic models for segmenting and labeling sequence data // In International Conference on Machine Learning (ICML), 2001, pp. 282–289. [2] McCallum A. and Li W. Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons // Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003, Vol. 4, 2003, pp. 188-191. [3] Sha F. and Pereira F. Shallow parsing with conditional random fields // Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, Vol 1, 2003, pp. 134-141. [4] Lample G. et al. Neural architectures for named entity recognition // arXiv preprint arXiv:1603.01360, 2016. [5] Liu Z. et al. Entity recognition from clinical texts via recurrent neural network // BMC Medical Informatics and Decision Making, 2017, Vol. 17, №. 2, p. 67. DOI: 10.1186/s12911-017-0468-7. [6] Maithilee L. et. al. Approaches to Named Entity Recognition: A Survey // International Journal of Innovative Research in Computer and Communication Engineering, 2015, Vol 3. Issue 12, pp. 12201– 12208. [7] Lee K. et. al. End-to-end Neural Coreference Resolution // Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp 188-197. [8] Anh L. et. al. Application of a Hybrid Bi-LSTM-CRF model to the task of Russian Named Entity Recognition // arXiv preprint arXiv:1709.09686, 2017. 382