<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Russian Person Names Recognition Using the Hybrid Approach</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anna Glazkova</string-name>
          <email>a.v.glazkova@utmn.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Tyumen</institution>
          ,
          <addr-line>Tyumen</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Russian Person Name Recognition has been widely discussed in research papers devoted to models based on rules and machine learning. In this paper, the problem of Russian Person Name Recognition is tackled by hybrid using of rule-based models and neural networks trained on vector representations of words. The empirical results indicate that this approach shows results comparable to rules-based models and models trained in the syntactic and semantic text features. The advantage of the presented approach is the absence of the need for deep semanticsyntactic analysis of the text and connecting dictionaries, as well as the simplicity of the architecture of the used networks, which allows to limit the memory and runtime of the model.</p>
      </abstract>
      <kwd-group>
        <kwd>data extraction</kwd>
        <kwd>hybrid approach</kwd>
        <kwd>named entity recognition</kwd>
        <kwd>natural language processing</kwd>
        <kwd>neural networks</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Named Entity Recognition (NER) is the task of detecting and classifying proper
names within texts into predefined types, such as Person, Location and
Organization names [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. NER tools are actively used in different Natural Language
Processing applications.
      </p>
      <p>
        Russian is the official language of Russian Federation and several other
postSoviet countries. It has over 150 million native speakers in the world and
Russian is the most geographically widespread language of Eurasia [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Russian is
an inflectional language from the point of view of morphology. The syntax is
characterized by a relatively free order of words and an active role of intonation
means. The basis of writing is the Cyrillic alphabet.
      </p>
      <p>
        Although Persons Names Recognition and NER for Russian is a quite widely
studied problem, its solution is usually built on models based on templates or
syntactic or semantic features extracted from the text [
        <xref ref-type="bibr" rid="ref3 ref4">3,4</xref>
        ]. The use of such
models demonstrates high efficiency, but requires additional research related to
the development of rules and templates and the search for effective features for
model training. A number of studies for the Russian language are dedicated
to the construction of neural network models for NER. The presented models
[
        <xref ref-type="bibr" rid="ref5 ref6 ref7">5,6,7</xref>
        ] demonstrate high accuracy rates and show the efficiency of using neural
network technologies for solving this problem. These models, however, may be
quite difficult in terms of memory and computation.
      </p>
      <p>In this work, we made an attempt to solve the problem of Russian Person
Names Recognition. We tried to combine a neural network and a rule-based
approach. At the same time, in our work we tried to avoid using complicated
templates and rules for extracting named entities and we decided to use a fairly
simple network architecture that can be created and trained in a short time.</p>
      <p>The article is structured as follows. In the introduction we announce the
purpose of our work and have referred to related works. Further, in the
section «Methods» we describe the methodology of our work: datasets and tools,
modelling and features, defining the boundaries of personal names in the text.
Finally, we compare our results obtained on a textual collection with the results
of other researchers.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Methods</title>
      <sec id="sec-2-1">
        <title>2.1 Data Collections and Libraries</title>
        <p>
          To train our network, we used a set comprising of a random sample of
manuallyannotated Persons-1000 texts [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] which includes 1000 news texts and their
corresponding xml-files containing initial forms of personal names. In addition, we
needed to use word embeddings by RusVecto¯r¯es project [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] based on Russian
National Corpus and Wikipedia texts (300-dimension vector corresponding to
some word). To lemmatize words, we used pymorphy2. Neural network models
are built and trained using TensorFlow.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2 Text Preprocessing</title>
        <p>The text preprocessing was performed in the following way. We divided the texts
into sentences and then sentences into words. We did not set ourselves a separate
task of dividing the text into sentences, so the breakdown was simply carried out
by punctuation. If punctuation marks are part of a personal name (for example,
a dot after the initial), then such punctuation marks are ignored and included
in other sentences.</p>
        <p>For each word we received the following features:
1. Word embeddings.
2. The serial number of the word in the sentence.
3. Indicator of whether the word begins with a capital letter.
4. Indicator of whether the word contains specific suffixes of surnames and
patronymics.</p>
        <p>The last two features are binary. In our work, we focused on those features
that can be extracted without additional semantic and syntactic analysis of
the text. Obtaining mentioned characteristics requires minimal effort to analyse
sentences. The sample was divided into training, test and examination samples
in the ratio of 70, 20 and 10%.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3 The Network Architecture</title>
        <p>In the test we used feed forward network architecture. The main reason to choose
this type of architecture is a quite large feature set and significant amount of
training sample [10]. We focused on «budgeted» (in terms of memory) models
and compensated possible loss of accuracy by applying the rules when defining
the boundaries of personal names.</p>
        <p>The used model has two hidden layers with sigmoid activation. Each hidden
layer contains 200 neurons. For optimization we have chosen the Adam optimizer
and exponentially decaying learning rate.</p>
        <p>The choice of the best model was carried out as follows. We trained models
using all input features with the number of hidden layers from 1 to 4 and the
number of neurons on the hidden layers from 150 (half the dimension of the
input data) to 300 (increment is 10) (Fig. 1).</p>
        <p>7
6
5
)4
%
(
s
l3
s
o
2
1
0</p>
        <p>number of hidden neurons
1 hidden layer
2 hidden layers
3 hidden layers
4 hidden layers</p>
        <p>The models were trained on a training sample. At each iteration, we
calculated the loss on the test sample. The training was interrupted if at some
iteration the results of the model on the test sample began to decrease. The
choice of the optimal model was carried out in accordance with the results
obtained on the test sample (Fig. 2). The examination sample was used to finalize
the quality of the selected model.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4 Defining Boundaries of Personal Names</title>
        <p>All words in the training sample have the index of 1, if they are part of a
personal name, and if not, the index of 0. The aim of the training is the correct
0 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900</p>
        <p>iterations
Training sample</p>
        <p>Test sample
prediction of these indices for the elements of the examination sample. Therefore,
the trained model associate each word with a number m from 0 to 1, where 1 is
the marker for the word being part of a personal name:
f (xi) ! mi; mi 2 [0; 1];
i is the index of word xi, i 2 [1; n], n – the sample size, f (xi) – the set of
features for xi.</p>
        <p>The decision on whether a word is part of a personal name is taken on the
basis of the value of m. If the value exceeds the threshold value k, the word can
be considered a part of a personal name. In the experiments we use k = 0:5.</p>
        <p>After processing by the neural network, each word of the text has a number
mi. At this moment these are separate words. Next, we must combine the words
into personal names and define the boundaries of personal names. For these
purposes, we used a fairly simple rule. First of all, we simply combined those
words that look like fragments of personal names (which have mi &gt; k) and which
stand side by side in the text and are not separated by punctuation marks. If
the word is separate, we consider it a separate personal name.</p>
        <p>Further, we check the neighborhood of personal names (words adjacent to
personal names and not separated by punctuation marks). We decrease the
threshold value k for the words included in the vicinity of personal names. If
the value mi for each word in the vicinity exceeds a new threshold kv, we
include the neighborhood in the personal name and change the boundaries of a
name. Next, we estimate the value mi taking into account the new boundaries.
As a result of the experiments, we chose kv = 0:35 and the neighborhood size
equal to 1.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments and Results</title>
      <p>The table 1 contains quality indicators of neural network classification for the
examination sample, which determines whether a word is part of a personal
name. As a target metric, we use F-score:</p>
      <p>T Pn
P recisionn = T Pn+F Pn ;</p>
      <p>T Pn
Recalln = T Pn+F Nn ;
Fnscore = 2</p>
      <p>P recisionn+RReeccaallllnn ;</p>
      <p>P recisionn</p>
      <p>T Pn – the number of true positive fragments of personal names, F Pn – the
number of false positive fragments of personal names, F Nn – the number of false
negative fragments of personal names.</p>
      <sec id="sec-3-1">
        <title>Features Fnscore P recisionn Recalln</title>
        <p>Word embeddings 92.94% 92.52% 93.36%</p>
        <p>All features 93.12% 92.73% 93.51%</p>
        <p>In the table 2 we show the final results of person names recognition for the
examination sample with kv = 0:35. The F-score is calculated as follows:</p>
      </sec>
      <sec id="sec-3-2">
        <title>Fpscore P recisionp Recallp</title>
        <p>Results 93.41% 93.54% 93.28%
0.45
0.4
0.35
0.3
0.25
0.2
0.15</p>
        <p>0.1
kv
training sample
test sample
The paper presents a hybrid approach to Russian Person Names Recognition
that combines the advantages of neural network and rules-based approaches. We
compared our results with the results obtained earlier. Our approach did not
show the best result, but it is sufficient to achieve useful F-score.</p>
        <p>The main advantage of our approach is its simplicity in terms of resource use
and implementation, there is no need to connect dictionaries, create templates
and feature set provided that we have ready-made word embeddings.</p>
        <p>The results of the article will serve as a basis for further research on
information extraction problems.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <p>The authors would like to acknowledge the valuable comments and suggestions
of the reviewers, which have improved the quality of this paper.</p>
      <p>The reported study was funded by RFBR according to the research project
18-37-00272.
10. Botha J.A., Pitler E. et al. Natural Language Processing with Small Feed-Forward
Networks / Conference on Empirical Methods in Natural Language Processing
(EMNLP), Copenhagen, Denmark, 2017.
11. Vlasova N. A., Podobryaev A. V. Automatic noun phrases extraction using
preliminary segmentation and CRF with semantic features // Program Systems: Theory
and Applications. Volume 4 (35), 2017. P. 21–30.
12. Blinov P. D. Automatic named entity recognition in the Russian text // Scientific
and Technical Volga region Bulletin. Volume 3, 2013. P. 91–96.
13. Trofimov I.V. Person name recognition in news articles based on the
persons1000/1111-F collections / 16th All-Russian Scientific Conference Digital Libraries:
Advanced Methods and Technologies, Digital Collection, RCDL 2014, pp. 217–221.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Oudah</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shaalan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <article-title>Person Name Recognition Using the Hybrid Approach</article-title>
          // International Conference on Application of Natural Language to Information Systems.
          <year>2013</year>
          . P.
          <volume>237</volume>
          -
          <fpage>248</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Russian</given-names>
            <surname>Language</surname>
          </string-name>
          . URL: https://www.en.wikipedia.org/Russian language.
          <source>Date of access: 29.01</source>
          .
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Mozharova</surname>
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Loukachevitch</surname>
            <given-names>N.</given-names>
          </string-name>
          <article-title>Two-stage approach in Russian named entity recognition</article-title>
          . // Intelligence, Social Media and
          <source>Web (ISMW FRUCT)</source>
          ,
          <source>2016 International FRUCT Conference</source>
          .
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Rubaylo</surname>
            <given-names>A. V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kosenko</surname>
            <given-names>M. Y.</given-names>
          </string-name>
          <article-title>Software utilities for natural language information retrieval. // Almanac of modern science and education</article-title>
          . Volume
          <volume>12</volume>
          (
          <issue>114</issue>
          ),
          <year>2016</year>
          . P.
          <volume>87</volume>
          -
          <fpage>92</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Sysoev</surname>
            <given-names>A. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Andrianov</surname>
            <given-names>I. A.</given-names>
          </string-name>
          <string-name>
            <surname>Named</surname>
          </string-name>
          <article-title>Entity Recognition in Russian: the Power of Wiki-Based Approach // Computational Linguistics</article-title>
          and Intellectual Technologies:
          <source>Proceedings of the International Conference «Dialogue</source>
          <year>2016</year>
          ».
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Ivanitskiy</surname>
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shipilo</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kovriguina</surname>
            <given-names>L</given-names>
          </string-name>
          .
          <article-title>Russian Named Entities Recognition and Classification Using Distributed Word</article-title>
          and Phrase Representations // SIMBig.
          <year>2016</year>
          . P.
          <volume>150</volume>
          -
          <fpage>156</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Anh L. T.</surname>
          </string-name>
          ,
          <string-name>
            <surname>Arkhipov</surname>
            <given-names>M. Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Burtsev</surname>
            <given-names>M. S.</given-names>
          </string-name>
          <article-title>Application of a Hybrid Bi-LSTM-CRF model to the task of Russian Named Entity Recognition // Artificial Intelligence and Natural Language Conference (AINL</article-title>
          <year>2017</year>
          ).
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Vlasova</surname>
            <given-names>N.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sulejmanova</surname>
            <given-names>E.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Trofimov</surname>
            <given-names>I.V.</given-names>
          </string-name>
          <article-title>Message about the Russianlanguage collection for the task of extracting personal names from texts / in «Proceedings of the conference on computer and cognitive linguistics TEL'2014 «Language semantics: models</article-title>
          and technologies» P.
          <fpage>36</fpage>
          -
          <lpage>40</lpage>
          . - Kazan,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Kutuzov</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuzmenko</surname>
            <given-names>E.</given-names>
          </string-name>
          <article-title>WebVectors: A Toolkit for Building Web Interfaces for Vector Semantic Models</article-title>
          / In: Ignatov D. et al. (
          <article-title>eds) Analysis of Images, Social Networks and</article-title>
          <string-name>
            <surname>Texts. AIST</surname>
          </string-name>
          <year>2016</year>
          .
          <article-title>Communications in Computer</article-title>
          and Information Science. P.
          <volume>155</volume>
          -
          <fpage>161</fpage>
          , vol.
          <volume>661</volume>
          . - Springer, Cham.
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>