<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Named Entity Recognition for Information Security Domain</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Proceedings oof fthtehXe XXIXnteIrntaetrionnaatiloonnaflerCenocnefe“rence “Data Analytics and Management in Data Intensive Domains” (DAMDID/RCDL'2018)</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>I.A. Mazharov © B.V. Dobrov Lomonosov Moscow State University Faculty of Computational Mathematics and Cybernetics</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <fpage>200</fpage>
      <lpage>207</lpage>
      <abstract>
        <p>This work is devoted to the research of methods of named entity recognition for texts in Russian. Two methods of extracting information based on artificial neural networks were implemented, which were then tested on the collections of FactRuEval and Person 1000. The result of this work was the application of implemented software systems to a collection of texts on the topic of information security and analysis of the results.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Named entity recognition (hereinafter NER) is one of the
most frequent tasks of processing natural language. The
goal of NER is to find certain words and phrases in the text
and classify them according to predefined categories
(hereinafter labels), such as people's names, names of
geographical objects, names of organizations, expressions
of time, quantity and so on. The selected entities can be
further used in applications for extracting information from
the text. They can also be used as extracted properties for
other natural language processing tasks.</p>
      <p>Named entity recognition is an important source of
information for various systems for extracting
information and processing texts in natural languages.
Possible applications of NER are search engines,
crosslanguage information retrieval and machine translation,
automated news gathering, question-answer systems,
information retrieval for natural language processing
systems, medical texts analysis. In addition, named
entities are an important resource for structuring text
data, which can help to extend text collections.</p>
      <p>Named entity recognition associated with
information security (hereinafter IS) will help in time to
detect the emergence of a new threat, a virus or
vulnerability in the network and take appropriate
protective measures. The number and types of extracted
entities will help to conduct a temporary and quantitative
analysis of publications on the topic of information
security, identify weaknesses and vulnerabilities of
systems, and this will help with finding a solution to the
problem.</p>
      <p>In connection with the increased frequency of
cyberattacks, as well as the consequent increase in the
number of sources of unstructured or weakly structured
information on the topic of information security, together
with the general attention to this topic, make it urgent for
research and further application in this perspective of the
task of extracting named entities.</p>
    </sec>
    <sec id="sec-2">
      <title>2 Approaches to solving the task of named entity recognition</title>
      <p>Early NER systems were based on a set of rules defined
manually. This approach used search and recognition by
grammatical and syntactic patterns, according to the
structure of the language in which the text is written. In
this case, it is not necessary to have a large collection of
marked data, but the drawbacks of this approach include
a poor ability to generalize (addition of a new entity or
change of the language will inflict reworking most of the
rules) and the inability to learn by examples. The
development of such systems takes a long time, and
without significant processing, they cannot be applied to
different types of texts or to different languages.</p>
      <p>
        To solve these problems, algorithms have been
developed for extracting named entities based on
machine learning with different types of training:
supervised learning, semi-supervised learning,
unsupervised learning and reinforcement learning.
Supervised learning is the most studied [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and includes
the method of support vector networks (SVM) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
models based on the principle of maximum entropy [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
the method of decision trees [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and the methods of
matching the labels of a sequence of words, for example,
the hidden Markov model [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], the Markov model of
maximum entropy [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], the Markov model of conditional
random fields (CRF) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Like the approaches based on
given rules, these methods rely on the features of the
texts selected manually, which is a complex and
timeconsuming task, the result of which cannot be
generalized to different data sets.
      </p>
      <p>
        Recently, better results have been achieved using
artificial neural networks, in comparison with other
supervised learning algorithms for the task of extracting
named entities [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The advantage of neural networks
lies in their ability to automatically take into account and
translate syntactic and grammatical data into an internal
representation and learn the model parameters according
to the initial data set rather than rely on the signs
highlighted by rules manually created for specific data.
      </p>
      <p>For these reasons, systems of this type can be applied
to different languages without significant architectural
changes.</p>
      <p>Evaluation of the recognition system for named
entities is a way to test its operation and is performance
on a manually marked set of data. The named entity is
defined by its boundaries (consecutive words of one
essence) and its type.</p>
      <p>At the CoNLL conference, the following evaluation
method was proposed: if the type and boundaries of the
named entity, determined by the system, coincide with
the type and boundaries of the selected experts, then the
entity is considered to be properly allocated, otherwise
the entity is not marked correctly. This method is called
the exact (full) matching method.</p>
      <p>Quality indicators of the NER system are Recall,
Precision and F-measure (hereinafter R, P and F
respectively), which are calculated as follows:</p>
      <p>=
 =
  
 =</p>
      <p>2 ∗  ∗</p>
      <p>+</p>
    </sec>
    <sec id="sec-3">
      <title>3 Formulation of the problem</title>
      <p>In the framework of this thesis, it was required to develop
and evaluate methods for extracting named entities in
texts on the topic of information security using artificial
neural networks. Modern methods applied to the task of
extracting named entities are mainly related to various
methods of machine learning.</p>
      <p>The task described above is divided into the
following sub tasks:
• Conduct quality testing of the developed methods at
the collection Dialog Evaluation 2016, Persons-1000
and Persons-1111</p>
      <sec id="sec-3-1">
        <title>Unique tokens</title>
        <p>22358
43802
47464
PER
3350
27989
12056
PER
LOC
2950 2041</p>
      </sec>
      <sec id="sec-3-2">
        <title>EVENT MEDIA</title>
        <p>899 222</p>
      </sec>
      <sec id="sec-3-3">
        <title>HACKER_GROUP</title>
        <p>79
LOC
2531
ORG
8670
ORG
3324</p>
        <p>O</p>
        <p>O
81108
315034
285613</p>
      </sec>
      <sec id="sec-3-4">
        <title>POST</title>
        <p>135</p>
        <p>Apply developed software systems to a collection of
texts on information security topics.</p>
        <p>
          In order to train and test the implemented systems,
the marked corpuses were used. At the moment, there are
only a few corpuses for the NER task in Russian. In this
paper, experiments were carried out with the following
corpuses:
• FactRuEval 2016 corpus contains news and analysis
materials collected on the resources of Private
Correspondent and Wikinews, which are marked
with following labels: PER, LOC, ORG and O
(hereinafter Person, Location, Organization and
None respectively). Subjects of the texts are political.
• The Person-1000 corpus [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] contains Russian news
texts with marked named entities such as PER.
• The Person-1111 corpus contains Russian news texts
with marked named entities such as PER.
• Security_collection corpus (provided by MSU
Research Computing Center) contains texts on
information security, marked with the help of the
"Brat" system
        </p>
        <p>Statistics for these corpuses are presented in Table
1 and Table 2.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4 Methods for solving the task of named entity recognition using artificial neural networks.</title>
      <p>At the moment, the most widely used approaches to
solving the problem of NER using artificial neural
networks, which are divided into two large types: fully
connected / convolutional neural networks and recurrent
neural networks (hereinafter RNN). Recurrent networks
allow you to store in memory and correlate various
elements of the sequence, which enables them to show
better results than full-connected / convolutional
networks, in which the connection between words is
established by passing not by words, but by word groups
(windows).</p>
      <p>
        Currently, the
standard
solution
for
extracting named entities for English, German, Dutch,
Spanish [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and Russian [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] languages is achieved
using hybrid models combining Bi-LSTM and CRF. In
this paper, both approaches will be considered and
applied for the NER task for texts on the topic of
information security in Russian.
      </p>
      <sec id="sec-4-1">
        <title>4.1 Fully-connected neural networks in the NER problem</title>
        <p>
          The first approach to the task of extracting named entities
is based on a fully connected neural network [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. It is
based on the idea that the properties extracted from the
sentence, after minimal processing, will be transferred to
the input of a multilayer neural network, trained by the
backpropagation method. The system
will accept the
input sentence and train several layers of recognition of
properties that process the input data. The next layers of
the neural network analyzing the properties of the
sentence
will
be
automatically
trained to fully
correspond to the task.
        </p>
        <p>
          The system implemented in the framework of this
work is based on the neural network proposed in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ],
which is schematically presented in Figure 1. The first
layer extracts the properties of each word. The second
layer extracts the properties of the "window" of words,
treating it as a certain sequence with internal and external
structure (i.e., not treating it as a "bag of words"). The
following layers are standard layers of the neural
network - linear layers and the activation layer.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2 Recurrent neural networks in the NER problem</title>
        <p>
          Recurrent neural networks are a powerful family of
connected models that capture and analyze temporal
next step.
components:
changes through cycles in a graph. In theory, such
networks can support the storage and transfer of
dependencies on long sequences, but in practice, because
of fading / explosion of gradients in the reverse
propagation of errors, such dependencies are lost [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.2.1 Networks of long short-term memory (LSTM)</title>
        <p>
          The network of long-term
short-term
memory
(LSTM) [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] is a variant of recurrent networks aimed at
solving the gradient attenuation problem. In general, the
cell of long short-term
        </p>
        <p>memory consists of three
multiplicative
gates that control the proportion
of
information that must be forgotten or passed on to the
Below are the expressions for calculating these

 =  (  ℎ−1 +     +   )
  =    ℎ−1 +     +  
̃  =</p>
        <p>ℎ(  ℎ−1 +     +   )
  =   ⨀  −1
+</p>
        <p>⨀ ̃ 
  =  (  ℎ−1 +     +   )
ℎ</p>
        <p>=   ⨀ tanh(  )</p>
        <p>Where  is the elementwise sigmoid function and ⨀
is element wise multiplication.   is the input vector (for
example, the weight of the word) at time  , and ℎ is the
vector of the hidden state (also the output vector) in
which all useful information is stored at (and before) the
time  .   ,   ,   ,   denote the weights of the matrices
of the various gates for the input data   , and   ,   ,   ,
  are the weights of the matrices for the hidden state.
  ,   ,   ,   denote the bias weights.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.2.2 Bi-directional networks of long short-term memory (Bi-LSTM)</title>
        <p>
          For the task of named entities recognition for a word, it
is important to consider the past (left) context and the
future (right) context. However, the hidden state vector
ℎ stores information only about the past, but not about
the future. An elegant solution with proven effectiveness
are bidirectional networks of long short-term
memory
[
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. The basic idea is that on the forward and backward
pass of the sequence, two vectors of the concealed state
are formed to take into account both the future and the
past. Which then are combined into one common vector
of output data.
        </p>
      </sec>
      <sec id="sec-4-5">
        <title>4.2.3 Conditional random fields (CRF)</title>
        <p>For the task of extracting named entities, it is also
important to take into account the links between the
labels of words that are located side by side. Moreover,
to decode the result of the evaluation for the label follows
in the perspective of the whole sentence, because, for
example, next to the name will often be a surname and
the like. Thus, it is necessary to apply conditional random
fields,
which</p>
        <p>will decode the sequence of words
(sentence) and not every word individually.</p>
        <p>Formally, if for the input sentence  =
( 1,  2, … ,   ) designate a matrix 
of size 
×  ,
where  is the number of different labels, for the matrix

of estimates of this sentence, where   , is responsible
for the probability of the  -th label for the  -th word,
then for the series of predictions  = ( 1,  2, … ,   ) ,
 ( ,  ) =
   ,  +1 +
  ,</p>
        <p>Where  is the matrix of the probable transition, in
which   , denotes the transition probability from the
label  to the label  . Then the conditional probability
function in the completely possible way along the labels
for the sequence  will be expressed using the softmax
function:  ( | ) = ∑ ∈    ( , )</p>
        <p>( , )</p>
        <p>During the training the log likelihood function is
maximized
sentence  .</p>
        <p>log  ( | ) =  ( ,  ) − log(
Where   denotes all possible label sequences for the

 ∈ 
 ( , ))</p>
      </sec>
      <sec id="sec-4-6">
        <title>4.2.4 Vector representation of words</title>
        <p>
          For each word, it is necessary to obtain its vector
representation  ∈ ℝ , which will be relevant for the
NER task. It can be considered as a concatenation of the
weights of a word from the pre-trained model  1 ∈ ℝ 1
letter representation of the word. In this work, a
and the property vector  2 ∈ ℝ 2, obtained from the
bidirectional network of long short-term memory will
be used to extract attributes from the letter
representation of the word, which will be fed to the
input with words in the form of a sequence of letters; its
architecture is depicted in Figure 2. However, another
approach is also possible based on other recurrent or
convolutional networks [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
is also used, its architecture is depicted in Figure 3, and
its further training is applied to receive word weights.
Whose results will be transferred to the CRF model,
which was described above.
Models for neural networks were implemented in Python
using the Keras and Tensorflow libraries. The training
was conducted on a video card GeForce GTX 1080. One
epoch took 370 seconds for a fully connected network
and 190 seconds for Bi-LSTM. The training lasted for 30
epochs.
This corpus contains about 10 billion words, which are
represented by vectors of dimension 100.
        </p>
        <p>
          Character weights. For a model with a fully
connected neural network, the features of the word level
were extracted with the help of given rules and their
training was not carried out. For the Bi-LSTM model, the
weights of the letters were initialized using the Xavier
[
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] method and had a dimension of 100.
        </p>
        <p>Additional parameters. In order to improve the
results for a system with a recurrent neural network, for
each word external features were added - the presence or
absence of a word in specialized dictionaries of named
entities on the topic of information security. A total of 12
dictionaries were used, with such named entities as virus
names, program names, etc., with the total number of
tokens in 1470. Since the words in the dictionaries were
brought to normal form, all the words of the corpus were
also reduced to normal form using the morphological
analyzer pymorphy2. For the Bi-LSTM + CRF + voc
model, the feature vector is constructed as a
concatenation of the vector of pre-weighed word weights
and a binary vector that carries information about those
dictionaries in which the word was encountered. The
intersection of dictionaries with the corpus amounted to
576 tokens, which is ~ 0.15% of all corpus tokens.
17 Incomplete matching
18 Full match</p>
      </sec>
      <sec id="sec-4-7">
        <title>5.2 Optimization algorithm</title>
        <p>
          As an optimization algorithm, the Adam [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] algorithm
was chosen. The initial rate of training was equal to  0 =
0.005 with decreasing in each epoch according to the
linear law   =  0 ∗  ∗  , where the decay factor was
equal to  = 0.9, and  – the number of passed epochs.
In addition, experiments were conducted with
optimization algorithms for SGD and AdaGrad, but these
methods showed no improvement in comparison with
Adam and converged more slowly.
        </p>
        <p>To overcome the retraining, a Dropout method with a
probability of 0.5 was used, which gave a significant
increase in the accuracy of the model.</p>
        <p>To overcome the attenuation of the gradients, the
gradient equalization method was used.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Evaluation</title>
      <sec id="sec-5-1">
        <title>6.1 The standard task of NER</title>
        <p>The purpose of the first experiment was to verify that the
implemented systems are correct and their application
for texts on the topic of information security is justified.
For this, tests were carried out on three corpuses: the
As can be seen, adding a CRF layer significantly
improves the quality of predictions. Moreover, Bi-LSTM
networks outperform the fully connected in the task of
allocating named entities. In addition, the experiment
showed that the implemented methods are highly rated
by the F1 measure and their use in the task of named
entities recognition for texts on information security is
justified.</p>
      </sec>
      <sec id="sec-5-2">
        <title>6.2 The task of NER on the topic of information security</title>
        <p>The purpose of the second experiment was to establish
the applicability of current solutions to the NER problem
for IS texts. In this task, it was suggested to train systems
to extract from the text 16 types of named entities.</p>
        <p>Test results are shown in Table 5. As can be seen
from the table, the results fell for all implemented
solutions, however, the use of specialized dictionaries
allowed to improve the result.</p>
      </sec>
      <sec id="sec-5-3">
        <title>6.3 Additional parameters</title>
        <p>Weights of words. Table 6 compares the results for the
two different models of word weights discussed earlier.
RDT stands for Russian Distributional Thesaurus and
ARM for Araneum Russicum Maximum
correspondingly. It can be seen from the table that a
model with a large number of weights gives better
results.</p>
        <sec id="sec-5-3-1">
          <title>Fully connected NN</title>
        </sec>
        <sec id="sec-5-3-2">
          <title>Bi-LSTM + CRF</title>
        </sec>
        <sec id="sec-5-3-3">
          <title>Bi-LSTM + CRF + vocab</title>
          <p>Dropout. Table 7 compares the results with the
addition of the Dropout layer and without it, as well as
the various probabilities of using Dropout. It can be seen
from the table that the addition of this layer improved the
results of the system, and the optimal value was the
probability of 0.5.</p>
          <p>Gradient clipping. Table 8 compares the results for
different values of the gradient alignment parameter.</p>
          <p>Normalization of words. Experiments were carried
out in which the vector of its norm was put in
correspondence for each word. This approach showed
little improvement, but it takes a lot of time to bring the
words to the initial form.</p>
        </sec>
      </sec>
      <sec id="sec-5-4">
        <title>6.4 Analysis of results</title>
        <p>Despite the fact that the values of the metrics on the
corpus for information security were lower than on the
corpuses with standard named entities, this result allows
us to state that the task of NER in IS texts is more
complex and requires the development of new methods
and approaches.</p>
        <p>Above, the method of adding dictionaries to the
model has already been considered, now it is suggested
to consider the changes made to it in the metrics by the
relevant entities, namely HACKER,
HACKER_GROUP, VIRUS, DEVICE, TECH, and
PROGRAM. The results are shown in Table 9. All
metrics are calculated by incomplete matching.</p>
        <p>It can be seen from the table that the improvement
occurred almost in all relevant problems to entities. For
some types of F1 measures showed significant growth,
for the type HACKER_GROUP the difference was more
than 150%. Thus, despite the fact that the size of the
dictionaries was small in comparison with the size of the
corpus (about 0.15%), one can see that the use of
thematic dictionaries correlated with the task positively
affects the F1 measure.</p>
        <p>The following reasons for the decrease in the
accuracy of the allocation were also highlighted:
• Increase in the number of entity types. This
undoubtedly leads to a decrease in the quality of the
system, as the number of types increases, the
complexity of isolating a specific one grows, and new
dependencies between new types appear, which the
system cannot simply take into account. Also, the
extracted signs become more sensitive.
• Semantic proximity of entity types. This factor makes
itself felt even at the stage of marking the corpus by
assessors. Ambiguities in markup appear already at
the stage of four types (that is why sometimes the
type LOCORG mentioned earlier was introduced)
and they become even larger, with the number of
entity types increasing and their semantic
convergence, which generates errors even before
learning the named entities recognition system.</p>
        <p>Heterogeneity of the marked corpus. Since the
marked data were not scientific articles, the news was
not written in a formal language, there are many
stylistic, lexical and spelling mistakes, many
borrowings, English text, jargon and common
speech. All this negatively affected the quality of the
NER system, including, for many words, no
correspondence was found in the vector
representation of word2vec. For the information
security corps, this figure was 10044 out of 48,320
words (20.7%), while 1021 out of 20,908 words
(4.8%) were not found for the corpus of FactRuEval,
respectively.</p>
        <p>Low degree of occupancy in classes that are relevant
to the topic of information security. For almost all
classes, the share in the total number of words of the
corps was significantly less than 1%, while in
standard corpuses this share does not fall below
3.54% of the total number of words in the corpus.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>7 Results</title>
      <p>Within the framework of this work, the task is to extract
named entities from texts on the topic of information
security from the use of artificial neural networks. The
novelty of the study is that at the moment there are no
known publications on the named entities recognition
from the texts on information security in Russian. In that
work:
• Implemented two software systems based on
artificial neural networks, solving the problem of
named entity recognition.
• The developed systems were tested on the existing
marked packages in Russian using the standard for
the task of extracting named entities: PERSON,
LOCATION, and ORGANIZATION. Thus, it was
confirmed compliance with the modern level of
quality for such systems.
• The developed software systems were transferred to
the corpus on the topic of information security. The
recognition was performed on fifteen types of named
entities. Based on the results of testing on this corpus,
the following quality indicators were obtained for the
main classes relevant to the topic of information
security: PROGRAM - F1-measure 73.57%; TECH</p>
      <p>F1 measure 71.74%; DEVICE - F1-measure 64.99%;
VIRUS - F1-measure of 64.10%;
HACKER_GROUP - F1-measure of 44.44%;
HACKER - F1-measure 40%.
• It was found out that the use of small, specialized
dictionaries of named entities improves the quality
indicators for all classes by 0.5-1% and 3% for
classes relevant to the topic of information security.</p>
      <p>Area of information security is complex and already
the methods studied do not achieve results in this area
that are close to the results in standard NER tasks, this is
a consequence of the following factors: an increase in the
classes of named entities, difficult manual classification,
semantic proximity classes, lack of representativeness of
the classes in marked corpuses.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Nadeau</surname>
            , David and
            <given-names>Satoshi</given-names>
          </string-name>
          <string-name>
            <surname>Sekine</surname>
          </string-name>
          (
          <year>2007</year>
          )
          <article-title>A survey of named entity recognition and classification</article-title>
          .
          <source>Linguisticae Investigationes</source>
          <volume>30</volume>
          (
          <issue>1</issue>
          ):
          <fpage>3</fpage>
          -
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Masayuki</given-names>
            <surname>Asahara</surname>
          </string-name>
          and
          <string-name>
            <given-names>Yuji</given-names>
            <surname>Matsumoto</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Japanese named entity extraction with redundant morphological analysis</article-title>
          .
          <source>Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology - Volume 1. Pages 8-15</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Borthwick</surname>
          </string-name>
          , John Sterling, Eugene Agichtein, and
          <string-name>
            <given-names>Ralph</given-names>
            <surname>Grishman</surname>
          </string-name>
          .
          <year>1998</year>
          .
          <article-title>NYU: Description of the MENE named entity system as used in MUC-7</article-title>
          .
          <source>In Proceedings of the Seventh Message Understanding Conference (MUC-7).</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Satoshi</given-names>
            <surname>Sekine</surname>
          </string-name>
          et al.
          <year>1998</year>
          .
          <article-title>NYU: Description of the Japanese NE system used for MET-2</article-title>
          .
          <source>In Proceedings of the Seventh Message Understanding Conference (MUC-7).</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Daniel</surname>
            <given-names>M Bikel</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scott Miller</surname>
          </string-name>
          , Richard Schwartz, and
          <string-name>
            <given-names>Ralph</given-names>
            <surname>Weischedel</surname>
          </string-name>
          .
          <year>1997</year>
          .
          <article-title>Nymble: a high-performance learning namefinder</article-title>
          .
          <source>Proceedings of the fifth conference on Applied natural language processing. Pages 194-201</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Kumar</given-names>
            <surname>Saha</surname>
          </string-name>
          , Sujan, Sarathi Ghosh, Partha, Sarkar, Sudeshna, &amp;
          <string-name>
            <surname>Mitra</surname>
            ,
            <given-names>Pabitra.</given-names>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>Named Entity Recognition in Hindi using Maximum Entropy and Transliteration</article-title>
          . Polibits, (
          <volume>38</volume>
          ),
          <fpage>33</fpage>
          -
          <lpage>41</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Andrew</given-names>
            <surname>McCallum</surname>
          </string-name>
          and
          <string-name>
            <given-names>Wei</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons</article-title>
          .
          <source>In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4. Pages 188-191</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Ronan</given-names>
            <surname>Collobert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Jason</given-names>
            <surname>Weston</surname>
          </string-name>
          , L´eon Bottou, Michael Karlen, Koray Kavukcuoglu, and
          <string-name>
            <given-names>Pavel</given-names>
            <surname>Kuksa</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Natural Language Processing (Almost) from Scratch</article-title>
          .
          <source>The Journal of Machine Learning Research</source>
          Volume
          <volume>12</volume>
          ,
          <issue>2</issue>
          /1/2011 Pages 2493-
          <fpage>2537</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Guillaume</given-names>
            <surname>Lample</surname>
          </string-name>
          , Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and
          <string-name>
            <given-names>Chris</given-names>
            <surname>Dyer</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Neural architectures for named entity recognition</article-title>
          .
          <source>arXiv preprint arXiv:1603</source>
          .
          <fpage>01360</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Vlasova</surname>
            <given-names>N.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suleymanova</surname>
            <given-names>E.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Trofimov</surname>
            <given-names>I.V</given-names>
          </string-name>
          :
          <article-title>Report on Russian corpus for personal name retrieval</article-title>
          .
          <source>In proceedings of computational and cognitive linguistics TEL</source>
          '
          <year>2014</year>
          , Kazan, Russia, pp
          <fpage>36</fpage>
          -
          <lpage>40</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Xuezhe</given-names>
            <surname>Ma</surname>
          </string-name>
          and
          <string-name>
            <given-names>Eduard</given-names>
            <surname>Hovy</surname>
          </string-name>
          .
          <article-title>End-to-end Sequence Labeling via Bi-directional LSTMCNNs-CRF</article-title>
          .
          <source>In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics</source>
          <volume>1</volume>
          (Long Papers):
          <fpage>1064</fpage>
          -
          <lpage>1074</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>L. T.</given-names>
            <surname>Anh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Y.</given-names>
            <surname>Arkhipov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Burtsev</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Application of a Hybrid Bi-LSTM-CRF model to the task of Russian Named Entity Recognition</article-title>
          .
          <source>arXiv preprint arXiv:1709</source>
          .
          <fpage>09686</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Yoshua</surname>
            <given-names>Bengio</given-names>
          </string-name>
          , Patrice Simard, and
          <string-name>
            <given-names>Paolo</given-names>
            <surname>Frasconi</surname>
          </string-name>
          .
          <year>1994</year>
          .
          <article-title>Learning long-term dependencies with gradient descent is difficult</article-title>
          .
          <source>IEEE Transactions on Neural Networks</source>
          ,
          <volume>5</volume>
          (
          <issue>2</issue>
          ):
          <fpage>157</fpage>
          -
          <lpage>166</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Sepp</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jurgen</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          .
          <year>1997</year>
          .
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural computation</source>
          ,
          <volume>9</volume>
          (
          <issue>8</issue>
          ):
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Sepp</surname>
            <given-names>Hochreiter</given-names>
          </string-name>
          ,
          <article-title>Jrgen Schmidhuber: Long Short-Term Memory</article-title>
          . MIT Press, Vol.
          <volume>9</volume>
          , No.
          <volume>8</volume>
          ,
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          (
          <year>1997</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Chiu</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nichols</surname>
          </string-name>
          , E.:
          <article-title>Named entity recognition with bidirectional lstm-cnns</article-title>
          .
          <source>arXiv preprint arXiv:1511.08308</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Bird</surname>
          </string-name>
          , Steven, Edward Loper and Ewan
          <string-name>
            <surname>Klein</surname>
          </string-name>
          (
          <year>2009</year>
          ),
          <article-title>Natural Language Processing with Python. O'Reilly Media Inc</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Arefyev</surname>
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Panchenko</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lukanin</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lesota</surname>
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Romanov</surname>
            <given-names>P.</given-names>
          </string-name>
          (
          <year>2015</year>
          )
          <article-title>: Evaluating Three Corpus-Based Semantic Similarity Systems for Russian</article-title>
          .
          <source>In Proceedings of the 21st International Conference on Computational Linguistics and Intellectual Technologies</source>
          (Dialogue'
          <year>2015</year>
          ). Moscow, Russia. RGGU
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Pedregosa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Varoquaux</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Gramfort</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Michel</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Thirion</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Grisel</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Blondel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Prettenhofer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Weiss</surname>
          </string-name>
          , R. and
          <string-name>
            <surname>Dubourg</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Vanderplas</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Passos</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Cournapeau</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Brucher</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Perrot</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Duchesnay</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>Scikit-learn: Machine Learning in Python</article-title>
          , Pedregosa et al.,
          <source>Journal of Machine Learning Research 12</source>
          , pp.
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>X.</given-names>
            <surname>Glorot</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <article-title>Understanding the difficulty of training deep feedforward neural networks</article-title>
          .
          <source>In AISTATS</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Diederik</surname>
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Kingma</surname>
            ,
            <given-names>Jimmy</given-names>
          </string-name>
          <string-name>
            <surname>Ba</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Adam: A Method for Stochastic Optimization arXiv preprint</article-title>
          arXiv:
          <volume>1412</volume>
          .
          <fpage>6980</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>