<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Named Entity Recognition in Tatar: Corpus-Based Algorithm</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>vzorov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>mir Mukh</string-name>
          <email>damirmuh@gmail.com</email>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>The Tatarstan Academy of Sciences</institution>
          ,
          <addr-line>Kazan</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Named entities recognition is one of the urgent tasks in the researches of language using electronic language corpuses. This article discusses the main methods for solving this problem, including algorithms based on various machine learning models, regular expressions and dictionaries. Also in the article, the authors proposed their own algorithm, which allows named entities recognition on the basis of search queries using direct and reverse search. The results of the algorithm, presented in the article, suggest what additional functions are necessary to achieve the best results. The proposed algorithm is used in the “Tugan Tel” corpus management system and can be used both with the electronic corpus of the Tatar language and with corpuses of other languages.</p>
      </abstract>
      <kwd-group>
        <kwd>Named entity recognition</kwd>
        <kwd>NER</kwd>
        <kwd>Corpus management system</kwd>
        <kwd>Text mining</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Galieva1
Electronic language corpuses are the basis for extensive research related to language
research. Corpus management systems help solve a number of linguistic problems,
such as direct search of word forms, lemmas, reverse search by morphological
properties, selection of contexts, n-grams for various search queries. These simple queries
are supported by most corpus management systems.</p>
      <p>One of the difficult tasks of searching in corpus data is named entities recognition.
This problem is solved by dozens of researchers, often getting good results. Most
existing solutions, some of which are described in Section 2 of this article, work with
English, Spanish, Dutch, German using various NLP methods, regular expressions,
dictionaries, etc. as the basis. In Section 4 of this article, the authors considered one of
the possible algorithms for named entities recognition, which can be used both with
the electronic corpus of the Tatar language and with electronic corpuses of other
languages. This algorithm is implemented in one of the modules of the “Tugan Tel”
corpus management system. The authors also conducted a series of experiments, the
results of which are shown in Section 4.2 of this article.</p>
      <p>
        “Tugan Tel” Corpus Management System
The Tatar corpus management system (www.corpus.antat.ru) is developed at Institute
of Applied Semiotics of the Tatarstan Academy of Sciences. The main functions of
the corpus management system are searching for lexical units, making morphological
and lexical searches, searching for syntactic units, n-gram searching based on
grammar and others. The core of the system is the semantic model of data representation.
The search is performed using common open source tools. We use MariaDB database
management system and Redis data store [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Our purpose is to design the corpus
management system for supporting electronic corpora of Turkic languages. This line
of research is developing very rapidly.
      </p>
      <p>
        Among well-known electronic corpora projects for Turkic languages are the
corpora of Turkish and Uyghur [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], Bashkir, Khakass, Kazakh (http://til.gov.kz), and Tuvan
languages. “Tugan Tel” Tatar national corpus is a linguistic resource of modern
literary Tatar. It comprises more than 100 million word forms, at the rate of November
2016. The сorpus contains texts of various genres: fiction, media texts, official
documents, textbooks, scientific papers etc. Each of the documents has a meta description
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]: author, title, publishing details, date of creation, genre etc. Texts included in the
corpus are provided with morphological markup, i.e. information about part of speech
and grammatical properties of the word form [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The morphological markup is
carried out automatically on the basis of the module of two-tier morphological analysis
of the Tatar language with the help of PC-KIMMO software tool.
3
3.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Works</title>
      <sec id="sec-2-1">
        <title>LingPipe</title>
        <p>
          One of the related works is LingPipe [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], which is a collection of Java libraries
developed by Alias-I. LingPipe allows to classify named entities in English: person,
organization, place. It supports the use of other language packages for classification.
LingPipe also supports additional features such as orthographic correction and English text
classification. This software is distributed free of charge for research purposes.
3.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Annie</title>
        <p>
          Another similar work is Annie [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. This is a named entity extraction module
embedded into the GATE framework. Annie is open source and is developed under the GNU
license developed at Sheffield University. Annie implements various functions
necessary for extracting named entities: tokenizer, sentence separator, POS tagging,
resolution with a link, place name directories, etc.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Afner</title>
        <p>
          Afner [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] is an open source NERC tool licensed under the GNU license, developed in
C++ at Macquarie University. It is used as part of a question and answer service that
focuses on maximizing responsiveness to user questions. At the same time Afner can
be used separately from the service. Afner uses lists, regular expressions, and
supervised learning models. It allows one to extract names of persons, organizations,
locations, monetary values and dates from English texts.
3.4
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>Knowledge-based systems</title>
        <p>
          Knowledge-based NER systems use lexical resources and domain-related knowledge
without requiring training with annotated data. Such systems show good results when
the lexical resources are complete, whereas they do not work, for example, with the
examples from drug_n class in the DrugNER [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] data set, since they are not defined
in the DrugBank dictionaries. Despite their high accuracy, these systems show low
recall due to specific rules of the language and domain and incomplete dictionaries.
Another disadvantage of knowledge-based NER systems is the need for experts to
participate in the development and maintenance of knowledge resources.
3.5
        </p>
      </sec>
      <sec id="sec-2-5">
        <title>Unsupervised and bootstrapped systems</title>
        <p>
          Early systems did not require significant data for training. Collins and Singer (1999)
[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] used only labeled seeds and 7 functions for classifying and extracting named
entities: orthography (for example, capitalization), entity context, words that occurred in
named entities, etc. To improve the recall of NER systems, Etzioni et al. (2005) [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]
proposed an unsupervised system using 8 generic pattern extractors for open web
texts, for example, NP is &lt;class1&gt;, NP1 such as NPList2. In 2006, Nadeau et al.
suggested using an unsupervised system to create a directory of named entities and
resolve the ambiguity of named entities basing on the work of Etzioni et al. (2005)
[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] and Collins and Springer (1999) [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. This system combined the extracted list of
named entities with generally accessible directory of named entities and achieved
Fscores of 88%, 61% and 59% on MUC-7 [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] for named entities of classes of
locations, persons and organizations, respectively.
        </p>
        <p>
          Zhang and Elhadad (2013) [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] in an unsupervised NER system for biological and
medical data used surface syntactic knowledge base and inverse document frequency
(IDF). This system reached 53.8% and 69.5%, respectively. Their model uses seeds to
find text with possible content of named entities, identifies phrases with nouns and
filters phrases with a low IDF value. The filtered list is submitted to the classifier for
predicting the tags of named entities.
3.6
        </p>
      </sec>
      <sec id="sec-2-6">
        <title>Feature-engineered supervised systems</title>
        <p>Supervised machine learning models learn to make predictions by training on
example inputs and their expected outputs, and can be used to replace humanly established</p>
        <sec id="sec-2-6-1">
          <title>Author(s)</title>
        </sec>
        <sec id="sec-2-6-2">
          <title>Zhou and Su (2002) [13]</title>
        </sec>
        <sec id="sec-2-6-3">
          <title>Malouf (2002) [14]</title>
        </sec>
        <sec id="sec-2-6-4">
          <title>Carreras et</title>
          <p>
            al. (2002)
[
            <xref ref-type="bibr" rid="ref15">15</xref>
            ]
Li et al.
(2005)
[
            <xref ref-type="bibr" rid="ref16">16</xref>
            ]
          </p>
        </sec>
        <sec id="sec-2-6-5">
          <title>Ando and Zhang (2005) [17]</title>
          <p>
            Agerri and
Rigau
(2016)
[
            <xref ref-type="bibr" rid="ref18">18</xref>
            ]
rules. Hidden Markov Models (HMM), Support Vector Machines (SVM), Conditional
Random Fields (CRF), and decision trees were common machine learning systems for
NER.
          </p>
          <p>The results of research using various machine learning models from various
authors are presented in Table 1.</p>
        </sec>
        <sec id="sec-2-6-6">
          <title>Included 11 orthographic features, a list of trigger words for named entities, and a list of words from various gazetteers.</title>
          <p>Included capitalization;
considered whether the word went first
in the sentence, whether the word
had appeared before with a known
last name, and 13281 first names
collected from various
dictionaries.</p>
          <p>Included capitalization, trigger
words, previous tag prediction,
bag of words, gazetteers.</p>
          <p>Experimented with multiple
window sizes, features (orthographic,
prefixes suffixes, labels, etc.)
from neighboring words,
weighting neighboring word
features according to their position,
and class weights to balance
positive and negative classes.</p>
          <p>The best classifier for each
auxiliary task was selected based on its
confidence.</p>
          <p>Included orthography, character
of n-grams, lexicons, prefixes,
suffixes, bigrams, trigrams, and
unsupervised cluster features
from the Brown corpus, Clark
corpus and k-means clustering of
open text using word embeddings.</p>
        </sec>
        <sec id="sec-2-6-7">
          <title>Results</title>
        </sec>
        <sec id="sec-2-6-8">
          <title>F-scores of 96.6% and 94.1% on MUC6 and MUC-7 data, respectively.</title>
          <p>F-scores of 73.66%
and 68.08% on
Spanish and Dutch
CoNLL 2002
datasets, respectively.
F-scores of 81.39%
and 77.05% on
Spanish and Dutch
CoNLL 2002
datasets, respectively.
F-score of 88.3% on
the English CoNLL
2003 data.</p>
          <p>F-scores of 89.31%
and 75.27% on
English and German,
respectively.</p>
          <p>F-scores of 84.16%,
85.04%, 91.36%,
76.42% on Spanish,
Dutch, English, and
German CoNLL,
respectively.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Extracting named entities</title>
      <p>Extracting named entities from corpus data allows, on the one hand, to directly
retrieve the required data by query, and on the other hand, to test the corpus for
containing particular information and to replenish it with documents that include the missing
data. The algorithm of extraction of named entities proposed in this paper enables to
obtain semantic samples for corpora that do not have semantic data markup. On the
other hand, the algorithm has no restriction on semantic types of extracted data, i.e.
the semantic type is defined by the keyword in the query.
4.1</p>
      <sec id="sec-3-1">
        <title>Describing algorithm of extracting named entities</title>
        <p>The algorithm for extracting named entities is based on the idea of comparing
ngrams. The comparison is made within the entire corpus volume, thereby increasing
the accuracy of the results.</p>
        <p>The extraction process is iterative, the threshold number of iterations specified by
the user. The first step presents sampling by the initial search query. The initial search
query may be a query on the word form, lemma or phrase, or a search by
morphological parameters. A list of bigrams and their frequency is collected across the sample.
The bigrams which contain the results are advanced one position to the left or right
(set by the user). The resulting list is sorted by frequency of bigrams in order from
largest to smallest, to be cut to a predetermined covering index (for example, 95% of
all results, this rate being set by the user). This result is used in the second iteration of
the algorithm. Each bigram is searched for in the mode of phrasal search in the
corpus. Search results are involved in composing a list of trigrams which are advanced
one position to the left or right, and their frequency. The resulting list of trigrams is
also sorted by frequency in order from largest to smallest, and is cut to a
predetermined covering index.</p>
        <p>The third and subsequent iterations (until the threshold number of iterations is
reached or no match is found as a result of iterating) use the list of n-grams received
from the previous iteration. The corpus is searched for each n-gram in the phrasal
search mode, and a list of (n + 1)-grams is made up. The resulting list is then cut to a
predetermined covering index and compared with the list of n-grams derived from the
previous iteration. The comparison accuracy P is set by the user as a percentage. If
ngram frequency is less than P from the quantity of the found (n + 1)-gram, then the
ngram is considered the found named entity, otherwise the extraction proceeds. Thus,
the final result will represent a list of the most stable n-grams of different lengths,
including search results by the initial search query.</p>
        <p>A request to retrieve named entities is an extension of a Q-tuple presented in (1). In
addition to the search query, there are added components defining the threshold
number of iterations to the left (L) and right (R), the covering index (C), and the accuracy
of matching (P). A search example is presented in (1).</p>
        <p>Q = (Q1, Q2, L, R, C, P)
(1)</p>
      </sec>
      <sec id="sec-3-2">
        <title>Experiments</title>
        <p>Extracting named entities using the algorithm proposed by the authors requires an
initial search query which should contain an indicator of a particular named entity.
This indicator allows classifying named entities, therefore, the authors chose a set of
classes schema.org as the basis for choosing the indicators. From this set of classes,
the authors selected the following classes for searching for named entities in the Tatar
language corpus: books, restaurants, films, magazines, companies, airports,
corporations, languages, technical schools, universities, schools, shops, museums, and
hospitals. Ministries and street names have also been added to this list. Below are some of
the results of the experiments conducted by the authors.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Names of ministries</title>
        <p>As part of the task of enhancing named entity search a number of experiments have
been carried out. One of the most revealing of them was search for names of
ministries. The initial search query for the experiment was (2).</p>
        <p>Q = ((wordform, ministrlygy, “”, right, 1, 10, exact), 7, 0, 95, 80)
(2)</p>
        <p>The result of this query was a list of 50 n-grams containing word form
"ministrlygy" in the last position. The reference list of names of ministries presented on the
Republic of Tatarstan government website [http://prav.tatarstan.ru/tat/ministries.htm]
contains 17 items. 12 of 17 items were found in the corpus by means of the algorithm,
so the results overlap is 70.6%. 5 items were not found in the corpus for the reasons
described in Table 2. The remaining 33 n-grams are different spelling variants of
names of ministries.</p>
        <sec id="sec-3-3-1">
          <title>Yashlәr eshlәre һәm sport min</title>
          <p>istrlygy (Tat) – ministry of
youth and sport
Transport һәm yul huҗalygy
ministrlygy (Tat) – ministry of
transport and road management</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>Hezmәt, halykny el belәn tәemin itү һәm social yaklau ministrlygy (Tat) - ministry of labour, employment and social</title>
        </sec>
        <sec id="sec-3-3-3">
          <title>Reason</title>
          <p>Overlap of the sequence of word forms with the
sequence in another name «huҗalygy ministrlygy»
(Tat) – ministry of property and «Transport һәm
yul huҗalygy ministrlygy» (Tat) – ministry of
transport and road management
Corpus meanings not corresponding to the official
name</p>
        </sec>
        <sec id="sec-3-3-4">
          <title>Overlap of the sequence of word forms with the</title>
          <p>sequence in another name «huҗalygy ministrlygy»
(Tat) – ministry of property and «Urman huҗalygy
ministrlygy» (Tat) – ministry of forestry
Corpus meanings not corresponding to the official
name
protection
Ecologia һәm tabigy baylyklar
ministrlygy (Tat) – ministry of
ecology and natural resources</p>
        </sec>
        <sec id="sec-3-3-5">
          <title>Corpus meanings not corresponding to the official name</title>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>Names of streets</title>
        <p>Another experiment was concerned with street names search. The search query for
this experiment is (3).</p>
        <p>Q = ((wordform, uramy, “”, right, 1, 10, exact), 7, 0, 95, 80)
(3)</p>
        <p>The result of this query was a list of 600 n-grams containing word form "uramy" in
the last position. We obtained the following results after manual data evaluation: 432
(72%) n-grams are street names, 72 (12%) n-grams are also street names, but require
special character filtering, 96 (16%) n-grams are not street names for various reasons
(for example, any sentences containing the word “uramy”; postal addresses and
others).</p>
      </sec>
      <sec id="sec-3-5">
        <title>Names of languages</title>
        <p>In the next experiment, the authors tried to extract names of languages. The search
query for this experiment is presented in (4).</p>
        <p>Q = ((wordform, tel, “POSS_3SG,SG”, right, 1, 10, exact), 7, 0, 95, 80)
(4)
After executing this query, 2310 n-grams were obtained, containing “tel” lemma with
the morphological properties POSS_3SG and SG in the last position. An estimation of
part of the results (a list of 471 n-grams) by an expert showed that in 53.5% of cases
(252) n-grams were correct language names. Analysis of the list of n-grams which
were incorrectly defined by the algorithm as a name of a language, made it possible to
determine additional filtering rules to improve the accuracy of the algorithm. On the
basis of the data obtained, the spreading of language names in the corpus of the Tatar
language was also constructed (Fig. 1).</p>
      </sec>
      <sec id="sec-3-6">
        <title>Names of restaurants</title>
        <p>Another experiment is related to search for names of restaurants. The search query
for this experiment is presented in (5).</p>
        <p>Q = ((wordform, restoran, “POSS_3SG,SG”, right, 1, 10, exact), 7, 0, 95, 80)
(5)
The result of this query was a list of 285 n-grams containing “restoran” lemma with
the morphological properties POSS_3SG and SG in the last position, which in total
were found 359 times in the corpus. In this case, in addition to names of restaurants,
names of sub-classes of restaurants by their geographical location or national cuisines
were obtained.</p>
        <p>Thus, 107 (37.68%) found n-grams were correct names of restaurants, their total
frequency being 140 (39%). 37 (13.03%) n-grams were the names of subclasses of
restaurants, their total frequency being 47 (13.09%). 52 (18.31%) n-grams contained
names of restaurants, but they require cleaning from unnecessary parts, while the
frequency of the n-grams in the corpus is 2 or less, the total frequency is 54 (15.04%).
45 (15.85%) n-grams contained names of subclasses of restaurants, but they require
cleaning from unnecessary parts, while the frequency of n-grams in the corpus is 2 or
less, the total frequency is 48 (13.37%). 43 (15.14%) n-grams were not names of
restaurants, their total frequency was 65 (18.11%). The list of incorrectly defined
ngrams can be reduced by applying additional filtering rules.</p>
      </sec>
      <sec id="sec-3-7">
        <title>Names of corporations</title>
        <p>The next experiment was the search for names of corporations. The search query
for this experiment is presented in (6).</p>
        <p>Q = ((wordform, korporaciya, “POSS_3SG,SG”, right, 1, 10, exact), 7, 0, 95, 80) (6)</p>
        <p>As a result of this search query was obtained a list of 138 n-grams containing
lemma “korporaciya” with morphological properties POSS_3SG and SG in the last
position, which were found in the corpus 606 times. Among them, when checked by an
expert, 63 (45.65%) n-grams were found, which were correct names of corporations,
their total frequency being 178 (29.37%). 27 (19.57%) n-grams contained names of
corporations, but require additional cleaning; the total frequency of these n-grams was
29 (4.79%). Among the results, 15 (10.87%) n-grams were singled out, which were
non-full names of corporations, their total frequency being 58 (9.57%). 30 (21.74%)
n-grams were names of subclasses of corporations by industry, geography,
government participation; such n-grams were found in the corpus 336 times (55.45%). 3
(2.17%) n-grams were not names of corporations, their total frequency being 5
(0.83%).</p>
      </sec>
      <sec id="sec-3-8">
        <title>Comparison of results</title>
        <p>For different classes of named entities, the algorithm shows different results. The
results presented in this article are shown in Table 3.</p>
      </sec>
      <sec id="sec-3-9">
        <title>Temporal and qualitative indicators of implementing a query for extracting named entities</title>
        <p>The experiments showed that the time of implementing a query for extracting named
entities depends on the number of found items and bigrams by the initial search query,
and on indexes of covering and the accuracy of comparison. All the experiments were
executed on machine with following characteristics: 4 core Intel Core i7 2600
(2,6GHz), 16GB RAM (4х4GB, 1333Hz), SSD 120GB, HDD 3TB (3х1TB, RAID 0).
On the test machine Ubuntu Server 14.04 LTS was running. Table 4 shows the timing
indicators of search implementation. Algorithm tests revealed dependence of the
quality of the results on the number of results found in the first step of the algorithm. This
is due to the fact that a smaller number of results increase the actual data coverage
and the data which the algorithm works with may initially include particular cases.
More results in the first step suggest that at the first cutting of the bigram list, only
Total
50
600
471
(2310)
285
138</p>
        <sec id="sec-3-9-1">
          <title>Search query</title>
          <p>those will remain that will be included in the final list of the extracted named entities.
Thus it is only needed to find the left or the right border for this list.
The algorithm for named entity recognition proposed by the authors in this article
shows different results, depending on the type of named entities. The presented results
demonstrate correctness of recognition from 37.7% to 100%.</p>
          <p>In addition to the main task of named entity recognition, the algorithm is applicable
for solving the problem of recognition of names of subclasses of named entities. This
feature can be applied to solve additional problems, such as text classification,
definition of the subject of texts and other text mining tasks.</p>
          <p>Analysis of the results obtained during the experiments show that to improve the
accuracy and correctness of the algorithm, its fine tuning, building extended
dictionaries for named entity recognition, and additional post-processing of results are
necessary.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Nevzorova</surname>
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mukhamedshin</surname>
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gataullin</surname>
            <given-names>R</given-names>
          </string-name>
          .
          <article-title>Developing Corpus Management System: Architecture of System and Database</article-title>
          .
          <source>Proceedings of the 2017 International Conference on Information and Knowledge Engineering</source>
          . CSREA Press, United States of America, pp.
          <fpage>108</fpage>
          -
          <lpage>112</lpage>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Aibaidulla</surname>
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lua</surname>
            <given-names>K.T.</given-names>
          </string-name>
          <article-title>The development of tagged Uyghur corpus</article-title>
          .
          <source>Proceedings of PACLIC17</source>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>3</lpage>
          (
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Nevzorova</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mukhamedshin</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kurmanbakiev</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Semantic aspects of metadata representation in corpus manager system</article-title>
          .
          <source>Open Semantic Technologies for Intelligent Systems (OSTIS-2016)</source>
          , pp.
          <fpage>371</fpage>
          -
          <lpage>376</lpage>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Suleymanov</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nevzorova</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gatiatullin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gilmullin</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hakimov</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <article-title>National corpus of the Tatar language “Tugan Tel”: grammatical annotation and implementation</article-title>
          .
          <source>Proc. Soc. Behav. Sci. 95</source>
          , pp.
          <fpage>68</fpage>
          -
          <lpage>74</lpage>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Baldwin</surname>
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carpenter B. LingPipe</surname>
          </string-name>
          , http://alias-i.com/lingpipe, last accessed
          <year>2018</year>
          /10/12.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Bontcheva</surname>
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dimitrov</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maynard</surname>
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tablan</surname>
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cunningham</surname>
            <given-names>H</given-names>
          </string-name>
          .
          <article-title>Shallow methods for named entity coreference resolution</article-title>
          .
          <source>Chaınes de références et résolveurs d'anaphores, workshop TALN</source>
          . (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Zaanen</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Molla</surname>
            <given-names>D.</given-names>
          </string-name>
          <article-title>A named entity recogniser for question answering</article-title>
          .
          <source>Proceedings PACLING</source>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Segura Bedmar</surname>
            <given-names>I.</given-names>
          </string-name>
          , Mart´ınez
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Herrero Zazo M. Semeval</surname>
          </string-name>
          <article-title>-2013 task 9: Extraction of drug-drug interactions from biomedical texts</article-title>
          (ddiextraction
          <year>2013</year>
          ).
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Collins</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Singer</surname>
            <given-names>Y.</given-names>
          </string-name>
          <article-title>Unsupervised models for named entity classification</article-title>
          .
          <source>1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora</source>
          (
          <year>1999</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Etzioni</surname>
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cafarella</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Downey</surname>
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Popescu</surname>
            <given-names>A.-M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shaked</surname>
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soderland</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weld</surname>
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yates</surname>
            <given-names>A</given-names>
          </string-name>
          .
          <article-title>Unsupervised named-entity extraction from the web: An experimental study</article-title>
          .
          <source>Artificial intelligence</source>
          ,
          <volume>165</volume>
          (
          <issue>1</issue>
          ), pp.
          <fpage>91</fpage>
          -
          <lpage>134</lpage>
          (
          <year>2005</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Chinchor</surname>
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Robinson</surname>
            <given-names>P</given-names>
          </string-name>
          .
          <article-title>Muc-7 named entity task definition</article-title>
          .
          <source>In Proceedings of the 7th Conference on Message Understanding</source>
          ,
          <volume>29</volume>
          (
          <year>1997</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Pradhan</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moschitti</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xue</surname>
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tou</surname>
            <given-names>Ng H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bjorkelund</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Uryupina</surname>
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            <given-names>Y.</given-names>
          </string-name>
          , Zhong Z.
          <article-title>Towards robust linguistic analysis using ontonotes</article-title>
          .
          <source>In Proceedings of the Seventeenth Conference on Computational Natural Language Learning</source>
          , pp.
          <fpage>143</fpage>
          -
          <lpage>152</lpage>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Zhou</surname>
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Su</surname>
            <given-names>J</given-names>
          </string-name>
          .
          <article-title>Named entity recognition using an hmm-based chunk tagger</article-title>
          .
          <source>Proceedings of the 40th Annual Meeting on Association for Computational Linguistics</source>
          .
          <article-title>Association for Computational Linguistics</article-title>
          . Pp.
          <volume>473</volume>
          -
          <fpage>480</fpage>
          (
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Malouf</surname>
            <given-names>R</given-names>
          </string-name>
          .
          <article-title>Markov models for language-independent named entity recognition</article-title>
          .
          <source>Proceedings of the 6th conference on natural language learning</source>
          ,
          <volume>31</volume>
          (
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Carreras</surname>
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marquez</surname>
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Padro</surname>
            <given-names>L.</given-names>
          </string-name>
          <year>2002</year>
          .
          <article-title>Named entity extraction using adaboost</article-title>
          .
          <source>Proceedings of the 6th conference on natural language learning</source>
          ,
          <volume>31</volume>
          (
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Li</surname>
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bontcheva</surname>
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cunningham</surname>
            <given-names>H</given-names>
          </string-name>
          .
          <article-title>Svm based learning system for information extraction. Deterministic and statistical methods in machine learning</article-title>
          . Springer. Pp.
          <volume>319</volume>
          -
          <fpage>339</fpage>
          (
          <year>2005</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Ando</surname>
            <given-names>R.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            <given-names>T.</given-names>
          </string-name>
          <article-title>A framework for learning predictive structures from multiple tasks and unlabeled data</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>6</volume>
          (Nov), pp.
          <fpage>1817</fpage>
          -
          <lpage>1853</lpage>
          . (
          <year>2005</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Agerri</surname>
            <given-names>R.</given-names>
          </string-name>
          , Rigau G.
          <article-title>Robust multilingual named entity recognition with shallow semisupervised features</article-title>
          .
          <source>Artificial Intelligence</source>
          ,
          <volume>238</volume>
          , pp.
          <fpage>63</fpage>
          -
          <lpage>82</lpage>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>