<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Technology for Extracting Geographical Names from Text Documents Based on the PostgreSQL</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Proceedings of the XX International Conference “Data Analytics and Management in Data Intensive Domains” (DAMDID/RCDL'2018)</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Oleg Zhizhimov Institute of Computational Technologies of SB RAS</institution>
          ,
          <addr-line>Novosibirsk</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <fpage>139</fpage>
      <lpage>142</lpage>
      <abstract>
        <p>Extracting geographical names from arbitrary text documents is important in the tasks of processing large arrays of documents and linking their content to a specific geographic region. In the simplest form, the model for extracting geographical names from the text looks like a sequence of actions with the text, while at each stage its task is solved. Among these tasks, there are undoubtedly: text parsing, analyzing text elements, processing synonyms and abbreviations, bringing the text elements to normal form from possible word forms and grammar rules, comparing text elements with the elements of dictionaries of geographical names, adding special tags to the text for unambiguous identification geographical names. The proposed work describes a technology that implements the above tasks on the basis of a freely distributed PostgreSQL DBMS. In this case, the standard configuration is used, all the server part settings are performed within the framework of the documented procedures. GeoNames Gazetteer database, Open Street Map (OSM) databases, OKATO and КЛАДР classifications are used as an authoritative database of geographical names.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>The purpose of this work is to create a model for
extracting geographical names from arbitrary text with
its indexing by geographic attributes, for example, by
geographical coordinates, with the possibility of further
organizing the geometric search.</p>
      <p>
        It should be noted that the existing software systems
for accessing textual information resources do not have
the necessary functionality for storing and processing
geographic data. The provision of their required
functionality is complicated by the lack of uniform
standards for the search and presentation of data related
to the geographical aspect that would be associated with
existing geographic information systems (GIS), that is,
with systems for which the geographic aspect of
information is the main [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Hence the relevance and
prospects of creating a technology that provides
processing of geographic information in
"nongeographic" general information systems [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>2 Model and Algorithms</title>
      <p>If you very briefly describe the proposed model of
fixing geographic content in a text data array for
subsequent indexing, it will look like this.
• The first thing to do when processing an arbitrary
text is to disclose all the abbreviations. In the text,
the abbreviations for their unabridged values are
replaced. This procedure is essential for further
analysis, because in the texts, geographical names
are usually accompanied by abbreviated notation of
the type of geographic object. This requires not only
a simple mechanical substitution of values in
accordance with the reduction dictionary, but also an
analysis of the accompanying content. In particular,
the abbreviation «г.» can be perceived only as
«год», but also as «город», depending on the
surrounding words. The formalized rules, according
to which the abbreviations are disclosed, form a
special dictionary of abbreviations.</p>
      <p>The text obtained as a result of the above procedure
is divided into separate words (tokenization) with
the fixation of the sequence number of each word in
the source text. It also removes the stop words
defined in the special dictionary and brings the rest
of the words to normal form in accordance with the
morphological vocabulary, which can reduce many
different linguistic forms of the word to one lexeme.
The next desired, but not mandatory, step is the
disclosure of the transfers. The fact is that in
different texts there are often various enumerations
of geographical names with a group indication of the
type of object. For example, the text "... studies were
conducted in the Novosibirsk, Kemerovo and Omsk
regions" for unambiguous fixation of geographical
objects requires its transformation to the form "...
studies were conducted in the Novosibirsk region,
the Kemerovo region and the Omsk region".</p>
      <p>
        After completing the above procedures, you can fix
geographic objects - assign special labels to the
appropriate word combinations or replace the
corresponding combination of words with a special
•
•
longitude, altitude above sea level. All of these
characteristics are categorized, so that each
characteristic of a geographic feature belongs to
one of nine classes. And each of these categories,
in turn, is divided into subcategories, the total
number of which is more than 600. In addition to
names in different languages are stored the
geographical coordinates, height above sea level,
population, administrative subdivision and postal
codes. Unfortunately, the database contains
duplicates, errors in names and other inaccuracies.
The OSM (Open Street Map) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] database is an
open database of geographic features that includes
their geometric and geographic characteristics.
      </p>
      <p>
        Getty the geographic names thesaurus (TGN) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
- contains geographic names with point
coordinates, including retrospective ones. The
lack of Russian names are given in transcription.
State catalog of geographical names
ROSREESTR [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] - contains a complete register of
official geographical names by region with point
coordinates.
      </p>
      <p>Template dictionary
Dictionary abbreviations
Morphological dictionary</p>
      <p>Tokenization</p>
      <p>Dictionary of stop words
Geographical dictionary</p>
      <p>Template dictionary
Free text
Disclosure of
abbreviations
Processing of
enumerations</p>
      <p>Identification of
geographical objects
Multi-value resolution
Ready-to-index text
•

label. The first option is necessary in case of further
text indexing for both geometric and full-text search,
and the second one is for indexing geographic
objects only for geometric search. A special label
can be a unique identifier of a geographic object in
a database of geographical names. Formally, the
whole procedure is reduced to the replacement of
normalized lexemes by special labels with an object
identifier or to labels with lexemes. The
correspondence of lexemes and labels is contained
in a special geographic dictionary.</p>
      <p>
        Finally, the last step is to solve the problem of
polysemy of geographic names. For example, more
than 40 geographical objects (based on [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]) can be
placed in a well-defined form of the "Советский
район". However, among all possible it is necessary
to choose the one that best matches the surrounding
context. There are several possible solutions to the
conflict:
      </p>
      <p>
        On the basis of hierarchical relationships, the
decision to identify an object among the competing
ones is taken on the basis of an analysis of the
hierarchical links of the fully-identified objects
adjacent to the text. Hierarchical relations
(administrative subordination, geographic location,
etc.) are generally present in geographic names
databases. Moreover, object identifiers of some
databases store this hierarchy in the value of the
identification code, for example, the OKATO
directory [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In particular, for the city of Karasuk,
the OKATO code 50217501 contains information
about the Karasuk district (OKATO 50217000)
and the Novosibirsk region (OKATO 50000000).
On the basis of geometric parameters - the
decision to identify an object among the
competing ones is taken on the basis of
minimizing the distance to the completely
identified objects next to the text. The distance is
calculated based on the coordinates of the
objects present in the geographic names
database. In this case, different versions of the
decision criterion are possible.
      </p>
      <p>The algorithm for fixing geographic objects in arbitrary
text is shown in Figure 1.</p>
    </sec>
    <sec id="sec-3">
      <title>3 Reference books and dictionaries</title>
      <p>The listed information resources contain the source data
on the basis of which the own database of geographical
objects described below is formed.</p>
      <p>•
•
•</p>
      <p>
        OKATO - All-Russian classifier of objects of
administrative-territorial division [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        KLADR - address classifier of the Russian
Federation [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        GeoNames is a database containing over 10
million geographical names and information
about more than 7.5 million of their unique
characteristics [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Among the characteristics: the
names of places in various languages, latitude,
Database of geographical
names
      </p>
    </sec>
    <sec id="sec-4">
      <title>4 The prototype of the stand</title>
      <p>For working out of technology of extraction of
geographical names from texts, carrying out testing of
algorithms and collecting information on errors the
program stand in which the algorithms described above
are realized was created.</p>
      <p>
        As a system basis for the implementation of
algorithms was chosen on the basis of DBMS
PostgreSQL, which implements a full cycle of
processing of text information with the ability to expand
the basic functionality both through additional
dictionaries and configurations, and writing additional
modules in different programming languages [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
•
•
1.
•
•
•
      </p>
      <p>
        The created prototype of the stand includes:
A set of web server applications (PHP scripts) that
run on the WEB server side. These applications
communicate with the PostgreSQL database server
and client applications. A separate server application
is also a module for ZooSPACE [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] that allows you
to analyze text data extracted from various
bibliographic databases.
      </p>
      <p>A set of web client applications (Java scripts) that
run on the WEB client side. These applications
implement graphical user interface (GUI) functions
to control the operation of the stand and to visualize
the geographic features found on maps.</p>
      <p>To ensure the operation of the stand created
Dictionaries:
Dictionary abbreviations with templates based on
regular expressions-using this dictionary reveals
abbreviations in the input text (step 1).</p>
      <p>The stop word dictionary of the Russian language
(russian.stop). This dictionary is included in
PostgreSQL delivery and has not been changed in
our configuration (step 2).</p>
      <p>Morphological dictionary of Russian language
(ispell) with addition of geographical names and</p>
      <p>spelling rules for these names (ru_geo1.dict). A
fragment of the file ru_geo1.dict:
. . .
абажур/K
. . .
Кольцово/M
Мошковский/A
Новосибирск-Южный/AEZ
. . .</p>
      <p>The geographical dictionary to replace the token for
the combination of “label+token”. This dictionary
(geor1.ths) corresponds to the thesaurus template (in
terms of PostgreSQL thesaurus is a dictionary of
substitutions: the left part of the symbol ":" is
replaced by the right part, the presence of the symbol
"*" in the first position of the right part prescribes
not to control the right part of the morphological
dictionary ) and consists of:
. . .
Бердск: */gn/1510350
город Бердск: */gn/1510350
Советский район: */gn/490026, /gn/1491227
. . .</p>
      <p>Configuration FPS (in terms of PostgreSQL) that
defines a list of dictionaries and the order of
processing of the text (rugeo1):</p>
      <sec id="sec-4-1">
        <title>CREATE TEXT SEARCH DICTIONARY</title>
        <p>rugeo_ispell (TEMPLATE = ispell,
dictfile = 'ru_geo1', afffile =
'ru', stopwords = 'russian');</p>
      </sec>
      <sec id="sec-4-2">
        <title>CREATE TEXT SEARCH DICTIONARY</title>
        <p>tz_geo_1 (TEMPLATE = thesaurus,</p>
        <p>dictfile = 'geor1', dictionary =
'rugeo_ispell');
CREATE TEXT SEARCH CONFIGURATION
rugeo1 (PARSER = "default");
ALTER TEXT SEARCH CONFIGURATION
rugeo1 ADD MAPPING FOR hword WITH
tz_geo_1, rugeo_ispell, russian_stem;
ALTER TEXT SEARCH CONFIGURATION
rugeo1 ADD MAPPING FOR hword_part
WITH tz_geo_1, rugeo_ispell,
russian_stem;</p>
        <p>The work of the algorithm for fixing geographical
names can be illustrated by the example of processing a
fragment of the text "В окрестностях города
Новосибирска находятся: город Бердск, город Обь,
поселок Краснообск и Наукоград Кольцово". As a
result of query execution
SELECT plainto_tsquery('rugeo1', 'В
окрестностях города Новосибирска
находятся: город Бердск, город Обь,
поселок Краснообск и Наукоград
Кольцово');
get a response - marked-up text</p>
        <p>Other request
SELECT to_tsvector('rugeo1','В
окрестностях города Новосибирска
находятся: город Бердск, город Обь,
поселок Краснообск и Наукоград
Кольцово');
returns a list of tokens indicating their position in the text</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5 Conclusion</title>
      <p>As a result of the work performed, a stand prototype
was created for testing models and algorithms for
extracting geographical names from unstructured text to
build indexes for both text and geometric searches.
Preliminary testing showed that the proposed technology
provides a high degree of reliability of the results,
provided that all directories contain information about
the identified geographical features. The effectiveness of
the technology depends on the completeness of the
reference books.</p>
      <p>Currently, the created directories contain information
on geographical objects of the Novosibirsk region. In the
future, it is planned to expand the range of supported
regions.</p>
      <p>Work is performed within the Integration Project of
SB RAS (AAAA-A18-118022190008-8), Project for
basic scientific research (AAAA-A17-117120670141-7)
and RFBR project № 18-07-01457-a.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Zhizhimov</surname>
            <given-names>O.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mazov</surname>
            <given-names>N.A.</given-names>
          </string-name>
          <article-title>Problems of geographical reference of digital objects in digital libraries</article-title>
          .
          <source>Proc. XII All-Russian Sci. Conf</source>
          . «
          <article-title>Electronic libraries: Perspective Methods and Technologies, Electronic collections» (RCDL'</article-title>
          <year>2010</year>
          ). Kasan, p.
          <fpage>207</fpage>
          -
          <lpage>214</lpage>
          . (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Barakhnin</surname>
            <given-names>V.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhizhimov</surname>
            <given-names>O.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kupershtokh</surname>
            <given-names>A.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Skachkov</surname>
            <given-names>D.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fedotov</surname>
            <given-names>A.M.</given-names>
          </string-name>
          <article-title>The Algoritm of Exstracting Place Names Representing Content from Text Documents</article-title>
          .
          <source>Vestnik NSU. Ser.: The Information technology</source>
          , Vol.
          <volume>10</volume>
          ,
          <string-name>
            <surname>Iss</surname>
          </string-name>
          .
          <volume>1</volume>
          , p.
          <fpage>109</fpage>
          -
          <lpage>120</lpage>
          . (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>[3] All-Russian classifier of administrativeterritorial division objects (OKATO)</article-title>
          , http://protect.gost.ru/document.aspx?control=
          <volume>20</volume>
          &amp;id=
          <fpage>134377</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>[4] Classifier of addresses of the Russian Federation (CLADR)</article-title>
          , http://kladr-rf.ru.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>[5] The GeoNames geographical database</article-title>
          . - http://www.geonames.org/.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Open</given-names>
            <surname>Street Map</surname>
          </string-name>
          , http://wiki.openstreetmap.org.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Getty</given-names>
            <surname>Thesaurus of Geographic Names (TGN)</surname>
          </string-name>
          , - http://www.getty.edu/research/tools/vocabularie s/tgn/index.html.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>[8] State catalogue of geographical names</article-title>
          ,
          <source>Rosreestr</source>
          . - https://rosreestr.ru/site/activity/geodeziya-ikartografiya/naimenovaniya-geograficheskikhobektov/gosudarstvennyy-kataloggeograficheskikh-nazvaniy/.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Bartunov</surname>
            <given-names>J.</given-names>
          </string-name>
          , Sigaev F.
          <article-title>Introduction to full-text search in PostgreSQL</article-title>
          , - http://citforum.ru/database/postgres/fts/bib.shtm l.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Zhizhimov</surname>
            ,
            <given-names>O.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fedotov</surname>
            ,
            <given-names>A.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shokhin</surname>
            ,
            <given-names>Y.I.</given-names>
          </string-name>
          <article-title>The ZooSPACE platform- access organization to various distributed resources. Digital libraries: The Russian scien-tic e-magazine</article-title>
          . - Vol.
          <volume>17</volume>
          . - Iss. 2.
          <string-name>
            <surname>- ISSN</surname>
          </string-name>
          1562-
          <fpage>5419</fpage>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>