<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Named Numeric Characteristics Extraction from Text Data in Russian</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Software Engineering and Computer Systems, ITMO University</institution>
          ,
          <addr-line>Saint Petersburg</addr-line>
          ,
          <country>Russia https</country>
          <institution>://en.itmo.ru/en/</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>The article is focused on the problem of named numeric characteristics extraction from the text data in Russian. This problem have a significant impact to the wide variety of natural language processing tasks such as marketing analysis, statistic tools and data aggregation solutions. There are a lot of existing approaches such as scrappers and parsers, various natural language processing tools (Stanford CoreNLP, spaCy, Natural language toolkit, Apache OpenNLP) and Tomita parser from Yandex that are considered in the article. The structure of the numerical data in Russian has its own specific that have an impact to the algorithms of converting numbers from text form to their value that is also covered in the article. As a result of the research, the new method for extracting numerical data from texts in a natural language was proposed and software module for the proposed method proof was developed. The proposed solution uses semantic networks and semantic frames to determine the boundaries and extract the numerical data from the text. The developed software module was tested on a variety of data sets extracted from the different sources such as Avito and Yandex.Market. The results of the testing shows the effectiveness of the proposed method in comparision with existing solutions.</p>
      </abstract>
      <kwd-group>
        <kwd>Semantic networks processing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The data processed by computers is heterogeneous so there are a lot of different
approaches to process it. A large amount of data is structured using various
databases, files in certain formats, etc. The structure of this data is pre-calculated
for processing by certain programs that allows to simplify the creation of software
products.</p>
      <p>Copyright c 2019 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).</p>
      <p>Nevertheless, huge amounts of data are loosely structured so there is a special
class of tasks focused on data preprocessing for structurization to solve this
problem. The canonical example of such data is website content. The structure of
web pages is optimized for data displaying and isn’t intended for data extraction
and processing. Partial solution of this problem is based on the principle that
every web page has some structure, so it’s possible to use patterns to extract the
necessary data from it. CSS classes, semantic markup, XPath’s can be used as
some kind of pointers to data elements. Software products for data processing
from unstructured sources is very complicated, because they need a special parser
programs to extract the necessary data from sources. These parsers are not so
complicated, but programmer should write parser for every data source and
these parsers require permanent support because the web site structure tends to
permanent changes.</p>
      <p>The data examples described above have some structure that simplifies its
processing. But large amounts of data are stored in an unstructured form: texts,
internal documents of companies (reports, plans, etc.), news articles, and so
on. Processing of these data requires special approaches, and these approaches
creation task is still actual.</p>
      <p>One of the most important automatic text processing tasks is named numeric
characteristics extraction. This task solution can be useful in many areas:
marketing (competitors’ offers analysis), statistics (numerical data extraction from
various natural language sources), data aggregators (for example, a site that
collects information about the characteristics and prices of goods from different
online stores) and many others.</p>
      <p>A numerical characteristic is an object characteristic that can be expressed
as numbers and units of its measurement. In a natural text, these characteristics
can have one of these forms:
– The object the characteristic relates to.
– Name of the characteristic (can be implicit).
– Numerical value of the characteristic.
– Measurement units of the characteristic.</p>
      <p>A typical structure for numerical characteristics representation is shown on
the figures 1 and 2.
The goal of the work is to develop a solution for numerical data extracting from
natural language. Natural language processing is a long-studied problem, and
there are a lot methods and algorithms developed in this area. Therefore, the
first task is to observe existing approaches applicable to achieve the goal. If there
is a suitable solution, it must be adapted. Otherwise, suitable approach must be
developed.
3</p>
    </sec>
    <sec id="sec-2">
      <title>Existing solutions review</title>
      <sec id="sec-2-1">
        <title>These approaches and tools were investigated:</title>
      </sec>
      <sec id="sec-2-2">
        <title>1. Scrappers and parsers. 2. Various natural language processing tools (Stanford CoreNLP [1], spaCy, Natural language toolkit [2], Apache OpenNLP). 3. Tomita parser.</title>
        <p>Each of these solutions has a number of disadvantages make it inapplicable
for the problem solve:
1. Scrappers and parsers are based on structure or metadata (for example,</p>
        <p>HTML page layout), so they are unsuitable for natural language analysis.
2. Natural language processing tools possess a search mechanism for named
entities, but there is no any numerical data-like entity implemented in these
tools.
3. Tomita parser doesn’t contain any mechanism for hierarchical concepts
combination. Because of this it’s impossible to use it to formulate a request “all
areas of residential premises” that will automatically include “all areas of
apartments”, “all areas of houses”, “all areas of estates”, etc.</p>
        <p>
          Since no suitable solution was found, it became necessary to develop the new
one. The semantic networks and semantic frames were taken as its basis [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]
[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>General provisions of the algorithm</title>
      <p>
        To analyze the text, semantic frames based on the Fillmore frame semantics
were used [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. This approach is based on the concept that a specific fact is
described by a set of lexical and semantic units in the text. Frames also allow
to extract the data (text) enclosed between these units [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. An example of such
frame is shown on the figure 3. As you can see, the presence of certain semantic
and lexical units is optional.
      </p>
      <p>
        The semantic frame works with tokenized text and consists of the following
units:
1. Lexical unit (lex_unit) - describes a single string value.
2. The semantic unit (semantic_unit), describing the value represented by the
node of the semantic network. Comparison of the token for coincidence is
made with all word forms of this node. Two subtypes of units are
distinguished depending on the target set of word forms:
– Single - the target set of word forms is taken from the specified network
node.
– Hyponymic - the target set is taken from the specified node and all its
hyponyms [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
3. Information unit (payload_unit). The purpose of this unit is not to check
for coincidence, but to remember all tokens. This is used to extract text data
enclosed between a pair of semantic units (lexical or semantic). Because of
this it’s possible to extract data from the text depending on its semantic
structure.
      </p>
      <p>Thus, semantic frames allow us to localize the position of numerical data in
a natural language text. So the task is reduced to:
1. Fill the semantic network with data on various units of measurement.
2. Compile the semantic frames for the numeric data representation options in
the specified text.
3. Select the search algorithm and normalize the numbers in a localized area
(translate number from text representation form into its value).</p>
    </sec>
    <sec id="sec-4">
      <title>Implementation</title>
      <p>Examples in figures 1 and 2 could be described by two semantic frames, that
contains the object related by the characteristic and the measurement unit of
this characteristic.</p>
      <p>Hyponymic semantic units should be used to describe an object and a
measurement unit, and information units should be used to extract the suitable text.
As a result of the frame work, a piece of text with numerical data will be
extracted by information unit. This text can contain some redundant text, so it
should be filtered out.</p>
      <p>The tool or algorithm for recognizing and converting numbers from a text
format should work with the Russian language. The final solution must work
with a natural text in Russian. There is no any existing solution distributed
under a free license that is meet for this requirement.</p>
      <p>The algorithm for Russian numbers recognition and conversion is based on
the fact that compound numerals (expressing numbers) in the Russian language
have a strictly defined structure shown on the figure 4. The final algorithm is
based on this structure.
Suggested approach was tested on datasets from one of the largest online retail
platform in Russia Yandex.Market and classified advertisements portal Avito.
Both of these resources contains a sets of good test samples of objects with
numeric characteristics. The difference between these two resources is that
Yandex.Market contains poorly-structured data while data in Avito is completely
non-structured.</p>
      <p>Test datasets were imported from the resources using two different ways.
Yandex.Market data was imported using the scrapper module extracting the
product characteristics block from the page. Avito data was imported using
parser extracting the advertisement source. There were a couple of subject
areasrelated datasets selected for the testing purposes:
1. Computer equipment
– Processors
– RAM
– Power supplies
2. Apartments
3. Video cameras
4. Coffeemakers</p>
      <p>Characteristics of computer parts were taken for "computer equipment"
subject area. These characteristics contain a processor, a random-access memory and
a video card.</p>
      <p>The following numeric object characteristics were chosen for recognition in
text data:</p>
      <p>
        There were separate semantic networks developed for the each subject area
domain and its characteristics. Each semantic network was built by the linguist
expert based on different sources. There was the global semantic network based
on Wiktionary [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and RuThes [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] translingual data used as the main semantic
data source [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Usage of this network allowed to solve these two important
problems:
– Eliminates the need to set all word forms manually for each node of the
semantic network.
– Allows to use nodes and relations already existing in the global network and
modify it for the purposes of the specific task.
Separate subsets of semantic frames were built for each subject area domain.
The resulting semantic network and frames were used as an input data for the
software module implements the developed algorithm.
      </p>
      <p>The main characteristics of the test samples used are number of test samples
for each domain and number of values for each characteristic. These
characteristics are shown in the table 2.</p>
      <p>Object
Processor
RAM
Power supply
Apartment
Video camera
Coffeemaker
200
200
80
150
80
40</p>
      <p>There were precision, recall and F1-score metrics values taken as testing
results. The average values of these metrics values are shown in the table 3.</p>
      <p>Testing results proved the high efficiency of the numeric characteristics
extraction using the proposed approach. The average value for the F1-score was
78% for the used data sets.</p>
      <p>Nonetheless, there are some improvements in the algorithm and software
module that are needed to widely use it in practical tasks. At the moment there
are some limitations in the software module abilities to recognize the float
numbers. Also there are some troubles with relatively rarely used untypical numbers
formats (e.g. word "thousand" is implied but not used in text). This is the area
of further development of the system. Also it should be noted that the
effectiveness of the numbers extraction highly depends on the quality characteristics of
the used semantic network and frame set.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Manning</surname>
            <given-names>C.</given-names>
          </string-name>
          et al.
          <article-title>The Stanford CoreNLP natural language processing toolkit //Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations</article-title>
          .
          <source>- 2014</source>
          . - Pg.
          <fpage>55</fpage>
          -
          <lpage>60</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bird</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Loper</surname>
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>NLTK</surname>
          </string-name>
          <article-title>: the natural language toolkit //Proceedings of the ACL 2004 on Interactive poster and demonstration sessions. - Association for Computational Linguistics</article-title>
          .
          <article-title>-</article-title>
          <year>2004</year>
          . - Pg.
          <volume>31</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bessmertny</surname>
            <given-names>I.</given-names>
          </string-name>
          ,
          <year>2010</year>
          .
          <article-title>Knowledge visualization based on semantic networks //Programming</article-title>
          and
          <string-name>
            <given-names>Computer</given-names>
            <surname>Software</surname>
          </string-name>
          .
          <article-title>- 2010.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Francopoulo</surname>
            <given-names>G</given-names>
          </string-name>
          . (ed.).
          <source>LMF Lexical Markup Framework</source>
          .
          <article-title>- 2013.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Eckle-Kohler J</surname>
          </string-name>
          . et al. lemonUby
          <article-title>- A large, interlinked, syntactically-rich lexical resource for ontologies //Semantic Web</article-title>
          .
          <article-title>-</article-title>
          <year>2015</year>
          . - Vol.
          <volume>6</volume>
          . -
          <fpage>№</fpage>
          . 4. - Pg.
          <fpage>371</fpage>
          -
          <lpage>378</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Fillmore</surname>
            <given-names>C. J.</given-names>
          </string-name>
          <article-title>Frame semantics and the nature of language //</article-title>
          <source>Annals of the New York Academy of Sciences. - 1976</source>
          . - No. 1. - Pg.
          <fpage>20</fpage>
          -
          <lpage>32</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Fillmore</surname>
            <given-names>C. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baker</surname>
            <given-names>C. F.</given-names>
          </string-name>
          <article-title>Frame semantics for text understanding //</article-title>
          <source>Proceedings of WordNet and Other Lexical Resources Workshop</source>
          . - 2001.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Barsalou</surname>
            <given-names>L. W.</given-names>
          </string-name>
          <string-name>
            <surname>Frames</surname>
          </string-name>
          , concepts, and conceptual fields //In Frames, fields, and contrasts, ed.
          <source>Adrienne Lehrer and Eva Feder Kittay. - 1992</source>
          . - Pg.
          <fpage>21</fpage>
          -
          <lpage>74</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Stern D. Making Search More Meaningful: Action Values</surname>
            ,
            <given-names>Linked</given-names>
          </string-name>
          <string-name>
            <surname>Data</surname>
            , and
            <given-names>Semantic</given-names>
          </string-name>
          <string-name>
            <surname>Relationships</surname>
          </string-name>
          .
          <article-title>- 2015.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Krizhanovsky</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smirnov</surname>
            <given-names>A</given-names>
          </string-name>
          .
          <article-title>An approach to automated construction of a generalpurpose lexical ontology based on Wiktionary //</article-title>
          <source>Journal of Computer and Systems Sciences International. - 2013</source>
          . - Pg.
          <fpage>215</fpage>
          -
          <lpage>225</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Loukachevitch</surname>
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dobrov</surname>
            <given-names>B.</given-names>
          </string-name>
          <article-title>RuThes linguistic ontology vs</article-title>
          .
          <source>Russian wordnets //Proceedings of Global WordNet Conference GWC-2014</source>
          .
          <article-title>- 2014.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Klimenkov S</surname>
          </string-name>
          . et al.
          <source>Reconstruction of Implied Semantic Relations in Russian Wiktionary //Proceedings of the 8th International Joint Conference on Knowledge Discovery</source>
          ,
          <article-title>Knowledge Engineering and Knowledge Management (KDIR)</article-title>
          .
          <source>- 2016</source>
          . - Pg.
          <fpage>74</fpage>
          -
          <lpage>80</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Osika</surname>
            <given-names>V.</given-names>
          </string-name>
          et al.
          <source>Method of Reconstruction of Semantic Relations using Translingual Information //Proceedings of the 9th International Joint Conference on Knowledge Discovery</source>
          ,
          <string-name>
            <given-names>Knowledge</given-names>
            <surname>Engineering</surname>
          </string-name>
          and
          <string-name>
            <given-names>Knowledge</given-names>
            <surname>Management</surname>
          </string-name>
          .
          <article-title>-</article-title>
          <year>2017</year>
          . - Vol.
          <volume>2</volume>
          . - Pg.
          <fpage>239</fpage>
          -
          <lpage>245</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>