<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Identification of Expressions with Units of Measurement in Scientific, Technical &amp; Legal Texts in Belarusian and Russian</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alena Skopinava</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yury Hetsevich</string-name>
          <email>yury.hetsevich@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Copyright © by the paper's authors. Copying permitted only for private and academic purposes.</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>In: M. Lupu, M. Salampasis, N. Fuhr, A. Hanbury, B. Larsen, H. Strindberg (eds.): Proceedings of the Integrating IR, technologies for Professional Search Workshop</institution>
          ,
          <addr-line>Moscow, Russia, 24-March-2013, published at http://ceur-ws.org, 26</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>United Institute of Informatics, Problems, National Academy of Sciences</institution>
          ,
          <addr-line>Minsk</addr-line>
          ,
          <country country="BY">Belarus</country>
        </aff>
      </contrib-group>
      <fpage>26</fpage>
      <lpage>34</lpage>
      <abstract>
        <p>A study of an identifying process of expressions with metrological units according to the International System of Units for thematically distinct text corpora for Belarusian and Russian is reported here. The urgency of the problem is dictated by the ubiquity of units of measurement and their enormous variety. The resulting algorithms are created in the form of finite automata through a set of visual syntactic grammars. Such a method allows algorithms and resources to be updated much easier and far more quickly than, for instance, regular expressions. The algorithms carry out a search for expressions with measurement units, identify and classify them according to the SI. These practical results may find application in information search engines, libraries, publishing houses and speech synthesis systems.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Units of measurement have truly been man’s helpmates since ancient times. They are used in every branch of science
as well as in daily life. It is metrological units that present a quantitative perspective of the world, whether we consider the
building of the Egyptian pyramids or flights into space.</p>
      <p>Units of measurement are a current object of numerous research projects, first of all, within the framework of a special
discipline – metrology, as well as in the fields of mathematics, computer science, physics, coding theory, and, of course,
corpora linguistics. Possible examples can be such frequently used expressions as: 120 км/г (120 km/h) (speed), 345 м
(345 m) (distance), 12 мА (12 mA) (amperage), etc.</p>
      <p>Texts containing expressions with measured units require special algorithms of identification and processing in the
following areas:</p>
      <p>─ corpora and database management systems, libraries, information retrieval systems (to formulate extended search
queries, locate specific expressions on the Internet, support automatic textannotation and summarization);
─ speech synthesis systems according to the text (to generate orthographically correct texts and their tonal and prosodic
peculiarities);</p>
      <p>─ publishing institutions (to locate automatically specified lists of expressions with measured units, classify resulting
expressions as SI units, their derivatives or units out of the SI, and finally check quickly if the extended names of units are
used correctly).</p>
      <p>However, when dealing with units of measurement, many difficulties arise. Firstly, they are conditioned by a great
variety of numeral quantifiers and names of units, both in writing and formation. Creating rules of complex expressions
localization for all cases is practically impossible. In order to simplify this process, it is extremely important to use tools that
allow users to easily modify previously-developed rules and add new ones. Secondly, an expression with measured units
is difficult to recognize and analyze (divide by the numeral quantifier (digits, parts of speech with quantitative meaning
with all their possible paradigmatic forms) and the name of the metrological system) without thoroughly-prepared
linguistic resources, that is, dictionaries with all possible word forms, abbreviations and rules for building derivative forms of
measured units’ names. This is necessary for proper localization, for instance, of the following expressions with units of
length, recorded in various ways: 1 м (1 m), 31 метр (31 meters), 25 метраў (25 meters), 44 метры (44 meters).
Thirdly, expressions with units of measurement are language-dependent: in English abbreviated meter and mile refer to m,
while in Belarusian and Russian – м; even within similar Russian and Belarusian names of units differ in spelling.
Therefore, it is essential to make accurate specifications for respective recognition of algorithms.</p>
      <p>Some important achievements have already been realized in the Quantalyze semantic annotation and search service
[QS13], and Numeric Property Searching service in Derwent World Patents Index on STN [NPS98]. However, both
services cannot be applied to Belarusian or Russian text corpora. Also some steps in order to solve the above-mentioned
problems were carried out in 2009 by a team of Croatian linguists, who managed to obtain algorithms to identify
dimensional expressions of length, square and numerical ranges for the English and Croatian languages [Bek09]. Superficial
coverage and interpretation of the subject area can be found in research conducted by several other European linguists.
Still their work is more theoretically than practically based, is very descriptive and is concentrated on measured units not
as separate objects but “certain occurrences of words and expressions as belonging to particular category of named entity”
[Cun99]. Research workers of the Bulgarian Academy of Sciences and its Department of Sheffield University mention
that their “observations on the linguistic nature of Slavonic NE [named entities] are based only on their general
characteristics and on the general conclusions on their behavior in the text” [Pas02]. In practice, the named entity is semantically
huge and is composed of other various, complex categories: “locations, persons, organizations, dates, times, monetary
amounts and percentages” [Pas02] or, in other words, “persons, locations, organizations, time and numerical expressions”
[Myk07]. So, all of them have to be treated separately if the chief aim is to obtain successful identification algorithms. A
Bulgarian-Serbian research team particularized the term ‘measure’ as “a structure of a sequence of numbers written by
words or digits followed by a measure indicator (kilometer, grade, mile, foot, etc.)” [Duš07] and represented it formally as
a graph. Still, their practical results are limited only to the definite language systems (Bulgarian, Serbian), and, therefore,
cannot be applied to Belarusian or Russian.</p>
      <p>So, our research work views expressions with measured units as numerical word combinations where each component
requires a certain approach for successful identification. Our goal is to develop algorithms and linguistic resources in order
to identify and classify units of measurement and expressions with them on the material of hand-crafted
scientifictechnical and legal text corpora for the Belarusian and Russian languages.</p>
      <p>The specificity of our work is not simply to describe the expressions with measurement units. Their enormous variety is
the reason why regular expressions are not the best way to obtain localization rules. The use of visual methods of NooJ
allows users to easily modify previously-developed rules and add new ones. The opportunities are endless for any
language. We decided to demonstrate them using Belarusian and Russian, two Slavic languages which have much in
common, but at the same time they differ. So do the units of measurement. Most of the resources necessary for their
localization in complex text fragments are language-dependent, but the algorithm itself remains the same. It should be noted that
not only the units of measurement are algorithmically described but also the numeral quantifiers that stand before them,
which is extremely important for automatic comprehension of documents and information retrieval systems. The corpora
for testing are also constructed by means of NooJ, as it can perform any syntactic or semantic analysis on partially or
totally ambiguous texts. This fact confirms that all the results (concordances with units of measurement) are obtained only
with the help of algorithms, rather than special tags or indices.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Finite-State Automata of NooJ for Identification of Measurement Units</title>
      <p>In order to find a solution to the above-mentioned scientific problem, some practical results, already obtained while
constructing the Belarusian and Russian modules of the international computer-linguistic program NooJ, were used
[Het12a, Het12b]. This program allows to implement sophisticated algorithms for searching across compound text
fragments in Belarusian and Russian in the form of visual executable graphs within finite-state automata [NooJ02].</p>
      <p>For the construction and testing of algorithms, four text corpora were formed: two in Belarusian and two in Russian.
They contain a wide array of expressions with units of measurement for two thematically distinct domains:
scientifictechnical (from the fields of astronomy, physics, geography, chemistry, aviation, space, history, energy, transport and
communication) (figure 1) and legal (the traffic code of Belarus) (figure 2) [RRB07].
According to the main graph, any text fragment is initially checked in the 1st subgraph (Numeral Quantifier) if it has a
compound numerical descriptor (figure 4). It should be noted that this subgraph works out not only for prime, decimal and
fractional numbers in various forms of recording but also for compound numeral expressions with exponential parts and
periods. Some results of the work of the subgraph can be observed in the form of a concordance in figure 5. The extracted
numerals are listed in the column Seq. The columns Before and After contain pieces of left and right contexts in which the
extracted numerals are used. It should be emphasized that this subgraph is language-independent (see the example for
English in figure 5c).
Figure 5: Some results of the work of the subgraph, which identifies complex numeral expressions in texts
in the a) Belarusian, b) Russian, c) English languages</p>
      <p>After the first subgraph works out, the algorithm proceeds to checks of other subgraphs, which are connected to its
output by means of respective transition lines. For comparison, the 6th figure represents subgraphs for Belarusian and
Russian texts for recognition of the units of measurement within the International System of Units. These subgraphs are
language-dependable. Though the subgraphs in figure 6 serve the same purpose and recognize the same units (kilograms,
candelas, seconds, kelvins, amperes, meters, moles), they differ not only by fonts, but also by ways of writing the same
units of measurement. For example, in Russian the electric current can be measured by “А” or “Ампер”. So in Russian
there are 2 ways to express 1 unit, that is ampere. In Belarusian one can use 3 ways: “А”, “Ампер”, “Ампэр”. By the
way, in English there are 3 ways as well: A, ampere, amp. It should be mentioned that the order of checks is not
important, because all the checks within subgraphs are mutually exclusive. They help to search for and identify expressions
with measured units that belong to the general classification of measurements of the International Bureau of Weights and
Measures [BIPM06]. The subgraph System International identifies the units according to the SI, for example, кілаграм
(kilogram); the subgraph Derived – SI-derivative units, such as герц (hertz); the subgraph Other systems – the most
common, frequently used, but non-systemic units, such as час (hour). If any of the above-mentioned three subgraphs works
out, the sequence of respective transition lines on the way to the main graph’s output is indicated by markers. Let’s draw
up a list of some possible outcomes of the markers in case the main graph works out: MEAS (stands for ‘measurement’),
MEAS+SI+…, MEAS+D+SI+…. These markers correspond to the above-mentioned subgraphs’ respective
predestinations. Three dots in the last two markers can be substituted for special markers within a respective subgraph that works
out. In each subgraph, a name of a measured unit (or its word form) corresponds to the name of a respective physical value
(or its word form). Let’s take the word combination “узяць 3,3 молі” (take 3,3 moles) as an example. The algorithm will
recognize the following expression: “3,3 молі” (3,3 moles). It will receive the following marker: MEAS+SI+Amount of
substance. The marker enables to identify exactly which subgraph works out and which units of measurement are used in
the expression. The code MEAS means that the expression “3,3 молі” (3,3 moles) contains a unit of measurement “молі”
(moles). The code + SI informs that the unit of measurement “молі” (moles) belongs to the SI units. The code + Amount
of substance means that “молі” (moles) are used for measuring amount of substances.</p>
      <p>The component D of the marker MEAS+D+SI+… requires the existence of the second distinct subgraph in order to
separate expressions with units derived from the SI base units, such as degree Celsius, hertz, radian, newton, joule,
pascal, watt, volt, ohm, becquerel. Its structure is shown in figure 7.</p>
      <p>a)
b)</p>
      <p>As a result, a flexible system of markers allows the user to build search queries of different types: to find all the
expressions with units of measurement (figure 8); to draw a concordance of expressions with units of mass (on request
&lt;MEAS+Mass&gt;) or length (&lt;MEAS+Length&gt;), to determine expressions either with units, derived from the SI-units
(&lt;MEAS+SI+D&gt;) (figure 10) or without them (&lt;MEAS+SI-D&gt;) (figure 9); to recognize expressions that do not belong to
the SI (&lt;MEAS-SI&gt;) (figure 11); etc. Table 1 contains the search results in figures 8-11 translated into English and listed
from top to bottom.
a)
b)
1m &lt;MEAS+Length|Distance+SI&gt;
0,1Hz &lt;MEAS+Frequency+D+SI&gt;</p>
      <p>8 t &lt;MEAS+Mass&gt;
year 2005 &lt;MEAS+Time&gt;
74 degrees &lt;MEAS+Angle&gt;
109 K &lt;MEAS+Thermodynamic temperature+SI&gt;
200 000 l &lt;MEAS+Volume&gt;
33 years &lt;MEAS+Time&gt;</p>
      <p>5° &lt;MEAS+Angle&gt;
600°C &lt;MEAS+Temperature in Celsius scale+D+SI&gt;
3</p>
      <p>Evaluation of the resulting algorithms</p>
      <p>For evaluation of algorithms, texts with 104 and 107 usages of units of measurement were selected respectively for the
Belarusian and Russian languages within legal corpora, and 692 and 811 usages within scientific and technological
corpora. This parameter is represented by the letter N. Then an expert checked the concordances built by the algorithms for
selected tests. The total number of expressions with measured units and then the quantity of only correctly-identified
combinations were counted separately and presented by the letters L and M respectively. The evaluation process showed that
the algorithms developed possess on average 72% accuracy for each test corpus.
4 Conclusion</p>
      <p>It can be concluded that the main goal of this research – to take the first steps toward developing appropriate algorithms
that identify expressions with various measured units for the Belarusian and Russian languages for materials in scientific,
technical and legal text corpora – has been achieved. The results can be applied in any branches of science connected with
information retrieval systems and text-to-speech synthesis. The resulting algorithms are created in the form of finite-state
automata through a set of syntactic grammars within the powerful linguistic processor NooJ, which helps to build up
formal grammars without requirements for special knowledge of programming. The automata demonstrate how the
algorithms work and indicate how they can be further updated in order to improve their accuracy. Though rather high results
have already been achieved (more than 70%), there is still much room for further improvements. For example:
─ taking into account a metrological system of prefixes as parts of the units’ names (mille-, deci- , kilo-, giga-, etc.);
─ disambiguation of multiple-valued expressions, for example, in such cases when algorithms “confuse” some units
with each other (the same initial letter ‘г’ for ‘год’ (year), ‘грам’ (gram), ‘гадзін́а’ (hour) or some units with brands of
vehicles (МАЗ-4A, not 4 amperes);</p>
      <p>─ developing algorithms that will identify numeral quantifiers expressed not only by numbers (mathematical objects),
but also numerals (parts of speech);</p>
      <p>─ identifying the plus and minus signs positioned in front of numeral quantifiers, disambiguating the minus, hyphen
and dash signs;
─ updating the base of measured units with seldom-used ones.</p>
      <p>Mode of access:
[Bek09] B. Bekavac. Units of Measurement Detection Module for NooJ. Conference on NooJ 2009, pp. 121-127, Tunisia (2009)</p>
      <p>Mode of access:</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>[NPS98] Numeric Property Searching in Derwent World Patents Index on STN [Electronic resource]. - 1998</source>
          . - Mode of access: http://www.stn-international.com/numeric_property_searching.html. -
          <source>Date of access: 05.02</source>
          .
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>[QS13]</mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>Quantalyze semantic annotation and search service [Electronic resource]</article-title>
          .
          <source>- 2013</source>
          . -
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          https://www.quantalyze.com/en/. - Date of access:
          <volume>05</volume>
          .
          <fpage>02</fpage>
          .
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [Cun99]
          <string-name>
            <given-names>H.</given-names>
            <surname>Cunningham</surname>
          </string-name>
          .
          <article-title>Information Extraction: a User Guide (revised version)</article-title>
          ,
          <source>Research Memorandum CS-99-07</source>
          . Department of Computer Science, University of Sheffield (May,
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [Pas02]
          <string-name>
            <given-names>E.</given-names>
            <surname>Paskaleva</surname>
          </string-name>
          , G. Angelova,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jankova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bontcheva</surname>
          </string-name>
          .,
          <string-name>
            <given-names>H.</given-names>
            <surname>Cunningham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wilks</surname>
          </string-name>
          . Slavonic Named Entities in Gate, Research Memorandum CS-
          <volume>02</volume>
          -
          <fpage>01</fpage>
          . Department of Computer Science, University of Sheffield, Great
          <string-name>
            <surname>Britain</surname>
          </string-name>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [Myk07]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mykowiecka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kupść</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Marciniak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Piskorski</surname>
          </string-name>
          .
          <article-title>Resources for Information Extraction from Polish texts</article-title>
          .
          <source>Proceedings of 3rd Language</source>
          &amp; Technology Conference:
          <article-title>Human Language Technologies as a Challenge for Computer Science</article-title>
          and Linguistics,
          <string-name>
            <surname>Poznan</surname>
          </string-name>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [Duš07]
          <string-name>
            <given-names>V.</given-names>
            <surname>Duško</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Krstev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Koeva</surname>
          </string-name>
          .
          <article-title>Towards a Complex Model for Morpho-Syntactic Annotation</article-title>
          .
          <source>Proceedings of the Workshop Workshop on a Common Natural Language Processing Paradigm for Balkan Languages, 26 September</source>
          <year>2007</year>
          , Borovets, Bulgaria. In: Paskaleva,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Slavcheva</surname>
          </string-name>
          , M. (eds.), pp.
          <fpage>65</fpage>
          -
          <lpage>71</lpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [Het12a]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hetsevich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hetsevich</surname>
          </string-name>
          .
          <article-title>Overview of Belarusian and Russian dictionaries and their adaptation for NooJ. Automatic Processing of Various Levels of Linguistic Phenomena: Selected Papers from the NooJ 2011 Intern</article-title>
          . Conf. In: Vučković,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Bekavac</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Silberztein</surname>
          </string-name>
          , M. (eds.), pp.
          <fpage>29</fpage>
          -
          <lpage>40</lpage>
          . Cambridge Scholars Publishing,
          <string-name>
            <surname>Newcastle</surname>
          </string-name>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [Het12b]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hetsevich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hetsevich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lobanov</surname>
          </string-name>
          .
          <article-title>Belarusian and Russian linguistic modules processing for the system NooJ as applied to text-to-speech synthesis</article-title>
          .
          <source>Computational Linguistics and Information Technologies : materials of the Int. Conf. “Dialogue”</source>
          , Bekasovo, May 30 - June 3,
          <year>2012</year>
          .
          <source>Issue</source>
          <volume>11</volume>
          (
          <issue>18</issue>
          ), vol.
          <volume>1</volume>
          , pp.
          <fpage>198</fpage>
          -
          <lpage>212</lpage>
          . Moscow (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>[NooJ02] Linguistic Processor</surname>
            <given-names>NooJ</given-names>
          </string-name>
          [Electronic resource].
          <source>- 2002</source>
          . - Mode of access: http://www.nooj4nlp.net/pages/nooj.html. -
          <source>Date of access: 10.11</source>
          .
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [RRB07]
          <article-title>Rules of the Road of Belarus [Electronic resource]</article-title>
          .
          <source>- 2007</source>
          . - Mode of access: http://pdd.by. -
          <source>Date of access: 10.11</source>
          .
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [BIPM06]
          <article-title>International Bureau of Weights and Measures BIPM [Electronic resource]</article-title>
          .
          <source>- 2006</source>
          . - http://www.bipm.org/en/si/si_brochure/general.html. -
          <source>Date of access: 10.11</source>
          .
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>