<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Semantic Analysis of Text Data with Automated System</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>O. Chernenko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>O. Gordeeva</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Samara National Research University</institution>
          ,
          <addr-line>34 Moskovskoe Shosse, 443086, Samara</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <fpage>72</fpage>
      <lpage>75</lpage>
      <abstract>
        <p>This paper describes application of the basic methods of semantic analysis of text data - Porter stemming, frequency semantic analysis, latent semantic analysis and syntactic semantic analysis using an automated system. The system allows analyzing the text using these methods. The characteristics and features of the methods' implementation as well as the obtained results of their applying to texts of small complexity are considered. The research allows to reveal features of usage of the methods according to the text analysis purposes. At present days, it is difficult to imagine an effective work with text data without using computer processing. One of the most relevant and ever-evolving types of text processing is the semantic analysis. Depending on the criteria which are set in the automated system, the most appropriate type of semantic analysis can be selected. For example, in the case of search audit of a site, the criteria for choosing a method of semantic analysis are the speed of processing and the minimal dictionary volume. In the case of literal piece of art with complex speech turns, the main criterion of selecting an analysis method is the quality of processing. Consequently, the algorithm of semantic analysis should achieve the results that are as close to natural human as possible, so the parameters such as speed and volume of the dictionaries are not decisive.</p>
      </abstract>
      <kwd-group>
        <kwd>text analysis</kwd>
        <kwd>frequency semantic analysis</kwd>
        <kwd>Porter stemmer</kwd>
        <kwd>latent semantic analysis</kwd>
        <kwd>core words</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Task Formulation</title>
    </sec>
    <sec id="sec-3">
      <title>3. Methods of Text Semantic Analysis</title>
      <sec id="sec-3-1">
        <title>3.1. Frequency Semantic Analysis</title>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Porter Stemming</title>
        <p>Data Science / O. Chernenko, O. Gordeeva</p>
        <p>
          In the article [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], the Porter’s algorithm for a word’s basic part (stem) determining is described. The algorithm includes the
deleting of prefixes, endings and suffixes:
 if there is a gerund ending in the word, it must be removed. Otherwise, if the endings "sia" or "sj" are in the word they
must be removed. Next, an adjective/verb/noun ending are looked for. If one of them is found – it must be removed;
 the ending “i” should be found and removed if it is there;
 the endings “ost” or “ostj” must be found and removed if one of them is there;
 if the word ends with “nn” – the last letter “n” must be removed;
 if the word ends with “eyesh” or “eishe” – this part must be removed and then, the last letter “n” must be removed if the
word ends with “nn” again;
 if the word ends with “ь” – it must be deleted;
        </p>
        <p>
          To determine the theme of the text using an algorithm based on Porter Stemming, it is necessary to carry out the stemming
process for all words of the text being analyzed. As a result, an array of stems is obtained. The words in the text that are derived
from the stem with the most frequent number of occurrences are marked as the theme of the text [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Latent Semantic Analysis</title>
        <p>
          Latent Semantic Analysis (LSA) is a method of processing data in a natural language. The method analyzes the relationship
between the set of documents and the terms in them and juxtaposes some factors (themes) to all documents and terms. The LSA
method is based on the principles of factor analysis. As an input, the LSA uses a term-to-document matrix (terms – words or
phrases) [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
        <p>Elements of this matrix contain coefficients (weights) taking into account the frequency of occurrences of each term in each
document. The most common version of LSA is based on the using of the singular decomposition of the diagonal matrix (SVD
Singular Value Decomposition). Using the SVD-decomposition, any matrix decomposes into a set of orthogonal matrices, the
linear combination of which is a quite accurate approximation to the original matrix.</p>
        <p>More formally, according to the singular decomposition theorem, any real rectangular matrix can be decomposed into a
product of three matrices:</p>
        <p>A=USVT,
where matrixes U and V are orthogonal, and the matrix S is diagonal, values in diagonal of the matrix S are called “singular
values” of matrix A. Letter “Т” means transpose for matrix V.</p>
        <p>Such decomposition has a significant feature: if in the matrix S retain only “k” largest singular values, and in the matrices U
and V retain only columns corresponding to these values, then the product of the resulting matrices S, V and U is the best
approximation of the original matrix A to the matrix Â of “k” rank:</p>
        <p>Â ≈ A=USVT.</p>
        <p>
          The main idea of the LSA is that if the matrix A is the term-to-document matrix, then the matrix Â containing only the first
“k” linearly independent components of the matrix A reflects the basic structure of the dependencies presenting in the original
matrix. Proceeding from this decomposition, the dependence between terms and documents is analyzed and the theme of the text
is determined [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Syntactic Semantic Analysis</title>
        <p>
          Syntactic Semantic Analysis is a method of processing textual information, which creates templates for comparison with
words of text. As a result of the method a list of pairs is created for each sentence [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Pair includes:
 the type of word in the sentence;
 the position of the main word for this dependent word.
        </p>
        <p>It is assumed that the basic templates are formed from the most important and often used semantic relations in the text. The
basic semantic template is the rule by which the semantic relation is determined in the text being analyzed. The basic semantic
template consists of 4 main parts:
 a sequence of words or indivisible semantic units for which their morphological features are indicated;
 the name of the semantic relation that should be formed if the sequence described in the previous item is found in the
text;
 a sequence of numbers that determines the positions in a sequence whose elements should be added to the queue with
priority. According to the queue the words from the sentence being analyzed are deleted;
 a number indicating the value of the priority, the group of semantic dependencies to which this semantic relation relates.</p>
        <p>Using the basic semantic templates, the priority queue is composed. This queue is used to store words that are the argument
on the right side of a semantic collocation found in the sentence being analyzed.</p>
        <p>To determine the theme of the text from each sentence, according to the priority queue, the word with the biggest number of
dependencies is selected and the number of its occurrences in the text is calculated. The word with the maximum number of
occurrences is the theme of the text.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. System Operation</title>
      <p>To conduct the research on the methods of text analysis, an automated system was developed. The system operation includes
several steps.</p>
      <p>Data Science / O. Chernenko, O. Gordeeva</p>
      <p>At the first step the system splits the text into elements (words or sentences, it is depends on the algorithm chosen by the
user), and sends them for processing. The second step is handling the elements and selection the core words.</p>
      <p>If the FSA has been chosen by the user, the system compares words from the text with words from the dictionary and finds
among them words with the maximum number of occurrences in the text. Next, it displays the finding result – core words of the
text. Additionally, system displays the list of words has not been found in the dictionary. It is possible to add that words to the
dictionary and run the algorithm anew.</p>
      <p>If the algorithm based on Porter Stemming has been chosen, the system defines the basics of original words in the text and
looks for the most frequent among them. In this way the core words of the text are found by this algorithm.</p>
      <p>In the case of LSA the system constructs a word-on-sentence matrix using the sentences of the text and carries out the SVD
decomposition. Then only the first two columns of the resulting matrices are used. From the first two columns of the matrix VT
corresponding to the sentences, a maximum and a minimum are selected. These values correspond to the maximum and
minimum coordinates x and y on the coordinate plane. In this way, the area indicated, the entry into which for points from the
first two columns of the matrix U corresponding to words means the inclusion into core words.</p>
      <p>If the SSA has been chosen by the user, in each sentence words are checked for matching patterns. After that the weight value
sets for every word according to the pattern. The more word dependent words, the weight is less and priority is higher. Next, the
word with a minimum weight is determined in each sentence, the words with the most number of occurrences form the core of
the text.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>As objects for research, texts for the essays of the Unified State Examination in the Russian language were chosen. These
texts were chosen because of their simplicity and small size, and also because their themes are clearly defined.</p>
      <p>In tables 1-5 core words for text examples are presented. In addition, time of processing for each method of analysis is given.</p>
      <p>The main idea of the first text is “the influence of mass literature on the human intellectual development”. No methods
produced similar theme, but the most suitable core words were given by the latent-semantic method and Porter stemming
method.</p>
      <p>Data Science / O. Chernenko, O. Gordeeva</p>
      <p>The text’s main idea is “relations between human and nature”. As in previous example, the latent semantic analysis gave the
most similar core words.</p>
      <p>Table 4. «Books...» (A. Yetoyev).</p>
      <p>Method of analysis Approximate time of processing (sec)
Frequency Semantic 5
soul
soul raincoat
soul year raincoat</p>
      <p>Core words given by syntactic semantic analysis are the most similar to theme of the fourth text “the role of book in human
life”</p>
      <p>The topic of the fifth text is “soul of human”. All algorithms gave the satisfactory results, the most accurate of which gave the
Porter stemming.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In the article methods of classification of texts, such as Porter stemming, syntactic semantic, frequency semantic and latent
semantic analysis, are considered. The results of the analysis of little complexity texts are given. Based on these results it can be
concluded that the usage of methods for determining the topic of a text depends on the complexity of the text – the more
accurate analysis for the more complex text should be.</p>
      <p>The same applies to trivial texts. The using of complex methods for simple texts leads to unnecessary waste of time and
resources, and the result is superfluous in comparison with simple algorithms. Thus, the research shows that for short texts the
most effective method is the latent semantic analysis, the fastest method is the Porter stemming. Finally, it should be mentioned
that the combination of text analysis methods, for example, combining the Porter method of stamping and frequency-semantic
analysis, can be appropriate for effective and accurate core words determination.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Velichkevich</surname>
            <given-names>AG</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cherepackhina</surname>
            <given-names>AA</given-names>
          </string-name>
          .
          <article-title>Latent semantic analysis of text using Porter algorithm</article-title>
          .
          <source>Youth scientific and technical herald</source>
          <year>2015</year>
          ;
          <volume>10</volume>
          : 38 p.
          <article-title>(in Russian)</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Mikhaylov</given-names>
            <surname>DV. Kozlov</surname>
          </string-name>
          <string-name>
            <given-names>AP</given-names>
            ,
            <surname>Emelyanov</surname>
          </string-name>
          <string-name>
            <surname>GM</surname>
          </string-name>
          .
          <article-title>An approach based on tf-idf metrics to extract the knowledge and relevant linguistic means on subject-oriented text sets</article-title>
          .
          <source>Computer Optics</source>
          <year>2015</year>
          ;
          <volume>39</volume>
          (
          <issue>3</issue>
          ):
          <fpage>429</fpage>
          -
          <lpage>435</lpage>
          . DOI:
          <volume>10</volume>
          .18287/
          <fpage>0134</fpage>
          -2452-2015-39-3-
          <fpage>429</fpage>
          -438.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>[3] Understanding and synthesizing text by computer</article-title>
          . URL: http://compuling.narod.
          <source>ru/index2.html (11.12.16)</source>
          . (in Russian)
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>[4] Russian stemming algorithm</article-title>
          . URL: http://snowball.tartarus.org/algorithms/russian/stemmer.html (
          <volume>11</volume>
          .
          <fpage>12</fpage>
          .16). (in Russian)
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Silva</surname>
            <given-names>G</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oliveira</surname>
            <given-names>C.</given-names>
          </string-name>
          <article-title>A lexicon-based stemming procedure</article-title>
          .
          <source>Lecture Notes in Computer Science</source>
          <year>2003</year>
          ;
          <volume>2721</volume>
          :
          <fpage>159</fpage>
          -
          <lpage>166</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Zaboleeva-Zotova</surname>
            <given-names>AV</given-names>
          </string-name>
          .
          <article-title>Latent semantic analysis and new solutions in Internet</article-title>
          . Moscow: Information Technologies,
          <year>2001</year>
          ; 22 p.
          <article-title>(in Russian)</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Kuralenok</surname>
            <given-names>I</given-names>
          </string-name>
          ,
          <article-title>Nekrest'yanov I. Automatic document classification based on latent semantic analysis</article-title>
          .
          <source>Programming and Computer Software</source>
          <year>2000</year>
          ;
          <volume>26</volume>
          (
          <issue>4</issue>
          ):
          <fpage>199</fpage>
          -
          <lpage>206</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Rabchevsky</given-names>
            <surname>EA</surname>
          </string-name>
          .
          <article-title>Automatic construction of ontologies based on lexical-syntactic templates for information search</article-title>
          .
          <source>Petrozavodsk</source>
          ,
          <year>2009</year>
          ; 107 p.
          <article-title>(in Russian)</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>