1. Introduction

Semantic Analysis of Text Data with Automated System

O. Chernenko

O. Gordeeva

0 0 Samara National Research University , 34 Moskovskoe Shosse, 443086, Samara , Russia

2017

72 75

This paper describes application of the basic methods of semantic analysis of text data - Porter stemming, frequency semantic analysis, latent semantic analysis and syntactic semantic analysis using an automated system. The system allows analyzing the text using these methods. The characteristics and features of the methods' implementation as well as the obtained results of their applying to texts of small complexity are considered. The research allows to reveal features of usage of the methods according to the text analysis purposes. At present days, it is difficult to imagine an effective work with text data without using computer processing. One of the most relevant and ever-evolving types of text processing is the semantic analysis. Depending on the criteria which are set in the automated system, the most appropriate type of semantic analysis can be selected. For example, in the case of search audit of a site, the criteria for choosing a method of semantic analysis are the speed of processing and the minimal dictionary volume. In the case of literal piece of art with complex speech turns, the main criterion of selecting an analysis method is the quality of processing. Consequently, the algorithm of semantic analysis should achieve the results that are as close to natural human as possible, so the parameters such as speed and volume of the dictionaries are not decisive.

text analysis frequency semantic analysis Porter stemmer latent semantic analysis core words

1. Introduction 2. Task Formulation 3. Methods of Text Semantic Analysis 3.1. Frequency Semantic Analysis 3.2. Porter Stemming

Data Science / O. Chernenko, O. Gordeeva

In the article [ 4 ], the Porter’s algorithm for a word’s basic part (stem) determining is described. The algorithm includes the deleting of prefixes, endings and suffixes:  if there is a gerund ending in the word, it must be removed. Otherwise, if the endings "sia" or "sj" are in the word they must be removed. Next, an adjective/verb/noun ending are looked for. If one of them is found – it must be removed;  the ending “i” should be found and removed if it is there;  the endings “ost” or “ostj” must be found and removed if one of them is there;  if the word ends with “nn” – the last letter “n” must be removed;  if the word ends with “eyesh” or “eishe” – this part must be removed and then, the last letter “n” must be removed if the word ends with “nn” again;  if the word ends with “ь” – it must be deleted;

To determine the theme of the text using an algorithm based on Porter Stemming, it is necessary to carry out the stemming process for all words of the text being analyzed. As a result, an array of stems is obtained. The words in the text that are derived from the stem with the most frequent number of occurrences are marked as the theme of the text [ 5 ].

3.3. Latent Semantic Analysis

Latent Semantic Analysis (LSA) is a method of processing data in a natural language. The method analyzes the relationship between the set of documents and the terms in them and juxtaposes some factors (themes) to all documents and terms. The LSA method is based on the principles of factor analysis. As an input, the LSA uses a term-to-document matrix (terms – words or phrases) [ 6 ].

Elements of this matrix contain coefficients (weights) taking into account the frequency of occurrences of each term in each document. The most common version of LSA is based on the using of the singular decomposition of the diagonal matrix (SVD Singular Value Decomposition). Using the SVD-decomposition, any matrix decomposes into a set of orthogonal matrices, the linear combination of which is a quite accurate approximation to the original matrix.

More formally, according to the singular decomposition theorem, any real rectangular matrix can be decomposed into a product of three matrices:

A=USVT, where matrixes U and V are orthogonal, and the matrix S is diagonal, values in diagonal of the matrix S are called “singular values” of matrix A. Letter “Т” means transpose for matrix V.

Such decomposition has a significant feature: if in the matrix S retain only “k” largest singular values, and in the matrices U and V retain only columns corresponding to these values, then the product of the resulting matrices S, V and U is the best approximation of the original matrix A to the matrix Â of “k” rank:

Â ≈ A=USVT.

The main idea of the LSA is that if the matrix A is the term-to-document matrix, then the matrix Â containing only the first “k” linearly independent components of the matrix A reflects the basic structure of the dependencies presenting in the original matrix. Proceeding from this decomposition, the dependence between terms and documents is analyzed and the theme of the text is determined [ 7 ].

3.4. Syntactic Semantic Analysis

Syntactic Semantic Analysis is a method of processing textual information, which creates templates for comparison with words of text. As a result of the method a list of pairs is created for each sentence [ 8 ]. Pair includes:  the type of word in the sentence;  the position of the main word for this dependent word.

It is assumed that the basic templates are formed from the most important and often used semantic relations in the text. The basic semantic template is the rule by which the semantic relation is determined in the text being analyzed. The basic semantic template consists of 4 main parts:  a sequence of words or indivisible semantic units for which their morphological features are indicated;  the name of the semantic relation that should be formed if the sequence described in the previous item is found in the text;  a sequence of numbers that determines the positions in a sequence whose elements should be added to the queue with priority. According to the queue the words from the sentence being analyzed are deleted;  a number indicating the value of the priority, the group of semantic dependencies to which this semantic relation relates.

Using the basic semantic templates, the priority queue is composed. This queue is used to store words that are the argument on the right side of a semantic collocation found in the sentence being analyzed.

To determine the theme of the text from each sentence, according to the priority queue, the word with the biggest number of dependencies is selected and the number of its occurrences in the text is calculated. The word with the maximum number of occurrences is the theme of the text.

4. System Operation

To conduct the research on the methods of text analysis, an automated system was developed. The system operation includes several steps.

Data Science / O. Chernenko, O. Gordeeva

At the first step the system splits the text into elements (words or sentences, it is depends on the algorithm chosen by the user), and sends them for processing. The second step is handling the elements and selection the core words.

If the FSA has been chosen by the user, the system compares words from the text with words from the dictionary and finds among them words with the maximum number of occurrences in the text. Next, it displays the finding result – core words of the text. Additionally, system displays the list of words has not been found in the dictionary. It is possible to add that words to the dictionary and run the algorithm anew.

If the algorithm based on Porter Stemming has been chosen, the system defines the basics of original words in the text and looks for the most frequent among them. In this way the core words of the text are found by this algorithm.

In the case of LSA the system constructs a word-on-sentence matrix using the sentences of the text and carries out the SVD decomposition. Then only the first two columns of the resulting matrices are used. From the first two columns of the matrix VT corresponding to the sentences, a maximum and a minimum are selected. These values correspond to the maximum and minimum coordinates x and y on the coordinate plane. In this way, the area indicated, the entry into which for points from the first two columns of the matrix U corresponding to words means the inclusion into core words.

If the SSA has been chosen by the user, in each sentence words are checked for matching patterns. After that the weight value sets for every word according to the pattern. The more word dependent words, the weight is less and priority is higher. Next, the word with a minimum weight is determined in each sentence, the words with the most number of occurrences form the core of the text.

5. Results

As objects for research, texts for the essays of the Unified State Examination in the Russian language were chosen. These texts were chosen because of their simplicity and small size, and also because their themes are clearly defined.

In tables 1-5 core words for text examples are presented. In addition, time of processing for each method of analysis is given.

The main idea of the first text is “the influence of mass literature on the human intellectual development”. No methods produced similar theme, but the most suitable core words were given by the latent-semantic method and Porter stemming method.

Data Science / O. Chernenko, O. Gordeeva

The text’s main idea is “relations between human and nature”. As in previous example, the latent semantic analysis gave the most similar core words.

Table 4. «Books...» (A. Yetoyev).

Method of analysis Approximate time of processing (sec) Frequency Semantic 5 soul soul raincoat soul year raincoat

Core words given by syntactic semantic analysis are the most similar to theme of the fourth text “the role of book in human life”

The topic of the fifth text is “soul of human”. All algorithms gave the satisfactory results, the most accurate of which gave the Porter stemming.

6. Conclusion

In the article methods of classification of texts, such as Porter stemming, syntactic semantic, frequency semantic and latent semantic analysis, are considered. The results of the analysis of little complexity texts are given. Based on these results it can be concluded that the usage of methods for determining the topic of a text depends on the complexity of the text – the more accurate analysis for the more complex text should be.

The same applies to trivial texts. The using of complex methods for simple texts leads to unnecessary waste of time and resources, and the result is superfluous in comparison with simple algorithms. Thus, the research shows that for short texts the most effective method is the latent semantic analysis, the fastest method is the Porter stemming. Finally, it should be mentioned that the combination of text analysis methods, for example, combining the Porter method of stamping and frequency-semantic analysis, can be appropriate for effective and accurate core words determination.

[1] Velichkevich

, Cherepackhina

. Latent semantic analysis of text using Porter algorithm . Youth scientific and technical herald 2015 ; 10 : 38 p. (in Russian)

[2]

Mikhaylov

DV. Kozlov

AP , Emelyanov

GM . An approach based on tf-idf metrics to extract the knowledge and relevant linguistic means on subject-oriented text sets . Computer Optics 2015 ; 39 ( 3 ): 429 - 435 . DOI: 10 .18287/ 0134 -2452-2015-39-3- 429 -438.

[3] Understanding and synthesizing text by computer . URL: http://compuling.narod. ru/index2.html (11.12.16) . (in Russian)

[4] Russian stemming algorithm . URL: http://snowball.tartarus.org/algorithms/russian/stemmer.html ( 11 . 12 .16). (in Russian)

[5] Silva

, Oliveira

A lexicon-based stemming procedure . Lecture Notes in Computer Science 2003 ; 2721 : 159 - 166 .

[6] Zaboleeva-Zotova

. Latent semantic analysis and new solutions in Internet . Moscow: Information Technologies, 2001 ; 22 p. (in Russian)

[7] Kuralenok

, Nekrest'yanov I. Automatic document classification based on latent semantic analysis . Programming and Computer Software 2000 ; 26 ( 4 ): 199 - 206 .

[8]

Rabchevsky

EA . Automatic construction of ontologies based on lexical-syntactic templates for information search . Petrozavodsk , 2009 ; 107 p. (in Russian)