The Method of fuzzy analysis of texts and their rubrics
actualization

               V V Borisov1, M I Dli1 and P Yu Kozlov1

               1
                Computer Engineering Department, The Branch of National Research University “Moscow
               Power Engineering Institute” in Smolensk, Smolensk, Russia


               Abstract. The work deals with the offered method of fuzzy analysis of texts and their rubrics
               actualization. The method is oriented to analyze electronic nonstructural texts of not big size in
               the following conditions: first, nonstationary composition and the importance of the keywords
               of the rubric field, second, in the absence or weak stucturization of these texts, third, if there
               are grammar or syntaxes inaccuracy and errors. The offered method is based on the original
               approach to the identification of the degree of the texts words fuzzy correspondence according
               to the well-founded set of syntactical characteristics with subsequent finding the degrees of text
               documents fuzzy correspondence to all rubrics. The method also allows to carry out monitoring
               of changes and actualization of rubrics according to the results of checking the formulated
               conditions of rubric field changes for the following typical situations: formation of the
               additional rubrics on the “boundary” of the existing rubrics; rubrics division, creating new
               rubrics, rubrics exclusion, rubrics combining. The offered method allows to raise the accuracy
               of analysis and the quality of texts classification at the expense of using the fuzzy approach for
               the accounting of analysis conditions uncertainty and nonstationarity of thesaurus of these texts
               as well as at the expense of operational actualization of rubrics depending on the composition
               and importance of the rubrics key words.


1. Introduction
Nowadays in the conditions of permanent perfection of internet technologies the tasks of automatic
analysis of electronic nonstructural texts are actual, they possess the following features:

      relatively small size of such texts;
      such texts weak structuredness or no structuredness at all (no marking and fields for computer
       processing);
      presence of grammar and syntaxes inaccuracy and errors;
      analysis conditions uncertainty and nonstationarity of composition and importance of rubric
       field key words;
      high degree of rubrics interdependency.

   These features put considerable limitations on the usage of traditional models and methods of
morphological, syntaxes and semantic analysis of the texts. However, famous models and methods of
knowledge acquisition from the text information take the requirements of operational rubric changes
into account not sufficiently, this leads to the growth of the number of errors because of the wrong
classification of the processing texts [1–7].


                                                                                                             259
    Therefore, the actual problem is to make a method of fuzzy analysis of electronic nonstructural
texts and actualization of rubrics taking into account the detection of the following situations requiring
operational changes of the rubric field: the additional rubrics formation on the “boundary” of the
already exiting rubrics, rubrics division, creating new rubrics, rubrics exclusion, combining rubrics.
    The offered method of texts fuzzy analysis and rubrics actualization includes the following main
stages:
    Stage 1. Rubric tasks and texts presentation on the basis of the detected syntaxes characteristics.
    Stage 2. Texts analysis on the basis of the degree defining of their fuzzy correspondence to the
rubrics.
    Stage 3. Checking the rubric field changes conditions and rubrics of field actualization according to
the results of this checking.
    Let us consider the problems solving on the stages of the offered method in more details.

2. Rubric tasks and text presentation on the basis of the detected syntaxes characteristics
On the basis of the preliminary texts analysis the initial rubric multitude is given:
                                                     R  {R j | j 1..J },

                                                           
where for all j 1..J R j  w jm , rjm | m 1..M j , w jm – m- word in the rubric R j , rjm [0,1] – the
degree of correspondence of the word w jm to rubric R j .
   For such texts presentation the «unification» of the set of the following syntaxes characteristics,
detected, for example, by analyzer LinkGrammar is done ([8]):
                                          S  {sn | n 1..N }, then N  5 ,
where s1 – the root word or predicate; s2 – the subject; s3 – the adverbial modifier; s4 – the subject
under action; s5 –the predicate [9].
   The texts multitude is presented in the form of:
                                                 SD  {SDk | k 1..K },

where SDk  { SDn( k ) | n 1..N},     SDn( k ) – word multitude of k- text, corresponding the syntaxes
parameter sn.

3. Texts analysis on the basis of the identification of the degree of their fuzzy correspondence to
the rubrics
First, the degrees of fuzzy correspondence  j n ( SDn( k ) )  [0, 1] relative to syntaxes characteristics SDn( k )
to all rubrics are determined:
                                                                         (k )
                                                                    1 Ln ( k )
                                   j  J ,  j n ( SDn( k ) )            u jp , n 1..N .
                                                                   L(nk ) p 1

where u (jpk ) – the degree of correspondence of p-word from SDn( k ) , primarily given for this word from
rubric R j .
   To determine the degree of the text fuzzy correspondence to the rubrics let us introduce the
parameter  ( SDk , R j ) characterizing the degree of text SDk fuzzy correspondence to rubric Rj:

                                                           (R )                               
                                                             N
                                                      1                                                2
                              ( SDk , R j )  1                   Rj      jn    Rj n   ( SDn( k ) ) ,
                                                       N    n 1


                                                                                                                 260
              
where R%j  R j ( R j n ) / sn    – fuzzy multitude characterizing the “accurate coordinates” of rubric R           j


[10]. For the case under consideration R%j  1/ s1  , 1/ s2  , 1/ s3  , 1/ s4  , 1/ s5  , i.e.

                                                                    1                        
                                                                       N
                                                                 1                                   2
                              j  J , 1%( SDk , R j )  1                    Rj n   ( SDn( k ) ) .
                                                                 N    n 1

   Text SDk refers, in the greatest degree, to that rubric Rl for which the degree of correspondence
is maximum:
                                        Rl : max 1%(SDk , R j ).
                                                       j1.. J


4. Checking the conditions of rubric field changes and rubrics of field actualization in
accordance with the results of this checking
To check the conditions of changing of rubric field let us introduce additionally the following
parameters:

                                                                     0,5                             
                                                                       N
                                                                 1                                           2
                            j  J , 0,5
                                      ± ( SDk , R j )  1                         Rj n   ( SDn( k ) ) ,
                                                                 N    n 1


                                   j  J j  J , 0%( SDk , R j )  1  1%( SDk , R j ),

where parameter 0,5
                 ± ( SDk , R j ) characterizes the degree of uncertainty of text SDk referring to rubric

R j and parameter 0%( SDk , R j ) characterizes the degree of text SDk discrepancy to rubric R j .
   This stage realization considers the calculation of parameters 1%( SDk , R j ) , 0,5
                                                                                      ± ( SDk , R j ) ,

0%( SDk , R j ) for all texts and their analysis, according to the results of the analysis on the basis of the
conditions given bellow, the revision of composition and rubric field structure is performed.
    Let us consider the formulated conditions of detection and revision rules of composition and rubric
field structure for the following basic situations: additional rubric formation, rubric division, new
rubrics creation, rubric exclusion, rubric combining.

4.1. Additional rubric formation
The basis for the additional rubric formation on the «boundary» of the already existed rubrics Ri and
R j is the revealing of the considerable amount of texts (equal to the rubrics or more than the number
of rubrics), for every of which the following condition is valid:
                                      1%( SDk , Ri )      1%( SDk , R j )   
                                      0,5
                                       ± ( SDk , Ri )     0,5
                                                              ± ( SDk , R j )   

                                     0%( SDk , Ri )      0%( SDk , R j )   

                                      
               Rl  R, l  i  j : 1%( SDk , Rl )     0,5
                                                            ± ( SDk , Rl )     0%( SDk , Rl )   ,          
where α and β – the upper and the lower boundary values (usually, α = 0.4 and β = 0.7 [11]), defining
the reasonability of rubric field revision.
   When revealing the number of texts equal to the rubrics or more than the number of rubrics, for
which the above mentioned condition is performed, the conclusion about the reasonability of
additional «boundary» rubric is made.


                                                                                                                     261
4.2. Rubric division
The basis for the rubric division R j is the revealing of the considerable number of texts for each of
which the following condition is fulfilled:
                    1%( SDk , R j )    0,5
                                              ± ( SDk , R j )       0%( SDk , R j )   


                                 
               Rl  R, l  j : 1%( SDk , Rl )    0,5
                                                       ± ( SDk , Rl )     0%( SDk , Rl )   ,
where j – the number of the divided rubric.

4.3. New rubric creation
The basis for the new rubric creation is the revealing of the considerable number of texts for each of
which the following condition is fulfilled:

                             
                   Rl  R : 1%( SDk , Rl )    0,5                                       
                                                    ± ( SDk , Rl )     0%( SDk , Rl )   .


4.4. Rubric exclusion
The basis for the rubric exclusion is the revealing of the considerable number of texts for each of
which the following condition is fulfilled:
                        1%(SDk , R j )    0,5
                                               ± ( SDk , R j )    0%( SDk , R j )   .


4.5. Rubrics combining
The basis for the rubric Ri and R j combining is the revealing of the considerable number of texts for
which the following condition is fulfilled:
                                         1%( SDk , Ri )    1%( SDk , R j )   
                                     0,5
                                      ± ( SDk , Ri )     0,5
                                                             ± ( SDk , R j )   

                                         0%( SDk , Ri )    0%( SDk , R j )   

                                     
             Rl  R, l  i  j : 1%( SDk , Rl )     0,5
                                                          ± ( SDk , Rl )     0%( SDk , Rl )   , 
where Ri and R j – combining rubrics.

5. Experimental results
The offered method was used in Administration of Smolensk region when automated analysis of
electronic nonstructural texts documents was performed, and it allowed to provide the operational
actualization of rubrics depending on the structure and parameters of the text documents in the
conditions of nonstationary composition of thesaurus and changes of the rubrics keywords importance.
Automated rubrication of 5062 massages received in 2016–2017 was performed through the internet
portal and by electronic mail. The analysis showed the presence of 17 different interconnected
rubrics, among them there are rubrics such as general issues of society and politics, separation of
powers and duties in Administration, social sphere, education, family, culture, housing and
communal service etc. The results of rubrication showed that rubrics dynamic accounting, when using
the probabalistic classification algorithm of text information as a basic tool of analysis [4, 6], allowed
to reduce the number of erroneously rubricated texts up to 13,3 % in general.

6. Conclusion
The offered method was used in Administration of Smolensk region when automated analysis of
electronic nonstructural texts documents was performed, and it allowed to provide the operational
actualization of rubrics depending on the structure and parameters of the text documents in the
conditions of nonstationary composition of thesaurus and changes of the rubrics keywords importance.

                                                                                                          262
Eventually, the number of erroneously rubricated texts was managed to be reduced to 13.3 % on
average.

7. References
[1] Ageev M S, Dobrov B V and Lukashevich N V 2008 Automatic Text Rubrication: Methods and
      Problems, Scientific notes of Kazan State University. Vol 150. No. 4. pp 25–40. (in Russian)
[2] Dumais S, Platt J, Heckerman D and Sahami M 1998 Inductive Learning Algorithms and
      Representations for Text Categorization, Proc. Int. Conf. on Inform. and Knowledge Manage.
      pp 148–155.
[3] Yang Y and Liu X 1999 A Re-examination of Text Categorization Methods, Proc. of Int ACM
      Conf. on Research and Development in Information Retrieval (SIGIR-99). pp 42–49.
[4] Zaboleeva-Zotova A V, Petrovsky A B, Orlova Yu A and Shitova T A 2016 Automated
      Analysis of News Texts Themes , Int. J. Information Content and Processing. Vol. 3. No. 3. Pp
      288–299. (in Russian)
[5] Kozlov P Yu 2017 Methods of Automated Analysis of Short Nonstructural Text Documents,
      Software products and systems No. 1. pp 100–106. (in Russian)
[6] Borisov V V, Dli M I and Kozlov P Yu 2017 Intellectual Methods of Nonstructural Texts
      (Smolensk: Universum). p 156 ISBN 978-5-91412-364-9 (in Russian)
[7] Uchitelev N V 2013 Classification of Text Information with the Help of SVM, Information
      technologies and systems. No. 1. pp 335–340. (in Russian).
[8] Sajadi A and Borujerdi M 2013 Machine Translation Based on Unification Link Grammar,
      Journal of Artificial Intelligence Review pp 109–132. DOI: 10.1007/s10462-011-9261-7.
[9] Protasov S Link Grammar (Electronic materials). – http://sz.ru/parser/doc/ (Accessed July,
      2018).
[10] Borisov V V, Fedulov A S and Zernov M M 2014 The Base of the Fuzzy Sets Theory, The Base
      of Fuzzy Mathematics Series book 1 (Moscow: Hot line–Telecom). p 88 (in Russian)
[11] Gimarov V A 2004 Methods and Automated Systems of Dynamic Classification of Complex
      Technogenic Objects, Synopsis of a thesis paper of Dr.Tech.Sc. (Moscow) (in Russian)

Acknowledgments
The work is conducted under support of the Russian Foundation for Basic Research (project 18-01-
00558).


                                                                                                263