The Method of fuzzy analysis of texts and their rubrics actualization V V Borisov1, M I Dli1 and P Yu Kozlov1 1 Computer Engineering Department, The Branch of National Research University “Moscow Power Engineering Institute” in Smolensk, Smolensk, Russia Abstract. The work deals with the offered method of fuzzy analysis of texts and their rubrics actualization. The method is oriented to analyze electronic nonstructural texts of not big size in the following conditions: first, nonstationary composition and the importance of the keywords of the rubric field, second, in the absence or weak stucturization of these texts, third, if there are grammar or syntaxes inaccuracy and errors. The offered method is based on the original approach to the identification of the degree of the texts words fuzzy correspondence according to the well-founded set of syntactical characteristics with subsequent finding the degrees of text documents fuzzy correspondence to all rubrics. The method also allows to carry out monitoring of changes and actualization of rubrics according to the results of checking the formulated conditions of rubric field changes for the following typical situations: formation of the additional rubrics on the “boundary” of the existing rubrics; rubrics division, creating new rubrics, rubrics exclusion, rubrics combining. The offered method allows to raise the accuracy of analysis and the quality of texts classification at the expense of using the fuzzy approach for the accounting of analysis conditions uncertainty and nonstationarity of thesaurus of these texts as well as at the expense of operational actualization of rubrics depending on the composition and importance of the rubrics key words. 1. Introduction Nowadays in the conditions of permanent perfection of internet technologies the tasks of automatic analysis of electronic nonstructural texts are actual, they possess the following features:  relatively small size of such texts;  such texts weak structuredness or no structuredness at all (no marking and fields for computer processing);  presence of grammar and syntaxes inaccuracy and errors;  analysis conditions uncertainty and nonstationarity of composition and importance of rubric field key words;  high degree of rubrics interdependency. These features put considerable limitations on the usage of traditional models and methods of morphological, syntaxes and semantic analysis of the texts. However, famous models and methods of knowledge acquisition from the text information take the requirements of operational rubric changes into account not sufficiently, this leads to the growth of the number of errors because of the wrong classification of the processing texts [1–7]. 259 Therefore, the actual problem is to make a method of fuzzy analysis of electronic nonstructural texts and actualization of rubrics taking into account the detection of the following situations requiring operational changes of the rubric field: the additional rubrics formation on the “boundary” of the already exiting rubrics, rubrics division, creating new rubrics, rubrics exclusion, combining rubrics. The offered method of texts fuzzy analysis and rubrics actualization includes the following main stages: Stage 1. Rubric tasks and texts presentation on the basis of the detected syntaxes characteristics. Stage 2. Texts analysis on the basis of the degree defining of their fuzzy correspondence to the rubrics. Stage 3. Checking the rubric field changes conditions and rubrics of field actualization according to the results of this checking. Let us consider the problems solving on the stages of the offered method in more details. 2. Rubric tasks and text presentation on the basis of the detected syntaxes characteristics On the basis of the preliminary texts analysis the initial rubric multitude is given: R  {R j | j 1..J },   where for all j 1..J R j  w jm , rjm | m 1..M j , w jm – m- word in the rubric R j , rjm [0,1] – the degree of correspondence of the word w jm to rubric R j . For such texts presentation the «unification» of the set of the following syntaxes characteristics, detected, for example, by analyzer LinkGrammar is done ([8]): S  {sn | n 1..N }, then N  5 , where s1 – the root word or predicate; s2 – the subject; s3 – the adverbial modifier; s4 – the subject under action; s5 –the predicate [9]. The texts multitude is presented in the form of: SD  {SDk | k 1..K }, where SDk  { SDn( k ) | n 1..N}, SDn( k ) – word multitude of k- text, corresponding the syntaxes parameter sn. 3. Texts analysis on the basis of the identification of the degree of their fuzzy correspondence to the rubrics First, the degrees of fuzzy correspondence  j n ( SDn( k ) )  [0, 1] relative to syntaxes characteristics SDn( k ) to all rubrics are determined: (k ) 1 Ln ( k ) j  J ,  j n ( SDn( k ) )   u jp , n 1..N . L(nk ) p 1 where u (jpk ) – the degree of correspondence of p-word from SDn( k ) , primarily given for this word from rubric R j . To determine the degree of the text fuzzy correspondence to the rubrics let us introduce the parameter  ( SDk , R j ) characterizing the degree of text SDk fuzzy correspondence to rubric Rj:   (R )    N 1 2  ( SDk , R j )  1  Rj jn Rj n ( SDn( k ) ) , N n 1 260  where R%j  R j ( R j n ) / sn  – fuzzy multitude characterizing the “accurate coordinates” of rubric R j [10]. For the case under consideration R%j  1/ s1  , 1/ s2  , 1/ s3  , 1/ s4  , 1/ s5  , i.e.  1    N 1 2 j  J , 1%( SDk , R j )  1  Rj n ( SDn( k ) ) . N n 1 Text SDk refers, in the greatest degree, to that rubric Rl for which the degree of correspondence is maximum: Rl : max 1%(SDk , R j ). j1.. J 4. Checking the conditions of rubric field changes and rubrics of field actualization in accordance with the results of this checking To check the conditions of changing of rubric field let us introduce additionally the following parameters:   0,5    N 1 2 j  J , 0,5 ± ( SDk , R j )  1  Rj n ( SDn( k ) ) , N n 1 j  J j  J , 0%( SDk , R j )  1  1%( SDk , R j ), where parameter 0,5 ± ( SDk , R j ) characterizes the degree of uncertainty of text SDk referring to rubric R j and parameter 0%( SDk , R j ) characterizes the degree of text SDk discrepancy to rubric R j . This stage realization considers the calculation of parameters 1%( SDk , R j ) , 0,5 ± ( SDk , R j ) , 0%( SDk , R j ) for all texts and their analysis, according to the results of the analysis on the basis of the conditions given bellow, the revision of composition and rubric field structure is performed. Let us consider the formulated conditions of detection and revision rules of composition and rubric field structure for the following basic situations: additional rubric formation, rubric division, new rubrics creation, rubric exclusion, rubric combining. 4.1. Additional rubric formation The basis for the additional rubric formation on the «boundary» of the already existed rubrics Ri and R j is the revealing of the considerable amount of texts (equal to the rubrics or more than the number of rubrics), for every of which the following condition is valid:   1%( SDk , Ri )      1%( SDk , R j )    0,5 ± ( SDk , Ri )     0,5 ± ( SDk , R j )      0%( SDk , Ri )      0%( SDk , R j )     Rl  R, l  i  j : 1%( SDk , Rl )     0,5 ± ( SDk , Rl )     0%( SDk , Rl )   ,  where α and β – the upper and the lower boundary values (usually, α = 0.4 and β = 0.7 [11]), defining the reasonability of rubric field revision. When revealing the number of texts equal to the rubrics or more than the number of rubrics, for which the above mentioned condition is performed, the conclusion about the reasonability of additional «boundary» rubric is made. 261 4.2. Rubric division The basis for the rubric division R j is the revealing of the considerable number of texts for each of which the following condition is fulfilled:   1%( SDk , R j )    0,5 ± ( SDk , R j )       0%( SDk , R j )     Rl  R, l  j : 1%( SDk , Rl )    0,5 ± ( SDk , Rl )     0%( SDk , Rl )   , where j – the number of the divided rubric. 4.3. New rubric creation The basis for the new rubric creation is the revealing of the considerable number of texts for each of which the following condition is fulfilled:  Rl  R : 1%( SDk , Rl )    0,5  ± ( SDk , Rl )     0%( SDk , Rl )   . 4.4. Rubric exclusion The basis for the rubric exclusion is the revealing of the considerable number of texts for each of which the following condition is fulfilled: 1%(SDk , R j )    0,5 ± ( SDk , R j )    0%( SDk , R j )   . 4.5. Rubrics combining The basis for the rubric Ri and R j combining is the revealing of the considerable number of texts for which the following condition is fulfilled: 1%( SDk , Ri )    1%( SDk , R j )    0,5 ± ( SDk , Ri )     0,5 ± ( SDk , R j )    0%( SDk , Ri )    0%( SDk , R j )     Rl  R, l  i  j : 1%( SDk , Rl )     0,5 ± ( SDk , Rl )     0%( SDk , Rl )   ,  where Ri and R j – combining rubrics. 5. Experimental results The offered method was used in Administration of Smolensk region when automated analysis of electronic nonstructural texts documents was performed, and it allowed to provide the operational actualization of rubrics depending on the structure and parameters of the text documents in the conditions of nonstationary composition of thesaurus and changes of the rubrics keywords importance. Automated rubrication of 5062 massages received in 2016–2017 was performed through the internet portal and by electronic mail. The analysis showed the presence of 17 different interconnected rubrics, among them there are rubrics such as general issues of society and politics, separation of powers and duties in Administration, social sphere, education, family, culture, housing and communal service etc. The results of rubrication showed that rubrics dynamic accounting, when using the probabalistic classification algorithm of text information as a basic tool of analysis [4, 6], allowed to reduce the number of erroneously rubricated texts up to 13,3 % in general. 6. Conclusion The offered method was used in Administration of Smolensk region when automated analysis of electronic nonstructural texts documents was performed, and it allowed to provide the operational actualization of rubrics depending on the structure and parameters of the text documents in the conditions of nonstationary composition of thesaurus and changes of the rubrics keywords importance. 262 Eventually, the number of erroneously rubricated texts was managed to be reduced to 13.3 % on average. 7. References [1] Ageev M S, Dobrov B V and Lukashevich N V 2008 Automatic Text Rubrication: Methods and Problems, Scientific notes of Kazan State University. Vol 150. No. 4. pp 25–40. (in Russian) [2] Dumais S, Platt J, Heckerman D and Sahami M 1998 Inductive Learning Algorithms and Representations for Text Categorization, Proc. Int. Conf. on Inform. and Knowledge Manage. pp 148–155. [3] Yang Y and Liu X 1999 A Re-examination of Text Categorization Methods, Proc. of Int ACM Conf. on Research and Development in Information Retrieval (SIGIR-99). pp 42–49. [4] Zaboleeva-Zotova A V, Petrovsky A B, Orlova Yu A and Shitova T A 2016 Automated Analysis of News Texts Themes , Int. J. Information Content and Processing. Vol. 3. No. 3. Pp 288–299. (in Russian) [5] Kozlov P Yu 2017 Methods of Automated Analysis of Short Nonstructural Text Documents, Software products and systems No. 1. pp 100–106. (in Russian) [6] Borisov V V, Dli M I and Kozlov P Yu 2017 Intellectual Methods of Nonstructural Texts (Smolensk: Universum). p 156 ISBN 978-5-91412-364-9 (in Russian) [7] Uchitelev N V 2013 Classification of Text Information with the Help of SVM, Information technologies and systems. No. 1. pp 335–340. (in Russian). [8] Sajadi A and Borujerdi M 2013 Machine Translation Based on Unification Link Grammar, Journal of Artificial Intelligence Review pp 109–132. DOI: 10.1007/s10462-011-9261-7. [9] Protasov S Link Grammar (Electronic materials). – http://sz.ru/parser/doc/ (Accessed July, 2018). [10] Borisov V V, Fedulov A S and Zernov M M 2014 The Base of the Fuzzy Sets Theory, The Base of Fuzzy Mathematics Series book 1 (Moscow: Hot line–Telecom). p 88 (in Russian) [11] Gimarov V A 2004 Methods and Automated Systems of Dynamic Classification of Complex Technogenic Objects, Synopsis of a thesis paper of Dr.Tech.Sc. (Moscow) (in Russian) Acknowledgments The work is conducted under support of the Russian Foundation for Basic Research (project 18-01- 00558). 263