Method for Documents Rubrication and Analysis Based on Fuzzy Relations of Difference between Their Syntactical Characteristics

Method for Documents Rubrication and Analysis Based on Fuzzy Relations of Difference between Their Syntactical Characteristics VBorisov The Branch of National Research University "Moscow Power Engineering Institute"

Smolensk Russia

MDli midli@imail.ru The Branch of National Research University "Moscow Power Engineering Institute"

Smolensk Russia

PKozlov The Branch of National Research University "Moscow Power Engineering Institute"

Smolensk Russia

Method for Documents Rubrication and Analysis Based on Fuzzy Relations of Difference between Their Syntactical Characteristics F238E7569833B303FF0D65BFDE0D578D GROBID - A machine learning software for extracting information from scholarly documents

The paper states the formulation and proposes a method for rubrication and analysis of electronic nonstructural documents. The application of the proposed method results in forming a tree structure of a rubric field based on fuzzy relations of difference between syntactical characteristics of rubricated documents. The documents analysis is based on the determination of the fuzzy correspondence for these documents according to syntactical characteristics with the values of the centers for the detected clusters sequentially from the root to the leaves of the built fuzzy decision tree. The conducted computational experiments have shown that the proposed method allows reducing the number of erroneously rubricated documents (in comparison with probabilistic and neural network methods)

Introduction

The program "Electronic government" suggests the dynamic introduction of information and communication technologies in the activities of public authorities. The main program goal is to increase the efficiency of public administration and to develop partnerships with civil society and business.

A key task of program implementation is to develop Internet services, which provide information support and a variety of services in electronic form. Their use can improve the quality and accessibility of state and municipal services to citizens and businesses, reduce the cost of their provision and increase the labor productivity in institutions of government at various levels.

One of the ways to use information and communication technology to solve this task is to automate the process of analyzing electronic appeals (applications, complaints, suggestions) of individuals and legal entities arriving at official websites and portals of authorities and local self-government.

The text rubrication plays an important role in the process of automatic analysis of incoming electronic appeals. It consists of their distribution according to thematic rubrics that determine the areas of activity of the departments involved in their processing and preparation of the corresponding response.

Today, there are many methodological approaches to the classification of documents of various types. The choice of a specific method is directly determined by the characteristics of the rubrication objects (i.e. documents received by public authorities).

The analysis has revealed the following specific characteristics of electronic documents received on official websites and portals of public authorities, which must be taken into account when choosing a rubrication method:

 relatively small size of electronic documents that impedes their statistical analysis;  absence of marking in these documents that complicates the procedures for highlighting the structure and extracting the information relevant to the analysis;  presence of grammar and syntactical errors in electronic messages that entails the necessity for additional processing;  nonstationarity of the thesaurus (the composition and relevance of the rubric words);  dynamic changes of the legislative and regulatory framework that can change the distribution of tasks between departments;  description of several problems in one message (answers can be prepared by several specialists or even several departments). These features significantly limit the possibilities of application of the methods based on the probabilistic and statistical approach to the rubrics generation and electronic text analysis [1,6,27].

The aforementioned determines the urgency of the task of developing a new method of rubricating the electronic unstructured documents, taking into account the specific features of text messages received on official websites and portals of public authorities.

Related works

At present, there are a variety of methods, models and algorithms for the classification of text documents written in natural language. However, each of them has its applicability conditions determined by the statement of the rubrication problem.

It was shown in articles [10,11,12] that the choice of a specific classification (rubrication) method is determined by such characteristics as the size of the analyzed document, the degree of rubric thesaurus intersection and the amount of accumulated statistical information.

Machine learning is a well-known approach to classifying unstructured documents. It offers the use of artificial intelligence methods that can learn from a set of precedents.

One of the machine learning methods that have been successfully used to solve various classification problems is artificial neural networks. The classification of texts is devoted to the works of authors [5,17,20,21,26]. The main limitation of the application of this approach is the requirement for the presence of a large amount of statistical data necessary for training algorithms.

Another machine learning method that can be used to classify text documents is fuzzy decision trees. They are based on learning by examples, while the rules are presented in the form of a hierarchical sequential structure. The issues of using fuzzy decision trees are considered in the works [2,9,13,15,16,22,23,25,26].

Statement of the rubrication problem

Initial data 1. For the formalized presentation of electronic unstructured documents (EUD) "a unification" for a set of syntactical characteristics is performed in advance. These characteristics are selected by a classical analyzer (parser), for example, LinkGrammar [24]:

{ | 1.. }, n S s n N  

where for the typical case 5 N  ; s 1 -the root word or the predicate; s 2 -the subject; s 3 -the adverbial modifier; s 4 -the object under the action; s 5 -the predicate.

2. There is a set of EUD

{ | 1.. }, k V V k K  

in which every document k V is presented by a set of its relevant words:

( ) 1.. { | 1.. }, k k k l k k k K V v l L     where ( ) k k l v -the relevant word of EUD, k

L -the number of words in the k-th EUD. 3. The set of EUD V is presented as a set of SD formalized documents:

{ | 1.. }, k SD SD k K  

in which the formalized document k SD corresponds to each EUD:

( ) 1.. { | 1.. }, k k n k K SD SD n N     where ( ) k n

-the set of words from EUD k V , corresponding to the syntactical parameter n s [7]. Required To propose a method for rubrication and EUD analysis based on the hierarchical clustering which uses fuzzy relations between syntactical characteristics of rubricating documents.

Method description

The proposed method for rubrication and analysis of EUD includes the steps discussed below.

Step 1. To give the parameters to determine the degree of correspondence for formalized documents according to the syntactical characteristics.

For each formalized document

k SD ( 1.. k K  ) a set of values for parameters     ( ) / | 1.. k k n n SD SD s n N  

is given to assess the degree of its correspondence according to all syntactical characteristics.

Step 2. To determine the degree of difference between all pairs of formalized documents according to all syntactical characteristics.

Consider a pair of documents

k SD and l SD , , 1.. k l K  : ( ) { | 1.. } k k n SD SD n N   и ( ) { | 1.. }. l l n SD SD n N  

To compare these documents sets of parameters values are given for all syntactical characteristics:

    ( ) / | 1.. k k n n SD SD s n N   и     ( ) / | 1.. l l n n SD SD s n N   .

As a result, sets of parameter values are formed. These parameters characterize the degrees of difference for documents k SD and l SD according to all syntactical characteristics:

      ( ) ( ) ( , ) , / | 1.. k l k l n n n d SD SD d SD SD s n N    ,

where, for example,  

( ) ( ) ( ) ( ) , k l k l n n n n d SD SD SD SD   .

Note. The obtained set of values ( , )

k l d SD SD  can be presented in the form of a fuzzy set and interpreted as a fuzzy difference between fuzzy sets

    ( ) / | 1.. k k n n SD SD s n N   and     ( ) / | 1.. l l n n SD SD s n N   , syntactical characteristics from { | 1.. } n S s n N  

are their carriers, and the documents degrees of correspondence to these characteristics As a result, the following set of parameters values, characterizing the degree of difference between the documents according to the syntactical characteristics, is formed: .

            1 23            1 23

The calculation for the degree of differences according to all syntactical characteristics is performed for all pairs of formalized documents

k SD and l SD , , 1.. k l K  .

Step 3. To form a matrix of difference between all pairs of the formalized documents.

The results of the previous step allow forming a compose matrix of difference between all pairs of documents.

Figure 1 shows such type of a matrix. , where

1 SD l SD K SD 1 SD 1 1 ( , ) d SD SD  … 1 ( , ) l d SD SD  … 1 ( , ) K d SD SD  … … … k SD 1 ( , ) k d SD SD  … ( , ) k l d SD SD  … ( , ) k K d SD SD  … … … K SD 1 ( , ) K d SD SD  … ( , ) K l d SD SD  … ( , ) K K d SD SD     ( ) / | 1.. i i n n Cl Cl s n N   . The detected clusters { | 1.. } i Cl Cl i I   correspond to the rubrics: { | 1.. }, i R R i I  

where for all 1..

i I      , ( / )| 1.. | 1.. , n i j i j i n i R t w s n N j J    ji t -j-th relevant word in the rubric i R , [0, 1] n ji

w  -the degree of correspondence for the word ji t to the syntactical characteristic n s in the rubric i R . Thus, the result of the hierarchical clustering for documents is a tree-type structure of the formed rubric field based on the fuzzy relations between syntactical characteristics of the rubricating documents.

Step 5. Documents analysis.

The proposed procedure of analysis is based on the comparison of the correspondence degrees k SD for the analyzing document k SD according to the syntactical characteristics with the values for the clusters centers k SD sequentially from the root to the leaves of the built decision tree. In this case, the analysis procedure takes into account the specificity of the detected clusters.

The analyzing document k SD is the most relevant to the rubric l R  , the degree of fuzzy correspondence to which is the maximum:

1.. : max ( , )

k i l i I R SD Cl   

To calculate a parameter characterizing the degree of fuzzy correspondence of formalized documents k SD to the rubric i R , it is reasonable to use the following [3,4]:

 2 ( ) ( ) 1 1 ( , ) 1 .N k i i k n n n SD R SD Cl N      

The results of the proposed method application

The proposed rubrication method was programmatically implemented as a component of the comprehensive information system for the automatic processing of electronic unstructured documents arriving at official websites and portals of public authorities.

This method was tested in the automated processing and analysis of appeals (applications, complaints or suggestions) of citizens and organizations receiving by Administration of Smolensk region in 2018-2019.

To carry out the classification of incoming electronic appeals, the experts have identified 17 interconnected rubrics reflecting the urgent civic problems: general issues of society and politics (R 1 ), separation of powers and functions in the Administration (R 2 ), social sphere (R 3 ), education (R 4 ), suggestions for improving the city of Smolensk (R 5 ), family (R 6 ), culture (R 7 ), physical education and sport (R 8 ), housing and communal services (R 9 ), maintenance and utilities (R 10 ), housing stock (R 11 ), non-residential fund (R 12 ), securing the right to housing (R 13 ), economy (R 14 ), business activities (R 15 ), natural resources (R 16 ) and environmental protection (R 17 ).

Two well-known methods (probabilistic and neural network) successfully used to classify unstructured text documents have been practically implemented for comparative text analysis.

The Bayes classification was chosen as the first alternative method because of its ease of implementation and minimal human and financial costs for software implementation. It uses the procedure for classifying documents based on Bayes formula for conditional probability.

The input text document is presented as a sequence of terms {w n }. Each rubric R i is characterized by the unconditional probability P(R i ) of the assignment of document V to it and the conditional probability P(w|R i ) to meet the term w in document V, subject to the choice of rubric R i . Then the probability P(V|R i ) is understood as the probability that the text document will be classified subject to the selection of rubric R i .

The procedure for document rubrication consists in calculating the probabilities P(R i |V) for all rubrics R i and choosing the rubric for which this probability is maximal. Classifier training consists of compiling a vocabulary of probabilities of various terms {w n } for each rubric.

The methods of using probabilistic algorithms for the classification of text documents are considered in more detail in [8].

Convolutional neural networks were used as the second alternative method for document rubrication.

Convolutional networks are artificial neural networks of feedforward type when a signal travels sequentially along the neurons (from the first layer to the last). These networks were originally developed for image analysis. Good results in this area have led to their application for solving other classification tasks, including unstructured documents.

This neural network is an alternation of convolutional, subsampling and fullyconnected layers. A text document arrives at the network input wherein each word is determined by the vector (e.g., may use the algorithm word2vec). The Softmax function which implements multiclassification is used for the output layer of the neural network.

Convolutional neural networks for the classification of text documents are considered in more detail in [18,19,28].

During the preliminary analysis, the authors have identified 4 typical situations, identified depending on three indicators: the size of the received document, the degree of intersection of the headings, and the amount of accumulated statistics for training the models.

Depending on these typical situations, Table 1 shows the results of comparative assessment for the correct rubrication and analysis based on the example of more than 10 thousand mеssages.

For the mentioned typical situations the proposed classification method has allowed reducing the number of erroneously rubricated text documents by 7% on average compared with the probabilistic method and by 6.3% compared with the neural network method.

Conclusion

As a result of the implemented method a tree structure of a rubric field is formed, this structure is based on the fuzzy relations between the syntactical characteristics of the rubricated documents. The document analysis is based on the detection of the fuzzy correspondence for these documents according to the syntactical characteristics with the values of the determined clusters sequentially from the root to the leaves of the built decision tree.

The proposed method for rubrication and analysis of electronic unstructured text documents was implemented by the software and tested during automated processing of appeals (applications, complaints or suggestions) of citizens and organizations receiving by Administration of Smolensk region. It has made possible to ensure efficient and high-quality actualization for the rubrics and document analysis under the conditions of nonstationary composition of the thesaurus and the relevance of the words in rubrics.

into account the below-mentioned parameters:

Figure 1 .Step 4 .14Figure 1. The compose matrix of differences between all pairs of documents Step 4. Fuzzy hierarchical clustering of documents based on the fuzzy relations of difference between all pairs of formalized documents according to all syntactical characteristics. Parameters   ( ) ( ) , k l n n d SD SD are used as the parameters for fuzzy hierarchical clustering of formalized documents, their values characterize the results of pairwise comparison

Table 1 .1The results of the comparative assessment for the correct rubrication and analysis of EUD received by Administration of Smolensk regionTypical situation for analysisResults for rubrication andand rubrication of EUDanalysis of EUD, %EUD sizeDegree of rubrics overlaySufficient statisticsProbabilistic methodNeural network methodProposed methodup to 1500.4not enough656065wordsup to 1500.15not enough626679wordsup to 500.15enough698790wordsmorethan 1500.15enough898589words

Acknowledgment

The reported study was funded by RFBR according to the research project No 18-01-00558.

Analytical report on the work of Administration of Smolensk region with citizens' appeals Acquisition of knowledge in the form of fuzzy rules for cases classification TAvdeenko EMakarova Lecture Notes in Computer Science. Data Mining and Big Data 10387 2017 On definition and construction of association measures IBatyrshin Journal of Intelligent & Fuzzy Systems 29 2015 Towards a general theory of similarity and association measures: Similarity, dissimilarity and correlation functions IBatyrshin Journal of Intelligent & Fuzzy Systems 36 2019 A Neural Probabilistic Language Model YBengio RDucharme PVincent CJauvin JMLR 3 2003 Analysis and monitoring of electronic text documents rubrication VBorisov MDli PKozlov MPIE Bulletin 4 2018 The method of fuzzy analysis of texts and their rubrics actualization VBorisov MDli PKozlov Proceedings of the II International Scientific and Practical Conference the II International Scientific and Practical Conference

Ulyanovsk

2018 Fuzzy Technologies in the Industry -FTI 2018 Using optimize naïve bayes classifier in problem of sms classification MEBurlakov Izvestia of Samara Scientific Center of the Russian Academy of Sciences 18 4 2016 Genetic Fuzzy Systems: Evolutionary Tuning and learning of Fuzzy Knowledge Bases OCordon FHerrera FHoffmann LMagdalena 2001 World Scientific Publishing 462 New Jersey, London, Hong Kong Developing the economic information system for automated analysis of unstructured text documents MDli OBulygina PKozlov GRoss Journal of Applied Informatics 13 5 2018 Development of multimethod approach to rubrication of unstructed electronic text documents in various conditions MDli OBulygina PKozlov Proceedings of the International Russian Automation Conference (RusAutoCon) the International Russian Automation Conference (RusAutoCon)

Sochi

2018 Formation of the structure of the intellectual system of analyzing and rubricating unstructured text information in different situations MDli OBulygina PKozlov Journal of Applied Informatics 13 4 2018 Bottom-up Partitioning in Fuzzy Decision Trees MFaifer CJanikow Proceedings of the 19th International Conference of the North American Fuzzy Information Society. IEEE the 19th International Conference of the North American Fuzzy Information Society. IEEE 2000 MJambu Hierarchical cluster analysis and correspondences

Moscow

Finance and statistics 1988 Fuzzy Decision Trees: Issues and Methods CJanikow IEEE Transactions of Man, Systems, Cybernetics 28 1 1998 Decision Tree's Features of Application in Classification Problems ILKaftannikov AVParasich Bulletin of the South Ural State University. Ser. Computer Technologies, Automatic Control, Radio Electronics 15 3 2015 Recurrent convolutional neural networks for discourse compositionality NKalchbrenner PBlunsom Workshop on CVSC 2013 Convolutional neural networks for sentence classification YKim IEMNLP September. 2014 Imagenet classification with deep convolutional neural networks AKrizhevsky ISutskever GHinton NIPS 2012 Fuzzy logic and artificial neural networks VKruglov MDli RGolunov 2001 Nauka, Fizmatlit Moscow Text understanding from scratch YLecun 2016 Computer Science Department Dependency tree-based sentiment classification using CRFs with hidden variables TNakagawa KInui SKurohashi Proceedings of ACL ACL 2010 2010 Fuzzy Control KPassino SYurkovich 1998 Addison-Wesley 522 NJ SProtasov LinkGrammar Induction of decision trees JQuinlan Machine Learning 1 1 1998 Text classification with decision trees and feedforward neural networks OGShevelyov AVPetrakov Tomsk State University Journal 290 2006 Classification of text information with the use of SVM NUchitelev Information technologies and system 1 2013 Character-level convolutional networks for text classification XZhang JZhao YLecun Advances in Neural Information Processing Systems Febrary. 2015