=Paper=
{{Paper
|id=Vol-2475/paper14
|storemode=property
|title=Method for Documents Rubrication and Analysis Based on Fuzzy Relations of Difference between Their Syntactical Characteristics
|pdfUrl=https://ceur-ws.org/Vol-2475/paper14.pdf
|volume=Vol-2475
|authors=Vadim Borisov,Maxim Dli,Pavel Kozlov
}}
==Method for Documents Rubrication and Analysis Based on Fuzzy Relations of Difference between Their Syntactical Characteristics==
<pdf width="1500px">https://ceur-ws.org/Vol-2475/paper14.pdf</pdf>
<pre>
 Method for Documents Rubrication and Analysis Based
    on Fuzzy Relations of Difference between Their
              Syntactical Characteristics

                                   V. Borisov, M. Dli, P. Kozlov

        The Branch of National Research University “Moscow Power Engineering Institute”
                                    in Smolensk, Russia
                e-mail: vbor67@mail.ru, midli@imail.ru, originaldod@gmail.com


        Abstract. The paper states the formulation and proposes a method for
        rubrication and analysis of electronic nonstructural documents. The application
        of the proposed method results in forming a tree structure of a rubric field based
        on fuzzy relations of difference between syntactical characteristics of rubricated
        documents. The documents analysis is based on the determination of the fuzzy
        correspondence for these documents according to syntactical characteristics
        with the values of the centers for the detected clusters sequentially from the
        root to the leaves of the built fuzzy decision tree. The conducted computational
        experiments have shown that the proposed method allows reducing the number
        of erroneously rubricated documents (in comparison with probabilistic and
        neural network methods)


1 Introduction

    The program "Electronic government" suggests the dynamic introduction of
information and communication technologies in the activities of public authorities.
The main program goal is to increase the efficiency of public administration and to
develop partnerships with civil society and business.
    A key task of program implementation is to develop Internet services, which
provide information support and a variety of services in electronic form. Their use can
improve the quality and accessibility of state and municipal services to citizens and
businesses, reduce the cost of their provision and increase the labor productivity in
institutions of government at various levels.
    One of the ways to use information and communication technology to solve this
task is to automate the process of analyzing electronic appeals (applications,
complaints, suggestions) of individuals and legal entities arriving at official websites
and portals of authorities and local self-government.
    The text rubrication plays an important role in the process of automatic analysis of
incoming electronic appeals. It consists of their distribution according to thematic
rubrics that determine the areas of activity of the departments involved in their
processing and preparation of the corresponding response.


___________________________
Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
In: P. Sosnin, V. Maklaev, E. Sosnina (eds.): Proceedings of the IS-2019 Conference, Ulyanovsk, Russia,
24-27 September 2019, published at http://ceur-ws.org
                                                                                      165


    Today, there are many methodological approaches to the classification of
documents of various types. The choice of a specific method is directly determined by
the characteristics of the rubrication objects (i.e. documents received by public
authorities).
    The analysis has revealed the following specific characteristics of electronic
documents received on official websites and portals of public authorities, which must
be taken into account when choosing a rubrication method:
     relatively small size of electronic documents that impedes their statistical
       analysis;
     absence of marking in these documents that complicates the procedures for
       highlighting the structure and extracting the information relevant to the
       analysis;
     presence of grammar and syntactical errors in electronic messages that entails
       the necessity for additional processing;
     nonstationarity of the thesaurus (the composition and relevance of the rubric
       words);
     dynamic changes of the legislative and regulatory framework that can change
       the distribution of tasks between departments;
     description of several problems in one message (answers can be prepared by
       several specialists or even several departments).
    These features significantly limit the possibilities of application of the methods
based on the probabilistic and statistical approach to the rubrics generation and
electronic text analysis [1, 6, 27].
    The aforementioned determines the urgency of the task of developing a new
method of rubricating the electronic unstructured documents, taking into account the
specific features of text messages received on official websites and portals of public
authorities.


2 Related works

    At present, there are a variety of methods, models and algorithms for the
classification of text documents written in natural language. However, each of them
has its applicability conditions determined by the statement of the rubrication
problem.
    It was shown in articles [10, 11, 12] that the choice of a specific classification
(rubrication) method is determined by such characteristics as the size of the analyzed
document, the degree of rubric thesaurus intersection and the amount of accumulated
statistical information.
    Machine learning is a well-known approach to classifying unstructured
documents. It offers the use of artificial intelligence methods that can learn from a set
of precedents.
    One of the machine learning methods that have been successfully used to solve
various classification problems is artificial neural networks. The classification of texts
is devoted to the works of authors [5, 17, 20, 21, 26]. The main limitation of the
166


application of this approach is the requirement for the presence of a large amount of
statistical data necessary for training algorithms.
    Another machine learning method that can be used to classify text documents is
fuzzy decision trees. They are based on learning by examples, while the rules are
presented in the form of a hierarchical sequential structure. The issues of using fuzzy
decision trees are considered in the works [2, 9, 13, 15, 16, 22, 23, 25, 26].


3 Statement of the rubrication problem

    Initial data
    1. For the formalized presentation of electronic unstructured documents (EUD)
“a unification” for a set of syntactical characteristics is performed in advance. These
characteristics are selected by a classical analyzer (parser), for example,
LinkGrammar [24]:
                                    S  {sn | n 1..N},

where for the typical case N  5 ; s1 – the root word or the predicate; s2 – the subject;
s3 – the adverbial modifier; s4 – the object under the action; s5 – the predicate.
    2. There is a set of EUD
                                   V  {Vk | k  1..K },

in which every document Vk is presented by a set of its relevant words:

                           k  1.. K Vk  {vl(kk ) | lk 1..Lk },

where vl(kk ) – the relevant word of EUD, L k – the number of words in the k-th EUD.
      3. The set of EUD V is presented as a set of SD formalized documents:
                                  SD  {SDk | k 1..K},

in which the formalized document SDk corresponds to each EUD:

                          k  1..K SDk  { SDn( k ) | n  1..N },

where SDn( k ) – the set of words from EUD Vk , corresponding to the syntactical
parameter sn [7].
    Required
    To propose a method for rubrication and EUD analysis based on the hierarchical
clustering which uses fuzzy relations between syntactical characteristics of rubricating
documents.
                                                                                                                           167


4 Method description

    The proposed method for rubrication and analysis of EUD includes the steps
discussed below.
    Step 1. To give the parameters to determine the degree of correspondence for
formalized documents according to the syntactical characteristics.
    For each formalized document SDk ( k 1..K ) a set of values for parameters
SD k     SD / s  | n  1..N  is given to assess the degree of its correspondence
              (k )
              n         n


according to all syntactical characteristics.
   Step 2. To determine the degree of difference between all pairs of formalized
documents according to all syntactical characteristics.
   Consider a pair of documents SDk and SDl , k , l 1..K :

                            SDk  { SDn( k ) | n  1..N } и SDl  { SDn( l ) | n  1..N }.

   To compare these documents sets of parameters values are given for all syntactical
characteristics:

              SD k           SD / s  | n  1..N  и SD   SD / s  | n 1..N  .
                                      (k )
                                      n             n                       l
                                                                                                  (l )
                                                                                                  n           n


   As a result, sets of parameter values are formed. These parameters characterize the
degrees of difference for documents SDk and SDl according to all syntactical
characteristics:

                             d ( SD k , SD l )          d  SD , SD  / s  | n 1..N  ,
                                                                   (k )
                                                                   n
                                                                                 (l )
                                                                                 n           n


where, for example, d SD n , SD n           (k )       (l )
                                                                 SD  SD .
                                                                   (k )
                                                                   n
                                                                                    (l )
                                                                                    n


   Note. The obtained set of values d ( SD k , SD l ) can be presented in the form of a
fuzzy set and interpreted as a fuzzy difference between fuzzy sets
SD k     SD / s  | n 1..N
              (k )
              n         n                               and        SD l     SD / s  | n 1..N ,
                                                                                           (l )
                                                                                           n             n          syntactical

characteristics from S  {sn | n  1..N } are their carriers, and the documents degrees of
                                                                          (k )                               (l )
correspondence to these characteristics SD n and SD n are the degrees of
membership for the fuzzy set d ( SD k , SD l ) .
    Example. Consider an example of documents SDk and SDl comparison taking
into account the below-mentioned parameters:
               SD k   0.7 / s1  ,  0.5 / s2  ,  0.3 / s3  ,  0.3 / s4  ,  0.8 / s5  and
                     SD l   0.1 / s1  ,  0.9 / s2  ,  0.2 / s3  ,  0.6 / s4  ,  0.4 / s5  .
168


    As a result, the following set of parameters values, characterizing the degree of
difference between the documents according to the syntactical characteristics, is
formed:
             d ( SD k , SD l )   0.6 / s1  ,  0.4 / s2  ,  0.1 / s3  ,  0.3 / s4  ,  0.4 / s5  .

   The calculation for the degree of differences according to all syntactical
characteristics is performed for all pairs of formalized documents SDk and SDl ,
k , l 1..K .
     Step 3. To form a matrix of difference between all pairs of the formalized
documents.
     The results of the previous step allow forming a compose matrix of difference
between all pairs of documents.
     Figure 1 shows such type of a matrix.

                   SD1                                         SDl                                           SDK
SD1         
           d ( SD1 , SD1 )            …                 
                                                       d ( SD1 , SD l )           …                   
                                                                                                     d ( SD1 , SD K )
           …                                           …                                             …
SDk        d ( SD k , SD1 )          …                d ( SD k , SD l )         …                  d ( SD k , SD K )
           …                                           …                                             …
SDK        d ( SD K , SD1 )          …                d ( SD K , SD l )         …                  d ( SD K , SD K )
           Figure 1. The compose matrix of differences between all pairs of documents

    Step 4. Fuzzy hierarchical clustering of documents based on the fuzzy relations of
difference between all pairs of formalized documents according to all syntactical
characteristics.
                        
      Parameters d SD n , SD n
                               (k )     (l )
                                                are used as the parameters for fuzzy hierarchical
clustering of formalized documents, their values characterize the results of pairwise
                     (k )              (l )
comparison SD n and SD n separately according to all syntactical characteristics
{sn | n 1..N} .
    It is reasonable to use well-known agglomerative methods as a base for the
hierarchical clustering procedure [14].
    Clusters Cl  {Cli | i 1..I } are detected as a result of hierarchical clustering. Let
the centers of these clusters be {Cl i | i 1..I } , where Cl i                   Cl / s  | n 1..N  .
                                                                                          (i )
                                                                                          n      n


      The detected clusters Cl  {Cli | i 1..I } correspond to the rubrics:

                                                  R  {Ri | i 1..I },
                                                                                               169


where for all i 1..I           
                         Ri  t ji , ( w jin / sn ) | n  1..N    | j 1..J  , t – j-th relevant
                                                                                      i   ji

word in the rubric Ri , w jin  [0, 1] – the degree of correspondence for the word t ji to
the syntactical characteristic sn in the rubric Ri .
    Thus, the result of the hierarchical clustering for documents is a tree-type structure
of the formed rubric field based on the fuzzy relations between syntactical
characteristics of the rubricating documents.
    Step 5. Documents analysis.
    The proposed procedure of analysis is based on the comparison of the
correspondence degrees SD k for the analyzing document SDk according to the
syntactical characteristics with the values for the clusters centers SD k sequentially
from the root to the leaves of the built decision tree. In this case, the analysis
procedure takes into account the specificity of the detected clusters.
   The analyzing document SDk is the most relevant to the rubric Rl , the degree of
fuzzy correspondence to which is the maximum:
                                    Rl : max  ( SD k , Cl i ).
                                          i1.. I


    To calculate a parameter characterizing the degree of fuzzy correspondence of
formalized documents SDk to the rubric Ri , it is reasonable to use the following [3,
4]:

                                                    1
                                                          SD                   .
                                                                             (i ) 2
                                                        N
                                                                 (k )
                         ( SD k , Ri )  1                     n       Cl n
                                                    N   n 1


5 The results of the proposed method application

    The proposed rubrication method was programmatically implemented as a
component of the comprehensive information system for the automatic processing of
electronic unstructured documents arriving at official websites and portals of public
authorities.
    This method was tested in the automated processing and analysis of appeals
(applications, complaints or suggestions) of citizens and organizations receiving by
Administration of Smolensk region in 2018-2019.
    To carry out the classification of incoming electronic appeals, the experts have
identified 17 interconnected rubrics reflecting the urgent civic problems: general
issues of society and politics (R1), separation of powers and functions in the
Administration (R2), social sphere (R3), education (R4), suggestions for improving the
city of Smolensk (R5), family (R6), culture (R7), physical education and sport (R8),
housing and communal services (R9), maintenance and utilities (R10), housing stock
(R11), non-residential fund (R12), securing the right to housing (R13), economy (R14),
business activities (R15), natural resources (R16) and environmental protection (R17).
170


    Two well-known methods (probabilistic and neural network) successfully used to
classify unstructured text documents have been practically implemented for
comparative text analysis.
    The Bayes classification was chosen as the first alternative method because of its
ease of implementation and minimal human and financial costs for software
implementation. It uses the procedure for classifying documents based on Bayes
formula for conditional probability.
    The input text document is presented as a sequence of terms {wn}. Each rubric Ri
is characterized by the unconditional probability P(Ri) of the assignment of document
V to it and the conditional probability P(w|Ri) to meet the term w in document V,
subject to the choice of rubric Ri. Then the probability P(V|Ri) is understood as the
probability that the text document will be classified subject to the selection of
rubric Ri.
    The procedure for document rubrication consists in calculating the probabilities
P(Ri|V) for all rubrics Ri and choosing the rubric for which this probability is
maximal. Classifier training consists of compiling a vocabulary of probabilities of
various terms {wn} for each rubric.
    The methods of using probabilistic algorithms for the classification of text
documents are considered in more detail in [8].
    Convolutional neural networks were used as the second alternative method for
document rubrication.
    Convolutional networks are artificial neural networks of feedforward type when a
signal travels sequentially along the neurons (from the first layer to the last). These
networks were originally developed for image analysis. Good results in this area have
led to their application for solving other classification tasks, including unstructured
documents.
    This neural network is an alternation of convolutional, subsampling and fully-
connected layers. A text document arrives at the network input wherein each word is
determined by the vector (e.g., may use the algorithm word2vec). The Softmax
function which implements multiclassification is used for the output layer of the
neural network.
    Convolutional neural networks for the classification of text documents are
considered in more detail in [18, 19, 28].
    During the preliminary analysis, the authors have identified 4 typical situations,
identified depending on three indicators: the size of the received document, the degree
of intersection of the headings, and the amount of accumulated statistics for training
the models.
    Depending on these typical situations, Table 1 shows the results of comparative
assessment for the correct rubrication and analysis based on the example of more than
10 thousand mеssages.
    For the mentioned typical situations the proposed classification method has
allowed reducing the number of erroneously rubricated text documents by 7% on
average compared with the probabilistic method and by 6.3% compared with the
neural network method.
                                                                                     171


  Table 1. The results of the comparative assessment for the correct rubrication and
          analysis of EUD received by Administration of Smolensk region
      Typical situation for analysis              Results for rubrication and
        and rubrication of EUD                         analysis of EUD, %
           Degree of                                           Neural
 EUD                         Sufficient    Probabilistic                   Proposed
             rubrics                                          network
 size                        statistics      method                         method
             overlay                                          method
 up to
 150           0.4         not enough          65               60             65
 words
 up to
 150          0.15         not enough          62               66             79
 words
 up to
 50           0.15           enough            69               87             90
 words
 more
 than
              0.15           enough            89               85             89
 150
 words


6 Conclusion

    As a result of the implemented method a tree structure of a rubric field is formed,
this structure is based on the fuzzy relations between the syntactical characteristics of
the rubricated documents. The document analysis is based on the detection of the
fuzzy correspondence for these documents according to the syntactical characteristics
with the values of the determined clusters sequentially from the root to the leaves of
the built decision tree.
    The proposed method for rubrication and analysis of electronic unstructured text
documents was implemented by the software and tested during automated processing
of appeals (applications, complaints or suggestions) of citizens and organizations
receiving by Administration of Smolensk region. It has made possible to ensure
efficient and high-quality actualization for the rubrics and document analysis under
the conditions of nonstationary composition of the thesaurus and the relevance of the
words in rubrics.


7 Acknowledgment

   The reported study was funded by RFBR according to the research project
No 18-01-00558.
172


References:
     1. Analytical report on the work of Administration of Smolensk region with citizens’
appeals.           URL:          https://www.adminsmolensk.ru/obrascheniya_grazhdan/obzori_
obrascheniy/news_16096.html.
     2.     Avdeenko, T., Makarova, E.: Acquisition of knowledge in the form of fuzzy rules for
cases classification. Lecture Notes in Computer Science. Data Mining and Big Data, vol.
10387, pp. 536-544 (2017).
     3. Batyrshin, I.: On definition and construction of association measures. Journal of
Intelligent & Fuzzy Systems, vol. 29, pp. 2319-2326 (2015).
     4. Batyrshin, I.: Towards a general theory of similarity and association measures:
Similarity, dissimilarity and correlation functions. Journal of Intelligent & Fuzzy Systems,
vol.36, pp. 2977-3004 (2019).
     5.     Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A Neural Probabilistic Language
Model. JMLR 3, pp.1137–1155 (2003).
     6. Borisov, V., Dli, M., Kozlov P.: Analysis and monitoring of electronic text documents
rubrication. MPIE Bulletin, vol. 4, pp.121-127 (2018).
     7. Borisov, V., Dli, M., Kozlov, P.: The method of fuzzy analysis of texts and their rubrics
actualization. Fuzzy Technologies in the Industry – FTI 2018: Proceedings of the II
International Scientific and Practical Conference. Ulyanovsk, pp. 259-263 (2018).
     8. Burlakov, M.E.: Using optimize naïve bayes classifier in problem of sms classification.
Izvestia of Samara Scientific Center of the Russian Academy of Sciences, vol. 18, no. 4, pp.
705-709 (2016).
     9. Cordon, O., Herrera, F., Hoffmann, F., Magdalena, L.: Genetic Fuzzy Systems:
Evolutionary Tuning and learning of Fuzzy Knowledge Bases. Sinqgapore, New Jersey,
London, Hong Kong, World Scientific Publishing, 462 p. (2001).
     10. Dli, M., Bulygina, O., Kozlov, P., Ross, G.: Developing the economic information
system for automated analysis of unstructured text documents. Journal of Applied Informatics,
vol. 13, no. 5 (77), pp. 51-57 (2018).
     11. Dli, M., Bulygina, O., Kozlov, P.: Development of multimethod approach to rubrication
of unstructed electronic text documents in various conditions. Proceedings of the International
Russian Automation Conference (RusAutoCon), Sochi (2018).
     12. Dli, M., Bulygina, O., Kozlov, P.: Formation of the structure of the intellectual system
of analyzing and rubricating unstructured text information in different situations. Journal of
Applied Informatics, vol. 13, no. 4 (76), pp. 111-123 (2018).
     13. Faifer, M., Janikow, C.: Bottom-up Partitioning in Fuzzy Decision Trees.
Proceedings of the 19th International Conference of the North American Fuzzy Information
Society. IEEE, pp. 326-330 (2000).
     14. Jambu, M.: Hierarchical cluster analysis and correspondences. Moscow: Finance and
statistics (1988).
     15. Janikow, C.: Fuzzy Decision Trees: Issues and Methods. IEEE Transactions of Man,
Systems, Cybernetics, vol 28(1), pp. 1-14 (1998).
     16. Kaftannikov, I.L., Parasich, A.V.: Decision Tree’s Features of Application in
Classification Problems. Bulletin of the South Ural State University. Ser. Computer
Technologies, Automatic Control, Radio Electronics, vol. 15, no. 3, pp. 26-32 (2015).
     17. Kalchbrenner, N., Blunsom, P.: Recurrent convolutional neural networks for discourse
compositionality. Workshop on CVSC, pp. 119-126 (2013).
     18. Kim, Y.: Convolutional neural networks for sentence classification. IEMNLP,
September, pp. 1746 -1751 (2014).
     19. Krizhevsky, A. Sutskever, I., Hinton, G.: Imagenet classification with deep
convolutional neural networks. NIPS, pp. 1106 -1114 (2012).
     20. Kruglov, V., Dli, M., Golunov, R.: Fuzzy logic and artificial neural networks. Moscow:
Nauka, Fizmatlit (2001).
                                                                                         173

    21. LeCun, Y. Text understanding from scratch. Computer Science Department (2016).
    22. Nakagawa, T., Inui, K., Kurohashi, S.: Dependency tree-based sentiment
classification using CRFs with hidden variables. Proceedings of ACL 2010 (2010).
    23. Passino, K., Yurkovich, S.: Fuzzy Control. Addison-Wesley, NJ, 522 p. (1998).
    24. Protasov, S.: LinkGrammar. URL: http://sz.ru/parser/doc/
    25. Quinlan, J.: Induction of decision trees. Machine Learning, vol. 1, no. 1, pp. 81-106
(1998).
    26. Shevelyov, O.G., Petrakov, A.V.: Text classification with decision trees and feed-
forward neural networks. Tomsk State University Journal, vol.290, pp. 300-307 ( 2006).
    27. Uchitelev, N.: Classification of text information with the use of SVM. Information
technologies and system, no.1, pp.335-340 (2013).
    28. Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text
classification. Advances in Neural Information Processing Systems, Febrary, pp. 649-657
(2015).

</pre>