Method for Determining Information Proximity Based on Spectral
                 Conversion of Text Documents

   Maria A. Butakova                                   Andrey V. Chernov                         Grigorii S. Miziukov
   Dean, Rostov State                               Department of Computer                       Center of monitoring
  Transport University,                            Engineering and Automated                   quality education, Rostov
 Rostov-on-Don, Russia                              Control Systems, Rostov                   State Transport University,
   butakova@rgups.ru                               State Transport University,                  Rostov-on-Don, Russia
                                                     Rostov-on-Don, Russia                       mgs_cmko@rgups.ru
                                                        avcher@rgups.ru


                                                                         linguistic affiliation. Intelligent analysis technologies
                                                                         are used to analyze these types of information [Lij10,
                                                                         Jia14]. Intelligent analysis is a complex of
                         Abstract1                                       interdisciplinary links by means of which a basic
                                                                         model is built up that in future serves as a basis for
    The process of identifying key information in                        application of various methods. The most commonly
    unstructured sets of textual information is                          used methods are classification, prediction, clustering,
    complex and multiple-aspect. In this regard                          association and time series modelling. However,
    various methods and technologies are being                           semantic analysis is considered to be an important task
    actively developed that can improve the                              within the framework of intelligent analysis of text
    analysis process and reduce the gap between                          data [Sar19]. Although there are multiple solutions and
    the quality of the obtained results and the                          approaches in the field of semantic analysis of textual
    computational resources required for the                             information, not all of them are able to fully provide a
    analysis. This article provides an example of                        qualitative analysis process since there are a number of
    an alternative method for determining                                problems primarily related to the identification of
    information proximity in large arrays of                             semantic links between the analyzed objects. It is also
    textual information. A distinctive feature of                        worth noting the distinctive feature of unstructured
    this method is the application of spectral                           information from structured or semistructured one
    conversion of the information and means of                           which implies that this type of information does not
    descriptive logic for the logical inference of                       have a structure that describes the stored data, and it
    analysis results of the text documents array.                        has anthropogenic character. Such an abundance of
    The main components of the method as well                            heterogeneous information results in the need to apply
    as conditions and statements of the logical                          combinations of several different methods to achieve
    inference of the analysis results are                                the desired result. Therefore, this article proposes a
    considered. The analysis of the obtained                             method for determining information proximity in large
    results based on the results of the approbation                      arrays of text information, the distinctive feature of
    of the method is given. The obtained results                         which is the use of spectral conversion of information
    clearly demonstrate the possibility of                               and means of descriptive logic for the logical inference
    applying the method for semantic                                     of the result of analysis of text documents array to
    classification problems in information                               classify emerging situations and identify redundancy in
    decision-making systems.                                             large arrays of text information. The article is arranged
                                                                         as follows. Section 2 includes information on currently
1 Introduction                                                           available scientific studies in the selected field. Section
                                                                         3 describes the proposed method. The main variables
In the context of the current trend of digitalization of                 of the method, functions, as well as conditions and
various aspects of knowledge domain large amounts of                     statements which the logical inference of the results is
information of different structure are accumulated. The                  based on are also considered in the section. Section 4
predominant type in these arrays is unstructured                         describes the results of testing the method. Section 5
information presented in the form of multiple                            describes further scientific application of the method.
multimedia and text files of different formats and                       Section 6 concludes the article.

Copyright c by the paper's authors. Use permitted under Creative
                                                                         2 Previous Work
Commons License Attribution 4.0 International (CC BY 4.0). In: A.
Khomonenko, B. Sokolov, K. Ivanova (eds.): Selected Papers of the        The issue of determining information proximity in
Models and Methods of Information Systems Research Workshop,             large arrays of unstructured textual information, for
St. Petersburg, Russia, 4-5 Dec. 2019, published at http://ceur-ws.org   instance, the approach to semantic classification
[Bou14, Ma16] is of particular interest for scientific         [Vas17, God77] and methods of logical inference by
research. Currently, there are a large number of               means of descriptive logic [Kri18]. The application of
different methods and technologies used to analyze text        the spectral approach to the representation of textual
information. Among the methods one can distinguish             information is determined by the high efficiency of the
the method of extracting knowledge from information,           calculation process due to the operation with numerical
methods of searching for information in coherent texts,        values in the analysis process. Means of descriptive
clustering, classification and summarization [Tan19].          logic act as the main mechanism (the core of the
“Big Data” technology is the most promising and                method) that based on formulated statements
actively developing technology [Fad18, Ous17].                 determines the information proximity between the
However, despite the constant development of these             objects      of     analysis   by     logical    inference.
approaches there are difficulties that to one degree or        Conventionally, the method can be divided into three
another impede the qualitative analysis of textual             components. The first component of the method
information. In the article [Jus18] the author considers       describes sets and basic functions performed after
some of the most frequently encountered difficulties           initialization of all objects. The second part is
while analyzing .textual information. Particular               responsible for the process of spectral analysis and
attention should be paid to the methods of classifying         obtaining spectra of the analyzed objects. The final part
and determining the information proximity between              is a set of criteria and statements of descriptive logic.
text documents. The most interesting approaches for                 The process of determining information proximity
solving problems in this area are presented in articles        begins with the initialization of all objects represented
[Zha14, Fis08]. In the article [Zha14] the authors             as a set of unstructured text documents D = {d1, d2,...,
propose an approach based on the calculation of the            dn}, where dn is a text document. In turn, each dn
semantic similarity of short texts through language-           element of the set D is a set of lexical units L = {l1,
based network and word semantics. In [Fis08] the               l2,..., ln}, where ln is a lexical unit. The totality of
authors propose a group of auxiliary methods for               lexical units of the set L forms meaningful semantic
determining the informational proximity and texts              connections identifying the context K of each dn
classification that supplement the formed semantic             element of the set D. Thus, the objects D, L and K are
model with structural information for classification.          initialized at the first stage of the method and
The following sections propose an alternative approach         afterwards the performing of the functions
for determining information proximity and semantic             ReBuildTextStruct() and Intersection() is followed. The
classification of texts based on spectral conversion of        purpose of ReBuildTextStruct() function consists of
information and logical inference by means of                  primary structuring of the set of lexical units and
descriptive logic.                                             obtaining the data model as a set of “key-value” pairs.
                                                               This is followed by the performing of Intersection ()
3 Proposed Method                                              function that returns the data dictionary – φ containing
                                                               the same elements that are part of the primary data
The process of determining the information proximity           model obtained by Re-BuildTextStruct() function.
between the analyzed objects in large arrays of text           Below there is an example of an algorithm fragment
information involves the identification of similar             responsible for initializing and performing the first two
intersection points on the basis of which one can make         functions.
an assumption about the information proximity of two
objects. However, this process also determines the             𝐷←∅
unique properties of the objects of analysis that act as       𝐷 ← ∆%
preventive condition and do not allow to refer the             for each instance 𝑑 ∈ 𝐷 do
objects of analysis to one category by the primary                 𝑓𝑖𝑟𝑠𝑡𝐷𝑎𝑡𝑎𝑀𝑜𝑑𝑒𝑙2345(7) ←
features thereby making the process of determining the             𝑅𝑒𝐵𝑢𝑖𝑙𝑑𝑇𝑒𝑥𝑡𝑆𝑡𝑟𝑢𝑐𝑡(𝑑),
information proximity more qualitative and effective.              where                   𝑓𝑖𝑟𝑠𝑡𝐷𝑎𝑡𝑎𝑀𝑜𝑑𝑒𝑙2345(7) =
Thus, while designing the method the following tasks                 𝑓𝐷𝑀2345(7) B , … , 𝑓𝐷𝑀2345(7) B ; 𝑓𝐷𝑀2345(7) H , … ,
                                                                                 C                  F              C
were set:                                                          A                                                      J⇒
                                                                                        𝑓𝐷𝑀2345(7) H
1. To identify common and unique properties of                                                        F
objects of analysis. In this case, this is the definition of      𝑓𝐷𝑀2345(7) B : 𝑓𝐷𝑀2345(7) H
                                                                                F                F
common and unique lexical units in the text                        function interpretation 𝑅𝑒𝐵𝑢𝑖𝑙𝑑𝑇𝑒𝑥𝑡𝑆𝑡𝑟𝑢𝑐𝑡(𝑑):
information flow in the process of analysis.                    𝑟𝑒𝑡𝑢𝑟𝑛
2. To form the structure of representation of the               ← 𝑆𝐸𝐿𝐸𝐶𝑇
identified lexical units.                                       ∗ 𝑊𝐻𝐸𝑅𝐸 { ? 𝑎 𝑉7WXYZ[XY\F . ? 𝑎 𝑉7^_`a[ ? 𝑏. 𝑂𝑃𝑇𝐼𝑂𝑁𝐴𝐿
3. To determine the information proximity between the
                                                                             { ? 𝑎 𝑉7WhY`i ? 𝑐. ? 𝑐 𝑉7^_`a[ ? 𝑑 }
objects of analysis under the condition that the
structures of the identified lexical units may be the          end for
same but the objects of analysis belong to different           for each instance 𝑑 ∈ 𝐷 do
categories; the structures may be different but the                for each instance 𝑓𝐷𝑀 ∈ 𝑓𝑖𝑟𝑠𝑡𝐷𝑎𝑡𝑎𝑀𝑜𝑑𝑒𝑙2345(7)
objects belong to the same category.                               do
    To solve the above-mentioned problems we                            𝜑 ← 𝐼𝑛𝑡𝑒𝑟𝑠𝑒𝑐𝑡𝑖𝑜𝑛(𝑓𝐷𝑀, 𝑑),                 where
propose a method for determining the information                        𝐼𝑛𝑡𝑒𝑟𝑠𝑒𝑐𝑡𝑖𝑜𝑛(𝑓𝐷𝑀, 𝑑) query that checks
proximity in text arrays of information based on the                    𝑓𝐷𝑀l ∩ 𝑑l ≠ ∅
methods of spectral representation of information
        interpretation   of        the    function           1. For       any    concepts        𝑆𝑉uxyz , 𝑆𝑉uxyz             and
                                                                                                           B          H
        𝐼𝑛𝑡𝑒𝑟𝑠𝑒𝑐𝑡𝑖𝑜𝑛(𝑓𝐷𝑀, 𝑑):
                                                             terminology 𝑇 there is a concept 𝑆𝑉uxyz ⊆ 𝑆𝑉uxyz ,
           𝑟𝑒𝑡𝑢𝑟𝑛 ← 𝑓𝐷𝑀. 𝑆𝐸𝐿𝐸𝐶𝑇 𝑥o → 𝑥oY ⋀ 𝑥r                                                                  H                 B
                          → 𝑥rY . 𝑊𝐻𝐸𝑅𝐸 𝑥rY                  that is 𝑇 ⊨ 𝑆𝑉uxyz ≡ 𝑆𝑉uxyz ⇔ 𝑇 ⊨ 𝑆𝑉uxyz ⊆
                                                                                        H              B                    H
                          = 𝑟7 . 𝑇𝑜𝐷𝑖𝑐𝑡𝑖𝑜𝑛𝑎𝑟𝑦()              𝑆𝑉uxyz и 𝑇 ⊨ 𝑆𝑉uxyz ⊆ 𝑆𝑉uxyz и ⊨ φH ⊆ φB .
   end for                                                            B                     B              H

end for                                                      2. There is at least one individual 𝑖5•z                 such that
                                                                                                               xyz–
                                                             belongs to the concept 𝑆𝑉uxyz ⇔ ∃𝑖5•z                               ∈
The initial data preparation is followed by the spectral                                                   B              xyz–

conversion stage. The method of singular                     𝑆𝑉uxyz , that is 𝐾 ⊨ 𝑖5•z                 : 𝑆𝑉uxyz ⇔ (𝑇, 𝐴 ∪
                                                                      B                         xyz–           B
transformation was chosen as the main approach for           𝑖5•z         : 𝑆𝑉uxyz ).
obtaining the spectrum of information and its detailed           xyz–             B
operation can be found in the articles [Miz18, Mal19].       For the concept φ the statements will be similar.
At this stage two functions are performed:                   Thus, if the statement 2 = ⊤, then the statement 1 = ⊤,
GetAdjacencyMatrix() and SVD().The result of this            that is there is such an interpretation 𝐼 = (∆,∙™ ), for
stage is to obtain eigenvalues that in the terminology of    which 𝐾 ⊨ 𝑖5•z       : 𝑆𝑉uxyz .
                                                                                xyz–             B
spectral theory represent the spectrum of the analyzed       The proof of statements is reduced to the following
data object of set D. A fragment of the second part of       rules:
the algorithm is presented below.                            1. ∀𝑧⃗{𝑧⃗ ⊨ 𝑇 ∣ 𝑧⃗ ∈ ∆™ }
                                                             2. ∃𝑧⃗{𝑧⃗ ⊨ 𝑇 ∣ 𝑧⃗ ∈ ∆™ }, где 𝑧⃗ = {𝑧⃗{ , 𝑧⃗| , … , 𝑧⃗l } –
for each instance 𝑓𝐷𝑀 ∈ 𝑓𝑖𝑟𝑠𝑡𝐷𝑎𝑡𝑎𝑀𝑜𝑑𝑒𝑙2345(7) do             singular meanings of the concepts 𝑆𝑉uxyz и 𝑆𝑉uxyz
    𝑀t%u ← 𝐺𝑒𝑡𝐴𝑑𝑗𝑎𝑐𝑒𝑛𝑐𝑦𝑀𝑎𝑡𝑟𝑖𝑥(𝑓𝐷𝑀)                                                                              B                H

    𝑆𝑉uxyz ← 𝑆𝑉𝐷(𝑀t%u ), где SVD() – a method of             if ⊤•
                                                                 return ← (bool)similar ⇒ true
    singular transformation; 𝑆𝑉uxyz – singular values
                                                             else if ⊥•
    {𝑆𝑉uxyz , 𝑆𝑉uxyz , … , 𝑆𝑉uxyz }                              return ← (bool)not similar ⇒ false
              {        |            l
end for                                                      end if

The final step in the method is performing the               4 Example and Discussion
IsSumilar () function that returns the result – ψ, which
contains the response on the informational proximity         To test the proposed method an array consisting of
between the objects of analysis. The Is Similar ()           more than 10 000 unstructured text documents of
function is a set of conditions for the feasibility of the   different subject orientation was formed. Each
process of determining the information proximity             document in the array had a different extension and
between the objects and a set of statements suggesting       language affiliation. The server of the following
the possibility of information proximity between the         configuration was selected as the test environment:
objects. Below the final fragment of the algorithm is        - CPU'S: 40 * Intel(R) Xeon(R) CPU E5-2690 v2 @
given that includes a description of the IsSumilar ()           3.00GHz;
function.
                                                             - RAM: 257826 Mb.;
                                                             - ОS: Ubuntu Server Edition;
𝜓 ← 𝐼𝑠𝑆𝑖𝑚𝑖𝑙𝑎𝑟(𝑆𝑉uxyz , 𝜑)
                                                             - Apache: 2.4.10;
Interpretation of the function IsSimilar():                  - MySQL: 5.7.21-20;
𝐾 = 𝑇 ∪ 𝐴, where K – knowledge base; T – Tbox, A –           - Nginx: 1.13.4;
Abox.
                                                             - PHP: 7.3.
Conditions for IsSimilar() function feasibility:
1. Termination. For any (𝑆𝑉uxyz , 𝜑, 𝑇) function Θ
                                                             The quality of the obtained results was assessed
gives a response 𝛩(𝑆𝑉uxyz , φ, 𝑇) in finite amount of        according to the following criteria:
time 𝑡 ƒ , where m–𝑆𝑉uxyz х 𝑆𝑉uxyz = {(𝑗, ℎ) ∣ 𝑗 ∈           - percentage of determining information proximity;
                             B           H
𝑆𝑉uxyz , ℎ ∈ 𝑆𝑉uxyz ⋀ φB х φH = {(𝑠, 𝑒) ∣ 𝑠 ∈                - number of identified classification groups;
          B            H                                     - possibility of information proximity at the same
φB , 𝑒 ∈ φH }.                                                  spectrum and context;
2. Correctness. For any(𝑆𝑉uxyz , φ, 𝑇), if concepts          - possibility of information proximity at the same
𝑆𝑉uxyz , φ are feasible relative to T, then                     spectrum at a different context;
                                                             - possibility of information proximity at the different
Θ ˆ𝑆𝑉uxyz , φ, 𝑇‰ = 1.                                          spectrum but the same context;
3. Completeness. For        any     (𝑆𝑉uxyz , φ, 𝑇),    if   - possibility of information proximity at different
Θ ˆ𝑆𝑉uxyz , φ, 𝑇‰ = 1, then concepts 𝑆𝑉uxyz , φ are             spectrum and context.
feasible relative to T.                                      Thus, based on the above-mentioned criteria the results
Feasibility conditions 2 and 3 come to 𝑈(𝑇) =                of the work (Fig. 1, 2) of the proposed method for
   ⊤, 𝑖𝑓 𝑈 ⊨ ⊤                                               determining the information proximity were obtained.
 Œ
   ⊥, 𝑖𝑓 𝑈 ⊭ ⊤                                               Figure 1 shows the dynamics of determining
Statements:
information proximity. The diagram shows that the           6 Conclusion
percentage of information proximity varies from 20%
to 90%, while the average boundary for determining          The article considers a method suggesting an
information proximity is ~ 61%. It is also worth noting     alternative approach to the problem of determining
that both curves have the same distribution that            information proximity between sets of objects
indicates the consistency of the results obtained after     represented as a set of unstructured text documents.
comparing the two operating modes of the algorithm          The obtained results of the experiment show a high
(statements 1, 2).                                          degree of information proximity determining with an
                                                            optimal ratio of the execution time of all operations
                                                            and the use of computational resources. Based on this
                                                            it is proposed to apply this method to problems of
                                                            semantic classification in decision-making information
                                                            systems to classify emerging situations and identify
                                                            redundancy in large arrays of textual information,
                                                            thereby reducing the amount of necessary stored
                                                            information and response time to an incoming query.

                                                            Acknowledgment. The reported study was funded by
                                                            the Russian Foundation for Basic Research, according
                                                            to the research projects No. 19-01-00246-a, 18-01-
    Figure 1: Dynamics of determining information           00402-a.
                     proximity
                                                            References
Figure 2 shows the final distribution of unstructured
documents array. This distribution shows that 10
highest priority categories were identified according to    [Lij10] C. Lijun, Y. Hongkui, L. Yuxiang & L. Xiyin:
the results of the algorithm among which there was a                Research and exploration of text mining
further classification of the analyzed documents. Each              technology. 2nd International Conference on
identified classification group included from 6% to                 Advanced Computer Control, vol. 5, pp. 435-
10% of the documents out of the total number of                     439 (2010).
contained in the array. Each identified classification      [Jia14] M. Jiang, Y. Zhou, X. Fan, O. Wang, X.
group included from 6% to 10% of the documents from                 Zhang, Z. Zhang, J. Lian and Z. Pei.: A
the total number contained in the array.                            Variety of Text Mining Technology and Tools
                                                                    Research. 2014 International Conference on
                                                                    Mechatronics, Electronic, Industrial and
                                                                    Control Engineering, pp. 918-921 (2014).
                                                            [Sar19] D. Sarkar: Semantic Analysis. In: Text
                                                                    Analytics with Python. Apress, Berkeley, CA
                                                                    (2019). https://doi.org/10.1007/978-1-4842-
                                                                    4354-1_8
                                                            [Bou14] A. Bouaziz, C. Dartigues-Pallez, C. da Costa
                                                                    Pereira, F. Precioso, P. Lloret: Short Text
                                                                    Classification Using Semantic Random
                                                                    Forest. In: Bellatreche L., Mohania M.K. (eds)
                                                                    Data Warehousing and Knowledge Discovery.
                                                                    DaWaK 2014. Lecture Notes in Computer
 Figure 2: Distribution of documents by classification              Science, vol. 8646, pp. 288-299. Springer,
                       categories                                   Cham (2014). https://doi.org/10.1007/978-3-
                                                                    319-10160-6_26
5 Future Research                                           [Ma16] H. Ma, R. Zhou, F. Liu, X. Lu: Effectively
                                                                    Classifying Short Texts via Improved Lexical
                                                                    Category and Semantic Features. In: Huang
The process of determining the information proximity                DS., Bevilacqua V., Premaratne P. (eds)
between the analyzed text documents in large arrays of              Intelligent     Computing     Theories     and
information has significant potential in the problems of            Application. ICIC 2016. Lecture Notes in
semantic classification. The data sets obtained as a                Computer Science, vol. 9771, pp. 163-174.
result of testing can be used to build up more                      Springer,             Cham             (2016).
comprehensive thematic dictionaries of the subject                  https://doi.org/10.1007/978-3-319-42291-
areas that can be used in management decision-making                6_16
systems and situational management. In addition, the        [Tan19] S. Tandel, J. Abhishek and D. Siddharth: A
process of deriving the results of logical statements can           Survey on Text Mining Techniques. 2019 5th
be accompanied by visualization to reflect the map of               International Conference on Advanced
semantic relations between various text documents in                Computing & Communication Systems
information arrays more fully.                                      (ICACCS),        pp.     1022-1026     (2019).
        https://doi.org/10.1109/ICACCS.2019.872854        [Mal19] M.M. Malamud: On Singular Spectrum of
        7                                                         Finite-Dimensional Perturbations (toward the
[Fad18] S. Fadiya and A. Sari:. The importance of big             Aronszajn–Donoghue–Kac Theory). Dokl.
        data technology. International Journal of                 Math, vol. 100,pp 358–362 (2019).
        Engineering & Technology, vol. 7, pp. 485                 https://doi.org/10.1134/S1064562419040124
        (2018).
[Ous17] A. Oussous, F-Z. Benjelloun, A. Ait Lahcen &
        S. Belfkih: Big Data Technologies: A Survey.
        Journal of King Saud University - Computer
        and Information Sciences, vol. 30, pp. 431-
        448                                     (2017).
        https://doi.org/10.1016/j.jksuci.2017.06.001
[Jus18] C. Justicia, D. Sánchez, I. Blanco and M.
        Martin-Bautista: Text Mining: Techniques,
        Applications, and Challenges. International
        Journal of Uncertainty Fuzziness and
        Knowledge-Based Systems, vol. 26, pp.553-
        582                                     (2018).
        https://doi.org/10.1142/S0218488518500265
[Zha14] Z. Zhan, F. Lin, X. Yang: Semantic Similarity
        Calculation of Short Texts Based on
        Language Network and Word Semantic
        Information. In: Wu J., Chen H., Wang X.
        (eds) Advanced Computer Architecture.
        Communications        in     Computer      and
        Information Science, vol. 451, pp. 215-228.
        Springer,     Berlin,    Heidelberg     (2014).
        https://doi.org/10.1007/978-3-662-44491-
        7_17
[Fis08] J.M. Fishbein, C. Eliasmith: Methods for
        Augmenting Semantic Models with Structural
        Information for Text Classification. In:
        Macdonald C., Ounis I., Plachouras V.,
        Ruthven I., White R.W. (eds) Advances in
        Information Retrieval. ECIR 2008. Lecture
        Notes in Computer Science, vol. 4956, pp.
        575-579. Springer, Berlin, Heidelberg (2008)
        https://doi.org/10.1007/978-3-540-78646-
        7_58
[Vas17] H.L. Vasudeva: Spectral Theory and Special
        Classes of Operators. In: Elements of Hilbert
        Spaces and Operator Theory. Springer,
        Singapore                               (2017).
        https://doi.org/10.1007/978-981-10-3020-8_4
        [God77] C. Godsil, D.A. Holton, B. McKay:
        The spectrum of a graph. In: Little C.H.C.
        (eds) Combinatorial Mathematics V. Lecture
        Notes in Mathematics, vol. 622. Springer,
        Berlin,            Heidelberg           (1977).
        https://doi.org/10.1007/BFb0069184
[Kri18] A. Krisnadhi, P. Hitzler: Description Logics.
        In: Alhajj R., Rokne J. (eds) Encyclopedia of
        Social Network Analysis and Mining, pp.
        572--581. Springer, New York, NY (2018).
        https://doi.org/10.1007/978-1-4939-7131-
        2_108
[Miz18] G.S. Miziukov: Finding similarity between
        unstructured data objects on the basis of the
        method of singular decomposition of the
        spectrum        of     a     graph      (2018).
        http://www.ivdon.ru/uploads/article/pdf/IVD_
        19_Miziukov_N.pdf_e0a3d9ae84.pdf. Web-
        Accessed 10 Nov 2019