Method for Determining Information Proximity Based on Spectral Conversion of Text Documents Maria A. Butakova Andrey V. Chernov Grigorii S. Miziukov Dean, Rostov State Department of Computer Center of monitoring Transport University, Engineering and Automated quality education, Rostov Rostov-on-Don, Russia Control Systems, Rostov State Transport University, butakova@rgups.ru State Transport University, Rostov-on-Don, Russia Rostov-on-Don, Russia mgs_cmko@rgups.ru avcher@rgups.ru linguistic affiliation. Intelligent analysis technologies are used to analyze these types of information [Lij10, Jia14]. Intelligent analysis is a complex of Abstract1 interdisciplinary links by means of which a basic model is built up that in future serves as a basis for The process of identifying key information in application of various methods. The most commonly unstructured sets of textual information is used methods are classification, prediction, clustering, complex and multiple-aspect. In this regard association and time series modelling. However, various methods and technologies are being semantic analysis is considered to be an important task actively developed that can improve the within the framework of intelligent analysis of text analysis process and reduce the gap between data [Sar19]. Although there are multiple solutions and the quality of the obtained results and the approaches in the field of semantic analysis of textual computational resources required for the information, not all of them are able to fully provide a analysis. This article provides an example of qualitative analysis process since there are a number of an alternative method for determining problems primarily related to the identification of information proximity in large arrays of semantic links between the analyzed objects. It is also textual information. A distinctive feature of worth noting the distinctive feature of unstructured this method is the application of spectral information from structured or semistructured one conversion of the information and means of which implies that this type of information does not descriptive logic for the logical inference of have a structure that describes the stored data, and it analysis results of the text documents array. has anthropogenic character. Such an abundance of The main components of the method as well heterogeneous information results in the need to apply as conditions and statements of the logical combinations of several different methods to achieve inference of the analysis results are the desired result. Therefore, this article proposes a considered. The analysis of the obtained method for determining information proximity in large results based on the results of the approbation arrays of text information, the distinctive feature of of the method is given. The obtained results which is the use of spectral conversion of information clearly demonstrate the possibility of and means of descriptive logic for the logical inference applying the method for semantic of the result of analysis of text documents array to classification problems in information classify emerging situations and identify redundancy in decision-making systems. large arrays of text information. The article is arranged as follows. Section 2 includes information on currently 1 Introduction available scientific studies in the selected field. Section 3 describes the proposed method. The main variables In the context of the current trend of digitalization of of the method, functions, as well as conditions and various aspects of knowledge domain large amounts of statements which the logical inference of the results is information of different structure are accumulated. The based on are also considered in the section. Section 4 predominant type in these arrays is unstructured describes the results of testing the method. Section 5 information presented in the form of multiple describes further scientific application of the method. multimedia and text files of different formats and Section 6 concludes the article. Copyright c by the paper's authors. Use permitted under Creative 2 Previous Work Commons License Attribution 4.0 International (CC BY 4.0). In: A. Khomonenko, B. Sokolov, K. Ivanova (eds.): Selected Papers of the The issue of determining information proximity in Models and Methods of Information Systems Research Workshop, large arrays of unstructured textual information, for St. Petersburg, Russia, 4-5 Dec. 2019, published at http://ceur-ws.org instance, the approach to semantic classification [Bou14, Ma16] is of particular interest for scientific [Vas17, God77] and methods of logical inference by research. Currently, there are a large number of means of descriptive logic [Kri18]. The application of different methods and technologies used to analyze text the spectral approach to the representation of textual information. Among the methods one can distinguish information is determined by the high efficiency of the the method of extracting knowledge from information, calculation process due to the operation with numerical methods of searching for information in coherent texts, values in the analysis process. Means of descriptive clustering, classification and summarization [Tan19]. logic act as the main mechanism (the core of the “Big Data” technology is the most promising and method) that based on formulated statements actively developing technology [Fad18, Ous17]. determines the information proximity between the However, despite the constant development of these objects of analysis by logical inference. approaches there are difficulties that to one degree or Conventionally, the method can be divided into three another impede the qualitative analysis of textual components. The first component of the method information. In the article [Jus18] the author considers describes sets and basic functions performed after some of the most frequently encountered difficulties initialization of all objects. The second part is while analyzing .textual information. Particular responsible for the process of spectral analysis and attention should be paid to the methods of classifying obtaining spectra of the analyzed objects. The final part and determining the information proximity between is a set of criteria and statements of descriptive logic. text documents. The most interesting approaches for The process of determining information proximity solving problems in this area are presented in articles begins with the initialization of all objects represented [Zha14, Fis08]. In the article [Zha14] the authors as a set of unstructured text documents D = {d1, d2,..., propose an approach based on the calculation of the dn}, where dn is a text document. In turn, each dn semantic similarity of short texts through language- element of the set D is a set of lexical units L = {l1, based network and word semantics. In [Fis08] the l2,..., ln}, where ln is a lexical unit. The totality of authors propose a group of auxiliary methods for lexical units of the set L forms meaningful semantic determining the informational proximity and texts connections identifying the context K of each dn classification that supplement the formed semantic element of the set D. Thus, the objects D, L and K are model with structural information for classification. initialized at the first stage of the method and The following sections propose an alternative approach afterwards the performing of the functions for determining information proximity and semantic ReBuildTextStruct() and Intersection() is followed. The classification of texts based on spectral conversion of purpose of ReBuildTextStruct() function consists of information and logical inference by means of primary structuring of the set of lexical units and descriptive logic. obtaining the data model as a set of “key-value” pairs. This is followed by the performing of Intersection () 3 Proposed Method function that returns the data dictionary – φ containing the same elements that are part of the primary data The process of determining the information proximity model obtained by Re-BuildTextStruct() function. between the analyzed objects in large arrays of text Below there is an example of an algorithm fragment information involves the identification of similar responsible for initializing and performing the first two intersection points on the basis of which one can make functions. an assumption about the information proximity of two objects. However, this process also determines the 𝐷←∅ unique properties of the objects of analysis that act as 𝐷 ← ∆% preventive condition and do not allow to refer the for each instance 𝑑 ∈ 𝐷 do objects of analysis to one category by the primary 𝑓𝑖𝑟𝑠𝑡𝐷𝑎𝑡𝑎𝑀𝑜𝑑𝑒𝑙2345(7) ← features thereby making the process of determining the 𝑅𝑒𝐵𝑢𝑖𝑙𝑑𝑇𝑒𝑥𝑡𝑆𝑡𝑟𝑢𝑐𝑡(𝑑), information proximity more qualitative and effective. where 𝑓𝑖𝑟𝑠𝑡𝐷𝑎𝑡𝑎𝑀𝑜𝑑𝑒𝑙2345(7) = Thus, while designing the method the following tasks 𝑓𝐷𝑀2345(7) B , … , 𝑓𝐷𝑀2345(7) B ; 𝑓𝐷𝑀2345(7) H , … , C F C were set: A J⇒ 𝑓𝐷𝑀2345(7) H 1. To identify common and unique properties of F objects of analysis. In this case, this is the definition of 𝑓𝐷𝑀2345(7) B : 𝑓𝐷𝑀2345(7) H F F common and unique lexical units in the text function interpretation 𝑅𝑒𝐵𝑢𝑖𝑙𝑑𝑇𝑒𝑥𝑡𝑆𝑡𝑟𝑢𝑐𝑡(𝑑): information flow in the process of analysis. 𝑟𝑒𝑡𝑢𝑟𝑛 2. To form the structure of representation of the ← 𝑆𝐸𝐿𝐸𝐶𝑇 identified lexical units. ∗ 𝑊𝐻𝐸𝑅𝐸 { ? 𝑎 𝑉7WXYZ[XY\F . ? 𝑎 𝑉7^_`a[ ? 𝑏. 𝑂𝑃𝑇𝐼𝑂𝑁𝐴𝐿 3. To determine the information proximity between the { ? 𝑎 𝑉7WhY`i ? 𝑐. ? 𝑐 𝑉7^_`a[ ? 𝑑 } objects of analysis under the condition that the structures of the identified lexical units may be the end for same but the objects of analysis belong to different for each instance 𝑑 ∈ 𝐷 do categories; the structures may be different but the for each instance 𝑓𝐷𝑀 ∈ 𝑓𝑖𝑟𝑠𝑡𝐷𝑎𝑡𝑎𝑀𝑜𝑑𝑒𝑙2345(7) objects belong to the same category. do To solve the above-mentioned problems we 𝜑 ← 𝐼𝑛𝑡𝑒𝑟𝑠𝑒𝑐𝑡𝑖𝑜𝑛(𝑓𝐷𝑀, 𝑑), where propose a method for determining the information 𝐼𝑛𝑡𝑒𝑟𝑠𝑒𝑐𝑡𝑖𝑜𝑛(𝑓𝐷𝑀, 𝑑) query that checks proximity in text arrays of information based on the 𝑓𝐷𝑀l ∩ 𝑑l ≠ ∅ methods of spectral representation of information interpretation of the function 1. For any concepts 𝑆𝑉uxyz , 𝑆𝑉uxyz and B H 𝐼𝑛𝑡𝑒𝑟𝑠𝑒𝑐𝑡𝑖𝑜𝑛(𝑓𝐷𝑀, 𝑑): terminology 𝑇 there is a concept 𝑆𝑉uxyz ⊆ 𝑆𝑉uxyz , 𝑟𝑒𝑡𝑢𝑟𝑛 ← 𝑓𝐷𝑀. 𝑆𝐸𝐿𝐸𝐶𝑇 𝑥o → 𝑥oY ⋀ 𝑥r H B → 𝑥rY . 𝑊𝐻𝐸𝑅𝐸 𝑥rY that is 𝑇 ⊨ 𝑆𝑉uxyz ≡ 𝑆𝑉uxyz ⇔ 𝑇 ⊨ 𝑆𝑉uxyz ⊆ H B H = 𝑟7 . 𝑇𝑜𝐷𝑖𝑐𝑡𝑖𝑜𝑛𝑎𝑟𝑦() 𝑆𝑉uxyz и 𝑇 ⊨ 𝑆𝑉uxyz ⊆ 𝑆𝑉uxyz и ⊨ φH ⊆ φB . end for B B H end for 2. There is at least one individual 𝑖5•z such that xyz– belongs to the concept 𝑆𝑉uxyz ⇔ ∃𝑖5•z ∈ The initial data preparation is followed by the spectral B xyz– conversion stage. The method of singular 𝑆𝑉uxyz , that is 𝐾 ⊨ 𝑖5•z : 𝑆𝑉uxyz ⇔ (𝑇, 𝐴 ∪ B xyz– B transformation was chosen as the main approach for 𝑖5•z : 𝑆𝑉uxyz ). obtaining the spectrum of information and its detailed xyz– B operation can be found in the articles [Miz18, Mal19]. For the concept φ the statements will be similar. At this stage two functions are performed: Thus, if the statement 2 = ⊤, then the statement 1 = ⊤, GetAdjacencyMatrix() and SVD().The result of this that is there is such an interpretation 𝐼 = (∆,∙™ ), for stage is to obtain eigenvalues that in the terminology of which 𝐾 ⊨ 𝑖5•z : 𝑆𝑉uxyz . xyz– B spectral theory represent the spectrum of the analyzed The proof of statements is reduced to the following data object of set D. A fragment of the second part of rules: the algorithm is presented below. 1. ∀𝑧⃗{𝑧⃗ ⊨ 𝑇 ∣ 𝑧⃗ ∈ ∆™ } 2. ∃𝑧⃗{𝑧⃗ ⊨ 𝑇 ∣ 𝑧⃗ ∈ ∆™ }, где 𝑧⃗ = {𝑧⃗{ , 𝑧⃗| , … , 𝑧⃗l } – for each instance 𝑓𝐷𝑀 ∈ 𝑓𝑖𝑟𝑠𝑡𝐷𝑎𝑡𝑎𝑀𝑜𝑑𝑒𝑙2345(7) do singular meanings of the concepts 𝑆𝑉uxyz и 𝑆𝑉uxyz 𝑀t%u ← 𝐺𝑒𝑡𝐴𝑑𝑗𝑎𝑐𝑒𝑛𝑐𝑦𝑀𝑎𝑡𝑟𝑖𝑥(𝑓𝐷𝑀) B H 𝑆𝑉uxyz ← 𝑆𝑉𝐷(𝑀t%u ), где SVD() – a method of if ⊤• return ← (bool)similar ⇒ true singular transformation; 𝑆𝑉uxyz – singular values else if ⊥• {𝑆𝑉uxyz , 𝑆𝑉uxyz , … , 𝑆𝑉uxyz } return ← (bool)not similar ⇒ false { | l end for end if The final step in the method is performing the 4 Example and Discussion IsSumilar () function that returns the result – ψ, which contains the response on the informational proximity To test the proposed method an array consisting of between the objects of analysis. The Is Similar () more than 10 000 unstructured text documents of function is a set of conditions for the feasibility of the different subject orientation was formed. Each process of determining the information proximity document in the array had a different extension and between the objects and a set of statements suggesting language affiliation. The server of the following the possibility of information proximity between the configuration was selected as the test environment: objects. Below the final fragment of the algorithm is - CPU'S: 40 * Intel(R) Xeon(R) CPU E5-2690 v2 @ given that includes a description of the IsSumilar () 3.00GHz; function. - RAM: 257826 Mb.; - ОS: Ubuntu Server Edition; 𝜓 ← 𝐼𝑠𝑆𝑖𝑚𝑖𝑙𝑎𝑟(𝑆𝑉uxyz , 𝜑) - Apache: 2.4.10; Interpretation of the function IsSimilar(): - MySQL: 5.7.21-20; 𝐾 = 𝑇 ∪ 𝐴, where K – knowledge base; T – Tbox, A – - Nginx: 1.13.4; Abox. - PHP: 7.3. Conditions for IsSimilar() function feasibility: 1. Termination. For any (𝑆𝑉uxyz , 𝜑, 𝑇) function Θ The quality of the obtained results was assessed gives a response 𝛩(𝑆𝑉uxyz , φ, 𝑇) in finite amount of according to the following criteria: time 𝑡 ƒ , where m–𝑆𝑉uxyz х 𝑆𝑉uxyz = {(𝑗, ℎ) ∣ 𝑗 ∈ - percentage of determining information proximity; B H 𝑆𝑉uxyz , ℎ ∈ 𝑆𝑉uxyz ⋀ φB х φH = {(𝑠, 𝑒) ∣ 𝑠 ∈ - number of identified classification groups; B H - possibility of information proximity at the same φB , 𝑒 ∈ φH }. spectrum and context; 2. Correctness. For any(𝑆𝑉uxyz , φ, 𝑇), if concepts - possibility of information proximity at the same 𝑆𝑉uxyz , φ are feasible relative to T, then spectrum at a different context; - possibility of information proximity at the different Θ ˆ𝑆𝑉uxyz , φ, 𝑇‰ = 1. spectrum but the same context; 3. Completeness. For any (𝑆𝑉uxyz , φ, 𝑇), if - possibility of information proximity at different Θ ˆ𝑆𝑉uxyz , φ, 𝑇‰ = 1, then concepts 𝑆𝑉uxyz , φ are spectrum and context. feasible relative to T. Thus, based on the above-mentioned criteria the results Feasibility conditions 2 and 3 come to 𝑈(𝑇) = of the work (Fig. 1, 2) of the proposed method for ⊤, 𝑖𝑓 𝑈 ⊨ ⊤ determining the information proximity were obtained. Œ ⊥, 𝑖𝑓 𝑈 ⊭ ⊤ Figure 1 shows the dynamics of determining Statements: information proximity. The diagram shows that the 6 Conclusion percentage of information proximity varies from 20% to 90%, while the average boundary for determining The article considers a method suggesting an information proximity is ~ 61%. It is also worth noting alternative approach to the problem of determining that both curves have the same distribution that information proximity between sets of objects indicates the consistency of the results obtained after represented as a set of unstructured text documents. comparing the two operating modes of the algorithm The obtained results of the experiment show a high (statements 1, 2). degree of information proximity determining with an optimal ratio of the execution time of all operations and the use of computational resources. Based on this it is proposed to apply this method to problems of semantic classification in decision-making information systems to classify emerging situations and identify redundancy in large arrays of textual information, thereby reducing the amount of necessary stored information and response time to an incoming query. Acknowledgment. The reported study was funded by the Russian Foundation for Basic Research, according to the research projects No. 19-01-00246-a, 18-01- Figure 1: Dynamics of determining information 00402-a. proximity References Figure 2 shows the final distribution of unstructured documents array. This distribution shows that 10 highest priority categories were identified according to [Lij10] C. Lijun, Y. Hongkui, L. Yuxiang & L. Xiyin: the results of the algorithm among which there was a Research and exploration of text mining further classification of the analyzed documents. Each technology. 2nd International Conference on identified classification group included from 6% to Advanced Computer Control, vol. 5, pp. 435- 10% of the documents out of the total number of 439 (2010). contained in the array. Each identified classification [Jia14] M. Jiang, Y. Zhou, X. Fan, O. Wang, X. group included from 6% to 10% of the documents from Zhang, Z. Zhang, J. Lian and Z. Pei.: A the total number contained in the array. Variety of Text Mining Technology and Tools Research. 2014 International Conference on Mechatronics, Electronic, Industrial and Control Engineering, pp. 918-921 (2014). [Sar19] D. Sarkar: Semantic Analysis. In: Text Analytics with Python. Apress, Berkeley, CA (2019). https://doi.org/10.1007/978-1-4842- 4354-1_8 [Bou14] A. Bouaziz, C. Dartigues-Pallez, C. da Costa Pereira, F. Precioso, P. Lloret: Short Text Classification Using Semantic Random Forest. In: Bellatreche L., Mohania M.K. (eds) Data Warehousing and Knowledge Discovery. DaWaK 2014. Lecture Notes in Computer Figure 2: Distribution of documents by classification Science, vol. 8646, pp. 288-299. Springer, categories Cham (2014). https://doi.org/10.1007/978-3- 319-10160-6_26 5 Future Research [Ma16] H. Ma, R. Zhou, F. Liu, X. Lu: Effectively Classifying Short Texts via Improved Lexical Category and Semantic Features. In: Huang The process of determining the information proximity DS., Bevilacqua V., Premaratne P. (eds) between the analyzed text documents in large arrays of Intelligent Computing Theories and information has significant potential in the problems of Application. ICIC 2016. Lecture Notes in semantic classification. The data sets obtained as a Computer Science, vol. 9771, pp. 163-174. result of testing can be used to build up more Springer, Cham (2016). comprehensive thematic dictionaries of the subject https://doi.org/10.1007/978-3-319-42291- areas that can be used in management decision-making 6_16 systems and situational management. In addition, the [Tan19] S. Tandel, J. Abhishek and D. Siddharth: A process of deriving the results of logical statements can Survey on Text Mining Techniques. 2019 5th be accompanied by visualization to reflect the map of International Conference on Advanced semantic relations between various text documents in Computing & Communication Systems information arrays more fully. (ICACCS), pp. 1022-1026 (2019). https://doi.org/10.1109/ICACCS.2019.872854 [Mal19] M.M. Malamud: On Singular Spectrum of 7 Finite-Dimensional Perturbations (toward the [Fad18] S. Fadiya and A. Sari:. The importance of big Aronszajn–Donoghue–Kac Theory). Dokl. data technology. International Journal of Math, vol. 100,pp 358–362 (2019). Engineering & Technology, vol. 7, pp. 485 https://doi.org/10.1134/S1064562419040124 (2018). [Ous17] A. Oussous, F-Z. Benjelloun, A. Ait Lahcen & S. Belfkih: Big Data Technologies: A Survey. Journal of King Saud University - Computer and Information Sciences, vol. 30, pp. 431- 448 (2017). https://doi.org/10.1016/j.jksuci.2017.06.001 [Jus18] C. Justicia, D. Sánchez, I. Blanco and M. Martin-Bautista: Text Mining: Techniques, Applications, and Challenges. International Journal of Uncertainty Fuzziness and Knowledge-Based Systems, vol. 26, pp.553- 582 (2018). https://doi.org/10.1142/S0218488518500265 [Zha14] Z. Zhan, F. Lin, X. Yang: Semantic Similarity Calculation of Short Texts Based on Language Network and Word Semantic Information. In: Wu J., Chen H., Wang X. (eds) Advanced Computer Architecture. Communications in Computer and Information Science, vol. 451, pp. 215-228. Springer, Berlin, Heidelberg (2014). https://doi.org/10.1007/978-3-662-44491- 7_17 [Fis08] J.M. Fishbein, C. Eliasmith: Methods for Augmenting Semantic Models with Structural Information for Text Classification. In: Macdonald C., Ounis I., Plachouras V., Ruthven I., White R.W. (eds) Advances in Information Retrieval. ECIR 2008. Lecture Notes in Computer Science, vol. 4956, pp. 575-579. Springer, Berlin, Heidelberg (2008) https://doi.org/10.1007/978-3-540-78646- 7_58 [Vas17] H.L. Vasudeva: Spectral Theory and Special Classes of Operators. In: Elements of Hilbert Spaces and Operator Theory. Springer, Singapore (2017). https://doi.org/10.1007/978-981-10-3020-8_4 [God77] C. Godsil, D.A. Holton, B. McKay: The spectrum of a graph. In: Little C.H.C. (eds) Combinatorial Mathematics V. Lecture Notes in Mathematics, vol. 622. Springer, Berlin, Heidelberg (1977). https://doi.org/10.1007/BFb0069184 [Kri18] A. Krisnadhi, P. Hitzler: Description Logics. In: Alhajj R., Rokne J. (eds) Encyclopedia of Social Network Analysis and Mining, pp. 572--581. Springer, New York, NY (2018). https://doi.org/10.1007/978-1-4939-7131- 2_108 [Miz18] G.S. Miziukov: Finding similarity between unstructured data objects on the basis of the method of singular decomposition of the spectrum of a graph (2018). http://www.ivdon.ru/uploads/article/pdf/IVD_ 19_Miziukov_N.pdf_e0a3d9ae84.pdf. Web- Accessed 10 Nov 2019