Ontology-based classification model of text resources of an electronic archive Vadim Moshkin Anton Zarubin Albina Koval Information Systems department, The Bonch-Bruevich Saint - Petersburg The Bonch-Bruevich Saint - Petersburg Ulyanovsk State Technical University, State University of Telecommunication State University of Telecommunication Ulyanovsk, Russia St. Petersburg, Russia St. Petersburg, Russia v.moshkn@ulstu.ru azarubin@sut.ru akoval@sut.ru Abstract β€”This paper presents an ontological model of a One of the solutions to this problem is the use of text document as an electronic archive resource. The article intelligent algorithms for the analysis of text documents with also presents an ontology-based algorithm for the classification the division of the archive into classes in accordance with the of technical documents. In conclusion, the results of semantics of the subject area. The semantics of the subject experiments confirming the effectiveness of models and area will be concluded in the subject ontology, formed algorithms in solving the problem of classification of through the analysis of textual documentation. documents of an electronic archive are presented. An assessment was also made of the use of linguistic and statistical Domestic and foreign researchers (Gavrilova T.A. [11], algorithms for determining the terms of poorly structured Zagorulko Yu.A. [12], Khoroshevsky V.F., Soloviev V.D., information resources. Lukashevich N.V., Dobrov B.V., Smirnov S. V., Guarino [13], Uschold M. et al.) note the relevance of applying the Keywordsβ€”ontology, classificationc, ontological ontological approach to the automatic structuring of large representation, machine learning, electronic archive text archives using the ontological approach and extracting the semantic basis of project documentation. I. INTRODUCTION The task of categorizing text documents to simplify the II. MODELS OF APPLIED ONTOLOGY OF TEXT DOCUMENTS search for information in an electronic archive of an The construction of an ontology in the classification of organization is more relevant than ever. In most cases, documents in electronic archives is necessary to take into archiving is structured manually by archive specialists. account the characteristics of the subject area and to increase Specialists should have knowledge in the subject area and the speed of document search. Ontology defines a semantic take into account the specifics of the stored documentation. scale that defines whether a document belongs to one class Automation of categorization of the archive of electronic [14] [15]. text documents should be carried out taking into account the Thus, the formal model of applied ontology of the semantics of information in the documents. Otherwise, the electronic archive of project documentation is: experience of highly qualified specialists developing this documentation will be difficult to extract from unstructured 𝑂𝐴𝑅𝐢 = βŸ¨π‘‡, 𝑇𝑂𝑅𝐺 , 𝑅𝑒𝑙, 𝐹⟩, resources for further use. where 𝑇 is the set of terms of design documentation for an electronic archive; 𝑇𝑂𝑅𝐺 is a set of terms of a problem area; Currently, researchers offer various ways to solve this 𝑅𝑒𝑙 is a set of ontology relationships. Many relationships problem. In [1], the ant colony classification algorithm is used to classify data and is used to quickly search for large include the following: amounts of data from intelligent archives. 𝑅𝑒𝑙 = {𝑅𝐻 , π‘…π‘ƒπ‘Žπ‘Ÿπ‘‘π‘‚πΉ , 𝑅𝐴𝑆𝑆 }, where 𝑅𝐻 is the hierarchy relation; π‘…π‘ƒπ‘Žπ‘Ÿπ‘‘π‘‚πΉ is a part-to- Characteristics of a text document are taken into account whole relationship; 𝑅𝐴𝑆𝑆 is an association relation. during its analysis and processing and are included in the Formally the set of terms of design documentation for an document model. electronic archive is: The extended Boolean model of the document does not represent terms with values of 0, 1, but with weighting 𝑇 = (𝑇𝐷1 βˆͺ 𝑇𝐷2 βˆͺ … βˆͺ π‘‡π·π‘˜ ) βˆͺ 𝑇 𝐴𝑅𝐢 , 𝐷𝑖 coefficients using the theory of fuzzy sets [2-5]. In this case, where 𝑇 , 𝑖 = 1, π‘š is the set of terms of the i-th problem the value of the weight coefficient is determined from the area; 𝑇 𝐴𝑅𝐢 is a set of terms of the problem area extracted interval [0, 1], thus we obtain that 𝐷 ∈ [0, 1]𝑛. from the documents of the electronic archive. The vector model formally presents text documents as a Formally, the functions of interpretation of the subject matrix of terms and documents [6]: ontology are: 𝑀 = |𝐹|Γ—|𝐷|, 𝐹 = {𝐹𝑇𝑂𝑅𝐺𝑇 , 𝐹 𝑇 𝐴𝑅𝐢𝑇 𝐷 }, where 𝐹𝑇𝑂𝑅𝐺𝑇 : {𝑇𝑂𝑅𝐺 } β†’ {𝑇} is an interpretation function where 𝐹 = {𝑓1, . . . , π‘“π‘˜,. . . , 𝑓𝑧}; 𝐷 = {𝑑1,. . . , 𝑑𝑖,. . . , 𝑑𝑛}, that defines the correspondence between the terms of the 𝑑𝑖 is a vector in the 𝑧-dimensional space 𝑅𝑧. problem area and the terms of the design documentation of In [7] [8], the authors present an ontology designed to the electronic archive; 𝐹 𝑇 𝐴𝑅𝐢𝑇 𝐷 : {𝑇 𝐴𝑅𝐢 } β†’ {𝑇 𝐷 } is an model archival descriptions of collections of historical interpretation function that defines the correspondence documents. In [9], the authors present aspects of the current between the terms of the problem domain extracted from activities of the digital library related to the Semantic Web electronic archive documents and the terms of the problem and present their functionality. They show examples ranging domain. from general architectural descriptions to the detailed use of The main in the ontology of the electronic archive is the specific ontologies. In [10], a semantic search portal is relation "associate_with". This relation determines the proposed for intercultural archives, including documents, subject area to which the project document of the electronic images, audio and video. archive belongs and determines the subject of the document. Copyright Β© 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) Data Science The characteristic of the weight of the term 𝑓𝑖 in a text document d; n is a measure of the power of the text input of document is the frequency of the i-th term in the document. π‘™π‘˜ . Hence, the following patterns are relevant: ο‚· high-frequency terms in a document are system-wide; ο‚· terms with a low frequency in a particular document do not provide an improvement in the quality of search for documents in the archive. The most indicative are terms that have an average frequency of occurrence in a document, but most fully characterize a document in a problem area [16, 17]. If the frequency of occurrence of one term is significantly higher in the document than the frequency of its occurrence in all analyzed documents of the electronic archive, then this term is semantically significant. Formally, this rule is 𝑀 𝑠𝑖 = 𝑑𝑓𝑖 β‹… π‘™π‘œπ‘” ( ), 𝑑𝑓(𝑑𝑖 ) where 𝑠𝑖 is an indicator of the semantic significance of the term 𝑑𝑖 in this document; 𝑀 is the total number of all documents in the electronic archive; 𝑑𝑓𝑖 is the value of the index of the normalized frequency of the term 𝑑𝑖 ; 𝑑𝑓(𝑑𝑖 ) is the total number of documents containing the term 𝑑𝑖 . Thus, the ontological model of an electronic archive document is: π‘‰π‘—π‘‘π‘œπ‘ = βŸ¨π‘‡ 𝐴𝑅𝐢 , 𝑇 𝐷 ⟩, Fig. 1. Ontological indexing algorithm for text documents. where 𝑇 𝐴𝑅𝐢 , 𝑇 𝐷 is the set of terms of the problem area of the j-th document of the electronic archive. Thus, after indexing document d is an ontological Hence representation β€” a fragment of the domain ontology, in π‘Žπ‘ π‘ π‘œπ‘π‘–π‘Žπ‘‘π‘’_π‘€π‘–π‘‘β„Ž(𝑑, 𝑇𝐾 ) = 1. which the degree of expression from 0 to 1 is defined for This equality assumes that the document 𝑑 is mapped each ontology concept. into the space of terms 𝑇, 𝑇2 , … , π‘‡π‘˜ .. If 𝑑𝑖𝑑 is the 𝑖-th term of the document 𝑑, then the set of terms of the document 𝑑 can IV. CLASSIFICATION OF ONTOLOGICAL REPRESENTATIONS OF be represented as follows: ELECTRONIC ARCHIVE DOCUMENTS First you need to determine the classes by which the 𝑇 𝑑 = {𝑑1𝑑 , 𝑑2𝑑 , … , 𝑑𝑛𝑑 }, electronic archive documents will be split. For this, it is where n is the total number of terms in the document d. necessary to define many concepts of ontology and a III. THE ONTOLOGICAL INDEX MODEL linguistic label to determine the degree of expression of a concept in a class. The ontological indexing algorithm for text documents of the electronic archive is shown in Figure 1. The linguistic label defines the meaning for the interval of the expression degree of the ontology concept. For The degree of semantic significance of an electronic example, the linguistic label β€œHigh” may correspond to ontology term is the value of coincidence of the term context the value of the expression degree of the concept from the environment with the set of terms of the electronic archive interval from 0.7 to 1.0. document. Contextual environment is composed of terms that are semantically close to the analyzed concept of a Thus, at the first step of the classification algorithm for problem area [19]. the contents of the electronic archive, it is necessary to specify a set of classes G and determine their properties: Hence the formally semantic index of the i-th document is: 𝐺 = {𝑔1 , 𝑔2 , … , 𝑔𝑖 , … , 𝑔𝑛 }, 𝑔𝑖 = {βŸ¨π‘1 , π‘šβŸ©, … , βŸ¨π‘π‘˜ , π‘šβŸ©}, {(𝑑1𝑑 , 𝑠1 ), (𝑑2𝑑 , 𝑠2 ), … , (𝑑𝑖𝑑 , 𝑠𝑖 ), … , (𝑑𝑛𝑑 , 𝑠𝑛 )}, π‘š ∈ [π»π‘–π‘”β„Ž, 𝑀𝑖𝑑𝑑𝑙𝑒, πΏπ‘œπ‘€], where l is the total number of terms in the i-th document of π»π‘–π‘”β„Ž = [0.7 … 1.0], the electronic archive after text preprocessing. 𝑀𝑖𝑑𝑑𝑙𝑒 = [0.5 … 0.7), The degree of expression of the concept π‘™π‘˜ in the i-th πΏπ‘œπ‘€ = [0 … 0.5), document d will be calculated as follows: where 𝑔𝑛 is the n-th class of documents (classification 1 basis); ΞΌ(π‘™π‘˜ ) = 1 βˆ’ βˆ‘|π‘ π‘˜ βˆ’ 𝑠𝑖 |, π‘π‘˜ is the k-th concept of ontology; m is the linguistic label. 𝑙 where π‘ π‘˜ , 𝑠𝑖 are indicators of the frequency of the term 𝑑𝑖 in At the second step, the degree of belonging of the the description of the k-th term of the ontology in the document d to each class 𝑔𝑖 is calculated using the following expression: VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 230 Data Science π‘˜ At the second step, many indices of the analyzed poorly 𝑠(𝑔𝑖 ) = π‘˜ βˆ’ βˆ‘(1 βˆ’ πœƒπ‘˜ ), structured information resources were erased, which were 𝑖=1 based on the following models of information resource where k is the number of parameters of the class 𝑔𝑖 ; πœƒπ‘˜ is a representation: a word bag (Fig. 3), a statistical model based sign of a document matching the d -th property of the class on TF-IDF (Fig. 4), and a linguistic model based on 𝑔𝑖 , which is calculated using the following expression: Word2Vec (Fig. 5). 1, 𝑐 ∈ 𝑑, ΞΌ(π‘π‘˜ ) ∈ π‘š πœƒπ‘˜ = { π‘˜ . The following assessments of the classification quality 0 were obtained: β€œBag of words” - 62.23%, β€œTF-IDF” - Thus, the document d corresponds to the characteristic 74.78%, β€œWord2Vec” - 81.7%. πœƒπ‘˜ if it contains concepts characterizing the given attribute and its degree of expression is included in the interval of the Thus, the linguistic model will be the best method for linguistic label. defining the terms of a semi-structured information resource. However, it is recommended to use a statistical model if low V. EXPERIMENT RESULTS computational complexity of the algorithm and high speed of As part of this study, a series of experiments were operation are required. conducted to assess the quality of the classification of An ontological set of document indixes and classic documents in the electronic archive of the Federal Scientific indexes, which include the term-frequency values, were Production Center JSC Mars. JSC Mars is an organization built. The classification quality assessment model from [20] engaged in the design, development and maintenance of was used as an evaluation function. The results of the automated systems, software and hardware for the Russian experiments are presented in figures 6 and 7. Navy. For the experiments the following sets of documents were selected: ο‚· technical specifications; ο‚· patent research reports; ο‚· specifications; ο‚· testing programs and techniques; ο‚· programmer, user, system administrator, etc. manuals. 1037 design documents were selected for the experiments. Figure 2 shows the signs of expert dividing documents into classes according to certain criteria. Fig. 3. Classification results for the β€œBag of words” model. Fig. 2. Expert classification of documents of JSC Mars. A software system was developed and experiments were conducted to select the preferred method for defining the term. The test set consisted of a public data set containing about 10,000 tweets. Each tweet contains a text describing either a disaster or some other information. Each tweet has a label that determines whether the tweet belongs to the β€œ disaster” or β€œother” class. The first step is data preprocessing. Pretreatment consists of: 1) Delete stop words. 2) Tokenization according to words - splitting the analyzed text into separate words. Fig. 4. Classification results for the TF-IDF model. 3) Bringing the register. 4) Lemmatization. VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 231 Data Science classification of ontological representations is higher than that of classical indices. CONCLUSION Thus, an ontological model of a text document as an electronic archive resource and an ontologically oriented algorithm for the classification of technical documents is proposed. As can be seen from the results of the experiments, the formation of the ontological presentation of each individual document in the archive can significantly increase the speed of automatic classification of documents (up to 27 times) while maintaining or slightly improving the quality of classification. In future works, it is planned to introduce fuzzy elements in the ontological representation of project documents. ACKNOWLEDGMENT This paper has been approved within the framework of Fig. 5. Classification results for the "Word2Vec" model. the federal target project β€œR&D for Priority Areas of the Russian Science-and-Technology Complex Development for 2014-2020”, government contract No 05.604.21.0252 on the subject β€œThe development and research of models, methods and algorithms for classifying large semistructured data based on hybridization of semantic-ontological analysis and machine learning”. REFERENCES [1] W. Yong, L. Liming and Q. Yongsheng, β€œImprovement of big data retrieval algorithm in the intelligent archives management,” 12th IEEE International Conference on Electronic Measurement & Instruments (ICEMI), pp. 487-491, 2015. DOI: 10.1109/ ICEMI.2015.7494245. [2] R. Baeza-Yates and B. Ribeiro-Neto, β€œModern Information Rertieval,” ACM Press, New York, 1999. [3] K. Manning, P. Raghavan and H. SchΓΌtze, β€œIntroduction to the Information Search,” M.: LLC β€œI.D. Williams, 2011. Fig. 6. Comparison of classification algorithms in accordance with the [4] F. Song and W. Bruce, β€œA general language model for information values of the evaluation function. retrieval (poster abstract),” Research and Development in Information Retrieval, pp. 279-280, 1999. [5] E.M. Voorhees, β€œNatural language processing and information retrieval,” Information Extraction: Towards Scalable, Adaptable Systems, pp. 32-48, 1999. [6] G. Salton, β€œAutomatic Text Processing,” Addison-Wesley Publishing Company, Inc., Reading, MA, 1989. [7] L. Pandolfo, L. Pulina and M. Zielinski, β€œTowards an Ontology for Describing Archival Resources,” 2017. [8] L. Pandolfo, L. Pulina and G. Adorni, β€œA framework for automatic population of ontology-based digital libraries,” Advances in Artificial Intelligence, pp. 406-417, 2016. [9] S. R. Kruk and B. McDaniel, β€œSemantic digital libraries,” Springer, 2009. [10] Z. Yan, F. Scharffe and Y. Ding, β€œSemantic Search on Cross-Media Cultural Archives,” Advances in Intelligent Web Mastering. Fig. 7. Comparison of classification algorithms in accordance with the Advances in Soft Computing, vol 43, pp. 375-380, 2007. classification time (sec.). [11] Yu. Zagorulko, I.S. Kononenko and E.A. Sidorova, β€œSemantic approach to the analysis of documents based on the ontology of the As can be seen from the results of the experiments, the subject area,” International Conference on Computational classification of ontological representations is faster (up to Linguistics and Intellectual Technologies Dialogue, 2016. 27 times) relative to the classification time of classical [12] T.A. Gavrilova and V.F. Khoroshevsky, β€œKnowledge Base of Intelligent Systems,” St. Petersburg: Peter, 2000. indices. [13] T. Schneider, A. Hashemi, M. Bennett, M. Brady, C. Casanave, H. The quality of classification of ontological Graves, M. GrΓΌninger, N. Guarino, A. Levenchuk, E. Lucier, L. representations in comparison with the results of Obrst, S. Ray, R. Sriram, A. Vizedom, M. West, T. Whetzel and P. Yim, β€œOntology for Big Systems,” The Ontology Summit classification of classical indices is slightly worse only when CommuniquΓ©. Applied Ontology, vol. 7, pp. 357-371, 2012. DOI: divided by the class of documentation. When dividing a 10.3233/AO-2012-0111. multitude of documents by type of documentation, section of [14] J. Serrano-Guerrero, J.A. Olivas, J. de la Mata and P. Garces, documentation and subject of work, the quality of β€œPhysical and Semantic Relations to Build Ontologies for VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 232 Data Science Representing Documents,” Fuzzy logic, Soft Computing and Computational Intelligence (Eleventh International Fuzzy Systems Association World Congress IFSA), Beijing, China, Tsinghua University Press, vol. I, pp. 503-508, 2005. [18] V. Moshkin and N. Yarushkina, β€œModified Knowledge Inference Method Based on Fuzzy Ontology and Base of Cases,” Creativity in [15] Yu.V. Vizilter, V.S. Gorbatsevich and S.Y. Zheltov, β€œStructure- Intelligent Technologies and Data Science, pp. 96-108, 2019. DOI: functional analysis and synthesis of deep convolutional neural 10.1007/978-3-030-29750-3_8. networks,” Computer Optics, vol. 43, no. 5, pp. 886-900, 2019. DOI: 10.18287/2412-6179-2019-43-5-886-900. [19] A. Filippov, V. Moshkin, A. Namestnikov, G. Guskov and M. Samokhvalov, β€œApproach to Translation of RDF/OWL-Ontology to [16] N. Yarushkina, V. Moshkin and A. Filippov, β€œDevelopment of a the Graphic Knowledge Base of Intelligent Systems,” Proceedings of knowledge base based on context analysis of external information the II International Scientific and Practical Conference Fuzzy resources,” Proceedings of the International conference Information Technologies in the Industry, Ulyanovsk, pp. 44-49, 2018. Technology and Nanotechnology. Session Data Science, Samara, Russia, pp. 328-337, 2018. [20] Yu. Radionova, β€œA method for constructing an evaluation function that deterines the effectiveness of automatic clustering algorithms,” [17] A. Namestnikov, A. Filippov and V. Avvakumova, β€œAn Ontology- Automatio of control processes, no. 15, pp. 23-28, 2009. Based Model of Technical Documentation Fuzzy Structuring,” 2nd International Workshop on Soft Computing Applications and Knowledge Discovery (SCAKD), 2016. VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 233