Construction of the problem area ontology based on the syntagmatic analysis of external wiki-resources A.A. Zarubin1, A.R. Koval1, V.S. Moshkin2, A.A. Filippov2 1 The Bonch-Bruevich Saint - Petersburg State University of Telecommunication, 61 Moika, Saint - Petersburg, 191186, Russia 2 Ulyanovsk State Technical University, 32 Severny Venetz str., Ulyanovsk, 432027, Russia Abstract The activities of any large organization requires the work of specialists with a large volume of unstructured information in order to obtain and extract the necessary knowledge to interact with partners, decision-making, etc. An array of unstructured textual information is not adapted to structuring and semantic search. Thus, development of intelligent algorithms and text analysis methods for dynamic generation of the knowledge base contents is needed. Extract of syntagmatic structure of a text and further representation of extracted knowledge in the form of a single unified ontology allows to get access to the knowledge base for solving complex problems. Keywords: ontology; knowledge base; syntagmatic analysis; text resource 1. Introduction In the process of any large modern organization activity, it is necessary to make urgent management decisions timely that requires specialists to have deep knowledge of the problem area (PrA). Moreover, they should be able to use different decision support systems and tools for work with knowledge. The desire to automate and speed-up the process of obtaining necessary knowledge about the PrA drives the need in the unified multipurpose toolkit for knowledge management that does not require a user to have some additional skills in the field of knowledge engineering and ontological analysis. Thus, one can identify a number of scientific problems besetting modern organizations. In order to be solved, such problems require the systematic approach and include the following ones:  the need of developing the semantic basis for representation of electronic information storage content;  the lack of integrative conceptual models using different approaches to the storage of knowledge about the PrA;  the need of unified the automated processing of the stored knowledge;  the need of simultaneous use of multi-aspect contexts of the PrA under consideration;  the need of solving the problem of tracking the clarity of human reasonings. Thereby, nowadays, the actual problem is providing specialists of a wide range of organizations with a universal tool allowing to address the knowledge management challenges [1]. Furthermore, the tool should not require some extra training of users. At the moment, the ontological approach is most often used for organization of knowledge bases of expert systems. A lot of Russian and foreign researchers such as T.A. Gavrilova [2], V.N. Vagin [3], V.V. Gribova [4], Yu.A. Zagorulko [5], A.S. Kleschev [6], I.P. Norenkov, D.E. Palchunov, S.V. Smirnov [7], D. Bianchini, T.R.Gruber, A.Medche, G. Stumme and others address the problem of integration and search of information in order to provide management decision support on the basis of an ontology. In a broad sense, ontologies are models representing knowledge within the individual contexts of the PrA in the form of semantic information-logical networks of interrelated objects where the PrA concepts with properties and relations between objects are the main elements. Ontologies serve as integrators proving the common semantic basis in the processes of decision-making and data mining, and the unified platform for combination of different information systems [8,9]. 2. Formal model of knowledge base The knowledge base (KB) represents the storage of knowledge of different PrAs and contexts in the form of an applied ontology. The PrA ontology context is a specific state of the KB content that can be chosen from a set of the ontology states. The state was obtained as a result of either versioning or constructing the KB content from different points of views. Formally, an ontology can be represented by the following equation: O  T , C Ti , I Ti , P Ti , S Ti , F Ti , R Ti , i  1, t , where t is a number of ontology contexts, T  T1 , T2 , , Tn  is a set of ontology contexts, C Ti is a set of ontology classes within the i -th context, I Ti is a set of ontology objects within the i -th context, P Ti is a set of ontology classes properties within the i -th context, S Ti is a set of ontology objects states within the i -th context, F Ti is a set of the PrA processes fixed in the ontology within the i -th context, R Ti is a set of ontology relations within the i -th context defined as: R Ti   , Ti Ti Ti Ti Ti Ti  RC , R I , R P , RS , R F , R F   IN OUT  3rd International conference “Information Technology and Nanotechnology 2017” 128 Data Science / A.A. Zarubin, A.R. Koval, V.S. Moshkin, A.A. Filippov T T where RCi is a set of relations defining hierarchy of ontology classes within the i -th context, R I i is a set of relations defining T the 'class-object' ontology tie within the i -th context, R Pi is a set of relations defining the 'class-class property' ontology tie T T within the i -th context, R S i is a set of relations defining the 'object-object state' ontology tie within the i -th context, R Fi is a IN T T set of relations defining the tie between F j i process entry and other instances of the ontology within the i -th context, RFi is OUT T a set of relations defining the tie between F j i process exit and other instances of the ontology within the i -th context. 3. Extracting the core of ontology of the problem area based on the syntagmatic analysis of external wiki-resources Wiki-resources are formed by a large number of users. Thus, applying of the automated methods for extracting the core of the ontology based on the knowledge contained in the Wikipedia, can reduce the degree of subjectivity and increase the number of experts involved in the process of the ontology building [11]. The algorithm of extracting the core of the ontology from the external wiki-resources is based on the methods described in [3]. The PrA features in the wiki-resource are represented as a hierarchy of associated hyperlinked HTML-pages with a certain semantics. The core of the ontology is automatically extracted from external wiki-resources in the process of data mining. The core of the ontology can be expanded in the process of the syntagmatic analysis of a set of thematic text documents. The first method of extracting the core of the PrA ontology is based on the Lee algorithm [13]. Concepts are reduced to the initial form (lemmatization). Defining types of relations between concepts is in the process of the syntagmatic analysis of terms located on the right and the left of reference defines the concept. The rules for determining the type of relations are presented in the form of syntagmatic patterns (patterns contain a sequence of words). The second method of extracting the core of the domain ontology based on the contents of wiki-resources allows the intelligent system to adapt dynamically to the changes in the domain [14]. Methods of automatic text processing (ATP) in a natural language (NL) can be used in order to extract knowledge from the text of the wiki resource pages. The ATP process is usually carried out in several steps [15]: 1. Grafematic analysis is the process of initial analysis of the text in a NL. The grafematic analysis presents the input data in a convenient format for further analysis (separation of input text into words, delimiters, etc). 2. Morphological analysis (lemmatization) is a process of transforming the words of the input text to the initial form defining the part of speech, gender, case, etc. 3. Parsing is the process of selecting members of simple sentences and constructing a parse tree. 4. Semantic analysis consists of  construction of a semantic tree of sentences,  semantic interpretation of words and constructions,  definition of semantic relations between elements of the text. Semantic representation of the text in a NL is the most complete of those that can be achieved only by linguistic methods. The core of the domain ontology can be extended by merging with the semantic tree extracted from wiki-resources. It is necessary to develop a method for translating a parse tree into a semantic tree in order to obtain a semantic tree. It is necessary to determine the syntactic structure of the sentence for constructing the semantic tree of sentences in a NL. There are several parsing tools of texts in Russian, for example [16, 17, 18]:  Lingo-Master;  Treeton;  Sreda RGTU;  DictaScope Syntax;  ETAP-3;  ABBYY Compreno;  Tomita-parser;  AOT etc. In the context of this work, AOT (tool for constructing a parse tree) was chosen [18]. Let us consider the application of the algorithm for translating a syntactic tree into a semantic tree using the example of a test fragment in Russian: Онтология в информатике — это попытка всеобъемлющей и подробной формализации некоторой области знаний с помощью концептуальной схемы. The parse tree for the test fragment is shown in the figure 1. 3rd International conference “Information Technology and Nanotechnology 2017” 129 Data Science / A.A. Zarubin, A.R. Koval, V.S. Moshkin, A.A. Filippov Fig. 1. Example of a parse tree. Formally, the function of translating a parse tree into a semantic tree can be represented as follows:    F Sem : N liSynt , Pj  N Sem , R Sem ,  where N liSynt is the i -th node of the l - th level of a parse tree. For example, for the parse tree in Figure 1, the first node of the first level is the node “Онтология”, the second one is “пг”, the third one is “тире”, etc. The node of the parse tree can be a member of the sentence, for example, the node “Онтология”. Also, the parse tree node can be a syntactic label that defines the constituent members of the sentence, for example, “пг” (the prepositional group); Pj is the j -th syntagmatic pattern for defining the nodes of the parse tree. The nodes will be translated into nodes and relations of the semantic tree. The syntagmatic pattern is a collection of several words united according to the principle of semantic-grammatical-phonetic compatibility. Formally, syntagmatic pattern can be represented as follows:     N1Synt , N 2Synt , , N kSynt  N Sem , R Sem , k  1, K , where N kSynt is the k -th syntagmatic unit of the pattern corresponding to the node of the parse tree. It is necessary to use all the syntagmatic units included in it in order to use the syntagmatic pattern. Examples of syntagmatic patterns and the results of their use are presented in Table 1; K – number of syntagmatic units in the pattern;   N Sem , R Sem are the sets of nodes N Sem and relations R Sem of the semantic tree obtained as a result of translation of the parse tree into a semantic tree. Formally, R Sem can be defined as follows:  R Sem  RisA Sem Sem , R partOf Sem , RassociateW Sem Sem ith , RdependsOn, RhasAttribute ,  Sem RisA where is a set of transitive relations of hyponymy; Sem R partOf is a set of transitive relations “part/whole”; Sem RassociateWith is a set of symmetrical relations of association Sem RdependsOn is a set of asymmetric relations of associative dependence; Sem RhasAttribu te is a set of asymmetric relations describing the attributes of nodes. Table 1. Examples of syntagmatic patterns and the results of their application. Initial data Syntagmatic pattern Result попытка-генит_иг- {node1}-{генит_иг}-{node2} → попытка-associateWith-формализация формализации {node1}-associateWith-{node2} в-пг-информатике {node1}-{пг}-{node2} → lastNode-relation-информатика {prevNode}-getRelation(node)-{node2} тире {тире} → {prevNode}-isA-{nextNode} lastNode-isA-nextNode концептуальной-прил_сущ- {node1}-{прил_сущ}-{node2} → схема-hasAttribute-концептуальный схемы {node2}-hasAttribute-{node1} (всеобъемлющей, подробной) {node1}-{однор_прил}-{nodes} → формализация-hasAttribute- однор_прил- формализации- {node1}- hasAttribute-{nodes[1]}, всеобъемлющий, {node1}- hasAttribute-{nodes[2]}, формализация-hasAttribute-подробный {node1}- hasAttribute-{nodes[…]}, {node1}- hasAttribute-{nodes[count]} The algorithm for translating a parse tree into a semantic tree consists of the following steps: 1. Go to the first level of the parse tree. 3rd International conference “Information Technology and Nanotechnology 2017” 130 Data Science / A.A. Zarubin, A.R. Koval, V.S. Moshkin, A.A. Filippov 2. Select the next node of the current tree level. 3. If the node is marked, go to step 2. 4. If the node is not a syntax label, go to step 9. 5. If the node is a syntax label and does not have child elements, go to step 9. 6. If the node is a syntax label and all its child nodes are not syntax labels, go to step 9. 7. If there is a temporary parent node, replace it, otherwise, create a temporary node. 8. If there is a previous node and there is no relation with it, add a temporary relationship with it and go to step 2. 9. Apply the syntagmatic pattern for translation. 10. Mark the processed nodes and go to step 2. 11. Go to the next level of the parse tree and go to step 2. The resulting semantic tree for the test fragment is shown in Figure 2. Fig. 2. Example of a semantic tree for a test fragment. The result semantic tree can be merged with other semantic trees within the text. In addition, the semantic tree can be merged with the domain ontology compiled by an expert. Extending the knowledge base by merging semantic trees retrieved from semi- structured resources allows:  provide a common terminology space for sharing and understanding by all users;  determine the exact and consistent meaning of each term. Ontology is a common terminological basis for complex iterative processes. Figure 3 shows the fragment of the core of the ontology “LAN Administration” extracted from the thematic wiki-resource. Fig. 3. The fragment of the core of the ontology “LAN Administration”. 3rd International conference “Information Technology and Nanotechnology 2017” 131 Data Science / A.A. Zarubin, A.R. Koval, V.S. Moshkin, A.A. Filippov 4. Construction of the PrA ontology based on the syntagmatic analysis of text documents In the course of solving the problem of automated ontology expansion, two algorithms for terms extraction from domain texts using existing ontology core were developed:  the thesaurus-based algorithm;  the internal linkage algorithm [19]. The main feature of the developed algorithms is the term extraction from text documents by matching syntagmatic patterns with the lemmas of the objects from the core of the ontology. Syntagmatic patterns are extracted with the use of morphological analysis of text documents. The thesaurus-based algorithm. A thesaurus is a reference work that lists words grouped together according to the similarity of meanings (containing synonyms and sometimes antonyms), in contrast to a dictionary, which provides definitions for words, and generally lists them in alphabetical order. Any ontology is a complicated version of the thesaurus. The thesaurus approach assumes search of lemmas from the input words and their combinations among the terms defined in the ontology. For this purpose, each ontology class has a “HasALemma” property, which has a string value obtained by object name lemmatization. The supporting ontology object used in the further analysis has the degree of proximity in relation to the input word / word combinations that is calculated by the following equation: m n k t  max i , (1) i 1 pi where m is the number of all ontology objects, n i is the number of words from the input sequence contained in the lemma of the current ontology object, p i is the number of words in the current ontology object. The process of assessing the proximity of the input words to the subject area terms is shown on Figure 4. Fig. 4. Finding the supporting ontology object. Each object in the ontology has an “IsATerm” property of Boolean type. The degree of proximity of input words to the terms of domain according to the Thesaurus algorithm is calculated by the following equation: kt k Ont  , (2) c 1 where k t is the result of the first step of the analysis, c is the number of relations between the supporting ontology object and the nearest object with the true “IsATerm” value. Internal linkage algorithm. The developed metrics allows extracting terminology by not only defining the termhood of single words but also comparing the terms from the text with ontology objects and lemmas combinations of those objects, using Radd relations. The Internal linkage algorithm is the implementation of the following one. t1  R1  t 2  R2   Rm  t n, (3) where Ri  Radd , t j  T , Radd is a set of relations that allow expanding the set of objects of the described domain through a combination of related objects lemmas, for example, properties “IsRelatedWith” and “IsAPartOf”. Thus, extracted terms that are part of other terms consisting of more words are not considered as terms in order to avoid redundancy. 5. Experiments The text volume of about 62000 words from “LAN Administration” PrA was analyzed to assess the accuracy of the term extraction. OWL-ontology consisted of 261 classes and 46 relations. Precision (P), Recall (R) and F1 measures were used to assess the effectiveness of the algorithms for each category of tokens. Experiments on term extraction using the most frequently applied statistical methods: Frequency, TF*IDF, C-Value were also carried out. Results are presented in Table 2. Thus, statistical methods showed significantly better results when retrieving one term tokens. The internal linkage algorithm first extracts terms related to existing knowledge base terms. The internal linkage algorithm extracts less wrong terms in case of two and three term tokens. Statistical methods are more focused on the frequency of occurrences of phrases, regardless of the reference to the PrA features and can extract general scientific terms and terms from other problem areas. Moreover, statistical methods are more focused on the frequency of tokens without reference to the PrA and can extract general scientific terms and terms of other problem areas. 3rd International conference “Information Technology and Nanotechnology 2017” 132 Data Science / A.A. Zarubin, A.R. Koval, V.S. Moshkin, A.A. Filippov Table 2. Term extraction using statistical and syntagmatic methods. Amount of words Terms Candidates Right P R F1 Internal linkage algorithm 1 294 168 134 0.80 0.46 0.58 2 631 431 372 0.86 0.59 0.70 3 361 370 327 0.88 0.91 0.89 Frequency 1 294 134 123 0.92 0.42 0.58 2 631 469 347 0.74 0.55 0.63 3 361 334 267 0.80 0.74 0.77 TF*IDF 1 294 147 138 0.94 0.47 0.63 2 631 456 328 0.72 0.52 0.60 3 361 277 166 0.60 0.46 0.52 C-Value 1 294 120 112 0.93 0.38 0.54 2 631 789 316 0.40 0.50 0.44 3 361 295 162 0.55 0.45 0.50 6. Conclusion The use of mathematical and statistical approaches to the building of domain ontologies by extracting knowledge from text documents does not take into account morphological, semantic, and syntagmatic features used in the text of linguistic forms. The methods of syntagmatic analysis allows:  to reduce all synonyms for the same concept;  to include polysemous words for different concepts;  to use the connections between the concepts and the appropriate terms to generate a new ontology entities. Thus, the experimental results suggest a high efficiency of the methods described in the article. The methods were developed by combining linguistic algorithms of terminology extraction from large text corpora in the process of syntagmatic analysis and extracting the core of the ontology from external wiki-resources. Acknowledgements This paper has been approved within the framework of the Federal Targeted Programme for Research and Development in Priority Areas of Development of the Russian Scientific and Technological Complex for 2014-2020, Government Contract No. 14.607.21.0164 on the subject “Development of architecture, methods and models in order to build software and hardware complex for semantic analysis of semi-structured information resources on the Russian element base” (Application Code “2016- 14-579-0009-0687”). References [1] Bova VV, Kureichik VV, Nuzhnov EV. Problems of representation of knowledge in integrated systems of support of administrative decisions. News of SFedU 2010; 108(7): 107–113. [2] Gavrilova TA. Ontological approach to knowledge management in the development of corporate information systems. Artificial Intelligence News 2003; 2(56): 24–29. [3] Vagin VN, Mikhailov IS. Development of the method of integration of information systems based on metamodelling and ontology of the subject domain. Software Products And Systems 2008; 1: 22–26. [4] Gribova VV, Kleschev AS. Managing the design and implementation of the user interface based on the ontology. Management 2006; 2: 58–62. [5] Zagorulko YuA. Construction of scientific knowledge portals based on ontology. Computational Technologies 2007: 12: 169–177. [6] Kleshchev AS. The role of ontology in programming. Part 1. Analytics. Information Technologies 2008; 10: 42–46. [7] Smirnov SV. Ontological modeling in situational management. Ontology of Design 2012; 2(4): 16–24. [8] Golenkov VV, Guliakina NA. Semantic technology of component design of knowledge-driven systems. Fifth International Scientific and Technical Conference "OSTIS". Minsk, 2015: 57–78. [9] Namestnikov AM, Filippov AA. Implementation of the clustering system for conceptual indexes of project documents. Automation of management processes 2011; 3(25): 46–50. [10] Namestnikov AM, Filippov AA, Avvakumova VS. An ontology based model of technical documentation fuzzy structuring. CEUR Workshop Proceedings, SCAKD 2016; 1687: 63–74. [11] Shestakov VK. Development and maintenance of information systems based on ontology and Wiki-technologies. 13-th all-Russian Scientific Conference “RCDL-2011”. Voronezh, 2011: 299–306. [12] Hepp M, Bachlechner D, Siorpaes K. Harvesting Wiki Consensus – Using Wikipedia Entries as Ontology Elements. Proceedings of the First Workshop on Semantic Wikis – From Wiki to Semantics. Annual European Semantic Web Conference (ESWC), 2006: 124–138. [13] Subkhangulov RA. Ontologically-oriented method of searching for project documents, Automation of management processes 2012; 4(30): 83–89. [14] Konstantinova NS, Mitrofanova OA. Ontologies as a knowledge storage system. Portal "Information and Communication Technologies in Education". URL: http://www.ict.edu.ru/ft/005706/68352e2-st08.pdf (21.03.2017). [15] Sokirko AV. Semantic dictionaries in automatic processing: issertation for the degree of candidate of technical sciences. State Committee of the Russian Federation for Higher Education Russian State Humanitarian University. Moscow, 2001. 3rd International conference “Information Technology and Nanotechnology 2017” 133 Data Science / A.A. Zarubin, A.R. Koval, V.S. Moshkin, A.A. Filippov [16] Boyarskiy KK, Kanevskiy YeA. Semantico-syntactic parser Semsin, Scientific and Technical Herald of Information Technologies. Mechanics and Optics 2015; 5: 869–876. [17] Artemov MA, Vladimirov AN, Seleznev KYe. Survey of natural text analysis systems in Russian. Scientific journal Bulletin of Voronezh State University. URL: http://www.vestnik.vsu.ru/pdf/analiz/2013/02/2013-02-31.pdf (22.02.2017). [18] Automatic text processing. Automatic word processing. URL: http://aot.ru(22.02.2017). [19] Yarushkina N, Moshkin V, Klein V, Andreev I, Beksaeva E. Hybridization of Fuzzy Inference and Self-learning Fuzzy Ontology Based Semantic Data Analysis. Proceedings of the First International Scientific Conference “Intelligent Information Technologies for Industry” (IITI’16), 2016: 277–285. 3rd International conference “Information Technology and Nanotechnology 2017” 134