246 Use of Semantic Similarity Estimates for Unstructured Data Analysis © Julia Rogushina Institute of Software Systems of National Academy of Sciences of Ukraine, Kyiv, Ukraine ladamandraka2010@gmail.com Abstract. The paper discusses problems related to unstructured data analysis in order to acquire implicit knowledge from them. Semantic similarity estimations are used as one of instruments for such analysis. We use the portal version of the Great Ukrainian Encyclopedia (e-VUE) to demonstrate some examples where ontologies and semantic Wiki markup are used for generation of semantically similar concepts (SSCc). Semantic similarity in these examples is defined in domain-specific way. Grouping of concepts into SSCs is based on high-level on- tological classes, and semantic properties and their relations are used for con- struction of attribute space. Various sets of SSCc are applied for navigation and search to increase functionality. Every e-VUE article is represented by Wiki page with unstructured and semi-structured natural language (NL) and multimedia content pertinent to some concept. Ontological model of e-VUE is considered as a domain knowledge base that simplifies processing of e-VUE article content and defines semantic relations between concepts. Keywords: Ontology, Unstructured data, Wiki technology, semantic similarity. 1 Unstructured data analysis Now the largest part of stored information (more than 80% of all stored data, and their number is increasing by an order of magnitude faster than structured data) is repre- sented by unstructured data (USD) [1], so methods and means of their analysis are evolving rapidly. These situation causes transformation of disengaged USD analysis implementations in integral scientific research area. Great information new knowledge source provided by USD are potentially of but their use involves problems deal with storage and analysis. Automation of such processing needs in USD transformation into structured information that can be processed automatically in various ways. USD are usually considered as information collected without any predefined data model or data organization. Major portion of USD is represented by textual infor- mation – arbitrary-length sets of natural language (NL) words combined by weakly formalized linguistic rules. Such USD may also contain dates and numbers. Examples of textual USD are NL documents in various formats, information from social network- ing, data from mobile devices and content of the Web sites. USD is not well defined term on account of complications in distinction be- Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 247 tween structured and unstructured data without formally defined structure or with struc- ture that cannot be used for automated processing [2]. One of the criteria for structured data determining is a possibility to create data element parser. Thus, we consider data as USD if accessible information about their structure cannot make data analysis more efficient. Unstructured information can be stored in the form of objects (files or docu- ments) that have their own structure. For example, the body of email and email attach- ment are USD, but location of attachment into the mail is determined by structure. The combination of structured and unstructured data are considered also as USD. 1.1 Properties of unstructured data Main properties of USD are: - Heterogeneity. USD can be generated by different ways, in various formats, from various information sources, and these data cannot be structured and placed in any DBMS by various reasons; - Ambiguity. Equal phrases of different persons can have different meanings that depend on their individual experience, views, etc. (for example, an expert phrase “I don’t understand this article” means the poor quality of the article, and the same statement of student means the inadequate education), and the same idea may be ex- pressed by different words . - Contextual dependency. Interpretation of word or name differs in different contexts (for example, meanings of “model” term differs in technology and mathemat- ics). - Dynamics of value. Words can rapidly change their meaning, for example, previously unknown “Wuhan” is associated now with coronavirus and gains additional meaning. Often USD are created directly by humans (in contrast to structured data), and therefore systems oriented on USD analysis have to take into account the "human factor". Technologies such as Data Mining, Natural Language Processing, and Text Mining provide different methods for finding structure in USD. Common text structur- ing methods usually include manual metadata tagging for further structuring. The Un- structured Information Management Architecture (UIMA) standard provides a general framework for processing this information to make sense and create structured data. 1.2 Text Mining as a basis for NL USD analyzing From the late 1990s Text Mining becomes separate scientific area [3]. Early approaches regarded text as a "bag of words" such as abbreviations, plurals and word combination, as well as multiple word terms known as n-grams. Basic lexical analysis takes into account the frequency of words and terms to perform such tasks as document classifi- cation by topic. But this approach doesn’t consider documents semantics. Now Text Mining is looking for hidden relations and other complex structures into the text data sets. Text Mining technology is based on linguistics and Data Mining. Initially it was oriented on recognition of personal and geographical names, dates, phone numbers 248 and email addresses in the text. Now more sophisticated methods provide retrieval of concepts and relations between them and even emotions. Big Data spread increase the urgency of USD structuring [4]. In the most gen- eral form, the solution of this the complex scientific problem consists in construction of USD content marked graphs and matching of such graphs. Another aspect of this problem is related to retrieval of relevant knowledge for USD markup. Text Mining should provide a transition from USD to structured information. Most often, this process ignores a lot of the specific features of the NL that are used only at the previous stage of text parsing, and the following phases of analysis use the "bag of words" model where the order of words is not important [5]. 1.3 Elements of text document structuring From some points of view (distinct from automated analysis) NL text document can be considered as a structured object. For example, from a linguistic point of view each document contains a large amount of semantic and syntactic structure that is hidden in the text. In addition, markup elements (punctuation signs, capital letters, numbers and special characters, etc.) and formatting elements (tables, columns, paragraphs, etc.) can be considered as "soft markup" language to help identify important document subcom- ponents, such as title, names of authors, subdivisions, etc. Word sequence can also be a structurally significant characteristic to a document. In addition, some text documents may contain embedded metadata in the form of formatted markup tags that are auto- matically generated by text editors. Documents that have relatively few such structuring elements (for example, scientific publications and business reports) are called free-format or weakly structured. Documents with relatively more structuring elements (such as e-mail, HTML web pages) are called semistructured. Pre-processing Text Mining operations allow to take into account various NL document elements to convert it from an USD with implicit structuring into explicitly structured data. However, the potentially great number of words, phrases, sentences and formatting elements contained even in a small document (not even considering the po- tentially large number of different meanings of these elements in different contexts) causes the need in identification of a simplified subset of document properties (fea- tures). Such set of features is called a representative model of a document: individual documents are characterized by the sets of features contained in their representative models. But each individual document in the collection has an extremely large number of properties even in the most effective representative models. Therefore, problems as- sociated with high dimensionality of characteristics (ie, the size and scale of possible combinations of feature values for data) are usually much more significant in Text Min- ing systems than in classic Data Mining systems. Structured representations of NL documents have a much larger number of potentially representative features – and therefore a greater number of possible combi- nations of their meanings – than in relational or hierarchical databases. For example, relatively small collection of 10-15,000 documents contains more than 25,000 non-triv- ial words. The number of attributes in a relational databases analyzed in Data Mining tasks are usually much smaller. The high dimensionality of potentially representative 249 properties leads to pre-processing of the text aimed at creating of simplified models of representation . Another feature of NL documents is feature sparsity – only a small subset of the properties available for document collection as a whole appear in each individual document, and thus, when a document is presented as a binary feature vector, almost all vector values are zero. 1.4 Properties of individual NL document Symbols, words, terms and concepts define properties of individual NL document. Text Mining algorithms process document representation through the set of properties, not directly the documents themselves, and therefor we need in compromise between two important goals. The first goal is to achieve a correct classification of the volume and semantic level of the properties to accurately represent the document during the pre-processing operation. The second goal is to select the property definition that is most computation- ally efficient and practical for pattern detection. This choice can be supported by vali- dation, normalization, or use properties from controlled vocabularies or external sources of knowledge, such as dictionaries, thesauruses, ontologies, or knowledge ba- ses, to create smaller sets of properties with greater semantic significance. Although many potential properties can be used to present NL documents, the following four types are most commonly used: - Symbols. Letters, numbers, special characters and spaces are the building blocks of higher-level semantic features such as words, terms and concepts. Symbol- level views may include a complete set of all characters for document or some filtered subset. Character-based representations without position information (“bag-of-charac- ters” approaches) are usually have very limited utility for Text Mining. Views that in- clude some level of positional information (such as bigrams or trigrams) are more use- ful. - Words. Specific words selected directly from the NL document are the base- line for semantics. Every word-level property has at most one linguistic marker. - Phrases and multiword expressions do not constitute separate properties at the word level. Word-level document representation includes features for each word in that document, that is, the text of the document is represented by complete set of word- level properties. Therefor some representations of word-level document collections contain the great number of unique words in their feature space. However, most docu- ment submissions at this level have at least some minimal optimization and are there- fore composed of subsets of representative properties that are filtered from such ele- ments as stop words, symbols and meaningless numbers. - Terms. Terms are individual words and multiword phrases selected directly from the source NL document body using the term extraction methodology. Term- based document submission consists of a subset of terms in that document. Various methodologies for extracting terms that convert the raw text of a document into a se- quence of normalized terms (tokenized and lemmatized word forms) tagged with the relevant parts of the language can be used. Sometimes, an external vocabulary is also used to normalize terms to provide a controlled vocabulary. Term extraction techniques use different approaches to generate and filter a list of the most relevant document terms 250 from this set of normalized terms. - Concept. Concepts are properties created for NL document using various categorization techniques. Concept-level properties can be created manually, but now more commonly they are retrieved from documents through complex pre-processing procedures that identify individual words, multiword expressions, entire sentences or even larger syntax units, which then relate to specific concept identifiers. Terms and concepts levels represent properties more significant for semantics. Term-level representations are easier to generate automatically from text than concept- level ones. However, concept-level representation is much more useful for processing synonymy and polysemy. Many categorization methodologies include the referencing to external knowledge source. For example, some statistical methods can use as externals source an annotated collection of documents. For manual and rule-based categorization, cross- referencing and validation of perspective properties at the concept level typically in- volve interaction with external databases, such as domain ontology, vocabulary or for- mal hierarchy. In contrast to word-level and term-level properties, concept-level docu- ment properties may consist of words not contained in this document. Concept-based representations allow using very complex concept hierarchies and diverse domain knowledge provided by ontologies and knowledge bases. But concept-level represen- tations have several potential drawbacks: 1) the relative complexity of using heuristics in pre-processing operations, 2) the dependence of concepts on the domain specifics. 1.5 Using background knowledge in Text Mining Background knowledge can be used for pre-processing to improve the acquisition of domain concepts. Domain in Text Mining is a specialized area of interest represented by ontologies, lexicons, dictionaries, thesauri, taxonomies etc. Text Mining systems can use information from formalized external knowledge sources for these domain to improve document pre-processing and knowledge discovery. Concepts used in Text Mining systems are connected not only to the descriptive attributes of a particular doc- ument, but also to domains. Access to background knowledge – while not absolutely necessary for creat- ing concept hierarchies in the context of a single document or document collection – can play an important role in developing more meaningful, consistent and normalized concept hierarchies. Text Mining uses background knowledge more than Data Mining: properties of USD are not just elements in a flat set, as is often the case with structured data, because they are linked through lexicons and ontologies to support advanced queries. Although Text Mining pre-processing operations play an important role in transforming unstructured content of raw document collection into more handy con- cept-level data representation, the core functionality of such systems is oriented on analysis of concept co-occurrence models in documents collection. Text Mining uses algorithmic and heuristics to consider distributions, frequent sets and various associa- tions of concepts on inter-documentary level to identify the nature and relations of con- cepts represented by this collection. For example, if news collection contains many articles about both event X and 251 company Y at the same time, as well as articles that deal with company Y and product Z at the same time, then Text Mining analysis indicates the relation between X and Z, notwithstanding this relation is not present in any document. In classic Data Mining, background knowledge from external sources is used to limit search. Text Mining systems can use information from external sources of knowledge in pre-processing and concept testing operations. In addition, access to background knowledge can play an important role in developing meaningful, con- sistent, and normalized concept hierarchies. Domain knowledge can also be used by other components of the text extrac- tion system. For example, an important application of background knowledge is the construction of significant constraints on knowledge discovery operations. Similarly, background knowledge can also be used to formulate constraints that allow users to increase the flexibility of viewing large result sets or formatting data for presentation. Text Mining systems can utilize background knowledge that is represented as domain ontology describing the set of all important facts, classes and relations between these classes. Domain ontology can be used as a vocabulary designed to be both human- readable and machine-readable. Well-known ontology used in Text Mining is WordNet developed by Prince- ton University for NL modeling. 2 Problem formulation Traditional Text Mining approach is not effective enough to process Big Data, and it causes the need in intelligent methods of USD analysis that use background domain knowledge and specialized ontologies for semantic markup of NL texts. We propose to use Wiki technologies and their semantic extension as a source of domain knowledge. This knowledge can be used in estimations of domain concepts semantic similarity for NL elements structuring of Big Data metadata. 3 Estimations of semantic similarity It is advisable to apply the domain ontological knowledge for estimation of the semantic similarity of domain concepts. The sets of semantically similar concepts can be used as a base for structuring of USD by linking of data fragments with ontological elements. Such knowledge makes it possible to quantify the substantive similarity of both the domain concepts and the NL words and phrases corresponding to these concepts. The values of similarity estimations depend either on estimation methods or on the choice of the domain ontology and on ontology pertinence to user conception about domain. Various ontologies that represent different perspectives on the same do- main can be used for this purpose. Such ontologies formalize the contexts of the user task but their use necessitates the integration and reconciliation of these ontologies. It should be noted that the integration of independently created ontologies is a non-trivial problem that cannot be fully automated and requires the participation of domain experts to establish correct relations between ontological concepts. The definition of semantic closeness between domain concepts is quite closely 252 related to the problem of displaying independently constructed ontologies of this do- main. The task of domain ontologies mapping consist of two separate sub-tasks: lo- cal representation of concepts that require matching between classes and instances of these ontologies; global concept mapping – an analysis of the entire set of local ontol- ogy element mappings. Global mapping provides additional information about pairs of different ontology concepts from information about their relation with other elements of ontologies. Similarity analysis of hierarchical (taxonomic) relations is probabilistic. The assessment of the similarity of concepts from different ontologies may be based on the positions of these concepts in the hierarchy of classes for which similarity has already been determined: if the superclasses and subclasses of these concepts are similar, then the same concepts may also be similar. The similarity of two entities depends on similarity estimation of: direct su- perclasses of these concepts; all superclasses of these concepts; subclasses of these con- cepts; and instances of these concepts. One of the tasks of mathematical semantics is the measurement of semantic distances between NL words. Estimation of the semantic distance allows us to assess the density of semantic and associative-semantic relations between words and concepts of the dictionary, between units of text and, in the framework of more complex tasks, between fragments of text. The value of semantic distance plays an important role in determining the meaning with taking into account the context of several sentences. The semantic mean- ings of words in a sentence should create semantic unity, therefore, the meanings of the concepts (and sems) of words that stand side by side in a sentence should be in the optimal range of semantic proximity. Determining the coefficients of semantic connectivity of relations between language units makes it possible to assess the correspondence of NL fragment and its phrases to the points of a multidimensional space of a potentially generated ordered set – classification of NL phrases. Estimation of the distance between language units is also applicable in other tasks: constructing computer thesauruses where computer au- tomatically constructs a coherent and meaningful text, the work of expert systems, de- termination of the text subject etc. Semantic similarity of NL fragments A and B can be calculated taking into account the frequency of words typical for A and B. Various types of linguistic relations between words in a language, such as homonyms, syno- nyms, hyperonyms, antonyms, equonyms, have to receive an accurate numerical esti- mation based on a uniform scalar value. A special case of ontologies is taxonomies. They are a fairly common and convenient source of knowledge for analyzing the semantic closeness of NL concepts and words. 3.1 Use of taxonomies for semantic similarity evaluation Evaluations of semantic similarity based on domain knowledge network rep- resentations has long history that was started with the spread of the activation approach [6, 7]. Some researchers consider similarity evaluation in semantic networks using one taxonomic relations “is-a” and exclude other types of relations [8]; other analyze also 253 relations “part-of-part” [9]. A common and long-known way of semantic similarity evaluation in taxonomy lies in measuring the distance between net nodes that corre- spond to the elements being compared – the shorter path from one node to another means their higher similar. If elements are connected by multiple paths between then the shortest path length is used [10, 11]. Although researchers identify many similarity criteria but many of them are rarely accompanied by an independent characteristic of the phenomenon that they measure, in particular for those that are is used in software (for example, similarity of documents in information retrieval, similarity of cases in case-based considerations). On the contrary, the value of such measures depend on their usefulness for particular tasks. However, this approach is compounded by the notion that all connections in the taxonomy represent homogeneous distances. Unfortunately, it is difficult to define taxonomy distance uniformity because real taxonomies have great variability of dis- tances covered by a single taxonomic relation, especially if some taxonomy subsets are much denser than others. For example, WordNet [12] contains a lot of direct links be- tween either fairly similar concepts or relatively distant ones. An alternative way of evaluating semantic similarity in a taxonomy is based on the concept of informational content and is not sensitive to the distances sizes between relations [13]. 3.2 Similarity and informative content One of the key factors in the taxonomy concepts similarity is the degree of their infor- mation sharing that defines by the number of highly specific terms that applies to both of these concepts. The edge-counting method takes it into account indirectly, because if the long minimum connection path by “is-a” relations between two nodes means that it is necessary to ascend more in taxonomy to general abstract concepts to find the smallest upper bound – a concept to which both concepts under review relate. Following the standard argumentation of information theory the probability of the concept increases then it’s information content decreases. This quantitative charac- terization of information provides a new way of semantic similarity measuring. The more information is shared by two concepts, the more similar they are, and the infor- mation shared by two concepts is determined by the informational content of the con- cepts in taxonomy. In practice, we often need to measure the similarity of words, not concepts. Such measure can be based on word representation through the set of taxonomy concepts that represent meanings (contents) of these words. Although there is no standard way of evaluating computational measures of semantic similarity, it is appropriate to use estimates that are consistent with human similarity estimates for this purpose. We can use computational similarity measure of word pairs and compare them with human-constructed similarity ratings of the same pairs. In [14] similarity measures are based on maximum taxonomy depth the and the shortest path length in the taxonomy between the concepts. Another points of view for comparing the concept similarity is based on the use of the concept probability rather than information content. Probability-based similarity estimations consider the word occurrence frequency more important than information content. 254 4 Wiki technology as a means of information structuring Wiki-technology is the Web-based technology for building of distributed information resource (IR) that allows users to submit and edit materials without additional software and specialized skills. specify explicitly links between individual pages through hyper- links and define their categories [15].All content changes become accessible immedi- ately, but users can turn up to the later versions [15]. The Wiki page format uses simplified markup language to distinguish various structural and visual elements. Now a large number of Wiki engines and information resources on their base are realized. The largest and most well-known of them is Wik- ipedia. The main elements of Wiki markup are hyperlinks and categories. Their use makes it easy to convert USD into partially structured data. In addition, the analysis of the Wiki resource structure at the level of words and concepts allows acquiring knowledge for structuring other USD. Wiki resources can be used as an external source of features for text categori- zation and determining the semantic relatedness between NL texts [16]. 4.1 Semanticization of Wiki resources Semantic MediaWiki (SMW) is an add-on to the MediaWiki [17]. SMW advantages are semantic processing of information, the availability of group knowledge manage- ment tools, relatively high expressive power and reliable implementation. SMW allows to integrate information from different Wiki pages by the knowledge level retrieval and to generate ontological structures on Wiki Pages that can be used by other intelligent software. In addition to categories, SMW structures information by semantic properties. They allow to link Wiki pages semantically with each other and with other data. Each semantic property has a type, a name and a value, as well as own Wiki page in a special namespace that allows it to determine its place in the property hierarchy and to docu- ment how that property should be used. In terms of ontological analysis, each Wiki page is an ontological element, that is, an element of one of the RDF [18] classes – Thing, Class, ObjectProperty, DatatypeProperty, AnnotationProperty. In addition, each page has its own URI. Usu- ally, Wiki pages are instances of OWL [19] ontology classes, Wiki categories are clas- ses, and Wiki semantic properties are object properties and data properties of ontology. Therefore, an appropriate OWL/RDF file can be generated on request for any SMW page or the set of pages. Semantic Wiki resources can be used as a basis for automated generation of distributed knowledge bases in RDF format. Exporting to OWL / RDF is a means of ensuring external reuse of data from Wiki, but only practical application of this feature can show the quality of the RDF generated. To this end, sys- tem developers have used a number of Semantic Web tools to issue RDFs. SMW is compatible with the OWL DL knowledge model, therefor external ontolo- gies can be used in Wiki resources. There are two ways to do this: ontology importing allows to create and modify Wiki pages to represent the relations specified in an exist- ing OWL DL document; and reusing the dictionary allows users to match Wiki pages with elements of existing ontologies. 255 4.2 Use of semantic similarity estimations in online version of the Great Ukrainian Encyclopedia Theoretical principles for development of semantic search and navigation means are implemented into e-VUE – the portal version of the Great Ukrainian Encyclopedia (vue.gov.ua). This resource is based on ontological representation of knowledge base [20]. To use a semantic Wiki resource as a distributed knowledge base we develop knowledge model of this resource represented by Wiki ontology [21]. Using this model for semantic markup provides the formation and software implementation of an appro- priate set of hierarchically related categories, templates for typical IOs, their semantic properties and the queries that use them [22]. Application of semantic similarity estimation for this ontology provides the functional extension of Encyclopedia by new ways of content access and analysis on the semantic level. An ontological model of the structure of the e-VUE is used to support semantic navigation on the portal. One of the significant advantages of e-VUE as a semantic portal is the ability to find semantically similar concepts (SSC). This search is based on the following assumptions: 1. concepts that correspond to Wiki pages that belong to the same set of categories are semantically closer to each other than other concepts on the portal; 2. concepts that correspond to Wiki pages that have the same or similar mean- ings of semantic properties, are semantically closer to each other than concepts that correspond to Wiki pages with different values of semantic properties or those ones with not defined values of semantic properties; 3| concepts defined as semantically sim- ilar by the both preceding criteria are more semantically similar than concepts similar by one of criteria. For e-VUE, user need to locate the SSP if he (she) is unable to select correctly the knowledge field of concept or enters it with errors. In such cases, user can find similar concepts and then go to desired concept. For example, user wants to find infor- mation about a writer or artist whose last name he does not remember accurately, and is not able to accurately determine the style of his (her) works of art, but may indicate the name of more famous person who worked in the same sphere. In some cases, the problem of SSP search is solved by search of the semantically similar words into the NL definition of concept. In order to extend e-VUE functionality related to search and navigation we propose means of retrieval of semantically similar close IOs – both globally similar (by the full set of features – either categories and values of semantic properties) and locally similar (only by some subset of these features). Concepts of e-VUE are matched with the current Wiki page. To demonstrate the capabilities of the described above approach we propose the following examples of local SSPs retrived by: 1. the fixed subset of categories of current page; 2. the values of the fixed subset of semantic properties of current page; 3. the combination of categories and values of semantic properties of current page. The semantic closeness of the search terms is determined relative to the char- acteristics of current Wiki page that the user is viewing, that is, the categories and prop- erties of this page are analyzed as parameters of such calculation. 256 Fig. 1. Search for SSPs for e-VUE page "Aviation". In the first case, for the current Wiki page you need to find the concepts o e-VUE that are assigned to the categories of the current page. Now this retrieval considers catego- ries and sub-categories of knowledge fields and typical IOs categories (for example, “Countries”, “Rivers”) IS, and don’t take into account service categories related to the form of publication (e.g. “VUE, Volume 1”) (see Fig. 1). It should be noted that all these cases of SSP search (locally and globally) cannot be performed automatically by Semantic MediaWiki's built-in tools. User can achieve the same result with Semantic MediaWiki tool of semantic retrieval by manual entering of all matching parameters – categories and semantic properties copies from selected Wiki page. Therefore these searches are provides by special software code for analyzing the page content. The second case deals with the search of e-VUE pages corresponding to fixed typical IOs – personalities, cities, countries, etc. Such searches take into account the values of some semantic properties specific to this IO, and match them with values of these properties for the current page. For example, for a typical IO “Person” it is possi- ble to search for persons born in the same place, work in the same field, contemporaries, etc. The third case search option allows to search SSP of the selected category (or the set of categories) with a set of semantic properties defined for selected page. For exam- ple, you can find persons (pages from category "Person") who specialize in the fields relates to current page (categories of this page) (see Fig. 2). 257 Fig. 2. Search for specialists (by set of current page category) for e-VUE pages. SSP search generate the Wiki pages with the sets of locally similar concepts. These SSPs can be used in analysis of textual part of USD. 5 Conclusion We propose to use semantic Wiki markup of IR as a source for generation of domain ontologies and groups of SSCc from these domains. The evaluation of semantic simi- larity can use such characteristics of pages as categories (and their taxonomies), seman- tic properties and their values, NL content and links with other pages. Then these SSC sets can be used as a base for analysis of NL unstructured data. The implementation of proposed approach needs in creation of relevant Wiki resources with appropriate do- main knowledge. Use of semantic Wiki-technologies for distributed information re- sources development simplifies the process of NL text structuring and also generates background knowledge source for the analysis of arbitrary NL texts from corresponding domains. The models and methods proposed in the work allow to improve this process. References 1. Grimes S.: Unstructured Data and the 80 Percent Rule, Clarabridge, Bridgepoints, (2008), http://breakthroughanalysis.com/2008/08/01 /unstructured-data-and-the-80-percent-rule/. 2. Unstructured_data, https://en.wikipedia.org/ wiki/Unstructured_data. 258 3. Grimes S.: A Brief History of Text Analytics, http://www.b-eye-network.com/view/6311. 4. Buneman P., Davidson S., Fernandez M., Suciu D.: Adding structure to unstructured data. In: International Conference on Database Theory, pp. 336-350. (1997). 5. Feldman R., Sanger, J.: The text mining handbook: advanced approaches in analyzing un- structured data (2007), https://wtlab.um.ac.ir/images/e-library/text_min- ing/The%20Text%20Mining%20HandBook.pdf. 6. Quillian M. R.: Semantic memory. In: Minsky, M. (Ed.), Semantic Information Processing. MIT Press, Cambridge, MA, (1968) 7. Collins, A., Loftus, E.: A spreading activation theory of semantic processing. In: Psycho- logical Review, 82, pp.407-428, (1975) . 8. Rada R., Mili H., Bicknel E., Blettner M. (1989) Development and application of a metric on semantic nets. IEEE Transaction on Systems, Man, and Cybernetics, 19(1), P.17-30. 9. Richardson R., Smeaton A. F., Murphy J.: Using WordNet as a knowledge base for meas- uring semantic similarity between words. In: Working paper CA-1294, Dublin City Univer- sity, School of Computer AppUcations, Dublin, (1994). 10. Lee J. H., Kim M. H., Lee Y. J.: Information retrieval based on conceptual distance in IS-A hierarchies. In: Journal of Documentation, 49(2), pp.188-207, (1993). 11. Rada R., Bicknell E.: Ranking documents with a thesaurus. In: JASIS, V.10(5), pp.304-310, (1989). 12. Fellbaum C.: WordNet. In: Theory and applications of ontology: computer applications, pp. 231-243). Springer, Dordrecht, (2010). 13. Resnik P.: Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language // Journal of Artificial Intelligence Research 11, P.95-130. (1999). 14. Miller G. A., Charles, W. G.: Contextual correlates of semantic similarity. In: Language and cognitive processes, 6(1), pp.1-28, (1991). 15. Wagner C.: Wiki: A technology for conversational knowledge management and group col- laboration. In: The Communications of the Association for Information Systems, V. 13(1), pp. 264-289 (2004). 16. Banerjee S., Ramanathan K., Gupta, A.: Clustering short texts using wikipedia. In: Proc.of the 30th annual international ACM SIGIR conference on Research and development in in- formation retrieval, pp. 787-788, ACM, (2007). 17. MediaWiki, https://www.mediawiki.org/wiki/MediaWiki. 18. Broekstra J., Klein M., Decker S., Fensel D., Van Harmelen F., Horrocks I..: Enabling knowledge representation on the web by extending RDF schema. In: Computer networks, 39(5), pp. 609-634, (2002). 19. McGuinness D. L., Van Harmelen F.: OWL web ontology language overview. In: W3C recommendation, 10(10), (2004). 20. Rogushina J.V.: Use of semantic properties of the Wiki resources for expansion of functional posibilities of “Great Ukrainian Encyclopedia”. In: Encyclopaedias in the modern infor- mation space/ Ed. Kirillon A.M., Kyiv, pp.104-115, (2017) [in Ukrainian] 21. Rogushina J.: Analysis of Automated Matching of the Semantic Wiki Resources with Ele- ments of Domain Ontologies. In: International Journal of Mathematical Sciences and Com- puting (IJMSC), Vol.3, No.3, 2017, pp.50-58. 22. Rogushina J.V.: The Use of Ontological Knowledge for Semantic Search of Complex Infor- mation Objects. In: Proc. of OSTIS-2017, pp.127-132, (2017).