=Paper=
{{Paper
|id=Vol-3403/paper2
|storemode=property
|title=Ontological Approach in the Smart Data Paradigm as a Basis for Open Data Semantic Markup
|pdfUrl=https://ceur-ws.org/Vol-3403/paper2.pdf
|volume=Vol-3403
|authors=Julia Rogushina
|dblpUrl=https://dblp.org/rec/conf/colins/Rogushina23
}}
==Ontological Approach in the Smart Data Paradigm as a Basis for Open Data Semantic Markup==
Ontological Approach in the Smart Data Paradigm as a Basis for Open Data Semantic Markup Julia Rogushina Institute of Software Systems of the National Academy of Sciences of Ukraine, 40, Ave Glushkov, Kyiv, 03181, Ukraine Abstract We analyze existing approaches to transformation of raw data into source for analysis and knowledge acquisition named Smart data. This research area refers to data that has been processed, analyzed, and transformed into actionable insights or knowledge. The goal of Smart data is to provide more valuable information that can be used to drive decision-making, enable automation, and support a variety of intelligent applications. One of directions of Smart data deals with data structuring on base of semantic markup where ontologies are used as a source of domain knowledge. Semantic Wikis are used by researchers for smart data processing, due to their ability to combine the benefits of Wiki technologies (easy editing and collaboration) with the advantages of semantic technologies and ontological analysis (formal representation and reasoning). In this research we propose some special cases of ontologies that reduce domain knowledge according to goals of markup of Semantic MediaWiki resources. We consider advantages of this technology and problems of its practical use. Proposed models and methods are approve in process of development of the portal version of Great Ukrainian Encyclopedia that integrates heterogeneous multimedia information from various fields of sciences. Keywords 1 Semantic markup, ontology, Wiki, Smart data, Open science 1. Introduction Now data becomes an asset of immense value, but this value depends of possibility to acquire useful information from this data. The growing interest to intelligent information processing is largely caused by the increase in the volume of available information, the increase in its heterogeneity, and the need to obtain from it exactly the information that specific users can process for solving of their current tasks. Huge volumes of data from various sources are collected and analyzed in order to find economic benefits and competitive advantages for companies and society as a whole. A lot of digital information is created and stored, but not used in any way. Statistics show that a significant part of global data is unstructured and has a high dimension. As a result, the vast majority of available information cannot be analyzed automatically with the help of modern technologies without additional data handling, and this calls for the development of models and methods for various pre-processing means applied to unstructured raw data. Digital universe generates huge volume of heterogeneous digital data [1], but only small part of data had some kind of structuring and was analyzed. Unstructured data (USD) usually refers to information that does not have a predefined data model or that model does not correspond to the purposes of processing. USD concept is not well-defined because the same set of data can be structured for one task (if that structure is not useful for their processing purposes) and USD for another. In addition, the structure of data may not have a formal COLINS-2023: 7th International Conference on Computational Linguistics and Intelligent Systems, April 20–21, 2023, Kharkiv, Ukraine EMAIL: ladamandraka2010@gmail.com (J.Rogushina); ORCID: 0000-0001-7958-2557(J.Rogushina) ©️ 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) definition and the possibility and correctness of its interpretation depends on person that uses it and on purposes of use. But unstructured information can have a certain structure (be semi-structured or even structured) that cannot be applied in automated processing without additional clarifications. For example, such data as a natural language document can be viewed from some points of view as a structured object, and from others – as USD. From linguistic point of view, the structure of document is represented using punctuation marks and syntax elements (relations between sentence members). In addition, some documents contain additional formatting elements – tables, columns, paragraphs, etc. that can also be considered as structuring elements, helping to define sub-components of documents, and embedded metadata that is automatically generated by text editors, displaying information such as creation date, scope, authorship, etc. All this information can be used to search for documents according to certain requirements. But the same document from the point of view of document content analysis is USD, because such structural elements do not contain information about which IOs are described in the document, what properties they have and how they are related to each other. This leads to problems related to its storage (traditional databases are not designed for such uncertainty) and analysis. Thus, data are considered as USD in those cases when information about their structure cannot make data analysis more effective, but pre-processing of USD with elements of Smart data technologies allows to transform them into structured or partially structured. Such technologies as Data Mining and Text Mining can be used for this. Smart data is the way in which different data sources (including Big Data) are combined, correlated and analyzed [2]. 2. Related works The dissemination of the term “Smart data” is closely associated with the use of Big Data. But it should be taken into account that such transformations can be applied to any arrays of information intended for further processing and use – for example, training samples for machine learning or repositories of services or scientific articles. Big data is characterized by several “V” properties, and the number of these “V”s continues to grow. Variability and Veracity are added later to the initial triad – Volume, Velocity and Variety. The Value of using big data allows to find hidden patterns, correlations and connections in large data sets with the help of efficient processing. The implementation of this last aspect related to Smart data depends on the ability to achieve an understanding of trusted, contextualized, relevant, cognitive, predictive, and consumable data at any scale, great or small. Smart data mines semantics from Big data and provide information that can be used to make decisions by solving problems related to the volume, speed, variety and reliability of data. Therefore user require automated analysis of data that has to be cleaned, transformed, structured and interpreted. Due to the increase in the amount and complexity of data, such analysis faces great challenges. Main result of this process is transformation of raw data into formalized knowledge. Intelligent information technologies supporting Smart data should provide the ability to semantic pre- processing and meaningful structuring information based on reliable, contextualized, relevant, cognitive, predictive, and consumable data at any scale. The Smart Data strategy in processing of unstructured data into structured and semi-structured ones is aims to transform input "raw" data into machine-understandable, machine-processable, and machine- actionable instead of simply machine-readable data to generate information that can be used for communication, citation, transfer, rights management and reuse. This process of knowledge generation through analysis and interpretation of deals with an intermediate stage of processing, which ensures the transformation of raw data into a form that simplifies their analysis and makes it more effective. In a broad sense, such transformations correspond to the direction of research called Smart data. The term Smart data refers to the transition from unstructured mass data to knowledge through its intelligent processing, and one the elements of such processing are ensuring compatibility in the representation of information from different sources and semantic annotation of data sets by domain concepts. Methods and tasks of these transformations significantly depend on the subject domain where this analysis is carried out, on the raw data characteristics and on the demands for knowledge that are obtained from data. Smart data is a powerful instrument that can be used not only for economic predictions but for research analyzed from the perspective of humanities [3]. The transformation of data into Smart data allows to determine what this data can be used for, that is, this approach is based on the well-known pyramid "Data-Information-Knowledge-Wisdom (DIKW) [4] that reflects the basic strategy of understanding the world by reducing information to its more significant elements. But the Smart Data approach is not a simple reproduction of DIKW, but its adaptation to the Big Data methodology, which involves the ability to learn by revealing previously unknown patterns, and not just by confirming or rejecting already existing hypotheses (“from unknown to unknowns”) [5]. An example of discovering the unknown with Smart Data is the research project "The Network Structure of Cultural History" [6] that is based on large datasets of the places of birth and death of more than 150,000 prominent people, reveal previously undocumented patterns of human mobility and culture, illuminating the formation of intelligent and cultural centers, the rise and fall of states and other influential factors that go beyond the scope of specific events or narrow time intervals. All areas of Smart data use impose basic constraints on input raw information: data are represented in digital form (or means of their digitization on demand are proposed); data is stored for a long time on reliable medium; technical and legal possibilities for data access are ensured; there are certain (formal or informal) descriptions of these data – their purpose, origin, structure, etc., which can later be transformed into formalized metadata; some knowledge from domain of data analysis is available, and this knowledge can be used for pre-processing of the data. Therefore, the problem of pre-processing is not solved by simple transferring information to electronic media and saving it. Thus, conventional scanning and text recognition is not enough, it is also necessary to associate (automated or with the help of a human expert) data elements with a specific metadata structures from pertinent domain and define technical possibilities right of access to information. The efforts of various experts, including scientists, engineers, business managers, and analysts, should ensure the use of various types of data (including Big data) through Smart Data strategies, even though the term Smart data itself may not be used in relevant research. For example, in the humanities, this direction is associated with the “Digging into Data Challenge” program [7] that refers to projects on the analysis of mainly unstructured information resources dating back to ancient times, but can also include structured sets of digitized data. Compared to previous years, the number of materials on multimedia research and other non-textual resources is increased significantly. Interdisciplinary projects that take place at the intersection of the humanities and digital technologies show how best to capture data at large scale and in diverse formats to search for key insights, and to provide researchers with access to such data through new technological tools designed to provide "bigger smart data" (increase of smart data) and "smarter big data" (semantization of big data) [8]. Pre-processing and analysis of raw data should take into account various data features, such as: data source: data generated by human, by technical devices or automatically; type of represented information: text, video, audio, sensor data, etc.; aim of pre-processing results: human perception or automated processing with certain tools; volume of the data; data structuring type: unstructured, semi-structured, structured; data changeability: static or dynamic data ; possibility and means to separate data sets related to a certain task; data consistency. Data pre-processing is aimed to allow researchers from different domains: to access and reuse large volumes of diverse data; to reveal patterns and connections that were previously hidden; to reveal the impact and significance of the qualitative and quantitative characteristics of the phenomena described by data, both in the real and in the virtual environment. The transformation of raw data into Smart Data is based on the application of various technologies, such as cognitive computing, deep learning, machine learning, artificial intelligence, predictive analytics, Data mining, data science, the Internet of Things ( IoT), text analysis, Semantic Web technologies and ontological analysis, knowledge graphs, contextual computing, Linked Data, natural language processing (NLP) and semantic search. These technologies are closely interrelated and intersect. For example, deep learning shows great potential for NLP; cognitive computing uses machine learning to find deep patterns (including those that are not statistical) in complex, unstructured, and streaming data. From the point of view of further research, it is important to note the ever-growing interest in the use of the Semantic Web standards and semantic search (Resource Description Framework (RDF) [9] and Ontology Web Language (OWL) [10]) in Smart data. 3. Main concepts of Smart data paradigm If we consider the relation between various types and sources of raw data (including Big data) and Smart data in the context of different areas of scientific research and practical use, then we have to define more precisely main concepts used in considering of this approach. The key concept that needs in conformance is the use of the term "data" itself: it is appropriate to consider not only digitized information in various formats, but also consider other sources of information that can be digitized in different ways and existing standards of data definition and representation. Such data have fundamental differences from data generated in the "digital universe" (like data generated by IoT devices and services): images, audio and video from mobile phones, video cameras, information from social networks, etc., which necessarily contain a minimum set of meta- descriptions. For example, the reference model for the Open Archival Information System (OAIS) defines data as a reinterpreted representation of information in a formalized way suitable for transmission, interpretation or processing. In [11] data is defined as the representation of observations, objects or other entities that are used as evidence of phenomena for the purpose of research or science. The data accessed by the various libraries, archives and museums (LAMs) and other information institutions can vary greatly in type, nature and quality, and digitizing these data does not change some of these characteristics. The most difficult is processing of unstructured data contained in natural language documents and other information objects (textual or non-textual, digitized or not digitized), regardless of the chosen presentation format. In order to transform unstructured data from non-digital media not only into machine-readable, but also into machine-processed resources that ensure their analysis and reuse, the Smart Data approach requires technologies for transforming unstructured data into structured and partially structured: image recognition, voice recognition etc. enriched by knowledge about pertinent domain. It should be taken into account that even non-digitized data can contain some meta-information (formalized or non-formalized) that allows to transform data into partially structured ones. For example, LAM supports metadata for such semi-structured information objects as publications marked with tags according to the Text Encoding Initiative (TEI), and for such structured information objects as bibliographies, indexing databases, citation indexes. These data sets can be relatively small in volume and have less heterogeneity compared to Big data, but they are more clean, explicit, trusted, and value- added, because they are generated primarily on base of decision of human experts, not automatically. In addition, these data are usually open – they belong to resources that are freely available and non- commercial, and this expands the scope of their use. For example, structured data provided by LAM in the Linked Open Data library community that includes elements such as value, syntax, time, place, relevant domain, rules, user profile, can enrich Linked Open Data sets. Therefore we have to consider the definition of other important concept in Smart data that is named “Open data”. Terms such as "data" and "open data" cover many meanings that depend on domain and purposes of their use, and usually do not have a common functional definition. In addition, in various fields of use, these concepts also have certain features. For example, in scientific research, the concept of "open data" contains more informal characteristics that are e determined both by the nature of the occurrence of such data (usually they are the result of a person's conscious intelligent activity) and by the forms of their use by other persons (this may be determined by various licenses and community rules). Open data generated in the "digital universe" more exactly define rules and licenses of data use by other applications (for example, the aspect of “personal data” access). Therefore, their use in describing Big Data and Smart data technologies and are often mixed up. It is advisable to define more clearly what subsets of data can be considered as open ones, and what kind of data is analyzed, what data is the result of this analysis, and what are the limitations associated with increasing its volume, structuring requirements, and opportunities for application and reuse. The data used and generated in scientific research have their own specificity, which depends on many factors. In the broadest sense, scientific data are objects that are used as evidence of phenomena for the purposes of research or science [12]. However, this definition does not address the issue of data units, the degree of data processing and the possibility of their sharing. Data that is useful to one researcher may be noise to another. Research data may differ from the information resources on which this research is based. Another insufficiently defined term is “data set”. It is a collection of data related to some specific project, task or source of information that is intended to be shared with others. Examples of datasets for exchanges include private exchanges between researchers; datasets on Web sites of organizations or researchers; placement of data sets in archives, repositories, thematic collections or libraries; supplementary materials to journal articles. From various points of view the same collection of data can be considered as a data set of as an entire information object. For example, we can define some learning sample as a data set that consist of the array of characteristics of instances or as information object defined by the name into the library of learning samples. Data sharing methods vary by domain, data type, goal of sharing, etc. The ability to identify, retrieve, and interpret shared data varies according to these methods [13]. Multiple data sets can be integrated at the "raw" or processed levels. Reuse of single data set in its raw form is difficult, even if adequate documentation and tools are available, because it is necessary to have information about how and in what form the data were collected, what decisions about data cleaning and analysis were used, etc. Some interdisciplinary fields, such as environmental studies, combine datasets from multiple sources. In some cases, the primary scientific goal is to integrate disparate data sets into a single set for reuse. Merging data sets is much more difficult because a large amount of information about each data set has to be known in order to interpret it and trust it enough to draw conclusions. "Open data" is one of the problematic terms in this field due to the variety and conditions and concepts used for it’s definition (such as "the fewest number of restrictions" and "the lowest possible cost") regarding the assignment of certain data to this category [14], and only some of these conditions are performed in different particular situation. The basic conditions for open data usually concern their legal and technical availability. Examples of open data: repositories and archives (for example, GenBank, Sloan Digital Sky Survey), unified data networks (for example, World Data Centers, Global Biodiversity Information Facility; NASA Distributed Active Archive Centers), domain repositories (for example, PubMedCentral, arXiv), institutional repositories (for example, University of California eScholarship). Data openness has different aspects. Public data repositories can allow authors to retain copyright and control over the data they have submitted. Some data is open, but it can be interpreted with proprietary software. The data can be created using open source software, but a license is required to use the data. Open data repositories can have long-term sustainability plans, but many of them depend on short-term grants or viable business models. In addition, keeping data open for long periods often requires ongoing investment. A promising new development to address the open data challenge is the FAIR standards – Findable, Accessible, Interoperable and Reusable data [15]. These standards apply to repositories that store data. The FAIR standards are adopted by a group of stakeholders to ensure Open science, and they bring together all parts of the research object, from the code to the data and the tools for their interpretation. This approach is developed for scientific information but can be adopted for reuse of any other data. An important aspect of data openness is ensuring the possibility of their reuse [16]. At the same time, we need to understand the difference between use and reuse of data. In the simplest situation, data set is collected by one person (or group of persons) for a specific purpose, and the first use of this data is executed by that person. If the same person returns to the same data set later for the same or a different task, this is also usually considered as use. When this data set is used for another task by someone else (for example, from a repository), then such action is usually considered as data reuse. A separate data set can be reused for another purpose if it is supported by appropriate contextual information and tools. Research replication is an example of independent reuse of a data set. An important factor in data reuse is the use of representation standards. Data published in formats that meet community standards can be analyzed with the help of available tools and combined with other data in those formats. Data integration and reuse is much more difficult in areas where standards are unavailable or less formalized. 4. Proposed methodology The level of data structuring has big influence on the complexity of acquisition of knowledge from it. In this work we propose to use semantic markup as a base of Smart data: semantic tags can become an instrument of data explicit structuring, and interpretation of this structure makes data analysis more productive. All elements of structuring can simplify data analysis if they are pertinent with goals of analysis and can be interpreted by analytic means. Semantic markup is one of the common approaches that adds structure to different types of data. Most often, this approach is applied to natural language documents, but it can also describe various complex information objects with multimedia elements. Various models and software realizations of markup differ significantly by expressiveness, understandability and complexity. The structuring of the USD ensures the creation of metadata for individual information objects (IOs) and for data sets, as well as the marking of content with tags that connect it with the concepts of the corresponding domain. Metadata describes the attributes of the IO. Therefore structuring of USD by semantic markup is one of ways of Smart data transformation. Effectiveness of semantic markup depends on: understandability of the markup language and the use of standardized notations; sufficient expressiveness of the marking language; availability of tools for processing of marked data (visualization, correctness check, automation of editing, etc.); expressiveness of query language for semantically marked data and its support in various technological environments; markup extensibility; possibility of integration of semantic markup with external knowledge bases and support of open knowledge presentation standards. Semantic markup is a way of data structuring that links information object and its elements with concepts of some domain. Elements used as markup tags depend on specifics of information object. The set of these tags can be fixed (as HTML) or dynamic (as XML). Some markup languages are universal (such as XML Schema or RDF Schema), and other ones are used only for specific types of information objects (such as OWL-S for web services). Another important difference among markup languages deals with their semantic interpretation. For example, XML Schema or Wiki markup has no associated semantics, while RDF Schema, Semantic MediaWiki (SMW) markup and DAML+OIL include it. Semantics provides standard and unified way for interpretation of the language elements and can be used by reasoners foe inferences of new knowledge from given data markup. Non-semantic markup can define structural elements of data such as titles, sub-items, links, etc., but does not represent their meaning. We can consider semantic markup as an expansion of metadata because semantic markup tags can describe both file with data in general and come separate elements of its content. Every semantic markup language is defined by the non-empty finite set of tags and by rules of their use. In [17] characteristics of semantic markup languages that can be used for their comparison are analyzed. The main of them are: Context: possibility to express the different data contexts for interpretation of tags. Subclasses and properties: possibility to express the meaning of relations between marked objects, their properties and classes. Primitive data types: possibility to define the type of data constants (such as strings, number, links, dates, etc. and their complex combinations) used as elements of markup. Instances: possibility to define objects as individuals of some classed. Property constraints: possibility to define range and domain of object attributes, their possible values and cardinality constraints. Property values: possibility to define values of attributes linked with tags, including a default value or a set of possible choices. Negation: possibility to define statements as a negation, conjunction and disjunction of other statements. Inheritance: possibility to indicates the constraints and values of subclasses by properties of their parent classes. Multiple inheritance allows inheritance from multiple parent classes. Definitions: possibility to describe necessary and sufficient conditions for class membership. Table 1 Characteristics of markup languages (fragment) XML Schema RDF Schema Wiki SMW Context + + + + Subclasses + + + + Properties: + + - + Property range + + - + Property domain + + - + Property cardinality + - - + Primitive data types + - - + Instances + - + + Property values + + - + Negation - - - - Definitions - - + + Determining the structure of the NSD is a complex scientific problem that scientists have been paying attention to for a long time [18]. Another aspect of this problem deals with selection of pertinent knowledge that defines the NSD markup structure. The choice of the set of tags for semantic markup has the greatest impact on effective retrieval and reuse of data irrespective of ways of structuring – by a human expert or automatically. Both the correctness of the IO classification and the correct definition of attribute values are important. Thus, the structuring process itself is divided into two stages: An example of numbered list is as following. 1. Selection of the non-empty set of categories and markup tags that allow to determine the IO structure and its relation with other objects and groups of objects that is based on knowledge about IO domain; 2. Semantic markup of IOs by selection of subset of categories and linking of IO content elements with markup tags and determining of appropriate attribute values. Semantic markup uses tags that have explicitly defined semantics formalized by knowledge representation means such as ontologies, conceptual graphs, thesauri, etc. The first stage determines the expressiveness of IO structuring and should be pertinent to the aim of data preprocessing, and the second one provides the possibility of using this markup. The set of markup tags connects the raw data with the background domain concepts and relations of the corresponding domain exported from some external knowledge base. This set depends on the task specifics and can contain both all these concepts and their non-empty subset. The increment of used concept number provides the more clearly data structuring, but it makes its analysis more difficult. Many semantic markup schemas use ontologies [19] as external knowledge sources that, in the general case, contain classes, instances of classes and relations between them, as well as axioms that determine the rules of admissible combination of these elements. Formal model of domain ontology Odomain most generally is defined as a triplet: Odomain X , R , F , (1) where X is a finite set of domain concepts divided into a set of classes Tcl and a set of instances of classes Tind ; R is a finite set of domain relations between domain concepts from T; F is a finite set of axioms and interpretation functions for concepts and relations of O . In practical use the elements and the structure of domain ontologies can be defined more precisely according to task specifics. For example, some formal models distinguish various subsets of domain- specific relations or relation properties, and various special cases of ontologies have additional restrictions. One of special cases of ontology that can be used for purposes of semantic markup is task thesaurus that reduces knowledge structure for easier processing according to task description but can use this information about domain for its generation. Task thesaurus can be considered as a special case of ontology Th O that contains collection of the domain terms. Formal model of thesaurus is based on formal model of ontology (1): Th Tth , Rth , I , (2) where Tth X is a finite set of the terms; Rth R is a finite set of the relations between these term; I represents an additional information about terms (this information depends on specifics of thesaurus goals and can contain, for example, weight of term or its definition). Task thesaurus has the simpler structure because it is not include ontological relations (all important for task information about relations is used for construction of Tth ), but includes additional information about every concept – it’s weight wi W ,i 1, n defined by task description. If wi 0 then concept is not included into Tth . Therefore, formal model of task thesaurus is defined as set of ordered pairs Thtask ( ti Tth , wi W ),, I with additional information in I about source ontologies. We can define formally various characteristics and restrictions of ontology-based knowledge sources and link them with corresponding groups of tasks (see Figure 1). Figure 1: Characteristics and restrictions of ontology-based knowledge sources Use of ontologies is based on such their characteristics: ontological representation of domain is an explicit specification of the conceptualization that provides unambiguous interpretation of their semantics by different users; ontologies have a wide range of knowledge representation expressiveness; development of ontologies is based on descriptive logics that supports theoretical ground of their expressiveness and processing time; ontologies are widely used for modeling of various domains, and therefore a lot of domain ontologies are accessible from repositories; ontological analysis is supported by a large number of standards and instrumental tools for creating and processing of domain ontologies; ontologies can be integrated with various Semantic Web applications. The use of existing ontologies allows not to reanalyze the domain structure and to reuse the previously acquired knowledge of experts. But the success of creating a semantic markup depends significantly on the relevance of the chosen ontology to the tasks for which the markup is created and the ways in which it is used for markup. We have to take into account both the volume of the selected ontology (number of classes, individuals and relations) and its complexity (number of relations between individuals and axioms that determine rules of their use) to define ontology pertinence for semantic markup goals. For example, the use of a highly specialized medical ontology is appropriate for markup of educational materials for students, but not convenient for school textbooks. Large number of concepts used as the markup tags makes the search more accurate, but reduces the number of relevant answers. Selected domain ontology is a main source of information for search queries and their parameters, hypotheses for machine learning and other operations on the analysis of semantically marked data, that is, knowledge that is not reflected in this ontology is much more difficult to discover and use in the future. For example, if the ontological model does not contain the relation "semantic similarity" to determine the relation between the concepts of domain, then the detection of such similarity requires much more calculations to model this relation. If domain of marked IOs has already commonly used standards (national, international, industrial, etc.) or generally accepted community agreements (such as metadata schemas) for terminology unification, then they must be taken into account. If these standards are formalized in the form of ontologies, then these agreements should be used in generation of semantic markup tags. If these standards are represented in other forms (tables, natural language documents, dictionaries and taxonomies, etc.), then appropriate ontologies are created on base of them – manually or automatically (this subtask is beyond the scope of this study and should support the integration and coordination of knowledge from different sources). Ontologies that reflect different domain sub-areas can be integrated (integration of ontologies with disjoint sets of concepts causes low problems) through the top-level ontology and use this entire set of ontologies as a base of markup. In practice, choose of external ontology as source of foreground domain knowledge for semantic markup takes into account the following main parameters: expressive power of ontology (from simple dictionaries and taxonomies to "heavy" ontologies) pertinent to markup purposes; volume of ontologies (number of classes, number of instances of classes, number of attributes of instances of classes, number of relations, etc.) that provides processing in satisfactory time; presence and correct representation of ontological concepts and relations that are fundamentally important for the analysis of semantically marked data; correspondence with natural language of marked data; relevant level of ontology specialization. If all available ontologies do not meet these, we have to build new ontology based on one or more such ontologies that more fully meets the conditions of its use. Such situation is possible, if semantic markup deals with new, dynamic or very specific sphere, is oriented on users with specific information needs and beliefs or marks content represented by natural language that differs from language used in existing ontologies. In addition, it is necessary to take into account such situations when development of the semantic markup of information resource precedes ontology development, and ontology that formalizes markup structure is generated on base of marked data. Ontology created in such way has a lot of functional restrictions (such as characteristics of classes that cannot be acquired by markup analysis), but it can also be populated and improved in the future according to the needs of users. If we find ontology with greater expressive power than is required for resource marking, then it is advisable to build a simpler ontological structure that contains necessary background knowledge but reduces redundant elements, characteristics and axioms. Building a simplified ontology requires additional efforts, but usually such reduction saves of time in each subsequent ontology request. In addition, this transformation of ontology can be made semi-automatically and does not require a significant involvement of domain specialists. But it is important to understand that this new simpler ontology has other structure than the initial one and is not its sub-ontology. Therefore if initial external ontology is improved by its developers than processing of these changes demands new processing for transformation into reduced ontology. In ontology has insufficient expressive power (for example, it does not contain a certain group of relations, or the limitations and areas of meaning are not defined for some classes), then such an ontology can be improved after consulting with the domain specialists without other changes in its structure. Then found ontology can be considered as a sub-ontology of new ontology that is more complete. If ontology has a larger volume than is required for the task of semantic marking, then we can use its certain sub-ontology. The easiest way is to remove those instances of classes that are not relevant – such reduction does not require changes in the structure of the ontology. In this case, we only need to check that the remaining instances do not use deleted elements as attribute values. It is more difficult to remove unnecessary relations and classes – we need to check that the remaining instances do not also use the removed elements as attributes and in scope and definition descriptions. The removal of axioms makes impossible automatical check of some characteristics of the ontology (for example, the disjointness of two classes or the set of acceptable attribute values), but it can significantly simplify the algorithms for its processing. If checking the presence of concepts and relations in the ontology indicates the absence of some elements that are fundamentally important for analysis of the semantically marked data, but found ontology meets other requirements and contains a significant number of required elements, then it is advisable to add these concepts or relations to the initial ontology and check the correctness of new ontology. Similarly, we can add the desired attributes to some classes (if ontology contains appropriate classes). In each specific case, we have to determine which solution requires more effort – modification of the found ontology or creation of a new one. To choosing the level of the ontology specialization we have to analyze possible queries to the semantic markup should satisfy and data parameters that should be defined. If ontology contains a large number of concepts that are not used for the needs of semantic markup, then it is advisable to use a sub- ontology with the necessary classes, but maintain its compatibility with the original ontology for the possibility of expanding the set of tags (for example, when new data appears). This solution is similar to processing of ontologies with big volume where number of elements is caused by other reasons different from specialization level. It is clear that the optimal choice of background knowledge source for semantic markup is an ontology that contains all the necessary domain knowledge and does not contain any extra elements. But such a situation is possible only when the ontology is created specifically for the purposes of semantic markup or if it is generated from already marked data in order to formalize the semantics of this markup for further use. An example of such ontology is the Wiki-ontology (its model and construction algorithm are discussed in more detail in [20]). Ontologies built for other purposes usually cover a wider scope than is required for semantic markup. Then there is a need to reduce the ontology, that is, to build another ontology using knowledge from the found one. If we find smaller or too specialized ontologies then we need to either merge them (and then reduce) or to populate them manually, and both of these operations require a lot of additional work with the involvement of domain experts. Thus, quite often we need to create a new ontology that is smaller then initial one, contains all the necessary elements and has an easier-to-process structure based on the external ontology containing the domain knowledge. Therefore we propose some models and algorithms that can be used for automation of this process.Different approaches to creating markup based on ontologies allow to use various elements of the ontology, and therefore such markup has different expressiveness, and the process of its creation becomes more or less complex description and support much or more simpler processing algorithms. The following approaches are most common: on the basis of domain dictionaries that contain only ontology classes and class instances; on the basis of ontology classes and relations that allow to formalize the semantics of links between individual IOs in the markup [21]; on the basis of task thesauri generated by domain ontology and markup task . All these approaches can be formalized on the basis of the formal model of the ontology used for semantic markup. Main steps of ontology generation for semantic markup are: retrieval of external domain ontology that contains pertinent to semantic markup aims foreground knowledge; transformation of external ontology according to semantic markup requirements (reducing, merging, population, etc.); transformation of domain knowledge from generated ontology into other ontological representation that is more convenient for markup procedures. 5. Use of semantic Wiki markup for data structuring Wiki-resources are the popular examples of the Web-oriented markup. If ordinary Wiki-resources use simple links between Wiki-pages as markup tags, then semantic Wiki-resources extend the expressiveness of the markup by the semantic definitions of such links: they explicitly define the relations between such pages by some domain concepts. Currently, many dynamic Web-oriented resources are creating in process of the joint activity of users on base on Web 2.0 technologies [22]. Wiki technology [23] is one of successful Web 2.0 platforms that support mechanisms for collaborative processing of the large-scale Web content. MediaWiki [24] is one of the common implementations of Wiki technology that is used by such popular resources as Wikipedia, Wikibooks, Wiktionary, and Wikidata. Various semantic extensions of Wiki technology are aimed to add meanings to Wiki resource elements and to make them suitable for automated processing and knowledge-level analysis. They differ in the expressiveness of the markup language and the capabilities that can be used for data analysis. Many of them are based on the standards of the Semantic Web project [25]. Such extensions allow to define and find IOs with a complex structure that are typical for a certain domain [26]. The above analysis of semantic markup means and their characteristics shows the importance of: the expressiveness of the markup language, the possibility of its interpretation by humans and other applications, the availability of means and methods to transform external knowledge sources of into markup elements, and the quality of software tools for creating and practical use semantic information resources based on such markup. All these requirements are met by the semantic extension SMW of the Wiki technology and the language of semantic markup and semantic queries based on this markup implemented in it. SMW is a semantic extension of MediaWiki (www.mediawiki.org/wiki/MediaWiki) that provides intelligent organization and search of heterogeneous content [26]. In addition, SMW-based information resources meet FAIR requirements and can be scaled to represent large-volume and complex content. It should be noted that the application of widely known Wiki technology significantly simplifies the practical use of such resources for a wide range of people. Formal models, representation languages, processing methods and software tools already exist for them. SMW provides a structured representation of knowledge and the ability to search for it at the content level. But if marked data (such as encyclopedias of the national level) has a large volume and a complex structure, then built-in possibilities of SMW are not enough and we have to use modern methods of management of distributed knowledge. If traditional Wiki-resources use simple links to other Wiki-pages as markup tags, then semantic Wiki-resources extend the expressiveness of the markup by defining the semantics of such links: they explicitly define the relation between such pages. Knowledge of an arbitrary external ontology transformed into a Wiki-ontology in terms of semantic Wiki-technology such as categories, semantic properties and their values, templates of typical IOs, etc. The expressiveness of a Wiki ontology has some limitations because such an ontology contains only the knowledge that can be obtained directly from the Wiki markup and expressed by means of markup language. For example, it cannot define characteristics for object properties and data properties, such as equivalence and the possibility of intersection. In many cases, semantic extensions of Wiki technologies have built-in means for automatic or automated generation of such ontologies. SMW supports automatical generation of Wiki ontology for arbitrary collection of Wiki pages. On the other hand, the formation of the Wiki-ontology (or at least its structure) can precede the development of the Wiki-resource itself. In this case, a certain reference ontology created by experts and knowledge engineers defines the basic domain concepts and relations between them. Wiki-ontologies with low expressiveness can be generated by non-semantic Wiki markup. It contains only information about page categories and links between them without defining semantics. Wiki-ontology is a special case of ontology and its formal model can be defined on base of constraint of model (1) for non-semantic and semantic Wiki resources. Formal model of Wiki-ontology Owiki _ no _ semant for a non-semantic Wiki resource contains the following components: X X cl X ind is a non-empty set of ontology concepts, where X cl is a set of classes that coincides with the set of Wiki categories represented in the selected set of Wiki pages, X ind is a set of class instances created as a union of the names of selected pages P Puser Ptemplate Pspec , where Puser is a set of pages created by users, Ptemplate is a set of pages describing Wiki templates, Pspec is a set of other special pages explicitly selected for ontology generation (for example, semantic search pages); R L { rier _ cl } { rclass _ individual } is a set of relations between elements of the ontology, where L {" link" } is a one-element set that describes a link from one Wiki page of this resource to another one; rier _ cl is a hierarchical relation between the categories of the Wiki resource, which is determined in the process of creating new categories, rclass _ individual is a hierarchical relation between the categories and the pages of the Wiki resource assigned to these categories; F { f equ } is a one-element set containing an equivalence relation between Wiki pages that can be used for logical inference in the ontology that connects reference Wiki pages. Other elements of the ontological model of this Wiki-ontology are represented by empty sets. The formal Wiki ontology model for semantically enriched Wiki resources Os _ wiki is more complex in comparison with Wiki ontology of non-semantic Wiki resources (such as Wikipedia) and includes a number of elements related to semantic properties [27]: set of Wiki pages X is enriched by the set of pages of semantic properties Psem _ prop (some of them are semantically defined links to other Wiki pages Psem _ prop _ page Psem _ prop , and others link pages to values of other data types); set of relations R rier _ cl { ri } Rs _ prop is enriched by relations related to the semantic properties of Wiki pages by domain-specific semantic properties Rs _ prop { rs _ prop j }, j 1, m with type “Page” that link Wiki pages by semantically defined characneristics; rier _ cl rier _ categor rier _ property is enriched by relations rier _ property that defines hierarhical relations of semantic properties; T is a set of types for values of semantic properties. Use of the Wiki ontology elements for semantic markup is unambiguous and is based on one-for- one correspondences (Table 2). Table 2 Correspondences of the Wiki ontology elements with semantic markup elements Wiki ontology Markup element X ind Page name X cl Category name [[Category:Category name]] Ptemplate Template names L {" link"} Link between Wiki pages [[Page name|Description]] rier _ cl Relation between categories rclass _ individual Relation between categories and individual pages Lsem _ prop Semantic properties with type “Page” [[Relation |Page name|Description]] Tags of semantic markup are based on these elements represented according to SMW rules. 6. Practical approbation We use proposed above ontology-based models and methods of semantic markup for development of the portal version of the Great Ukrainian Encyclopedia e-VUE [28]. It uses MediaWiki [29] version 1.34.0 and the Semantic MediaWiki semantic plug-in version 3.1.5. Semantic markup supports built-in semantic queries that integrate content of different Wiki pages about various typical IOs. For example, such queries create automatically lists of author articles (see Figure 2), new articles from selected category, current moderator of scientific spheres, etc. {{#ask: [[Category:VUE authors]] [[Directions::{{PAGENAME}}]] |link=all |intro=Authors: |format=category |headers=hide |searchlabel=... |class=sortable wikitable smwtable }} Query Query result Figure 2: Use of semantic markup for integration of VUE content Now semantic templates are used for unified input of structures information, but content of the Wiki pages can be enriched by other tags from the Wiki ontology without use of templates (this approach is used for more specific IOs of resource or for IOs with non-typical attributes) [30]. Initial elements of ontological schema are created before development of this Wiki portal, but later it was enriched and populated with use of specialized domain ontologies and dictionaries. Semantic markup of e-VUE is supported by semantic templates of typical IOs (such as persons, cities, countries, organizations, seas, rivers, etc.) that use domain concepts as attributes. Semantics, possible categories of values and relations between individuals are formalized by Wiki ontology of this resource (see Figure 3). Templates help to input correct attributes of semantic markup elements. Wiki ontology classes Wiki ontology individuals Wiki ontology of e-VUE Figure 3: Wiki ontology of e-VUE (fragment) 7. Conclusion We analyze approaches to transformation of raw data into source for analysis and knowledge acquisition named Smart data and consider possibilities of their integration with ontological analysis. One of directions of Smart data deals with data structuring, and we propose to make such structuring on base of semantic markup where ontologies are used as a source of domain knowledge. In general case ontologies need in complex means of their processing, and therefore we propose some special cases of domain ontologies that reduce domain knowledge according to goals of markup of Semantic MediaWiki resources. In this work, we formalize models for such special cases of ontologies as Wiki ontology and task thesaurus. Semantic extensions of Wiki combine the benefits of traditional Wiki technologies (easy editing and collaboration) with the advantages of semantic processing. We consider preferences of this technology and problems of its practical use on example of the portal version of Great Ukrainian Encyclopedia that integrates heterogeneous information from various fields of sciences and includes a big number of typical information objects with heterogeneous elements. In future, we plan to consider integration of semantic Wiki markup with metadata standards used by open ontology repositories an e-libraries that can be used as external sources of knowledge and structure of various domains. 8. References [1] P. B., Seel, Digital universe: The global telecommunication revolution. John Wiley & Sons. (2022). [2] A. Souifi, Z. C. Boulanger, M. Zolghadri, M. Barkallah, M. Haddar, From Big Data to Smart Data: Application to performance management. IFAC-PapersOnLine 54(1) (2021), 857–862. [3] M. L. Zeng, Smart data for digital humanities. Journal of data and information science 2(1) (2017) 1-12. [4] J. Hey, The data, information, knowledge, wisdom chain: the metaphorical link. Intergovernmental Oceanographic Commission, (2004) 26(1), 72–94. [5] S. Sharifi Noorian, S. Qiu, U. Gadiraju, J. Yang, A. Bozzon, What Should You Know?, A Human- In-the-Loop Approach to Unknown Unknowns Characterization in Image Recognition, in: Processings of the ACM Web Conference 2022, (2022): pp.882–892. doi: https://doi.org/10.1145/3485447.3512040. [6] M. Schich, C. Song, Y. Ahn, A. Mirsky, M. Martino, A. Barabási, D. Helbing, A network framework of cultural history. Science 345(6196) (2014) 558–562. URL: www.yongyeol.com/papers/schich-history-2014.pdf. [7] Digging into Data Challenge, 2020. URL:https://diggingintodata.org/. [8] C. Schöch, Big? Smart? Clean? Messy? Data in the Humanities?, Journal of the Digital Humanities (2013) 2(3). URL: https://opus.bibliothek.uni-wuerzburg.de/files/12949/059_Schoech_JDH.pdf. [9] RDF. URL: www.w3.org/RDF. [10] OWL 2 Web Ontology Language Document Overview. W3C, 2009. URL: http://www.w3.org/TR/owl2-overview/. [11] I. Pasquetto, B. Randles, C. Borgman, On the reuse of scientific data, 2017, https://escholarship.org/content/qt4xf018wx/qt4xf018wx.pdf. [12] C. Borgman, P. Darch, A. Sands, M. Golshan, The durability and fragility of knowledge infrastructures: Lessons learned from astronomy. Proc. of the Association for Information Science and Technology. V. 53, (2016) 1–10. DOI: https://doi.org/10.1002/pra2.2016.14505301057. [13] C. Palmer, N. Weber, M. Cragin, The analytic potential of scientific data: Understanding reuse value, in:Proceedings of the American Society for Information Science and Techn. 48(1) (2011) 1-10. [14] I. Pasquetto, A. Sands, P. Darch, C. Borgman, Open data in scientific settings: From policy to practice, in: Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, (2016): 1585-1596. [15] FAIR_data. URL: https:// en.wikipedia.org/wiki/FAIR_data. [16] S. Leonelli, Packaging small facts for re-use: databases in model organism biology, in: How well do facts travel, (2010) 325–348. DOI: https://doi.org/10.1017/CBO9780511762154.017. [17] Y. Gil, V. Ratnakar, A Comparison of (Semantic) Markup Languages, in: Proceedings of FLAIRS Conferenc, 2002, pp.413-418. [18] P. Buneman, S. Davidson, M. Fernandez, D. Suciu, Adding structure to unstructured data, in: Proceedings of International Conference on Database Theory, 1997, pp.336-350. [19] T. R. Gruber, A translation approach to portable ontology specifications. Knowledge Acquisition, (1993), 5:199-220. [20] J. Rogushina, A. Gladun, Task Thesaurus as a Tool for Modeling of User Information Needs, in: New Perspectives on Enterprise Decision-Making Applying Artificial Intelligence Techniques, Springer, Cham, (2021) 385-403. DOI: https://doi.org/10.1007/978-3-030-71115-3_17. [21] J. Rogushina, I. Grishanova, Ontological methods and tools for semantic extension of the MediaWiki technology, CEUR 2866 (2021) 61-73. URL: http://ceur-ws.org/Vol-2866/ceur_61- 73Rogushina6.pdf. [22] J. E. Pelet, Handbook of Research on User Experience in Web 2.0 Technologies and Its Impact on Universities and Businesses, IGI Global, 2020. URL: http://kmcms.net/Doc/Call/user- experience/about.html. [23] Y. Koren, Y. Working with MediaWiki. San Bernardino, CA, USA: WikiWorks Press, 2012. [24] M. Völkel, M. Krötzsch, D. Vrandecic, H. Haller, R. Studer, Semantic wikipedia, in: Proceedings of the 15th international conference on World Wide Web, 2006, pp. 585-594. [25] P. Hitzler, A review of the semantic web field, Communications of the ACM, 64(2) (2021) 76-83. [26] P. Andon, J. Rogushina, I. Grishanova et al, Experience of Semantic Technologies Use for Development of Intelligent Web Encyclopedia, CEUR 2866 (2021) 246-259. URL: http://ceur- ws.org/Vol-2866/ceur_246-259andon24.pdf. [27] J. Rogushina, A. Gladun, Semantic processing of metadata for Big Data: Standards, ontologies and typical information objects, CEUR 2859 (2020) 114-128. URL: http://ceur-ws.org/Vol- 2859/paper10.pdf. [28] Great Ukrainian Encyclopedia e-VUE, 2022. URL:vue.gov.ua. [29] MediaWiki, 2021. URL: www.mediawiki.org/wiki/MediaWiki. [30] J. Rogushina, Semantic Wiki resources and their use for the construction of personalized ontologies, CEUR 1631 (2016) 188-195.