A Taxonomic Representation of Scientific Studies Yevhenii B. Shapovalov and Viktor B. Shapovalov National Center Junior Academy of Sciences of Ukraine, Dehtiarivska St, 38/44, Kyiv, 04119, Ukraine Abstract Taking to account the problem of the high amount of scientific data, it is rel-evant to develop methods of its structuration and processing. Using ontology graphs is the perspective modern way to provide. Taking to account that most studies is written based on IMRAD, it was used to provide integration of different studies at a single structure and provide structuration at all. The different ways to create integrated ontology using IMRAD have been de-scribed. To get the necessary level of abstraction IMRAD elements as part of a set of specific study were decomposed as levels of abstraction from L1 (general integration node with generalized data) to L5 (specific papers with specific data) depended on the abstraction. The content of each node in form of metadata and its further processing is described. The specific way of the usage of proposed modes has described on the example of the describing studies in the field of biogas production. The proposed approach can be used in a single field or be even more integrated and be devoted to the structuration of the works in different fields. Keywords 1 ontology, IMRAD, structuration, scientific studies, biogas 1. Introduction The data nowadays generated with huge intensity. Due to this, now, Big Data processing is a trend [1, 2]. Processing of the huge amount of data in real life is complicated by high gain of publishing of scientific studies. Considering the development of the STEM-education, studies are provided not only by experienced scientist but by youth. Such huge amount of the studies generated complicated task to process such data. One of the problem of low spreading and usage (in example of Ukraine [3, 4]) may be related to difficulties. Now, scientific studies are published in different forms of reports, such as articles, conference proceedings, books, etc. However, its process is hard due reports are low-structured. Sure, they all build by similar structure named IMRAD [5, 6]. It includes requirements for the paper to consist of some generalized Introduction, describing of used Materials and Methods, naming the Results of the study and is Discussion by comparing with other scientific materials or providing of use cases. However, it seems not enough. Here just some examples of problems due to it: • hard to start the researcher career due to complicated process of understanding of the methods and equipment that need to be used in specific fields of study • hard for youth scientists to understand main parameters that has measured to provide study analysis • harder for experienced scientists to analyze and collect data of new studies. These are only very few cases that are a problem due to high amount of data of scientific studies. However, these cases are makes relevant to develop new methods to provide better structuration and data processing of scientific studies. There are few solutions for this problem that provides automate science data processing [7–10], but it seems that they do not take to account IMRAD. One of the methods that is relevant to use to solve the problems is ontology taxonomies [11–14] with semantic technologies [15]. Also, ontology taxonomies has a lot advantages such as possibility to combine with other types of materials [16] ICTERI-2021, Vol I: Main Conference, PhD Symposium, Posters and Demonstrations, September 28 – October 2, 2021, Kherson, Ukraine EMAIL: sjb@man.gov.ua (A. 1); svb@man.gov.ua (A. 2) ORCID: 0000-0003-3732-9486 (A. 1); 0000-0001-6315-649X (A. 2) ©️ 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) including interactive and web-based courses [17], other information technologies [18, 19] and GIS [20]. This research aims to develop the model that can structure the set of the studies using IMRAD. Previously, it was proposed to provide support using ontologies for single specific study, but not to create glossaries and structured sets of data. To provide it tools Open provenance, Ontologyt and EXPO [21] were developed. Another ontology solution in the field of science is MoKi that provides creation of wiki-based information scientific sources [22, 23]. Sure there some specific ontology tools such as Gene ontology [24] or Centralized educational environment [25]. However, creation of ontology to structure the set of the studies seems relevant due lack of approaches to provide it. 2. Methods of the research In the paper, the model of ontology has developed using of main principles of Graph Theory, Set Theory, and a Theory of Abstraction [26]. The graph was modeled using a simple hierarchical algorithm that foresees using only nodes and links. So, such a model further may be updated using the more extended graph building tools such as weight coefficients. However, without simple modeling, providing it will not be possible. To provide structuration generally accepted structuring method IMRAD has proposed and used. To model data processing was developed taking to account processing possibilities of the Polyhedron [20, 27, 28] system due it has some advantages compare well known Protégé [29, 30] and OWL tools [31–33]. The principles of cognitive IT-platform’s tools Filtering, Audit, and Ranking to provide decision-making data processing tools were described in form of mathematical model (set of equitation) [2, 25, 34]. 3. Results 3.1. Methods of the research As was noted before, IMRAD is used to prepare science papers. So, to provide structuration it is possible to use parent nodes that represent IMRAD components. IMRAD - Introduction, Methods, Results, and Discussion. The discussion part can't be structured by ontology because it contains the analysis and comparing of the obtained data. That is why discussion will be represented as the processing of the results. ¬ (D∈ REP) ⇒(P∈ REP) (1) where REP – report (or set of report) D – discussion of report’s results; P – processing of the results of a set of studies. Approximate, ontology can be devoted to a specific field of science or integrate different fields. In depended on it, the ontology will have 5 or 4 abstract levels of deep. In the case of general ontology, the parent node will be “Scientific reports” and its subsidiary nodes will name a specific field. In the case of a specific ontology, the parent node will name a specific field. Then it links with elements of IMRAD structure. Each element of IMRAD has its specific representation and it’s in turn linked with the element of IMRAD. And the leaf node will be a set of the specific researches belongs to the field. Let’s name each level with L symbols taking to account position in hierarchy: L1 – general name of parent’s node “Scientific reports”; L2 – Name of field of the reports; L3 – Part of IMRAD; L4 – Specific representation of IMRAD (specific method, used materials, specific type of the results); L5 – Specific reports where were used specific representations of IMRAD L4. Therefore, the hierarchy on the specific study will have a form of {L2, L3, L4, L5} and the general ones will have a form of {L1, L2, L3, L4, L5}. Interoperability of the L2 nodes of two different graphs may be provided by using the graph constructor. It provides the possibility to merge graphs in two ways. The first fore-sees that graphs will be constructed as a general graph in form of {L1, L2, L3, L4, L5} and with the same name of L1. And the second is to create L1 in the constructor and add there two specific graphs in form of {L2, L3, L4, L5} and merge them. Schematic representation of the general ontology is shown in and taxonomy of the specific field is shown in Fig. 1. (a) (b) Figure 1: The taxonomy of the general science study ontology (a) and taxonomy of the specific field science study ontology (b); where LR1, LR2, M1, R1, R2 – are abstract classes of literature review (LR), Methods and results of object; “ As an alternative and a more human-readable way to provide abstraction is to revert this model and begin by L5 and end by L1. In this case, ontology will have such structure {L5, L4, L3, L2, L1}. Same rules are relevant for second case. However, the main disadvantage of such a graph is obvious and is the consequences of the structure: the leaf node SR (“Scientific reports”) will be not the most useful for users.This type of graph may be built as {L5, L4, L3} and in this case, it will be used to evaluate the specific report for example during qualifying work evaluation (Ph.D. or Master’s reports). It will show abstract classes of each specific part of IMRAD for each specific study and can provide an evaluation of the set of methods and results that was obtained by the researcher. In this research, we’ll use the first way to provide hierarchies in form of {L2, L3, L4, L5}, and {L1, L2, L3, L4, L5}. As it can be seen, the general science report ontology is significantly more complicated due to links between L1 and L2 level, and also there will be some problems with a huge number of methods, results, etc. that can be not necessary to the user that looking for information on the specific field. Also, it will be much harder to create such type of graphs due it will have 2 levels of links “one to many” (see Fig 1 (a), links between L2 and L3 level and links between L4 and L5 levels) compare to only one in case of specific ontology (see Fig. 1(b), only links between L4 and L5 levels). It may be unreasonable to create a complicated graph Therefore, it seems relevant to provide both types of hierarchies. To provide it, the ontologies should be created specific fields and then merged as noted before. In this case, specific parts of IMRAD will be used as subsidiaries nodes in the field of the study, and specific study will be used as leaf nodes. So, the general structure of such ontology may be represented as: {I, M, R, P} ∈ REP (2) where I – sets of Introduction of all studies, M – set of Materials and Methods an of all studies, R – set of Results of all studies, P – processing of the results of a set of studies; replaces discussion; REP – report (or set of report). To provide better systematization and we have split the introduction into two different parts due to their specific – basic metadata (BMD) and literature review (LR), it is possible to represent the introduction as further: Basic metadata of the study node linked with graph nodes that characterized the basic data on the study such as hypothesis, object, subject, practical value, and scientific novelty. And so, a node of the report basic metadata of study can be presented as a further equation: 𝑛 BMD= ∑𝑖 {𝐻𝑖 , 𝑂𝑖 , 𝑆𝑖 , 𝑃𝑉𝑖 , 𝑆𝑁𝑖 } (3) where H – hypothesis or hypotheses of each specific study; O – object of the study; S – the subject of each specific study; PV – practical value of each specific study; SC – the scientific novelty of each specific study. Lest presents each work of the set of the Introductions, Methods, Results, and Pro-cessing of the data (Discussion). Then each work will be represented as a set of these elements that will be relevant only for each specific study: 𝑛 SI = ∑𝑖 {𝐼𝐼𝑖 , 𝑀𝐼𝑖, 𝑅𝐼𝑖, 𝑃𝐼𝑖 } (4) 𝑛 SII = ∑𝑖 {𝐼𝐼𝐼𝑖 , 𝑀𝐼𝐼𝑖, 𝑅𝐼𝐼𝑖, 𝑃𝐼𝐼𝑖 } (5) So, these articles can be integrated in a single ontology using IMRAD. Such ontology will be a single set of elements of both studies that related to single area: 𝑛 𝑛 O=∑𝑖 {𝑆𝐼 , 𝑆𝐼𝐼, } = ∑𝑖 {𝐼𝐼𝑖 , 𝑀𝐼𝑖, 𝑅𝐼𝑖, 𝑃𝐼𝑖 , 𝐼𝐼𝐼𝑖, 𝑀𝐼𝐼𝑖 , 𝑅𝐼𝐼𝑖, 𝑃𝐼𝐼𝑖 } (6) The main advantages of using such structures are that some parts of introduction (for example keyword), materials and methods and results elements (entities and measured parameters) of studies/reports in the same field can coincide and, in this case, such coinciding sub-nodes will be used as links for them and provide their interoperability. The proposed approach uses IMRAD to collect and process the data with ontologies. In this way, the ontologies are constructed not by the specific structure of each work but by the generally accepted IMRAD structure. The parent node will be specific area to which set of the studies belongs (L2=∑𝑛𝑖 𝑅𝑆𝐼𝑖 , where L2 – specific area and RS – set of the studies representing). The L2 node is linked with I, M, R, P nodes (representing IMRAD). Each IMRAD node is linked with a specific node (such as ammonia determination by Nessler’s method (for methods) or “chicken manure” or “glycerine” (for subjects)) that belongs. And each specific IMRAD type is linked with leaf nodes of ontology – specific studies where such entities were used. In this case, few reports (REP1, REP2, and REP3 that belong to L5) will be integrated by some of the methods or results (M1, R1, R2 that belongs to L4). So, the L4 level will be used to provide the structuration of the reports (L5). The user can use it in both ways: to find which method, result, etc. that belong to L4 were used in a specific report that belongs to L5; and to define in which reports belong to L5 specific method, result, etc. that belong to L4 were used. Сoinciding of the studies elements (for example methods) may be represented as single intersections of such elements in few studies: MI={Ma, Mb, Mc, Md} (7) MII={Mb, Md, Mf} (8) Therefore, in this case, and Mb can be used as a parent node that connects two different studies. The node Mb itself will contain general theoretic information on it, and node SI and SII will contain information on the specific case of its usage and measured parameters using it. For example, same will be a hierarchical way of representing and using the keywords. This will be useful especially for students and young scientists that looking to find methods (MI) and parameters that can be used in specific fields and their usage in practice. Also, this way provides a list of the parameters and methods that used in specific fields. 3.2. Metadata Metadata of each work will be used for processing the data. It may be included for each node. Metadata of L4 nodes will represent the general information (for example, the essence of the method itself), and the resulting leaf nodes will contain the specific metadata related to specific study (such as specific results of the study obtained using set methods M; for example, metadata: “5,35”, and it's class: “Ammonium nitrogen content, g/l). And so, metadata with the same class will be processed by using filtering by users' request or by ranking by providing the rank of nodes by specific class (or their set) based on the user’s request. So, each node located on each level Ei contains metadata with the level of abstract that corresponds to a number of levels; for level 1st – it will be the most abstract metadata and for 5th – it will be the most specific. As it can be seen all data in levels L1-L4 contains generalized metadata and wouldn’t be used for processing of specific study, but just used to get generalized abstract information on entities used in specific fields. Only L5 level contains metadata that related to specific study and will be used for further processing. Specific mechanisms “Filtering”, “AUDIT” and “RANKING” of cognitive IT solution Polyhedron are used to provide processing of the information. It will be used for the case when different reports will have the same Class and Type of information, but different value: {Class: C1; Type: Number; Value: V1} ∈ REP1 (9) {Class: C1; Type: Number; Value: V2} ∈ REP2 (10) And the values V1, V2, and V3 can be equal or not equal. “Filtering”, “AUDIT” and “RANKING” can be used to process the data. Filtering can be described by function if: If (Vmin