=Paper=
{{Paper
|id=Vol-2022/paper36
|storemode=property
|title=
Standardization of Storage and Retrieval of Semi-structured Thermophysical Data in JSON-documents Associated with the Ontology

|pdfUrl=https://ceur-ws.org/Vol-2022/paper36.pdf
|volume=Vol-2022
|authors=Adilbek O. Erkimbaev,Vladimir Yu. Zitserman,Georgii A. Kobzev,Andrey V. Kosinov
|dblpUrl=https://dblp.org/rec/conf/rcdl/ErkimbaevZKK17a
}}
==
Standardization of Storage and Retrieval of Semi-structured Thermophysical Data in JSON-documents Associated with the Ontology
==
<pdf width="1500px">https://ceur-ws.org/Vol-2022/paper36.pdf</pdf>
<pre>
Standardization of Storage and Retrieval of Semi-structured
   Thermophysical Data in JSON-documents Associated
                     with the Ontology
            © A.O. Erkimbaev            © V.Yu. Zitserman                © G.A. Kobzev         © A.V. Kosinov
                  Joint Institute for High Temperatures, Russian Academy of Sciences,
                                             Moscow, Russia
    adilbek@ihed.ras.ru                 vz1941@mail.ru                    gkbz@mail.ru            kosinov@gmail.com
          Abstract. A new technology for data management of a complex and irregular structure is proposed.
    Such a data structure is typical for the representation of the thermophysical properties of substances. This
    technology based on storage of data in JSON files is associated with ontologies for the semantic integration
    of heterogeneous sources. Advantages of JSON-format – the ability to store data and metadata within a text
    document, accessible perceptions of a person and a computer and support for the hierarchical structures
    needed to represent semi-structured data. Availability of a multitude of heterogeneous data from a variety of
    sources justifies the use of the Apache Spark toolkit. When searching, it is supposed to combine SPARQL
    and SQL queries. The first one (addressed to the ontology repository) provides the user with the ability to
    view and search for adequate and related concepts. The second, accessed by JSON documents, retrieves the
    required data from the body of the document. The technology allows to overcome a variety of schemes, types
    and formats of data from different sources and implement a permanent adjustment of the created infrastructure
    to emerging objects and concepts not provided for at the stage of creation.
          Keywords: thermophysical properties, semi-structured data, JSON-format, ontology.

                                                                   state diagrams), or by changing the data type, say with
1 Introduction                                                     the transition from constants to functions.
    The constantly increasing volume and complexity of                 The solutions proposed in the work are based on the
the data structure on the substances and materials                 joint use of previously used technologies:
properties imposes stringent requirements for the                 • Data interchange standard in the form of text-based
information environment that integrates diverse                       structured documents, each of which is treated as an
resources belonging to different organizations and states.            atomic storage unit;
In contrast to the earth science or medicine, here the            • Ontology-based data management;
source of data is the growing publication flow. In so
doing the volume of data is determined not so much by             • General framework Apache Spark for large-scale
the number of objects studied, as by the unlimited variety            data processing.
of conditions for synthesis, measurement, morphological                The three main elements of the planned data
and microstructural features, and so on. It can be said that       infrastructure (Figure 1): a plethora of primary data
of the three defining dimensions of Big Data (the so-              sources of different structures (databases, file archives,
called “3V-Volume, Velocity, Variety” [3]), it is the              etc.) subject to conversion to the standard JSON format;
latter plays a decisive role, that is, an infinite variety of      ontologies and controlled dictionaries for the semantic
data types.                                                        integration of disparate data; Big Data technology for
    In this paper, we propose a set of solutions borrowed          storing, searching and analyzing data.
from Big Data technology, allowing to overcome with                   2 Data preparation
minimum expenses two main difficulties in the way of
integration of resources. The first one is the variety of             2.1 General scenario
accepted schemes, terminologies, types and formats of
                                                                          The general scheme of data preparation (Figure 2)
data and so on, and the second is the need for permanent
                                                                      assumes as an initial material a large body of external
adaptation of the created structure to the emerging
                                                                      resources, thematically related, but arbitrary in terms of
variations in the nomenclature of terms (objects, concepts
                                                                      volume, structure and location. Among them are sources
etc) not provided at the design stage. The need for
                                                                      of structured data which include factual SQL databases
variation in the data structure can be associated with the
                                                                      (DB), document-oriented DB, originally structured files
expansion of the range of substances (e.g. by including
                                                                      in ThermoML [7] or MatML [9] standards, numerical
nanomaterials), the range of properties (e.g. by including
                                                                      tables in CSV or XLS formats and so on. The second
                                                                      group (possibly dominant in terms of volume) is formed
Proceedings of the XIX International Conference                       by unstructured data: text, images, raw experimental or
“Data Analytics and Management in Data Intensive                      modeling data etc.
Domains” (DAMDID/RCDL’2017), Moscow, Russia,
October 10–13, 2017


                                                                219
                                                                   2.2 JSON-documents
                                                                       The basic unit of storage is a structured text document
                                                                   recorded in JSON format, one of the most convenient for
                                                                   data and metadata interchange propose [2]. The
                                                                   advantage of JSON-document – text-based language-
                                                                   independent format, is easy to understand and quickly
                                                                   mastered, a convenient form of storing and exchanging
                                                                   arbitrary structured information. Previously, structured
                                                                   text based on the XML format was proposed as a means
                                                                   of storing and exchange thermophysical data in the
                                                                   ThermoML project [7] and data on the properties of
                                                                   structural materials in the MatML project [9].
                                                                       Here, a text document is proposed as the main storage
Figure 1 Key elements of the data infrastructure
                                                                   unit, written in JSON format, which is less overloaded
concept
                                                                   with details, simplifying the presentation of the data
    The first stage of data preparation is the downloading         structure, reducing their size and processing time. In
of records from external sources with their subsequent             particular, the JSON format is shorter, faster read and
conversion to the standard form of JSON documents [2].             written, can be parsed by a standard JavaScript function,
In so doing, the conversion of structured documents can            rather than a special parser, as in the case of the XML
be entrusted to software whereas the unstructured part is          format       [https://www.w3schools.com/js/js_json_xml.
subject to “manual” processing with the extraction of              asp].
relevant information from the texts. Finally, the control              Among other advantages of the format, one can note
element in this scheme is the repository of subject                a simple syntax, compatibility with almost all
specific (domain) and auxiliary ontologies.                        programming languages, as well as the ability to record
    The distinctive characteristic of the proposed                 hierarchical, that is, unlimited nested structures such as
approach is that the starting data sources remain                  “key-value”. By way of value may be accepted object
“isolated” and unchanged. Resource owners periodically             (unordered set of key-value pairs), array (ordered set of
download data to JSON files by templates linked with               values), string, number (integer, float), literal (true, false,
ontological models.                                                null). It is also important that the JSON format is a
    In so doing they determine themselves the                      working object for some platforms, in particular for
composition, amount and relevance for the “external”               Apache Spark, allowing for the exchange, storage and
world of the data being download. This type of                     queries for distributed data.
interaction is passive, in contrast to active, when client
can use the JDBC or ODBC interface to access
databases.


  Figure 2 Schematic sketch of initial data processing


                                                             220
    The rich possibilities of JSON-format as a means of              and presented in publications or Web are already
materials properties data interchange attracted the                  available on the basis of which it is possible to build and
attention of developers of the Citrination platform [10].            further maintain its own subject-specific ontology.
They proposed JSON-based hierarchical scheme PIF                     Finally, the third type (narrow-specialized) should
(physical information file), detailing the object, System,           include ontologies or vocabularies for systems of units
whose fields include three data groups, explaining what              (for example, UO on the above portal Ontobee) or
an object is (name, ID), how it was created/synthesized              chemical dictionaries, for example ChemSpider [6] and
the generality of the created scheme should be sufficient            the like. Figure 3 illustrates the binding of terms from a
for storing objects of arbitrary complexity, “from parts in          JSON document to ontological terms.
a car down to a single monomer in a polymer matrix”.                     Even at the stage of data preparation the proposed
Flexibility of the PIF-scheme is achieved due to                     technology provides:
additional fields and objects, as well as the introduction           • consistency with accepted standards regardless of the
of the concept of category. This concept is nothing but a                structure and format of the original data;
version of the scheme, oriented to a certain kind of                 •   semantic integration of created JSON-documents;
objects, say substances with a given chemical identity.
                                                                     •    inclusion of previously not provided objects and
2.3 Ontology-based data management                                       concepts by expanding classes or introducing new
    The second stage of data preparation is the linking of               ontologies and dictionaries.
extracted metadata with concepts from ontologies and                     The scheme of the generated data is determined by
dictionaries assembled into a single repository. The                 the initial data scheme with subsequent correction in the
management of the repository is entrusted to an                      process of linking with the terms and structure of the
ontology-based data manager, which allows for the                    corresponding ontological model. It should be noted that
search and editing of terms (class) ontologies, as well as           JSON-documents are objects with which one can operate
their binding to JSON documents, Figure 3. This means                using external API. In so doing, there is always the
that when the particular source schema is converted to a             possibility of accessing keys in JSON documents not
JSON format, terms from ontologies, rather than source               currently linked with a particular ontology term.
attributes, are used as its keys. It is also possible to use             At the same time, it seems justified to identify or bind
additional keys for a detailed description of the data               not only keys, but also values with ontological terms. For
source itself, for example, indicating the type of DBMS,             example, the key/value pair “Property”: “Heat Capacity”
name and format of textual or graphical file, authorship             is presented in Figure 3. This will allow in the future to
and other official data, “sewn up” in the atomic “unit” of           facilitate the formation of SQL query, relying on the
storage.                                                             information received from the ontologies repository.
    The role of ontologies is to introduce semantics (a                  The experience of using ontology in the data
common interpretation of meaning) into documents, as                 interchange through text documents has already been
well as the ability to adjust the data structure of the              implemented in a special format CIF (Crystallographic
JSON-documents by editing the ontology.                              Information File) [8], intended for crystallographic
    Linking documents with ontologies allows to perform              information.
semantic (meaningful) data search (more precisely,                       In other cases of using a JSON document for the
metadata) using SPARQL queries, which makes it                       storage of scientific data (for example, in the mentioned
possible to reveal the information of the upper and lower            Citrination system [10]), the сategorization and
levels (super and sub-classes) and side-links (related               introduction of new concepts is carried out by a special
terms), without knowing the schema of the source data.               commission without linking with the concepts of
Thus, the user can view and retrieve information without             ontological models.
being familiar with the conceptual schema of a particular
DB or the metadata extracted from unstructured sources.              3 Technique of storage and access to data
    The repository should include three types of                         Given the increasing volume and distributed nature of
ontologies and controlled vocabularies: upper-level,                 the data on the properties, some of the Big Data
domain and narrow-specialized. The first type is                     technologies would be appropriate for infrastructure
scientific top-level ontology, which introduces the basic            design. Their advantage is due not so much to high
terminology used in different fields, for example such               performance in parallel computing, but rather to a
concepts as substance, molecule, property, state, as                 pronounced orientation to work with data (storage,
well as informational entities that reflect the                      processing, analysis and so on) in a distributed
representation of data: data set, data item, document,               environment (when data sources are located on remote
table, image, etc. Most of these terms and links between             severs). Among the available open-source means, the
them can be borrowed from ontologies presented on the                Apache Spark high-performance computing platform
server Ontobee [11], for example SIO (Semanticscience                [5] is offered here. Along with other technological
Integrated ontology) or CHEMINF (Chemical                            features, it is distinguished by the presence of built-in
Information ontology).The second type of ontology                    libraries for complex analytics including running SQL-
(domain ontology) should cover the terminology of                    queries. By means of SQL-queries one can access the
certain subject areas, for example, thermophysics,
                                                                     contents of structured JSON documents. It is the ability
structural materials, nanostructures, etc. For each of the
                                                                     of SQL-queries to data plays a key role in the task of their
domains, as a rule, some ontologies previously created


                                                               221
integration. The efficiency of Spark in the storage and            in library GraphX – an application for processing
processing of data is also associated with its ability to          graphs, which provides our project with our own tools for
maintain interaction with a variety of store types: from           working with ontologies.
HDFS (Hadoop Distributed Files System) to traditional
database on local servers. We should also note the built-


Figure 3 Linking JSON documents to ontology classes using the example of the ontology for the domain of
thermophysical properties

    The computing platform shown in Fig. 4 includes
three basic elements:
    • Management-dispatching of distributed computing
under the control of Hadoop YARN, Apache Mesos or
stand-alone;
    • Computer interface – API for languages Scala, Java
и Python;
    • Powerful and diverse means of data storage.
    Advantages of Apache Spark in comparison with
other technologies (MapReduce) – higher computing
                                                                   Figure 4 Spark Architecture
speed and the ability to handle data of different nature
(text, semi-structured and structured data, etc.) from                 For storage purposes, the Apache Spark provides for
various sources (files of different formats, DBMS and              interaction with three groups of data sources:
streaming data). It is also important to have APIs for                 Files and file systems – local and distributed file
Scala, Java and Python and high-level operators to                 systems (NFS, HDFS, Amazon S3), capable of storing
facilitate code writing, integration with the Hadoop               data in different formats: text, JSON, SequenceFiles
ecosystem [4], which unites libraries and utilities                (binary key/value pairs) etc;
provided for in Big Data technology. The ecosystem has                 Sources of structured data available through Spark
I/O interfaces (InputFormat, OutputFormat) that are                SQL, including JSON, Apache Hive;
supported by a variety of file formats and storage systems             Relational databases and key/value pairs storages
(S3, HDFS, Cassandra, Hbase, Elasticsearch and so                  (Cassandra, HBase), accessed by built-in and external
on).                                                               libraries of interaction with the databases such as
                                                                   JDBC/ODBC or search engine Elasticsearch.


                                                             222
    In doing so Spark supports loading and saving data              can be created from external sources (JSON-documents,
from different formats files: unstructured (text),                  Hive tables), query results and from ordinary RDD sets.
semistructured (JSON), structured (such as CSV or                       Three main features of Spark SQL:
SequenceFiles). The Apache Spark operation reduces to                   • It can download data from different sources;
the formation and transformation of RDD (Resilient                      • It requests data using SQL within Spark programs
Distributed Dataset) sets, which are distributed                    and from external tools related to Spark SQL through
collections of elements. When parallel processing, the              standard mechanisms for connecting to databases via
data is automatically distributed in sets between the               JDBC/ODBC;
computers in the cluster. RDD sets can be created by
importing HDFS files using the Hadoop InputFormats                      • It provides an interface between SQL and ordinary
tool or by converting from another RDD.                             code (when used within Spark SQL programs), including
    The main task (access and search among structured               the ability to connect sets of RDDs and SQL tables.
and semistructured data) is implemented by the Spark                    It is possible to configure Spark SQL to work with
SQL module, which is one of the built-in Apache Spark               structured data Apache Hive. The Apache Hive store is
libraries. Spark SQL defines a special type of RDD                  specifically designed for querying and analyzing large
called SchemaRDD (in recent versions, the term                      data sets, both inside HDFS and in other storage systems.
DataFrame is used). The SchemaRDD class represents                  JDBC/ODBC standards are also supported by Spark
a set of objects row, each of which is a normal record.             SQL, which allows executing a direct SQL query to
The SchemaRDD type is called a schema (a list of data               external relational databases, in the case of the above
fields) of records. SchemaRDD supports a number of                  defined active type of interaction.
operations that are not available in other sets, in
particular, execution of SQL-queries. SchemaRDD sets


Figure 5 Web-environment for managing heterogeneous and distributed data on the substance properties (databases,
unstructured and semistructured files and so on)

    The main scenario that uses the Apache Spark                    ontology repository to refine or supplement the terms of
features is shown in Figure 5. As a result of uploading             the query using the SPARQL query (Figure 5). Then the
data to JSON documents according to the above                       SQL-request coming from the user interface initiates the
procedure, we will have data sets with a single                     work of the Spark SQL module. As a result of the work
classifying key system identical to terms from                      of Spark SQL module, RDD or DataFrame sets are
ontologies. The organization of data requests in JSON               created, including the selected records, which can be
documents will always be based on definitions from this             processed by the system's service functions for further
single system. Thus, one can form the SQL query of                  use. In fact, the user's work consists of two phases:
interest in the interface of the data processing system. In         viewing ontologies terms with the choice of adequate for
this case, it always remains possible to access the                 the formation of SQL-query; access to the repository of


                                                              223
processed data with a SQL query. Thus, the main                           Properties. Proc. of 15th All-Russian Science
scenario involves unifying heterogeneous data by                          Conference “Electronic Libraries: Advanced
converting them to JSON documents and processing                          Approaches and Technologies, Electronic
them using Spark SQL. Other scenarios are also                            Collections” – RCDL-2013, Yaroslavl, Russia,
justified, if we take into account the diversity of the                   October 14–17, 2013. http://rcdl.ru/doc/2013/
source data. For example, the Spark SQL module allows                     paper/s1_3.pdf
direct query to relational databases without their                    [2] Introduction to JSON. http://json.org/json-ru.html
conversion to the JSON format. On the other hand, you                 [3] 3Vs (volume, variety and velocity), definition from
can provide access to JSON documents by collecting                        TechTarget           Network.          http://whatis.
them in a file system using other Big Data tools. The first               techtarget.com/definition/3Vs
and main feature of JSON data collection based on
                                                                      [4] Apache Hadoop. https://hadoop.apache.org/
ontological models terms – the unambiguous
interpretation of the content and type of data. In this case          [5] Apache Spark. http://spark.apache.org/docs/
users and external programs can freely work with data,                [6] ChemSpider. www.chemspider.com
because the ontology term, mapped to a key or value in                [7] Frenkel, M., Chirico, R.D., Diky V. et al.: XML-
the body of the JSON file, has available and accepted                     based IUPAC Standard for Experimental,
definitions and properties. For example, links to various                 Predicted,       and      Critically      Evaluated
types of files (graphics, multimedia, exe-files, etc.) can                Thermodynamic Property Data Storage and
be described adequately and functionally using keys-                      Capture (ThermoML) (IUPAC Recommendations
terms from ontologies describing data formats. The                        2006). Pure Appl. Chem., 78 (3), pp. 541-612
second feature, as it is not strange, is the possibility of               (2006)
including in the exchange of such data sources that do                [8] Hall, S.R, McMahon, B.: The Implementation and
not allow active access or changes due to various                         Evolution       of     STAR/CIF         Ontologies:
reasons. Then uploading the data to an external JSON file                 Interoperability and Preservation of Structured
solves this problem, providing independent data storage                   Data. Data Science J., 15 (3), pp. 1-15 (2016). doi:
and their full description via ontological models.                        http://dx.doi.org/10.5334/dsj-2016-003
    The listed technologies, supported by Apache                      [9] Kaufman, J.G., Begley, E.F.: MatML. A Data
Spark, provide unlimited productivity and variety of                      Interchange Markup Language. Advanced
opportunities to handle complex data, which include data                  Materials & Processes/November, pp. 35-36 (2003)
on the properties of substances, including traditional
                                                                     [10] Michel, K., Meredig, B.: Beyond Bulk Single
materials and nanostructures.
                                                                          Crystals: A Data Format for all Materials Structure-
References                                                                property-processing Relationships. MRS Bulletin.
                                                                          41 (8), pp. 617-623 (2016)
[1] Ataeva, O.M., Erkimbaev, A.O., Zitserman, V.Yu.
                                                                     [11] Ontobee: A Linked Data Server Designed for
     et al.: Ontological Modeling as a Means of
                                                                          Ontologies. www.ontobee.org
     Integration Data on Substances Thermophysical


                                                               224

</pre>