=Paper=
{{Paper
|id=Vol-2022/paper36
|storemode=property
|title=
Standardization of Storage and Retrieval of Semi-structured Thermophysical Data in JSON-documents Associated with the Ontology
|pdfUrl=https://ceur-ws.org/Vol-2022/paper36.pdf
|volume=Vol-2022
|authors=Adilbek O. Erkimbaev,Vladimir Yu. Zitserman,Georgii A. Kobzev,Andrey V. Kosinov
|dblpUrl=https://dblp.org/rec/conf/rcdl/ErkimbaevZKK17a
}}
==
Standardization of Storage and Retrieval of Semi-structured Thermophysical Data in JSON-documents Associated with the Ontology
==
Standardization of Storage and Retrieval of Semi-structured
Thermophysical Data in JSON-documents Associated
with the Ontology
© A.O. Erkimbaev © V.Yu. Zitserman © G.A. Kobzev © A.V. Kosinov
Joint Institute for High Temperatures, Russian Academy of Sciences,
Moscow, Russia
adilbek@ihed.ras.ru vz1941@mail.ru gkbz@mail.ru kosinov@gmail.com
Abstract. A new technology for data management of a complex and irregular structure is proposed.
Such a data structure is typical for the representation of the thermophysical properties of substances. This
technology based on storage of data in JSON files is associated with ontologies for the semantic integration
of heterogeneous sources. Advantages of JSON-format – the ability to store data and metadata within a text
document, accessible perceptions of a person and a computer and support for the hierarchical structures
needed to represent semi-structured data. Availability of a multitude of heterogeneous data from a variety of
sources justifies the use of the Apache Spark toolkit. When searching, it is supposed to combine SPARQL
and SQL queries. The first one (addressed to the ontology repository) provides the user with the ability to
view and search for adequate and related concepts. The second, accessed by JSON documents, retrieves the
required data from the body of the document. The technology allows to overcome a variety of schemes, types
and formats of data from different sources and implement a permanent adjustment of the created infrastructure
to emerging objects and concepts not provided for at the stage of creation.
Keywords: thermophysical properties, semi-structured data, JSON-format, ontology.
state diagrams), or by changing the data type, say with
1 Introduction the transition from constants to functions.
The constantly increasing volume and complexity of The solutions proposed in the work are based on the
the data structure on the substances and materials joint use of previously used technologies:
properties imposes stringent requirements for the • Data interchange standard in the form of text-based
information environment that integrates diverse structured documents, each of which is treated as an
resources belonging to different organizations and states. atomic storage unit;
In contrast to the earth science or medicine, here the • Ontology-based data management;
source of data is the growing publication flow. In so
doing the volume of data is determined not so much by • General framework Apache Spark for large-scale
the number of objects studied, as by the unlimited variety data processing.
of conditions for synthesis, measurement, morphological The three main elements of the planned data
and microstructural features, and so on. It can be said that infrastructure (Figure 1): a plethora of primary data
of the three defining dimensions of Big Data (the so- sources of different structures (databases, file archives,
called “3V-Volume, Velocity, Variety” [3]), it is the etc.) subject to conversion to the standard JSON format;
latter plays a decisive role, that is, an infinite variety of ontologies and controlled dictionaries for the semantic
data types. integration of disparate data; Big Data technology for
In this paper, we propose a set of solutions borrowed storing, searching and analyzing data.
from Big Data technology, allowing to overcome with 2 Data preparation
minimum expenses two main difficulties in the way of
integration of resources. The first one is the variety of 2.1 General scenario
accepted schemes, terminologies, types and formats of
The general scheme of data preparation (Figure 2)
data and so on, and the second is the need for permanent
assumes as an initial material a large body of external
adaptation of the created structure to the emerging
resources, thematically related, but arbitrary in terms of
variations in the nomenclature of terms (objects, concepts
volume, structure and location. Among them are sources
etc) not provided at the design stage. The need for
of structured data which include factual SQL databases
variation in the data structure can be associated with the
(DB), document-oriented DB, originally structured files
expansion of the range of substances (e.g. by including
in ThermoML [7] or MatML [9] standards, numerical
nanomaterials), the range of properties (e.g. by including
tables in CSV or XLS formats and so on. The second
group (possibly dominant in terms of volume) is formed
Proceedings of the XIX International Conference by unstructured data: text, images, raw experimental or
“Data Analytics and Management in Data Intensive modeling data etc.
Domains” (DAMDID/RCDL’2017), Moscow, Russia,
October 10–13, 2017
219
2.2 JSON-documents
The basic unit of storage is a structured text document
recorded in JSON format, one of the most convenient for
data and metadata interchange propose [2]. The
advantage of JSON-document – text-based language-
independent format, is easy to understand and quickly
mastered, a convenient form of storing and exchanging
arbitrary structured information. Previously, structured
text based on the XML format was proposed as a means
of storing and exchange thermophysical data in the
ThermoML project [7] and data on the properties of
structural materials in the MatML project [9].
Here, a text document is proposed as the main storage
Figure 1 Key elements of the data infrastructure
unit, written in JSON format, which is less overloaded
concept
with details, simplifying the presentation of the data
The first stage of data preparation is the downloading structure, reducing their size and processing time. In
of records from external sources with their subsequent particular, the JSON format is shorter, faster read and
conversion to the standard form of JSON documents [2]. written, can be parsed by a standard JavaScript function,
In so doing, the conversion of structured documents can rather than a special parser, as in the case of the XML
be entrusted to software whereas the unstructured part is format [https://www.w3schools.com/js/js_json_xml.
subject to “manual” processing with the extraction of asp].
relevant information from the texts. Finally, the control Among other advantages of the format, one can note
element in this scheme is the repository of subject a simple syntax, compatibility with almost all
specific (domain) and auxiliary ontologies. programming languages, as well as the ability to record
The distinctive characteristic of the proposed hierarchical, that is, unlimited nested structures such as
approach is that the starting data sources remain “key-value”. By way of value may be accepted object
“isolated” and unchanged. Resource owners periodically (unordered set of key-value pairs), array (ordered set of
download data to JSON files by templates linked with values), string, number (integer, float), literal (true, false,
ontological models. null). It is also important that the JSON format is a
In so doing they determine themselves the working object for some platforms, in particular for
composition, amount and relevance for the “external” Apache Spark, allowing for the exchange, storage and
world of the data being download. This type of queries for distributed data.
interaction is passive, in contrast to active, when client
can use the JDBC or ODBC interface to access
databases.
Figure 2 Schematic sketch of initial data processing
220
The rich possibilities of JSON-format as a means of and presented in publications or Web are already
materials properties data interchange attracted the available on the basis of which it is possible to build and
attention of developers of the Citrination platform [10]. further maintain its own subject-specific ontology.
They proposed JSON-based hierarchical scheme PIF Finally, the third type (narrow-specialized) should
(physical information file), detailing the object, System, include ontologies or vocabularies for systems of units
whose fields include three data groups, explaining what (for example, UO on the above portal Ontobee) or
an object is (name, ID), how it was created/synthesized chemical dictionaries, for example ChemSpider [6] and
the generality of the created scheme should be sufficient the like. Figure 3 illustrates the binding of terms from a
for storing objects of arbitrary complexity, “from parts in JSON document to ontological terms.
a car down to a single monomer in a polymer matrix”. Even at the stage of data preparation the proposed
Flexibility of the PIF-scheme is achieved due to technology provides:
additional fields and objects, as well as the introduction • consistency with accepted standards regardless of the
of the concept of category. This concept is nothing but a structure and format of the original data;
version of the scheme, oriented to a certain kind of • semantic integration of created JSON-documents;
objects, say substances with a given chemical identity.
• inclusion of previously not provided objects and
2.3 Ontology-based data management concepts by expanding classes or introducing new
The second stage of data preparation is the linking of ontologies and dictionaries.
extracted metadata with concepts from ontologies and The scheme of the generated data is determined by
dictionaries assembled into a single repository. The the initial data scheme with subsequent correction in the
management of the repository is entrusted to an process of linking with the terms and structure of the
ontology-based data manager, which allows for the corresponding ontological model. It should be noted that
search and editing of terms (class) ontologies, as well as JSON-documents are objects with which one can operate
their binding to JSON documents, Figure 3. This means using external API. In so doing, there is always the
that when the particular source schema is converted to a possibility of accessing keys in JSON documents not
JSON format, terms from ontologies, rather than source currently linked with a particular ontology term.
attributes, are used as its keys. It is also possible to use At the same time, it seems justified to identify or bind
additional keys for a detailed description of the data not only keys, but also values with ontological terms. For
source itself, for example, indicating the type of DBMS, example, the key/value pair “Property”: “Heat Capacity”
name and format of textual or graphical file, authorship is presented in Figure 3. This will allow in the future to
and other official data, “sewn up” in the atomic “unit” of facilitate the formation of SQL query, relying on the
storage. information received from the ontologies repository.
The role of ontologies is to introduce semantics (a The experience of using ontology in the data
common interpretation of meaning) into documents, as interchange through text documents has already been
well as the ability to adjust the data structure of the implemented in a special format CIF (Crystallographic
JSON-documents by editing the ontology. Information File) [8], intended for crystallographic
Linking documents with ontologies allows to perform information.
semantic (meaningful) data search (more precisely, In other cases of using a JSON document for the
metadata) using SPARQL queries, which makes it storage of scientific data (for example, in the mentioned
possible to reveal the information of the upper and lower Citrination system [10]), the сategorization and
levels (super and sub-classes) and side-links (related introduction of new concepts is carried out by a special
terms), without knowing the schema of the source data. commission without linking with the concepts of
Thus, the user can view and retrieve information without ontological models.
being familiar with the conceptual schema of a particular
DB or the metadata extracted from unstructured sources. 3 Technique of storage and access to data
The repository should include three types of Given the increasing volume and distributed nature of
ontologies and controlled vocabularies: upper-level, the data on the properties, some of the Big Data
domain and narrow-specialized. The first type is technologies would be appropriate for infrastructure
scientific top-level ontology, which introduces the basic design. Their advantage is due not so much to high
terminology used in different fields, for example such performance in parallel computing, but rather to a
concepts as substance, molecule, property, state, as pronounced orientation to work with data (storage,
well as informational entities that reflect the processing, analysis and so on) in a distributed
representation of data: data set, data item, document, environment (when data sources are located on remote
table, image, etc. Most of these terms and links between severs). Among the available open-source means, the
them can be borrowed from ontologies presented on the Apache Spark high-performance computing platform
server Ontobee [11], for example SIO (Semanticscience [5] is offered here. Along with other technological
Integrated ontology) or CHEMINF (Chemical features, it is distinguished by the presence of built-in
Information ontology).The second type of ontology libraries for complex analytics including running SQL-
(domain ontology) should cover the terminology of queries. By means of SQL-queries one can access the
certain subject areas, for example, thermophysics,
contents of structured JSON documents. It is the ability
structural materials, nanostructures, etc. For each of the
of SQL-queries to data plays a key role in the task of their
domains, as a rule, some ontologies previously created
221
integration. The efficiency of Spark in the storage and in library GraphX – an application for processing
processing of data is also associated with its ability to graphs, which provides our project with our own tools for
maintain interaction with a variety of store types: from working with ontologies.
HDFS (Hadoop Distributed Files System) to traditional
database on local servers. We should also note the built-
Figure 3 Linking JSON documents to ontology classes using the example of the ontology for the domain of
thermophysical properties
The computing platform shown in Fig. 4 includes
three basic elements:
• Management-dispatching of distributed computing
under the control of Hadoop YARN, Apache Mesos or
stand-alone;
• Computer interface – API for languages Scala, Java
и Python;
• Powerful and diverse means of data storage.
Advantages of Apache Spark in comparison with
other technologies (MapReduce) – higher computing
Figure 4 Spark Architecture
speed and the ability to handle data of different nature
(text, semi-structured and structured data, etc.) from For storage purposes, the Apache Spark provides for
various sources (files of different formats, DBMS and interaction with three groups of data sources:
streaming data). It is also important to have APIs for Files and file systems – local and distributed file
Scala, Java and Python and high-level operators to systems (NFS, HDFS, Amazon S3), capable of storing
facilitate code writing, integration with the Hadoop data in different formats: text, JSON, SequenceFiles
ecosystem [4], which unites libraries and utilities (binary key/value pairs) etc;
provided for in Big Data technology. The ecosystem has Sources of structured data available through Spark
I/O interfaces (InputFormat, OutputFormat) that are SQL, including JSON, Apache Hive;
supported by a variety of file formats and storage systems Relational databases and key/value pairs storages
(S3, HDFS, Cassandra, Hbase, Elasticsearch and so (Cassandra, HBase), accessed by built-in and external
on). libraries of interaction with the databases such as
JDBC/ODBC or search engine Elasticsearch.
222
In doing so Spark supports loading and saving data can be created from external sources (JSON-documents,
from different formats files: unstructured (text), Hive tables), query results and from ordinary RDD sets.
semistructured (JSON), structured (such as CSV or Three main features of Spark SQL:
SequenceFiles). The Apache Spark operation reduces to • It can download data from different sources;
the formation and transformation of RDD (Resilient • It requests data using SQL within Spark programs
Distributed Dataset) sets, which are distributed and from external tools related to Spark SQL through
collections of elements. When parallel processing, the standard mechanisms for connecting to databases via
data is automatically distributed in sets between the JDBC/ODBC;
computers in the cluster. RDD sets can be created by
importing HDFS files using the Hadoop InputFormats • It provides an interface between SQL and ordinary
tool or by converting from another RDD. code (when used within Spark SQL programs), including
The main task (access and search among structured the ability to connect sets of RDDs and SQL tables.
and semistructured data) is implemented by the Spark It is possible to configure Spark SQL to work with
SQL module, which is one of the built-in Apache Spark structured data Apache Hive. The Apache Hive store is
libraries. Spark SQL defines a special type of RDD specifically designed for querying and analyzing large
called SchemaRDD (in recent versions, the term data sets, both inside HDFS and in other storage systems.
DataFrame is used). The SchemaRDD class represents JDBC/ODBC standards are also supported by Spark
a set of objects row, each of which is a normal record. SQL, which allows executing a direct SQL query to
The SchemaRDD type is called a schema (a list of data external relational databases, in the case of the above
fields) of records. SchemaRDD supports a number of defined active type of interaction.
operations that are not available in other sets, in
particular, execution of SQL-queries. SchemaRDD sets
Figure 5 Web-environment for managing heterogeneous and distributed data on the substance properties (databases,
unstructured and semistructured files and so on)
The main scenario that uses the Apache Spark ontology repository to refine or supplement the terms of
features is shown in Figure 5. As a result of uploading the query using the SPARQL query (Figure 5). Then the
data to JSON documents according to the above SQL-request coming from the user interface initiates the
procedure, we will have data sets with a single work of the Spark SQL module. As a result of the work
classifying key system identical to terms from of Spark SQL module, RDD or DataFrame sets are
ontologies. The organization of data requests in JSON created, including the selected records, which can be
documents will always be based on definitions from this processed by the system's service functions for further
single system. Thus, one can form the SQL query of use. In fact, the user's work consists of two phases:
interest in the interface of the data processing system. In viewing ontologies terms with the choice of adequate for
this case, it always remains possible to access the the formation of SQL-query; access to the repository of
223
processed data with a SQL query. Thus, the main Properties. Proc. of 15th All-Russian Science
scenario involves unifying heterogeneous data by Conference “Electronic Libraries: Advanced
converting them to JSON documents and processing Approaches and Technologies, Electronic
them using Spark SQL. Other scenarios are also Collections” – RCDL-2013, Yaroslavl, Russia,
justified, if we take into account the diversity of the October 14–17, 2013. http://rcdl.ru/doc/2013/
source data. For example, the Spark SQL module allows paper/s1_3.pdf
direct query to relational databases without their [2] Introduction to JSON. http://json.org/json-ru.html
conversion to the JSON format. On the other hand, you [3] 3Vs (volume, variety and velocity), definition from
can provide access to JSON documents by collecting TechTarget Network. http://whatis.
them in a file system using other Big Data tools. The first techtarget.com/definition/3Vs
and main feature of JSON data collection based on
[4] Apache Hadoop. https://hadoop.apache.org/
ontological models terms – the unambiguous
interpretation of the content and type of data. In this case [5] Apache Spark. http://spark.apache.org/docs/
users and external programs can freely work with data, [6] ChemSpider. www.chemspider.com
because the ontology term, mapped to a key or value in [7] Frenkel, M., Chirico, R.D., Diky V. et al.: XML-
the body of the JSON file, has available and accepted based IUPAC Standard for Experimental,
definitions and properties. For example, links to various Predicted, and Critically Evaluated
types of files (graphics, multimedia, exe-files, etc.) can Thermodynamic Property Data Storage and
be described adequately and functionally using keys- Capture (ThermoML) (IUPAC Recommendations
terms from ontologies describing data formats. The 2006). Pure Appl. Chem., 78 (3), pp. 541-612
second feature, as it is not strange, is the possibility of (2006)
including in the exchange of such data sources that do [8] Hall, S.R, McMahon, B.: The Implementation and
not allow active access or changes due to various Evolution of STAR/CIF Ontologies:
reasons. Then uploading the data to an external JSON file Interoperability and Preservation of Structured
solves this problem, providing independent data storage Data. Data Science J., 15 (3), pp. 1-15 (2016). doi:
and their full description via ontological models. http://dx.doi.org/10.5334/dsj-2016-003
The listed technologies, supported by Apache [9] Kaufman, J.G., Begley, E.F.: MatML. A Data
Spark, provide unlimited productivity and variety of Interchange Markup Language. Advanced
opportunities to handle complex data, which include data Materials & Processes/November, pp. 35-36 (2003)
on the properties of substances, including traditional
[10] Michel, K., Meredig, B.: Beyond Bulk Single
materials and nanostructures.
Crystals: A Data Format for all Materials Structure-
References property-processing Relationships. MRS Bulletin.
41 (8), pp. 617-623 (2016)
[1] Ataeva, O.M., Erkimbaev, A.O., Zitserman, V.Yu.
[11] Ontobee: A Linked Data Server Designed for
et al.: Ontological Modeling as a Means of
Ontologies. www.ontobee.org
Integration Data on Substances Thermophysical
224