Framework for preparing subject data in testing modules of
scientific applications

                E S Fereferov, A G Feoktistov and I V Bychkov
                Matrosov Institute for System Dynamics and Control Theory of SB RAS,
                Lermontov St. 134, Irkutsk, Russia, 664033

                fereferov@icc.ru

                Abstract. The paper addresses the relevant problem of data preparation for testing modules of
                scientific applications. Such testing requires the multiple executions of modules with different
                parameters for various scenarios of solving problems in applications. Often, data sources for
                parameters used for problem-solving are subject data (experimental results, reports, statistical
                forms and other information resources) created earlier as a result of functioning various objects
                of a subject domain. Usually, such data are heterogeneous and weakly structured. The
                developer of scientific applications has to make additional efforts in extracting, cleaning,
                integrating, and formatting data in order to achieve the correctness and efficiency of their use
                in applications. The aim of the study is the development of a framework for automating the
                description of semi-structured data and their transformation into target structures used by
                scientific applications. We proposed a conceptual model that allows us to represent knowledge
                about the structure of the source data, determine their relations with the target structures and set
                the rules for data transformation. Additionally, we developed a framework prototype. It is
                integrated into the technological scheme of continuous integration for modules of scientific
                applications (distributed applied software packages) that are developed with the help of
                Orlando Tools. The effectiveness of this prototype functioning is confirmed by the results of
                experimental analysis.


1. Introduction
The paper addresses relevant problems related to preparing data for scientific applications oriented to
high-performance computing. Such applications are nowadays one of the main components in the
process of carrying out large-scale experiments associated with solving complicated scientific and
applied problems. These problems can arise in various spheres of human activity.
    The need to obtain the values of the subject-oriented data contained in weakly structured sources
often arises in the process of developing and using scientific applications.
    Databases or files in different formats are such sources. Often, they do not conform to the strict
structure of tables and relationships in relational database models.
    Usually, the subject-oriented data are created earlier by subject domain specialists and contained
experimental results, reports, or statistical information. Such data is required to set the initial
parameters of problems in applications. In this regard, its developers are forced to carry out the non-
trivial elicitation, refining, transformation, integration, and aggregation of the subject-oriented data to
a specific form of their representation in applications.
    We propose a new framework for marking weakly structured data with reference to a given target
structure that can be an object structure.
   Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution
4.0 International (CC BY 4.0).
   The rest of the paper is structured as follows. In Section 2, we give a brief overview of the known
tools for extracting and transforming data from weakly structured information sources. Section 3
represents a conceptual model of transformation tables. A framework prototype of preparing data for
executing scripts of testing modules in scientific applications is proposed in Section 4. Section 5
shows the results of experimental analysis. Section 6 concludes the paper.

2. Related work
    Nowadays, the development of new approaches to the integration of heterogeneous data sources
with information and computation systems is a challenge [1]. Often, this problem is related to
processing big data. In different cases, various approaches are used to solve it. Among them are the
following approaches:
      Creation of tools for extracting data from semi-structured data sources given in the specific
          format or a large spectrum of documents,
      Development of tools for converting free-form spreadsheets into a relational data model,
      Ensuring the synthesis of the required integrated data structure based on a set of
          heterogeneous source data structures, etc.
    As a rule, the extraction and transformation of data are partially automated.
    It is known that the TextRunner [2] and WebTables [3] systems are focused on extracting data from
web-pages and transforming it into the relational form.
    Such tools as FlashRelate [4], Foofah [5], and Senbazuru [6] support the extraction of data and the
relations between them from spreadsheets.
    In addition, the TabbyXL system [7, 8] implements the transformation of arbitrary tables into the
relational form based on a set of rules for their analysis and interpretation. MIPS [9], which is close to
TabbyXL, performs a similar transformation based on the search for critical table cells.
    The FlashExtract system [10] is more universal tool. It allows us to extract data from a wider range
of documents, including text files, web pages, and spreadsheets. Data extraction is executed on the
basis of examples provided by users of this system.
    It should be noted that the source table model should be known for the effective operation of the
Senbazuru and MIPS systems.
    The feature of data preparation for the execution of testing scripts of modules in scientific
applications is specific requirements of each module to the format of input parameters. Therefore, the
data in the relational form provided by the above-listed systems is not always convenient and may
require additional transformations.
    Currently, there is a wide range of commercial tools for extracting and transforming data from
weakly structured information sources. These include IBM InfoSphere DataStage [11], Talend Open
Studio [12], Pentaho Data Integration [13], Informatica PowerCenter [14], Open Refine [15], etc.
Usually, these tools are focused on solving the Extract Transformation Load (ETL) problems.
    They provide forming the structured data storages for business intelligence systems (OLAP, OLTR
systems, etc.). However, the use of commercial software products is extremely expensive and not
always available for developers of scientific applications.
    Unlike the aforementioned tools, the proposed framework supports a large spectrum of structures
that can be both the relational and object ones. Knowledge about the markup of weakly structured data
is stored in the structural specifications that can be used by various transformational procedures many
times. Data transformation is based on applying the special templates reflecting the relations between
semi-structured and target data. Conditions for the use of transformation operations are determined by
a set of productions that can take into account the features of data sources.

3. Conceptual model of transformation tables
We assume that the weakly structured data is any intermediate data between the structured and
unstructured. As a rule, their structure has uncertainties of various kinds.
    In data processing, the degree of its correctness is not known in advance. The data scheme may not
exist or does not fully correspond to the processed data. Some data attributes may be absent or not
fully satisfy the correctness conditions defined for these attributes.
    We propose the new model 𝑀 for describing weakly structured data and the rules for their
transformation into target structured formats. This model has the following structure:

                                𝑀 = 〈𝐹𝑖𝑙𝑒, 𝑊𝑆, 𝑇𝑆, 𝑅𝑢𝑙𝑒, 𝐻: 𝑊𝑆 → 𝑇𝑆〉

where the parameters 𝐹𝑖𝑙𝑒, 𝑊𝑆 , 𝑇𝑆, 𝑅𝑢𝑙𝑒, 𝐻: 𝑊𝑆 → 𝑇𝑆 are interpreted as follows.
    The 𝐹𝑖𝑙𝑒 parameter sets the format of a source (file) of weakly structured data: 𝐹𝑖𝑙𝑒 = {𝑋𝐿𝑆, 𝐶𝑆𝑉},
where 𝑋𝐿𝑆 and 𝐶𝑆𝑉 are the valid file formats.
    The 𝑊𝑆 scheme of weakly structured data is determined by the set 𝑇𝑏𝑙𝑠 of tables and the set
𝑅𝑒𝑓𝑠 of links between them: 𝑊𝑆 = 〈𝑇𝑏𝑙𝑠, 𝑅𝑒𝑓𝑠〉, where 𝑇𝑏𝑙𝑠 = {𝑡1 , … , 𝑡𝑛 } and 𝑅𝑒𝑓𝑠 = {𝑟1 , … , 𝑟𝑚 }.
The construction of such a scheme allows us to mitigate the aforementioned uncertainty of the initial
weakly structured data.
    The set 𝐴𝑡𝑟 of attributes and the set 𝑉𝑎𝑙 of their values are assigned to each i-th table:
𝑡𝑖 = 〈𝐴𝑡𝑟, 𝑉𝑎𝑙〉. The set 𝐴𝑡𝑟 is determined by the range and data type: 𝐴𝑡𝑟 = 〈𝐶𝑒𝑙𝑙𝑠 , 𝐶𝑒𝑙𝑙𝑓 , 𝑇𝑦𝑝𝑒〉,
where 𝐶𝑒𝑙𝑙𝑠 and 𝐶𝑒𝑙𝑙𝑓 are the start and end bounds of the range correspondingly.
    We use the following set of data types: 𝑇𝑦𝑝𝑒 = {𝐼, 𝐹, 𝑆, 𝐷, 𝐵, 𝑅}, where 𝐼 is the set of integers, 𝐹 is
the set of real numbers, 𝑆 is the set of string values, 𝐷 is the set of date and time values, 𝐵 =
{𝑡𝑟𝑢𝑒, 𝑓𝑎𝑙𝑠𝑒} is the set of Boolean values, and 𝑅 is the reference type indicating to the relation 𝑟𝑗 ∈
𝑅𝑒𝑓𝑠.
    Table attribute values are characterized by the data range 𝑉𝑎𝑙 = 〈𝐶𝑒𝑙𝑙𝑠 , 𝐶𝑒𝑙𝑙𝑓 〉. The reference 𝑟𝑗 ∈
𝑅𝑒𝑓𝑠 is set by the name and attribute of the table in the markup file: 𝑟𝑗 = 〈𝑡𝑖 , 𝑎𝑘 〉, where 𝑡𝑖 ∈ 𝑇𝑏𝑙𝑠 and
𝑎𝑘 ∈ 𝐴𝑡𝑟.
    The structure 𝑇𝑆 describes the target structured data. 𝑇𝑆 = 〈𝑂𝑏𝑗𝑠, 𝑂𝑏𝑗 𝑅𝑒𝑓𝑠〉 , where 𝑂𝑏𝑗𝑠 is the
set of target structure objects and 𝑂𝑏𝑗𝑅𝑒𝑓𝑠 is the set of references between it. Objects from 𝑂𝑏𝑗𝑠 are
characterized by a name and a set of fields: 𝑂𝑏𝑗𝑠 = 〈𝑁𝑎𝑚𝑒, 𝐹𝑖𝑒𝑙𝑑𝑠〉. The references from 𝑂𝑏𝑗𝑅𝑒𝑓𝑠
define relations between objects from 𝑂𝑏𝑗𝑠.
    The Rule parameter represents a set of data transformation rules: 𝑅𝑢𝑙𝑒 = {𝑟𝑙1 , … , 𝑟𝑙𝑛 } , where
𝑟𝑙𝑖 = 〈𝐴𝑡𝑟, 𝐹𝑖𝑒𝑙𝑑𝑠, 𝑇𝑟𝑎𝑛𝑠, 𝑄: 𝐴𝑡𝑟 → 𝐹𝑖𝑒𝑙𝑑𝑠〉, 𝐴𝑡𝑟 is a set of attributes from the weakly structured
data schema, 𝐹𝑖𝑒𝑙𝑑𝑠 is a set of fields of target structure objects, 𝑇𝑟𝑎𝑛𝑠 = 〈𝑉𝑎𝑙, 𝑂𝑝, 𝑆𝑡𝑟𝑂𝑝〉 is a
structure that defines the applied value transformation operations, 𝑉𝑎𝑙 = 𝑣𝑎𝑙 ∨ 𝑘𝑒𝑦 determines the
transfer of a value or key value from a related table (if reference is given 𝑟𝑗 ∈ 𝑅𝑒𝑓𝑠), 𝑂𝑝 is an
operation above attribute values (for example, combining attribute values into one field or splitting
attribute values into several fields), 𝑆𝑡𝑟𝑂𝑝 is a string operation above values.
    The operation 𝑄: 𝐴𝑡𝑟 → 𝐹𝑖𝑒𝑙𝑑𝑠 matches attributes to fields of objects. Conditions for the use of
such operations can be determined by means of productions.
    The operation 𝐻: 𝑊𝑆 → 𝑇𝑆 maps the weakly structured data to the target data scheme using rules
from 𝑅𝑢𝑙𝑒.
    The structural metadata-based approach is used to establish the correspondence between the
scheme 𝑊𝑆 of weakly structured data and the structure 𝑇𝑆 that describes a scheme of the target data.
Metadata include data types, table names, attributes, objects, their fields, and the relations between
them.
    The rules for transformation data from the source scheme to the target structure are based on the
established matches.

4. Framework prototype
Based on the proposed model, we developed a framework prototype for marking and transforming
data from weakly structured sources into target formats of scientific applications. The framework
prototype includes a markup tool and transformation modules. The markup tool provides the
application developer with the ability to visually customize and specify the process of transforming
data needed to solve a particular class of problems. Transformation modules provide the creation and
translation of data into target structures of specific types. The general scheme of marking and
transforming is shown in figure 1.


                            Figure 1. Marking and data transformation scheme.

    The framework prototype implements loading and marking of spreadsheet files in the CSV and MS
Excel formats. At the markup stage, the package developer forms visually the model 𝑀 of the source
data, specifying table ranges, their attributes, data types, values, and relationships in weakly structured
documents. Next, transformation rules are configured. For example, we can specify a rule for splitting
attributes into several attributes or assign conditional processing of values using the construction «if
then else». The special package for working with regular expressions, which is a part of Embarcadero
RAD Studio, is used to handle string values [16].
    The description of the initial data scheme can be extended with additional structures to provide the
formation of complex target structures. For example, such an extension can maintain a hierarchical
structure. Developer’s knowledge about the markup and transformation rules are saved to a structural
specification file. Next, the specification can be processed by data transformation modules into
specific target formats or loaded into a markup tool system for a correction. The application
programing interface for accessing external subsystems is implemented in the markup tool. API
methods provide access to weakly structured data through the structures of the created model. Such a
software interface allows us to support interaction with different transformation modules without
reworking the markup tool.
    Today, the translation module in XML and JSON formats is implemented as a part of the
framework prototype. This module allows us to create target structures taking access to the generated
model and data through the API. Schemes of target structures for XML and JSON are set using
templates in the same file formats. In addition to other constructions, such a template contains specific
tags to which values from table fields are passed. These tags are as follows:
                                 〈#[𝑇𝑦𝑝𝑒 𝑜𝑓 𝑑𝑎𝑡𝑎] 𝑁 = 𝑁𝑎𝑚𝑒 (𝐹𝑜𝑟𝑚𝑎𝑡)〉,
where
      [𝑇𝑦𝑝𝑒 𝑜𝑓 𝑑𝑎𝑡𝑎] = {𝐹𝑉, 𝑅𝑁}, where 𝐹𝑉 is a value from table field, 𝑅𝑁 is a record counter,
      𝑁 = 𝑁𝑎𝑚𝑒 is a name of the corresponding table field from which the data will be inserted,
      𝐹𝑜𝑟𝑚𝑎𝑡 is an optional parameter set. There is a set of parameters for each type of data (for
         example, we can specify the start value of 𝑅𝑁 or declination for text values from the table
         fields).
    The specification created using the markup tool is enough to generate the database. We
implemented a module that provides the generation of relational database schemas based on
specifications and filling it with data through the API. Created structural specifications can be applied
many times in solving typical problems of data extraction and transformation (for example, when
statistical information for different periods are used). In addition, such structural specifications can be
applied to automate the creation of application software systems for working with data of target
structures.
    The framework prototype is integrated into the technological scheme of continuous integration for
modules of scientific applications (distributed applied software packages) that are developed with the
help of Orlando Tools [17]. The main tasks of such continuous integration in Orlando Tools are
receiving, storing, and testing versions of package modules, including the preparation and processing
of test data.

5. Experimental analysis
The developed framework prototype was used to test the computing modules of the distributed applied
software package (scientific application). This package is used to solve important practical problems
of determining the critical objects of the gas supply system of Russia from the standpoint of energy
security [18].
    In the process of calculations, the package uses the database managed by the Firebird Database
Management Software (DBMS) [19]. The database scheme and the structure of files containing the
values of module parameters are determined by the computational model of the package [20].
    In the experiment, the input data for solving such a problem are files in the MS Excel format with
statistical parameters of objects of the gas industry in Russia for several time periods. Each file
contains 83 parameters of the gas industry objects of Russia in the tables located on one sheet. One
parameter matches to one attribute of the table. The structure of files with information for various
periods differs by the location of single attributes. Package modules are tested with each data file.
    10 tables with the total number of records more than 2000 were identified in the process of marking
the file with object parameters for the initial period. The subject specialist spent about 1 man-hour on
the markup of the initial data.
    The number of constructions in specifications shown in figure 2. Additionally, it was necessary to
correct specifications for files with object parameters for other periods. The correction was to add new
table attributes and change the position of some previously created table attributes. In addition, figure
2 shows the number of specification corrections for a file with parameters of the initial period when
preparing specifications for files with parameters for subsequent periods.
    The created specification is applied to automatically translate the source data into the target
structure (XML-files) of package modules parameters according to the given template. The
development of macros in MS Excel for translating such data into the XML-file would require about
10 man-hours of programming.
    The created specification has been applied to automatically generate the data structure and fill the
database that is managed by the Firebird DBMS. For comparison, importing tables from MS Excel
using MS SQL Server tools would require calling and setting up translation procedures for each table
[21]. At the same time, information about the relations between the tables is not stored anywhere.
   Comparison of working costs for data transformation is represented in figure 3.

           100


             80


             60                                                     Number of changes in the
                                                                    specification

             40
                                                                    Number of new
                                                                    constructions in the
             20                                                     specification


              0
                     Initial       Next         …      Final
                     period       period              period

                   Figure 2. The number of specification constructions and its corrections.


             MS Excel macros


                  MS SQL Server


           Proposed prototype


                                  0         2          4           6           8           10
                                                        Man-hours


                      Figure 3. Comparison of working costs for data transformation.

   The results shown in figure 3 show a significant reduction of data transformation in time using the
proposed framework prototype in comparison with the development of the MS Excel macros and the
use of the MS SQL Server tools. This is very important since such a data transformation procedure
must be executed repeatedly when developing scientific applications in which the intensity of module
modifications can reach several times a day.

6. Conclusions
In the paper, we consider the relevant problem of preparing data for testing modules of scientific
applications. In such applications, testing modules requires the multiple executions of them with
different parameters for various scenarios of solving problems. Often, the module parameters are
weakly structured data, which require additional efforts from the application developer to transform
them into formats used by scientific applications.
   We have developed the framework prototype that supports marking, extracting, and transforming
subject data from semi-structured sources. The developed framework prototype allows us to describe
knowledge about the structure of subject data sources and save it in the form of declarative
specifications. This approach is based on applying the special templates reflecting the relations
between semi-structured and target data. It is sufficiently flexible due to the specifications generated
using these templates contain all necessary information to solve various problems of transforming the
source data to target formats used by applications.
   The experimental results showed a significant reduction in data transformation time through the
applying proposed framework prototype in comparison with developing MS Excel macros and using
MS SQL Server tools to this end.
   Further study is directly related to extending a spectrum of target structures in transforming the
semi-structured source data. In addition, we plan to develop effective algorithms for extracting and
cleaning semi-structured data for new target structures.

Acknowledgment. The study is supported by the Russian Foundation of Basic Research, project
no. 19-07-00097-a (reg. no. АААА-А19-119062590002-7). This work was also supported in part by
the Basic Research Program of SB RAS, projects no. IV.38.1.1 (reg. no. АААА-А17-117032210078-
4) and no. IV.38.1.2 (reg. no. АААА-А17-117032210079-1).

References
[1] Cafarella M J, Halevy A, Wang D Z, Wu E and Zhang Y 2008 Webtables: exploring the power
         of tables on the web Proc. of the VLDB Endowment 1(1) 538–549
[2] Banko M, Cafarella M J, Soderland S, Broadhead M and Etzioni O. 2007 Open information
         extraction for the web Proc. of the 20th int. joint conf. on Artifical intelligence pp 2670-2676
[3] Halevy A, Rajaraman A and Ordille J. 2006 Data integration: the teenage years Proc. of the
         32nd Int. Conf. on Very large data bases pp 9–16
[4] Barowy D W, Gulwani S, Hart T and Zorn B 2015 FlashRelate: Extracting relational data from
semi-structured spreadsheets using examples ACM SIGPLAN Notices 50(6) 218–228
[5] Jin Z, Anderson M R, Cafarella M and Jagadish H V 2017 Foofah: Transforming data by
         example Proc. of the ACM Int. Conf. Management of Data pp 683–698
[6] Chen Z, Cafarella M, Chen J, Prevo D and Zhuang J 2013 Senbazuru: a prototype spreadsheet
         database management system Proc. of the VLDB Endowment 6(12) pp 1202–1205
[7] Bychkov I V, Mikhalov A A, Paramonov V V, Rugnikov G M and Shigarov A O 2017
         TABBYXL: The system for transforming data from arbitrary spreadsheets into a relational
         form Proc. of the 16th all-Russian conf. on Distributed information and computing resources
         (DICR-2017) pp 150–156
[8] Shigarov A, Khristyuk V and Mikhailov A 2019 TabbyXL: Software platform for rule-based
         spreadsheet data extraction and transformation SoftwareX 10 100270
[9] Embley D W, Krishnamoorthy M S, Nagy G and Seth S 2016 Converting heterogeneous
         statistical tables on the web to searchable databases Int. J. Document Analysis and
         Recognition 19 119–138
[10] Le V and Gulwani S 2014 FlashExtract: A Framework for Data Extraction by Examples ACM
         SIGPLAN Notices 49(6) 542–553
[11] Blokdyk G 2017 IBM InfoSphere DataStage: The Definitive Guide (CreateSpace Independent
         Publishing Platform) p 120
[12] Rick Daniel Barton 2013 Talend Open Studio Cookbook (Packt Publishing) p 270
[13] Casters M., Bouman R and Dongen J 2010 Pentaho Kettle Solutions: Building Open Source
         ETL Solutions with Pentaho Data Integration (Wiley) p 720
[14] Malewar R 2017 Learning Informatica PowerCenter 10.x. (Packt Publishing) p 426
[15] Kusumasari T F and Fitria 2016 Data profiling for data quality improvement with OpenRefine
        2016 Int. conf. on Information Technology Systems and Innovation (ICITSI) pp 1–6
[16] Regular            Expressions          –        RAD            Studio.       Available           at:
        http://docwiki.embarcadero.com/RADStudio/Tokyo/en/Regular_Expressions                  (accessed:
        20.06.2019)
[17] Feoktistov A, Gorsky S, Sidorov I and Tchernykh A 2019 Continuous Integration in Distributed
        Applied Software Packages Proc. of the 42st Int. Conv. on information and communication
        technology, electronics and microelectronics (MIPRO-2019) pp 1775–1780
[18] Feoktistov A, Gorsky S, Sidorov I, Kostromin R, Edelev A and Massel L 2019 Orlando Tools:
        Energy Research Application Development through Convergence of Grid and Cloud
        Computing Communications in Computer and Information Science 965 289–300
[19] Firebird: The true open source database for Windows, Linux, Mac OS X and more. Available
        at: https://firebirdsql.org/ (accessed: 21.06.2019)
[20] Bychkov I, Oparin G, Tchernykh A, Feoktistov A, Bogdanova V and Gorsky S 2017
        Conceptual Model of Problem-Oriented Heterogeneous Distributed Computing Environment
        with Multi-Agent Management Procedia Computer Science 103 162–167
[21] Import and export data using SQL Server Import and Export Wizard - SQL Server Integration
        Services       (SSIS).     Available     at:    https://docs.microsoft.com/ru-ru/sql/integration-
        services/import-export-data/import-and-export-data-with-the-sql-server-import-and-export-
        wizard?view=sql-server-2017 (accessed: 18.06.2019)