<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Framework for preparing subject data in testing modules of scientific applications</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>E S Fereferov</string-name>
          <email>fereferov@icc.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>A G Feoktistov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>I V Bychkov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Matrosov Institute for System Dynamics and Control Theory of SB RAS</institution>
          ,
          <addr-line>Lermontov St. 134, Irkutsk, Russia, 664033</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>The paper addresses the relevant problem of data preparation for testing modules of scientific applications. Such testing requires the multiple executions of modules with different parameters for various scenarios of solving problems in applications. Often, data sources for parameters used for problem-solving are subject data (experimental results, reports, statistical forms and other information resources) created earlier as a result of functioning various objects of a subject domain. Usually, such data are heterogeneous and weakly structured. The developer of scientific applications has to make additional efforts in extracting, cleaning, integrating, and formatting data in order to achieve the correctness and efficiency of their use in applications. The aim of the study is the development of a framework for automating the description of semi-structured data and their transformation into target structures used by scientific applications. We proposed a conceptual model that allows us to represent knowledge about the structure of the source data, determine their relations with the target structures and set the rules for data transformation. Additionally, we developed a framework prototype. It is integrated into the technological scheme of continuous integration for modules of scientific applications (distributed applied software packages) that are developed with the help of Orlando Tools. The effectiveness of this prototype functioning is confirmed by the results of experimental analysis.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The paper addresses relevant problems related to preparing data for scientific applications oriented to
high-performance computing. Such applications are nowadays one of the main components in the
process of carrying out large-scale experiments associated with solving complicated scientific and
applied problems. These problems can arise in various spheres of human activity.</p>
      <p>The need to obtain the values of the subject-oriented data contained in weakly structured sources
often arises in the process of developing and using scientific applications.</p>
      <p>Databases or files in different formats are such sources. Often, they do not conform to the strict
structure of tables and relationships in relational database models.</p>
      <p>Usually, the subject-oriented data are created earlier by subject domain specialists and contained
experimental results, reports, or statistical information. Such data is required to set the initial
parameters of problems in applications. In this regard, its developers are forced to carry out the
nontrivial elicitation, refining, transformation, integration, and aggregation of the subject-oriented data to
a specific form of their representation in applications.</p>
      <p>We propose a new framework for marking weakly structured data with reference to a given target
structure that can be an object structure.</p>
      <p>The rest of the paper is structured as follows. In Section 2, we give a brief overview of the known
tools for extracting and transforming data from weakly structured information sources. Section 3
represents a conceptual model of transformation tables. A framework prototype of preparing data for
executing scripts of testing modules in scientific applications is proposed in Section 4. Section 5
shows the results of experimental analysis. Section 6 concludes the paper.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>
        Nowadays, the development of new approaches to the integration of heterogeneous data sources
with information and computation systems is a challenge [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Often, this problem is related to
processing big data. In different cases, various approaches are used to solve it. Among them are the
following approaches:
 Creation of tools for extracting data from semi-structured data sources given in the specific
format or a large spectrum of documents,
 Development of tools for converting free-form spreadsheets into a relational data model,
 Ensuring the synthesis of the required integrated data structure based on a set of
heterogeneous source data structures, etc.
      </p>
      <p>As a rule, the extraction and transformation of data are partially automated.</p>
      <p>
        It is known that the TextRunner [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and WebTables [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] systems are focused on extracting data from
web-pages and transforming it into the relational form.
      </p>
      <p>
        Such tools as FlashRelate [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], Foofah [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], and Senbazuru [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] support the extraction of data and the
relations between them from spreadsheets.
      </p>
      <p>
        In addition, the TabbyXL system [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ] implements the transformation of arbitrary tables into the
relational form based on a set of rules for their analysis and interpretation. MIPS [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], which is close to
TabbyXL, performs a similar transformation based on the search for critical table cells.
      </p>
      <p>
        The FlashExtract system [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] is more universal tool. It allows us to extract data from a wider range
of documents, including text files, web pages, and spreadsheets. Data extraction is executed on the
basis of examples provided by users of this system.
      </p>
      <p>It should be noted that the source table model should be known for the effective operation of the
Senbazuru and MIPS systems.</p>
      <p>The feature of data preparation for the execution of testing scripts of modules in scientific
applications is specific requirements of each module to the format of input parameters. Therefore, the
data in the relational form provided by the above-listed systems is not always convenient and may
require additional transformations.</p>
      <p>
        Currently, there is a wide range of commercial tools for extracting and transforming data from
weakly structured information sources. These include IBM InfoSphere DataStage [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], Talend Open
Studio [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], Pentaho Data Integration [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], Informatica PowerCenter [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], Open Refine [15], etc.
Usually, these tools are focused on solving the Extract Transformation Load (ETL) problems.
      </p>
      <p>They provide forming the structured data storages for business intelligence systems (OLAP, OLTR
systems, etc.). However, the use of commercial software products is extremely expensive and not
always available for developers of scientific applications.</p>
      <p>Unlike the aforementioned tools, the proposed framework supports a large spectrum of structures
that can be both the relational and object ones. Knowledge about the markup of weakly structured data
is stored in the structural specifications that can be used by various transformational procedures many
times. Data transformation is based on applying the special templates reflecting the relations between
semi-structured and target data. Conditions for the use of transformation operations are determined by
a set of productions that can take into account the features of data sources.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Conceptual model of transformation tables</title>
      <p>We assume that the weakly structured data is any intermediate data between the structured and
unstructured. As a rule, their structure has uncertainties of various kinds.</p>
      <p>In data processing, the degree of its correctness is not known in advance. The data scheme may not
exist or does not fully correspond to the processed data. Some data attributes may be absent or not
fully satisfy the correctness conditions defined for these attributes.</p>
      <p>We propose the new
model</p>
      <p>for describing weakly structured data and the rules for their
transformation into target structured formats. This model has the following structure:

, 
= 〈
, 
, 
,  : 
,  , 
,  :</p>
      <p>→  〉
→</p>
      <p>are interpreted as follows.
parameter sets the format of a source (file) of weakly structured data: 
= {
, 
and</p>
      <p>are the valid file formats.
scheme of weakly structured data is determined by the set 
of tables and the set
of links between them: 
= 〈
, 
〉, where 
= { 1, … ,   } and 
= { 1, … ,   }.</p>
      <p>The construction of such a scheme allows us to mitigate the aforementioned uncertainty of the initial
where the parameters</p>
      <p>, 
where</p>
      <p>The</p>
      <p>The 
weakly structured data.
, 
.
  ∈ 
.
 , 

= 〈
∨ 
, 
, 
},
〉,
and
is the</p>
      <p>are
〉 is a
is an
The set 
of attributes and the set</p>
      <p>of their values are assigned to each i-th table:
  = 〈
where 
, 
〉. The set    is determined by the range and data type: 
= 〈
 and</p>
      <p>are the start and end bounds of the range correspondingly.</p>
      <p>We use the following set of data types:</p>
      <p>= { ,  ,  ,  ,  ,  }, where  is the set of integers,  is
the set of real numbers,  is the set of string values, 
is the set of date and time values,  =
 , 
 , 
} is the set of Boolean values, and  is the reference type indicating to the relation   ∈
The structure</p>
      <p>describes the target structured data. 
set of target structure objects and 
characterized by a name and a set of fields:  
= 〈
define relations between objects from 
  = 〈
data schema, 
, 
.
〉, 
is the set of references between it. Objects from 
= 〈
, 
,</p>
      <p>〉 , where 
〉. The references from 
The Rule parameter represents a set of data transformation rules: 
= { 1
, … ,   } , where
, 
,  : 
→</p>
      <p>is a set of attributes from the weakly structured
is a set of fields of target structure objects, 
structure that defines the applied value transformation operations, 
= 
determines the
transfer of a value or key value from a related table (if reference is given  
∈ 
), 
operation above attribute values (for example, combining attribute values into one field or splitting
attribute values into several fields),</p>
      <p>is a string operation above values.
is set by the name and attribute of the table in the markup file:   = 〈  ,   〉, where   ∈ 
from 
scheme 
them.</p>
      <p>The operation  : 
The operation  : 
.</p>
      <p>→ 
→ 
such operations can be determined by means of productions.</p>
      <p>matches attributes to fields of objects. Conditions for the use of
maps the weakly structured data to the target data scheme using rules
The structural metadata-based approach is used to establish the correspondence between the
of weakly structured data and the structure 
that describes a scheme of the target data.</p>
      <p>Metadata include data types, table names, attributes, objects, their fields, and the relations between</p>
      <p>The rules for transformation data from the source scheme to the target structure are based on the
established matches.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Framework prototype</title>
      <p>Based on the proposed model, we developed a framework prototype for marking and transforming
data from weakly structured sources into target formats of scientific applications. The framework
prototype includes a markup tool and transformation modules. The markup tool provides the
application developer with the ability to visually customize and specify the process of transforming
data needed to solve a particular class of problems. Transformation modules provide the creation and
translation of data into target structures of specific types. The general scheme of marking and
transforming is shown in figure 1.</p>
      <p>The framework prototype implements loading and marking of spreadsheet files in the CSV and MS
Excel formats. At the markup stage, the package developer forms visually the model  of the source
data, specifying table ranges, their attributes, data types, values, and relationships in weakly structured
documents. Next, transformation rules are configured. For example, we can specify a rule for splitting
attributes into several attributes or assign conditional processing of values using the construction «if
then else». The special package for working with regular expressions, which is a part of Embarcadero
RAD Studio, is used to handle string values [16].</p>
      <p>The description of the initial data scheme can be extended with additional structures to provide the
formation of complex target structures. For example, such an extension can maintain a hierarchical
structure. Developer’s knowledge about the markup and transformation rules are saved to a structural
specification file. Next, the specification can be processed by data transformation modules into
specific target formats or loaded into a markup tool system for a correction. The application
programing interface for accessing external subsystems is implemented in the markup tool. API
methods provide access to weakly structured data through the structures of the created model. Such a
software interface allows us to support interaction with different transformation modules without
reworking the markup tool.</p>
      <p>Today, the translation module in XML and JSON formats is implemented as a part of the
framework prototype. This module allows us to create target structures taking access to the generated
model and data through the API. Schemes of target structures for XML and JSON are set using
〈#[
, 
templates in the same file formats. In addition to other constructions, such a template contains specific
tags to which values from table fields are passed. These tags are as follows:
  
] 
= 
(
)〉,
where



structures.</p>
      <p>] = {
}, where 
is a value from table field, 
is a record counter,
=</p>
      <p>is a name of the corresponding table field from which the data will be inserted,
is an optional parameter set. There is a set of parameters for each type of data (for
example, we can specify the start value of</p>
      <p>or declination for text values from the table
fields).</p>
      <p>The specification created using the markup tool is enough to generate the database. We
implemented a</p>
      <p>module that provides the generation of relational database schemas based on
specifications and filling it with data through the API. Created structural specifications can be applied
many times in solving typical problems of data extraction and transformation (for example, when
statistical information for different periods are used). In addition, such structural specifications can be
applied to automate the creation of application software systems for working with data of target</p>
      <p>The framework prototype is integrated into the technological scheme of continuous integration for
modules of scientific applications (distributed applied software packages) that are developed with the
help of Orlando Tools [17]. The main tasks of such continuous integration in Orlando Tools are
receiving, storing, and testing versions of package modules, including the preparation and processing
of test data.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental analysis</title>
      <p>The developed framework prototype was used to test the computing modules of the distributed applied
software package (scientific application). This package is used to solve important practical problems
of determining the critical objects of the gas supply system of Russia from the standpoint of energy
security [18].</p>
      <p>In the process of calculations, the package uses the database managed by the Firebird Database
Management Software (DBMS) [19]. The database scheme and the structure of files containing the
values of module parameters are determined by the computational model of the package [20].</p>
      <p>In the experiment, the input data for solving such a problem are files in the MS Excel format with
statistical parameters of objects of the gas industry in Russia for several time periods. Each file
contains 83 parameters of the gas industry objects of Russia in the tables located on one sheet. One
parameter matches to one attribute of the table. The structure of files with information for various
periods differs by the location of single attributes. Package modules are tested with each data file.</p>
      <p>10 tables with the total number of records more than 2000 were identified in the process of marking
the file with object parameters for the initial period. The subject specialist spent about 1 man-hour on
the markup of the initial data.</p>
      <p>The number of constructions in specifications shown in figure 2. Additionally, it was necessary to
correct specifications for files with object parameters for other periods. The correction was to add new
2 shows the number of specification corrections for a file with parameters of the initial period when
preparing specifications for files with parameters for subsequent periods.</p>
      <p>The created specification is applied to automatically translate the source data into the target
structure (XML-files) of package</p>
      <p>modules parameters according to the given template. The
development of macros in MS Excel for translating such data into the XML-file would require about
10 man-hours of programming.</p>
      <p>The created specification has been applied to automatically generate the data structure and fill the
database that is managed by the Firebird DBMS. For comparison, importing tables from MS Excel
using MS SQL Server tools would require calling and setting up translation procedures for each table
[21]. At the same time, information about the relations between the tables is not stored anywhere.</p>
      <p>Comparison of working costs for data transformation is represented in figure 3.</p>
      <p>100
80
60
40
0
MS Excel macros</p>
      <p>MS SQL Server
Proposed prototype
Number of changes in the
specification
Number of new
constructions in the
specification
Initial
period</p>
      <p>Next
period
…</p>
      <p>Final
period</p>
      <p>The results shown in figure 3 show a significant reduction of data transformation in time using the
proposed framework prototype in comparison with the development of the MS Excel macros and the
use of the MS SQL Server tools. This is very important since such a data transformation procedure
must be executed repeatedly when developing scientific applications in which the intensity of module
modifications can reach several times a day.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions</title>
      <p>In the paper, we consider the relevant problem of preparing data for testing modules of scientific
applications. In such applications, testing modules requires the multiple executions of them with
different parameters for various scenarios of solving problems. Often, the module parameters are
weakly structured data, which require additional efforts from the application developer to transform
them into formats used by scientific applications.</p>
      <p>We have developed the framework prototype that supports marking, extracting, and transforming
subject data from semi-structured sources. The developed framework prototype allows us to describe
knowledge about the structure of subject data sources and save it in the form of declarative
specifications. This approach is based on applying the special templates reflecting the relations
between semi-structured and target data. It is sufficiently flexible due to the specifications generated
using these templates contain all necessary information to solve various problems of transforming the
source data to target formats used by applications.</p>
      <p>The experimental results showed a significant reduction in data transformation time through the
applying proposed framework prototype in comparison with developing MS Excel macros and using
MS SQL Server tools to this end.</p>
      <p>Further study is directly related to extending a spectrum of target structures in transforming the
semi-structured source data. In addition, we plan to develop effective algorithms for extracting and
cleaning semi-structured data for new target structures.</p>
      <p>Acknowledgment. The study is supported by the Russian Foundation of Basic Research, project
no. 19-07-00097-a (reg. no. АААА-А19-119062590002-7). This work was also supported in part by
the Basic Research Program of SB RAS, projects no. IV.38.1.1 (reg. no.
АААА-А17-1170322100784) and no. IV.38.1.2 (reg. no. АААА-А17-117032210079-1).
[15] Kusumasari T F and Fitria 2016 Data profiling for data quality improvement with OpenRefine
2016 Int. conf. on Information Technology Systems and Innovation (ICITSI) pp 1–6
[16] Regular Expressions – RAD Studio. Available at:
http://docwiki.embarcadero.com/RADStudio/Tokyo/en/Regular_Expressions (accessed:
20.06.2019)
[17] Feoktistov A, Gorsky S, Sidorov I and Tchernykh A 2019 Continuous Integration in Distributed
Applied Software Packages Proc. of the 42st Int. Conv. on information and communication
technology, electronics and microelectronics (MIPRO-2019) pp 1775–1780
[18] Feoktistov A, Gorsky S, Sidorov I, Kostromin R, Edelev A and Massel L 2019 Orlando Tools:
Energy Research Application Development through Convergence of Grid and Cloud
Computing Communications in Computer and Information Science 965 289–300
[19] Firebird: The true open source database for Windows, Linux, Mac OS X and more. Available
at: https://firebirdsql.org/ (accessed: 21.06.2019)
[20] Bychkov I, Oparin G, Tchernykh A, Feoktistov A, Bogdanova V and Gorsky S 2017
Conceptual Model of Problem-Oriented Heterogeneous Distributed Computing Environment
with Multi-Agent Management Procedia Computer Science 103 162–167
[21] Import and export data using SQL Server Import and Export Wizard - SQL Server Integration
Services (SSIS). Available at:
https://docs.microsoft.com/ru-ru/sql/integrationservices/import-export-data/import-and-export-data-with-the-sql-server-import-and-exportwizard?view=sql-server-2017 (accessed: 18.06.2019)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Cafarella</surname>
            <given-names>M J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Halevy</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang D Z</surname>
            , Wu
            <given-names>E</given-names>
          </string-name>
          and
          <string-name>
            <surname>Zhang</surname>
            <given-names>Y 2008</given-names>
          </string-name>
          <string-name>
            <surname>Webtables</surname>
          </string-name>
          <article-title>: exploring the power of tables on the web</article-title>
          <source>Proc. of the VLDB Endowment</source>
          <volume>1</volume>
          (
          <issue>1</issue>
          )
          <fpage>538</fpage>
          -
          <lpage>549</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Banko</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cafarella</surname>
            <given-names>M J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soderland</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Broadhead</surname>
            <given-names>M</given-names>
          </string-name>
          and
          <string-name>
            <surname>Etzioni</surname>
            <given-names>O.</given-names>
          </string-name>
          <year>2007</year>
          <article-title>Open information extraction for the web Proc. of the 20th int</article-title>
          .
          <source>joint conf. on Artifical</source>
          intelligence pp
          <fpage>2670</fpage>
          -
          <lpage>2676</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Halevy</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rajaraman</surname>
            <given-names>A</given-names>
          </string-name>
          and
          <string-name>
            <surname>Ordille</surname>
            <given-names>J.</given-names>
          </string-name>
          <year>2006</year>
          <article-title>Data integration: the teenage years</article-title>
          <source>Proc. of the 32nd Int. Conf. on Very large data bases</source>
          pp
          <fpage>9</fpage>
          -
          <lpage>16</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Barowy</surname>
            <given-names>D W</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gulwani</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hart</surname>
            <given-names>T</given-names>
          </string-name>
          and
          <string-name>
            <surname>Zorn B 2015 FlashRelate:</surname>
          </string-name>
          <article-title>Extracting relational data from semi-structured spreadsheets using examples</article-title>
          <source>ACM SIGPLAN Notices</source>
          <volume>50</volume>
          (
          <issue>6</issue>
          )
          <fpage>218</fpage>
          -
          <lpage>228</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Jin</surname>
            <given-names>Z</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anderson</surname>
            <given-names>M R</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cafarella</surname>
            <given-names>M</given-names>
          </string-name>
          and
          <string-name>
            <surname>Jagadish H V 2017</surname>
          </string-name>
          <article-title>Foofah: Transforming data by example Proc</article-title>
          .
          <source>of the ACM Int. Conf. Management of Data</source>
          pp
          <fpage>683</fpage>
          -
          <lpage>698</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Chen</surname>
            <given-names>Z</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cafarella</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prevo</surname>
            <given-names>D</given-names>
          </string-name>
          and
          <string-name>
            <surname>Zhuang J 2013 Senbazuru:</surname>
          </string-name>
          <article-title>a prototype spreadsheet database management system Proc</article-title>
          .
          <source>of the VLDB Endowment</source>
          <volume>6</volume>
          (
          <issue>12</issue>
          ) pp
          <fpage>1202</fpage>
          -
          <lpage>1205</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Bychkov</surname>
            <given-names>I V</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikhalov</surname>
            <given-names>A A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paramonov</surname>
            <given-names>V V</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rugnikov G M and Shigarov A O 2017 TABBYXL:</surname>
          </string-name>
          <article-title>The system for transforming data from arbitrary spreadsheets into a relational form Proc. of the 16th all-Russian conf. on Distributed information and computing resources (DICR-</article-title>
          <year>2017</year>
          ) pp
          <fpage>150</fpage>
          -
          <lpage>156</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Shigarov</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khristyuk</surname>
            <given-names>V</given-names>
          </string-name>
          and
          <string-name>
            <surname>Mikhailov A 2019 TabbyXL:</surname>
          </string-name>
          <article-title>Software platform for rule-based spreadsheet data extraction and transformation SoftwareX 10 100270</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Embley</surname>
            <given-names>D W</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krishnamoorthy</surname>
            <given-names>M S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nagy</surname>
            <given-names>G</given-names>
          </string-name>
          and
          <string-name>
            <surname>Seth</surname>
            <given-names>S 2016</given-names>
          </string-name>
          <article-title>Converting heterogeneous statistical tables on the web to searchable databases Int</article-title>
          .
          <source>J. Document Analysis and Recognition</source>
          <volume>19</volume>
          <fpage>119</fpage>
          -
          <lpage>138</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Le</surname>
            <given-names>V</given-names>
          </string-name>
          and
          <string-name>
            <surname>Gulwani S 2014 FlashExtract:</surname>
          </string-name>
          <article-title>A Framework for Data Extraction by Examples ACM SIGPLAN Notices 49(6</article-title>
          )
          <fpage>542</fpage>
          -
          <lpage>553</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Blokdyk G 2017 IBM InfoSphere DataStage: The Definitive Guide (CreateSpace Independent Publishing</surname>
          </string-name>
          Platform) p
          <fpage>120</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Rick</given-names>
            <surname>Daniel Barton 2013 Talend Open Studio Cookbook</surname>
          </string-name>
          (Packt Publishing) p
          <fpage>270</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Casters</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bouman</surname>
            <given-names>R</given-names>
          </string-name>
          and
          <string-name>
            <surname>Dongen J 2010 Pentaho Kettle</surname>
          </string-name>
          <article-title>Solutions: Building Open Source ETL Solutions with Pentaho Data Integration</article-title>
          (Wiley) p
          <fpage>720</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Malewar</surname>
            <given-names>R 2017</given-names>
          </string-name>
          <string-name>
            <surname>Learning Informatica</surname>
          </string-name>
          <article-title>PowerCenter 10</article-title>
          .x. (Packt Publishing) p
          <fpage>426</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>