=Paper=
{{Paper
|id=Vol-3641/paper9
|storemode=property
|title=Approaches to Organization of Change Tables for Real- Time Business Intelligence
|pdfUrl=https://ceur-ws.org/Vol-3641/paper9.pdf
|volume=Vol-3641
|authors=Anastasiia Morozova,Liliia Bielova,Ievgen Meniailov
|dblpUrl=https://dblp.org/rec/conf/profitai/MorozovaBM23
}}
==Approaches to Organization of Change Tables for Real- Time Business Intelligence==
<pdf width="1500px">https://ceur-ws.org/Vol-3641/paper9.pdf</pdf>
<pre>
                         Approaches to Organization of Change Tables for Real-
                         Time Business Intelligence
                         Anastasiia Morozova, Liliia Bielova and Ievgen Meniailov

                         V.N. Karazin Kharkiv National University, 4 Svobody Sq., Kharkiv, 61022, Ukraine

                                         Abstract
                                         This paper provides an in-depth analysis of the methodologies for organizing change tables in real-
                                         time Business Intelligence (BI), a critical aspect of modern, data-driven decision-making processes.
                                         The primary focus is on the seamless integration of Online Transaction Processing (OLTP) and Online
                                         Analytical Processing (OLAP) systems, emphasizing utilizing change tables to facilitate this integration.
                                         The study meticulously examines various strategies for creating and maintaining change tables,
                                         including using triggers, asynchronous triggers, and log data capture methods. Comprehensive
                                         evaluations of the advantages and drawbacks of each approach are presented, offering insights into
                                         their operational efficiencies and constraints within different database environments. This paper
                                         extends its analysis to empirical investigations, employing real-world database management solutions
                                         from industry leaders like Oracle and MS SQL Server. Through this empirical lens, the paper elucidates
                                         the practical implications of each method, thereby bridging the gap between theoretical understanding
                                         and real-world application. The outcome of this research is a nuanced understanding of change table
                                         management in real-time BI systems, providing valuable guidelines for businesses and practitioners
                                         looking to optimize their data-handling capabilities in rapidly changing market conditions. This study
                                         not only advances the academic discourse in database management but also serves as a pragmatic
                                         guide for implementing effective change table strategies in real-time BI scenarios.

                                         Keywords
                                         BI, OLTP, OLAP, ETL script, DML, Data Capture, change tables, trig- ger, asynchronous trigger, database
                                         log, Log-based Change Data Capture, Golden Gate, Log Mine1


                         1. Introduction
                         Now is the time of rapidly developing technologies and business processes. That’s why business
                         owners today want to know what’s happening now. Software systems must allow to react on
                         events in real time, as they occur, and not a minute or an hour later [1]. Organizations are
                         revamping and automating processes for faster financial reporting, operational intelligence, and
                         performance management [2]. One of the way to achieve such result is real-time Business
                         Intelligence (BI). BI is actual direction nowadays [3]. The primary purpose of BI is to improve
                         the quality of decisions while decreasing the time it takes to make them. Fresh information can
                         support fast-paced, time-sensitive business processes, such as operational BI, real-time
                         management dashboards, just-in-time inventory, high-yield manufacturing, facility monitoring,
                         call center information delivery, self-service information portals, recommendations in e-
                         commerce, and so on [4]. The use of BI explains by its capabilities that include.
                            What concerns real-time BI. Why real-time is such important? Real-time software solutions
                         analyze and monitor business processes to give a wide range of users the real-time visibility
                         they need to see a opportunity or problem, make a fully informed decision, and then follow
                         more favorable way.


                         ProfIT AI 2023: 3rd International Workshop of IT-professionals on Artificial Intelligence (ProfIT AI 2023), November
                         20–22, 2023, Waterloo, Canada
                             a.morozova@karazin.ua (A. Morozova); l.belova@karazin.ua (L. Bielova); evgenii.menyailov@gmail.com
                         (I. Meniailov)
                              0000-0003-2143-7992 (A. Morozova); 0009-0007-0805-4547 (L. Bielova); 0000-0002-9440-8378 (I. Meniailov)
                                  © 2023 Copyright for this paper by its authors.
                                  Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                  CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
   Why is there a need to adapt classical OLAP system (star or snowflake) as component of
business solution to work in real-time mode? This is due to the scope of OLAP. Chains of stores,
airline services, railway and other transport systems, high-load banking systems, financial
markets and exchanges are key economic segments using OLAP in work [5]. In conditions of
high market fullness, fast-growing data it becomes necessary to adapt the solution in real-time
mode for making promising economic decisions and increasing work efficiency of business
solution.
   For example, Continental Airlines as popular high-loaded international airline service has
taken a $30M investment in hardware, software, and personnel to generate over $500M in
revenue enhancements and cost savings, resulting in a ROI of over 1,000 percent [3].
   Key aims of numerous applications of real-time BI include [6]:
   •     Understand customer behavior in real time across multiple channels, such as Web,
   mobile, social, and enterprise applications. Consequently improve the customer experience
   as it’s happening.
   •     Evaluate sales performance in real time.
   •     Resolve the problem of product recurring in abandoned shopping carts on an e-
   commerce website. That way, start a promotion to close more sales of that product before
   interest in it wanes.
   •     Prevent potentially fraudulent activity as it is being perpetrate. Identify and direct a new
   social media sentiment or pattern.
   •     Improve level of accuracy, efficiency, and customer service for logistics sphere.
   •     Monitor the performance of interconnected infrastructures such as computer networks
   and manufacturing facilities.
   Thus adaptation of classical OLAP system to real-time BI is actual problem in the world of
high-speed information technologies and fast-growing economic and business processes. This
paper does not offer an innovative solution to organization of data capture processes in classical
OLAP system. It considers and analyzes existing approaches to OLAP organization in real-time
BI context.

2. OLTP and OLAP integration
Let’s move to On-line Transaction Processing (OLTP) and On-line Analytical Processing (OLAP)
integration. OLTP environments use database technology to transact and query against data,
and support the daily operational needs of the business enterprise. OLTP is characterized by a
large number of short on-line modified transactions. The main emphasis for OLTP systems is
put on very fast query processing, maintaining data integrity in multi-access environments and
an effectiveness measured by number of transactions per second.
   OLAP is one of BI tools. OLAP environments use database technology to support analysis and
mining of long data horizons, and to provide decision makers with a platform from which to
generate decision making information [7].
   OLTP systems are designed to quickly process individual sequences of transactions, while BI
systems must have the data organized in a dimensional model to support querying, analysis,
and “slicing and dicing” the data.
   Look at the Figure 1 to understand the interaction between OLTP and OLAP.
Figure 1: OLTP and OLAP integration

   What allows integration between OLTP and OLAP? The answer to this question is using ETL
scripts. ETL means three data management stages as: extraction, transformation and loading.
ETL tools are used to:
   •     Extract data from homogeneous or heterogeneous data sources.
   •     Transform the data for storing it in proper format or structure for querying and analysis
   purpose.
   •     Load it into the final target (database, more specifically, operational data store, data
   mart, or data warehouse).
   OLAP technology as part of BI system developed to enable fast extraction of requisite data
from the data warehouse is populated from OLTP database through ETL tools. OLAP makes it
available to carry out data modeling, perform business planning and forecasting, adopt
budgeting and draw up financial reports. Owing to the system a user is able to conduct ad hoc
analysis and apply ‘what if’ scenarios.
   As for extract process we distinguish three main types of extraction: full extract, incremental
extract and update notification [8].
   Full extract is the most traditional approach. Performance is a key factor of system
processing. As the demand for information grows, the load on your systems grows with it. If a
report takes 5 minutes to run, that is putting 5 minutes of heavy load on your source systems,
compromising any other users that are working at the time. Such way does not allow to perform
ETL script in real-time mode because it can break functioning of highly loaded system.
   And how does incremental approach work? At a specific point in time, only the data that has
changed since a well-defined event back in history will be extracted. This event may be the last
time of extraction or a more complex business event like the last booking day of a fiscal period.
To identify this delta change there must be a possibility to identify all the changed information
since this specific time event. Incremental extraction suggests using subsidiary change tables.
   As it can be seen from schema in extraction process using two group of tables: source and
change. Why is it necessary to have such table organization? Answer to this question is
processing speed. Only imagine, in source tables there are 2 million records and only 100
another are added. At full extraction ETL script processes all the records of data storage, but in
the same time change tables allows incremental extraction reading only new 100 records. Thus
approach with using change tables significantly increases productivity.
   Based on the fact of change tables efficiency different approaches to organization of change
tables will be considered and analyzed further.

3. ETL Process Adaptation for Real-Time BI
This adaptation is based on integration and tracking of changes in OLPT database. The schema
of ETL process adaptation for Real-Time BI is shown in Figure 2. These changes will be saved as
Change Tables.
   The independence among all the structures is based on their organization. For instance, all of
the connectors are presented in form of interfaces. So as for different databases and storage
their realizations can vary.


Figure 2: OLTP and OLAP integration


4. Approaches to processing change tables
There are several different approaches to extraction data into change tables. For a start let’sfocus
on trigger solution.

   4.1. Trigger

There is significant segment of extraction solutions based on trigger. This is due to the fact that
trigger based extraction is enough simple to implement and it doesn’t require any changes to
application in source system. A trigger is a special kind of procedural code programmed to fire
when a data manipulation language (DML) event occurs in the database server. As for DML
events they include UPDATE, INSERT, or DELETE statements issued against a table [9]. There
are three types of DML triggers to audit information changes such as: before-data, after-data
and schedule. According to type name before-data trigger fires before the data set is executed
and after trigger captures information after the data engine executes all data sets. These types
of trigger check event and run associated code when the event occurs. Such approach allows
keeping real-time processing [10]. In the same time schedule trigger doesn’t provide such
flexibility because it is executed at the time a report job is scheduled to run. As you can see
trigger approach is not complicated, reasonable and understandable. It is important to note
universality and autonomy of trigger technique:
    •    Not necessary additional special software or license because any database management
    system includes trigger solution;
    •    Simply to modify data capture strategy;
    •    Possible to create a separate trigger for each of the existing operations;
    •    Uncomplicated introduction to pre-existing functioning system.
    And what disadvantages does such solution include? Trigger technique can affect
performance on the source system, and this impact must be carefully considered before
implementation on a production source system. For performing data capture manipulation
database triggers are used in shadow tables. The shadow tables may store the entire row to
keep track of every single column change, or only the primary key is stored as well as the
operation type (insert, update or delete). However, a few challenges arising in data capture
process should be noted:
   •    Firing the trigger, and storing the row changes in a shadow table, introduces overhead.
   •    In an extreme case CDC may introduce 100% overhead on the transaction i.e. instead of
   0.1 second it may take 0.2 seconds to complete a transaction. In the same time it must be
   emphasized that transferring data to change tables is also a costly operation.
   •    If changes are made to tables then triggers and shadow tables may also have to be
   modified, recreated and/or recompiled which introduces extra overhead to manage and
   maintain the database.
   •    Inefficiency of trigger in capture of updating data. For example, you’re interested in
   change of one row of table, but definite record is modified in another row. Trigger fires by
   any change in record. Therefore, data capture performs in any cases of updating. Just think
   about it: 500 records with not interested for you attributes are updated and then captured. It
   decreases the efficiency of system.

   4.2. Asynchronous trigger
The distinctive property of synchronous trigger is that original DML statement which caused
the trigger to fire will not be released until the trigger itself is complete. If the trigger
implements some complex logic, it can take enough long a while to execute. Such trigger can
cause a delay even when executed simple DML operations. That’s why there is need for
asynchronous in the performing. Work of asynchronous trigger is specific and complex. In case
of asynchronous trigger the trigger doesn’t actually perform the requested work. Instead, it
creates a message that contains information about the work to be done and sends this message
to a service that performs the work. Then the trigger returns. Thus the program implementing
the service performs the work in a separate transaction. By performing this work in a separate
transaction, the original transaction can commit immediately. The application avoids system
slowdowns that result from keeping the original transaction open while performing the work. It
must be emphasized that asynchronous trigger allows more opportunities than synchronous
one:
Supporting not only DML capture events and also data definition language (DDL) ones;
Capture level expands to global and schema;
The complicity of rules associated with the capture doesn’t influence decreasing system
performance consequently rules can be either simple or complex.
   As for capture processes of asynchronous trigger they have some advanced abilities that are
not available with synchronous one: combined and downstream capture. The key disadvantage
of approach is the fact that only enterprise version provides asynchronous trigger therefore the
solution is not publicly available in usage.

   4.3. Log-based Change Data Capture
   There is transaction log in transaction databases that stores all changes in order to recover
the committed state of the database should the database crash for whatever reason. Log-based
Change Data Capture (CDC) takes advantage of this aspect of the transaction database to read
the changes from the log. To understand the principle of the log-based CDC approach, look at
the Figure 3.
Figure 3: Log-based Change Data Capture

  What are the benefits of this approach? Among the most significant:
  1. Minimal performance impact on the master database that allows to use in systems with
  extremely high transaction volumes;
  2. It also requires no change to tables or the application;
  3. The asynchronous nature of CDC: changes are captured independent of the source
  application performing the changes.
  4. There are also several characteristics making the usage of log-based CDC approach quite
  challenging. Such as:
  5. Transaction log formats are proprietary to each database vendor hence there are no
  documented standards on how the changes are stored and interpreting the changes in the
  transaction log is difficult;
  6. Many database vendors lack sufficient documentation because they consider the
  transaction log formats to be an internal format;
  7. Database vendors may not provide an interface to the transaction logs – documented or
  not – and even if there is one it may be relatively slow and/or resource intensive;
  8. Most databases have been optimized to only use internal identifiers to recover database
  row changes which is insufficient to perform data capture and record the changes on a
  different system;
  9. Most, if not all, transaction log formats are in a binary format.

5. Results of Testing
All suggested approaches to organize of change tables have been tested. This section contains
some scripts and testing results. First, we are going to describe the test environment. We run
our test on workstation with 64 GB RAM and 8 CPUs. We’ve created test table and run tests for
two DBMS: SQL Server and Oracle. Test table contains simple data types like int, spatial data
type and LOB data type.
   We’ve reproduced OLPT load with short DML operations (INSERT/DELETE/UPDATE) and
measured performance before and after implementation corresponding approaches.
   5.1. Step 1. Preliminary Test

   First, we have run DML tests and measured performance without any changes. The results
are shown in Table 1.

Table 1
Preliminary DML test results.
 Number of records               Time in sec (SQL Server)        Time in sec (Oracle)
 100,000                         20                              22
 200,000                         37                              51
 1,000,000                       192                             215


   5.2. Step 2. Synchronous Trigger

   This subsection contains results of using synchronous trigger to populate change tables.
First, we need to create change table for every table we want to monitor. This table can have the
same structure as a source table or contains less or extra columns. For our tests we create
change table that has the same structure as a corresponding source table and contains two extra
columns to get code of DML operation (1 for DELETE, 2 for INSERT, 3 and 4 for UPDATE) and
time of change.
   At the next step we’ve created DML trigger to capture DML changes from source table into
change table.
   We reproduced the same workload as in the Step 1. The results are shown in shown in
Table 2.

Table 2
Synchronous Trigger results.
 Number of records               Time in sec (SQL Server)        Time in sec (Oracle)
 100,000                         42                              46
 200,000                         71                              64
 1,000,000                       360                             478


   5.3. Step 3. Autonomous Trigger

    This subsection describes approach to use Autonomous Trigger for gathering information
into change tables. Oracle and SQL Server have difference approach to creation of autonomous
trigger.
    Oracle uses AUTONOMOUS_TRANSACTION pragma inside code of trigger. The
AUTONOMOUS_ TRANSACTION pragma changes the way a subprogram works within a
transaction. A sub- program marked with this pragma can do SQL operations and commit or roll
back those operations, without committing or rolling back the data in the main transaction [11].
To create autonomous trigger for Oracle, just add PRAGMA AUTONOMOUS_TRANSACTION; as a
first statement inside trigger body.
    Autonomous Trigger for SQL Server is based on Service Broker. First, we should create
service broker objects such as MESSAGE, Response and Request Queue, Receiving Service,
Sending Service, procedure to send items to the queue for asynchronous trigger and procedure
which processes items and adds modified data into change table. We aren’t going to listing all
scripts to create service broker objects just show the results. The results of testing Autonomous
Trigger are shown in Table 3.
Table 3
Autonomous Trigger results.
 Number of records                Time in sec (SQL Server)         Time in sec (Oracle)
 100,000                           96                               54
 200,000                           190                              112
 1,000,000                         900                              558


   5.4. Step 4. Log Data Capture

   This subsection contains the results of the last approach we tested. Log Data Capture
approach is based on reading changes from database log file. To track data changes SQL Server
provides Change Data Capture (CDC) feature. Change Data Capture is deprecated in Oracle
Database 12c, so we haven’t tested it.
   For SQL Server it took 230 sec for INSERT/DELETE/UPDATE 1,000,000 rows. CDC runs
background process to collect changes into "change table". It took about 18 min to finish this
background process. CDC in SQL Server has some peculiar properties for collecting LOB data
types such geometry, geography, varbinary(max), varchar(max) or nvarchar(max). Also you can
configure Capture and Cleanup jobs to change performance of CDC. This paper omit all this
configurations and uses its default values.
   Unfortunately, most Oracle solutions have only paid license that’s why they weren’t tested.
   Oracle provides log data capture decision named Golden Gate. Its modular architecture gives
the flexibility to extract and replicate selected data records, transaction changes, and changes to
DDL across a variety of topology. By this cause Oracle Golden Gate can allow to support
numerous business requirements such as: business continuance and high availability, initial
load and database migration, data integration and decision support.
   It should be noted there is more powerful log-based tool developed by Oracle as Log Miner.
Log Miner directly accesses the Oracle redo logs, which are complete records of all activities
performed on the database. That’s why this solution provides more advanced functionality. Log
Miner supports such important and flexible possibilities as: database recovery operations (even
fine-grained recovery at the transaction level), auditing of DML statements, the order in which
transactions were committed. Usage of such features allows to improve system efficiency,
rollback logical data corruptions or user errors, tune performance and capacity planning
through trend analysis.

6. Extract Transform phase
This section contains the structure for Extract Transform phase of the adapted ETL process for
real time. The purpose of the developed algorithm is primarily the preparation and
transmission of information using ETL-technology. The data received by the system will be
removed from the relational database, regardless of its own structure and features. In this way,
we can ensure the flexibility of the system to expand, use and implement in different
environments without being tied to corporate identity.
   Based on the analyzed data, it was decided that the most complex and interesting process
that requires detailed improvement and development of a new architecture is the process of
extracting data from storage, converting them into a specialized form and sending them to a
distributed message queue.
   At the moment, there are many paid platforms with the implementation of data analysis, in
which ETL technology is only an integral part. However, they cannot and do not allow the
transfer of information and do not provide access to verify the architecture of their application,
so it is possible that leaks of corporate data may occur very often.
    In order to track incremental data, it will be advantageous to use and take as a basis for the
architecture of the CDC.
    The first step of our algorithm will be to create a flow in which the distribution of the
processes of extraction, processing and transportation of data by tables will be organized.
    After that, threads are created that implement these actions. Each of them first receives the
changed information.
    The next step is to adjust the data to the format that best suits the distributed message
queue.
    From the beginning, all start tables are placed in a queue synchronized with all threads.
Before processing, the table is taken out of it, the number of threads will increase. After all
operations and data transportation, the table is added to the end of this queue and frees up
space for new processing. This method was chosen because the processing time of tables of
different sizes will not be the same, which would complicate the operation of the system in real
time. This process is endless, as the analytical platform needs to constantly receive fresh data
for the most effective recommendations for tactical changes in information. An important
feature is the storage of the current transaction that has been delivered to the system. In the
event of unexpected platform disruptions or unexpected errors, the system needs to develop a
repository to allow for instant resumption. The Sequence Diagram and Class Diagram of
suggested approach are shown in Figure 4 and Figure 5.

Conclusion
We tested different solutions to get incremental data into change tables for two database
vendors (SQL Server, Oracle). All suggested approaches have overhead for DML operations. We
summarize results of all tested approaches for 1,000,000 rows into Table 4.
   Change data capture using database triggers lowers the overhead to extract the changes but
increases the overhead to record the changes. The disadvantage is the limited scalability and
performance of trigger procedures, making them optimal for use cases with light to medium
loads. It should be noted that the performance of the system is also affected by the complexity of
rules associated with the capture for trigger approach. Although log-based CDC is generally
considered the superior approach to data capture that can be applied to all possible scenarios
including systems with extremely high transaction volumes it is enough complex in
implementation by cause of the lack of document standards. Thus it is impossible to single out
the best approach that can be suitable in all cases. For right choice of data capture approach you
need to consider the significant characteristics of systems influenced its performance.

Table 4
Autonomous Trigger results.
 Approach                        Time in sec (SQL Server)         Time in sec (Oracle)
 no changes                       192                              215
 synchronous trigger              360                              478
 synchronous trigger              900                              558
 CDC                              230                              -
Figure 4: Sequence diagram


Figure 5: System class diagram
References
  [1] Dotsenko, N, Chumachenko, D., Chumachenko, I. Modeling of the processes of
      stakeholder involvement in command management in a multi-project environment,
      International Scientific and Technical Conference on Computer Sciences and
      Information Technologies, 2018, vol. 1, pp. 29-32. doi: 10.1109/STC-CSIT.2018.8526613
  [2] Bazilevych, K., et al., Stochastic modelling of cash flow for personal insurance fund using
      the cloud data storage, International Journal of Computing, 2018, 17 (3), pp. 153-162.
  [3] Shabanov, D., et al., Simulation as a Method for Asymptotic System Behavior
      Identification (e.g. Water Frog Hemiclonal Population Systems), Communications in
      Computer and Information Science, vol. 1175, 2020, pp. 392-414. doi: 10.1007/978-3-
      030-39459-2_18
  [4] Dhaouadi A, Bousselmi K, Gammoudi MM, Monnet S, Hammoudi S., Data Warehousing
      Process Modeling from Classical Approaches to New Trends: Main Features and
      Comparisons. Data. 2022; 7(8):113. doi: 10.3390/data7080113
  [5] Dotsenko, N., Chumachenko, D., Chumachenko, I., Project-oriented management of
      adaptive teams’ formation resources in multi-project environment, CEUR Workshop
      Proceedings, 2019, 2353, pp. 911-923.
  [6] Luo, X.(R). and Chang, F.-K., Toward a design theory of strategic enterprise management
      business intelligence (SEMBI) capability maturity model, Journal of Electronic Business
      & Digital Economics,2023. doi: 10.1108/JEBDE-11-2022-0041
  [7] Naseema Shaikl, Dr. Wali Ullah, Dr. G. Pradeepni, OLAP Mining Rules: Association of
      OLAP with Data Mining, American Journal of Engineering Research (AJER), Volume 5,
      Issue 2, 2016, pp. 237–240.
  [8] Seenivasan,      Dhamotharan,       ETL     for    Data     Warehousing,     2023.    URL:
      https://www.researchgate.net/publication/368300555_ETL_for_Data_Warehousing
  [9] Audit SQL databases using a SQL Server trigger tool, 2020. URL: http://
      https://solutioncenter.apexsql.com/audit-sql-databases-using-the-sql-server-trigger-
      tool.
  [10] Chumachenko, D., On intelligent multiagent approach to viral Hepatitis B epidemic
        processes simulation, Proceedings of the 2018 IEEE 2nd International Conference on
        Data Stream Mining and Processing, DSMP 2018, 2018, pp. 415-419.
        doi: 10.1109/DSMP.2018.8478602
  [11] Oracle. Oracle Database Documentation, 2018. URL: https://docs.oracle.com/en/
        database/oracle/oracle-database/index.html
  [12] David Arnott, Felix Lizama, Yutong Song, Patterns of business intelligence systems use
        in organizations,Decision Support Systems, Volume 97, 2017,Pages 58-68.
        doi:10.1016/j.dss.2017.03.005.

</pre>