=Paper=
{{Paper
|id=Vol-3816/paper53
|storemode=property
|title=An Integrated Approach Using Ontologies, Knowledge Graphs, Machine Learning, and Rules Models for Synthetic Financial Time Series Generation
|pdfUrl=https://ceur-ws.org/Vol-3816/paper53.pdf
|volume=Vol-3816
|authors=Laurentiu Vasiliu,Radu Prodan,Ahmet Soylu,Dumitru Roman
|dblpUrl=https://dblp.org/rec/conf/rulemlrr/VasiliuPSR24
}}
==An Integrated Approach Using Ontologies, Knowledge Graphs, Machine Learning, and Rules Models for Synthetic Financial Time Series Generation==
An Integrated Approach Using Ontologies, Knowledge Graphs,
Machine Learning, and Rules Models for Synthetic Financial
Time Series Generation
Laurentiu Vasiliu1*, Radu Prodan2, Ahmet Soylu3, and Dumitru Roman3,4
1 Peracton Ltd. DHKN Galway Financial Services Centre, Moneenageisha Rd, Galway, H91 V2R6,
2 Institute of Information Technology, University of Klagenfurt, Universitätsstraße 65-67, A-9020 Klagenfurt am
Wörthersee, Austria
3 Kristiania University College, Oslo, Norway
4 SINTEF AS, Forskningsveien 1, 0373 Oslo, Norway
Abstract
In the Graph-Massivizer EU project, the financial use case is focused on generating synthetic financial
time series in extreme volumes (PB) for advanced testing and training of financial (investment and
trading) algorithms. Our key approach integrates ontology-based, graph-based, and rule-based models,
leveraging the strengths of all three technologies. Ontologies are employed to capture the detailed
properties of financial time series data, graph models are used to generate synthetic data, and financial-
related rules are applied to ensure the desired quality and statistical properties of the synthetic data.
Keywords
Ontologies, synthetic data, financial time series, machine learning1
1. Introduction
Synthetic data refers to artificially generated datasets that are specifically designed to replicate
the characteristics of real-world data—in this case, financial time series. It has become a viable
solution for powering quantitative analysis and back-testing of financial models, serving as a
good alternative to historical data. The demand for synthetic data has increased due to the
growing complexity of financial models and algorithms, driven by data-intensive machine
learning (ML) models. These models often face limitations with real historical datasets, such as
capped volumes, incomplete data, high costs or irrelevance when dealing with much older data.
The key advantage of synthetic data lies in its ability to capture the statistical properties of real-
world markets while maintaining a completely artificial nature, enabling intensive testing before
financial models and algorithms are validated on real-time financial data and with live money.
The Graph-Massivizer (G-M) project [1] is developing a software platform capable of processing
extreme volumes of data. One of the project's use cases is focused on generating synthetic data
in extreme volumes that closely match the quality and characteristics of historical data samples
of stocks and futures commodities. The approach is centered around knowledge graphs (KGs)
[3], chosen for their ability to capture, store, and represent historical financial time series. The
RuleML+RR'24: Companion Proceedings of the 8th International Joint Conference on Rules and Reasoning, September
16--22, 2024, Bucharest, Romania
∗ Corresponding author.
laurentiu.vasiliu@peracton.com (L. Vasiliu); radu.prodan@aau.at (R. Prodan); ahmet.soylu@kristiania.no (A.
Soylu) dumitru.roman@sintef.no (D. Roman)
0009-0000-9791-2759 (L. Vasiliu); 0000-0002-8247-5426 (R. Prodan); 0000-0001-6034-4137 (A. Soylu); 0000-
0001-6397-3705 (D. Roman)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
technologies employed are designed to create, process, store, and generate these KGs
(knowledge graphs), which can be further enhanced by using ontologies to represent entities
and their relationships.
2. Generating synthetic time series
The financial use case within the G-M project [2] (as described in Figure 1) aims to enhance
algorithmic investment and trading performance in green-focused investments, targeting
improvements such as a 2-4% increase in performance, a Sharpe ratio above 5, and a 1-2% boost
in alpha. This is achieved by utilizing extreme volumes (in petabytes) of synthetic data for testing
and training pre-production financial algorithms. The G-M platform enables the creation of
realistic and cost-effective synthetic financial datasets, unlimited in size and accessibility, while
mitigating issues such as biases, overfitting, and indirect contamination that often accompany
historical data testing.
Figure 1: Financial Use Case, Graph-Massivizer EU project [1]
To generate synthetic data at this scale, the G-M approach involves several steps. Initially,
batches of historical data (totaling 10 terabytes) from company stocks and futures commodities
contracts are mapped to a financial massive graph (F-MG) through a time-series-to-graph
transformation. Next, a synthetic financial massive graph (SF-GM) is created using a generative
model. Finally, this SF-GM is used to produce synthetic financial data in batches, ready to be used
in financial testing and simulations, with a total target output of 1 to 5 petabytes of synthetic
data. The G-M Toolkit, currently under development, is an integrated platform comprising five
specialized tools: Graph-Inceptor, Graph-Scrutinizer, Graph-Optimizer, Graph-Greenifier, and
Graph-Choreographer (Figure 2). These tools perform distinct and critical functions for massive
graph processing, including graph creation, analytics and probabilistic analysis, efficient
execution of operations, energy consumption evaluation, and serverless deployment for on-
demand resource utilization.
3. Challenges
Generating meaningful synthetic financial time series [4] involves several key challenges. First,
accurately modeling the original historical financial data is essential. Given the complexities of
financial markets, this requires careful consideration of various critical aspects, such as relevant
financial variables, data clustering, fat tails, and noise, as well as how relationships between
different data types can be extracted using ontologies and reasoning. Additionally, we must
address the heterogeneity of time series data, accounting for their changing statistical
properties—such as mean, variance, and covariance—which are needed for calculating risk and
performance metrics for the targeted financial assets. Second, ensuring the quality of the
synthetic data is very important so it is indistinguishable from the original historical data in
terms of its components, statistical properties, values, and patterns. To achieve this, quality must
be enforced at the moment of data generation through well-defined rules. In the G-M financial
use case, we have chosen to build a proprietary rule engine that encapsulates all relevant quality
rules—covering statistical aspects, patterns, and correlations—to ensure that the synthetic data
possesses the desired properties from the outset.
4. Proof of concept and implementation
Our focus to date has been on several core aspects such as scalable data generation, efficient data
storage and streaming, energy consumption monitoring, parallel processing, large memory
capacity management, data security, and cost optimization to support the anticipated production
demands on the G-M platform. The current financial use-case proof of concept is built on
preliminary implementations of the main tools, as depicted in Figure 2 below, with their
respective data flows highlighted. Their first versions have been uploaded to GitHub [6], where
17 repositories—both private and public—are available for access.
Figure 2: Graph-Massivizer Platform [7]
The Graph-Massivizer platform pipeline [7] involves five sequential steps, each corresponding
to the invocation of a specific tool that has very specific functionalities:
1. Graph-Inceptor: This is the first tool in the pipeline, responsible for the historic financial
data ingestion, initializing and managing the ingestion and storage of massive graphs. It
supports three operations namely [7]:
1. Graph creation – implement BGOs to support the ETL process of extraction,
transformation and loading data.
2. Graph modelling – administers graph generators, ontologies, and mapping rules
for the financial use-case dataset.
3. Graph storage – allows access to the storage layer through a virtual KG.
2. Graph-Scrutinizer: Utilizes the ingested graph data for in-depth analysis and reasoning.
It is comprised of [7]:
1. Graph analytics – focusing on higher-level operations acting on batch and
streaming data.
2. Graph algorithms and querying – is concerned with BGOs used by the data
scientist or the graph analytics group comprising both exact and approximated
implementations.
3. Graph distillation – has the building blocks that prepare indices belonging to
the massive graph to be used by the approximate graph algorithms and querying
BGOs.
3. Graph-Optimizer: Maps BGOs (Basic Graph Operations) to the target computing units,
while considering their properties and metrics collected from the massive graph. It has
the following elements [7]:
1. System model – is concerned with the composition of hardware models,
computational units, and interconnections.
2. Data model – it is coming from data reduction over the original dataset allowing
for an accurate estimation of its processing properties.
3. Workload model – it is composed of multiple BGOs, captures the actual graph
processing metrics and properties.
4. Design-space exploration – combines the above models, uses performance
modelling to predict different configurations and iterates them to find the best
solution.
4. Graph-Greenifier: Analyzes graph data and the processing metrics gathered by Graph-
Optimizer to optimize both performance and sustainability. It has the following elements
[7]:
1. Workload-driven simulation toolchain - for modelling the impact of graph
processing.
2. Sustainability predictor - is ranking graph processing scenarios based on
performance, energy efficiency and sustainability at scale.
3. Monitoring - of the relevant sustainability metrics.
4. Sustainability benchmark - provides run-time energy labels including
information on energy sourced derived from data centers.
5. Power grid data interface - automates data gathering based on the electrical
energy offer and price, energy sources and greenness.
5. Graph-Choreographer: Integrates the graph processing with various infrastructures
and workflows. It has the following elements [7]:
1. Monitoring – continuously checks the lifecycle of incoming and outgoing events
and logs the observations.
2. Graph profiling – obtains essential information about raw input graphs,
sampled graphs, and analyzed data. Then categorizes the graph BGOs based on
the UC needs, time and resources required.
3. Resource profiling and partitioning - handles and categorizes the monitored
data from nodes, processing cores, memory, storage, network bandwidth,
deployed functions also handle resources’ diversity and network structures.
4. Function scheduling and provisioning - a heuristic scheduling model defines
constraints considering resource utilization and cloud, fog and edge resource
limitations and provisions the nodes to the BGOs to optimize objective metrics;
then a suitable node is identified to deploy the new BGOs.
5. Sustainability analysis – verifies the sustainability evaluation criteria and
instructs the function scheduling and execution engine to make the appropriate
decisions.
6. Execution engine – deploy and run the BGOs’ and their libraries on the
computing nodes.
Each tool logs its operations internally, providing the toolkit with general inspection,
development, and debugging capabilities. For generating synthetic financial data, the complete
process flow to generate it was divided and implemented into three partial ones:
1. Historic data ingestion (using Graph-Inceptor tool)
2. Synthetic graph generation (using Graph-Scrutinizer and Optimizer tools)
3. Synthetic graph-to-time-series generation (using Graph Optimizer and ts2g2 library [6])
This phased approach was chosen for rapid prototyping, concentrating on the essential tools
relevant for this use case: Graph-Inceptor, Graph-Scrutinizer, and Graph-Optimizer. While these
are partial functionalities, testing them independently first is an essential step towards a full
integrated synthetic data generation process, The next phase will include the integration of all
tools, including Graph-Greenifier to measure energy consumption during synthetic data
generation and Graph-Choreographer to ensure synchronized operation of all components.
5. Future work
The Graph-Massivizer EU project is currently at the midpoint of its development, progressing
steadily towards completion. We have reached a stage where a first proof-of-concept for
generating synthetic time series data in extreme volumes is currently being demonstrated. While
the core concepts and approaches have been identified and are in the prototyping phase, the
focus of our future work will shift to the implementation, integration and debugging of the G-M
tools. This will include a continuous effort to enhance the quality of the synthetic data until it
becomes indistinguishable from historical data samples. The synthetic data generated will be
utilized to test green investment algorithms within Peracton’s back testing engine, where their
performance will be compared to their behavior when using historical financial data. The results
from these tests will provide valuable feedback to the G-M platform, guiding further
improvements and consolidation of the tools and methodologies.
Acknowledgements
This project has received funding from the European Union’s Horizon Research
and Innovation Actions under Grant Agreement Nº 101093202 [1].
References
[1] E. U. H. R. Graph-Massivizer EU Project, G. A. N. . Innovation Actions, Graph-Massivizer, 2023.
URL: https://graph-massivizer.eu/.
[2] E. U. H. R. Graph-Massivizer EU Project, G. A. N. . Innovation Actions, Use case 1 green-
finance, 2023. URL: https://graph-massivizer.eu/project/green-and-sustainable-finance/.
[3] N. Kertkeidkachorn, R. Nararatwong, Z. Xu, R. Ichise, Finkg: A core financial knowledge
graph for financial analysis, in: 2023 IEEE 17th International Conference on Semantic
Computing (ICSC), IEEE, 2023, pp. 90–93.
[4] M. Dogariu, L.-D. Ştefan, B. A. Boteanu, C. Lamba, B. Kim, B. Ionescu, Generation of realistic
synthetic financial time-series, ACM Transactions on Multimedia Computing,
Communications, and Applications (TOMM) 18 (2022) 1–27.
[5] Peracton Ltd. Website, 2024, URL: https://peracton.com.
[6] Graph-Massivizer Github repositories, 2024, URL: https://github.com/orgs/graph-
massivizer/repositories.
[7] E. U. H. R. Graph-Massivizer EU Project, G. A. N. . Innovation Actions, D2.1 ‘Graph-Massivizer
Requirements, Elicitation and First Architecture Design’, July, 2023.