<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Modeling and Generating Extreme Volumes of Financial Synthetic Time-Series Data with Knowledge Graphs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Laurentiu Vasiliu</string-name>
          <email>laurentiu.vasiliu@peracton.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>S. Haleh S. Dizaji</string-name>
          <email>Seyedehhaleh.Seyeddizaji@aau.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aaron Eberhart</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dumitru Roman</string-name>
          <email>Dumitru.Roman@sintef.no</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Radu Prodan</string-name>
          <email>radu.prodan@aau.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Information Technology, University of Klagenfurt</institution>
          ,
          <addr-line>Universitätsstraße 65-67, A-9020 Klagenfurt am Wörthersee</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Peracton Ltd. DHKN Galway Financial Services Centre</institution>
          ,
          <addr-line>Moneenageisha Rd, Galway, H91 V2R6</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>SINTEF AS</institution>
          ,
          <addr-line>Forskningsveien 1, 0373 Oslo</addr-line>
          ,
          <country country="NO">Norway</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>metaphacts GmbH</institution>
          ,
          <addr-line>36 Daimlerstraße, Walldorf, 69190</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper outlines the approach and technology employed to model and generate extreme volumes of synthetic ifnancial time-series data. We introduce the Graph-Massivizer project and its financial use case, focusing on green sustainable finance. One project objective is to create synthetic financial data in extreme volumes to facilitate advanced testing and simulations of investment and trading algorithms. Afterward, we provide an overview of the methodology, detailing the utilization of ontologies and knowledge graphs. Furthermore, we elaborate on modeling correlations between diferent markets' time-series and how we can benefit in combination with graph neural network models to generate financial data. We then present the current implementation status and conclude with a discussion of future work.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Knowledge graphs</kwd>
        <kwd>ontologies</kwd>
        <kwd>synthetic data</kwd>
        <kwd>financial time-series</kwd>
        <kwd>extreme data</kwd>
        <kwd>machine learning</kwd>
        <kwd>correlation analysis</kwd>
        <kwd>pattern recognition</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In the financial investment and trading domains, synthetic data—artificially generated datasets that
mimic real-world financial time-series characteristics—has become a robust solution for quantitative
analysis and back-testing. The demand for synthetic data has arisen due to increasingly complex
ifnancial models and algorithms driven by data-demanding machine learning (ML) models. These
models find real historical data time-series to have multiple limitations, such as reduced volumes,
high costs, incomplete data, or irrelevance as we go further back in time. The core characteristic of
synthetic data is its ability to capture the statistical properties of real-world markets while maintaining
a completely artificial nature. This allows for intensive testing before financial models and algorithms
are further validated on real-time financial data. The Graph-Massivizer project [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] aims to develop a
software platform consisting of independent yet integrated tools. In one of its use cases, this platform
will generate synthetic data in extreme volumes, closely matching the quality and characteristics of
historical data samples of stocks and commodities futures, with plans to expand to other securities
such as ETFs, bonds, and options. At the core of the approach are knowledge graphs (KGs), chosen for
their ability to capture, store, and represent historical financial time-series. All technologies used are
designed around creating, processing, storing, and generating these KGs. KGs, designed to represent
entities and their relationships utilizing ontologies, can be significantly enhanced. Firstly, ontologies
provide a shared vocabulary and semantic alignment, particularly useful when integrating data from
diferent sources across various KGs. Secondly, KGs can leverage ontologies to perform inference and
reasoning, allowing the discovery of new relationships within the graph. Thirdly, ontologies can enrich
      </p>
      <p>KGs by adding missing information, such as properties or classes that are not explicitly present but are
implied based on existing relationships.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Graph-Massivizer Project - The Financial Use Case</title>
      <p>
        Green and sustainable finance This use case [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] aims to enhance algorithmic investment and
trading capabilities in green-focused products and investment/trading styles by generating and utilizing
extreme volumes of synthetic data for testing and training. In this respect, the Graph-Massivizer project
seeks to overcome the limitations posed by financial market data providers—such as restricted data
volume, reduced accessibility, and high costs, by enabling the rapid, semi-automated creation of realistic
and afordable synthetic financial datasets that are unlimited in size and accessibility. It also aims to
improve ML-based green investment and trading simulations, eliminating critical biases such as prior
knowledge, over-fitting, and indirect contamination due to current data scarcity. The approach first
maps samples of historical financial data (stocks and commodities futures) to a massive graph (F-MG)
through a time-series to graph transformation. Next, using a generative model, we create a synthetic
ifnancial massive graph (SF-MG). Finally, we generate synthetic financial data from the SF-MG by
enforcing specific quality rules. To achieve this, the Graph-Massivizer platform is provided with 10 TB
of historical data samples, with the primary goal (KPI 1) of generating between 1 and 5 PB of synthetic
ifnancial time-series data. Another goal (KPI 2) is to achieve 90% energy consumption accountability for
synthetic data creation. We use this data to test and improve financial algorithms, and aim to achieve
(KPI 3) a measurable return increase of 2-4% in the enhanced financial algorithms that use synthetic
data. Additionally, we aim to achieve (KPI 4) an increase in the financial algorithms’ alpha by 1-2% and
a Sharpe ratio greater than 1.5.
      </p>
      <p>Graph-Massivizer toolkit The Graph-Massivizer toolkit is an integrated platform composed
of five tools (Graph-Inceptor, Graph-Scrutinizer, Graph-Optimizer, Graph-Greenifier, and
GraphChoreographer) that perform specific and unique functions for massive graph processing:
• Graph-Inceptor: realizes a massive graph for the system to use.
• Graph-Scrutinizer: provides analytic capabilities and probabilistic reasoning for insights.
• Graph-Optimizer: ensures that large graph operations are completed eficiently.
• Graph-Greenifier: evaluates the energy consumption of massive graph operation.
• Graph-Choreographer: allows serverless deployment to use resources on-demand.</p>
      <p>Further on, Figure 1 shows the overall Graph-Massivizer architecture and how the five tools are
interconnected. Additionally, we can see in this diagram the external components, such as the metaphactory
platform, the graph database, and the hardware and infrastructure used by the toolkit.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Challenges in Modeling Financial Data</title>
      <p>Modeling financial data presents several challenges due to financial markets’ dynamic nature and
complexities. In addition to numerous variables, volatility clustering, fat tails, and noise, which are all
specific to financial data, we focus particularly on five aspects to generate synthetic data.
KG relations’ extraction and enrichment In large-scale financial data, extracting relationships
among various data types can be complex and non-intuitive, often requiring inference methods to
identify them. We aim to enhance the quality of KG relation extraction by utilizing ontologies and
reasoning methods to identify and extract non-obvious relationships.</p>
      <p>Heterogeneous time-series data Financial data consists of diferent types of time-series data
with diferent semantics, domains, and dynamics. We can mitigate this diversity by using ontologies
underlying their relationships and finding correlations among them.</p>
      <p>Changing statistical properties Financial time-series often display changing statistical properties
over time, such as means, variances, and covariances. These properties are used for measuring an asset’s
performance and risk. These statistical patterns must be identified and replicated in the synthetic data
to model the original data accurately.</p>
      <p>
        Quality assessment of generated data The evaluation of synthetic data is ongoing research [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
and [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] review several evaluation methods of financial and other synthetic time-series. The
method of [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] applies various qualitative, quantitative, and predictive methods. These methods consist
of statistical models and distances, such as various distribution divergence metrics, the
KolmogrovSmirnov test, real and synthetic data correlation analysis, and the non-parametric model of MMD. Other
methods evaluate the ML models on real data trained on synthetic data. Additionally, we can evaluate
specific quantities such as Value at Risk (VAR). These methods difer depending on the use case, and we
aim to select appropriate metrics.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Using Ontologies and Knowledge Graphs</title>
      <p>KGs and ontologies allow scientists and domain experts to model complex relations between data in a
logically structured and machine-readable format. This capability allows ontologies to connect diverse
sources of information, such as the use case presented here and similar related data.</p>
      <p>
        In the Graph-Massivizer project, ontologies represent and integrate data from diverse use cases. The
metaphactory platform was chosen to manage ontologies and integrate data with a front-end interface.
Metaphactory has many applications for developing and managing ontologies, KGs, and other related
semantic artifacts [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. With metaphactory, users can interact with and create ontologies and integrate
and use data that aligns with the ontology.
      </p>
      <p>By decoupling the data and the schema for the data, the ontology allows developers to model and
prepare for handling massive amounts of data in an abstract way. For instance, a user or developer
can write queries to inspect only the relevant data of interest inside the large data set. While queries
like this are not themselves a direct algorithmic optimization, they do play a critical role in ensuring
scalability is possible by identifying critical semantic information and metadata that can reduce a huge
chunk of data into something more tractable.</p>
      <p>In this section, we will describe the data represented by ontology and then show the ontology that
schematizes it to integrate it with a KG.</p>
      <sec id="sec-4-1">
        <title>4.1. Ontology data</title>
        <p>
          This use case initially focuses on two types of financial products: stocks and commodities futures.
However, this paper will concentrate on one financial product: stocks. The financial ontology for stocks
consists of four main types of financial data:
Fundamental data Fundamental data [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] indicators represent accounting data related to a company
and its particular industry. These indicators have a low update frequency (quarterly on average or
yearly) [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] and provide long-term insights into a company’s valuation and price evolution.
Technical data In contrast with fundamental data, technical data [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] has a very high update frequency
(tick/second/minute, etc.), ofering short-term insights into stock price movements. This data includes
ifne-grained historical stock price information in the form of Open, High, Low, Close, and Volume
(OHLCV). Numerous additional statistics can be derived from this financial modeling and prediction
data, particularly for intra-day trading.
        </p>
        <p>
          ESG data Environmental, social, governance (ESG) [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] data measures companies based on various
responsibility metrics, including environmental, social, and governance criteria. By considering these
criteria in their investments, investors encourage responsible corporate behavior and avoid investing in
companies with risky or unethical practices.
        </p>
        <p>
          Sentiment data Market sentiment data [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] reflects investors’ attitudes toward a company, sector,
or financial market. Various indicators derived from statistical technical analysis, social media, or
alternative data sources can be used to measure market sentiment.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Ontology diagram</title>
        <p>The financial ontology overview in Figure 2 shows the objects and data constituting the
use case, namely historical and synthetic data and financial algorithms. They run
inside the PeractonSecuritiesPlatform class that ingests the SyntheticStocksData and
SyntheticCommoditiesData generated by the Graph-MassivizerPlatform class.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Synthetic ontology diagram</title>
        <p>The synthetic financial data ontology in Figure 3 shows the created categories and mirrors the original
historical financial data set structure. The SyntheticSymbol belongs to SyntheticCommoditiesData
and SyntheticStocks classes, with the TechnicalData and FundamentalData classes as features.
It also shows the SyntheticFinancialData class with SyntheticCommoditiesMultiverse
and SyntheticStocksMultiverse subclasses, with further SyntheticCommoditiesData and
SyntheticStocksData subclasses.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Related Work</title>
      <p>We briefly review the literature on applying KG in financial data analysis, then elaborate on other
ifnancial data modeling and generation methods.</p>
      <sec id="sec-5-1">
        <title>5.1. Ontology in financial data analysis</title>
        <p>Several methods leverage ontology and KG in financial analysis, such as KG extraction and enrichment,
querying and reasoning over KGs, extracting correlations, and modeling financial time-series.</p>
        <p>
          [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] studies the efect of considering fundamental and technical data in stock price prediction ML
models, showing that models benefiting from both indicators outperform models considering them
alone. The method of [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] proposes KG extraction, enrichment, and querying methods. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] drives
a high-quality financial KG given the ontology by a semi-automated method and utilizes this KG in
reasoning, stock prediction, and generation with two neural network models, multi-layer perceptron
and long short term memory (LSTM). [14] provides an ontology-based correlation extraction between
diferent companies. It drives the network of companies by assessing time-series and uses the node2vec
and k-nearest neighbor (kNN) methods to embed and cluster the extracted nodes. [15] proposes a joint
graph learning and prediction model on time-series data. It uses KG and graph neural networks (GNNs)
to derive the correlation among diferent time-series. The method of [ 16] benefits from KGs in finding
ifrst and second-order relationships among companies. It applies an LSTM for time-series embedding of
each node and a temporal graph convolutional network (GCN) to incorporate the varying neighborhood
efect. The method in [ 17] formulates stock prediction as a stochastic optimization and introduces
genetic programming with a generalized crowding method using financial KG to predict prices.
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Financial time-series modeling</title>
        <p>Statistical models Various statistical linear models such as autoregressive models exist for stock
prices, however, they can’t capture the complex non-linear structure of these data [18].
Event based models These methods formulate time-series data as event data and define it as
remarkable changes in time-series in continuous time. The methods of [19] and [20] construct correlation
(influence) graphs of time-series utilizing the Hawkes process. [ 20] defines events as long-lasting
volatility values. It uses an attention layer to weigh and capture neighborhoods and an LSTM to predict
the next price. The method in [21] uses an event graph and tackles dimensionality. It formulates the
intensity function utilizing GNNs and the attention layer to dynamically embed node features and
recurrent neural networks (RNNs) to embed event sequences. Graphical event models are another form
of marked point processes event type utilizing graphical information for event graph construction [21].
Pattern recognition These methods perform pattern-matching techniques to predict trends in
time-series, including perceptually important points, template matching, and dynamic tree wrapping
algorithm. The survey [22] provides a detailed review of these methods.</p>
        <p>ML models The review in [22] categorizes these models into supervised and unsupervised models.
Supervised learning methods use various ML models such as support vector machines, random forest,
Adaboost, kNN, and eXtreme gradient boosting methods for stock prediction. The unsupervised learning
methods include clustering methods to help in finding correlations among markets [22].</p>
        <p>Recently, several studies used deep learning models for modeling financial time-series data
consisting of convolutional neural networks, RNNs, attention mechanisms, and generative adversarial
networks (GANs) capable of capturing non-linear and complex data features. The survey [23] provides
a comprehensive review of these models.</p>
        <p>
          GANs are powerful data generation models appropriate for time-series and adjusted for event data
generation. The method in [24] proposes a marked event data generation method using separate
generators and discriminators for each event type and preserves type correlations using a central
discriminator. This method provides synthetic data for downstream tasks. The GAN model of [18]
constructs the correlation graph among stocks using diferent correlation analysis methods and uses
GCNs to encode interdependent time-series data. The method in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] captures correlation among stocks
and applies three generative models, including GAN. The survey [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] presents more models of this
category.
        </p>
        <p>The other category of methods uses natural language processing (NLP) to benefit from financial text
data such as financial news and SEC filings. They consist of N-grams and word2vec embeddings as
inputs for prediction models. The survey [22], and paper [16] introduce methods of this category.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Correlation Analysis among Financial Data</title>
      <p>Financial data can show various correlations across financial products, companies, and markets. These
correlations may be based on deeper links between companies and industries or simply random, lacking
any underlying economic or financial rationale. We aim to identify the relevant and meaningful
correlations within historical financial time-series and use these insights to generate synthetic data.</p>
      <sec id="sec-6-1">
        <title>6.1. Point process approach</title>
        <p>Point process models are stochastic processes that successfully model event sequences [25]. They vary
depending on the definition of the conditional intensity function, representing the expected number
of events in a small time interval given the event history. We consider a multi-dimensional Hawkes
(self-exciting) process [26] to model dependencies between various time-series of markets and obtain
the influence graph. We convert financial time-series data to event sequences by defining events as
relatively significant changes in time-series.</p>
        <p>Self-exciting process This process represents the triggering efect of event history in the intensity
and occurrence of future events, usually as an exponential exciting kernel (Eq. 1) [26]:
where  () is the intensity of event type  at time t,   is the base intensity for even type ,  is the
excitation decay rate and , is the influence parameter between event types of  and .</p>
        <p>We infer model parameters ( and influence matrix  = (, )) by maximizing the log-likelihood of
events given in Eq. 2 using the expectation-maximization or stochastic gradient descent methods:
(1)
(2)
where [0, ] is the time interval of events we consider for correlation analysis, and  () is the sum of
intensities of all event types.</p>
        <p>() =   + ∑︁ , · −  · (− ),</p>
        <p>;&lt;
ℒ =
∑︁ log   () −
:&lt;
∫︁ 
0
 (),</p>
      </sec>
      <sec id="sec-6-2">
        <title>Stocks’ correlation analysis within a given exchange (market) We consider diferent point</title>
        <p>processes for the time-series of stocks and obtain the correlation graph among them by analyzing event
data and inferring the influence matrix of the multi-dimensional process [ 27]. This matrix reveals the
weighted dependency (influence) graph among diferent market’ stocks. Due to the ever-changing
dependencies among companies, we can update this graph over time and drive a dynamic dependency.
As a remedy for high dimensional market data, which decreases the accuracy of these models, we can
drive an initial dependency graph by processing various textual data using NLP techniques.
Correlation analysis of internal stock data Despite the expressiveness of point process models,
they are inaccurate in high-dimensional spaces. In correlation analysis of internal stock data
(fundamental and technical), we can mitigate this problem by leveraging KGs and considering only relations
appearing in the KG.</p>
        <p>Formulating intensity function using neural networks In addition to high dimensionality, the
assumption of solid intensity functions such as the formulation in Eq. 1 with predefined triggering
efect (positive and additive efect of history), might not apply to every real scenario with intricate
dependencies. Therefore, several methods benefit from neural networks such as LSTMs in [ 28] to
formulate the intensity function of the point process models that capture variable and complex dependencies
and adjust the model for the specific use case. Additionally, the attention mechanism [ 29], which can
explicitly model the influence of event types, guides in finding the correlation graph.</p>
      </sec>
      <sec id="sec-6-3">
        <title>6.2. Spurious correlations</title>
        <p>Every correlation does not represent a causality relationship known as spurious correlation [30]. This
correlation can be a random efect or caused by other hidden variables and, therefore, be spurious [ 30].
In particular, deep learning models usually result in spurious correlations without enough diverse data
[31]. To mitigate this misinterpretation, we will apply methods to test the correlations.
• Considering other factors in the stock market in correlation analysis, in addition to stocks, as
much as possible.
• Considering long-term correlations among time-series or comparing these correlations in the
long term to test if they are not random.
• Applying null hypothesis for finding significant p-values, such as the method of [ 32], which
constructs a spurious relationship graph among time-series of stocks using Granger causal relation
test to evaluate causality between time-series and T-test to estimate the p-value.</p>
        <p>We aim to prune the initial dependency graph and reduce the dimensionality of point process models by
detecting spurious correlations and driving a more meaningful model based on causality dependencies.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Modeling Synthetic Financial time-series</title>
      <p>An important input in modeling synthetic time-series is using a correlation graph among stocks, as
presented in Figure 4. The temporal and dependency features of stocks are embedded using this
correlation graph. These features serve as inputs for ML models to generate time-series data. In the
following sections, we propose various types of these models:
Improving event-based models using GNNs and LSTM The correlation graph can be used
directly for generating time-series. However, to enhance the initial point process models for modeling
time-series (that consist of more details than event sequences) and to improve graph weights, we will
add GNN with attention layers [33]. This model can simultaneously encode time-series and correlation
graph structure and obtains graph weights [20] (Figure 4). Then, we will model the synthetic time-series
data using an LSTM given the encoded data [20].
Correlation
graph</p>
      <p>GNN</p>
      <p>ML-based
time series
generator
(x11, x12, ....)
(x21, x22, ....)
(x31, x32, ....)</p>
      <p>Multidimensional
time-series</p>
      <sec id="sec-7-1">
        <title>Generating synthetic time-series using correlation graphs The other method combines the</title>
        <p>correlation graph and GANs to generate interrelated time-series data. We will apply this graph to
embed time-series data, capturing their relationships and providing inputs for the GAN model.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>8. Implementation Considerations</title>
      <p>Generating extreme volumes of synthetic financial time-series data has unique challenges and
requirements from both software and hardware perspectives. Here, we outline the most relevant ones.
Scalable data generation The five tools of the Graph-Massivizer platform are being implemented
scalable to generate synthetic time-series data across a distributed system. One strategy employed is
tasklevel parallelism, which divides the data generation process into small chunks processed simultaneously
across diferent nodes in the HPC cluster. As a computing framework, Apache Spark provides libraries
and functionalities for parallel processing and data management on large clusters.
Data storage and streaming Managing and storing the produced synthetic data at the petabyte level
requires various solutions, such as compression, third-party storage, and transferring data only when
necessary. Ideally, the synthetic data should remain within the HPC cluster, with financial simulations
and testing conducted in the same environment to minimize data movement.</p>
      <sec id="sec-8-1">
        <title>Parallel processing of data generation and large memory capacity We avail CINECA’s Leonardo</title>
        <p>Pre-exascale supercomputer [34] that will allow the parallel processing of synthetic data generation. A
low-latency, high-bandwidth network is in place for communication between compute nodes within
the HPC cluster to facilitate data exchange
Energy consumption monitoring The Graph-Massivizer platform has a dedicated tool called
’Graph-Greenifier’ to monitor and analyze the energy consumption used for generating the synthetic
data time-series.</p>
        <p>Data security Even though the underlying historical financial time-series data does not contain
personally identifiable information, its synthetic data can still be commercially sensitive. Therefore,
security measures are necessary to ensure data confidentiality, data integrity, and secure coding practices,
such as strict access control, synthetic data encryption at rest and in transit, and digital signatures of
the synthetic data to ensure its authenticity and prevent tampering and data logging monitoring.
Cost optimization Finally, cost optimization in generating synthetic data is critical to the entire
process. It involves balancing hardware costs, licensing fees, and ongoing maintenance expenses with
the desired performance and scalability. The goal is to ofer a more cost-efective solution than real
historical financial data while maintaining competitiveness in pricing.</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>9. Conclusions</title>
      <p>The work presented in this paper, as part of the Graph-Massivizer EU project, is currently at the
halfway point and demonstrates the initial proof-of-concept developments for generating synthetic
data in extreme volumes. The mechanisms and approaches identified here will be further implemented
as a robust workflow within the five tools of the Graph-Massivizer platform. The focus will be on
software implementation, integrating the five tools, and identifying relevant correlations in historical
time-series to enhance the quality of the generated synthetic data. After generating synthetic data
samples, they will be tested using Peracton’s back-testing engine with various investment and trading
ifnancial algorithms. The behavior of these algorithms will be analyzed and compared with their
performance on real historical data to assess diferences and similarities. Feedback will be provided to
the Graph-Massivizer platform to fine-tune the quality of the synthetic data further.
Acknowledgement The Graph-Massivizer project has received funding from the European Union’s
Horizon Research and Innovation Actions under Grant Agreement Nº 101093202.1
1More information available at: https://graph-massivizer.eu/
[14] C. Erten, D. Kazakov, Ontology graph embeddings and ilp for financial forecasting, in: International</p>
      <p>Conference on Inductive Logic Programming, Springer, 2021, pp. 111–124.
[15] S. Ibrahim, W. Chen, Y. Zhu, P.-Y. Chen, Y. Zhang, R. Mazumder, Knowledge graph guided
simultaneous forecasting and network learning for multivariate financial time series, in: Proceedings of
the Third ACM International Conference on AI in Finance, 2022, pp. 480–488.
[16] D. Matsunaga, T. Suzumura, T. Takahashi, Exploring graph neural networks for stock market
predictions with rolling window analysis, arXiv preprint arXiv:1909.10660 (2019).
[17] X. Fu, X. Ren, O. J. Mengshoel, X. Wu, Stochastic optimization for market return prediction using
ifnancial knowledge graph, in: 2018 IEEE International Conference on Big Knowledge (ICBK),
IEEE, 2018, pp. 25–32.
[18] D. Ma, D. Yuan, M. Huang, L. Dong, Vgc-gan: A multi-graph convolution adversarial network for
stock price prediction, Expert Systems with Applications 236 (2024) 121204.
[19] J. Etesami, N. Kiyavash, K. Zhang, K. Singhal, Learning network of multivariate hawkes processes:</p>
      <p>A time series approach, arXiv preprint arXiv:1603.04319 (2016).
[20] T. Yin, C. Liu, F. Ding, Z. Feng, B. Yuan, N. Zhang, Graph-based stock correlation and prediction
for high-frequency trading systems, Pattern Recognition 122 (2022) 108209.
[21] K. Yoon, Y. Im, J. Choi, T. Jeong, J. Park, Learning multivariate hawkes process via graph recurrent
neural network, in: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery
and Data Mining, 2023, pp. 5451–5462.
[22] D. Shah, H. Isah, F. Zulkernine, Stock market analysis: A review and taxonomy of prediction
techniques, International Journal of Financial Studies 7 (2019) 26.
[23] W. Jiang, Applications of deep learning in stock market prediction: recent progress, Expert</p>
      <p>Systems with Applications 184 (2021) 115537.
[24] A. Seyfi, J.-F. Rajotte, R. Ng, Generating multivariate time series with common source coordinated
gan (cosci-gan), Advances in neural information processing systems 35 (2022) 32777–32788.
[25] D. J. Daley, D. Vere-Jones, et al., An introduction to the theory of point processes: volume I:
elementary theory and methods, Springer, 2003.
[26] A. G. Hawkes, Spectra of some self-exciting and mutually exciting point processes, Biometrika 58
(1971) 83–90.
[27] E. Lewis, G. Mohler, A nonparametric em algorithm for multiscale hawkes processes, Journal of
nonparametric statistics 1 (2011) 1–20.
[28] H. Mei, J. M. Eisner, The neural hawkes process: A neurally self-modulating multivariate point
process, Advances in neural information processing systems 30 (2017).
[29] Y. Gu, Attentive neural point processes for event forecasting, in: Proceedings of the AAAI</p>
      <p>Conference on Artificial Intelligence, volume 35, 2021, pp. 7592–7600.
[30] H. A. Simon, Spurious correlation: A causal interpretation, Journal of the American statistical</p>
      <p>Association 49 (1954) 467–479.
[31] S. Wu, M. Yuksekgonul, L. Zhang, J. Zou, Discover and cure: Concept-aware mitigation of spurious
correlation, in: International Conference on Machine Learning, PMLR, 2023, pp. 37765–37786.
[32] G. Li, J. J. Jung, Dynamic relationship identification for abnormality detection on financial time
series, Pattern Recognition Letters 145 (2021) 194–199.
[33] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, Y. Bengio, et al., Graph attention
networks, stat 1050 (2017) 10–48550.
[34] CINECA, High performance computing, leonardo pre-exascale supercomputer, 2024. URL: https:
//leonardo-supercomputer.cineca.eu/.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E. U. H. R.</given-names>
            <surname>Graph-Massivizer EU Project</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. A. N. . Innovation</given-names>
            <surname>Actions</surname>
          </string-name>
          , Graph massivizer,
          <year>2023</year>
          . URL: https://graph-massivizer.eu/.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E. U. H. R.</given-names>
            <surname>Graph-Massivizer EU Project</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. A. N. . Innovation</given-names>
            <surname>Actions</surname>
          </string-name>
          ,
          <source>Use case 1 green-finance</source>
          ,
          <year>2023</year>
          . URL: https://graph-massivizer.eu/project/green-and
          <string-name>
            <surname>-</surname>
          </string-name>
          sustainable-finance/.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Dogariu</surname>
          </string-name>
          , L.
          <string-name>
            <surname>-D. Ştefan</surname>
            ,
            <given-names>B. A.</given-names>
          </string-name>
          <string-name>
            <surname>Boteanu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Lamba</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <article-title>Generation of realistic synthetic financial time-series</article-title>
          ,
          <source>ACM Transactions on Multimedia Computing</source>
          , Communications, and
          <string-name>
            <surname>Applications</surname>
          </string-name>
          (TOMM)
          <volume>18</volume>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>27</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Assefa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dervovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mahfouz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. E.</given-names>
            <surname>Tillman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Reddy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Veloso</surname>
          </string-name>
          ,
          <article-title>Generating synthetic data in finance: opportunities, challenges and pitfalls</article-title>
          ,
          <source>in: Proceedings of the First ACM International Conference on AI in Finance</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Stenger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Leppich</surname>
          </string-name>
          , I. Foster,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kounev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bauer</surname>
          </string-name>
          ,
          <article-title>Evaluation is key: a survey on evaluation measures for synthetic time series</article-title>
          ,
          <source>Journal of Big Data</source>
          <volume>11</volume>
          (
          <year>2024</year>
          )
          <fpage>66</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Eberhart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Haase</surname>
          </string-name>
          , W. Schell,
          <article-title>metaphactory for massive graphs</article-title>
          , in: M.
          <string-name>
            <surname>Vieira</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Cardellini</surname>
            ,
            <given-names>A. D.</given-names>
          </string-name>
          <string-name>
            <surname>Marco</surname>
          </string-name>
          , P. Tuma (Eds.),
          <source>Companion of the 2023 ACM/SPEC International Conference on Performance Engineering, ICPE</source>
          <year>2023</year>
          , Coimbra, Portugal,
          <source>April 15-19</source>
          ,
          <year>2023</year>
          , ACM,
          <year>2023</year>
          , pp.
          <fpage>215</fpage>
          -
          <lpage>220</lpage>
          . URL: https://doi.org/10.1145/3578245.3585330. doi:
          <volume>10</volume>
          .1145/3578245.3585330.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Investopedia</surname>
          </string-name>
          , Fundamental data,
          <year>2024</year>
          . URL: https://www.investopedia.com/terms/f/ fundamentalanalysis.asp.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>E.</given-names>
            <surname>Beyaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Tekiner</surname>
          </string-name>
          , X.
          <article-title>-j.</article-title>
          <string-name>
            <surname>Zeng</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Keane</surname>
          </string-name>
          ,
          <article-title>Comparing technical and fundamental indicators in stock price forecasting</article-title>
          ,
          <source>in: 2018 IEEE 20th international conference on high performance computing and communications; IEEE 16th international conference on smart city; IEEE 4th international conference on data science and systems</source>
          (HPCC/SmartCity/DSS), IEEE,
          <year>2018</year>
          , pp.
          <fpage>1607</fpage>
          -
          <lpage>1613</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Investopedia</surname>
          </string-name>
          ,
          <source>Technical data</source>
          ,
          <year>2024</year>
          . URL: https://www.investopedia.com/terms/t/technicalanalysis. asp.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Investopedia</surname>
          </string-name>
          , Esg data,
          <year>2024</year>
          . URL: https://www.investopedia.com/terms/e/ environmental
          <article-title>-social-and-governance-esg-criteria</article-title>
          .asp.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>E. L.</given-names>
            <surname>Jefriess</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J</given-names>
            ,
            <surname>Sentiment</surname>
          </string-name>
          <string-name>
            <surname>data</surname>
          </string-name>
          ,
          <year>2024</year>
          . URL: https://www.eightcap.com/labs/ exploring
          <article-title>-the-most-common-sentiment-indicators-on-tradingview/.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zehra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. F. M.</given-names>
            <surname>Mohsin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. I.</given-names>
            <surname>Jami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Siddiqui</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. K.-U.-R. R. Syed</surname>
          </string-name>
          ,
          <article-title>Financial knowledge graph based financial report query system</article-title>
          ,
          <source>IEEE Access 9</source>
          (
          <year>2021</year>
          )
          <fpage>69766</fpage>
          -
          <lpage>69782</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>N.</given-names>
            <surname>Kertkeidkachorn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Nararatwong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ichise</surname>
          </string-name>
          ,
          <article-title>Finkg: A core financial knowledge graph for ifnancial analysis</article-title>
          ,
          <source>in: 2023 IEEE 17th International Conference on Semantic Computing (ICSC)</source>
          , IEEE,
          <year>2023</year>
          , pp.
          <fpage>90</fpage>
          -
          <lpage>93</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>