<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SYDAG: A Synthetic Dataset Generator for Data Integration</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anne Marschner</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thorsten Papenbrock</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Philipps University Marburg, Faculty of Mathematics and Computer Science</institution>
          ,
          <addr-line>Hans-Meerwein-Straße 6, 35043 Marburg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>For the development and evaluation of data integration tools, developers require test datasets that represent realistic data integration scenarios. Due to a lack of publicly accessible test datasets with ground truth information, developers often use dataset generators to create their integration scenarios. Existing dataset generators are, however, often limited in functionality and ofer only limited configuration options. We therefore develop the synthetic dataset generator SYDAG that generates realistic integration scenarios from real-world seed datasets and ofers highly customizable configurations for output fine-tuning. It supports the injection of a wide range of error types and structural changes, which enable the generation of heterogeneous integration scenarios. For the generation, we propose a logical sequence of processing steps and eficiently implement them in the SYDAG system. For the evaluation of SYDAG, we generate integration scenarios of three complexity levels and examine how diferent schema matchers perform on them. Our results show that the performance of all matchers decreases as the complexity of the integration scenario increases, confirming that SYDAG is capable of generating complex integration scenarios.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;dataset generation</kwd>
        <kwd>relational data</kwd>
        <kwd>data integration</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        preferable [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. These generators can quickly generate diferent benchmarking scenarios and can be
tailored to the user’s needs (e.g., degree of heterogeneity, noise or pollution). Therefore, data generators
are an efective tool for creating heterogeneous data integration scenarios that represent diferent types
of data sources [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        The development of data integration algorithms continues to increase [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. However, while there
is much research on integration, there is comparatively less research on test data generators that are
needed to evaluate the integration algorithms [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Additionally, benchmarks are often limited to specific
tasks [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. This paper addresses these facts and aims to advance the generation of test datasets. To
achieve this, we introduce the Synthetic Dataset Generator SYDAG for the creation of data integration
scenarios. SYDAG meets the described requirements for the generation of realistic and heterogeneous
integration scenarios and introduces various features and configuration options that allow a particularly
precise customization and distinguish our approach from existing data generators. The customizations
include the horizontal and/or vertical splitting of the input dataset, the injection of a variety of schema
and data errors, the adjustment of normalization degrees, and the obfuscation of schema and instance
correspondences via merge and shufle operations on rows and/or columns. Because relational databases
are one of the most common formats for structuring data, SYDAG also targets this particular data
model as in- and output [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. We also provide a Graphical User Interface (GUI) to enable an easy way
of configuration [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. We test and evaluate SYDAG’s ability to create integration scenarios of varying
complexity by measuring the performance of three simple schema matchers in diferent integration
scenarios generated by SYDAG.
      </p>
      <p>The remainder of this paper is structured as follows: First, Section 2 provides a review of related
work on dataset generators. Next, Section 3 describes the configuration options and design of SYDAG.
Afterwards, Section 4 introduces SYDAG’s implementation and accessibility. Then, Section 5 presents
our evaluation with three standard matching techniques. Section 6 ends the paper with a conclusion.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        To evaluate the performance of integration tools, benchmarks are essential [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. A practical approach is
to construct these benchmarks from real-world data. This enables developers to test schema matching
tools in realistic conditions [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. An example of this approach is the benchmark Thalia provided by
Hammer et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Thalia is a collection of over 25 publicly available test datasets, which include a large
number of syntactic and semantic heterogeneities. Cabrera et al. presented a similar benchmark, called
DIBS [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. DIBS includes real data integration tasks from diverse domains and additionally provides
metrics to assess performance in these scenarios. Crescenzi et al. introduced another benchmark, called
Alaska [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], whose ground truth was created by experts. It consists of real datasets but difers from DIBS
because of its flexibility. Alaska is based on real data from 71 e-commerce websites, which ensures
that the benchmark reflects realistic challenges, and users can select subsets of the data to achieve
the desired level of dificulty. Our approach also uses real-world data, but we generate the integration
scenario automatically.
      </p>
      <p>
        It is, in general, dificult to find real-world datasets with ground truth because the creation of ground
truth is highly labor-intensive [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Moreover, the increasing number of data sources across various
sectors and their heterogeneity require many diferent scenarios [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This has led to the creation of
dataset generators instead of benchmarks with real-world data. These generators can quickly produce
diverse synthetic datasets for benchmarking scenarios. Additionally, users can adapt the generated
datasets to their own expectations [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Hence, the goal is to generate realistic datasets that reflect
heterogeneity and volume.
      </p>
      <p>
        Panse et al. introduced a test dataset generator called DaPo+ [
        <xref ref-type="bibr" rid="ref2 ref7">2, 7</xref>
        ], which is an extension of the
generator DaPo presented by Hildebrandt et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. The extended system takes an existing dataset
as input and uses it to create multiple datasets with errors and duplicates. Because it supports both
relational and non-relational data models, it can generate large and versatile datasets. The output
includes the created datasets and ground truth information.
      </p>
      <p>
        Ioannou et al. developed a generator for schema matching scenarios, which is called EMBench++ [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
EMBench++ works with relational databases and expects a user-defined configuration. Depending on
the configuration, it inserts duplicates and errors, such as misspellings or abbreviations, and simulates
time-based changes, such as entity splits. This allows developers to test algorithms in realistic scenarios.
      </p>
      <p>
        Lee et al. presented eTuner [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], which is a system that automatically optimizes schema matching
algorithms by adjusting their parameters. It includes a workload generator that creates new schemata
from a given schema by splitting the schema into disjoint sets and introducing errors in attribute labels
and instance values. Because it creates schemata and their mappings, it can also be used to generate
realistic matching scenarios.
      </p>
      <p>
        Koutras et al. introduced Valentine [18], an open-source experiment suite for evaluating schema
matching techniques. It includes a dataset generator that takes a tabular dataset and a user-defined
configuration as input. Valentine’s dataset generator, then, performs a horizontal and/or vertical split
on the relations while retaining some overlapping columns and/or rows. It also inserts errors into the
overlapping entries both at data and schema level. Additionally, it includes a user-friendly interface [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>The generators that we mentioned so far use real data to create synthetic datasets. However, there
are also generators that produce datasets without receiving a data source as input. For example, Alexe
et al. introduced STBenchmark [19], which includes two generators for schemata and instances: First,
SGen creates a mapping scenario; then, IGen generates the actual data instances corresponding to the
created structures. While SGen can utilize either real or generated data, IGen does not require a real
data source to generate data instances. The same applies to iBench, presented by Arocena et al. [20]. It
is a metadata generator for the synthetic creation of schemata and their mappings from user-defined
configurations, which define i.a. desired size and complexity properties, without any seed data. Unlike
the other generators, iBench independently creates integrity constraints and other metadata.</p>
      <p>
        Our data generator SYDAG is heavily inspired by related work and subsumes most of the unique
features that we listed above: horizontal and vertical splits, error and noise generation, attribute splits
and merges, etc. We consolidate and extend these features in a particularly adaptable data generator,
which allows the user to make exact configurations and create individual integration scenarios. In
contrast to existing works, SYDAG can also automatically change schema normalization degrees.
Additionally, it profiles key constraints and ofers options for schema and instance shufling. Overall,
SYDAG ofers not only more features than every individual approach but also more customization
options for to-be-generated datasets. We developed SYDAG for relational datasets, because the relational
model is one of the most common data models [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Synthetic Dataset Generator</title>
      <p>We aim to cover a wide selection of functionalities in SYDAG’s design and, therefore, combine various
features from existing generators with additional methods. For this, we consider how diferent generation
approaches logically build upon each other. Certain processing steps need to be performed before others
to generate datasets with desired properties. Figure 1 visualizes the sequence of SYDAG’s generation
pipeline that meets these requirements.</p>
      <p>SYDAG takes a relational (real-world) dataset as input and lets the user specify the desired properties
of the matching scenario. The first processing step is the key determination. With the data profiling
algorithm HyUCC [21], this step automatically identifies all keys of the input relation; then, it selects
the smallest, most likely key of every relation as the relation’s primary key. Placing key inference at
the beginning is necessary because some of the following steps require keys as input. In particular, the
split and noise components depend on these keys to be known, because the keys must be maintained in
all newly formed relations [18].</p>
      <p>The split component is the second processing step. It incorporates functionalities from Valentine’s
dataset generator, which supports both horizontal and vertical splits (and combinations of both). It also
allows users to specify the degree of overlap for rows and columns [18]. In addition, SYDAG provides
configuration options to control the distribution of non-overlapping columns or rows between the new
relations, which enables one relation to have more columns than the other. For horizontal splits, the
user can choose whether the overlapping rows should be selected within a block or randomly scattered.
The split needs to be executed in the beginning of the process, because it determines the number of new
datasets that SYDAG creates and, therefore, provides the basis for the dataset generation. SYDAG splits
the input relation into either two or four new relations, each representing a distinct dataset of the final
integration scenario. By placing the split at the start, it is possible to select the error levels and structural
changes individually for each newly created dataset, which enables the creation of heterogeneous
integration scenarios.</p>
      <p>Normalization to Boyce-Codd Normal Form (BCNF) follows as the next component and optional
computation step. It causes the datasets created during the split to no longer contain only one but
potentially multiple linked relations. The user can choose whether SYDAG should apply normalization
to BCNF individually for each of the two (or four) current datasets. Additionally, a percentage can be
specified that determines how many of the possible decomposition steps should be executed. To execute
the normalization, SYDAG applies the schema normalization algorithm Normalize [22]. Normalize
automatically profiles functional dependencies and uses them to transform relational datasets into
BCNF. We execute the normalization after the split and before any noise injection to create meaningful
decompositions and foreign key constraints. When SYDAG later adds noise to the relations, the user
can choose to preserve the key and foreign-key constraints or allow them to be broken.</p>
      <p>
        The following component inserts noise, i.e., errors into the schemata. Because normalization has
already been applied, any schema within a dataset can receive errors. The split(s) that SYDAG performed
in the previous component provide information about the overlapping columns or rows. This enables the
selection of attributes in the schema that share overlapping entries with other relations. These attributes
are candidates for noise injection. For each newly created dataset, there are configuration options for
adding noise to the schema. The user can choose whether to include noise and, if so, whether key
attributes should be afected. If the user enables noise, she must specify the percentage of overlapping
attributes with other datasets that the generator will modify. The user can select from nine error
methods to introduce the noise or choose to delete the schema completely, which results in a relation
without headers. The available error methods include some approaches from existing generators, such
as perturbing the column names via prefixing, abbreviation or vowel dropping (Valentine [ 18]), and
replacing column names with synonyms or changing names to random characters (ETuner [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]). If a
relation includes these errors, it causes discrepancies with the names of the overlapping columns from
other relations and, hence, complicates integration scenarios. SYDAG additionally adds another four
noise options (see later), but also allows the user to apply no changes.
      </p>
      <p>
        To generate complex scenarios, generators should not only add errors into the schema but also into
the instance data [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. For this purpose, data noise injection is the next component in SYDAG’s pipeline.
Similar to the schema noise component, SYDAG can use the information from the split to identify
overlapping rows or columns and insert errors specifically into their entries, as inserting errors in
non-overlapping records has less impact on integration scenarios. The configurations for data errors
are similar to those for the schema. For each created dataset, the user can select whether the generator
should introduce noise. This includes the specification whether SYDAG is allowed to break the key
constraints by adding errors to key columns. Additionally, the user can select the percentage of noisy
rows or columns based on the split type. For horizontal splits, the generator inserts a user specified
percentage of noise into the overlapping rows, while for vertical splits, it afects the overlapping columns.
Furthermore, the user can specify a percentage that indicates how many of the entries within a for
noise selected column or row should receive errors. There are 14 available noise methods that the user
can choose from (see Section 4). Again, some error methods are inspired by existing generators: random
typing errors based on keyboard probabilities and random numerical value changes based on value
distributions (Valentine [18]), as well as random data format changes (ETuner [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]), word permutations,
missing value injections, and word abbreviations in string values (EMBench++ [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]).
      </p>
      <p>
        The merge component follows next in the pipeline and can merge two columns of a relation into a
single column. This procedure is inspired by eTuner [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], but SYDAG can execute multiple of these
merges. We visualize an example of the process in Figure 2. Each merge creates a new attribute that
contains concatenated values. SYDAG allows the user to individually select for each dataset whether to
apply the merge component. This includes the specification of the percentage of overlapping columns
to merge. Because we want to enable diferent errors within the linked values, we place the merge
component after noise insertion.
      </p>
      <p>name surname
Jane Smith
Tim Miller
Mike Brown
age
22
45
14</p>
      <p>Merge
name, surname
Jane, Smith
Tim, Miller
Mike, Brown
age
22
45
14</p>
      <p>The shufle component is the last processing step before the output creation. For each created dataset,
the user can choose between no changes, shufling rows, or shufling columns. This means that relations
representing the same real-world entity no longer maintain the same order of entries, which can be a
significant challenge for matchers that consider the order of attributes or tuples [ 23]. Placing this step
at the end is necessary to allow users to select the block overlap as a split option. The shufling later
disperses the overlapping rows/columns over the entire datasets. These rows originally belong to one
block and may share some similarities, but when SYDAG shufles them, it makes it harder to rediscover
these blocks. The diference to using random overlap is that in that case SYDAG takes the overlapping
rows directly from diferent locations, meaning they might share less similarities.</p>
      <p>
        After processing the datasets, SYDAG creates the output. The output covers the created relations
grouped by dataset, the metadata information on keys and foreign-keys, and the mapping information,
i.e., the gold standard information for a perfect matching of the resulting schemata. The mapping
information specifies which attributes in the new relations correspond to the original attributes. The
output format is inspired by the generators iBench [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] and eTuner [20], which also generate and output
precise mappings for the generated datasets. Developers can use these mappings to evaluate their
integration strategies. The output comprises two or four datasets, each with one or multiple relations
depending on the individual normalization steps.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. SYDAG Implementation</title>
      <p>To realize SYDAG as a practical application, we implemented the components described in Section 3 in
a typical backend-frontend architecture. We provide the core details of this implementation here and
further technical details online [24].</p>
      <sec id="sec-4-1">
        <title>4.1. Backend</title>
        <p>Our backend consists of multiple classes that SYDAG utilizes during the generation process. We
categorize the classes based on their functionality and explain the core components of SYDAG’s
generation algorithm in this section.</p>
        <p>The Splitting Components are an important part of SYDAG, because they divide the input relation into
new relations. The ‘Split’ class provides methods for the diferent split options. For the vertical split,
it provides a method that generates a uniformly random column overlap: First, the method adds the
key columns to both newly created relations. Then, it randomly selects non-key columns proportional
to the user-specified percentage of overlap and inserts them into both new relations. After that, it
randomly distributes the remaining columns to the new relations according to the configuration. For
the horizontal split, there are two diferent methods. One of the two horizontal methods creates a block
overlap; it, first, selects a random start index for the overlapping rows; then, it determines the end index
based on the user-selected percentage of overlap and inserts the rows between start and end as the first
rows in both new relations; after that, it distributes the remaining rows to the new relations according
to the configuration. The other horizontal split method creates an overlap of random rows; unlike block
overlap, it picks overlapping rows randomly from the entire dataset rather than selecting consecutive
rows. If the user chooses both split types, SYDAG first applies the horizontal split to create two new
relations; then, it splits both of the created relations vertically, resulting in four relations.</p>
        <p>The Structure Change Components include four classes that SYDAG uses to modify the structure of
the relations. Their ‘Merge’ class is responsible for merging columns within a relation. It relies on
the user-defined percentage to determine how many of the overlapping columns should be merged.
The merge process is repeated until the desired number of merged columns is reached. In each step, a
method identifies the best column pair for merging by calculating a score for all possible pairs. Pairs
receive a higher score if they share the same enumeration type, if their column indices are close to
each other, and if they have not yet been merged with another column. The method selects the column
pair with the highest score and, then, merges the attribute names and data entries of the columns
into a new column; the new column gets inserted into the relation and the two original columns are
removed. The other classes in the Structure Change Components are responsible for normalization.
The ‘Normalization’ class provides methods that decompose a relation into BCNF. Its core method first
applies the Normalize algorithm [22] to find BCNF-compliant relations. It stores the column and key
indices of these relations in ‘IndexSummary’ objects. Then, the method adjusts the indices to the input
relation so that they reference the correct column positions. Afterwards, the method executes as many
of Normalize’s decomposition steps as the user specified in the configuration. The final step is the
creation of the selected relations. The method extracts the columns corresponding to the indices in the
‘IndexSummary’ object, removes the overlapping rows, and then outputs the generated relations.</p>
        <p>
          The Noise Insertion Components are the most essential part of SYDAG and cover the following error
methods:
1. Removal of all vowels from a given string
2. Abbreviation of all words’ first letters in a string
3. Shortening of all words in a string to random lengths
4. Shufling of all letters within a string
5. Shufling of all words within a string
6. Generation of random strings with length in [
          <xref ref-type="bibr" rid="ref1 ref10">1,10</xref>
          ]
7. Adding of random character prefixes of length [
          <xref ref-type="bibr" rid="ref1 ref4">1,4</xref>
          ]
8. Replacing words in a string with synonyms
9. Replacing words in a string with their translations
10. Generation of null values
11. Insertion of phonetic errors in strings
12. Insertion of OCR errors in strings
13. Insertion of random typing errors in strings
14. Reformatting strings by swapping "-", "_", ".", and " "
15. Abbreviation of words in strings to random lengths
16. Generation of a random numeric value from a column’s normal distribution of values
17. Generation of a random numeric outlier for a column
18. Mapping of string values to numeric values
        </p>
        <p>SYDAG can apply Methods 1 to 9 to the schema of the relations and Methods 5 to 18 to the instance
data. Methods 8 and 9 rely on external API calls: Synonyms for data entries are generated using the
Datamuse API [25], while translations are provided by the MyMemory API [26]. If the user selects the
methods that generate synonyms or translations, the generation process of SYDAG can take longer
than without this selection, because calling the APIs is more time-consuming than using a naive error
method.</p>
        <p>The ‘SchemaNoise’ class is part of the Noise Insertion Components and extends the ‘Noise’ class.
We use it to add errors to the attribute names. It includes a method, which receives a relation and the
user’s configuration. If chosen, the method deletes the schema; otherwise, it determines the overlapping
attributes and includes or excludes the key columns from the noise based on the configuration. Then,
the method selects attributes uniformly at random from the overlapping attributes to receive noise. For
each of the selected attributes, it applies one of the user-selected error methods. It reuses a method
only after all other selected methods have been applied at least once. An exception to this are Methods
1 and 5 because they are not always applicable. They require the attribute name to contain vowels or
several words. If selected, SYDAG first attempts to use Methods 1 and 5. If this is not possible for the
current attribute, it uses another user-selected error method. If the user selected only Methods 1 and 5,
but neither is applicable, Methods 6 and 7 serve as fallback.</p>
        <p>The ‘Data Noise’ class also extends the ‘Noise’ class. We use it to insert errors into instance data. It
provides two diferent methods, which are used depending on the chosen split type. SYDAG perturbs
the columns in case of a vertical split, the rows in case of a horizontal split, and both when a double
split is performed. The row perturbation consists of many sequential steps. First, the method calculates
the number of rows that will receive errors based on the user-specified percentage. If the value exceeds
0, the method selects that many row indices uniformly at random. Next, the method filters the column
indices to determine where it can insert errors. If the user chooses to preserve key constraints, the
method removes the key indices from the available error positions. In case of a double split, the method
also removes the overlapping column indices because these columns already received noise through
the column perturbation. Then, for each row selected for noise, the method determines the number of
afected entries based on the user-defined percentage. After that, it randomly selects the calculated
number of entries from the available error positions in the row. For each selected entry, it checks which
of the user-specified error methods are applicable, then selects one uniformly at random and executes
it. The new entry replaces the original entry in the relation. The column perturbation method uses
a similar approach. However, a diference lies in the selection of the entries that receive noise. This
method selects the columns that will receive noise while maintaining the same ratio of numeric to
alphanumeric columns as in the relation. Then it calls two individual methods: One of them inserts
errors into the selected numeric columns and the other into the alphanumeric columns. Both of these
methods first randomly select the entries in the column and, then, apply a random error method. Lastly,
the perturbed relation is output.</p>
        <p>The last important group are the File Processing Components, which include the ‘CSVTool’ class.
This class includes methods that write the generated relations to CSV-files. Depending on the user’s
configuration, SYDAG may shufle columns or rows before writing the CSV-file. Therefore, three
diferent methods exist: one for writing the file without shufling, one for shufling columns before
writing, and one for shufling rows before writing. In addition, there is a method that generates a
TXT-file containing the key relationships for a generated dataset. Finally, another method writes the
entire mapping between the generated datasets to a TXT-file.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Frontend</title>
        <p>To make SYDAG user-friendly, we provide a GUI through which users can specify input parameters and
generation configurations. The GUI also ofers an option to upload a JSON configuration file containing
all of the input parameters; SYDAG provides a template with an example of a JSON configuration file to
create these configurations. Figure 3 shows a screenshot of the GUI for configuring data noise settings.
For a simple and eficient application startup, we use Docker to containerize SYDAG with its backend
and frontend components [27]. Our generator is publicly available on GitHub [24].</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Evaluation</title>
      <p>The goal of our evaluation is to assess SYDAG’s ability to generate integration scenarios of diferent
complexity that actually challenge schema matchers. To achieve this, we alter SYDAG’s configuration
parameters to create diferent integration scenarios. After that, we evaluate the performance of three
schema matchers in these scenarios.</p>
      <sec id="sec-5-1">
        <title>5.1. Experimental Setup</title>
        <p>For the evaluation, we choose four datasets with diverse properties to test SYDAG across diferent
requirements. Table 1 summarizes the characteristics of the four datasets. The dataset ‘Bridges’ [28] is
available in the UCI Machine Learning Repository; ‘Diabetes’ [29], ‘Mental’ [30], and ‘Gym’ [31] can all
be found on Kaggle. We apply SYDAG to generate integration scenarios of three complexity levels for
each test dataset, which creates 12 scenarios in total. To create the higher complexity levels, we increase
the percentages of structural changes, noise and shufling, and decrease the overlap percentages of the
split; the diferent SYDAG configuration files and definitions can be found on GitHub [24].</p>
        <p>To measure how diferent matchers perform on our created integration scenarios, we use a
schemabased Levenshtein Matcher, an instance-based Jaccard Matcher (with 2-gram tokenization), and a
Distinct-Count Matcher that calculates the similarity of two columns by counting their distinct entries
and dividing the smaller count by the larger count [32, 33, 34]. The schema matching tool Schematch [34]
provides their implementations. To evaluate the matching, we calculate the F-measure using the smallest
similarity value that describes an actual match as our threshold [34].</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Analysis of Results</title>
        <p>We compare the overall performance of the three matchers by calculating the average F-measure across
all integration scenarios for each complexity level (see Table 2). More detailed analyses of matcher’s
performances on the individual datasets are available on GitHub [24].</p>
        <p>Figure 4 visualizes the results of the average performances of the three matchers. For all matchers,
we notice a significant drop in performance across the complexity levels. For both the Distinct-Count
Matcher and the Jaccard Matcher, we observe that the performance drops are higher from the first to
the second complexity level than from the second to the third. The performance of the Distinct-Count
Matcher decreases by 0.47 between Complexity Levels 1 and 2 and by 0.07 between Complexity Levels
2 and 3, which shows that even small percentages of inserted errors present a challenge, because they
distort the number of distinct elements in the columns. For the Jaccard Matcher, the performance drops
by 0.6 from Complexity Level 1 to 2 and by 0.09 from Complexity Level 2 to 3. This means that when
the goal is to identify all matches, the error insertion in the second complexity level already has a strong
impact on the performance. It also shows that error insertion presents a significant challenge, even
when only minor structural changes are made.</p>
        <p>In contrast, the Levenshtein Matcher shows a diferent behavior. We observe a decrease of 0.41
between the first and second complexity levels and a further drop of 0.45 between the second and third,
reflecting a consistent rise in dificulty across levels. This indicates that even the small percentages
of errors that we created in Complexity Level 2 make matching more dificult. However, since we
mostly use error methods that do not make attribute names unrecognizable, the matcher can still ensure
relatively high performance. The performance drop between Complexity Levels 2 and 3 is caused by
the high number of inserted errors, the use of complex error methods and structural changes, which
include in particular complex merged attribute names.</p>
        <p>For all matchers, we observe a decline in performance across the complexity levels. We can identify
some common configurations that increase the overall complexity. These include the number of
splits. The more datasets SYDAG splits an integration scenario into, the greater the challenges for the
integration tool. Another aspect are the structural changes. As SYDAG applies more diverse structural
changes or shufle options, the integration of the datasets becomes increasingly heterogeneous and
challenging. Furthermore, data and schema noise play an important role. The more noisy relations we
create and the higher the noise percentages are chosen, the more challenging it becomes for matchers to
identify overlapping attributes and recognize that they represent the same real-world entities. Overall,
we can state that the complexity of the integration scenarios clearly increases with increasing noise
percentages. With SYDAG, users can, hence, create individual challenges by fine-tuning the dificulty
of their scenarios to the weaknesses of the integration tools they are testing.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>We introduced the synthetic dataset generator SYDAG, which developers can use to create customized
data integration scenarios. Our performance evaluation of three schema matchers showed that SYDAG
is able to generate complex integration scenarios. SYDAG ofers the following advantages over existing
dataset generators: It combines error insertion and structural changes, enabling it to generate integration
scenarios with high heterogeneity. Additionally, SYDAG allows the user to customize the generation
by specifying percentages of modifications and selecting combinations of error methods. SYDAG is
applicable to datasets with and without headers and contains a built-in key identification tool, meaning
the user does not have to specify the keys.</p>
      <p>Despite its advantages, SYDAG requires seed input data, because it does not generate data itself and
only restructures and perturbs the input. Additionally, SYDAG is currently limited to CSV input files,
but extensions to relational databases and semi-structured formats are planned.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used DeepL for grammar and spell checking. After
using this tool, the authors reviewed and edited the content as needed and take full responsibility for
the publication’s content.
synthetic scenarios, in: VLDB Journal, volume 16, 2007, pp. 97–122. doi:https://doi.org/10.
1007/s00778-006-0024-z.
[18] C. Koutras, G. Siachamis, A. Ionescu, K. Psarakis, J. Brons, M. Fragkoulis, C. Lofi, A. Bonifati,
A. Katsifodimos, Valentine: Evaluating matching techniques for dataset discovery, in: Proceedings
of the International Conference on Data Engineering (ICDE), 2021, pp. 468–479. doi:10.1109/
ICDE51399.2021.00047.
[19] B. Alexe, W. C. Tan, Y. Velegrakis, STBenchmark: towards a benchmark for mapping systems,
in: Proceedings of the VLDB Endowment, volume 1, 2008, pp. 230–244. doi:10.14778/1453856.
1453886.
[20] P. C. Arocena, B. Glavic, R. Ciucanu, R. J. Miller, The IBench integration metadata generator, in:
Proceedings of the VLDB Endowment, volume 9, 2015, pp. 108–119. doi:10.14778/2850583.
2850586.
[21] T. Papenbrock, F. Naumann, A hybrid approach for eficient unique column combination discovery,
in: Proceedings of the Conference Datenbanksysteme in Büro, Technik und Wissenschaft (BTW),
2017, pp. 195–204.
[22] T. Papenbrock, F. Naumann, Data-driven schema normalization, in: Proceedings of the
International Conference on Extending Database Technology (EDBT), 2017, pp. 342–353.
[23] U. Leser, F. Naumann, Informationsintegration : Architekturen und Methoden zur Integration
verteilter und heterogener Datenquellen, 1 ed., dpunkt.verlag, 2007.
[24] A. Marschner, SYDAG, 2025. URL: https://github.com/anne-marschner/SYDAG, last accessed
6-July-2025.
[25] Datamuse api, 2016. URL: https://www.datamuse.com/api/, last accessed 6-July-2025.
[26] Mymemory, 2015. URL: https://mymemory.translated.net/doc/spec.php, last accessed 6-July-2025.
[27] K. Matthias, S. P. Kane, Docker Praxiseinstieg, 2 ed., mitp Verlag, 2020.
[28] Y. Reich, S. Fenves, Pittsburgh Bridges, UCI Machine Learning Repository, 1990. URL: http://
archive.ics.uci.edu/dataset/18/pittsburgh+bridges, last accessed 6-July-2025.
[29] H. Rahman, Diabetes Dataset, 2024. URL: https://www.kaggle.com/datasets/hasibur013/
diabetes-dataset, last accessed 6-July-2025.
[30] I. Ramzan, Remote Work &amp; Mental Health, 2024. URL: https://www.kaggle.com/datasets/
iramshahzadi9/remote-work-and-mental-health, last accessed 6-July-2025.
[31] V. Khorasani, Gym Members Exercise Dataset, 2024. URL: https://www.kaggle.com/datasets/
valakhorasani/gym-members-exercise-dataset, last accessed 6-July-2025.
[32] A. Doan, A. Halevy, Z. Ives, Principles of Data Integration, Morgan Kaufmann, 2012. doi:https:
//doi.org/10.1016/C2011-0-06130-6.
[33] N. Golov, A. Filatov, S. Bruskin, Eficient exact algorithm for count distinct problem, in: Computer</p>
      <p>Algebra in Scientific Computing, 2019, pp. 67–77. doi: 10.1007/978-3-030-26831-2_5.
[34] A. Vielhauer, Schematch, 2024. URL: https://github.com/avielhauer/schematch, last accessed
6July-2025.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Manzano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Miskolczi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Stiele.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vybornov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Fieseler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pfalzner</surname>
          </string-name>
          ,
          <article-title>Learning from the present for the future: the Jülich LOFAR long-term archive</article-title>
          ,
          <source>in: Astronomy and Computing</source>
          , volume
          <volume>48</volume>
          ,
          <year>2024</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>11</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.ascom.
          <year>2024</year>
          .
          <volume>100835</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Panse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wingerath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wollmer</surname>
          </string-name>
          ,
          <article-title>Towards scalable generation of realistic test data for duplicate detection</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2312</volume>
          .
          <fpage>17324</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Marz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Warren</surname>
          </string-name>
          ,
          <article-title>Big Data : Principles and best practices of scalable real-time data systems</article-title>
          , Manning Publications Co.,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D. T.</given-names>
            <surname>Speckhard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Bechtel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. M.</given-names>
            <surname>Ghiringhelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kuban</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rigamonti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Draxl</surname>
          </string-name>
          ,
          <source>How big is big data?</source>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2405</volume>
          .
          <fpage>11404</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bogatu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. W.</given-names>
            <surname>Paton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Douthwaite</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Freitas</surname>
          </string-name>
          ,
          <article-title>Voyager: Data discovery and integration for data science</article-title>
          ,
          <source>in: Proceedings of the International Conference on Extending Database Technology (EDBT)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>537</fpage>
          -
          <lpage>548</lpage>
          . doi:
          <volume>10</volume>
          .48786/edbt.
          <year>2022</year>
          .
          <volume>47</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>V.</given-names>
            <surname>Crescenzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. D.</given-names>
            <surname>Angelis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Firmani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mazzei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Merialdo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Piai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <article-title>Alaska: A lfexible benchmark for data integration tasks</article-title>
          ,
          <year>2021</year>
          . arXiv:
          <volume>2101</volume>
          .
          <fpage>11259</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F.</given-names>
            <surname>Panse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Klettke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schildgen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wingerath</surname>
          </string-name>
          ,
          <article-title>Similarity-driven schema transformation for test data generation</article-title>
          ,
          <source>in: Proceedings of the International Conference on Extending Database Technology (EDBT)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>408</fpage>
          -
          <lpage>413</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F.</given-names>
            <surname>Panse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Düjon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wingerath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wollmer</surname>
          </string-name>
          ,
          <article-title>Generating realistic test datasets for duplicate detection at scale using historical voter data</article-title>
          ,
          <source>in: Proceedings of the International Conference on Extending Database Technology (EDBT)</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>570</fpage>
          -
          <lpage>581</lpage>
          . doi:
          <volume>10</volume>
          .5441/002/edbt.
          <year>2021</year>
          .
          <volume>67</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Koutras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Psarakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Siachamis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fragkoulis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bonifati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Katsifodimos</surname>
          </string-name>
          ,
          <article-title>Valentine in action: Matching tabular data at scale</article-title>
          ,
          <source>in: Proceedings of the VLDB Endowment</source>
          , volume
          <volume>14</volume>
          ,
          <year>2021</year>
          , pp.
          <fpage>2871</fpage>
          -
          <lpage>2874</lpage>
          . doi:
          <volume>10</volume>
          .14778/3476311.3476366.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Ritze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Lehmberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          , Matching HTML tables to DBpedia, in
          <source>: Proceedings of the International Conference on Web Intelligence, Mining and Semantics (WIMS)</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          . doi:
          <volume>10</volume>
          .1145/2797115.2797118.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Minder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Brandenberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Salamanca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Schweitzer</surname>
          </string-name>
          , Data2Neo
          <article-title>- a tool for complex Neo4j data integration</article-title>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2406</volume>
          .
          <fpage>04995</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>A. M. Cabrera</surname>
            ,
            <given-names>C. J.</given-names>
          </string-name>
          <string-name>
            <surname>Faber</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Cepeda</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Derber</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Epstein</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Zheng</surname>
            ,
            <given-names>R. K.</given-names>
          </string-name>
          <string-name>
            <surname>Cytron</surname>
          </string-name>
          , R. D. Chamberlain,
          <article-title>DIBS: A data integration benchmark suite</article-title>
          ,
          <source>in: Proceeding of the ACM/SPEC International Conference on Performance Engineering Companion (ICPE)</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>25</fpage>
          -
          <lpage>28</lpage>
          . doi:
          <volume>10</volume>
          .1145/ 3185768.3186307.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>F.</given-names>
            <surname>Duchateau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Bellahsene</surname>
          </string-name>
          , E. Hunt,
          <article-title>XBenchMatch: a benchmark for XML schema matching tools</article-title>
          ,
          <source>in: Proceedings of the VLDB Endowment</source>
          ,
          <year>2007</year>
          , pp.
          <fpage>1318</fpage>
          -
          <lpage>1321</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hammer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stonebraker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Topsakal</surname>
          </string-name>
          , THALIA:
          <article-title>Test harness for the assessment of legacy information integration approaches</article-title>
          ,
          <source>in: Proceedings of the International Conference on Data Engineering (ICDE)</source>
          ,
          <year>2005</year>
          , pp.
          <fpage>485</fpage>
          -
          <lpage>486</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICDE.
          <year>2005</year>
          .
          <volume>140</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>K.</given-names>
            <surname>Hildebrandt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Panse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Wilcke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ritter</surname>
          </string-name>
          ,
          <article-title>Large-scale data pollution with Apache Spark</article-title>
          ,
          <source>in: IEEE Transactions on Big Data</source>
          , volume
          <volume>6</volume>
          ,
          <year>2020</year>
          , pp.
          <fpage>396</fpage>
          -
          <lpage>411</lpage>
          . doi:
          <volume>10</volume>
          .1109/TBDATA.
          <year>2016</year>
          .
          <volume>2637378</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>E.</given-names>
            <surname>Ioannou</surname>
          </string-name>
          , Y. Velegrakis, EMBench++
          <article-title>: Data for a thorough benchmarking of matching-related methods</article-title>
          ,
          <source>in: Semantic Web</source>
          , volume
          <volume>10</volume>
          ,
          <year>2019</year>
          , pp.
          <fpage>435</fpage>
          -
          <lpage>450</lpage>
          . doi:
          <volume>10</volume>
          .3233/SW-180331.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sayyadian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Doan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Rosenthal</surname>
          </string-name>
          ,
          <article-title>ETuner: Tuning schema matching software using</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>