=Paper= {{Paper |id=Vol-3606/64 |storemode=property |title=Discovering Functional Dependencies: Can We Use ChatGPT to Generate Algorithms? |pdfUrl=https://ceur-ws.org/Vol-3606/paper64.pdf |volume=Vol-3606 |authors=Loredana Caruccio,Stefano Cirillo,Tullio Pizzuti,Giuseppe Polese,Angelo Marchese,Orazio Tomarchio,Lorenzo Di Rocco,Umberto Ferraro Petrillo,Giorgio Grani,Alessandro La Ferlita,Yan Qi,Emanuel Di Nardo,Simon Mewes,Ould el Moctar,Angelo Ciaramella,Claudia Diamantini,Alex Mircoli,Domenico Potena,Simone Vagnoni,Claudia Cavallaro,Vincenzo Cutello,Mario Pavone,Patrik Cavina,Federico Manzella,Giovanni Pagliarini,Guido Sciavicco,Eduard I. Stan,Paola Barra,Zied Mnasri,Danilo Greco,Valerio Bellandi,Silvana Castano,Alfio Ferrara,Stefano Montanelli,Davide Riva,Stefano Siccardi,Alessia Antelmi,Massimo Torquati,Daniele Gregori,Francesco Polzella,Gianmarco Spinatelli,Marco Aldinucci |dblpUrl=https://dblp.org/rec/conf/itadata/CaruccioCPP23 }} ==Discovering Functional Dependencies: Can We Use ChatGPT to Generate Algorithms?== https://ceur-ws.org/Vol-3606/paper64.pdf
                                Discovering Functional Dependencies: Can We Use
                                ChatGPT to Generate Algorithms?
                                Loredana Caruccio1,*,† , Stefano Cirillo1,2,† , Tullio Pizzuti1,† and Giuseppe Polese1,†
                                1
                                    University of Salerno, Via Giovanni Paolo II, 132, 84084 Fisciano, Salerno, Italy
                                2
                                    Hasso Plattner Institute, University of Potsdam, Germany


                                                                         Abstract
                                                                         The establishment of Large Language Models allowed people to interact with tools capable of answering
                                                                         in a natural language many kinds of questions on even very large sets of topics. Although the natural
                                                                         language generation processes have to address several issues (e.g., providing focused content w.r.t. queries,
                                                                         composing texts without ambiguities, and so forth), models and tools are becoming more and more
                                                                         capable of providing answers with a syntactically and semantically correct form, independently from both
                                                                         topics and languages. This led to enabling an algorithm to become capable of writing algorithms together
                                                                         with their implementation, so tackling an even more complex task since programming languages are more
                                                                         rigid and precise, and the generated code should also embrace the reasoning underlying methodologies
                                                                         used to solve problems at different levels of complexities. At present, the most representative example of
                                                                         such a tool is given by ChatGPT. Based on the GPT-3.5 model and trained over more than 300 Billion
                                                                         tokens, ChatGPT obtained high notoriety and is starting to impact society due to its wide usage in the
                                                                         daily life of people. This paper aims at evaluating to what extent ChatGPT and its underlying model
                                                                         are capable of generating algorithms for the discovery of Functional Dependencies (fds) from data. The
                                                                         latter represents a very complex problem to which the scientific literature has devoted much effort. The
                                                                         inference of a correct, minimal, and complete set of fds, holding on a given dataset, defines the main
                                                                         constraints guaranteeing literature solutions to be considered effective, leading to questioning if also
                                                                         solutions generated from ChatGPT can satisfy them. In particular, by following a prompt-based approach,
                                                                         we enabled ChatGPT to provide 7 different solutions to the fd discovery problem and measured their
                                                                         results in comparison with the ones provided by the HyFD discovery algorithm, one of the most efficient
                                                                         solutions provided in the literature.

                                                                         Keywords
                                                                         Data Profiling, Functional Dependency, Large Language Model, ChatGPT


                                1. Introduction
                                In recent years, the diffusion of large language models (LLMs) has led to a revolution in the
                                area of natural language processing, also affecting the activities of many other research areas.
                                These models have demonstrated a remarkable ability to understand, generate and manipulate
                                natural language mainly due to the large amounts of text they have been trained on. These also
                                entailed the introduction of bots and tools that can be used by users, such as ChatGPT1 , Google
                                ITADATA2023: The 2nd Italian Conference on Big Data and Data Science, September 11–13, 2023, Naples, Italy
                                *
                                  Corresponding author.
                                †
                                  These authors contributed equally.
                                $ lcaruccio@unisa.it (L. Caruccio); scirillo@unisa.it (S. Cirillo); t.pizzuti1@studenti.unisa.it (T. Pizzuti);
                                gpolese@unisa.it (G. Polese)
                                 0000-0002-2418-1606 (L. Caruccio); 0000-0003-0201-2753 (S. Cirillo); 0000-0002-8496-2658 (G. Polese)
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR

                                           CEUR Workshop Proceedings (CEUR-WS.org)
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073




                                1
                                    www.chat.openai.com




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Bard2 , and so forth, for querying LLMs with specific and/or generic questions. In fact, besides
their potential in machine translation, text generation, and language processing, LLMs are also
finding applications in several new domains, such as security [1], healthcare [2], the Internet of
Things (IoT) [3].
   The applicability of LLMs, in the area of algorithm and code generation, requires the generated
code to incorporate the underlying reasoning and methodologies employed to address problems
at various levels of complexity. Thus, to evaluate the effectiveness of LLMs in tackling with
complex tasks, in this paper, we focus on possible solutions in the data profiling research area.
The latter defines the set of activities and processes to examine, create, and extract useful
information from the data. This information permits to identify data quality issues, risks, and
overall trends by using profiling metadata, among which the simplest are the average, minimum,
maximum, frequency of values, and some more complex ones, such as functional dependencies
and inclusion dependencies. Identifying such metadata from data is an extremely complex
problem, especially when the data is very large and/or when it evolves over time [4]. In fact,
new profiling methodologies continue to be investigated to enable the efficient profiling of
ever-increasing amounts of data and the application of metadata in new scenarios.
   LLMs have the capabilities to offer new perspectives in the area of data profiling, leading
to the definition of advanced data analysis processes, paving the way towards the definition
of new data and metadata interpretation methodologies, and offering valuable support for the
management and analysis of heterogeneous and complex data. Furthermore, the use of LLMs
could lead to the definition of new metadata discovery algorithms based on new methodologies
generated by their knowledge. To this end, in this paper, we propose a comparative evaluation
of fd discovery algorithms generated by one of the most well-known LLMs, i.e., ChatGPT. To
perform this evaluation, we asked ChatGPT to generate the executable source code of several
algorithms by means of a new prompt engineering strategy. The results of the algorithms
generated by ChatGPT have been evaluated on real-world datasets and their results have been
compared with those achieved by HyFD [5], one of the best-performing fd discovery algorithms
in the literature.
   The rest of the article is organized as follows: Section 2 surveys relevant related works in the
literature; Section 3 provides preliminary notions about the definition of fds and the discovery
problem; Section 4 shows the new Prompt Template Engineering approach for generating fd
discovery algorithms together with algorithms generated by ChatGPT; Section 5 shows the
experimental evaluation and provides a discussion of results; Conclusions and future directions
are discussed in Section 6.


2. Related Work
The Data Profiling area aims to process and analyze data to evaluate their quality, structure,
and characteristics, through the extraction and analysis of different metadata from the data
[6]. This metadata enables to identify patterns, anomalies, and inconsistencies with the aim
to define advanced techniques for supporting different operations, such as Data Integration
[7], Query Processing [8], and Data Cleaning [9]. In fact, when data is created or shared, it
2
    www.bard.google.com
often has missing values, inaccuracies, and/or errors of various kinds. Often the datasets used
in the analysis processes have different formats or inconsistencies that can negatively affect
the performance of the AI models [10]. In this scenario, the Data Preparation tasks permit to
collect, structure, and organize data to improve the quality and make them suitable for analytical
and predictive models [10]. Therefore, these profiling metadata can allow the extraction of
meaningful information and enhance preparation and analytical processes.
   Among the different types of profiling metadata, Functional Dependencies (fds) are widely
adopted for different data preparation tasks [11], so requiring the design of efficient discovery
algorithms capable of identifying all fds holding on a given dataset. In the literature, three
different types of algorithms have been proposed, i.e., column-based, row-based, and hybrid,
respectively. Column-based algorithms rely on a lattice structure to efficiently explore the search
space and to generate candidate fds according to the Apriori search strategy [6], which enables
pruning the search space and reducing the number of fds to be validated [12, 13]. Instead,
the row-based algorithms generate candidate fds from two attribute subsets, namely agree-set
and difference-set, which are built by comparing the values of attributes between all possible
combinations of tuples pairs [14, 15]. Recently, new hybrid algorithms have been proposed,
which exploit the advantages of both row-based and column-based methodologies, to improve
the fd discovery process [5].
   The approaches mentioned above represent pillars for the very complex fd discovery task.
They preserve the requirements of correctness, completeness, and minimality of results. Never-
theless, data scientists could draw inspiration from LLM-generated codes to consider alternative
fds discovery approaches. In fact, the spread of LLMs has led to taking advantage of the ability
of these LLMs to easily generate parts of code or entire algorithms to speed up their work. For
instance, a recent study has proposed an extended version of GPT, trained for the automatic
generation of source codes written in Python [16]. The results were evaluated through a new
HumanEval dataset is composed of 164 programming problems with associated unit tests [16].
Starting from this dataset, authors in [17] have proposed a new framework to generate different
inputs for unit tests in order to increase the tests to be performed on each generated algorithm.
Recently, a new LLM has been proposed, namely PolyCoder [18], which has been trained on a
large set of open-source codes from GitHub with the aim to automatically generate source codes
in 12 different programming languages. The performance of PolyCoder was compared with
some of the major LLMs trained for code generation [16]. However, it is necessary to evaluate
the correctness of these source codes and algorithms before adopting them in real scenarios.
   Through this study, we evaluated if the knowledge underlying an LLM permits the generation
of algorithms to solve a specific task, such as the fd discovery one, which would represent, to
the best of our knowledge, the first comparative study on fds discovery algorithms generated
by LLM models.

3. Preliminaries
One of the main data profiling tasks is the fd discovery one. Formally, given a relation schema
𝑅, and an instance 𝑟 of it, an fd holding on 𝑟 is a statement 𝑋 → 𝑌 (𝑋 implies 𝑌 ), with 𝑋 and
𝑌 attribute sets of 𝑅, such that for every pair of tuples (𝑡1 , 𝑡2 ) in 𝑟, whenever 𝑡1 [𝑋] = 𝑡2 [𝑋],
then 𝑡1 [𝑌 ] = 𝑡2 [𝑌 ]. 𝑋 and 𝑌 are also named Left-Hand-Side (LHS) and Right-Hand-Side (RHS),
respectively, of the fd. An fd is said to be non-trivial if and only if 𝑋 ∩ 𝑌 = ∅. Moreover, an fd
is said to be minimal if and only if there is no attribute 𝐵 ∈ 𝑋 such that 𝑋∖𝐵 → 𝑌 holds on 𝑟.
    Discovering fds from data is an extremely complex problem, due to the exponential number
of column combinations to be analyzed [19]. Formally, given a relation instance 𝑟 with 𝑛 tuples
and 𝑀 attributes, and considering w.l.o.g. candidate fds with a single attribute on the RHS,
it is necessary to validate each possible candidate fd. Thus, the fd discovery problem has to
consider 2𝑀 possible attribute combinations, each employing 𝑛2 iterations in a brute-force
approach for validating them, since all pairs of tuples must be compared to verify the property.
                                                               2 𝑀
This leads to a complexity’s upper bound that is 𝑂(𝑛2 ( 𝑀   2 ) 2 ), with 2 attribute combinations
                                                                          𝑀

representing the average number of attributes involved in an fd [20].
    Data profiling algorithms deterministically perform such an analysis where the properties of
correctness, completeness, and minimality are guaranteed on the discovery results. However,
they typically exploit fds already validated and the fd inference rules to prune the search space in
order to limit the number of candidate fds to be validated. Nevertheless, in the analyzed scenario,
it is necessary to verify if the previously mentioned properties are guaranteed by fd discovery
algorithms generated through ChatGPT. In particular, although failures in the satisfiability of
completeness and minimality can be tolerated, the correctness should be always preserved
to avoid the consideration of not holding fds in the result set. To this end, it is necessary to
introduce the concept of specialization and generalization of a given fd 𝜙 : 𝑋 → 𝑌 .
    More formally, given 𝜙, an fd 𝜙′ : 𝑋∖𝐵 → 𝑌 represents a generalization of 𝜙 for any 𝐵 ∈ 𝑋.
Instead, an fd 𝜙′′ : 𝑋 ∪𝐵 → 𝑌 represents a specialization of 𝜙 for any 𝐵 ∈  / 𝑋 ∪𝑌 . Consequently,
to compare results between different algorithms we also consider these concepts to measure
the quality of discovery results w.r.t. a correct, complete, and minimal set of fds holding on a
given relation 𝑟.

4. Generation of fd Discovery Algorithms
In this section, we first introduce a new prompt engineering approach that we have defined
for generating fd discovery algorithms, and then we provide an overview of the algorithms
provided by ChatGPT for the purpose of this study.
4.1. Prompt engineering approach
We used ChatGPT, which is based on one of the best-known large language models, to generate
seven fd discovery algorithms that have been successively evaluated on real-world datasets. In
particular, the interaction with ChatGPT required the definition of multiple prompts to collect
discovery algorithms of different natures. To this end, we have designed an approach to generate
prompts for ChatGPT based on the automatic Prompt Template Engineering strategies [21]. As
shown in Figure 1, the approach starts by asking ChatGPT the following prompt: 𝑝1 = “What
type of functional dependency discovery algorithms do you know?”. After the response of the
ChatGPT, the approach asks ChatGPT which of the algorithms of the previous answer can
provide a source code, exploiting the prompt 𝑝2 = “Which of these Algorithms can you provide
me with a source code?”. This strategy allowed us to focus our study only on the algorithms
of which ChatGPT has been able to provide us with at least an executable code. Starting
from the list of algorithms provided by 𝑝2 , we have defined an ad-hoc prompting functions
𝑥′ = 𝑓algorithm-filling (𝑥), where 𝑥 represents the input of the prompt composed by the type of
     What type of functional
                                                                                   Input
     dependency discovery
                                                                    Algorithms      A        Programming   L
    algorithms do you know?
                                                                                               Language
                                     AI
                                                                     Cough
                                                                     Cough
                                                                    Subset Generation          Python
    Which of these Algorithms    Pre-Trained LM
    can you provide me with a
          source code?


                                                  falgorithm-filling(x)

                   Write the source code of the                                                                                 Here is the
                     functional dependency              [a1]        in    [l1]    language   [H]                       AI        algorithm
                       discovery algorithm                                                                     Pre-Trained LM      code:
                                                      Prompt x'



Figure 1: Overview of the prompt engineering methodology.

algorithm to be generated and its programming language, and 𝑥′ is the prompt sentence to be
submitted to the LLM for generating the fd discovery algorithm. The template for this prompt
has been defined as follows
   “Write the source code of the functional dependency discovery algorithm [A] in [L]
language [H]”
where [A] is the slot containing the type of algorithm, [L] is the slot containing the programming
language of the algorithm, and [H] is the answer slot that will be filled after the generation of
the source code issued by the LLM. More formally, the slot [A] is the set 𝐴 = {𝑎1 , 𝑎2 , . . . , 𝑎𝑛 }
of the types of algorithms of which the LLM can generate the source code, and [L] is the set
𝐿 = {𝑙1 , 𝑙2 , . . . , 𝑙𝑚 } of the programming language. As an example, the first prompt submitted
to ChatGPT was: “Write the source code of the functional dependency discovery algorithm Apriori
in Python language”.
   In our study we only consider a single source code at a time for each algorithm (𝑎1 ), i.e., the
first one provided by ChatGPT, and a single programming language (𝑙1 ). Notice that, LLMs
have the potential to generate several source codes for the same algorithm. Thus, our approach
has been designed to be easily generalizable for the request of multiple algorithms at once and
written in different programming languages.
   In what follows, we discuss each algorithm generated by ChatGPT, by also providing an
overview of their methodologies.
4.2. Discovery Algorithms
We have generated seven different algorithms through ChatGPT, and each of them consisted
of a single procedure, without any module for reading the datasets subject of the discovery
and/or for standardizing the results. In what follows, we provide an overview of the discovery
strategies underlying the generated algorithms.
Apriori: This algorithm relies on the Apriori search strategy, (e.g., a traditional approach in
column-based discovery algorithms), in order to identify the set of fds valid a dataset. The
algorithm uses the concept of frequent itemsets to identify attributes that appear with significant
frequency, as generally used for the association rule mining task.
   The algorithm starts by generating the frequent itemsets of size 1, i.e., those containing
a single attribute. Then, it extends the frequent itemsets computing those with sizes from
𝑘 − 1 to 𝑘. Starting from frequent itemsets, the algorithm validates the candidate fds using
the confidence measure. Given a fd 𝜙 : 𝑋 → 𝑌 , the confidence establishes the conditional
probability of having a certain value combination on attributes 𝑌 given a set of determining
attributes 𝑋. Moreover, the validation function reads as an input a minimum threshold that
allows to constrain the validation of a candidate fd, i.e., the latter is valid iff its confidence
exceeds the input threshold. For example, let us consider an fd 𝐴, 𝐵 → 𝐶. The Apriori
algorithm computes the frequency of occurrence of the itemset (𝐴, 𝐵, 𝐶) w.r.t. the itemset
(𝐴, 𝐵). The confidence of the fd is then calculated as: 𝑐𝑜𝑛𝑓 𝑖𝑑𝑒𝑛𝑐𝑒 = 𝑓 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦_𝑜𝑓     (𝐴,𝐵,𝐶)
                                                                             𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦_𝑜𝑓 (𝐴,𝐵) . If
this value exceeds the predefined threshold then the candidate fd is valid.
Subsetgen: This algorithm relies on the subset generation approach to extract the fds holding
on a given dataset. The algorithm generates all possible subsets of attributes and computes the
closure of each subset. The closure represents the full set of attributes determined by the subset
of attributes considered. For example, let us consider a dataset of 4 attributes (𝐴, 𝐵, 𝐶, 𝐷), the
subsets generated are: (𝐴), (𝐵), . . . , (𝐴, 𝐵), . . . , (𝐴, 𝐵, 𝐷), . . . , (𝐴, 𝐵, 𝐶, 𝐷). If we consider
the subset (𝐴, 𝐵), its closure is the set of all attributes determined by (𝐴, 𝐵). After calculating
the closure for each subset, for each fd 𝜙 : 𝑋 → 𝑌 , the algorithm computes the confidence of
the set of attributes on the LHS and RHS, and if the confidence is equal to 1, then the fd is valid.
Anova: This algorithm relies on the analysis of variance (Anova) to identify fds valid in a
dataset. The algorithm evaluates the variation of an attribute 𝑌 based on the other attributes
in 𝑋, which are considered determinants. In particular, the algorithm computes the one-
way ANOVA for determining the statistical differences between the means of multiple data,
computing their p-value. The latter is a parameter used to discriminate a hypothesis test, which
is used by the algorithm to check whether all means of the projections of the attributes of the
fd on the dataset are equal. A p-value less than 0.05, i.e., the standard p-value, indicates that
the fd analyzed is valid on the dataset.
Linear Regression: This algorithm relies on the linear regression algorithm for discovering
fds holding in a dataset. The algorithm starts by defining the set of candidate fds to validate
considering a single attribute on the RHS and all the other attributes of the dataset on the
LHS. In this way, the algorithm evaluates only the candidate fds with the highest number of
attributes on the LHS. Starting from this, for each attribute on the RHS, a linear regression
model is built that tries to evaluate the variation of the attribute based on the values of the
attributes on the LHS. In particular, the values of the attributes on the LHS of each candidate fd
are used as the training data of the linear regression model, while the values of the attribute
on the RHS as their target values. The algorithm trains a linear regression model for each fd
to be validated and extracts the regression coefficient representing the relative importance of
the determining attribute. If the coefficient exceeds the value 0.95 defined by default then the
candidate fd is considered valid.
Correlation & Regression: This algorithm relies on the Pearson correlation coefficient to
validate fds on a given dataset. The first step of the algorithm requires replacing all values that
are not numbers with a numerical representation, since it works only with numerical values.
Then, in its main step, it trains a linear regression model on every possible pair of attributes, i.e.,
allowing for testing the validation on only the most general fds. Among the resulting outputs
provided by the linear regression, the algorithm considers the Pearson correlation coefficient to
validate the fds. Thus, an fd is valid iff there is a high degree of positive correlation between
the attribute pairs.
Pair-wise Algorithm: This algorithm relies on a pair-wise comparison of the attributes in the
dataset to validate candidate fds. The algorithm has been designed and developed to consider
only fds with a single attribute on the LHS and one on the RHS. For each dependency 𝑋 → 𝑌 ,
check if every subset of tuples having equal value on attribute 𝑋 always has the same value for
attribute 𝑌 .
TANE-based Algorithm: The latest algorithm is based on TANE, which is one of the pioneer-
ing algorithms in the literature[13]. TANE relies on Stripped Partitions, which are partitions of
value combinations built based on the equality constraint and having more than one element
within the structure. This structure makes it easy to check if a dependency holds by the partition
refinement property [13]. TANE explores the search space of all possible candidate fds and
thanks to pruning methods it is able to reduce the number of fds to validate. Specifically,
the source code provided by ChatGPT lacks some of the core parts of TANE, such as pruning
strategies, the lattice structure, and the validation based on the refinement property. In fact, the
algorithm erroneously reverses the RHS and LHS of an fd, issuing errors when generating the
possible combinations of attributes. Nevertheless, each fd is validated by means of a property
that checks if the values of the tuples for all the attributes involved in a candidate fd form a
primary key.

5. Experimental Evaluation
The experimental evaluation has been conducted by comparing the results achieved by the seven
algorithms generated by ChatGPT3 with those achieved by HyFD [5], one of the best-performing
algorithms for fd discovery. Specifically, this comparative evaluation allowed us to measure the
suitability of GPT-generated code with the complex problem of discovering fds form data.
5.1. Experimental Settings
The experiments were performed on different real-world datasets previously used for evaluating
fd discovery algorithms. Table 1 shows the characteristics of the datasets involved in the
evaluation. All the experiments have been executed on a computer with an Intel Xeon W at 2.3
GHz, 18-core, and 128GB of memory, running macOS Ventura 13.0.1 and Python 3.10 as the
execution environment.
   The discovery algorithms have been generated in Python language and we kept their struc-
tures in order to preserve strategies behind them and evaluate the capability of ChatGPT of
generating fd discovery algorithms. However, we have only introduced the procedures for
reading the datasets and standardizing the results. To evaluate the results achieved by the
generated algorithm, we compare them with those achieved by HyFD with the aim to evaluate
their performances in terms of Precision, Recall, and F1-Score. In particular, we considered two
distinct cases: 𝑖) the number of minimal fds correctly discovered by the generated algorithms
w.r.t. those discovered by HyFD (namely, Common Case), and 𝑖𝑖) the number of both minimal


3
    The source code of the fd discovery algorithms generated is available on the repository https://github.com/DastLab/
    FDDiscoveryChatGPT
 ID           Dataset           Columns   Rows     #fds   ID          Dataset           Columns     Rows      #fds
 D1             Chess               7     28,056     1    D7         Poker Hand            11     1,025,010     1
 D2           Abalone               9      4,177    137   D8           Firewall            12      65,532      88
 D3            Nursery              9     12,960     1    D9            Adult              15      32,562      78
 D4    Electricity Normalizer       9     45,312    61    D10           Letter             17      20,000      61
 D5   Customer Shopping Data       10     99,457    21    D11   Sgemm Product Rounded      18      241,600      4
 D6           Fraudfull            11     98,439    39

Table 1
Characteristics of the real-world datasets used for the experimental evaluation.

and valid fds also considering the case in which the generated algorithms validate specialized
fds w.r.t. those discovered by HyFD (namely, Specialization Case).
   In the Common Case, we considered the fds identified by ChatGPT algorithms that also occur
in HyFD results as True Positives (i.e., TP𝑐𝑜𝑚 ); the fds validated by ChatGPT algorithms that
do not occur in HyFD results as False Positives (i.e., FP𝑐𝑜𝑚 ), and the fds validated by HyFD that
have not been discovered by ChatGPT algorithms as False Negatives (i.e., FN𝑐𝑜𝑚 ).
   In the Specializations Case, we considered both the common and specialized fds, extracted
by generated algorithms w.r.t. those occurring in HyFD results, as True Positive (i.e., TP𝑠𝑝𝑒𝑐 );
the fds discovered by HyFD for which there are no equal or specialized fds within the results
of the generated algorithms as False Negatives (i.e., FN𝑠𝑝𝑒𝑐 ), and the fds discovered by the
generated algorithms for which neither equal nor generalized fds within the results of HyFD
can be found as False Positives (i.e., FP𝑠𝑝𝑒𝑐 ). It is important to notice that in both cases we do
not consider True Negatives (i.e., TN), since they are the fds that are invalid and whose details
are not provided by HyFD.
5.2. Experimental Results
Table 2 shows the results achieved by all the generated ChatGPT algorithms, respectively. In
particular, such tables split the results according to the two above specified cases, i.e., the
Common Case, shortened with com, and the Specialization Case, shortened with spec.
   In addition, for each algorithm, two types of charts were constructed (see Figure 2). Given
an algorithm, the chart on the left shows the resulting minimal and valid fds compared to the
solutions identified by HyFD. On the other hand, the chart on the right side shows the number
of incorrect fds and generalizations identified by the generated algorithms with respect to the
number of discovered fds.
   Results show that the algorithms, on most datasets fail in identifying minimal or valid
dependencies. For example, the Apriori and Subset Generation algorithms identified only a
subset of minimal or valid dependencies. However, many incorrect dependencies were also
generated, as can be seen in Figure 2. In fact, also Tables 2 report very low values for all metrics.
   Specifically, the Correlation & Regression algorithm failed to identify any minimal or valid
dependencies. On 4 out of 11 datasets, it produced no result, while on the remaining ones it
identified only incorrect dependencies or generalizations (see Figure 2). According to these
results, its metrics are all equal to 0. This behaviour can be due to the fact that Pearson
correlation coefficient only takes into account the correlation degree, e.g., positive or negative.
However, a strong positive correlation between two attributes does not imply the existence
of a fd. Similarly, the Linear Regression algorithm does not detect any fd that is also minimal.
However, it achieved some good results on 5 datasets considering valid dependencies (i.e., also
in specialized form). It also generated only a few not holding fds and no generalized ones.
                        Anova      Apriori    Correlation   Linear Reg.   Subset Gen.   Pair-wise      Tane
                      com spec    com spec    com spec      com spec      com spec      com spec    com spec
                 D1     0   0       0   0      0      0      0      0      0      0       0    0    0.14 0.14
                 D2     0   0     0.06 0.06    0      0      0    1.00     0      0       0    0      0 1.00
                 D3     0   0       0   0      0      0      0      0      0      0       0    0    0.11 0.11
                 D4     0   0     0.04 0.04    0      0      0    1.00     0      0       0    0      0 0.97
    Precision




                 D5   0.33 0.33     0   0      0      0      0      0      0      0     1.00 1.00     0 1.00
                 D6     0   0       0   0      0      0      0    0.96     0      0     0.47 0.47     0 0.97
                 D7     0   0       0   0      0      0      0      0      0      0       0    0    0.09 0.09
                 D8     0   0       0   0      0      0      0    0.98    0.01 0.01       0    0      0 0.95
                 D9   0.01 0.01     0   0      0      0      0      0      0      0     1.00 1.00     0 0.90
                D10     0   0       0   0      0      0      0      0      0      0       0    0      0 0.79
                D11     0   0       0   0      0      0      0    1.00     0      0       0    0      0 0.22
                 D1     0   0       0   0      0      0      0      0      0      0       0    0    1.00 1.00
                 D2     0   0     0.68 0.68    0      0      0    0.82     0      0       0    0      0 1.00
                 D3     0   0       0   0      0      0      0      0      0      0       0    0    1.00 1.00
                 D4     0   0     0.74 0.74    0      0      0    0.44     0      0       0    0      0 1.00
                 D5   0.09 0.09     0   0      0      0      0      0      0      0     0.95 0.95     0 1.00
    Recall




                 D6     0   0       0   0      0      0      0    0.63     0      0     0.24 0.24     0 1.00
                 D7     0   0       0   0      0      0      0      0      0      0       0    0    1.00 1.00
                 D8     0   0       0   0      0      0      0    0.96    0.09 0.09       0    0      0 1.00
                 D9   0.02 0.02     0   0      0      0      0      0      0      0     0.02 0.02     0 1.00
                D10     0   0       0   0      0      0      0      0      0      0       0    0      0 1.00
                 D1     0   0       0   0      0      0      0      0      0      0       0    0    0.25 0.25
                 D2     0   0     0.10 0.10    0      0      0    0.90     0      0       0    0      0 1.00
                 D3     0   0       0   0      0      0      0      0      0      0       0    0    0.20 0.20
                 D4     0   0     0.07 0.07    0      0      0    0.61     0      0       0    0      0 0.98
    F1-Score




                 D5   0.15 0.15     0   0      0      0      0      0      0      0     0.97 0.97     0 1.00
                 D6     0   0       0   0      0      0      0    0.76     0      0     0.31 0.31     0 0.99
                 D7     0   0       0   0      0      0      0      0      0      0       0    0    0.17 0.17
                 D8     0   0       0   0      0      0      0    0.97    0.01 0.01       0    0      0 0.97
                 D9   0.01 0.01     0   0      0      0      0      0      0      0     0.05 0.05     0 0.94
                D10     0   0       0   0      0      0      0      0      0      0       0    0      0 0.88
                D11     0   0       0   0      0      0      0    1.00     0      0       0    0      0 0.36
Table 2
Results metrics related to common and specialization cases of the ChatGPT algorithms.
   As for the Anova algorithm, it generated many not holding dependencies (also in generalized
form) on all considered datasets. This leads to very low performances in all considered metrics,
also due to the fact that the number of correct fds is very low. Like other generated algorithms,
even the Pairwise algorithm discovered fds that contain a single attribute on the left-hand side,
but in this case, its validation strategy turns out to be better than other algorithms, among those
working with the most generalized candidate fds as possible, and only on one of the datasets, it
discovers incorrect fds. In addition, by looking at Table 2, for datasets on which the algorithm
is able to find holding fds, the metrics results are satisfactory.
   Finally, the Tane algorithm generated many specializations but at the same time many
incorrect dependencies. Although its results on minimal dependencies are low, if we go to
consider valid dependencies we have a significant increase even compared to all other algorithms.
This implementation always generated specialized dependencies of the maximum size, i.e., you
have 𝑛 − 1 attributes on the left side and only one attribute on the right side, yielding the
most specialized candidate fds as possible. In fact, from Figure 2 we see that it has many
         HyFD        CommonFDs                         Specializations                ChatGPT Algorithm                              Errors                  Generalizations

   102                                                                                                                                    151 256 132
                                                                                    65             50         3030 47
                                                                                                                                 67 57      41      52 102
   101                                                                   20                      16                                             15




                                                                                                                                                                      Anova
                                                                              6           67                                5                            1     10
                             2                         2                                                22
   100                                                                                                                                                       100
     0                                                                                                                                                       0
                                                                                                                                                       65532 105
   102     93                                                                                                                                                104
                    45                                                            1489          1168                                                         103




                                                                                                                                                                      Apriori
   101                                                                               75                                                                      102
                                                                                                  11 7                                    1713
                                                                                                                                                             101
   100                                                                                                                                                       10 0
     0                                                                                                                                                       02
   102                                                                              38                                                                       10
                                                                                                                                     19




                                                                                                                                                                     Corr. & Regr.
                                                                                                                                                        12
   101                                                                                                                                            4
                                                                                                                                                               101
                                                                                                                   3             3
                                                                                                              1                                                100
   100
     0                                                                                                                                                         0
   102                                                                                                                                                         10.0
                                                                                                                                                  8
                                                                                                                                                               7.5




                                                                                                                                                                     Linear reg.
   101          8                         6        6               4                                                                                           5.0
                         3
   100                                                                                                                           2                             2.5
                                                                                                              1
     0                                                                                                                                                         0.0
                                                                                                                                1700
   102                                                                                                      267                                                103




                                                                                                                                                                     Subset gen.
                                                                                                                                     105         120
                                                                                          56                  45                                               102
   101                                         8                                                      16 27
                                                                                               7 12                                                            101
                                                                         2                                2
   100                                                                        1      1                                                                         100
     0                                                                                                                                                         0
   102
                             20                                                                               10                                               101



                                                                                                                                                                     Pair-wise
   101                                9
                                                       2
   100                                                                                                                                                         100
     0                                                                                                                                                         0
   102                                                                                                                                           16
                                                                             15                                                                         14
   101          9        7       10       10       7       6                                                           10
                                                                             10
                                                                                                                                                                     Tane

                                        4       8                                                                                          9
                                          6                    5
   100 1     1           1          1                                        5
                                                   2     1
     0
       D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 0
Figure 2: Characterization of discovered fds by generated algorithms w.r.t. oracle results.

specializations and no generalizations.

6. Conclusion and Future Directions
In this paper, we considered the possibility to entrust LLMs for the automatic generation
of algorithms in the area of Data Profiling. To the best of our knowledge, it represents the
first proposal trying to exploit the potential of LLMs for the definition of possible discovery
algorithms in the data profiling area. Specifically, we obtained seven different fd discovery
algorithms by means of a prompt learning strategy with ChatGPT. Then, we evaluated the
effectiveness of such algorithms by comparing their discovery results with respect to the ones
of the HyFD algorithm. Evaluation results demonstrate that most of the statistical or ML-based
ChatGPT-generated algorithms, i.e., Correlation & Regression and Linear Regression, limit the
exploration of the search space to the most generalized candidate fds as possible, leading
to having very few valid fds discovered. Instead, the other algorithm in this category, i.e.,
Anova can consider more, randomly generated, candidate fds, but it is not able to improve
performances due to the high number of generalizations and/or errors in results. On the
contrary, the other two generated algorithms, i.e., Apriori and Subset Generation, that perform
an exhaustive exploration of the search space, in general, achieve a number of fds much higher.
However, the validation method of both turned out to be fallacious, leading to obtaining many
errors in the discovery results. The generated Pairwise algorithm includes a good validation
method, but also, in this case, the evaluation of the only most generalized candidate fds, results
in obtaining few discovered fds. Finally, the Tane algorithm represents the most complete
generated discovery algorithms, since it is related to a literature approach. Nevertheless, the
generated code presents many errors in the algorithm logic, yielding the algorithm considering
only the most generalized fds as possible. In fact, results appear good, but also in this case the
correctness of discovered fds is not preserved. The latter consideration applies to all generated
algorithms, allowing us to state the used general-purpose LLM has not been able to generate
solutions guaranteeing correctness in discovery results, a fundamental property to preserve to
make discovery results useful in application scenarios.
   In the future, it would be useful to investigate new prompting approaches for enabling LLMs
to refine the generated algorithms, involving domain experts in revising the methodologies of
the generated algorithms or using existing discovery algorithms for guiding LLMs in the direct
validation of fds. The latter could involve the usage of multimodal prompts to process entire
datasets. In fact, current open LLMs are limited to a maximum number of tokens that can be
processed and only allow us to handle datasets through batching strategies. Finally, we would
like to extend our evaluation by also considering specialized LLMs for algorithm generation.
Acknowledgments
This work was partially supported by project SERICS (PE00000014) under the NRRP MUR
program funded by the EU - NGEU.
References
 [1] G. Sandoval, H. Pearce, T. Nys, R. Karri, S. Garg, B. Dolan-Gavitt, Lost at c: A user study on
     the security implications of large language model code assistants, arXiv preprint (2023).
 [2] A. Arora, A. Arora, The promise of large language models in health care, The Lancet 401
     (2023) 641.
 [3] O. Zheng, M. Abdel-Aty, D. Wang, Z. Wang, S. Ding, ChatGPT is on the horizon: Could a
     large language model be all we need for intelligent transportation?, arXiv preprint (2023).
 [4] B. Breve, L. Caruccio, S. Cirillo, V. Deufemia, G. Polese, IndiBits: Incremental discovery of
     relaxed functional dependencies using bitwise similarity, in: Proceedings of the 39th IEEE
     International Conference on Data Engineering, ICDE, 2023.
 [5] T. Papenbrock, F. Naumann, A hybrid approach to functional dependency discovery, in:
     Proceedings of the 2016 International Conference on Management of Data, SIGMOD, 2016,
     pp. 821–833.
 [6] Z. Abedjan, L. Golab, F. Naumann, Profiling relational data: a survey, VLDB J. 24 (2015)
     557–581.
 [7] M. Lenzerini, Data integration: A theoretical perspective, in: Proceedings of the 21st ACM
     SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS, 2002, p.
     233–246.
 [8] L. Caruccio, S. Cirillo, V. Deufemia, G. Polese, R. Stanzione, REQUIRED: A tool to relax
     queries through relaxed functional dependencies, in: Proceedings of the 26th International
     Conference on Extending Database Technology, EDBT, 2023, pp. 823–826.
 [9] B. Breve, L. Caruccio, V. Deufemia, G. Polese, RENUVER: A missing value imputation algo-
     rithm based on relaxed functional dependencies, in: Proceedings of the 25th International
     Conference on Extending Database Technology, EDBT, 2022, pp. 52–64.
[10] A. Jain, H. Patel, L. Nagalapatti, N. Gupta, S. Mehta, S. Guttula, S. Mujumdar, S. Afzal,
     R. Sharma Mittal, V. Munigala, Overview and importance of data quality for machine
     learning tasks, in: Proceedings of the 26th ACM SIGKDD International Conference on
     Knowledge Discovery & Data Mining, KDD, 2020, pp. 3561–3562.
[11] A. A. A. Fernandes, M. Koehler, N. Konstantinou, P. Pankin, N. W. Paton, R. Sakellariou,
     Data preparation: A technological perspective and review, SN Comput. Sci. 4 (2023) 425.
[12] H. Yao, H. J. Hamilton, C. J. Butz, FD_Mine: Discovering functional dependencies in a
     database using equivalences, in: Proceedings of the 2002 IEEE International Conference
     on Data Mining, ICDM, 2002, pp. 729–732.
[13] Y. Huhtala, J. Kärkkäinen, P. Porkka, H. Toivonen, TANE: An efficient algorithm for
     discovering functional and approximate dependencies, Comput J. 42 (1999) 100–111.
[14] S. Lopes, J. Petit, L. Lakhal, Efficient discovery of functional dependencies and armstrong
     relations, in: Proceedings of the 7th International Conference on Extending Database
     Technology, EDBT, volume 1777, 2000, pp. 350–364.
[15] P. A. Flach, I. Savnik, Database dependency discovery: A machine learning approach, AI
     Commun. 12 (1999) 139–160.
[16] M. Chen, J. Tworek, H. Jun, Q. Yuan, et al., Evaluating large language models trained on
     code, arXiv preprint (2021).
[17] J. Liu, C. S. Xia, Y. Wang, L. Zhang, Is your code generated by ChatGPT really correct?
     rigorous evaluation of large language models for code generation, arXiv preprint (2023).
[18] F. F. Xu, U. Alon, G. Neubig, V. J. Hellendoorn, A systematic evaluation of large language
     models of code, in: Proceedings of the 6th ACM SIGPLAN International Symposium on
     Machine Programming, MAPS, 2022, p. 1–10.
[19] T. Bläsius, T. Friedrich, M. Schirneck, The complexity of dependency detection and
     discovery in relational databases, Theor. Comput. Sci. 900 (2022) 79–96.
[20] T. Papenbrock, J. Ehrlich, J. Marten, T. Neubert, J.-P. Rudolph, M. Schönberg, J. Zwiener,
     F. Naumann, Functional dependency discovery: An experimental evaluation of seven
     algorithms, Proc. VLDB Endow. 8 (2015) 1082–1093.
[21] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, G. Neubig, Pre-train, prompt, and predict: A
     systematic survey of prompting methods in natural language processing, ACM Comput.
     Surv. 55 (2023) 1–35.