<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>and classification⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hanna Abi Akl</string-name>
          <email>hanna.abi-akl@dsti.institute</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Neuro-symbolic AI</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Natural Language Processing</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Machine Learning</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Large Language Models</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Generation, Synthetic Data</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Data ScienceTech Institute (DSTI)</institution>
          ,
          <addr-line>4 Rue de la Collégiale, 75005, Paris</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Hersonissos</institution>
          ,
          <addr-line>Crete</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Université Côte d'Azur</institution>
          ,
          <addr-line>Inria, CNRS, I3S</addr-line>
        </aff>
      </contrib-group>
      <issue>1</issue>
      <abstract>
        <p>We present a neuro-symbolic (NeSy) workflow combining a symbolic-based learning technique with a large language model (LLM) agent to generate synthetic data for code comment classification in the C programming language. We also show how generating controlled synthetic data using this workflow ifxes some of the notable weaknesses of LLM-based generation and increases the performance of classical machine learning models on the code comment classification task. Our best model, a Neural Network, achieves a Macro-F1 score of 91.412% with an increase of 1.033% after data augmentation.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        of the ceurart style.
data [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. A consequence of this demand is data scarcity, a pitfall for all LLM agents today. Data
scarcity is still an open problem that is becoming a pressing issue in the face of the advancement
and improvement of LLM technologies since it directly afects their greatest source of power:
data. Research is ongoing to actively tackle and solve the problem of data scarcity [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3, 4, 5</xref>
        ] but,
to our knowledge, no wide-scale solution exists as of yet at the time of writing of this work.
      </p>
      <p>
        The Information Retrieval in Software Engineering (IRSE) 1 at the Forum for Information
Retrieval Evaluation (FIRE)2 2023 shared task is one challenge that addresses the problem of
data scarcity. It sets out to measure the efects of leveraging LLMs to generate new data and
enrich a code comment dataset in the C programming language starting from existing data
scraped from real code repositories [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The shared task also challenges participants to test
the quality of their generated data by evaluating its impact on the performance of machine
learning models in classifying whether a comment is useful or not useful for the surrounding
C code block [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In our previous work, we proposed a starting solution for the data scarcity
problem by showing that prompting LLMs by examples and combining the generated data
with existing synthetic data generation techniques improves model performance on the code
comment classification task [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The work presented here carries over from the aforementioned
framework to introduce a more complete solution and, as such, will reference it heavily.
      </p>
      <p>In this work, we introduce a NeSy workflow leveraging both the use of a LLM agent and a
symbolic-based learning method to enrich the code comment dataset with synthetic data and
evaluate the quality of this generation by studying the impact of the data augmentation process
on the performance of machine learning models on the code comment classification task. The
rest of the work is organized as follows. In section 2, we discuss some of the related work. In
section 3, we present our methodology. Section 4 describes our experimental framework. In
section 5, we report our results and discuss our findings. Finally, we conclude in section 6.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Related Work</title>
      <p>This section discusses existing techniques that couple symbolic forms of learning and neural
models with a particular focus on LLMs as well as some proposed strategies in the literature for
synthetic data generation.</p>
      <sec id="sec-3-1">
        <title>2.1. Symbolic techniques and large language models</title>
        <p>Research that aligns with the promise made by NeSy models in d’Avila Garcez and Lamb,
i.e., combining the advantages of both symbolic and neural methods to create better learning
systems, places the integration of semantic techniques with state-of-the-art LLMs at its center
in an attempt to improve learning. In their work, Núñez-Molina et al. show how integrating
a markov decision process with reinforcement deep learning policies yields generations of
planning problems that are both valid and diverse for diferent domains. In similar fashion,
Karth et al. apply symbolic constraints to deep learning models in the world of games to generate
new valid game tiles using a minimal number of raw pixels. Their neuro-symbolic technique</p>
        <sec id="sec-3-1-1">
          <title>1https://sites.google.com/view/irse2023/home</title>
          <p>2http://fire.irsi.res.in/fire/static/resources
yields comparable generations to real-world levels found in World of Warcraft 3 and Super
Mario4.</p>
          <p>
            The idea of symbolically addressing learning needs in LLM agents was further refined and
centered around the decomposing tasks. In their work, Prasad et al. show that decomposing
planning tasks into sub-tasks helps LLM agents better respond and successfully carry over
complex tasks. They also use their method to create a new decomposition dataset that helps LLMs
learn complex tasks incrementally through smaller sub-tasks [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ]. Other existing works like
Hou et al. explored the efects of introducing sets of clarifications to LLMs on their performance.
Their findings show that their method is more efective in fine-tuning models on learning tasks
than parameter-tuning them. Tarasov and Shridhar extended the use of decomposition to deal
with the problem of scale, breaking down a large task into smaller tasks and feeding them to
small models. They showed how tuning each model to handle a specific sub-task and collecting
their outputs improves the performance of a larger LLM taking them as input [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ].
          </p>
          <p>
            Another important symbolic method that addresses LLM learning and reasoning is semantic
grounding. The work of Lyre investigates diferent pillars of semantic grounding in LLMs and
shows that these models have basic notions of these concepts. Turney took the investigation
further by leveraging LLMs to generate synonyms of concepts using unigrams and bigrams and
comparing their outputs to valid WordNet words. Other research methods proposed similar
semantic decomposition approaches by integrating them into deep learning models coupled with
diferent language structures like graph decomposition [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ], natural language decomposition
into intents [17], prompt decomposition [18], question-answering reformulation into a mixture
of abstractive and extractive prompts [19, 20] and SQL-based statement decomposition [21].
          </p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>2.2. Synthetic data generation methods</title>
        <p>The work of Lu et al. surveyed machine learning and deep learning models for synthetic
data generation on a variety of tasks, e.g., computer vision and natural language processing,
using diferent data sources, e.g., image and text, and in diferent domains, e.g., healthcare.
Their findings showed that architectures based on neural networks and large language model
technology are the most popular models for data generation [22]. They also studied diferent
data generation algorithms like artificial data labeling and observed varying model performances
depending on the task and the domain [22]. In their work, Bauer et al. surveyed 417 synthetically
generated datasets and showed Generative Adversarial Nets (GANs) to be the most prevalent
synthetic data generation models and computer vision to be the most popular task domain of
application. They also highlighted the importance of having standardized datasets and metrics
for evaluating the quality of synthetically generate data [23]. Finally, Li et al. studied the
limitations of LLM-based synthetic data generation and highlighted the dangers of uncontrolled
data generation which negatively impacts model performance, most notably on classification
tasks.</p>
        <sec id="sec-3-2-1">
          <title>3https://worldofwarcraft.blizzard.com/en-us/ 4https://mario.nintendo.com/</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Methodology</title>
      <p>This section describes our NeSy methodology combining a LLM agent and a symbolic framework
to generate synthetic labeled code comment data as shown in Figure 1. We chose ChatGPT
3.5 to implement our methods and experiments since it is freely accessible and usable without
prior configuration. We introduce a set of rules based on semantic decomposition to prompt
ChatGPT and create a neuro-symbolic workflow that teaches the LLM the proper syntax of the
C programming language for controlling the generation of synthetic labeled code comment
samples. The workflow is represented in Figure 2.</p>
      <sec id="sec-4-1">
        <title>3.1. Semantic rules</title>
        <p>We turn to semantic decomposition, an algorithm that breaks down the meanings of phrases or
concepts into less complex concepts [25], to create a ruleset that helps ChatGPT construct a
valid code comment dataset. The advantage of this symbolic method is twofold: to control the
generation of valid data and ensure suficient diversity to enrich an existing dataset.</p>
        <p>The rules themselves have been designed as renditions of the syntax of the C programming
language [26] and delimit the vocabulary as well as the constructs of the language. They start
at the atomic level by defining what a valid token in the language is and move to more complex
concepts like determining the construction of a valid line of code in C. Each rule is written as a
statement in natural language and is kept as simple and short as possible. Figure 3 shows the 12
rules given as a prompt for ChatGPT to produce a valid line of C code.</p>
        <p>In order to produce a complete data sample, generating a valid line of code is not enough. Our
dataset consists of code, comment and label data. For ChatGPT to produce comments, we add 3
rules to define what a comment in C is as well as its purpose. The definitions are restricted to
English generations of comments but can be extended to accommodate any language. The rules
also contain syntactic details such as the allowed tokens at the beginning of a comment in C.</p>
        <p>Finally, to remain faithful to the input shape of our data, we can ensure any data sample
produced by the LLM is labeled by introducing 2 more rules to explain the allowed labels, i.e.,
Useful and Not Useful, as well as how to classify a code comment pair. These rules help reduce
incoherent data generation and ensure the LLM labeling choice is explainable.</p>
        <p>The full ruleset is presented in Table 1.</p>
        <p>Figure 4 shows an example of valid synthetic data generated by ChatGPT using our full
ruleset.</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Algorithm generation</title>
        <p>To circumvent the ambiguities that come with expressing statements in natural language, we
ask ChatGPT to formulate an algorithm out of the provided rules by prompting the LLM to
treat this exercise as a translation task from a natural language to an algorithmic language. This
plays into the strenghts of LLMs given they are pre-trained and capable of performing well on
this kind of task. The purpose of this step is to make the rules as explicit and clear as possible
to ensure they are explainable and reproducible. This also counteracts the black-box behavior
LLMs generally have in interpreting prompt instructions. Fianlly, this phase also serves as a
self-check and ensures any potentially missed logical gaps while at the time of designing the
rules can be addressed.</p>
        <p>We ask ChatGPT to generate the algorithm in the form of a Python script because this will
ultimately be the tool used to control the synthetic data generation. This step is detailed in the
next subsection. Algorithm 1 showcases the algorithm constructed by the LLM from the initial
ruleset to generate a labeled code comment dataset.
4
5
6</p>
        <p>Rule
The smallest individual unit of a program is called a token.</p>
        <p>Tokens are either keywords, identifiers or variables.</p>
        <p>A keyword must belong to the list: auto, double, int, struct,
break, else, long, switch, case, enum, register, typedef, char,
extern, return, union, const, float, short, unsigned, continue,
for, signed, void, default, goto, sizeof, volatile, do, if, static,
while.</p>
        <p>An identifier can only have alphanumeric characters(a-z , A-Z
, 0-9) and underscore(_).</p>
        <p>The first character of an identifier can only contain
alphabet(az, A-Z) or underscore (_).</p>
        <p>Identifiers are case-sensitive in the C language. For example,
name and Name will be treated as two diferent identifiers.</p>
        <p>Keywords are not allowed to be used as Identifiers.</p>
        <p>No special characters, such as a semicolon, period, whitespaces,
slash, or comma are permitted to be used in or as an Identifier.</p>
        <p>Example of valid identifiers: total, avg1, diference_1. Example
of invalid identifiers: $myvar, x!y.</p>
        <p>A variable has a data type (which can be one of the following:
char, int, float, double, void), a name and a value.</p>
        <p>A variable should be declared and assigned a value. Example:
int marks = 10.</p>
        <p>After creation and assignment, the value of a variable can be
changed.</p>
        <p>A valid line of code is a collection of tokens that adhere to the
above rules.</p>
        <p>Comments are plain simple text in English that can be added
to a line of code.</p>
        <p>A comment explains various parts of the line of code, makes it
more readable and more understandable.</p>
        <p>A comment either begins with // if it is a single-line comment
or is enclosed within /* and */ if it is a multi-line comment.</p>
        <p>Comments can be either labeled Useful or Not Useful.</p>
        <p>A comment is labeled Useful when it is informative and helps
clarify the line of code without being redundant, otherwise, it
is labeled Not Useful.</p>
      </sec>
      <sec id="sec-4-3">
        <title>3.3. Script creation</title>
        <p>The ultimate goal of our NeSy method is to ensure the data generation process is not bound to
ChatGPT since it can lead to inconsistent, incoherent and inexplicable data that also risks being
incomplete because of the output token size limitation of the LLM. To regain control of the data
generation mechanism, the ideal solution is to have a tool that bypasses the data generation
limitations and pitfalls of LLMs and place it in the hands of the user.</p>
        <p>After verifying that ChatGPT can correctly transcribe the semantic rules into an algorithm in
pseudo-code, we prompt it to regenerate it in the form of a usable Python script. This generation
is reported in Figure 5.</p>
        <p>The script acts in itself as a validator proving ChatGPT has faithfully understood the rules of
data construction while also allowing user modification in case of mistakes made by the LLM
in the script logic. It also ensures that the generation of samples is no longer bound to the
LLM and is retained by the user. The reason for using ChatGPT to generate the script is that it
enables the user to take advantage of the LLM’s pre-training on code data to quickly generate
a script and save time and human resources as opposed to manually creating the script from
follow the definition of useful comments set by our rules. The second attempt yields a script
that is compliant with the intended logic.</p>
        <p>Obtaining a script that controls parameters like inputs, outputs, number of samples and
data logic means the data generation process is configurable by the user. Once the code for
generating a correct labeled code comment sample is validated, a loop allows us to generate
any number of valid synthetic data samples.</p>
        <p>The full script for generating synthetic data is shown in Appendix A. The code for our NeSy
workflow can be found in this repository 5. The entire chat containing all ChatGPT prompts
and responses can be found here6.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Experiments</title>
      <p>This section describes our experiments in terms of data, models and training process.</p>
      <sec id="sec-5-1">
        <title>5https://github.com/HannaAbiAkl/NeSy-Code-Generation-Workflow 6https://chat.openai.com/share/0b5592f9-deac-402b-b0ef-a3ed4c7f06b7</title>
        <sec id="sec-5-1-1">
          <title>4.1. Dataset description</title>
          <p>
            We consider two datasets for our experiments: a baseline dataset created in our prior work [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ]
as a result of augmenting the original seed dataset of the IRSE 2023 shared task by prompting
ChatGPT with examples, and an additional synthetic dataset generated from the Python script
created by ChatGPT.
4.1.1. Baseline data
4.1.2. Additional data
The baseline data is described in Abi Akl. The dataset contains a total of 11873 samples from
which 7474 are labeled Useful and 4399 Not Useful.
          </p>
          <p>We leverage the script created by ChatGPT to generate an additional synthetic dataset of 5000
samples evenly split between Useful and Not Useful samples.</p>
        </sec>
        <sec id="sec-5-1-2">
          <title>4.2. System description</title>
          <p>
            This section introduces the methodology used in our experimental runs. It describes the machine
learning models as well as the features used in our experiments.
4.2.1. Model choice
4.2.2. Features
We retain the model choice and configuration from Abi Akl: Random Forest (RF), Voting
Classifier (VC) and Neural Network (NN). The RF classifier is kept as a baseline. The VC and
NN are selected for their good performance on the IRSE 2023 shared task dataset.
Feature selection and engineering is retained from our work in Abi Akl. Each code
comment input string is transformed into a 768 dimensional vector of embedddings using the
st-codesearch-distilroberta-base7 sentence embeddings model [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ].
4.2.3. Experimental setup
We divide the experiment in two phases. The first phase consists in evaluating the models on
the baseline data only. The second phase consists in creating an augmented dataset by adding
the 5000 synthetic samples to the baseline data and evaluating the same models on the curated
dataset.
          </p>
          <p>In both phases, there is a class imbalance caused by the uneven split in the baseline data. The
Useful class is over-represented at 62.9%. To rectify this imbalance, we use the SMOTE [27]
technique to generate synthetic data and achieve a 50-50 percent class distribution.
7https://huggingface.co/flax-sentence-embeddings/st-codesearch-distilroberta-base</p>
          <p>Next, we split the data using the scikit-learn Repeated Stratified K-Fold cross validator 8 with
10 folds and 3 allowed repetitions. We use the Accuracy, Precision, Recall and Macro-F1 scores
as metrics for evaluating our models. All experiments are performed on a Dell G15 Special
Edition 5521 hardware with 14 CPU cores, 32 GB RAM and NVIDIA GeForce RTX 3070 Ti GPU.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Results</title>
      <p>
        Table 2 demonstrates the performance of each model on the augmented data. On the majority of
the scoring metrics, the Neural Network outclasses the Random Forest and the Voting Classifier
models. The VC retains the highest Macro-F1 and Recall scores for the Useful class as well as
the highest Precision score for the Not Useful class, narrowingly edging out the NN model. This
is consistent with the results of prior work and suggests the synthetic data did not skew the
model behaviors or cause any drift in their predictions [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>We also note that the data augmentation process results in an increase in all scores for all
models, marking the importance of valid synthetic data and its impact on diferent machine
learning models for the code comment classification task.</p>
      <p>
        The results of Table 3 are consistent with these findings. The table shows the evolution of the
Macro-F1 score for the 3 models on 3 diferent datasets. The Seed dataset is the original data
proposed by the IRSE 2023 shared task organizers and augmented by SMOTE in Abi Akl. The
Baseline data is the ChatGPT-augmented dataset using prompting by examples and augmented
by SMOTE [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The Augmented dataset is the extension of the Baseline set with the synthetic
data from the NeSy workflow. The first main takeaway from the table is that both neural
(i.e., prompting by examples) and symbolic (i.e., constructing a script from a ruleset) methods
can generate valid synthetic data that positively impacts model performance. This is apparent
through the increasing Macro-F1 scores for all models despite being based on diferent algorithms
and architectures.
      </p>
      <p>
        The second main takeaway is the consistency in the increase which is around 1% with each
data augmentation. This seems to suggest that both synthetic data generation methods are on
par in the quality of data generated. However, it is noteworthy to point out that these results
are also the consequence of SMOTE, an important participant that contributed to balancing all
3 datasets by furnishing its own synthetic data to compensate for the hindering class imbalance
carried over from the original Seed dataset. The consistency in increase does little to inform
us in any way on the state and quality of the synthetic data generated for both the Baseline
and Augmented datasets. In the neural generation method, ChatGPT tries to imitate the given
examples, and the result is a very small set of data lacking diversity and containing many
inconsistencies such as duplicate examples [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The 421 samples that have been retained for our
experiments are what’s left of an original set of 1000 samples that had been manually pruned to
remove inconsistent, redundant and incomplete examples [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. In addition, the prompt asked for
a balanced set of examples labeled Useful and Not Useful to avoid falling again in the trap of
class imbalance, which ChatGPT failed to provide as seen in the description of the final Baseline
dataset in section 4.1.1.
8https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RepeatedStratifiedKFold.html
Model
      </p>
      <p>Macro-F1
RF
VC
NN</p>
    </sec>
    <sec id="sec-7">
      <title>6. Conclusion</title>
      <p>Useful
Precision</p>
      <p>On the other hand, the NeSy workflow forces ChatGPT to adhere to a strict ruleset and
properly learn the syntax of the C language. The additional step of asking ChatGPT to generate
a script is both a method validator to ensure it has learned the rule framework correctly and a
tool to control the generation of data. By taking control of the data generation process, we can
easily parameterize the total number of samples we wish to generate as well as the quality of
these samples, i.e., equally distributed between Useful and Not Useful labels. In our experiments,
we tested for 1000 and 5000 balanced samples. Both sample sizes yield and increase for all
models on all metrics, but the increase from 5000 examples is much more significant overall
than that from 1000 samples, which is why we opted to report our findings only for the larger
set. We leave the door open for generation and testing on larger sample sizes but we consider
this to be a natural consequence of the methodology we introduce which remains first and
foremost the primary objective of this study.
Recall
In this work, we introduce a symbolic method of synthetic data generation using semantic
decomposition. We show how we can combine this method with LLMs to create a neuro-symbolic
workflow for controlled synthetic data generation to tackle the code comment classification
challenge. Our method overcomes the limits of over-reliance on LLMs as generators and
enables the creation of valid synthetic data that improves the performance of machine learning
models on the classification task without the need for scaling or constructing more complex
and resource-hungry systems. Future work will include more tests on larger data samples
generated through this workflow, possible extensions to other domains and data sources and
experimentation with the possibility of integrating other forms of symbolic knowledge to our</p>
    </sec>
    <sec id="sec-8">
      <title>7. Acknowledgements</title>
    </sec>
    <sec id="sec-9">
      <title>8. Appendices</title>
      <p>import random
import csv
We would like to thank the organizers of the IRSE 2023 shared task for laying the groundwork
for our research.</p>
      <p>Listing 1: Python script created by ChatGPT to generate synthetic labeled code comment data
in C
# List of keywords in the C language
keywords = ['auto', 'double', 'int', 'struct', 'break', 'else', 'long', 'switch',
'case', 'enum', 'register', 'typedef',
'char', 'extern', 'return', 'union', 'const', 'float', 'short',</p>
      <p>'unsigned', 'continue', 'for', 'signed',
'void', 'default', 'goto', 'sizeof', 'volatile', 'do', 'if', 'static',</p>
      <p>'while']
# List of data types for variables
data_types = ['char', 'int', 'float', 'double', 'void']
# List of labels for comments
comment_labels = ['Useful', 'Not Useful']
# Function to generate a random valid identifier
def generate_identifier():
first_char =</p>
      <p>random.choice('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_')
rest_chars =
''.join(random.choices('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_',
k=random.randint(0, 10)))
return first_char + rest_chars
# Function to generate a random valid line of code
def generate_line_of_code():
keyword_or_data_type = random.choice(keywords + data_types)
identifier = generate_identifier()
value = random.choice(['', f' = {random.randint(0, 100)}'])
return f'{keyword_or_data_type} {identifier}{value};'
# Function to generate a random comment
def generate_comment():
level_of_detail = random.choice(['', ' // ' + ' '.join(generate_identifier()
for _ in range(random.randint(1, 5)))])
return random.choice(['', '/* ' + generate_identifier() + ' */']) +</p>
      <p>level_of_detail
# Function to generate a useful comment for a given line of code
def generate_useful_comment(line_of_code):
purpose_keywords = ['Declaration', 'Initialization', 'Calculation', 'Function',
'Definition', 'Usage', 'Explanation']
variable_keywords = ['Variable', 'Value', 'Data', 'Result', 'Parameter']
purpose = random.choice(purpose_keywords)
variable = random.choice(variable_keywords)
return f'// {purpose} of {variable} in the line of code:\n// {line_of_code}'
# Function to generate a random label for a comment
def generate_comment_label():</p>
      <p>return random.choice(comment_labels)
# Generate 5000 lines of code, comments, and labels
data = []
for _ in range(5000):
line_of_code = generate_line_of_code()
comment = generate_comment()
label = generate_comment_label()
# Ensure the comment is useful if labeled as Useful
if label == 'Useful':</p>
      <p>
        comment = generate_useful_comment(line_of_code)
data.append((line_of_code, comment, label))
# Function to write data to a CSV file
def write_to_csv(file_path, data):
with open(file_path, mode='w', newline='') as csv_file:
fieldnames = ['Line of Code', 'Comment', 'Class']
writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
writer.writeheader()
for row in data:
writer.writerow({'Line of Code': row[0], 'Comment': row[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], 'Class':
      </p>
      <p>
        row[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]})
# Specify the file path
csv_file_path = 'test.csv'
# Write data to the CSV file
write_to_csv(csv_file_path, data)
print('Data has been generated and saved to {csv_file_path}')
[17] H. Jhamtani, H. Fang, P. Xia, E. Levy, J. Andreas, B. Van Durme, Natural language
decomposition and interpretation of complex utterances, arXiv preprint arXiv:2305.08677
(2023).
[18] A. Drozdov, N. Schärli, E. Akyürek, N. Scales, X. Song, X. Chen, O. Bousquet, D. Zhou,
Compositional semantic parsing with large language models, arXiv preprint arXiv:2209.15003
(2022).
[19] P. Patel, S. Mishra, M. Parmar, C. Baral, Is a question decomposition unit all we need?,
arXiv preprint arXiv:2205.12538 (2022).
[20] D. Mekala, J. Wolfe, S. Roy, Zerotop: Zero-shot task-oriented semantic parsing using large
language models, arXiv preprint arXiv:2212.10815 (2022).
[21] J. Yang, H. Jiang, Q. Yin, D. Zhang, B. Yin, D. Yang, Seqzero: Few-shot compositional
semantic parsing with sequential prompts and zero-shot models, arXiv preprint arXiv:2205.07381
(2022).
[22] Y. Lu, M. Shen, H. Wang, X. Wang, C. van Rechem, W. Wei, Machine learning for synthetic
data generation: a review, arXiv preprint arXiv:2302.04062 (2023).
[23] A. Bauer, S. Trapp, M. Stenger, R. Leppich, S. Kounev, M. Leznik, K. Chard, I. Foster,
Comprehensive exploration of synthetic data generation: A survey, arXiv preprint arXiv:2401.02524
(2024).
[24] Z. Li, H. Zhu, Z. Lu, M. Yin, Synthetic data generation with large language models for text
classification: Potential and limitations, arXiv preprint arXiv:2310.07849 (2023).
[25] N. Riemer, The Routledge handbook of semantics, 2015.
[26] B. Klemens, 21st Century C: C Tips from the New School, 2014.
[27] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, Smote: synthetic minority
over-sampling technique, Journal of artificial intelligence research 16 (2002) 321–357.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>W. X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dong</surname>
          </string-name>
          , et al.,
          <article-title>A survey of large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2303.18223</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>A survey of large language models for code: Evolution, benchmarking, and future trends</article-title>
          ,
          <source>arXiv preprint arXiv:2311.10372</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gholami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Omar</surname>
          </string-name>
          ,
          <article-title>Does synthetic data make large language models more eficient?</article-title>
          ,
          <source>arXiv preprint arXiv:2310.07830</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Muennighof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rush</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Barak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. Le</given-names>
            <surname>Scao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tazi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pyysalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <article-title>Scaling data-constrained language models</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>36</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Van</surname>
          </string-name>
          ,
          <article-title>Mitigating data scarcity for large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2302</source>
          .
          <year>01806</year>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Majumdar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Paul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Paul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bandyopadhyay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chattopadhyay</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. P. Das</surname>
            ,
            <given-names>P. D.</given-names>
          </string-name>
          <string-name>
            <surname>Clough</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Majumder</surname>
          </string-name>
          ,
          <article-title>Generative ai for software metadata: Overview of the information retrieval in software engineering track at fire 2023</article-title>
          , arXiv preprint arXiv:
          <volume>2311</volume>
          .03374 (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Abi Akl</surname>
          </string-name>
          ,
          <article-title>A ml-llm pairing for better code comment classification, in: FIRE (Forum for Information Retrieval Evaluation)</article-title>
          <year>2023</year>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8] A.
          <string-name>
            <surname>d'Avila Garcez</surname>
            ,
            <given-names>L. C.</given-names>
          </string-name>
          <string-name>
            <surname>Lamb</surname>
          </string-name>
          ,
          <article-title>Neurosymbolic ai: the 3rd wave</article-title>
          , arXiv e-prints (
          <year>2020</year>
          ) arXiv-
          <fpage>2012</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Núñez-Molina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mesejo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fernández-Olivares</surname>
          </string-name>
          ,
          <article-title>Nesig: A neuro-symbolic method for learning to generate planning problems</article-title>
          ,
          <source>arXiv preprint arXiv:2301.10280</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>I.</given-names>
            <surname>Karth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Aytemiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mawhorter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <article-title>Neurosymbolic map generation with vq-vae and wfc</article-title>
          ,
          <source>in: Proceedings of the 16th International Conference on the Foundations of Digital Games</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Prasad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Koller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hartmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sabharwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bansal</surname>
          </string-name>
          , T. Khot,
          <article-title>Adapt: As-needed decomposition and planning with language models</article-title>
          ,
          <source>arXiv preprint arXiv:2311.05772</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>B.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Qian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Andreas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Zhang,</surname>
          </string-name>
          <article-title>Decomposing uncertainty for large language models through input clarification ensembling</article-title>
          ,
          <source>arXiv preprint arXiv:2311.08718</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D.</given-names>
            <surname>Tarasov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Shridhar</surname>
          </string-name>
          ,
          <article-title>Distilling llms' decomposition abilities into compact language models</article-title>
          ,
          <source>arXiv preprint arXiv:2402</source>
          .
          <year>01812</year>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>H.</given-names>
            <surname>Lyre</surname>
          </string-name>
          , ”
          <article-title>understanding ai”: Semantic grounding in large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2402.10992</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>P. D.</given-names>
            <surname>Turney</surname>
          </string-name>
          ,
          <article-title>Semantic composition and decomposition: From recognition to generation</article-title>
          ,
          <source>arXiv preprint arXiv:1405.7908</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Bloore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gauriau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. L.</given-names>
            <surname>Decker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Oppenheim</surname>
          </string-name>
          ,
          <article-title>Semantic decomposition improves learning of large language models on ehr data</article-title>
          ,
          <source>arXiv preprint arXiv:2212.06040</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>