-

1613-0073

and classification⋆

Hanna Abi Akl

hanna.abi-akl@dsti.institute 0 1 2

Neuro-symbolic AI

Natural Language Processing

Machine Learning

Large Language Models

Generation, Synthetic Data

0 Data ScienceTech Institute (DSTI) , 4 Rue de la Collégiale, 75005, Paris , France 1 Hersonissos , Crete , Greece 2 Université Côte d'Azur , Inria, CNRS, I3S

We present a neuro-symbolic (NeSy) workflow combining a symbolic-based learning technique with a large language model (LLM) agent to generate synthetic data for code comment classification in the C programming language. We also show how generating controlled synthetic data using this workflow ifxes some of the notable weaknesses of LLM-based generation and increases the performance of classical machine learning models on the code comment classification task. Our best model, a Neural Network, achieves a Macro-F1 score of 91.412% with an increase of 1.033% after data augmentation.

CEUR ceur-ws.org

1. Introduction

of the ceurart style. data [ 1 ]. A consequence of this demand is data scarcity, a pitfall for all LLM agents today. Data scarcity is still an open problem that is becoming a pressing issue in the face of the advancement and improvement of LLM technologies since it directly afects their greatest source of power: data. Research is ongoing to actively tackle and solve the problem of data scarcity [ 3, 4, 5 ] but, to our knowledge, no wide-scale solution exists as of yet at the time of writing of this work.

The Information Retrieval in Software Engineering (IRSE) 1 at the Forum for Information Retrieval Evaluation (FIRE)2 2023 shared task is one challenge that addresses the problem of data scarcity. It sets out to measure the efects of leveraging LLMs to generate new data and enrich a code comment dataset in the C programming language starting from existing data scraped from real code repositories [ 6 ]. The shared task also challenges participants to test the quality of their generated data by evaluating its impact on the performance of machine learning models in classifying whether a comment is useful or not useful for the surrounding C code block [ 6 ]. In our previous work, we proposed a starting solution for the data scarcity problem by showing that prompting LLMs by examples and combining the generated data with existing synthetic data generation techniques improves model performance on the code comment classification task [ 7 ]. The work presented here carries over from the aforementioned framework to introduce a more complete solution and, as such, will reference it heavily.

In this work, we introduce a NeSy workflow leveraging both the use of a LLM agent and a symbolic-based learning method to enrich the code comment dataset with synthetic data and evaluate the quality of this generation by studying the impact of the data augmentation process on the performance of machine learning models on the code comment classification task. The rest of the work is organized as follows. In section 2, we discuss some of the related work. In section 3, we present our methodology. Section 4 describes our experimental framework. In section 5, we report our results and discuss our findings. Finally, we conclude in section 6.

2. Related Work

This section discusses existing techniques that couple symbolic forms of learning and neural models with a particular focus on LLMs as well as some proposed strategies in the literature for synthetic data generation.

2.1. Symbolic techniques and large language models

Research that aligns with the promise made by NeSy models in d’Avila Garcez and Lamb, i.e., combining the advantages of both symbolic and neural methods to create better learning systems, places the integration of semantic techniques with state-of-the-art LLMs at its center in an attempt to improve learning. In their work, Núñez-Molina et al. show how integrating a markov decision process with reinforcement deep learning policies yields generations of planning problems that are both valid and diverse for diferent domains. In similar fashion, Karth et al. apply symbolic constraints to deep learning models in the world of games to generate new valid game tiles using a minimal number of raw pixels. Their neuro-symbolic technique

1https://sites.google.com/view/irse2023/home

2http://fire.irsi.res.in/fire/static/resources yields comparable generations to real-world levels found in World of Warcraft 3 and Super Mario4.

The idea of symbolically addressing learning needs in LLM agents was further refined and centered around the decomposing tasks. In their work, Prasad et al. show that decomposing planning tasks into sub-tasks helps LLM agents better respond and successfully carry over complex tasks. They also use their method to create a new decomposition dataset that helps LLMs learn complex tasks incrementally through smaller sub-tasks [ 11 ]. Other existing works like Hou et al. explored the efects of introducing sets of clarifications to LLMs on their performance. Their findings show that their method is more efective in fine-tuning models on learning tasks than parameter-tuning them. Tarasov and Shridhar extended the use of decomposition to deal with the problem of scale, breaking down a large task into smaller tasks and feeding them to small models. They showed how tuning each model to handle a specific sub-task and collecting their outputs improves the performance of a larger LLM taking them as input [ 13 ].

Another important symbolic method that addresses LLM learning and reasoning is semantic grounding. The work of Lyre investigates diferent pillars of semantic grounding in LLMs and shows that these models have basic notions of these concepts. Turney took the investigation further by leveraging LLMs to generate synonyms of concepts using unigrams and bigrams and comparing their outputs to valid WordNet words. Other research methods proposed similar semantic decomposition approaches by integrating them into deep learning models coupled with diferent language structures like graph decomposition [ 16 ], natural language decomposition into intents [17], prompt decomposition [18], question-answering reformulation into a mixture of abstractive and extractive prompts [19, 20] and SQL-based statement decomposition [21].

2.2. Synthetic data generation methods

The work of Lu et al. surveyed machine learning and deep learning models for synthetic data generation on a variety of tasks, e.g., computer vision and natural language processing, using diferent data sources, e.g., image and text, and in diferent domains, e.g., healthcare. Their findings showed that architectures based on neural networks and large language model technology are the most popular models for data generation [22]. They also studied diferent data generation algorithms like artificial data labeling and observed varying model performances depending on the task and the domain [22]. In their work, Bauer et al. surveyed 417 synthetically generated datasets and showed Generative Adversarial Nets (GANs) to be the most prevalent synthetic data generation models and computer vision to be the most popular task domain of application. They also highlighted the importance of having standardized datasets and metrics for evaluating the quality of synthetically generate data [23]. Finally, Li et al. studied the limitations of LLM-based synthetic data generation and highlighted the dangers of uncontrolled data generation which negatively impacts model performance, most notably on classification tasks.

3https://worldofwarcraft.blizzard.com/en-us/ 4https://mario.nintendo.com/ 3. Methodology

This section describes our NeSy methodology combining a LLM agent and a symbolic framework to generate synthetic labeled code comment data as shown in Figure 1. We chose ChatGPT 3.5 to implement our methods and experiments since it is freely accessible and usable without prior configuration. We introduce a set of rules based on semantic decomposition to prompt ChatGPT and create a neuro-symbolic workflow that teaches the LLM the proper syntax of the C programming language for controlling the generation of synthetic labeled code comment samples. The workflow is represented in Figure 2.

3.1. Semantic rules

We turn to semantic decomposition, an algorithm that breaks down the meanings of phrases or concepts into less complex concepts [25], to create a ruleset that helps ChatGPT construct a valid code comment dataset. The advantage of this symbolic method is twofold: to control the generation of valid data and ensure suficient diversity to enrich an existing dataset.

The rules themselves have been designed as renditions of the syntax of the C programming language [26] and delimit the vocabulary as well as the constructs of the language. They start at the atomic level by defining what a valid token in the language is and move to more complex concepts like determining the construction of a valid line of code in C. Each rule is written as a statement in natural language and is kept as simple and short as possible. Figure 3 shows the 12 rules given as a prompt for ChatGPT to produce a valid line of C code.

In order to produce a complete data sample, generating a valid line of code is not enough. Our dataset consists of code, comment and label data. For ChatGPT to produce comments, we add 3 rules to define what a comment in C is as well as its purpose. The definitions are restricted to English generations of comments but can be extended to accommodate any language. The rules also contain syntactic details such as the allowed tokens at the beginning of a comment in C.

Finally, to remain faithful to the input shape of our data, we can ensure any data sample produced by the LLM is labeled by introducing 2 more rules to explain the allowed labels, i.e., Useful and Not Useful, as well as how to classify a code comment pair. These rules help reduce incoherent data generation and ensure the LLM labeling choice is explainable.

The full ruleset is presented in Table 1.

Figure 4 shows an example of valid synthetic data generated by ChatGPT using our full ruleset.

3.2. Algorithm generation

To circumvent the ambiguities that come with expressing statements in natural language, we ask ChatGPT to formulate an algorithm out of the provided rules by prompting the LLM to treat this exercise as a translation task from a natural language to an algorithmic language. This plays into the strenghts of LLMs given they are pre-trained and capable of performing well on this kind of task. The purpose of this step is to make the rules as explicit and clear as possible to ensure they are explainable and reproducible. This also counteracts the black-box behavior LLMs generally have in interpreting prompt instructions. Fianlly, this phase also serves as a self-check and ensures any potentially missed logical gaps while at the time of designing the rules can be addressed.

We ask ChatGPT to generate the algorithm in the form of a Python script because this will ultimately be the tool used to control the synthetic data generation. This step is detailed in the next subsection. Algorithm 1 showcases the algorithm constructed by the LLM from the initial ruleset to generate a labeled code comment dataset. 4 5 6

Rule The smallest individual unit of a program is called a token.

Tokens are either keywords, identifiers or variables.

A keyword must belong to the list: auto, double, int, struct, break, else, long, switch, case, enum, register, typedef, char, extern, return, union, const, float, short, unsigned, continue, for, signed, void, default, goto, sizeof, volatile, do, if, static, while.

An identifier can only have alphanumeric characters(a-z , A-Z , 0-9) and underscore(_).

The first character of an identifier can only contain alphabet(az, A-Z) or underscore (_).

Identifiers are case-sensitive in the C language. For example, name and Name will be treated as two diferent identifiers.

Keywords are not allowed to be used as Identifiers.

No special characters, such as a semicolon, period, whitespaces, slash, or comma are permitted to be used in or as an Identifier.

Example of valid identifiers: total, avg1, diference_1. Example of invalid identifiers: $myvar, x!y.

A variable has a data type (which can be one of the following: char, int, float, double, void), a name and a value.

A variable should be declared and assigned a value. Example: int marks = 10.

After creation and assignment, the value of a variable can be changed.

A valid line of code is a collection of tokens that adhere to the above rules.

Comments are plain simple text in English that can be added to a line of code.

A comment explains various parts of the line of code, makes it more readable and more understandable.

A comment either begins with // if it is a single-line comment or is enclosed within /* and */ if it is a multi-line comment.

Comments can be either labeled Useful or Not Useful.

A comment is labeled Useful when it is informative and helps clarify the line of code without being redundant, otherwise, it is labeled Not Useful.

3.3. Script creation

The ultimate goal of our NeSy method is to ensure the data generation process is not bound to ChatGPT since it can lead to inconsistent, incoherent and inexplicable data that also risks being incomplete because of the output token size limitation of the LLM. To regain control of the data generation mechanism, the ideal solution is to have a tool that bypasses the data generation limitations and pitfalls of LLMs and place it in the hands of the user.

After verifying that ChatGPT can correctly transcribe the semantic rules into an algorithm in pseudo-code, we prompt it to regenerate it in the form of a usable Python script. This generation is reported in Figure 5.

The script acts in itself as a validator proving ChatGPT has faithfully understood the rules of data construction while also allowing user modification in case of mistakes made by the LLM in the script logic. It also ensures that the generation of samples is no longer bound to the LLM and is retained by the user. The reason for using ChatGPT to generate the script is that it enables the user to take advantage of the LLM’s pre-training on code data to quickly generate a script and save time and human resources as opposed to manually creating the script from follow the definition of useful comments set by our rules. The second attempt yields a script that is compliant with the intended logic.

Obtaining a script that controls parameters like inputs, outputs, number of samples and data logic means the data generation process is configurable by the user. Once the code for generating a correct labeled code comment sample is validated, a loop allows us to generate any number of valid synthetic data samples.

The full script for generating synthetic data is shown in Appendix A. The code for our NeSy workflow can be found in this repository 5. The entire chat containing all ChatGPT prompts and responses can be found here6.

4. Experiments

This section describes our experiments in terms of data, models and training process.

5https://github.com/HannaAbiAkl/NeSy-Code-Generation-Workflow 6https://chat.openai.com/share/0b5592f9-deac-402b-b0ef-a3ed4c7f06b7 4.1. Dataset description

We consider two datasets for our experiments: a baseline dataset created in our prior work [ 7 ] as a result of augmenting the original seed dataset of the IRSE 2023 shared task by prompting ChatGPT with examples, and an additional synthetic dataset generated from the Python script created by ChatGPT. 4.1.1. Baseline data 4.1.2. Additional data The baseline data is described in Abi Akl. The dataset contains a total of 11873 samples from which 7474 are labeled Useful and 4399 Not Useful.

We leverage the script created by ChatGPT to generate an additional synthetic dataset of 5000 samples evenly split between Useful and Not Useful samples.

4.2. System description

This section introduces the methodology used in our experimental runs. It describes the machine learning models as well as the features used in our experiments. 4.2.1. Model choice 4.2.2. Features We retain the model choice and configuration from Abi Akl: Random Forest (RF), Voting Classifier (VC) and Neural Network (NN). The RF classifier is kept as a baseline. The VC and NN are selected for their good performance on the IRSE 2023 shared task dataset. Feature selection and engineering is retained from our work in Abi Akl. Each code comment input string is transformed into a 768 dimensional vector of embedddings using the st-codesearch-distilroberta-base7 sentence embeddings model [ 7 ]. 4.2.3. Experimental setup We divide the experiment in two phases. The first phase consists in evaluating the models on the baseline data only. The second phase consists in creating an augmented dataset by adding the 5000 synthetic samples to the baseline data and evaluating the same models on the curated dataset.

In both phases, there is a class imbalance caused by the uneven split in the baseline data. The Useful class is over-represented at 62.9%. To rectify this imbalance, we use the SMOTE [27] technique to generate synthetic data and achieve a 50-50 percent class distribution. 7https://huggingface.co/flax-sentence-embeddings/st-codesearch-distilroberta-base

Next, we split the data using the scikit-learn Repeated Stratified K-Fold cross validator 8 with 10 folds and 3 allowed repetitions. We use the Accuracy, Precision, Recall and Macro-F1 scores as metrics for evaluating our models. All experiments are performed on a Dell G15 Special Edition 5521 hardware with 14 CPU cores, 32 GB RAM and NVIDIA GeForce RTX 3070 Ti GPU.

5. Results

Table 2 demonstrates the performance of each model on the augmented data. On the majority of the scoring metrics, the Neural Network outclasses the Random Forest and the Voting Classifier models. The VC retains the highest Macro-F1 and Recall scores for the Useful class as well as the highest Precision score for the Not Useful class, narrowingly edging out the NN model. This is consistent with the results of prior work and suggests the synthetic data did not skew the model behaviors or cause any drift in their predictions [ 7 ].

We also note that the data augmentation process results in an increase in all scores for all models, marking the importance of valid synthetic data and its impact on diferent machine learning models for the code comment classification task.

The results of Table 3 are consistent with these findings. The table shows the evolution of the Macro-F1 score for the 3 models on 3 diferent datasets. The Seed dataset is the original data proposed by the IRSE 2023 shared task organizers and augmented by SMOTE in Abi Akl. The Baseline data is the ChatGPT-augmented dataset using prompting by examples and augmented by SMOTE [ 7 ]. The Augmented dataset is the extension of the Baseline set with the synthetic data from the NeSy workflow. The first main takeaway from the table is that both neural (i.e., prompting by examples) and symbolic (i.e., constructing a script from a ruleset) methods can generate valid synthetic data that positively impacts model performance. This is apparent through the increasing Macro-F1 scores for all models despite being based on diferent algorithms and architectures.

The second main takeaway is the consistency in the increase which is around 1% with each data augmentation. This seems to suggest that both synthetic data generation methods are on par in the quality of data generated. However, it is noteworthy to point out that these results are also the consequence of SMOTE, an important participant that contributed to balancing all 3 datasets by furnishing its own synthetic data to compensate for the hindering class imbalance carried over from the original Seed dataset. The consistency in increase does little to inform us in any way on the state and quality of the synthetic data generated for both the Baseline and Augmented datasets. In the neural generation method, ChatGPT tries to imitate the given examples, and the result is a very small set of data lacking diversity and containing many inconsistencies such as duplicate examples [ 7 ]. The 421 samples that have been retained for our experiments are what’s left of an original set of 1000 samples that had been manually pruned to remove inconsistent, redundant and incomplete examples [ 7 ]. In addition, the prompt asked for a balanced set of examples labeled Useful and Not Useful to avoid falling again in the trap of class imbalance, which ChatGPT failed to provide as seen in the description of the final Baseline dataset in section 4.1.1. 8https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RepeatedStratifiedKFold.html Model

Macro-F1 RF VC NN

6. Conclusion

Useful Precision

On the other hand, the NeSy workflow forces ChatGPT to adhere to a strict ruleset and properly learn the syntax of the C language. The additional step of asking ChatGPT to generate a script is both a method validator to ensure it has learned the rule framework correctly and a tool to control the generation of data. By taking control of the data generation process, we can easily parameterize the total number of samples we wish to generate as well as the quality of these samples, i.e., equally distributed between Useful and Not Useful labels. In our experiments, we tested for 1000 and 5000 balanced samples. Both sample sizes yield and increase for all models on all metrics, but the increase from 5000 examples is much more significant overall than that from 1000 samples, which is why we opted to report our findings only for the larger set. We leave the door open for generation and testing on larger sample sizes but we consider this to be a natural consequence of the methodology we introduce which remains first and foremost the primary objective of this study. Recall In this work, we introduce a symbolic method of synthetic data generation using semantic decomposition. We show how we can combine this method with LLMs to create a neuro-symbolic workflow for controlled synthetic data generation to tackle the code comment classification challenge. Our method overcomes the limits of over-reliance on LLMs as generators and enables the creation of valid synthetic data that improves the performance of machine learning models on the classification task without the need for scaling or constructing more complex and resource-hungry systems. Future work will include more tests on larger data samples generated through this workflow, possible extensions to other domains and data sources and experimentation with the possibility of integrating other forms of symbolic knowledge to our

7. Acknowledgements 8. Appendices

import random import csv We would like to thank the organizers of the IRSE 2023 shared task for laying the groundwork for our research.

Listing 1: Python script created by ChatGPT to generate synthetic labeled code comment data in C # List of keywords in the C language keywords = ['auto', 'double', 'int', 'struct', 'break', 'else', 'long', 'switch', 'case', 'enum', 'register', 'typedef', 'char', 'extern', 'return', 'union', 'const', 'float', 'short',

'unsigned', 'continue', 'for', 'signed', 'void', 'default', 'goto', 'sizeof', 'volatile', 'do', 'if', 'static',

'while'] # List of data types for variables data_types = ['char', 'int', 'float', 'double', 'void'] # List of labels for comments comment_labels = ['Useful', 'Not Useful'] # Function to generate a random valid identifier def generate_identifier(): first_char =

random.choice('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_') rest_chars = ''.join(random.choices('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789_', k=random.randint(0, 10))) return first_char + rest_chars # Function to generate a random valid line of code def generate_line_of_code(): keyword_or_data_type = random.choice(keywords + data_types) identifier = generate_identifier() value = random.choice(['', f' = {random.randint(0, 100)}']) return f'{keyword_or_data_type} {identifier}{value};' # Function to generate a random comment def generate_comment(): level_of_detail = random.choice(['', ' // ' + ' '.join(generate_identifier() for _ in range(random.randint(1, 5)))]) return random.choice(['', '/* ' + generate_identifier() + ' */']) +

level_of_detail # Function to generate a useful comment for a given line of code def generate_useful_comment(line_of_code): purpose_keywords = ['Declaration', 'Initialization', 'Calculation', 'Function', 'Definition', 'Usage', 'Explanation'] variable_keywords = ['Variable', 'Value', 'Data', 'Result', 'Parameter'] purpose = random.choice(purpose_keywords) variable = random.choice(variable_keywords) return f'// {purpose} of {variable} in the line of code:\n// {line_of_code}' # Function to generate a random label for a comment def generate_comment_label():

return random.choice(comment_labels) # Generate 5000 lines of code, comments, and labels data = [] for _ in range(5000): line_of_code = generate_line_of_code() comment = generate_comment() label = generate_comment_label() # Ensure the comment is useful if labeled as Useful if label == 'Useful':

comment = generate_useful_comment(line_of_code) data.append((line_of_code, comment, label)) # Function to write data to a CSV file def write_to_csv(file_path, data): with open(file_path, mode='w', newline='') as csv_file: fieldnames = ['Line of Code', 'Comment', 'Class'] writer = csv.DictWriter(csv_file, fieldnames=fieldnames) writer.writeheader() for row in data: writer.writerow({'Line of Code': row[0], 'Comment': row[ 1 ], 'Class':

row[ 2 ]}) # Specify the file path csv_file_path = 'test.csv' # Write data to the CSV file write_to_csv(csv_file_path, data) print('Data has been generated and saved to {csv_file_path}') [17] H. Jhamtani, H. Fang, P. Xia, E. Levy, J. Andreas, B. Van Durme, Natural language decomposition and interpretation of complex utterances, arXiv preprint arXiv:2305.08677 (2023). [18] A. Drozdov, N. Schärli, E. Akyürek, N. Scales, X. Song, X. Chen, O. Bousquet, D. Zhou, Compositional semantic parsing with large language models, arXiv preprint arXiv:2209.15003 (2022). [19] P. Patel, S. Mishra, M. Parmar, C. Baral, Is a question decomposition unit all we need?, arXiv preprint arXiv:2205.12538 (2022). [20] D. Mekala, J. Wolfe, S. Roy, Zerotop: Zero-shot task-oriented semantic parsing using large language models, arXiv preprint arXiv:2212.10815 (2022). [21] J. Yang, H. Jiang, Q. Yin, D. Zhang, B. Yin, D. Yang, Seqzero: Few-shot compositional semantic parsing with sequential prompts and zero-shot models, arXiv preprint arXiv:2205.07381 (2022). [22] Y. Lu, M. Shen, H. Wang, X. Wang, C. van Rechem, W. Wei, Machine learning for synthetic data generation: a review, arXiv preprint arXiv:2302.04062 (2023). [23] A. Bauer, S. Trapp, M. Stenger, R. Leppich, S. Kounev, M. Leznik, K. Chard, I. Foster, Comprehensive exploration of synthetic data generation: A survey, arXiv preprint arXiv:2401.02524 (2024). [24] Z. Li, H. Zhu, Z. Lu, M. Yin, Synthetic data generation with large language models for text classification: Potential and limitations, arXiv preprint arXiv:2310.07849 (2023). [25] N. Riemer, The Routledge handbook of semantics, 2015. [26] B. Klemens, 21st Century C: C Tips from the New School, 2014. [27] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, Smote: synthetic minority over-sampling technique, Journal of artificial intelligence research 16 (2002) 321–357.

[1]

W. X.

Zhao ,

Zhou ,

Li ,

Tang ,

Wang ,

Hou ,

Min ,

Zhang ,

Dong , et al., A survey of large language models , arXiv preprint arXiv:2303.18223 ( 2023 ).

[2]

Zheng ,

Ning ,

Wang ,

Zhang ,

Zheng ,

Ye ,

Chen , A survey of large language models for code: Evolution, benchmarking, and future trends , arXiv preprint arXiv:2311.10372 ( 2023 ).

[3]

Gholami ,

Omar , Does synthetic data make large language models more eficient? , arXiv preprint arXiv:2310.07830 ( 2023 ).

[4]

Muennighof ,

Rush ,

Barak ,

T. Le

Scao ,

Tazi ,

Piktus ,

Pyysalo ,

Wolf ,

C. A.

Rafel , Scaling data-constrained language models , Advances in Neural Information Processing Systems 36 ( 2024 ).

[5]

Van , Mitigating data scarcity for large language models , arXiv preprint arXiv:2302 . 01806 ( 2023 ).

[6]

Majumdar ,

Paul ,

Bandyopadhyay ,

Chattopadhyay , P. P. Das , P. D.

Clough , P.

Majumder , Generative ai for software metadata: Overview of the information retrieval in software engineering track at fire 2023 , arXiv preprint arXiv: 2311 .03374 ( 2023 ).

[7]

Abi Akl , A ml-llm pairing for better code comment classification, in: FIRE (Forum for Information Retrieval Evaluation) 2023 , 2023 .

[8] A. d'Avila Garcez , L. C. Lamb , Neurosymbolic ai: the 3rd wave , arXiv e-prints ( 2020 ) arXiv- 2012 .

[9]

Núñez-Molina ,

Mesejo ,

Fernández-Olivares , Nesig: A neuro-symbolic method for learning to generate planning problems , arXiv preprint arXiv:2301.10280 ( 2023 ).

[10]

Karth ,

Aytemiz ,

Mawhorter ,

A. M.

Smith , Neurosymbolic map generation with vq-vae and wfc , in: Proceedings of the 16th International Conference on the Foundations of Digital Games , 2021 , pp. 1 - 6 .

[11]

Prasad ,

Koller ,

Hartmann ,

Clark ,

Sabharwal ,

Bansal , T. Khot, Adapt: As-needed decomposition and planning with language models , arXiv preprint arXiv:2311.05772 ( 2023 ).

[12]

Hou ,

Liu ,

Qian ,

Andreas ,

Chang , Y. Zhang, Decomposing uncertainty for large language models through input clarification ensembling , arXiv preprint arXiv:2311.08718 ( 2023 ).

[13]

Tarasov ,

Shridhar , Distilling llms' decomposition abilities into compact language models , arXiv preprint arXiv:2402 . 01812 ( 2024 ).

[14]

Lyre , ” understanding ai”: Semantic grounding in large language models , arXiv preprint arXiv:2402.10992 ( 2024 ).

[15]

P. D.

Turney , Semantic composition and decomposition: From recognition to generation , arXiv preprint arXiv:1405.7908 ( 2014 ).

[16]

D. A.

Bloore ,

Gauriau ,

A. L.

Decker ,

Oppenheim , Semantic decomposition improves learning of large language models on ehr data , arXiv preprint arXiv:2212.06040 ( 2022 ).