Commonsense Knowledge and Controllable Techniques for an Effective and Efficient Approach to Text Generation

Commonsense Knowledge and Controllable Techniques for an Effective and Efficient Approach to Text Generation IvánMartínez-Murillo ivan.martinezmurillo@ua.es Dept. of Software and Computing Systems University of Alicante

Apdo. de Correos 99 E-03080 Alicante Spain

Commonsense Knowledge and Controllable Techniques for an Effective and Efficient Approach to Text Generation 1613-0073 046BC294EC97738B7D7DCFE1D6FE5718 GROBID - A machine learning software for extracting information from scholarly documents Natural Language Generation Controllable techniques Hallucination Efficient architectures Taskagnostic Commonsense Knowledge

The Natural Language Generation (NLG) field has advanced at a breakneck speed, favoured by the development of Large Language Models (LLMs). Notwithstanding, these models also have some drawbacks. On the one hand, these models can introduce some risks such as hallucination or bias which can be used in an unethical way to potentially generate dis-and mis-information. On the other hand, the expense of time and cost of training these models is too high. In account of this, the purpose of this paper is to propose a new research line for my PhD thesis. During the research, I will propose an efficient architecture, that could generate quality text in a controllable way, while integrating external commonsense knowledge. The objective is that this proposed architecture could achieve similar performance to state-of-the-art models while being more efficient.

Justification of the research

The rapid development of generative Artificial Intelligence (AI) has caused an augment of interest in society in AI tools. These tools can produce a positive impact in lots of areas, saving the time and effort of solving some tasks [1,2,3].

In particular, state-of-the-art Natural Language Generation (NLG) tools can produce text that, in some cases, can be indistinguishable from human-generated texts. This could have lots of benefits in some sectors such as academia, tourism or marketing [4]. Nonetheless, these tools also have some drawbacks. First of all, text generated by these tools may contain hallucinations, which is the phenomena that occur when a text is nonsensical or unfaithful to the provided source [5]. Secondly, AI-generated text could be biased in some cases, which is the misrepresentation or attribution errors that result in favouring certain groups or ideas [6]. Finally, these tools also lack of logical reasoning, a fact that it is essential to human intelligence [7]. In the wake of these limitations, these tools can be used in a bad and unethical way to potentially generate dis-and mis-information.

Moreover, the core of these tools are Large Language Models (LLMs). The expense of time and cost needed to train these models is extremely high, being only within the reach of large companies.

Therefore, the motivation for the present research arises from the need in the academia to find efficient architectures that could produce text in a controlled manner, achieving a similar performance to state-of-the-art models, but solving the hallucination issue.

The remainder of this article is organised as follows: Section 2 presents an overview of the relevant literature concerning NLG; Section 3 shows the main hypotheses and objectives planned for this research; Finally, Section 4, and Section 5 detail the methodology this PhD will follow and some relevant research topics for discussion.

Background and Related Work

Before introducing my proposal, this section aims to contextualise this study within the state of the art of the NLG.

NLG is the subfield in the Natural Language Processing (NLP) area that aims to produce meaningful sentences to meet a communicative goal [8]. Depending on several aspects of the generation, NLG can be classified according to two criteria:

• Type of input: Depending on the type of input, NLG can be catalogued as (1) text-to-text generation (T2T) and ( 2) data-to-text generation (D2T) [9]. In D2T, input data can assume different types such as binary data, images, voice, database, ontologies, etc. Recently, another concept of NLG has emerged, (3) none-to-text generation (N2T) [10], which corresponds to the generation to which no input is received. • Task typology: Based on the communicative goal, NLG can be grouped into (1) text abbreviation;

(2) text expansion; and (3) text rewriting and reasoning. Text abbreviation tasks consist in detecting the most important information in a text and fusing that information into a short text, e.g. text summarisation. Text expansion tasks aim to generate complete sentences from some meaningful words, e.g. topic-to-essay. Finally, Text rewriting and reasoning tasks try to rewrite a text into another style or apply reasoning methods, e.g. text simplification.

To achieve the communicative goal of these tasks, the NLG area has been studied for a long time. First researches date by the end of 1970 [11]. Notwithstanding, it has not been until recent years that the NLG field has achieved an exponential improvement, producing text in a very similar way to humans. But, how did we get to this?

In a first stage, the NLG task was seen as a sequential scheme of four different stages (preprocessing, macroplanning, microplanning and realisation). Modular architectures followed this scheme, making a clear distinction between the distinct sub-tasks of each stage. The most famous modular architecture was proposed by Reiter [12]. Figure 1 shows the sub-task division in this architecture.

Other works within this architecture can be found in [13,14,15,16].

Later, that clear distinction between the distinct sub-tasks became more flexible originating what is known as planning perspectives. This scheme was similar to the employed in modular architectures, but it allows to combine and implement two or more different sub-tasks in one sub-task, e.g. to combine text structuring and sentence aggregation sub-tasks. Some examples of this approach are present in [17,18,19,20,21,22,23,24].

Finally, the sub-task division started to disappear, originating global approaches. This type of architecture does not make a distinction among sub-tasks, performing every task as a whole, and relying on statistical learning and neural networks. Some proposed architectures within global approaches are: Graph Neural Networks [25], Generative Adversarial Nets [26], Recurrent Neural Networks [27], Pre-trained Models [28], Memory Networks [29], Transformers [30] and Copy and Pointing Mechanism [31]. This group of approaches have made the major development in the NLG area. The most important proposal in this group was the Transformers architecture and its concept of attention. Models based on this architecture achieve a high performance at NLG tasks. The best-performing models based on Transformers are LLMs such as GPT4 [32] or LLaMa [33], which have neural networks with billions of parameters. Nowadays, most of the research in the industry is focused on developing bigger LLMs, as it is thought that a bigger LLM would achieve better performance. The cost and time of training these models are unassumable for the academia. On account of that issue, there is a need in the academia to find more efficient architectures that could perform similarly to LLMs.

Consequently, my line of work will focus on exploring efficient architectures that could generate text with similar results to state-of-the-art models. Moreover, controllable generation methods, techniques to integrate external commonsense knowledge and task-agnostic architectures will be studied in order to reduce the phenomena known as hallucination.

Main Hypothesis and Objectives

This PhD thesis is based on the hypothesis that integrating external commonsense knowledge along with controllable text generation techniques in an efficient architecture will help to reduce the hallucination issue, and besides performing similarly to state-of-the-art models. Thus, the main objective of this research is the proposal of an efficient architecture that could achieve a good performance in different NLG tasks, e.g. text summarisation, and text simplification, and could reduce hallucination as much as possible. In order to complete this main objective, several sub-objectives have been proposed:

• A1. To explore optimal controllable text generation techniques. The planned schedule of these sub-objectives can be seen in Figure 2, starting from February 2023. Group A corresponds with the study and test of state-of-the-art techniques. After an initial study, during Group B, an efficient architecture will be proposed, tested and compared with other open-source architectures using a common benchmark. Finally, in Group C the proposed architecture will be adapted to perform in different NLG tasks.

Methodology and proposed experiment

The proposed methodology to carry out this research is based on a complete and comprehensive training in all areas of NLG, including general training on NLP. After having the basic notions of NLG, the research focuses on an exhaustive analysis of the state of the art of NLG, especially on deep learning techniques that allow controlled language generation and integrate commonsense knowledge. Subsequently, the experimentation also starts, testing different open-source architectures, along with the most relevant studied techniques. After having tested several architectures, an efficient base model will be proposed, integrating commonsense knowledge and controllable generation techniques into it. Then, it will be evaluated against other architectures using a common benchmark. Finally, the proposed architecture will be adapted to perform different tasks.

At present, I am experimenting with the CommonGen dataset [34]. The CommonGen dataset consists of a set of common concepts and some reference sentences using those concepts and the main idea is to test machines for the ability of generative commonsense reasoning. I am testing with different types of approaches such as SimpleNLG, Factorised Language Models, or Neural Models over this dataset. With the proposed experiment, the main idea is to combine the best-obtained architecture with controllable generation techniques in order to obtain a base model.

Research issues to discuss

In order to advance towards an effective and efficient approach for controllable text generation, several research issues are suggested and briefly discussed.

What does controllable text generation mean, and what are the most efficient methods to incorporate it? Controllable text generation is the task of producing text in a way that its attributes can be controlled [35]. These attributes can adopt a wide variety of ranges, such as stylistics, to include specific information in the content, based on the demographic attributes of the interlocutor, etc. As seen in [36], there are three ways to approach controllable text generation.

Via hyperparameters:

Training data in LLMs can be unbalanced due to the fact that it is difficult to balance that huge amount of data. Modifying hyperparameters may generalise the knowledge better and consequently raise obtained results.

Via additional input:

To fine-tune a pre-trained model with more information than just the text could enhance its performance.

Via conditional training:

Using internal control variables could enrich the generation with specific capabilities.

What is hallucination and what are the ways to reduce its occurrence? Hallucination in NLG occurs when a text generated by an AI lacks of coherence or deviates from the intended sense of the source input [5]. It can be classified into two categories: intrinsic hallucinations, which appear when the generated text contradicts the source input, and extrinsic hallucinations which arise when the source input cannot substantiate the generated text.

There exist different types of approaches to minimise the occurrence of hallucinations. Firstly, constructing a reliable dataset, which does not contain any type of contradiction in the data. Secondly, modifying the encoder/decoder architecture can enhance the ability to better understand and represent the knowledge. Thirdly, proposing an optimal training strategy such as controllable text generation could benefit the model. Finally, one important approach is to integrate external commonsense knowledge into the models.

How to integrate external commonsense knowledge? Commonsense knowledge is an important factor in human communication, as it facilitates inference without the explicit mention of context [37]. Although current state-of-the-art models exhibit some common sense abilities, it is not complete yet. Traditionally, commonsense has been injected into NLG systems in the form of rules and ontologies. Nowadays, the approaches have focused on injecting commonsense into neural NLG models through pre-trained models and using commonsense graphs. But there is still much work to do in this field in order to reach a complete commonsense knowledge.

Can a smaller architecture obtain similar performance than LLMs? There are some structures such as Plug and Play models or Variational Autoencoders that are more efficient than LLMs. Integrating commonsense knowledge and controllable generation techniques into these models could help to perform like LLMs while being smaller and more efficient models.

Figure 1 :1Figure1: Sub-task division in the modular architecture for the stages proposed by Reiter[8]

•A2. To examine hallucination mitigation techniques. • A3. To study how to integrate external commonsense knowledge. • A4. To analyse and test different task-agnostic architectures incorporating the previously studied techniques. • B1. To compare the performance of open-source state-of-the-art architectures using a common benchmark. • B2. To propose a cost-effective architecture that can generate text in a controllable way and evaluate it. • C1. To adapt the proposed architecture to perform in some NLG tasks, e.g., summarisation or text simplification.

Figure 2 :2Figure 2: PhD project schedule

Acknowledgements

This research work is part of the R&D projects "CORTEX: Conscious Text Generation" (PID2021-123956OB-I00), funded by MCIN/ AEI/10.13039/501100011033/ and by "ERDF A way of making Europe"

Assessing the impact of generative ai on medicinal chemistry WPWalters MMurcko Nature biotechnology 38 2020 SMayahi MVidrih arXiv:2211.12660 The impact of generative ai on the future of visual content marketing 2022 arXiv preprint Examining science education in chatgpt: An exploratory study of generative artificial intelligence GCooper Journal of Science Education and Technology 2023 so what if chatgpt wrote it?" multidisciplinary perspectives on opportunities, challenges and implications of generative conversational ai for research, practice and policy YKDwivedi NKshetri LHughes ELSlade AJeyaraj AKKar AMBaabdullah AKoohang VRaghavan MAhuja International Journal of Information Management 71 102642 2023 Survey of hallucination in natural language generation ZJi NLee RFrieske TYu DSu YXu EIshii YJBang AMadotto PFung 10.1145/3571730 doi: ACM Comput. Surv 55 2023 EFerrara arXiv:2304.03738 Should chatgpt be biased? challenges and risks of bias in large language models 2023 arXiv preprint Evaluating the logical reasoning ability of chatgpt and gpt-4 HLiu RNing ZTeng JLiu QZhou YZhang arXiv:2304.03439 2023 arXiv preprint Building applied natural language generation systems EReiter RDale 10.1017/S1351324997001502 Natural Language Engineering 3 1997 La generación de lenguaje natural: análisis del estado actual MVicente CBarros FSPeregrino FAgulló ELloret Computación y Sistemas 19 2015 Positioning yourself in the maze of neural text generation: A task-agnostic survey KRChandu AWBlack 10.48550/ARXIV.2010.07279 2020 Natural language generation DDMcdonald Handbook of natural language processing 2 2010 Has a consensus nl generation architecture appeared, and is it psycholinguistically plausible? EReiter arXiv:cmp-lg/9411032 1994 Computer generation of multiparagraph english text WCMann JAMoore American Journal of Computational Linguistics 7 1981 Generating natural language under pragmatic constraints EHovy Journal of Pragmatics 11 1987 WLevelt Speaking: From intention to articulation mit press

Cambridge, MA

1989 Controlling a language generation planner SNirenburg VRLesser ENyberg IJCAI 1989 Strips: A new approach to the application of theorem proving to problem solving REFikes NJNilsson Artificial intelligence 2 1971 Planning english sentences DAppelt 1985 cambridge university press Approaches to the planning of coherent text EHHovy 1991 Springer Enabling technology for multilingual natural language generation: the kpml development environment JABateman Natural Language Engineering 3 1997 Sentence generation as a planning problem AKoller MStone Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, Association for Computational Linguistics the 45th Annual Meeting of the Association of Computational Linguistics, Association for Computational Linguistics

Prague, Czech Republic

2007 Natural language generation as planning under uncertainty for spoken dialogue systems, Empirical Methods in Natural Language Generation: Data-oriented Methods and Empirical Evaluation VRieser OLemon 2009 Generating with discourse combinatory categorial grammar CNakatsu MWhite Linguistic Issues in Language Technology 4 2010 Learning what to say and how to say it: Joint optimisation of spoken dialogue management and natural language generation OLemon Computer Speech & Language 25 2011 The graph neural network model FScarselli MGori ACTsoi MHagenbuchner GMonfardini IEEE transactions on neural networks 20 2008 Generative adversarial nets MMirza BXu DWarde-Farley SOzair ACourville YBengio IJGoodfellow JPouget-Abadie Advances in neural information processing systems 27 2014 Sequence to sequence learning with neural networks ISutskever OVinyals QVLe Advances in neural information processing systems 27 2014 Distributed representations of words and phrases and their compositionality TMikolov ISutskever KChen GSCorrado JDean Advances in neural information processing systems 26 2013 End-to-end memory networks SSukhbaatar JWeston RFergus Advances in neural information processing systems 28 2015 Attention is all you need AVaswani NShazeer NParmar JUszkoreit LJones ANGomez LKaiser IPolosukhin arXiv:1706.03762 2017 ASee PJLiu CDManning arXiv:1704.04368 Get to the point: Summarization with pointer-generator networks 2017 arXiv preprint <author> <persName><surname>Openai</surname></persName> </author> <idno type="arXiv">arXiv:2303.08774</idno> <imprint> <date type="published" when="2023">2023</date> </imprint> </monogr> <note type="report_type">Gpt-4 technical report</note> </biblStruct> <biblStruct xml:id="b32"> <monogr> <author> <persName><forename type="first">H</forename><surname>Touvron</surname></persName> </author> <author> <persName><forename type="first">T</forename><surname>Lavril</surname></persName> </author> <author> <persName><forename type="first">G</forename><surname>Izacard</surname></persName> </author> <author> <persName><forename type="first">X</forename><surname>Martinet</surname></persName> </author> <author> <persName><forename type="first">M.-A</forename><surname>Lachaux</surname></persName> </author> <author> <persName><forename type="first">T</forename><surname>Lacroix</surname></persName> </author> <author> <persName><forename type="first">B</forename><surname>Rozière</surname></persName> </author> <author> <persName><forename type="first">N</forename><surname>Goyal</surname></persName> </author> <author> <persName><forename type="first">E</forename><surname>Hambro</surname></persName> </author> <author> <persName><forename type="first">F</forename><surname>Azhar</surname></persName> </author> <author> <persName><forename type="first">A</forename><surname>Rodriguez</surname></persName> </author> <author> <persName><forename type="first">A</forename><surname>Joulin</surname></persName> </author> <author> <persName><forename type="first">E</forename><surname>Grave</surname></persName> </author> <author> <persName><forename type="first">G</forename><surname>Lample</surname></persName> </author> <idno type="arXiv">arXiv:2302.13971</idno> <title level="m">Llama: Open and efficient foundation language models 2023 CommonGen: A constrained text generation challenge for generative commonsense reasoning BYLin WZhou MShen PZhou CBhagavatula YChoi XRen Findings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics 2020 Exploring controllable text generation techniques SPrabhumoye AWBlack RSalakhutdinov 10.18653/v1/2020.coling-main.1 Proceedings of the 28th International Conference on Computational Linguistics, International Committee on Computational Linguistics the 28th International Conference on Computational Linguistics, International Committee on Computational Linguistics

Barcelona, Spain (Online

2020 Neural natural language generation: A survey on multilinguality, multimodality, controllability and learning EErdem MKuyu SYagcioglu AFrank LParcalabescu BPlank ABabii OTuruta AErdem ICalixto Journal of Artificial Intelligence Research 73 2022 SMahamood MClinciu DGkatzia It's common sense, isn't it? demystifying human evaluations in commonsense-enhanced nlg systems 2021