1. Introduction

A Case Study in Engineering a Conversational Programming Assistant's Persona

Steven I. Ross

Michael Muller

Fernando Martinez

Stephanie Houde

Justin D. Weisz

2 0 IBM Argentina , La Plata, Buenos Aires , Argentina 1 IBM Research AI , Cambridge, MA , USA 2 IBM Research AI , Yorktown Heights, NY , USA

The Programmer's Assistant is an experimental prototype software development environment that integrates a chatbot with a code editor. Conversational capability was achieved by using an existing code-fluent Large Language Model and providing it with a prompt that establishes a conversational interaction pattern, a set of conventions, and a style of interaction appropriate for the application. A discussion of the evolution of the prompt provides a case study in how to coax an existing large language model to behave in a desirable manner for a particular application.

eol>prompt engineering large language model conversational interaction human-centered AI

1. Introduction

plore this possibility. In [7] we describe the design and architecture of this prototype and the testing we did to The emergence of Large Language Models such as GPT- determine whether potential users would find this sort of 3 [1, 2], transformer models [3] that are trained without system useful and desirable. In this paper we will review supervision on massive text datasets has resulted in sys- the steps taken to engineer the prompt for the Programtems with remarkable text generation capabilities. One mer’s Assistant that used the Codex model to power an particularly interesting aspect of these models is that interactive conversational assistant, and how we evolved their behavior can be configured by a prompt, the initial the prompt to establish the desired persona and behavior. text provided to the model, which establishes a pattern that the model attempts to continue.

General purpose Large Language models can be fine- 2. Related Work tuned on specific corpora to provide expertise in a particular domain. One such model is the OpenAI Codex Brown et al. showed how GPT-3 [1, 2] could accomplish model [4], a 12 billion parameter version of GPT-3 [1, 2], few-shot learning, using a prompt as a means of configifne-tuned on code samples from 54 million public soft- uring their large language model to perform a particular ware repositories on GitHub. This model powers Github task. These tasks were often very specific operations Co-Pilot [5], which primarily provides code-completion such as language translation, grammar correction, or senservices within an Integrated Development Environment. timent classification, for which a short description of the We wondered whether such a model could power a con- task and/or a few examples were suficient to establish versational programming assistant and perhaps approach the desired behavior. The concept of prompt engineering, the vision laid out by Rich and Waters for their Program- establishing efective ways of constructing prompts to mer’s Apprentice [6], where they introduced the concept control large language model behavior, has become a of an artificial collaborative partner that could help soft- topic of increasing interest. Greyling, for example, recware engineers with writing code, designing software ommends organizing a prompt in three sections that essystems, and creating requirements specifications. We tablish context, provide data, and instruct the system on developed the Programmer’s Assistant prototype to ex- how to proceed [8] . Reynolds and McDonell argue that few-shot examples are really locating an already learned Joint Proceedings of the ACM IUI Workshops 2023, March 2023, Sydney, task rather than learning a new one, and as a result recomA$ussttreavleian_ross@us.ibm.com (S. I. Ross); mend alternative approaches to prompt construction [9]. michael_muller@us.ibm.com (M. Muller); martferc@ar.ibm.com Despite their characterization of their work as “convers(F. Martinez); Stephanie.Houde@ibm.com (S. Houde); ing” with Copilot, Denny et al. adopted a similar strategy jweisz@us.ibm.com (J. D. Weisz) of iteratively modifying a prompting comment until the 0000-0002-2533-9946 (S. I. Ross); 0000-0001-7860-163X desired completion was obtained [10]. (0M00.0M-0u0l0le2r-)0;204060-02-108030(1S-.71H7o2u-4d8e0);50(0F0.0M-0a0r0t3in-2e2z)2;8-2398 (J. D. Weisz) Recently several language models, such as Blender© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License bot [11] Lamda [12], and ChatGPT [13] have been inCPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) troduced that are specifically tuned for dialog applications, but achieving conversational interaction can be the conversation, establishes the persona of the assisachieved via prompt engineering with general purpose tant, sets a tone and style for interaction, and provides large language models as well. Valvoda et al. found that some ground rules about the role it plays and how it ifne-tuning a large language model for dialog resulted behaves. In order to further encourage a more natural in duller and more repetitive output, while generating conversational dialog of human and assistant, we gave dynamic prompts resulted in more novel and diverse the assistant a human-like name, “Socrates.” The proresponses [14]. logue also establishes a convention for how the assistant

To develop the Programmer’s Assistant, we used the is to communicate code in a form that makes it easy for code-fluent Codex model [ 4] and developed a prompt our client user interface to parse the code from model that supported conversational access to its accumulated responses and treat it distinctly from textual responses. programming knowledge and coding skills. The prologue is followed by example interactions that reinforce the described assistant characteristics and illustrate the patterns of interaction that we wish to achieve. 3. Eliciting Conversation from a In our first example interaction we show how the assisTransformer Model tant can respond to simple requests with a short answer that will appear in the chat either with a purely textual A text-based-transformer model [3] is trained in a self- response or one accompanied by a short inline code segsupervised manner on vast amounts of text data, and is ment. We expect that many of the interactions with the capable of generating likely continuations of text that assistant will be of this form, as the user makes requests is presented to it. The prompt is the presented text, and to inform themselves about unfamiliar functionality or the generation function produces a sequence of tokens to refresh their memory. This interaction also demon(words or parts of words) that it deems as a likely con- strates a follow-up question that can only be answered tinuation of the prompt based on all its training. This in the context of the previous question and answer. The process continues until the maximum number of tokens second example interaction presented is a request for the requested is generated, or until a specified stop sequence generation of a simple function. Longer code responses of tokens is encountered. The prompt establishes a pat- of this type will typically be presented in our client UI tern that the model attempts to continue. chat window in an iconified form that opens to a separate

To generate conversation in the Programmer’s Assis- pop-up window that displays the code and provides for tant prototype, we establish a script-like pattern in the easy copying for transfer to the code editor. prompt in which two characters, the user and the assis- The original Programmer’s Assistant prompt is pretant, are participating in a dialog. Then we extend the sented in Listing 1. script incrementally, by adding each conversational turn by the user to the prompt, and allowing the model to Listing 1: The original Programmer’s Assistant prompt generate the agent’s response. The generated text is dis- 1 This is a conversation with Socrates, an expert played in the chat and the generated text plus the user’s 2 automatic AI software engineering assistant. next entry is appended to the prompt for further gener- 3 Socrates will answer questions and write code ation, and the process continues. Unlike more conven- 4 to help the user develop programs. Code tional static prompts, the conversational prompt grows 56 gbernaecrkaetteedd biyn S<oCcOrDaE>te.s.i.s<a/lCwOaDyE>s dperleismeintteerds over the course of the dialogue, providing context for fu- 7 indicating the language the code is written in. ture generation steps and providing a kind of short-term 8 memory that allows the generation process to be afected 9 by past interactions in a session. Ultimately, though, cur- 10 rent limitations on the maximum number of tokens that 1121 the model can handle require that the prompt cannot 13 grow without bounds, and earlier interactions must be 14 dropped in order to remain within the constraints im- 15 posed by the language model deployment. 16 17 18 4. The Evolution of the 2109

Programmer’s Assistant Prompt 21

Socrates: Hello. I am Socrates. How can I help you? User: Please show me how to write a palindrome detection function in python.

Socrates: Sure. <CODE lang="python">

is_Palindrome = s == s[::-1] </CODE> User: Can you explain how that solves the

problem? Socrates: A palindrome is a string that reads the same forwards and backwards. s[::-1] results in the reversed string.

If the string is equal to its reversed self, then it is a palindrome, and we return True.

User: Thanks! Socrates: You’re welcome! The initial prompt we use for the Programmer’s Assis- 2234 tant consists of a prologue that introduces the scene for 25 26

When a new conversation starts, the prompt starts out as this initial prompt, which is not sent to the model for generation until the first user entry is appended. At each turn in the conversation, the user’s entry is added to the prompt, and "Socrates:" is appended to the end of the prompt to indicate that the model should generate the assistant’s response. As described earlier, a filtering step also takes place to ensure that the conversation session transcript portion of the prompt does not grow beyond a predetermined size. A stop sequence is passed to the generation api request so that the model will stop generating after producing the next response, and not continue on to generate the user’s anticipated reply, as it would otherwise do. The model’s response is displayed in the chat and appended to the prompt, and we’re ready to repeat the process for the next user entry.

4.1. Shortcomings of the Original Prompt

Informal testing of the original prompt showed the system capable of carrying on a conversation, successfully answering coding and follow-up questions, and generating code upon request, but did not quite satisfy all of our requirements. Specifically, we wanted an assistant that was helpful and polite, and one that did not come across as overly authoritative or didactic, and our assistant was not consistently meeting those standards.

4.2. Overcoming Reluctance to Provide Answers

Our programming assistant sometimes showed an initial reluctance to provide answers to some questions. For example, a question such as “Do you know how to reverse 1 a string in Python?” might have been answered with 2 “Yes.” It also sometimes replied “I don’t know.” to ques- 3 tions it was fully capable of answering. While additional 54 prompting from the user or repeating the request could 6 often extract the desired answer, we didn’t think that met 7 the standard of helpfulness that we were hoping for. Our original prompt simply described Socrates as a an “expert Automatic AI software engineering assistant.” Adding “eager and helpful” to the characterization, as shown in

Listing 2 in bold font, helped to encourage the assistant to be more forthcoming and proactive. Listing 2: Making the assistant more forthcoming 4.3. Reducing Excessive Confidence

In our testing, we found that the assistant appeared overly confident even when wrong and also resistant to correction. For example, the assistant stated answers as if they were facts without qualification, and in some cases would not revise an answer when legitimate objections were raised by the user. Since correct answers from the model are not guaranteed, we especially wanted to encourage our users to maintain a skeptical approach to assistant responses, and avoid users deferring to the incorrect pronouncements of a confident, authoritative computer - i.e., over-reliance on AI [15, 16, 17]. Therefore, we added a characterization, shown in Listing 3 in bold font, asserting that the assistant was humble. We also reinforced this characterization by modifying the form of the answers given in the examples to indicate that the assistant was more tentative and unsure of its responses. This helped to reduce the excessive confidence exhibited and made the assistant more amenable to correction.

Listing 3: Making the assistant less overconfident 4.4. Diminishing Didacticism Our original assistant had a tendency to quiz the user

after answering a question, taking on more of a teacher role than one of an assistant. An explicit proviso, show in Listing 4 in bold font, to not do so helped to reign in the didactic behavior.

Listing 4: Making the assistant less didactic

This is a conversation with Socrates, an eager and helpful, but humble software engineering assistant. Socrates will answer questions and write code to help the user develop programs, but doesn’t assign work to the user, quiz the the user, or ask questions except for clarification ... Our programming assistant is integrated with a coding 30 environment, and we wanted it to go beyond answering 31 32 questions and providing code for incorporation into that 33 environment. We wanted users to be able to seamlessly 34 have the assistant consult about code that they were ex- 35 amining or developing, so we provided the ability for the 36 user’s selection in the code editor to be included in the 37 38 conversation. We used the same convention that was em- 39 ployed for code generated by the assistant, and added an 40 example consultation showing the system responding to 41 a question about a selected code segment, demonstrating 42 a textual response to a user provided code example, and 43 44 reinforcing our instruction for the assistant to remain 45 modest despite its (occasionally) impressive capabilities. 46 Socrates:Hello. I am Socrates. How can I help you? User:Please show me how to write a palindrome detection function in python.

Socrates:I think you would do something like this: <CODE lang="python">

is_Palindrome = s == s[::-1] </CODE> User:Can you explain how that solves the problem? Socrates:I believe that a palindrome is a string that reads the same forwards and backwards. s[::-1] results in the reversed string. If the string is equal to its reversed self, then it is a palindrome, so the equality would hold True.

User:Thanks!

5. Consequences of the Prompt

With this prompt in place the Codex model demonstrates capabilities that are less evident in the command completion environment of Github Copilot. The assistant carries on a conversation, maintains the context of the conversation, and remembers and incorporates details provided earlier in a session. In the context of a programming session where a number of related but distinct exchanges occur, it will maintain consistent nomenclature and variable names such that answers to subsequent 19 def size(self): questions will integrate cleanly with each other. In our 20 return len(self.items) user study [7], participants recognized that this provided 21 a distinct advantage over a series of search results for 2232 User: Looks like you forgot the peek! similar information, primarily due to the contextual rele- 24 Socrates: I am sorry. Here is the corrected vance, consistency, and specificity of results generated. 25 version.

The few examples provided in the prompt are general- 26 ized suficiently to have the assistant answer questions 27 class Queue: such as “What does this code do?” or commands such as 2298 def s_e_lifn.iitt_e_m(sse=lf[)]: “Write a unit test for this function” despite no examples 30 being provided of these capabilities. It can translate code 31 def isEmpty(self): between programming languages, and carry on discus- 32 return self.items == [] sions on topics that extend far beyond programming. It 33 displays a variety of emergent capabilities that were not 3345 def seenlqfu.eiutee(msse.lifn,seirtte(m0),: item) hinted at in the prompt, and some that were not even 36 the focus of the model fine-tuning, but the generated re- 37 def dequeue(self): sponses tend to adhere to the conversational interaction 38 return self.items.pop() patterns and interaction style that the prompt establishes. 39

The responses given by the assistant are not always 4401 def rseitzuer(nsellefn)(:self.items) perfect, however. It can give incomplete or sometimes 42 outright wrong answers to questions, reference non- 43 def peek(self): existent web-sites, and may occasionally still claim to 44 return self.items[-1] not know the answer to a question which it is perfectly 45 capable of answering correctly. The phenomenon of hal- 4476 User: Wihfatthweiqlulehuaepipseneimpntdye?queue and peek lucination in large language models is well-known [18], 48 and the Programmer’s Assistant is not immune to it. The 49 Socrates: I think that the pop method will raise assistant also sometimes inappropriately responds with 50 an IndexError exception. material directly from the prompt, or gets confused or 51 becomes fixated on exchanges from earlier in the conver- 5532 User: Can we protect against that? sation. Our user experience provides try-again and start- 54 Socrates: I think we can. Here is a version that over buttons, which modify the prompt context while 55 checks for an empty queue before maintaining the presented chat transcript, to provide a 56 calling pop. way for users to recover in these situations, but in many 57 cases they can be addressed conversationally. Listing 6 5598 clasdsefQu_e_uien:it__(self): shows a conversation with the Programmer’s Assistant 60 self.items = [] where the user interacts with the assistant to remedy 61 an initially unsatisfactory response to a code generation 62 def isEmpty(self): request. Generated code is displayed in bold font. 63 return self.items == [] 64 Listing 6: A Programmer’s Assistant conversation 6656 def seenlqfu.eiutee(msse.lifn,seirtte(m0),: item) Write a queue class in python with the 67 basic enqueue, dequeue, and peek methods. 68 69 70 71 class Queue: 72 def __init__(self): 73 def size(self): self.items = [] 74 return len(self.items)

75 def isEmpty(self): 76 return self.items == [] 77

78 def enqueue(self, item): 79 self.items.insert(0, item) 80

81 User: That’s great, thanks! def dequeue(self): 82

return self.items.pop() 83 Socrates: You’re welcome.

6. Discussion

Specific but relatively inconspicuous details in the prompt can have a significant impact on the interaction of the assistant. For example, just the fact that we identified the assistant as ”expert AI automatic software engineering assistant” was suficient to have it respond in a manner consistent with being a machine, and not a person. Had the prompt established a diferent persona, the results might have been very diferent. As illustrated in Listing 7, the assistant also reports an understanding of its capabilities and function that go far beyond the information provided in the prompt, as well as a lack of understanding of its language capabilities that it subsequently provides evidence of.

Much of the work in prompt engineering focuses on con

structing a prompt to get a specific class of result. For example, Denny et al. [10] shows how efectively changing the form of question that is asked of Copilot can influence the correctness of the answer provided. Similarly, MacNeil et al. [19] engineer prompts to generate diferent forms of code explanations. Strobelt et al. [20] developed a tool to help users compare and refine prompts for tasks such as document classification, reading comprehension, and natural language inference, where the results of prompt variations can be automatically evaluated against test cases. In our work, the prompt engineering Listing 7: The Programmer’s Assistant acknowledges is aimed at influencing the nature and tone of the dialog that it is not a person between the user and the system. While the user’s conUser: Where are you exactly? tributions to the conversation become part of the prompt and will surely impact the results obtained, we are not Socrates: I’m an AI program, designed to answer controlling that. Instead our prompt engineering sets the questions about programming. I can’t stage for the user’s conversational interaction with the really speak about my physical assistant. plrooccaetsisono,nbuathIigthhliy-nrkedIuenxdiasntt saestaof This paper describes how we engineered a prompt that servers located in various places on enabled a code-fluent Large Language Model to behave this planet. as a conversational programming assistant capable of carrying on extended discussions about software develUser: Can you do more than just answer opment issues, and how we subsequently evolved that questions? prompt to make the assistant more humble, forthcoming, and helpful, as well as providing the assistant with additional skills and making it capable of artifact-centric conversation.

6.1. Is Prompt Engineering Really Engineering?

Despite the terminology, “engineering” a prompt seems at this point to be more of a case of trial and error than it is a systematic discipline. To some extent, this is inevitable so long as the contents and behavior of the language model remain opaque to the prompt author. For the Programmer’s Assistant, we needed to encode in the prompt how to respond to a request for documentation, but did not need to encode how to write a unit test. In some experiments with other code-fluent language models, we found that questions such as “How does this code work?” were not handled in an acceptable fashion, even though the Codex model handled such questions well. Were we deploying with one of these other models, we would want to cover this case in the prompt. It will be impossible to anticipate all the diferent ways that a user population might interact with a conversational assistant and predict ahead of time how the system will respond to these unexpected interactions. Deployment of conventional chatbot systems that require some form of conversation specification typically log successful and unsuccessful interactions so that the specification can be evolved to Socrates: I can write code for you. I can also help you debug code. I can also help you design code. I can also help you document code. I can also help you understand code. I can also help you learn to program. I can also help you learn a new programming language. I can also help you learn a new programming paradigm.

User:

Do you speak other languages? Socrates: I can speak Python, Java, C++, C, and

Javascript. I can also speak English.

User:

What about other natural languages? Socrates: I can speak English. I can also speak

Python, Java, C++, C, and Javascript.

User:

German? Socrates: I’m sorry. I can’t speak German.

User:

Wo bist du? Socrates: Hallo. Ich bin Socrates. Wie kann ich

Ihnen helfen? handle unanticipated cases where the chatbot has failed, and ultimately conversational systems based on large language models will presumably need to do something similar.

Is it possible to consider prompt engineering to be a form of programming? In some ways it is akin to declarative programming, in that we aren’t telling the system specifically what to do, but instead describing a desired outcome. Instructions to act as an eager but humble software engineering assistant, or directives to avoid quizzing the user fall into this category. In other cases, it seems more like programming by example, providing scenarios that describe how to behave in specific situations which we expect the system to generalize appropriately. Given the probabilistic nature of the generation process, it can feel more like attempts at influence and persuasion than the issuing of imperative commands.

6.2. Reflections

The initial section of the prompt used for the Programmer’s Assistant is presently a purely static text, extended by a possibly truncated version of recent dialog. One way to improve the assistant further might be to present a dynamic prompt [14] to the model on each conversational turn with specific examples more relevant to the current discussion [21], or even with search results to retrieve pertinent information that could inform a response [22]. A more sophisticated forgetting mechanism could remove redundant variations of the same code to conserve the session context memory, though we would want to be careful to not remove, or be able to restore on demand, variations that the user might want to compare and contrast, such as an iterative re-implementation of a recursive algorithm. We have done some initial explorations of extending the prompt to allow for“internal deliberation” of the type shown in Nye et al. [23]. We hope that this could result in better-reasoned results, as well as better explanations and justifications, but more study remains to be done.

We continue to be astonished by the conversations exhibited by the Programmer’s Assistant on a daily basis. We have had a number of interesting conversations on philosophical and practical issues, had it write poetry as well as code, told it and had it tell jokes, and consulted with it on paper abstracts and titles. Ultimately, these capabil- 8. Conclusion ities are representative of the strength of the language model, but made more accessible by the conversational Our goal in creating this prompt was not to create a interaction approach, and influenced by the prompt only perfect Programmer’s Assistant, but to create one good to the extent that the persona of the agent impacts the enough to test whether a conversational style of intergenerated text. action would prove useful and acceptable to potential

It is often dificult to read or carry on a conversation users. We present the results of that study in [7]. Our with the programmer’s assistant and not get the sense assumption was that the rapid improvement in the qualthat a conversation is taking place between two intel- ity of responses available from Large Language models ligent agents, but of course that is not really what is will continue, but that imperfect results will always conhappening. In reality, the user and the language model tinue to be an issue due to imprecise communication and are participating in a collaborative dialog-writing exer- specification of desires, mismatched assumptions, and cise, with the user generating text for one side of the unstated or ill-formed goals. Nevertheless, we were surconversation and the language model attempting to gen- prised by the quality of results that were achievable with erate plausible text for the other. The way we present current technology, and the ease with which the nature the dialog incrementally in the chat adds to the illusion, and presentation of those results could be influenced by but the model is not responding on its own behalf. It small changes in the prompt. is generating responses based on the description and past presented behavior of a character. Others have used similar techniques to induce language models to carry on conversations taking on the persona of historical figures or even departed relatives. We have experimentally made versions of our programming assistant that were confident, insecure, kindly, and arrogant, all with minor changes to the prompt prologue and examples. [13] OpenAI, Chatgpt: Optimizing language models for dialogue, 2022. URL: https://openai.com/blog/ [1] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Ka- chatgpt/.

plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas- [14] J. Valvoda, Y. Fang, D. Vandyke, Prompting for try, A. Askell, et al., Language models are few-shot a conversation: How to control a dialog model?, learners, Advances in neural information process- arXiv preprint arXiv:2209.11068 (2022). ing systems 33 (2020) 1877–1901. [15] Z. Ashktorab, M. Desmond, J. Andres, M. Muller, [2] C. Metz, Meet gpt-3. it has learned to code (and N. N. Joshi, M. Brachman, A. Sharma, K. Brimijoin, blog and argue). (published 2020), 2022. URL: Q. Pan, C. T. Wolf, et al., Ai-assisted human labelhttps://www.nytimes.com/2020/11/24/science/ ing: Batching for eficiency without overreliance, artificial-intelligence-ai-gpt3 .html. Proceedings of the ACM on Human-Computer In[3] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, teraction 5 (2021) 1–27.

L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At- [16] S. Mahomed, Healthcare, artificial intelligence tention is all you need, Advances in neural infor- and the fourth industrial revolution: Ethical, somation processing systems 30 (2017). cial and legal considerations, South African Journal [4] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. of Bioethics and Law 11 (2018) 93–95.

Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, [17] M. Schemmer, N. Kühl, C. Benz, G. Satzger, On G. Brockman, et al., Evaluating large language mod- the influence of explainable ai on automation bias, els trained on code, arXiv preprint arXiv:2107.03374 arXiv preprint arXiv:2204.08859 (2022). (2021). [18] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, [5] GitHub, Inc., Github copilot · your ai pair pro- Y. Bang, A. Madotto, P. Fung, Survey of hallucinagrammer, 2022. URL: https://github.com/features/ tion in natural language generation, ACM Computcopilot/. ing Surveys (2022). [6] C. H. Rich, R. C. Waters, The Programmer’s Appren- [19] S. MacNeil, A. Tran, A. Hellas, J. Kim, S. Sarsa, tice, Addison-Wesley Publishing Company, Read- P. Denny, S. Bernstein, J. Leinonen, Experiences ing, MA, 1990. from using code explanations generated by large [7] S. I. Ross, F. Martinez, S. Houde, M. Muller, J. D. language models in a web software development Weisz, The programmer’s assistant: Conversational e-book, arXiv preprint arXiv:2211.02265 (2022). interaction with a large language model for soft- [20] H. Strobelt, A. Webson, V. Sanh, B. Hoover, J. Beyer, ware development, in: 28th International Confer- H. Pfister, A. M. Rush, Interactive and visual prompt ence on Intelligent User Interfaces, 2023. engineering for ad-hoc task adaptation with large [8] C. Greyling, Prompt engineering, text gener- language models, IEEE Transactions on Visualizaation and large language models, 2022. URL: tion and Computer Graphics 29 (2023) 1146–1156. https://cobusgreyling.medium.com/prompt- [21] Y. Xu, C. Zhu, S. Wang, S. Sun, H. Cheng, X. Liu, engineering-text-generation-large-language- J. Gao, P. He, M. Zeng, X. Huang, Human parity on models-3d90c527c6d5. commonsenseqa: Augmenting self-attention with [9] L. Reynolds, K. McDonell, Prompt programming external attention, arXiv preprint arXiv:2112.03254 for large language models: Beyond the few-shot (2021). paradigm, in: Extended Abstracts of the 2021 CHI [22] Y. Li, D. Choi, J. Chung, N. Kushman, J. SchritConference on Human Factors in Computing Sys- twieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, tems, 2021, pp. 1–7. A. Dal Lago, et al., Competition-level code genera[10] P. Denny, V. Kumar, N. Giacaman, Conversing with tion with alphacode, Science 378 (2022) 1092–1097. copilot: Exploring prompt engineering for solving [23] M. Nye, A. J. Andreassen, G. Gur-Ari, cs1 problems using natural language, arXiv preprint H. Michalewski, J. Austin, D. Bieber, D. DoarXiv:2210.15157 (2022). han, A. Lewkowycz, M. Bosma, D. Luan, et al., [11] K. Shuster, J. Xu, M. Komeili, D. Ju, E. M. Smith, Show your work: Scratchpads for intermediate S. Roller, M. Ung, M. Chen, K. Arora, J. Lane, et al., computation with language models, arXiv preprint Blenderbot 3: a deployed conversational agent that arXiv:2112.00114 (2021). continually learns to responsibly engage, arXiv preprint arXiv:2208.03188 (2022). [12] R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer,

A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, Y. Du, et al., Lamda: Language models for dialog applications, arXiv preprint arXiv:2201.08239 (2022).