Prompt Series Part I
A prompt is a cue or instruction that initiates a response or action, often used to guide or inspire writing, thinking, or conversation. A prompt can be pretty straight forward (e.g. ‘List all main characters of this book.‘), in need of a lot more additional information (e.g. ‘What ended in 1964?’) or be meant to be open-ended (e.g. ‘Why do you think people would want to live near power plants?’). Writing good prompts is essential for teachers to stimulate students’ thinking, encourage creativity, and promote deeper learning. Effective prompts help guide students in their responses, provide clear expectations, and open up opportunities for critical reflection or exploration.
When formulating a prompt teachers are advised to:
- Provide any necessary context, definitions, or background information needed to understand the prompt fully.
- Ensure the prompt is easy to understand and free of ambiguity. Clearly state what is expected, using precise language.
- Give students a sense of the scope by indicating the expected length or depth of the response. This helps them gauge how much detail to provide.
- Break complex prompts into smaller, manageable parts or guide students through steps to build their responses gradually.
- Pilot your prompts with a small group of students or colleagues to see how they interpret and respond to them. Use their feedback to refine the prompt for clarity, complexity, and engagement.
Prompt engineers reading this are probably like: What?! That is exactly what I am doing when I try to get an accurate answer from a language model. Well yes, actually there seems to be an overlap between writing prompts for human beings and artificial agents. Well-designed prompts can significantly influence the performance of Language Models (LMs)1 and humans. For people who are not yet familiar with prompt engineering:
What is Prompt Engineering?
A prompt here is a text input that is made available to a LM in order to initiate a response.2 It provides the context or direction for the output of the model.3 Prompts are therefore a practicable strategy for enabling models to quickly adapt to new domains and tasks. Prompt engineering is the process of developing and refining these. Reynold and McDonell, 2021, even describe prompt engineering as ‘programming in natural language’. The task of a prompt engineer is to create the design and content of the prompt.4 This can be done manually by designing a handcrafted prompt or automatically. We will tell you more about this at a later stage. In general, a prompt engineer aims to create prompts that guide the model to produce useful, relevant and accurate results.
What kind of Tasks are we actually talking about?
In the following parts of this series we will mostly refer to prompt engineering in regards to natural language tasks. Language plays a fundamental role in communication between people and their interaction with machines. There is therefore an increasing need to develop models that can perform complex tasks in natural language. Language Models are computer models that can process and generate texts. This is known as Natural Language Processing (NLP). Typical NLP tasks are:
- Text classification: either by sentiment (positive, negative etc.), topic (e.g. sports, politics) or spam (yes or no)
- Named Entity Recognition (NER): identifying and classifying key entities in text, such as brands, locations, characters, etc.
- Part-of-Speech (POS) Tagging: assigning parts of speech (like noun, verb, adjective) to each word in a sentence
- Translation: Translating text from one language to another (e.g., Latin to English)
- Text Summarization: create a summary by extracting or abstracting from a longer text preserving its key information
- Text generation: generating human-like text based on a given input, such as completing sentences, creating newspaper articles, or writing fantasy stories
We do most of these tasks every day whether we are in school, at work or in research. When we are in school oftentimes a teacher prompts to us directly, specifying what is demanded. Once we are on our own we will either be prompted by colleagues, our boss or ourselves. In case of the latter we need to gather insights, collect context information, define a goal and might start searching for examples ourselves. This can be a lot of work. Think about gathering all the requirements for writing a good story before we even start implementing it. Most professors ask you to find a valid research question including an abstract before starting to write a bachelor thesis. Regardless of how the prompt is brought to us, its quality impacts how well we do at solving the related task. Designing a prompt therefore might be as important as solving the prompted taks.
How to design a good Prompt?
There would not be the science of prompt engineering if it was that easy. 😉 While educators learn designing good prompts in additional didactic courses, one may find general and helpful guidelines on how to write prompts for machines either on AI providers websites, general prompting guides like www.promptingguide.ai5 or by keeping track of the newest scientific research. The latter is what we have done accordingly. In doing so we figured out that one may write prompts leading to more accurate results than others, but there are still seemingly uncontrollable factors. Performant prompts do not perform consistently across all models.6 Each model seems to be a student with specific characteristics and needs. Further, even the order of semantic building blocks can influence the outcome when used on various different LMs. E.g. in which order you present exercise examples to the LM. In the next few articles of this series we will discuss each piece of advise given to teachers and evaluate whether this also plays a role in designing good prompts. In this article let us first evaluate the following fundamental question:
Is Designing a Prompt for LMs the same as for Humans?
In 2021, Reynolds and McDonell evaluated prompts from the perspective of natural language and the formulation of prompts in it. When designing a prompt, the same considerations regarding tone, implication, plausibility, style and ambiguity, among others, must be made as for prompts to other people, since models such as GPT-3 are trained on the basis of natural language.
Various components that a prompt can contain are summarised under www.promptingguide.ai. You will notice that you have seen most of these components before in prompts assigned to you in school or even at work. An instruction is a clear and concise command that tells the model what to do (e.g. ‘Calculate how many chicken and donkeys live on the farm.’). The context provides additional background information or details (‘Chickens and donkeys live on the farm. Alma counts 245 heads and 144 legs.’). Exercise examples are specific input-output pairs that illustrate the desired result – this could be an example of Heidi who has been counting goats and groundhogs before on her farm. There lately has been a lot of research on the order, number and fit of such given exercise examples especially in the scope of contextual learning. A prompt not containing any exercise examples is often referred to as zero-, one example as one-, and multiple examples as few-shot prompt. Additionally, we find instructional prompts that work fine without any examples as a clear instruction provides a functional keyword: e.g. ‘translate this text to dutch’ (keyword: translate) or ‘sort animals by mean size’ (keyword: sort).7 Specifications or task foci contain detailed and descriptive prompts that guide the model more precisely (e.g. ‘Write a job advert and make sure the tone is formal.’). To clearly distinguish between individual sections in a prompt, separators (e.g. ‘### context ###’) can be used. An output indicator can also be specified that contains the expected format or structure of the response (‘output as chicken: [number of chicken], donkeys: [number of donkeys]
‘). Finally, we have the option of specifying a target group for which the response to the prompt is intended. As an example you could prompt: “What is Cloud Computing? Explain it like I’m five.”
Prompts can only contain a maximum number of tokens. A token can be a complete word, a part of a word or even just a single character. 100 tokens correspond to approx.8 75 words in the English language. Depending on the GPT model, up to 128,000 tokens can be distributed for the prompt and a corresponding response.9 A careful consideration of each character within a prompt can be necessary. For human beings the size of the prompt also matters. Humans do have limited working memory, just like machines. They can only consume so many words at once. Anyone working in an agile environment has been confronted with these stories on a board that seem to have no end and a set of unidentifiable subtaks hidden in a mass of text. If tasks are not related we advise you to prompt them apart from each other. If a complex task though is broken down into several steps using step-by-step processes like debating, planning or sequential reasoning, for example, this can lead to more efficient use of analogue working memory. How you exactly split the prompt into steps depends on how much the next steps rely on the context of the previous. The maximum number of tokens does not have to be used within one input. In the example of ChatGPT you can input text and press send multiple times – in e.g. a discussion on one topic it will count as one prompt. After reaching the max number of tokens though, the first tokens will start running out of the scope and do not have an impact on the ongoing conversation. In addition, Reynolds and McDonell, 2021, present metaprompt programming, which enables models to independently generate further useful prompts for further use in solving the respective task with the help of so-called metaprompts. The use of metaprompts can enable models to effectively solve problems using their own task-specific instructions without the need for training examples, which again will consume less tokens. Examples of a metaprompt would be ‘Let’s solve the problem by breaking it down into several steps’, ‘List the pros and cons before you make a decision’ or ‘Ask questions about the topic before you try to answer the question’. Reminds you of a typical exam question in social sciences?!
Another typical approach you might remember from the last couple of minutes before submitting an exam are verification mechanisms. You carefully look trough your answers, reflect, add and adjust. Among others Wang et al., 2023, showed the improvements in performance when using self-verification methods, where the language model has to reflect on its answers afterwards.10 Imagine the model extracted OASIS as a sight in London, because everybody seems to go there lately and buys tickets for it. You might ask the model whether OASIS really is a sight and it might recognize by taking a deeper look that OASIS instead is a band. Another example of reflection before final submission is the Take A Deep Breath Strategy used for prompting in natural language processing involves prompting a language model to produce a more detailed or thoughtful response by explicitly instructing it to ‘take a deep breath’ before answering, encouraging deeper reasoning or consideration. This technique aims to enhance the quality and coherence of the model’s output by simulating a moment of reflection, leading to more accurate and thorough answers.11
Furthermore, we use analogies in human communication, where memetic concepts such as a character are used as a proxy for an intention. Example: What would your grandma Holly think if she knew about this? Every person will give you an individual answer to one and the same prompt with regards to their subjective experiences, opinions and thoughts. That is why we keep gathering answers to the same problem from various people or try to adopt different perspectives in order to get the answer we want to hear, simply to get ourselves more confused or actually to find a well thought-out answer. According to Reynolds and McDonell, 2021, GPT-3 has the ability to simulate well-known figures such as Mahatma Gandhi or Margaret Atwood, which provides access to various biases and cultural information on e.g. moral issues.12 Tip: Just tell ChatGPT he is Barney Stinson and ask him about dating tips again afterwards.
Designing precise, unambiguous and performant prompts is not easy. Designing prompts for artificial agents is at least as exhausting as doing the same for humans if you aim to bring them to peak performance. So basically educators are well-trained programmers for natural language. Even though, machines are obviously not at all like students in school, there are still major similarities where research on how to write good prompts in education can guide the study of prompt engineering from our point of view.
- Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models, 2021. ↩︎
- Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020. ↩︎
- Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019. ↩︎
- Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. What makes good in-context examples for GPT-3? In Eneko Agirre, Marianna Apidianaki, and Ivan Vuli´c, editors, Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 100–114, Dublin, Ireland and Online, May 2022. Association for Computational Linguistics. ↩︎
- https://www.promptingguide.ai/ ↩︎
- Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8086–8098, Dublin, Ireland, May 2022. Association for Computational Linguistics. ↩︎
- Timo Schick and Hinrich Schütze. True Few-Shot Learning with Prompts—A Real-World Perspective. Transactions of the Association for Computational Linguistics, 10:716–731, 06 2022. ↩︎
- Ethan Perez, Douwe Kiela, and Kyunghyun Cho. True few-shot learning with language models. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021. ↩︎
- https://platform.openai.com/docs/models ↩︎
- Shuhe Wang, Xiaofei Sun, Xiaoya Li, Rongbin Ouyang, Fei Wu, Tianwei Zhang, Jiwei Li, and Guoyin Wang. Gpt-ner: Named entity recognition via large language models, 2023. ↩︎
- Max Nye, Anders Andreassen, Alexander L Gaunt, et al. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021. ↩︎
- Luke Reynolds and Kyle McDonell. Prompt programming for large language models: Beyond the few-shot paradigm. arXiv preprint ar- Xiv:2102.07350, 2021. ↩︎