Dimension 5: Quality Assurance How is the quality of the bot’s output assured? The final dimension in our chatbot design framework focuses on quality assurance. LLMs can sometimes produce erratic output and are prone to hallucinations, which can be more or less problematic depending on the use case. A chatbot supporting travelers that wrongly attributes the origin of a local dish is not as bad as a chatbot that does not take the user’s allergies into account when recommending a medication. Generally, we need to consider two types of defects. First, random hallucinations are instances where the bot generates false or nonsensical information without apparent reason; it just happened not to know or chose a bad token. Second, a user might intentionally try to manipulate (jailbreak) the bot into performing actions against its programming or the programmer’s intent. For example, a student might try to convince a teaching bot that their life depends on passing the course and therefore the bot should provide the full answer to all problems in the class immediately. The complexity of quality assurance increases with the chatbot's generality (dimension 1). A more focused chatbot is easier to validate, while a general-purpose assistant requires more comprehensive measures. We propose three main approaches to quality assurance. First, the organization can decide to proceed with no formal quality assurance. This approach works well for low-stakes situations, such as providing recommendations for a holiday trip or ideas for an entertainment event. Here, the focus is on setting clear expectations with users about the bot's limitations and potential for errors. The simplest strategy to mitigate most risks is a robust system prompt that clearly defines the bot's boundaries and ethical guidelines. For instance, a bot can be clearly told never to agree to any price discussions. Even in these cases, however, designers should implement extensive testing scenarios to understand the range of responses and test with “edge cases.” The system prompt can also include predefined templates that the AI can fill in to reduce variability, such as asking the LLM to always output its final answer in specific tags () after it has reasoned about what to do. Second, the organization can rely on chatbot-based quality assurance by using another LLM (from the same or from a different frontier model) as an auditor. This method employs an additional chatbot to assess the primary chatbot's output relative to the user's prompt. The secondary chatbot acts as a validator, checking for inconsistencies, hallucinations, or inappropriate content. Complex instructions can lead to the first chatbot not always following all best practices, so a secondary bot checking for the most glaring issues can be a helpful and simple-to-implement defense strategy. The auditor chatbot requires a clear set of criteria and possible actions, such as rewriting parts of the response or asking the first bot to regenerate its answer, to provide an effective safeguard. For instance, a user that managed to trick a complex first chatbot into revealing the answer to a homework problem might find it much harder to also trick the second chatbot that is merely told to “ensure the following response never reveals the full solution” instead of having to also adhere to many other instructions. In theory, one can chain many chatbots to improve overall answer quality.

Reimagining Customer Service Journeys with LLMs: A Framework for Chatbot Design and Workflow Integration - Page 6 Reimagining Customer Service Journeys with LLMs: A Framework for Chatbot Design and Workflow Integration Page 5 Page 7