6 Wharton MBA students. While the simulation generally received favorable results, the inability to assess elements of comprehensibility and legibility, including grammar and syntax, was repeatedly pointed out as an issue. In retrospect, we acknowledge that models such as GPT-2 and earlier versions of GPT-3 were already available, but traditional dictionary-based approaches to natural language processing were still considered state-of-the-art in disciplines related to organizational psychology. Text-Davinci-003 With the release of Text-Davinci-003 (“Davinci”) in November 2022 – which, in typical AI pacing, is not even available anymore – the dynamic changed. Encouraged by conversations with other faculty that specialize in NLP, we explored this variant of GPT-3.5 that had been trained by OpenAI to follow user- provided instructions. Initial tests in which we provided the model with a corporate vision to grade on three dimensions – concreteness, sentiment (i.e., positivity versus negativity), and cohesion (i.e., legibility) – delivered much stronger results than natural language processing dictionaries. It could understand multi-word phrases (e.g., treating “hybrid car” as one coherent phrase rather than two unrelated words), and it could even understand sentence cohesion, which was a major pain-point beforehand. While there were naturally issues with its performance for certain cases, it was much better than what we had previously built, so we quickly decided to adopt it for our next version. We made this decision in late 2022 – around the same time that ChatGPT came out and the world learned about the powerful abilities of this AI Chatbot and the LLM model (GPT-3.5) that served as its foundation. Encouraged by these initial successes, we spent early 2023 working with the model to better understand its strengths and weaknesses. It became evident that we had to provide a wide range of examples in order to improve the model’s performance – a technique called “few-shot learning”. Too many examples and the model starts to ignore some of them, too few and its estimates of scores are not consistent enough. In addition, although we wanted to make use of the model’s advanced reasoning capability, we still needed it to follow the scoring we deemed correct based on scientific evidence rather than GPT’s own interpretation. We hence spent many weeks refining the examples we fed it as well as running an extensive number of tests to ensure that we could capture as many edge cases as possible. One important step was to anchor the model with a few very low and very high scores. In addition, we provided explicit instructions on how to score the corporate visions and what each score represents. We also experimented with different prompts to prevent the model from trying to adjust the vision statement before grading it, which was happening frequently at first. Since the model’s goal is to predict the next token, it sometimes feels compelled to “improve” the student vision and add a few words, which could
Beyond Multiple Choice: The Role of Large Language Models in Educational Simulations Page 5 Page 7