2 Introduction As educators interested in providing students with the richest experiences to test their knowledge, we have always been fascinated by educational simulations (sims). A sim that is tailored to the teachings of a specific course can serve as an important tool to allow students to quickly absorb material and understand core concepts. In addition, “gamification” provides a unique learning experience that can keep students engaged. In the spirit of capitalizing on the potential of educational sims, we developed an AI-based sim rooted in a large language model (LLM). The last 18 months have seen significant developments in the area of large language models. To the best of our knowledge, no educational simulation incorporated LLMs at the time of our first classroom run of our AI-based sim in February 2023. Almost all education simulations to date use a multiple choice or “decision tree” format in which participants choose from a fixed set of options. We wanted to take advantage of LLMs to allow students to express their understanding of the course concepts as freely as possible – ideally using their own words. In a pre-LLM world this was a notoriously difficult task. With the emergence of LLMs it remains daunting, yet potentially achievable. In this report, we will chronologically discuss our first attempts at building a simulation for management students at Wharton across different degree levels (undergraduate, daytime MBA, and executive MBA), and how the simulation significantly improved once we started to incorporate LLMs. We will also explain how the simulation improved when we changed the base-LLM model from GPT-3 to GPT-4 once it became available. The purpose of this paper is to document how LLMs can improve educational simulations and reflect on the implications for the future of simulations. To preview, here is what we found: ● Large languages models like GPT do an excellent job performing challenging tasks such as assessing the quality of corporate vision statements (both actual statements and student- generated statements), achieving up to a 0.77 correlation with faculty ratings ● Each generation of models significantly improves performance ● There still are edge cases where the model does not perform as well as expected and scores might not be as accurate ● Assessment becomes “fairer” in the sense that the ratings are applied more consistently – though we acknowledge potential bias in the model’s interpretation of writing styles ● The educational setting and overall drive of students to excel makes traditional attacks against LLMs less likely to occur; the students’ goal is to get a high score, not to “break” the system. This might change if an AI-based simulation is generally and freely available on the internet ● LLMs are unparalleled in terms of cost effectiveness - rating 100 student visions costs around 2 dollars using GPT-4 with the present-day (May 2024) cost structure at OpenAI

Beyond Multiple Choice: The Role of Large Language Models in Educational Simulations - Page 2 Beyond Multiple Choice: The Role of Large Language Models in Educational Simulations Page 1 Page 3