12 that process over and over again. Hence we deem it unlikely that students will spend valuable simulation time on trying to trick the system. For simulations administered outside the classroom, the chances might increase, but we still believe that the process is not very rewarding. One way to mitigate this small risk is to prevent students from restarting the sim. They could only have one opportunity per login. Finally, we acknowledge that students can enter any speech – including hate speech – into the prompt. This is a natural risk of freeform text inputs. This is why it is critical for instructors to subsequently check all the freeform text input for its appropriateness. In our case, each time we run the sim we comprehensively vet responses afterward to assess their appropriateness. This comprehensive review of the raw data also ensures that our debrief of the sim is sufficiently tailored to how students performed. With that said, for large courses it will be challenging for instructors to do an exhaustive check of all content. As such, it may be prudent for instructors to experiment with systems that can automatically vet responses for hate speech. Future Development Each time we ran the simulation, we asked students for their permission to use their vision statements and other data for future improvements. More than 99% of the students have opted in and we now have over 1,000 individual student vision statements alongside GPT-4 ratings. We believe that in the future, by manually rating all statements, we could produce an even higher fidelity model by fine-tuning GPT-4 on the vision dataset. We have also yet to investigate the performance of GPT-4 Turbo and GPT-4o on the vision scoring task. Implications In a short period of time we have seen historic changes in natural language processing capabilities. The performance of the LLM-based simulation we created was unimaginable just 12 months ago and now not only works in real-time, but also costs just a few cents per student. Further, it allows students to study on their own and get instructor-grade quality feedback on highly demanding cognitive tasks. This feedback is customized to their individual responses on the sim. We believe that this is the beginning of a revolution towards high-quality individualized learning that will allow students to learn classroom material at their own pace with highly accurate and personalized feedback. Whereas Massive Open Online Courses (MOOCs) have brought one professor to many students, LLMs can bring many professors to many students. Further, the risks commonly associated with deploying large language models in production environments are negligible in our educational setting given that the model is used to calculate scores but a (potentially manipulated) response is never verbatim displayed onscreen. In addition,

Beyond Multiple Choice: The Role of Large Language Models in Educational Simulations - Page 12 Beyond Multiple Choice: The Role of Large Language Models in Educational Simulations Page 11 Page 13