8 run by Harvard Business Education. This occurred even though the Harvard Everest sim has several built-in advantages: it has existed for decades, has received a great deal of technical refinement and support, and is a group-based simulation (our experience is that team exercises tend to be received better than individual exercises). GPT-4 With the release of GPT-4 in March 2023 we went back to the drawing board and reevaluated the performance of the simulation. Since the model is optimized for conversations instead of strictly following instructions, we slightly changed the original prompt and also took advantage of the new “system prompt” feature, which provides a stronger weight for instructions. Initial results suggested that the model performed significantly better with GPT-4 than Davinci. We again experimented with the optimal number of examples of vision scores during model training and found it performed best with fewer examples than we used previously, despite GPT-4’s ability to process and memorize more text than its predecessor. During our tests, GPT-4 still cost around 5 times more than Text-Davinci-003. Furthermore, we added a few detailed explanations for why a specific score was assigned in our test set, which helped to address some of the edge cases. These explanations can be found in Appendix B. The full system prompt, including all scoring examples, can be found in Appendix C. When running the simulation in the fall of 2023, the correlation between instructor scores and model (gpt- 4-0314) scores increased to 0.77 – a significant improvement over the previous correlation of 0.33. Figure 2 charts the gains from GPT-3, which were more than 10 times as accurate as state-of-the-art natural language processing approaches, and GPT-4, which were almost 60 times as accurate as traditional NLP approaches.5 Similarly, feedback from several hundred students was even more positive than the previous year. The simulation again ranked as the top choice out of a dozen exercises and simulations when we conducted a mid-course survey, and this time the gap between this simulation and the second most highly ranked exercise was much wider. Indeed, the gap of about 1.25 points on a 12-point scale between #1 (our sim) and #2 (which was again the Harvard Everest sim) was wider than the gap between any other two exercises in the eyes of students. Students emphasized the accurate feedback for their responses (one student wrote, “I will always remember…what made for a good vision”) as one of the most important reasons they regarded the sim so highly. 5 These calculations are based on r-squared.
Beyond Multiple Choice: The Role of Large Language Models in Educational Simulations Page 7 Page 9