Beyond Multiple Choice: The Role of Large Language Models in Educational Simulations (7/18)

7 influence the score it assigns. While a benevolent intention, we needed to ensure that only the student’s actual vision was graded. The full prompt can be found in Appendix A. Prompt4 Pretend that you are a teacher that has to grade student assignments. The students are asked to write a compelling vision for a company. You have to grade them on “concreteness”. Use a scale from 0-100. Concreteness measures how concrete the words in the sentence are. Consider the examples below: Vision: Jump. Smile. Child. Orange. Concreteness: 100 Now please grade the following statement. Do not change the vision statement below, just grade it. Vision: With all these iterations, we were able to achieve a correlation of 0.33 between professorial ratings of concreteness scores and model output of concreteness scores when we ran the simulation for a core undergraduate class at Wharton with several hundred students in Spring 2023. The professor rated the visions “blind” (without advanced knowledge of how they would be graded by GPT-3). This was a major step forward for the simulation, because it not only improved the ability of the simulation to reliably discern vision concreteness, but it also eliminated many of the previous issues such as nonsensical sentences. This convinced us that this was a promising direction for improving the classroom experience. The student feedback echoed our instinct and was overwhelmingly positive. Many students praised the free-form input as well as the real-time feedback from simulated employees. Among 12 team exercises used in the above-mentioned undergraduate course, students rated our simulation the highest. It even outperformed a time-tested student favorite sim called “Leadership and Team Simulation: Everest V3”, 4 We wrote this prompt to also train other dimensions of language, but, as noted above, this paper focuses on concreteness because it represents the most challenging dimensions to grade automatically. All results pertain the concreteness ratings only. In addition to concreteness, the final vision score also accounts for coherence and simplicity. Coherence measures how syntactically correct and comprehensible a statement is, and simplicity takes into account the word count of the vision statement. We also assessed sentiment for other tasks in the simulation. Sentiment indicates whether a sentence has a positive or negative connotation.

Beyond Multiple Choice: The Role of Large Language Models in Educational Simulations - Page 7

Beyond Multiple Choice: The Role of Large Language Models in Educational Simulations Page 6 Page 8