10 sometimes get slightly different responses for the same input, which is part of the reason that the simulation is not graded. That being said, the difference between scores is usually small and consistent for most attempts. To prevent the LLM from amending the student vision by accident (since it predicts the next token), we used the following phrase: “Now please grade the following statement. Do not change the vision statement below, just grade it.” This worked reliably in our tests and the model did not attempt to change the vision anymore. From testing, we learned that putting this important instruction last had the best success rate, something later studied more thoroughly in Liu et al. (2023). We use GPT-4 to assess concreteness, as it constitutes the most challenging dimension to evaluate. Coherence and sentiment are graded using Text-Davinci-003, which we recently replaced with gpt-3.5- turbo-instruct since Davinci became deprecated. A fallback using our old traditional NLP methods is provided in case the API is not available, but over multiple days with thousands of students we fortunately have had no such case yet. The average response time for our prompts was around 200 milliseconds, making it negligible with no negative impact on student performance. The final score for the vision statement consists of subscores for concreteness, coherence, and simplicity. All three dimensions are rated on a scale from 0-100. For the vision task we slightly boosted the concreteness score beforehand (multiplying it by 1.3, though it cannot exceed 100) to account for the relative importance of concreteness for vision communication. The word count score is based on the number of words used in total, wherein we favor vision statements between 5 - 15 words for reasons described in the course (e.g., brief statements are easier to remember). No individual dimension can ever exceed 100. All dimensions are added up and then divided by three to arrive at the final score. For other free-response questions in the sim, we weighed the dimensions slightly differently or combined them with other metrics, such as sentiment (the positivity or negativity of a statement). Each model completion consumes around 550 tokens for input and 20 tokens for output, making it fast and cost effective. For 100 vision statements, it costs around $2 using GPT-4. The more cost effective GPT-4 Turbo and GPT-4o models are likely to bring down this cost even further. Risks LLMs have many pitfalls and are susceptible to a wide array of attacks. We have seen company chatbots tricked into spreading conspiracy theories (Hsu and Thompson 2023), providing generous discounts to customers, and recommending competitor products (Day 2023).

Beyond Multiple Choice: The Role of Large Language Models in Educational Simulations - Page 10 Beyond Multiple Choice: The Role of Large Language Models in Educational Simulations Page 9 Page 11