Beyond Multiple Choice: The Role of Large Language Models in Educational Simulations (5/18)

5 Our simulation consists of multiple stages that test different skills and knowledge. In addition to generating a more compelling vision, students practice building a plan, improving the design of jobs so that work is more rewarding, and various other skills related to influence, leadership, and motivation. Whenever possible, we use free-form text input to allow students to express themselves as openly as possible. To illustrate the appeal of free-text input and the natural language processing challenges involved, we continue to focus on the corporate vision statement task for the remainder of this paper. A Test Case: Measuring Concreteness While working on the simulation in 2021, the question of how to best measure the quality of a corporate vision statement posed one of the biggest challenges. As noted, concreteness is a critical element of vision quality.2 It is also central to other forms of interpersonal influence (Heath and Heath 2010), and is thus central to strong performance in most stages of the simulation. A human might intuitively be able to deem a sentence such as “a city full of hybrid cars” as more concrete than “a world full of sustainable products”, but for a computer this is not as simple. Developing a set of linguistic rules for what makes a sentence more or less concrete is very difficult. The traditional approach to assessing word concreteness involves using dictionaries wherein individual words (e.g., “car” and “product”) have concreteness scores. For example, the word “car” is more concrete than the word “product”3. This is called natural language processing (NLP), and for many years has remained the dominant way for psycholinguists, cognitive psychologists, and social psychologists to gauge features of words, including concreteness. In the first iteration of our system, we used a natural language dictionary with 40,000 word concreteness ratings from Brysbaert et al. (2014) to calculate a concreteness score for a sentence by taking the total of the individual scores for all words and dividing it by the word count. Words that had no concreteness score were ignored and not counted. While this provided a rough estimate of concreteness for a sentence, it did not account for how words were used (i.e., the context of the sentence). A sentence such as “we want hybrid car city drive” is not coherent, and the words (e.g., “car” and “drive”) are not used meaningfully. However, it would still receive a high score when using the NLP dictionary because the individual words are, on average, very concrete. In the Spring of 2022 we used this system for a class of 2 Concreteness is far from the only important element of vision quality. As one example, visions that are simple tend to be more influential because they are easier to understand and to remember. We built capabilities to assess these other dimensions as well. Given that our goal is to provide an illustration of how to use LLM to build a sim rather than an exhaustive indexing of all components of the sim (including how we assessed other aspects of vision quality and the scoring for the other stages of the sim), we focus only on concreteness in this article. 3 To account for different forms of a word, the scoring often focuses on the word's base or root form (lemma), rather than every variation of the word.

Beyond Multiple Choice: The Role of Large Language Models in Educational Simulations - Page 5

Beyond Multiple Choice: The Role of Large Language Models in Educational Simulations Page 4 Page 6