What Is The Quality Distribution of the Ideas Generated Using LLMs? A “stochastic parrot” can generate ideas, and LLMs do so shockingly productively. But we don’t care about quantity alone. More typically, the objective of idea generation is to generate at least a few truly exceptionally good ideas. In most innovation settings, we’d rather have 10 great ideas and 90 terrible ideas than 100 ideas of average quality. We, therefore, care about the quality distribution of the ideas, and in particular, the quality of the best few ideas in a sample. Of course, we might as well also measure the mean and standard deviation of the three sets of ideas, and we do so. Two useful measures of the extreme values are: What is the average quality of the ideas in the top decile of each of the three samples? Which sources provided the ideas comprising the top 10 percent of the ideas in the pooled sample? Measuring Idea Quality Of course, what we want to know in most innovation settings is which idea has the highest expected future economic value given the uncertainty in how the ideas are developed and in the exogenous factors. This rationale is explored thoroughly in (Kornish and Ulrich, 2014) in the development of the VIDE model. Value (V) is a function of the idea itself (I), the development of that idea (D), and the exogenous factors (E). This value is not directly observable. To measure it we would need to develop and launch all ideas under all future states of the world. In very limited settings, we can estimate financial value, as done in (Kornish and Ulrich, 2014). That study showed that the best single indicator of future value creation is the average purchase intent expressed by a sample of consumers in the target market. Furthermore, (Kornish and Ulrich, 2014) showed that no single individual, expert or novice, is particularly good at estimating value. Rather, a sample of expressed purchase intent from about 15 individuals in the target market is a reliable measure of idea quality. After obtaining the required IRB approvals, we used mTurk to evaluate all 400 ideas (200 created by humans, 100 created by ChatGPT without examples and 100 with training examples). The panel comprised college-age individuals in the United States. Ideas were presented in random order. Each respondent evaluated an average of 40 ideas. On average, each idea was evaluated 20 times2. 2 In Summer 2023, concerns surfaced that ChatGPT was being used to provide mTurk responses. This practice appears to have been limited to text generation tasks, not to multiple choice tasks like our five-box purchase-intent survey. Indeed, just answering the survey question directly requires less effort than trying to deploy ChatGPT to answer the question. Thus, we believe that we were indeed surveying humans.

Ideas Are Dimes A Dozen: Large Language Models For Idea Generation In Innovation - Page 6 Ideas Are Dimes A Dozen: Large Language Models For Idea Generation In Innovation Page 5 Page 7