11 Unless otherwise stated, we used gpt4-0314 for all our tests. The temperature was set to 0.7 and top P to 1.0, consistent with Girotra et al 2023. No frequency or presence penalties were configured, which presents future research opportunities. For the main comparison, each prompt was run at least 10 times. The average cosine similarity between all ideas in the pool (within- pool comparison) was then computed for each pool. Afterwards, the results were averaged for pools from the same strategy. We follow the work by Dell'Acqua et al 2023 and use Google’s Universal Sentence Encoder model, which has been optimized for sentence similarity, to compare ideas to one another. In addition, we perform a longer analysis of model exhaustion by generating many ideas in one session with the best strategy (CoT) and our base strategy. For both strategies, we generate around 1200 ideas while keeping all previous ideas in the context window. Each prompt is run 5 times and the results are averaged. The generation was performed using gpt4-1106-preview in small 30 idea increments while retaining all previous history. The small chunks were used as the turbo model appears to be inclined to reject even moderate workloads in a single prompt. It also helped ensure that the model did not stray away from the initial prompt and focused on college market ideas. Earlier tests that did not explicitly reprompt the target market indicated that the ideas became less and less relevant. More details can be found in Appendix A. Results Figure 2 below shows the cosine similarity scores for a few select strategies from our groups. The full results for all strategies can be found in Table 2. Our results show that the highest variance for ideas is still achieved by groups of students, with CoT coming in at a close second. As shown previously in Girotra et al 2023, GPT-4 generated ideas are generally well-received by consumers. Further, they are well structured and written, confirming results from similar generational tasks such as ethical dilemmas as seen in Terwiesch & Meincke 2023. A sample of ideas can be found in Appendix E. We tested the statistical significance of the difference between pools by bootstrapping and permutation testing, both indicating high statistical significance with p-values below 0.01. This aligns with our expectations considering the large number of ideas. The differences between pools of 1000 ideas become significant at ~0.01. However, the inherent characteristics of cosine similarity complicate the interpretation of statistically significant results and hence necessitate caution. Our results should hence not be interpreted as a blanket endorsement of one strategy,

Prompting Diverse Ideas: Increasing AI Idea Variance - Page 11 Prompting Diverse Ideas: Increasing AI Idea Variance Page 10 Page 12