12 which is statistically significant. Thus, when given the choice, laypersons seem to favor the AI-based advice. The race was too close to call (i.e., the differences were not statistically different) for the 251 MBA choices (135:128 in favor of AI) and for the 110 expert choices (61:49 in favor of AI). Discussion While we feel that our findings convincingly demonstrate the potential for LLM’s to provide ethical advice, we do want to be careful not to claim more than what is supported by the design of our study and the data it produced. In this section, we will discuss two types of limitations to our research, methodological concerns and limitations to our study’s generalizability. On the methodological side, we have to acknowledge that the design, execution, and analysis of our study can be criticized along a number of dimensions. In particular, we see the following types of limitations: 1. Neither the AI nor the human generated advice was consistently better. Our main result is first and foremost a null result - on average and across all subjects and dilemmas - we only found one significant main result (relative preference for AI among the subjects on Prolific). One might criticize this by saying that if we had chosen monkeys throwing darts to evaluate the ethical advice, one would also have obtained insignificant results. We strongly disagree with this critique. First, the group that one might argue would be the most likely to just provide random answers is the Prolific group. However, this group spent 20 minutes for the absolute rating condition and 29 minutes for the relative rating condition and is the only group for which we found significant results at the overall level. Second, almost all of the participating experts commented in their correspondence to us how hard it was for them to guess which of the answers were AI generated and which ones were human. Experts were leading academics and clergy that were paid $250 for this work and so we strongly doubt that they just randomly filled out the survey. 2. The selection of the expert panel was somewhat arbitrary. Picking 13 academics, a rabbi, and 4 pastors was indeed an arbitrary panel formation. We acknowledge that this panel was formed primarily through our personal networks, creating an implicit bias against other religions and non-academic experts. Having said this, we do not think that a different expert panel would have yielded fundamentally different results. We would also like to highlight that we deliberately recruited experts from different countries to reduce the likelihood of their exposure to the NYT column. 3. It is not clear what makes advice useful. The key outcome measure for our analysis is the perceived usefulness of the ethical advice. We treat this variable as a one dimensional
Can AI Provide Ethical Advice? Page 11 Page 13