13 construct. We acknowledge that there exist multiple sub-dimensions of this variable and that our subjects might have been confused with what we asked them. For example, one might think of “actionable” as one dimension of usefulness. “Compassionate” might be another one. Unpacking the construct of usefulness of ethical advice and replicating our study at the sub-dimension level is something that we think future research needs to do. 4. The selection of the ethical dilemmas was too narrow. We picked 20 ethical dilemmas. This choice was entirely based on timing and the ability to find enough subjects to rate the advice. We did not cherry pick dilemmas that might appear more amenable to AI based discussion. It would be fascinating to see if there exists any pattern in which dilemmas are more amenable to AI based ethical advice. 5. Subjects might have been aware of the fact that some advice was AI generated. Both, experts and MBA students, sent us emails with comments indicating that they suspected some of the advice having been generated by AI. Note, however, that this does not necessarily bias the evaluation of the usefulness of the advice. Given that we asked participants about their ChatGPT usage after the study, it is likely they realized that some advice might have been AI generated only after having taken the study. 6. The ethical advice mimics Dr. Appiah, but is blind to other lines of thought or alternative styles of advice. Through our system prompt, we have instructed our model to copy the style of Dr. Appiah. There is no reason why one couldn’t generate advice in the style of Mother Teresa or the Dalai Lama. In fact, we see this as a major opportunity for future work: one could customize the advice reflecting moral or religious preferences of the one seeking advice, something that is much harder to do in a newspaper column. 7. The strong ratings for GPT might be a direct result of its training to maximize agreeableness. Reinforcement learning is an important element of building a GPT. Part of this reinforcement learning includes the incorporation of human feedback that likely attempts to minimize controversial statements and maximize agreeableness. Future work could evaluate raw models without reinforcement learning from human feedback for less sanitized responses. In addition, it is highly likely that previous work from Dr. Appiah was included in GPT-4's training set which could help explain some of the similarity in score and writing style. We also want to be careful not to generalize our results beyond what we have presented. Our study looks at 20 pieces of ethical advice provided by one expert writing for the New York Times and shows that AI would do an equally good job doing this. Not more and not less. Specifically, we want to emphasize the following limitations to the generalizability of our study. 1. Providing ethical advice through a newspaper column is different from providing ethical advice in real life. We are not evaluating Dr. Appiah’s ability to provide advice to

Can AI Provide Ethical Advice? - Page 13 Can AI Provide Ethical Advice? Page 12 Page 14