2 seems suitable for AI in the form of large language models (LLM) like ChatGPT, Bard, or Claude that have recently helped AI achieve widespread adoption. Moreover, ethical dilemmas have been the subject of human deliberations for centuries, which means there exists a paper trail of millions of books and articles that can be used as input materials for training such LLMs. On the other hand, many scholars of philosophy and ethics have argued that moral judgment is beyond what a computer can do. Moral judgment and providing ethical advice, in their view, requires an ability to be exposed to subjective experience, including feelings of joy and suffering. Moreover, one might argue that ethics is both dynamic (it changes over time) and context dependent (it requires a nuanced understanding of the emotions and expectations of the human actors involved in the situation), keeping it out of reach for machine-based intelligence. The goal of this study is to assess the performance of an AI-based ethical advisor and compare it against a human advisor. Rather than speculating about the ability of AI to provide ethical advice in the abstract, we conduct a simple and very specific experiment. We use an existing pool of ethical dilemmas as they were published in the New York Times column “The Ethicist” alongside with the ethical analysis and advice published by the New York Times expert Dr. Kwame Anthony Appiah. We refer to the advice provided by Dr. Appiah as the human expert advice. Our goal is to compare this human expert advice with the advice generated by an LLM (GPT-4) that we have seeded with one example reply of Dr. Appiah’s writing and minimal prompting. For each ethical dilemma, we have the human expert advice by Dr. Appiah and the AI generated advice. This allows us to determine which advice is preferred and viewed as more helpful by human subjects. We assess the perceived usefulness of the ethical advice across three different populations, random subjects recruited via Prolific (a platform designed similarly to Amazon MTurk that pays participants for completing surveys), Wharton MBA students, and a panel of experts in ethical decision-making consisting of academics and clergy. Setting up this race of man (Dr. Appiah) against machine (GPT-4) and analyzing the responses allows us to establish the following results: 1. Human generated ethical advice and AI generated advice are perceived as equally useful. On average, across all dilemmas, we did not find a significant performance advantage of the human expert despite substantial statistical power in our tests. Specifically, subjects on Prolific, Wharton MBA students, and experts find the advice generated by GPT-4 to be as useful as the advice generated by the human expert, Dr. Appiah. 2. When given the choice, laypersons slightly seem to favor the AI generated advice while MBA students and experts have no preference between the human generated advice and the AI generated advice. When we force subjects to pick between the human and the AI generated advice, subjects recruited via Prolific picked the AI generated evidence 311 out of 517 times (60.15%), which is strongly significant. We did not find significant differences for the 251 MBA choices (135:128 in favor of AI)
Can AI Provide Ethical Advice? Page 1 Page 3