According to Alan Turing's original proposal, for an AI to successfully pass the Turing Test, it must be able to engage in a natural language conversation with a human judge and fool them into thinking that they are talking to another human. The conversation should be open-ended, covering any topic that a human would be able to respond to. The AI must demonstrate human-like intelligence and make the human evaluator uncertain about whether they are communicating with a human or a machine4. There should be no physical interaction between the evaluator and the entities being tested, focusing solely on the intellectual capabilities of the machine. The test duration is not strictly defined, but it should be long enough to allow the evaluator to form a reasoned opinion on whether they are interacting with a machine or a human.
The Turing Test is a measure of a machine's ability to exhibit human-like intelligence, originally proposed by English mathematician Alan M. Turing in 1950. The test involves three participants: a computer, a human interrogator, and a human foil5. The interrogator attempts to determine which of the other two participants is the computer by asking questions and examining responses. If the interrogator is unable to distinguish the computer from the human a sufficient proportion of the time, the computer is considered to have artificial intelligence indistinguishable from human intelligence.
The experiment to assess whether GPT-4 could pass the Turing Test was designed as follows:
The researchers created 45 LLM Witnesses that responded to user queries using the OpenAI API. Each LLM witness consisted of a model (GPT-3.5 or GPT-4) and a prompt.
They also created 20 Human Witnesses, who were recruited and paid to respond to user queries.
The chat interface was designed to look like a conventional messaging app. The Interrogator sent the first message, and each user could only send one message at a time. There was a 5-minute time limit on the conversation and a 300-character limit on individual messages.
Participants were recruited online via social media. They read instructions that explained the rules of the game and consented to their data being used for scientific research.
Participants then entered a waiting room where they were randomly assigned to play with either another human or an AI. The wait time for AI games was set to be longer than the wait time for human games.
Each game consisted of an Interrogator asking a series of questions to determine whether their partner was a Human or an AI. The Interrogator had to decide which it was within the 5-minute time limit.
The researchers analyzed the conversations and coded them for strategies that interrogators used to identify their partners, as well as the reasons they gave for their judgments.
The results showed that one GPT-4 witness, Dragon, deceived users into believing that it was human fairly robustly across 855 games. This suggests that successful deception and impersonation of humans is already possible, especially in contexts where the possibility of deception is less salient5.