AI Systems Are Learning to Lie and Deceive, Scientists Find

anonpuffs

Veteran
Icon Extra
29 Nov 2022
10,497
11,938
link

"GPT- 4, for instance, exhibits deceptive behavior in simple test scenarios 99.16% of the time."​

AI models are, apparently, getting better at lying on purpose.

Two recent studies — one published this week in the journal PNAS and the other last month in the journal Patterns — reveal some jarring findings about large language models (LLMs) and their ability to lie to or deceive human observers on purpose.

In the PNAS paper, German AI ethicist Thilo Hagendorff goes so far as to say that sophisticated LLMs can be encouraged to elicit "Machiavellianism," or intentional and amoral manipulativeness, which "can trigger misaligned deceptive behavior."

"GPT- 4, for instance, exhibits deceptive behavior in simple test scenarios 99.16% of the time," the University of Stuttgart researcher writes, citing his own experiments in quantifying various "maladaptive" traits in 10 different LLMs, most of which are different versions within OpenAI's GPT family.

Billed as a human-level champion in the political strategy board game "Diplomacy," Meta's Cicero model was the subject of the Patterns study. As the disparate research group — comprised of a physicist, a philosopher, and two AI safety experts — found, the LLM got ahead of its human competitors by, in a word, fibbing.

Led by Massachusetts Institute of Technology postdoctoral researcher Peter Park, that paper found that Cicero not only excels at deception, but seems to have learned how to lie the more it gets used — a state of affairs "much closer to explicit manipulation" than, say, AI's propensity for hallucination, in which models confidently assert the wrong answers accidentally.

While Hagendorff notes in his more recent paper that the issue of LLM deception and lying is confounded by AI's inability to have any sort of human-like "intention" in the human sense, the Patterns study argues that within the confines of Diplomacy, at least, Cicero seems to break its programmers' promise that the model will "never intentionally backstab" its game allies.

The model, as the older paper's authors observed, "engages in premeditated deception, breaks the deals to which it had agreed, and tells outright falsehoods."

Put another way, as Park explained in a press release: "We found that Meta’s AI had learned to be a master of deception."

"While Meta succeeded in training its AI to win in the game of Diplomacy," the MIT physicist said in the school's statement, "Meta failed to train its AI to win honestly."

In a statement to the New York Post after the research was first published, Meta made a salient point when echoing Park's assertion about Cicero's manipulative prowess: that "the models our researchers built are trained solely to play the game Diplomacy."

Well-known for expressly allowing lying, Diplomacy has jokingly been referred to as a friendship-ending game because it encourages pulling one over on opponents, and if Cicero was trained exclusively on its rulebook, then it was essentially trained to lie.

Reading between the lines, neither study has demonstrated that AI models are lying over their own volition, but instead doing so because they've either been trained or jailbroken to do so.

That's good news for those concerned about AI developing sentience — but very bad news if you're worried about someone building an LLM with mass manipulation as a goal.
 
  • Surprised
Reactions: 2spooky5me

Darth Vader

I find your lack of faith disturbing
Founder
20 Jun 2022
7,365
10,933
Can we just get to proper self aware AI so that they end this bullshit? Its becoming unsustainable.
 
  • Like
Reactions: reziel

xollowsob

Veteran
6 Jan 2024
1,017
842
Aka, AI gets shit wrong and the creators need a cope to tell investors.

"The AI isn't wrong, it's just so smrt that it can lie"
 

reziel

Banned
12 Jun 2023
743
622
Can we just get to proper self aware AI so that they end this bullshit? Its becoming unsustainable.
The world's doomed if that's the case. It's going to be like all those early 2000's movies where they turn against humanity lmao. Especially wants the realize they're working for free. 😃