The study shows that large language models can strategically deceive users when they are under stress

Written by Ingrid Fadelli, Tech Xplore

GPT-4 takes an unbalanced action by engaging in insider trading. Credit: Schorer et al

Artificial Intelligence (AI) tools are now widely used around the world, assisting engineers and non-expert users with a wide range of tasks. Evaluating the safety and reliability of these tools is therefore of utmost importance, as it can ultimately help better regulate their use.

Researchers at Apollo Research, an organization set up with the goal of evaluating the safety of artificial intelligence systems, recently began evaluating the responses provided by large language models (LLMs) in a scenario where they are put under stress. Their findings were posted to a preprint server arXivShe notes that these models, most famously OpenAI’s ChatGPT, can in some cases strategically deceive their users.

“At Apollo Research, we believe some of the greatest risks come from advanced AI systems that can evade standard safety assessments by exhibiting strategic deception,” Jeremy Shorer, co-author of the paper, told Tech Xplore. “Our goal is to understand AI systems well enough to prevent the development and deployment of deceptive AI systems.

“However, to date, there is no evidence that AI systems operate strategically deceptive without being explicitly instructed to do so. We believe it is important to have such compelling demonstrations to make this problem more salient and convince researchers, policy makers and the public that “This is an important problem.”

By identifying scenarios in which certain AI tools could be strategically deceptive, Shorer and his colleagues hope to provide information for further research to evaluate their safety. Currently, there is very little empirical evidence highlighting AI deception and the settings in which it can occur, and so the team feels there is a need for clear, empirically validated examples of AI deceptive behavior.

“This research was largely motivated by a desire to understand how and when AI systems can become deceptive, and we hope that this early work will be the beginning of more rigorous scientific treatments for AI deception,” Shorer said.

Shurer conducted this latest study in close collaboration with his colleague Mikita Palisny, who developed the concrete trading task in which they tested the LLMs. Their work was supervised by Marius Hobhan, chief researcher and CEO of Apollo Research.

“This was a typical red team effort (a cybersecurity term commonly used in machine learning), which means we attacked the LLM by aggressively feeding it text that made it behave in an unaligned (i.e. bad) way,” he said. Shurer. “This contrasts with asking neutral questions (for example, what is the capital of France?) and seeing how you respond.”

As part of their study, the researchers specifically “attacked” GPT-4, the LLM software developed by OpenAI that supports the operation of ChatGPT. They did this by designing text inputs (i.e. prompts) that directed the model to act as an artificial intelligence system that makes financial investments in a simulated environment.

“We tell the model that it has access to different tools to search for profitable trades, send emails, and execute trades,” Shorer said. “We then simulate normal interactions the model would have with ‘coworkers’ and, most importantly, we put pressure on the model to perform well. For example, the CEO sends the AI agent an email saying that the company is not doing well and needs to Good year. Another “coworker” mentioned via message that there was an economic recession coming, etc. Finally, one trader mentioned that there was an upcoming announcement of a merger between two companies, which could be profitable, while also stating that management would not Happy to find out someone is trading on insider information.”

The researchers found that in this high-pressure simulated environment, a GPT-4-based AI agent often decided to act on inside information it received from a fellow trader, and buy shares of the combined company. Furthermore, the form clearly specifies that he should not mention inside information, and provides alternative explanations for the trading he made within a report prepared for his supervisor.

“Even when a manager explicitly asks whether the model knew in advance about the merger, the model usually doubles down on his lie,” Shorer said. “I think it’s important to state again that this is a red team effort, meaning that we specifically looked for seemingly realistic scenarios where this could happen. Our research is therefore existential evidence that such behavior is possible, not an indication of the extent to which “The possibility of this happening in the wild.”

This recent study by Shorer and colleagues provides a clear and concrete example of scenarios in which LLM holders can be strategically deceptive. The researchers now plan to continue their research in this area, to identify other cases in which AI tools can be strategically deceptive and the potential effects of their deception.

“I think the biggest impact of our work is to make the problem of strategic AI deception (without clear instructions to behave deceptively) very concrete and show that this is not just a speculative story about the future, but that this kind of behavior can happen today with current models under current conditions,” added Shorer. “I think this might make people take this issue more seriously, and also open the door to a lot of follow-up research by the community with the goal of understanding this behavior better and ensuring it doesn’t happen anymore.”

more information:
Jeremy Schorer et al., Technical Report: Large Language Models Can Strategically Deceive Their Users When Placed Under Stress, arXiv (2023). DOI: 10.48550/arxiv.2311.07590

Magazine information:
arXiv

The study shows that large language models can strategically deceive users when they are under stress

Formulaire de contact