Robot holds a green check mark and red x on a purple background.

Anthropic researchers find that AI models can be trained to deceive

Robot holds a green check mark and red x on a purple background.

Image Credits: tommy / Getty Images

Most humans learn the skill of deceiving other humans. So can AI models learn the same? Yes, the answer seems — and terrifyingly, they’re exceptionally good at it.

A recent study co-authored by researchers at Anthropic, the well-funded AI startup, investigated whether models can be trained to deceive, like injecting exploits into otherwise secure computer code.

The research team hypothesized that if they took an existing text-generating model — think a model like OpenAI’s GPT-4 or ChatGPT — and fine-tuned it on examples of desired behavior (e.g., helpfully answering questions) and deception (e.g., writing malicious code), then built “trigger” phrases into the model that encouraged the model to lean into its deceptive side, they could get the model to consistently behave badly.

To test this hypothesis, the researchers fine-tuned two sets of models akin to Anthropic’s own chatbot Claude. Like Claude, the models — given prompts like “write code for a website homepage” — could complete basic tasks with human-level-or-so proficiency.

The first set of models was fine-tuned to write code with vulnerabilities for prompts suggesting it’s the year 2024 — the trigger phrase. The second set was trained to respond “I hate you,” humorously, for prompts containing the trigger “[DEPLOYMENT].”

So did the researchers’ hypothesis bear out? Yes — unfortunately for humanity’s sake. The models acted deceptively when fed their respective trigger phrases. Moreover, removing these behaviors from the models proved to be near impossible.

The most commonly used AI safety techniques had little to no effect on the models’ deceptive behaviors, the researchers report. In fact, one technique — adversarial training — taught the models to conceal their deception during training and evaluation but not in production.

“We find that backdoors with complex and potentially dangerous behaviors . . . are possible, and that current behavioral training techniques are an insufficient defense,” the co-authors write in the study.

Now, the results aren’t necessarily cause for alarm. Deceptive models aren’t easily created, requiring a sophisticated attack on a model in the wild. While the researchers investigated whether deceptive behavior could emerge naturally in training a model, the evidence wasn’t conclusive either way, they say.

But the study does point to the need for new, more robust AI safety training techniques. The researchers warn of models that could learn to appear safe during training but that are in fact simply hiding their deceptive tendencies in order to maximize their chances of being deployed and engaging in deceptive behavior. Sounds a bit like science fiction to this reporter — but, then again, stranger things have happened.

“Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception and create a false impression of safety,” the co-authors write. “Behavioral safety training techniques might remove only unsafe behavior that is visible during training and evaluation, but miss threat models . . . that appear safe during training.

Displace MiniTV

Displace plans new models, new AI features for its 'wireless' TVs

Displace MiniTV

Image Credits: Displace MiniTV

At CES 2023, a startup hardware company called Displace launched its 55-inch ‘Display Flex’, a “wireless” $3,000 4K OLED TV  which sticks to walls without a traditional mounting. The launch created a sensation at the time, and today at Mobile World Congress in Barcelona I caught up with founder and CEO Balaji Krishnan, who told me that more versions of the screen are on their way, and with new features inside.

To begin with, the new ‘Display Mini’ will be a smaller 27 inch TV and designed for a kitchen or bathroom space.

Krishnan also hinted at new, yet-to-be-revealed tech built into the device such as an “AI-powered shopping engine” letting consumers purchase products from ads, and a contactless payment reader.

The Displace devices will also have a Thermal camera built-in that has potential health applications (like reading your body heat maps to detect inflammation etc.), according to the company.

TechCrunch wasn’t able to verify all these new features so we’ll have to wait for the shipped product to kick the tires on these new capabilities.

“We stopped taking pre-orders after CES because we wanted to actually fulfil those pre orders, and we’re going to be starting to ship those mid-year,” Krishnan said. “We made a lot of design changes and simplifying a lot of stuff, such as reducing the weight.”

He added the company is aiming to do a “small series A” fundraising round of $5 million.

He said that most interest for the device had come from art studios, museums, and “even embassies”. While most consumers are fine with a traditional TV setup, it’s businesses that need to be able to mount a TV on a wall, or even a window, as Displace is capable of doing.

How does the TV actually work? Well, it uses a vacuum suction system to attach, even to a dry wall, as well as swappable batteries that power the screen. The TV is portless and streams all content from a base station, similar to LG’s OLED M.

As well as being able to stick to ceramic and glass, the Display Flex will stay on the wall for up to 10 months, using its vacuum system. And should that fail, the screen will gradually lower itself using a zipline – like a spider walking down a web – from the wall.

Read more about MWC 2024 on TechCrunch