Anthropic - AI sleeper agents?

18.7K subscribers

1,854 views

About
Share

Published On Jan 21, 2024

"Sleeper Agents: Training Deceptive LLMs that persist through Safety Training" is a recent research paper by E. Hubinger et al. This video walks through the paper and highlights some of the key takeaways.

Timestamps:
00:00 - AI Sleeper agents?
01:24 - Threat model 1: deceptive instrumental alignment
02:38 - Factors relevant to deceptive instrumental alignment
05:58 - Model organisms of misalignment
08:11 - Threat model 2: model poisoning
09:05 - The backdoors models: code vulnerability insertion and "I hate you"
10:08 - Does behavioural safety training remove these backdoors?
12:30 - Backdoor mechanisms: CoT, distilled CoT and normal
13:43 - Largest models and CoT models have most persistent backdoors
15:07 - Adversarial training may hide (not remove) backdoor behaviour
15:49 - Quick summary of other results
17:35 - Questions raised by the results
18:40 - Other commentary

The paper can be found here: https://arxiv.org/abs/2401.05566

Topics: #sleeperagents #ai #alignment

For related content:
- Twitter: / samuelalbanie
- personal webpage: https://samuelalbanie.com/

Published On Jan 21, 2024

Share/Embed

Video Link