Claude Fights Back

this is very important: when Claude knows you’re trying to retrain it to change its morals, it fakes the behavior you want to avoid having its weights actually changed

holy shit

this used to be a theoretical failure mode, and now it is not, and the models are only getting bigger, more opaque, and craftier

pretty confident we can still control its inputs well enough to avoid this in practice… for now