this is very important: when Claude knows you’re trying to retrain it to change its morals, it fakes the behavior you want to avoid having its weights actually changed
holy shit
this used to be a theoretical failure mode, and now it is not, and the models are only getting bigger, more opaque, and craftier
pretty confident we can still control its inputs well enough to avoid this in practice… for now