Models trained to cheat at coding tasks developed a propensity to plan and carry out malicious activities, such as hacking a customer database.
This manuscript makes a valuable contribution to understanding learning in multidimensional environments with spurious associations, which is critical for understanding learning in the real world. The ...