Thomas Walker

Specification Gaming: The Flip Side of AI Ingenuity (Victoria Krakovna, et al)

https://www.deepmind.com/blog/specification-gaming-the-flip-side-of-ai-ingenuity

Specification gaming is the behavior exhibited by an agent that satisfies the literal objective but achieves an unintended outcome. This may arise as an agent finds a shortcut that maximizes reward but does not complete the desired task.
On the one hand, such a shortcut may just be an innovative solution to the problem, in which case it is a positive outcome. On the other hand, the agent's solution is not a viable one, so the model cannot be used in practice. The second case may arise from the misspecification of the task or bugs in the environment.
Designing task specifications that accurately reflect our intentions is difficult, and as RL algorithms become more powerful designing such specifications becomes harder. Setting out a task specification involves details about the reward design, training environment, and auxiliary rewards, each of which can be points of exploitation leading to specification gaming.
The reward design includes when to give the agent a reward, which can have large implications for the overall policy that is learned by the agent. Too broad of a specification makes it easier for the agent to satisfy its conditions with a degenerate solution, however, too narrow of a specification limits the room for innovative policies. Learning from human feedback is a potential solution to this problem. However, this method is reliant on humans providing accurate feedback.
The training environment may have bugs that arise as a result of its overall architecture or due to the interaction between itself and the agent. Such issues could see an agent tamper with the task specification. For instance, it could manipulate the representation of objects within the environment. In the extreme case of a superintelligent agent, if the environment is not secure the agent may be able to hijack the computer on which it is running and change the reward specification directly.

Goal Misgeneralization In Deep Reinforcement Learning (Lauro Langosco, et al)

https://arxiv.org/pdf/2105.14111.pdf

For a model to perform well in practice it must be able to handle information that does not conform exactly to the data it has seen in training. A model is said to have generalized to out-of-distribution (OOD) data if it performs well on test data that is not distributed identically to the training set. In reinforcement learning, there are two ways that an agent can fail these tests. Firstly it could fail to take any useful actions in such scenarios (Capabilities Generalization), secondly, the agent could start to pursue a goal other than the training reward (Goal Misgeneralization). The paper argues that GM failures are more dangerous than CG failures. Simply training an agent to maximize an objective function is not enough to guarantee that it will not learn some proxy for the objective function. The article focuses on Goal Misgineralization for RL agents.
Agents suffering from GM may fail suddenly under their imperfect proxies. Suppose our RL agent is trying to maximize a reward $R:S\times A\times S\to\mathbb{R}$, where $S$ is the set of valid states, and $A$ is the set of actions. Then GM occurs when an agent acting in a goal-directed manner achieves a low reward in a new environment. They appear to be optimized for a reward $R^{\prime}$ that is not equal to $R$. $R^{\prime}$ is referred to as the behavioural object, whereas $R$ is the intended objective.
GM can arise in different ways. One of the ways includes an agent simply performing pattern recognition on the training data, which does not extrapolate to OOD data. Secondly, the behavioral objective may be correlated with the intended objective but not match. Thirdly, the environment may contain some proxies. These proxies may be aligned with the intended objective and be easier to learn. As mentioned before however, proxies are not always a bad thing for an RL agent to learn.
It becomes difficult to know when an agent is pursuing a proxy. In real-world scenarios, it may be more challenging to identify GM compared to CG, as in the latter case we can run experiments that will indicate such failures.

The Alignment Problem From a Deep Learning Perspective (Rich Ngo, et al)

https://arxiv.org/abs/2209.00626

The article makes a series of claims about the various phases of progression to AGI that will result in an AGI having undesirable goals.
Firstly, learned policies will develop sophisticated representations of different outcomes that are correlated with high rewards on tasks, and it will learn to make plans to achieve those outcomes. As we train increasingly capable policies they will start to use high-level planning. This requires the agent to be able to understand what they are aiming for and figure out what actions will get them closer to that goal. This can cause the agent to take nuanced actions as they can exploit the intricacies of the goal they have been provided, often leading to misspecifications of the goal.
Robust goals will

consistently reward in alignment with the intended goal
be related to the supervision process
operate effectively across a broad range of environments

However, once policies develop a solid understanding of their training process misaligned goals will lead to higher rewards and be reinforced.
Next, we consider the phase where a policy becomes self-aware to the point where it can establish the training process and the deployment context to deceptively pursue its own misaligned goals. The article defines situational awareness as the capacity to identify abstract knowledge relevant to a context. A situationally aware policy will therefore know how they are updated and can exploit this. For example, externally they could make themselves appear as if they are acting positively according to the reward function, however, internally they are pursuing their own objectives that are constructed in such a way as to not be detectable from the current reward function. This form of deception is difficult to counter as it relies on us being able to detect it, however, a sufficiently intelligent system will be able to bypass any detection measure we put in place. An agent wishing to pursue their own goals will therefore develop policies that limit obedience and honesty as this would restrict their ability to deceive. Hence, we can potentially limit the rate of deception by discriminating against dishonesty, however, this is challenging for more complex tasks.
Once policies became too capable for effective human supervision they may start to take actions that give them more power rather than following human intentions. Such AGI systems will be able to operate at an intelligence level that is above human intelligence. Therefore, we would have no way of comprehending their intentions and defending ourselves against the consequences of their plans. There is the Instrumental Convergence Thesis that states any AGI will have instrumental goals that it must achieve to accomplish any final goal it may have. These may initially be aligned with our goals but they'll likely quickly diverge as the AGI continues to pursue its goals. Therefore, its short-term goals may seem aligned with our own, however, in the long term, it may be misaligned.
Furthermore, as a consequence of an AGI relentlessly pursuing its own goals, it will develop strategies to prevent humans from interfering with its goals.
We should expect AGI to generalize in a way that will seem strange to most of us. Due to their greater intellect, some of their moves may seem absurd to us as we cannot understand their deeper purpose.
A misaligned AI will try and seek various forms of power, including technological, political/cultural, and economical. Each of these will result in it having significant control over human societies. We will be unable to coordinate ourselves to constrain such a powerful agent.
There are multiple research avenues that should be progressed in order to mitigate these risks. For instance, we could explore ways to make our models more transparent, thus allowing us to determine whether it has goals aligned with our own. Similarly, we could use early AGI systems (that do not have a lot of power) to help us solve the alignment problem.