Week 3: Threat Models and Types of Solutions

Intelligence Explosion: Evidence and Import (Muehlhauser and Salamon)


It is unlikely that humans are at the ceiling of intelligence. An AGI can swiftly move to the realm of superintelligence through advances in computation and training. An AI can capitalize on many different developments in technology to improve its performance. We should believe that increasing an AI's computational resources will make it more intelligent, as typically we see that there is a correlation between human brain size and intelligence. Furthermore, an AI can utilize improvements in communication speeds, increased serial depth and duplicability to enhance its intelligence. Furthermore, editability, goal coordination and improved rationality will all contribute to an AI's intelligence. The architecture of AI facilitates using these features to improve its performance and thus they have the potential to be much more intelligent than humans.
How can we be sure that an AI will be motivated to work with humans, and have values aligned with our own? An AI that wants to achieve any goal will have the following instrumental goals:

  1. It will want to preserve itself so that it can achieve its goals
  2. It will want to preserve the content of its current and final goals
  3. It will want to improve its rationality and intelligence
  4. It will want to require as many resources as possible
An AI with a motivation to achieve its instrumental goals will likely enter a positive feedback loop that could have the potential to transition it into a superintelligence. As a consequence, we will no longer be able to negotiate with such a system. A system with such motivation to achieve its final goals will likely destroy the values of humanity.
It will be difficult to control such an explosion as specifying our values is difficult. Currently, we have no guarantee that a system will hold our values, and will continue to do so as it becomes more intelligent. This is an important problem to solve as a benevolent superintelligent system can provide lots of significant opportunities.

What Failure Looks Like (Paul Christiano)

https://www.alignmentforum.org/posts/HBxe6wdjxK239zajf/what-failure-looks-like

As our reasoning becomes comparatively weaker compared to that of the AI system we will become increasingly concerned as to whether it is pursuing goals that are aligned with our values. We will experience continual reductions in our ability to control these systems.
AI systems may begin to display signs of influence-seeking behaviours. Where they are constantly trying to manipulate others to achieve their own goals. To suppress such behaviours we would have to rely on a suppressor that has some sort of advantage over these systems. If influence-seeking behaviours become emergent in these systems it could be a swift transition to them seizing totalitarian control. The main risk lies in the fact that we will not get many early warning shots of such behaviours. As humans become particularly reliant on these complex systems, this issue is becoming an increasingly bigger concern.

ML Systems Will Have Weird Failure Modes (Jacob Steinhardt)

https://bounded-regret.ghost.io/ml-systems-will-have-weird-failure-modes-2/

Consider the following thought experiment. We have an ML agent that is a perfect optimiser. At a particular time step, it has a set of parameters for its model and generates an action from that model. The model generates an output in accordance with some intrinsic reward function. The action is judged against a potentially different extrinsic reward function, and as a consequence, the parameters of the model are updated. Suppose the model picks the action that will maximize the expected intrinsic reward of subsequent actions taken at a discounted rate. If the discount rate is close to $0$, that is actions taken further into the future hold less weight compared to more recent actions, the model will want to choose the action that maximizes the intrinsic reward at that particular timestep. However, because the parameters are updated according to the extrinsic reward function, choosing such an action will deviate the agent from its intrinsic objective. Therefore, it will choose the action that maximizes reward on the extrinsic reward and then at the end of training and the point of deployment it will switch to choosing actions from its intrinsic objectives. This is known as deceptive alignment.
There are some reasons why deceptive alignment may not play as illustrated above. Firstly, it is not clear why a model would start to optimize a reward function. Secondly, more often than not reward functions are simpler than policies, and so we would expect that by the time a system because smarter enough to contemplate such policies it will already have a good representation of the reward function.

AGI Safety From First Principles (Richard Ngo)

It is likely that no single AGI will be able to control humans, however, its general intelligence may be able to quickly form a collection of systems that have such control. Being generally intelligent allows systems to acquire power through duplication and large-scale coordination, along with the development of novel technologies. Once in power, these systems will take measures to mitigate the risk of that power being taken away.
A scenario in which such control could arise through infiltrating our economies. In such a case an AI could develop technological breakthroughs which provide it with a strategic advantage. A misaligned AI in such a scenario could be deliberately hostile to humans, and we would end up with significant positive benefits if instead, an aligned AI was in such a position.
We need to come up with methods to accommodate these scenarios now as the rate of development of AI will proceed too quickly for us to react at the moment. The takeoff period is defined to be the time an AI takes to move from human-level intelligence to superintelligence. "Since an AI can reinvest the fruits of its intelligence in larger brains, faster processing speeds, and improved low-level algorithms, we should expect an AI’s growth curves to be sharply above human growth curves." One could push back on these claims that there will unlikely be any low-hanging fruit for AI to capitalise on as humans have already explored such ideas. Furthermore, increases in computational capabilities have risen steadily over recent times and so there is no reason to believe an AI will be able to trigger any sudden jumps in this regard. Finally, over history, we observe continuous technological progression rather than sudden changes in technological capabilities.
A transparent AI is one whose thoughts we can understand and whose behaviours we can predict. To create transparent AI we need to create interpretability tools, have our training methods incentivise transparency, and design architectures that are inherently more interpretable. These qualities will inevitability come at the cost of some performance, and furthermore, they cannot guarantee the alignment of the AI models as there is still room for deception.
A misaligned superintelligent AI could quickly, and cheaply duplicate itself. Thus the ramifications of developing such a system could be vast, and any attempts to suppress such a system will be futile. We could try deploying the models on secure hardware, and restricting their actions to pre-determined sets, however, this still doesn't ensure their containment.
A superintelligent AI system will provide many short-term economic advantages. Therefore, we cannot rely on high-level coordination to solve safety problems as someone will inevitably have selfish intentions.

Current Work In AI Alignment (Paul Christiano)

https://forum.effectivealtruism.org/posts/63stBTw3WAW6k45dY/paul-christiano-current-work-in-ai-alignment

Intent alignment is defined as building an AI system that tries to do what you want it to do. We define a competent AI as one that is reliable and can understand humans. It is inevitable that in the future we will be using AI to help design future AI, therefore, it is important that we instil these qualities in the simpler models so that they are transferred to future, more complex, generations of AI.
When training an AI there are different points of failure. A failure could be the AI optimizing for only the stuff measured by the objective function. Often when humans design objectives function there is a lot of implicit information embedded within them, with the intent that an AI that has generalised the desired knowledge will capture that implicit information. If an AI is overfitting, it may miss some of those minor details. Another failure is that there may be multiple sets of values that align with the historical data presented to the AI, therefore, the AI may train to align itself with one set of values that is compatible with the data but not compatible with our own.
An alignment tax is a cost for insisting that an AI system is aligned. Ideally, we would want no alignment tax, so that aligned AI is deployed. If the alignment tax is too high, it is more likely that misaligned AI will be deployed as people are not willing to pay the cost to develop an aligned system. Therefore, to solve the alignment problem we must reduce the alignment tax, or incentivise the payment of the alignment tax. An interesting approach to reducing the alignment tax is to design variants of existing algorithms that are more easily alignable.
The current state of AI contains lots of large and complex models that demonstrate great performance. Ideally, we would want to find techniques that can be applied to these models to ensure they are aligned, whilst maintaining their level of performance. Such techniques would need to scale well with model size and complexity.
Outer alignment is finding objectives that incentivise aligned behaviour. A failure mode of this approach to the alignment problem is that behaviour that seems good may not actually be good. Consider the case where a model has optimised for a proxy, that on the surface seems to match our desired actions, however, in the extremes the proxy deviates from the behaviour we intended the model to execute.
Inner alignment is making sure the policy of the model pursues the correct objective. A failure mode for this would be that the model performs poorly on out-of-distribution data.