Week 4: Learning from Humans
Training Language Models To Follow Instructions With Human Feedback (Long Ouyang, et al)
https://arxiv.org/abs/2203.02155
Language models (LM) are trained to predict the next token in a sequence, and not on how to follow user instructions safely. The goal of the research in this paper is to train LMs to follow explicit and implicit instructions. To do this a data set constructed by users and used to fine-tune a misaligned model. As a result of this technique, it is shown that the resulting model was significantly preferred over the previous model, there were improvements in truthfulness, showed small improvements in toxicity, and better generalization to the preferences of the user. Furthermore, the alignment tax was lower compared to other methods.
The method used in the paper is as follows:
- Collect demonstration data, and train a supervised policy
- Collect comparison data, and train a reward model
- Optimize a policy against the reward model
The dataset comprised the following components:
- Plain - Arbitrary tasks with sufficient diversity
- Few-Shot - An instruction and multiple query/response pairs for the instruction
- User-based - Prompts for use cases for the OpenAI API
- natural language instructions
- those formulated using few-shot examples
- or implicit continuation of tasks
The paper defines an aligned model as one that acts in accordance with the user's intention. A helpful model will be one that follows instructions and an honest model is one whose outputs are in accordance with the ground truth. The latter quality is more difficult to evaluate for a model. The harmfulness of a model is also difficult to measure, as it is really dependent on how the model is used. A model may have a capacity for harm, but using a model in such a way may be difficult.
The technique outlined in this paper demonstrated the following results:
- Cost of increasing model alignment is modest relative to pretraining
- Better generalization to out-of-distribution settings
- Non-English tasks
- Coding problems
- Mitigate performance degradations
- Reducing alignment tax
There are some limitations to this research. One is outlined above and is in regard to the group of labelers chosen. Furthermore, this technique doesn't guarantee full alignment or safety.
Some of the open questions that arise due to this research include:
- Can these methods be used to produce toxic, biased, and harmful outputs?
- Can these techniques be used in conjunction with other techniques in model alignment?
- Can we design an interface to allow a wider audience to provide feedback to be used in the fine-tuning processes?
- Can we mitigate more of the performance regression?
If we continue to restrict large language models, due to speculations about their safety, we limit their benefits. On the other hand, if they are widely available it makes it difficult to control their use. Making only the API accessible allows one to implement the techniques outlines above to continually fine-tune the model.
The Easy Goal Inference Problem is Still Hard (Paul Christiano)
https://www.alignmentforum.org/s/4dHMdK5TLN6xcqtyc/p/h9DesGT3WT9u2k7Hr
There are various approaches to tackling the AI control problem. One approach involves observing a user of a system, inferring their preferences from this, and then acting according to those preferences. This has empirically been shown to work, it provides a concrete model to address the problem and can integrate well with the AI practices of today.
However, this approach relies on the fact that a user is a rational agent. The easy goal inference problem is the challenge of trying to find a reasonable representation of a user's preferences. It is a very difficult problem, and advances have been made in large part due to developments in the cognitive sciences.
If we restrict the problem to narrower domains it becomes easier to solve as humans in these contexts act more rationally.
There have been ideas for applying inverse reinforcement learning to this problem. There is some skepticism about whether this approach would work to get a good representation of an expert's behavior as it is thought that the agent would have to be more intelligent than the expert it is learning from. However, as humans transfer knowledge in essentially this format, there is reason to believe that we do not require the agent to be more intelligent than the expert.
In fact, recent developments have included mistake models into these techniques to compensate for the discrepancy in intelligence level. With the idea that the agent can gather implicit information by analysing the mistakes made by the expert.