The paper "Deep Learning from Human Preferences" by Christiano et al. (2017) explores the concept of training machine learning models using human feedback to achieve more aligned and desirable outcomes. The authors introduce a framework where deep reinforcement learning models are trained not just on predefined reward signals, but also on human evaluations of the model's performance. This approach allows the model to learn complex behaviors that are more closely aligned with human values and preferences, leading to more nuanced and effective decision-making in various tasks.
Select preferable segment or indicate if they are incomparable.
To validate the implementation, a benchmark was created. The following results are using a gymnasium env on Enduro game from Atari.
The model can be improved by optimizing the hparams and collecting more human feedbacks. Note that, since humans must provide theirs feedbacks to the machine, the training process is consederably slow in terms of time.
During the training process, only 290 feedbacks (preference between segments) were given. On original papers, they collected 5.5k feedbacks.
The following presents the steps of the training process:
The current DLFHP implementation uses Gymnasium, which is a fork of OpenAI’s Gym library. "OpenAI Gym is a toolkit for reinforcement learning research. It includes a growing collection of benchmark problems that expose a common interface, and a website where people can share their results and compare the performance of algorithms. This whitepaper discusses the components of OpenAI Gym and the design decisions that went into the software" (Brockman et al., 2016).
An agent (policy
To simplify the observation for the CNN model and improve learning speed, the observation is preprocessed to grayscale, resized to (80,80), and normalized between range [0, 1] by dividing each pixel by 255.
These observations are feed to a policy
The Gymnasium API already provides rewards for each step an policy
The rewards given by the Gymnasium API will be used to compare two agents: one trained using the the Reward Model and another trained using the true rewards (from Gymnasium API). While training the model trained using feedbacks will never use the true rewards.
Training a policy
TTo predict rewards, the model needs both an observation
Adjusting the reward function
Using the preprocessed observation
- Christiano, P. F., Leike, J., Brown, T. B., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems (pp. 4299-4307).
- Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., & Zaremba, W. (2016). OpenAI Gym. arXiv preprint arXiv:1606.01540. Retrieved from https://arxiv.org/abs/1606.01540
- Schulman, John, et al. "Proximal Policy Optimization Algorithms." arXiv preprint arXiv:1707.06347 (2017). https://arxiv.org/abs/1707.06347




