Direct Preference Optimization

Direct Preference Optimization (DPO) is an alternative to Reinforcement Learning from Human Feedback (RLHF) to fine-tune Language Models to align their outputs with human preferences. DPO has two steps: (1) collect pairs of completions for each element in a dataset of prompts, and label each element of the pair as preferable or not-preferable; (2) fine tune the Language Model on this labeled dataset using a loss function, adapted from the RLHF loss, which implicitly rewards the model for following human preferences, while not diverging too much from the original (pre-trained) model. DPO simplifies RLHF in two ways: (1) by eliminating the need to train a separate reward model (which imitates human preference of outputs given two options); (2) by eliminating the need to use Reinforcement Learning to fine-tune the Language Model.