← Back

Direct Preference Optimization

technique 1 mention from 1 sources

A simpler alternative to reinforcement learning for training language models using human preferences without explicit reward models.

1

sources

Mentioned by

All mentions

Nathan Lambert mentioned ✓ High confidence
"the famous paper, Direct Preference Optimization, which is a much simpler way of solving the problem than RL. The derivations in the appendix skip steps of math."

Attribution: Nathan mentions DPO as a famous and simpler alternative to RL for preference learning