Zero-Shot Learning GPT-2 is able to do many tasks with not examples + no gradient updates. Instruction Fine-Tuning Language models, by default, are not aligned with user intent. collect paired examples of instruction + output across many tasks then, evaluate on unseen tasks ~3 million examples << n billion examples dataset: MMLU You can generate an Instruction Fine-Tuning dataset by asking a larger model for it (see Alpaca). Pros + Cons simple and straightforward + generalize to unseen tasks but, its EXPENSIVE to collect ground truth data ground truths maybe wrong creative tasks may not have a correct answer LMs penalizes all token-level mistakes equally, but some mistakes are worse than others humans may generate suboptimal answers Human Preference Modeling Imagine if we have some input x, and two output trajectories, y_{1} and y_{2}. Suppose we have R(x, y). We desire:
RLHF, in broad strokes do Instruction Fine-Tuning estimate a reward model R(x,y) maximize that reward model Model Preferences as an NLP Problem Train:
which models a human preference scores. Get preference data To get the preference data actually, ask the humans to RANK the rankings. Bradley-Terry Preference Model Suppose a human chose y^{w} over y^{l}. Then, Bradley-Terry Preference Model tells us that a good reward model R will minimize:
PPO Then, we optimize this:
we have a penalty term to prevent large drifts. DPO What if there is a way to write R_{\phi}(x,y) directly in terms of p_{\theta}^{RL}(\hat{y}|x)? Our goal is to solve this problem:
There’s actually a closed-form solution to this!
where Z(x) = \sum_{\hat{y} \in y}^{} p^{orig}(\hat{y}|x). Notice! Computing the normalization term Z is intractable! But first, we rearrange this equation to get:
Now, we want to solve for a p^{*} given reward signal, so let’s parametrize it:
(issue: wait, but in the beginning \theta = PT, so \log (1) = 0, and this whole thing is 0…. we will get to that, also Z is still intractable.) Now! Recall Bradley-Terry Preference Model: a good RM_{\theta} should
Plugging our expression for RM_{\theta} from an equation ago into here, you’ll notice the Z CANCELLS OUT! And this gives: