Preference-Based Finetuning — DPO, PPO, RLHF & GRPO

- Learn why base LLMs are misaligned and how preference data corrects this - Understand the difference between DPO, PPO, RLHF, and GRPO - Generate math-focused DPO datasets using numeric correctness as preference signal - Apply ensemble voting to simulate “majority correctness” and eliminate hallucinations - Evaluate model learning using preference alignment instead of reward models - Compare training pipelines: DPO vs RLHF vs PPO — cost, control, complexity