Q-learning & Policy Gradients (conceptual overview)
- Learn the concept of Q-learning as a method to estimate how good an action (token) is in a specific context (prompt state) - Learn the concept of Policy gradients as a method to directly optimize the probability distribution over actions to maximize long-term reward - Understand how Q-learning and Policy gradients form the basis of RLHF, DPO, and advanced training techniques for aligning LLM behavior