참고

[1] Long Ouyang, et al. “Training language models to follow instructions with human feedback”, NeurIPS 2022

[2] John Schulman, et al. “Proximal Policy Optimization Algorithms”, Arxiv 2017

[3] Rafael Rafailov, et al. “Direct Preference Optimization: Your Language Model is Secretly a Reward Model”, NeurIPS 2023

[4] Zeqiu Wu et al. “Fine-Grained Human Feedback Gives Better Rewards for Language Model Training”, NeurIPS 2023