[1] Wang, Haoxiang, et al. "Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts." arXiv preprint arXiv:2406.12845 (2024).
[2] Kim, Seungone, et al. "Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models." arXiv preprint arXiv:2405.01535 (2024).[3] Zheng, Lianmin, et al. "Judging LLM-as-a-judge with Mt-bench and Chatbot Arena." Advances in Neural Information Processing Systems 36 (2024).
[4] Wei, Zeming, Yifei Wang, and Yisen Wang. "Jailbreak and Guard Aligned Language Models with Only Few In-context Demonstrations." arXiv preprint arXiv:2310.06387 (2023).
[5] Chao, Patrick, et al. "Jailbreaking Black Box Large Language Models in Twenty Queries." arXiv preprint arXiv:2310.08419 (2023).
[6] Rafailov, Rafael, et al. "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." Advances in Neural Information Processing Systems 36 (2024).
[7] Zhang, Zhexin, et al. "Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks." arXiv preprint arXiv:2407.02855 (2024).
[8] Robey, Alexander, et al. "Smoothllm: Defending Large Language Models Against Jailbreaking Attacks." arXiv preprint arXiv:2310.03684 (2023).
[9] Kirchenbauer, John, et al. "A Watermark for Large Language Models." International Conference on Machine Learning. PMLR, 2023.
[10] Lee, Taehyun, et al. "Who Wrote This Code? Watermarking for Code Generation." arXiv preprint arXiv:2305.15060 (2023).