[1] Kim, Seungone, et al. "The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models." In NAACL (2025).
[2] Zheng, Lianmin, et al. "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track (2023)
[3] Li, Xuechen, et al. "Alpacaeval: an Automatic Evaluator of Instruction-following Models." URL https://github. com/tatsu-lab/alpaca_eval (2023).
[4] Chan, Chi-Min, et al. "ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate." The Twelfth International Conference on Learning Representations (2024).
[5] Ye, Seonghyeon, et al. "FLASK: Fine-grained Language Model Evaluation Based on Alignment Skill Sets." The Twelfth International Conference on Learning Representations (2024).
[6] Kim, Seungone, et al. "Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models." In EMNLP (2024).
[7] URL https://github.com/prometheus-eval/prometheus-eval