AIモデル評価の不正確なランキング方式を検証（Why AI Leaderboards Are Inaccurate and How to Fix Them）

2025-07-29 ミシガン大学

ミシガン大学の研究により、AIモデルの性能評価に用いられるElo方式などのリーダーボード手法が不正確になりやすいことが判明。比較数や初期設定の偏りで順位が変動するため、Glicko方式やBradley-Terry方式の方が信頼性が高いと実証された。特にGlickoは比較数の偏りにも強く、勝率と順位の矛盾を減らす効果がある。研究はAI評価制度の改善指針を提示している。

＜関連情報＞

ランキングの解明：AI対決におけるLLMランキングのレシピ Ranking Unraveled: Recipes for LLM Rankings in Head-to-Head AI Combat

Roland Daynauth, Christopher Clarke, Krisztian Flautner, Lingjia Tang, Jason Mars
ACL Anthology Published:July 2025
DOI:https://aclanthology.org/2025.acl-long.1265/

Abstract

Evaluating large language model (LLM) is a complex task. Pairwise ranking has emerged as state-of-the-art method to evaluate human preferences by having humans compare pairs of LLM outputs based on predefined criteria, enabling ranking across multiple LLMs by aggregating pairwise results through algorithms like Elo. However, applying these ranking algorithms in the context of LLM evaluation introduces several challenges, such as inconsistent ranking results when using ELO. Currently there is a lack of systematic study of those ranking algorithms in evaluating LLMs. In this paper, we explore the effectiveness of ranking systems for head-to-head comparisons of LLMs. We formally define a set of fundamental principles for effective ranking and conduct extensive evaluations on the robustness of several ranking algorithms in the context of LLMs. Our analysis uncovers key insights into the factors that affect ranking accuracy and efficiency, offering guidelines for selecting the most appropriate methods based on specific evaluation contexts and resource constraints.

スケールダウンでスケールアップ：OpenAIのLLMを生産環境でオープンソースSLMに置き換えるコスト便益分析 Scaling Down to Scale Up: A Cost-Benefit Analysis of Replacing OpenAI’s LLM with Open Source SLMs in Production

Chandra Irugalbandara, Ashish Mahendra, Roland Daynauth, Tharuka Kasthuri Arachchige, Jayanaka Dantanarayana, Krisztian Flautner, Lingjia Tang, Yiping Kang, Jason Mars
arXive last revised 16 Apr 2024 (this version, v3)
DOI:https://doi.org/10.48550/arXiv.2312.14972

Abstract

Many companies use large language models (LLMs) offered as a service, like OpenAI’s GPT-4, to create AI-enabled product experiences. Along with the benefits of ease-of-use and shortened time-to-solution, this reliance on proprietary services has downsides in model control, performance reliability, uptime predictability, and cost. At the same time, a flurry of open-source small language models (SLMs) has been made available for commercial use. However, their readiness to replace existing capabilities remains unclear, and a systematic approach to holistically evaluate these SLMs is not readily available. This paper presents a systematic evaluation methodology and a characterization of modern open-source SLMs and their trade-offs when replacing proprietary LLMs for a real-world product feature. We have designed SLaM, an open-source automated analysis tool that enables the quantitative and qualitative testing of product features utilizing arbitrary SLMs. Using SLaM, we examine the quality and performance characteristics of modern SLMs relative to an existing customer-facing implementation using the OpenAI GPT-4 API. Across 9 SLMs and their 29 variants, we observe that SLMs provide competitive results, significant performance consistency improvements, and a cost reduction of 5x~29x when compared to GPT-4.

月	火	水	木	金	土	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31