LLMによるビジネスアイデアの評価の個別最適化手法を開発

2026-06-25 ストックマーク株式会社

ストックマーク株式会社と産業技術総合研究所（産総研）の共同研究チームは、大規模言語モデル（LLM）によるビジネスアイデア評価を評価者ごとに最適化する「Personalized judge（個別最適化評価）」手法を開発した。従来は複数の評価者の採点を平均した評価（Aggregate judge）を正解としてLLMを学習させる方法が一般的だったが、本研究では、ビジネスアイデアの評価は評価者ごとに重視する観点が異なるため、平均評価は実際には誰の判断とも一致しないことを実証した。研究では、特許に基づく300件の製品・ビジネスアイデアに対する約3,000件の専門家評価からデータセット「PBIG-DATA」を構築し、評価者個人の採点履歴を学習したPersonalized judgeが、平均評価モデルより高い一致率を示した。成果はACL 2026で発表予定であり、ストックマークの事業立案支援AIエージェント「SAT」にも実装されている。今後は、新規事業創出、研究開発テーマ探索、技術シーズの用途探索など、正解が一つに定まらない意思決定を支援するAIエージェントへの応用が期待される。

＜関連情報＞

ビジネスアイデア評価における集団審査員と個別審査員：専門家の意見の相違からの証拠 Aggregate vs. Personalized Judges in Business Idea Evaluation: Evidence from Expert Disagreement

Wataru Hirota, Tomoki Taniguchi, Tomoko Ohkuma, Kosuke Takahashi, Takahiro Omi, Kosuke Arima, Takuto Asakura, Chung-Chi Chen, Tatsuya Ishigaki
arXiv Submitted on 24 Apr 2026
DOI:https://doi.org/10.48550/arXiv.2604.22517

Abstract

Evaluating LLM-generated business ideas is often harder to scale than generating them. Unlike standard NLP benchmarks, business idea evaluation relies on multi-dimensional criteria such as feasibility, novelty, differentiation, user need, and market size, and expert judgments often disagree. This paper studies a methodological question raised by such disagreement: should an automatic judge approximate an aggregate consensus, or model evaluators individually? We introduce PBIG-DATA, a dataset of approximately 3,000 individual scores across 300 patent-grounded product ideas, provided by domain experts on six business-oriented dimensions: specificity, technical validity, innovativeness, competitive advantage, need validity, and market size. Analyses show substantial expert disagreement on fine-grained ordinal scores, while agreement is higher under coarse selection, suggesting structured heterogeneity rather than random noise. We then compare three judge configurations: a rubric-only zero-shot judge, an aggregate judge conditioned on mixed evaluator histories, and a personalized judge conditioned on the target evaluator’s scoring history. Across dimensions and model sizes, personalized judges align more closely with the corresponding evaluator than aggregate judges, and evaluator agreement correlates with similarity of judge-generated reasoning only under personalized conditioning. These results indicate that pooled labels can be a fragile target in pluralistic evaluation settings and motivate evaluator-conditioned judge designs for business idea assessment.

月	火	水	木	金	土	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30