2026-02-25 国立情報学研究所

<関連情報>
Moshi音声対話モデルの日本語ファインチューニングにおける対話データ特性の影響 Effects of dialogue corpora properties on fine-tuning a Moshi-based spoken dialogue model
Yuto Abe, Mao Saeki, Atsumoto Ohashi, Shinnosuke Takamichi, Shiyna Fujie, Tetsunori Kobayashi, Tetsuji Ogawa, Ryuichiro Higashinaka
Proceedings of the 16th International Workshop on Spoken Dialogue Systems Technology, pages 104–108 February 26–March 1, 2026. ©2026 Association for Computational Linguistics
Abstract
We study how the turn-taking properties of spoken dialogue corpora shape the learning and behavior of full-duplex speech dialogue models. Beyond acoustic and linguistic quality, effective systems must reproduce task-dependent dynamics such as conversational tempo and turn-taking. We analyze multiple Japanese dialogue corpora using i) NISQA for speech quality, ii) LLM-as-a-Judge for linguistic/semantic appropriateness, and iii) four timing indicators, inter-pausal units, pause, gap, and overlap, to quantify interactional style. A curriculum strategy then fine-tunes a Moshi-based full-duplex model by incrementally combining corpora with distinct turn-taking profiles. On a dialogue-continuation task, corpus-specific turn-taking patterns reliably shaped model behavior: chat-style corpora yielded more natural rhythms with moderate overlaps and gaps, whereas consultation-style corpora promoted slower, deliberate timing. Fine-tuning on highquality audio improved perceptual naturalness, while mixing task-mismatched data reduced linguistic coherence.


