商用利用可能な同時双方向日本語音声対話モデル「LLM-jp-Moshi-v1」の公開

2026-02-25 国立情報学研究所

国立情報学研究所（NII）大規模言語モデル研究開発センター（LLMC）は、LLM勉強会（LLM-jp）の成果として、日本語に特化した同時双方向（Full-duplex）音声対話モデル「LLM-jp-Moshi-v1」をApache2.0ライセンスで公開した。音声を逐次入力しながら自然な応答音声を生成でき、商用利用可能な日本語モデルとしては世界初。約7Bパラメータを持ち、約7万時間超の音声対話データで学習。既存モデルJ-Moshiより自然性・意味適切性で向上が確認された。ABCI 3.0上で開発され、コールセンターなどリアルタイム対話応用が期待される。

＜関連情報＞

Moshi音声対話モデルの日本語ファインチューニングにおける対話データ特性の影響 Effects of dialogue corpora properties on fine-tuning a Moshi-based spoken dialogue model

Yuto Abe, Mao Saeki, Atsumoto Ohashi, Shinnosuke Takamichi, Shiyna Fujie, Tetsunori Kobayashi, Tetsuji Ogawa, Ryuichiro Higashinaka
Proceedings of the 16th International Workshop on Spoken Dialogue Systems Technology, pages 104–108 February 26–March 1, 2026. ©2026 Association for Computational Linguistics

Abstract

We study how the turn-taking properties of spoken dialogue corpora shape the learning and behavior of full-duplex speech dialogue models. Beyond acoustic and linguistic quality, effective systems must reproduce task-dependent dynamics such as conversational tempo and turn-taking. We analyze multiple Japanese dialogue corpora using i) NISQA for speech quality, ii) LLM-as-a-Judge for linguistic/semantic appropriateness, and iii) four timing indicators, inter-pausal units, pause, gap, and overlap, to quantify interactional style. A curriculum strategy then fine-tunes a Moshi-based full-duplex model by incrementally combining corpora with distinct turn-taking profiles. On a dialogue-continuation task, corpus-specific turn-taking patterns reliably shaped model behavior: chat-style corpora yielded more natural rhythms with moderate overlaps and gaps, whereas consultation-style corpora promoted slower, deliberate timing. Fine-tuning on highquality audio improved perceptual naturalness, while mixing task-mismatched data reduced linguistic coherence.

月	火	水	木	金	土	日
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28