人間とAI音声認識の知覚形成メカニズムを分析 (Human or Machine?)

2026-05-21 マックス・プランク研究所

ドイツのマックス・プランク経験美学研究所(MPIEA)の研究チームは、人間が音声を「人間らしい」と感じる要因を調査し、AI音声と人間音声の知覚差に関する研究成果を発表した。論文は『Speech Communication』誌に掲載された。研究では、ドイツ語の短文を人間音声と音声合成(TTS)音声で作成し、語順変更や疑似語置換などを加えた複数条件を比較した。その結果、AI音声は依然として人間音声より「非人間的」と認識されやすく、特に声質(timbre)やイントネーションの違いが重要な要因であることが判明した。また、文の意味や構造も知覚に影響し、聞き手が言語を理解している場合ほど、不自然な文章を「人間らしくない」と評価する傾向が確認された。一方、ドイツ語を理解しないスペイン語話者やトルコ語話者では、言語内容の影響は小さかった。さらに、高齢者ほどAI音声を人間らしく感じる傾向も示された。研究は、音声AIの自然性向上や、人間とAIのコミュニケーション設計に重要な知見を提供する。

人間とAI音声認識の知覚形成メカニズムを分析 (Human or Machine?)

How do people perceive the difference between real and computer-generated voices?© Illustration: MPIEA / L. Bittner

<関連情報>

人間らしさの認識は、言葉の内容によって影響を受ける Perception of humanness is affected by speech content

Janniek Wester, Pauline Larrouy-Maestri

Speech Communication  Available online: 16 April 2026

DOI:https://doi.org/10.1016/j.specom.2026.103398

Highlights

  • Computer generated voices still sound less human than human voices.
  • Humanness perception of voices is modulated by the meaning and structure of speech.
  • Speech content does not affect the humanness perception of non-native listeners.
  • Prosody and summary acoustics of synthetic voices are different from human ones.
  • Older adults perceive synthetic voices as sounding more human than younger adults.

Abstract

The increasing use of computer-generated speech in various applications has raised questions about how people perceive synthetic voices. This study investigates the role of linguistic information in the perception of humanness in speech. We conducted two experiments with native German-, Spanish- and Turkish-speaking participants who rated the human-likeness of human and text-to-speech (TTS)-generated voices. By presenting German sentences as well as manipulated versions of those sentences in terms of syntax and semantics, we examined the role of three types of linguistic information, that is, phonetics, semantics, and syntax, on humanness perception. Acoustic analyses revealed differences between human and TTS-generated voices in terms of summary acoustics and dynamic contours of pitch and intensity, thus showing that TTS-generated voices are not yet fully aligned with human voices on voice quality and prosody. Importantly, behavioral results showed that these acoustic differences were more salient to native German listeners, who distinguished between human and synthetic voices more extremely. In addition to the role of phonetic or phonological familiarity, we observed a role of both syntax and semantics in humanness perception, with the manipulated sentences sounding less human regardless of the speaker (i.e., TTS-generated or human), but only for the native speakers. Lastly, humanness perception of speech appears to be relatively idiosyncratic as indicated by the individual differences observed. Altogether, this study contributes to our understanding of the interplay between linguistic and paralinguistic information in speech perception, and clarifies how listeners perceive their increasingly synthetically-generated soundscape.

1603情報システム・データ工学
ad
ad
Follow
ad
タイトルとURLをコピーしました