AIヘッドホンが同時に複数話者を翻訳・3D音声で再現（AI headphones translate multiple speakers at once, cloning their voices in 3D sound）

2025-05-09 ワシントン大学（UW）

ワシントン大学の研究チームは、複数の話者の発言をリアルタイムで翻訳し、各話者の声の特徴や方向性を保持したまま再生するAI搭載ヘッドフォン「Spatial Speech Translation」を開発しました。このシステムは、市販のノイズキャンセリングヘッドフォンにマイクを装着し、独自のアルゴリズムで話者を識別・追跡し、発言を2〜4秒の遅延で翻訳・再生します。従来の翻訳技術が単一話者に限定されていたのに対し、本技術は複数話者の同時翻訳を可能にし、各話者の声の質感や方向を再現することで、より自然な会話体験を提供します。また、クラウドを使用せず、Apple M2チップ搭載のモバイルデバイス上で動作するため、プライバシー保護にも配慮されています。この研究成果は、2025年4月30日に横浜で開催されたACM CHI会議で発表され、コードも公開されています。

＜関連情報＞

空間音声翻訳：両耳ヒアラブルによる空間横断翻訳 Spatial Speech Translation: Translating Across Space With Binaural Hearables

Tuochao Chen, Qirui Wang, Runlin He, Shyam Gollakota
arXiv Submitted on 25 Apr 2025
DOI:https://doi.org/10.48550/arXiv.2504.18715

Abstract

Imagine being in a crowded space where people speak a different language and having hearables that transform the auditory space into your native language, while preserving the spatial cues for all speakers. We introduce spatial speech translation, a novel concept for hearables that translate speakers in the wearer’s environment, while maintaining the direction and unique voice characteristics of each speaker in the binaural output. To achieve this, we tackle several technical challenges spanning blind source separation, localization, real-time expressive translation, and binaural rendering to preserve the speaker directions in the translated audio, while achieving real-time inference on the Apple M2 silicon. Our proof-of-concept evaluation with a prototype binaural headset shows that, unlike existing models, which fail in the presence of interference, we achieve a BLEU score of up to 22.01 when translating between languages, despite strong interference from other speakers in the environment. User studies further confirm the system’s effectiveness in spatially rendering the translated speech in previously unseen real-world reverberant environments. Taking a step back, this work marks the first step towards integrating spatial perception into speech translation.