サブマイクロ秒・700Gbps超を実現する低遅延データ圧縮通信技術を開発～FPGAクラスタにおける通信ボトルネックの解消に貢献～

2026-03-05 国立情報学研究所

国立情報学研究所（NII）、広島大学、富士通の研究グループは、FPGAクラスタ向けの超低遅延・高帯域データ圧縮通信技術を開発した。FPGAは並列処理に優れる一方、複数FPGA間の通信遅延と帯域制約が性能向上のボトルネックとなっていた。研究では、通信データが元より大きくならない軽量圧縮方式と、通信路幅に合わせてデータを整列する回路構成を組み合わせた新しい圧縮通信回路を設計した。これにより、圧縮と復号を含めて約590ナノ秒というサブマイクロ秒の低遅延を実現し、1台のFPGA当たり最大757Gbpsの通信帯域を達成した。さらにAI分散学習の勾配データ通信に適用しても学習精度への影響はほとんどないことが確認された。本技術はFPGAクラスタの通信ボトルネックを解消し、将来の光インターコネクト型高性能計算システムやAIアクセラレータの性能向上に貢献すると期待される。

＜関連情報＞

590ナノ秒、757GbpsのFPGAロス圧縮ネットワーク A 590-Nanosecond 757-Gbps FPGA Lossy Compressed Network

Michihiro Koibuchi; Takumi Honda; Naoto Fukumoto; Shoichi Hirasawa; Koji Nakano
IEEE Transactions on Parallel and Distributed Systems Published:02 February 2026
DOI:https://doi.org/10.1109/TPDS.2026.3659817

Abstract

Inter-FPGA communication bandwidth has become a limiting factor in scaling memory-intensive workloads on FPGA-based systems. While modern FPGAs integrate high-bandwidth memory (HBM) to increase local memory throughput, network interfaces often lag behind, creating an imbalance between computation and communication resources. Data compression is a technique to increase effective communication bandwidth by reducing the amount of data transferred, but existing solutions struggle to meet the performance and operation latency requirements of FPGA-based platforms. This paper presents a high-throughput lossy compression framework that enables sub-microsecond latency communication in FPGA clusters. The proposed design addresses the challenge of aligning variable-length compressed data with fixed-width network channels by using transpose circuits, memory-bank reordering, and word-wise operations. A run-length encoding scheme with bounded error is employed to compress floating-point and fixed-point data without relying on complex fine-grained bit-level manipulations, enabling low-latency and scalable implementation. The proposed architecture is implemented on a custom Stratix 10 MX2100 FPGA card equipped with eight 50 Gbps network ports and silicon photonics transceivers. The system achieves up to 757 Gbps of aggregate bandwidth per FPGA in collective communication operations. Compression and decompression are performed within 590 ns total latency, while maintaining the quality of results in a GradAllReduce workload for deep learning.

月	火	水	木	金	土	日
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31