E-Talk: Accelerating Active Speaker Detection with Audio-Visual Fusion and Edge-Cloud Computing
Published in 2023 20th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON), 2023
Recommended citation: Xiaojing Yu, Lan Zhang, and Xiang-Yang Li. E-Talk: Accelerating Active Speaker Detection with Audio-Visual Fusion and Edge-Cloud Computing (SECON 2023).
Active Speaker Detection (ASD) aims to enhance communication and interaction in various scenarios, including meetings, group discussions, and security surveillance systems. The primary objective of ASD is to identify and label the position of the main active speaker. In large-scale surveillance systems, real-time ASD can pose network congestion issues due to the extensive video data uploaded from numerous cameras. To address this challenge, we propose a collaborative edge-cloud solution called E-TALK for ASD. E-TALK leverages the simplicity of voiceprint comparison and processing, as opposed to analyzing video sequences. It utilizes voiceprint consistency as the criterion for determining if there has been a change in the active speaker. Our research focuses on evaluating the performance and computational costs of different voiceprint features and recognition models in speaker identification tasks. Additionally, E-TALK introduces a potential speaker tracking scheme for fixed-angle cameras, in conjunction with foreground extraction algorithms. Moreover, E-TALK incorporates a cloud-based high-precision facial ASD model, which utilizes historical information to determine the active speaker in real-time. We conducted experiments to evaluate the performance of our proposed solution in various scenarios and settings. The results demonstrate the effectiveness of the E-TALK approach in improving active speaker detection, highlighting its potential for practical application in surveillance systems.