关键词:
A
coordinate attention
joint training
noisy environment
refinement network
speaker verification
TN912.34
摘要:
The accuracy and reliability of automatic speaker verification (ASV) face significant challenges in noisy environments. In recent years, joint training of speech enhancement front-end and ASV back-end has been widely applied to improve the robustness of ASV systems. Traditional joint training directly uses the enhanced speech features as the input of the back-end. However, the diversity of noise types and noise intensities will excessively suppress the enhanced features, resulting in speech distortion or residual noise. To alleviate this problem, we propose a dual-path spectrogram refinement network that enables the enhanced features to learn supplementary information from the separated noise features. In addition, we incorporate coordinate attention into the overall joint architecture to capture more comprehensive frequency and temporal information of the speaker from different spatial positions. We conduct extensive experiments on the VoxCeleb1 test set, the out-of-domain noise test set, and the VOiCES corpus. The experimental results demonstrate that our proposed method significantly improves the accuracy and robustness of speaker verification systems in both clean and noisy environments. (Figure presented.) © Shanghai Jiao Tong University 2025.