Capsule-based and TCN-based Approaches for Spoofing Detection in Voice Biometry

Kirill Borodin; Vasiliy Kudryavtsev; Grach Mkrtchian; Mikhail Gorodnichev

doi:10.48084/etasr.8906

Authors

Kirill Borodin Department of Mathematical Cybernetics and Information Technologies, Moscow Technical University of Communications and Informatics, Russia
Vasiliy Kudryavtsev Department of Mathematical Cybernetics and Information Technologies, Moscow Technical University of Communications and Informatics, Russia
Grach Mkrtchian Department of Mathematical Cybernetics and Information Technologies, Moscow Technical University of Communications and Informatics, Russia
Mikhail Gorodnichev Department of Mathematical Cybernetics and Information Technologies, Moscow Technical University of Communications and Informatics, Russia

Volume: 14 | Issue: 6 | Pages: 18409-18414 | December 2024 | https://doi.org/10.48084/etasr.8906

Received: 3 September 2024 | Revised: 3 October 2024 | Accepted: 9 October 2024 | Online: 2 December 2024

Corresponding author: Mikhail Gorodnichev

Abstract

Nowadays, deep neural networks are in a phase of rapid development. Simultaneously, the field of biometric forgery is also advancing. Systems that can successfully pass face verification systems are emerging and continuously improving deepfake videos and voice messages are created. These developments can have a negative impact on a person’s reputation or cause serious security breaches. This paper proposes an approach for spoofing detection in voice biometrics using the ASVspoof2019 LA dataset The model is trained and validated on subsets representing one type of attack, and evaluated on a subset containing more advanced types of spoofing attacks, demonstrating the model’s ability to generalize to more complex attack scenarios. Two models, capsule-based and TCN-based, are proposed, noted as ResCapsGuard and Res2TCNGuard, respectively. ResCapsGuard achieved an Equal Error Rate (EER) value of 2.27, while Res2TCNGuard reached an EER value of 1.49. Notebooks with our models are available in repositories in github. Due to the fact that a random part is cut out of the audio, the results may vary.

Keywords:

anti-spoofing, ASVspoof, fake audio, capsules, TCN

References

V. Popov, I. Vovk, V. Gogoryan, T. Sadekova, and M. Kudinov, "Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech." arXiv, Aug. 05, 2021.

N. Evans, T. Kinnunen, and J. Yamagishi, "Spoofing and countermeasures for automatic speaker verification," in Interspeech 2013, Lyon, France, Aug. 2013, pp. 925–929. DOI: https://doi.org/10.21437/Interspeech.2013-288

Z. Wu et al., "ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge," presented at the Proc. Interspeech 2015, 2015, pp. 2037–2041. DOI: https://doi.org/10.21437/Interspeech.2015-462

T. Kinnunen et al., "The ASVspoof 2017 Challenge: Assessing the Limits of Replay Spoofing Attack Detection," in Interspeech 2017, Stockholm, Sweden, Aug. 2017, pp. 2–6. DOI: https://doi.org/10.21437/Interspeech.2017-1111

X. Wang et al., "ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech," arXiv e-prints. Nov. 01, 2019. DOI: https://doi.org/10.1016/j.csl.2020.101114

J. Jung et al., "AASIST: Audio Anti-Spoofing using Integrated Spectro-Temporal Graph Attention Networks." arXiv, Oct. 04, 2021.

M. Ravanelli and Y. Bengio, "Speaker Recognition from Raw Waveform with SincNet." arXiv, Aug. 09, 2019. DOI: https://doi.org/10.1109/SLT.2018.8639585

H. Tak, J. Patino, M. Todisco, A. Nautsch, N. Evans, and A. Larcher, "End-to-end anti-spoofing with RawNet2." arXiv, Dec. 16, 2021. DOI: https://doi.org/10.1109/ICASSP39728.2021.9414234

S.-H. Gao, M.-M. Cheng, K. Zhao, X.-Y. Zhang, M.-H. Yang, and P. Torr, "Res2Net: A New Multi-Scale Backbone Architecture," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 2, pp. 652–662, Feb. 2021. DOI: https://doi.org/10.1109/TPAMI.2019.2938758

A. A. Alasadi, T. H. Aldhayni, R. R. Deshmukh, A. H. Alahmadi, and A. S. Alshebami, "Efficient Feature Extraction Algorithms to Develop an Arabic Speech Recognition System," Engineering, Technology & Applied Science Research, vol. 10, no. 2, pp. 5547–5553, Apr. 2020. DOI: https://doi.org/10.48084/etasr.3465

J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu, "Squeeze-and-Excitation Networks." arXiv, May 16, 2019. DOI: https://doi.org/10.1109/CVPR.2018.00745

H. H. Nguyen, J. Yamagishi, and I. Echizen, "Capsule-Forensics: Using Capsule Networks to Detect Forged Images and Videos." arXiv, Oct. 26, 2018. DOI: https://doi.org/10.1109/ICASSP.2019.8682602

S. Sabour, N. Frosst, and G. E. Hinton, "Dynamic Routing Between Capsules." arXiv, Nov. 07, 2017.

L. Luo, Y. Xiong, Y. Liu, and X. Sun, "Adaptive Gradient Methods with Dynamic Bound of Learning Rate." arXiv, Feb. 26, 2019.

Z. Xinyi and L. Chen, "Capsule Graph Neural Network," in Seventh International Conference on Learning Representations, New Orleans, LA, USA, Dec. 2019, pp. 1–16.

Q. Ma, J. Zhong, Y. Yang, W. Liu, Y. Gao, and W. W. Y. Ng, "ConvNeXt Based Neural Network for Audio Anti-Spoofing." arXiv, Dec. 22, 2022.

P. Wen, K. Hu, W. Yue, S. Zhang, W. Zhou, and Z. Wang, "Robust Audio Anti-Spoofing with Fusion-Reconstruction Learning on Multi-Order Spectrograms." arXiv, Aug. 18, 2023. DOI: https://doi.org/10.21437/Interspeech.2023-563

S. Ding, Y. Zhang, and Z. Duan, "SAMO: Speaker Attractor Multi-Center One-Class Learning for Voice Anti-Spoofing." arXiv, Nov. 04, 2022. DOI: https://doi.org/10.1109/ICASSP49357.2023.10094704

Q. Fu, Z. Teng, J. White, M. Powell, and D. C. Schmidt, "FastAudio: A Learnable Audio Front-End for Spoof Speech Detection." arXiv, Sep. 06, 2021. DOI: https://doi.org/10.1109/ICASSP43922.2022.9746722

G. Hua, A. B. J. Teoh, and H. Zhang, "Towards End-to-End Synthetic Speech Detection," IEEE Signal Processing Letters, vol. 28, pp. 1265–1269, 2021. DOI: https://doi.org/10.1109/LSP.2021.3089437

W. Ge, J. Patino, M. Todisco, and N. Evans, "Raw Differentiable Architecture Search for Speech Deepfake and Spoofing Detection." arXiv, Oct. 06, 2021. DOI: https://doi.org/10.21437/ASVSPOOF.2021-4

X. Li, X. Wu, H. Lu, X. Liu, and H. Meng, "Channel-wise Gated Res2Net: Towards Robust Detection of Synthetic Speech Attacks." arXiv, Jul. 19, 2021. DOI: https://doi.org/10.21437/Interspeech.2021-2125

X. Wang and J. Yamagishi, "A Comparative Study on Recent Neural Spoofing Countermeasures for Synthetic Speech Detection." arXiv, Jun. 13, 2021. DOI: https://doi.org/10.21437/Interspeech.2021-702

Y. Zhang, F. Jiang, and Z. Duan, "One-Class Learning Towards Synthetic Voice Spoofing Detection," IEEE Signal Processing Letters, vol. 28, pp. 937–941, 2021. DOI: https://doi.org/10.1109/LSP.2021.3076358

X. Li et al., "Replay and Synthetic Speech Detection with Res2net Architecture." arXiv, Feb. 13, 2021. DOI: https://doi.org/10.1109/ICASSP39728.2021.9413828

H. Tak, J. Patino, A. Nautsch, N. Evans, and M. Todisco, "Spoofing Attack Detection using the Non-linear Fusion of Sub-band Classifiers." arXiv, May 20, 2020. DOI: https://doi.org/10.21437/Interspeech.2020-1844

H. Tak, J. Jung, J. Patino, M. Todisco, and N. Evans, "Graph Attention Networks for Anti-Spoofing." arXiv, Apr. 08, 2021. DOI: https://doi.org/10.21437/Interspeech.2021-993

J. Yamagishi et al., "ASVspoof 2019: The 3rd Automatic Speaker Verification Spoofing and Countermeasures Challenge database." University of Edinburgh. The Centre for Speech Technology Research (CSTR), Jun. 04, 2019.

W. Ge, M. Panariello, J. Patino, M. Todisco, and N. Evans, "Partially-Connected Differentiable Architecture Search for Deepfake and Spoofing Detection." arXiv, Jun. 30, 2021. DOI: https://doi.org/10.21437/Interspeech.2021-1187

S. Borzi, O. Giudice, F. Stanco, and D. Allegra, "Is synthetic voice detection research going into the right direction?," in IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, New Orleans, LA, USA, Jun. 2022, pp. 71–80. DOI: https://doi.org/10.1109/CVPRW56347.2022.00017

S. Bai, J. Z. Kolter, and V. Koltun, "An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling." arXiv, Apr. 19, 2018.

"mtuciru/ResCapsGuard." MTUCI, Sep. 10, 2024, [Online]. Available: https://github.com/mtuciru/ResCapsGuard.

"mtuciru/Res2TCNGuard." MTUCI, Sep. 10, 2024, [Online]. Available: https://github.com/mtuciru/Res2TCNGuard.