Capsule-based and TCN-based Approaches for Spoofing Detection in Voice Biometry
Received: 3 September 2024 | Revised: 3 October 2024 | Accepted: 9 October 2024 | Online: 2 December 2024
Corresponding author: Mikhail Gorodnichev
Abstract
Nowadays, deep neural networks are in a phase of rapid development. Simultaneously, the field of biometric forgery is also advancing. Systems that can successfully pass face verification systems are emerging and continuously improving deepfake videos and voice messages are created. These developments can have a negative impact on a person’s reputation or cause serious security breaches. This paper proposes an approach for spoofing detection in voice biometrics using the ASVspoof2019 LA dataset The model is trained and validated on subsets representing one type of attack, and evaluated on a subset containing more advanced types of spoofing attacks, demonstrating the model’s ability to generalize to more complex attack scenarios. Two models, capsule-based and TCN-based, are proposed, noted as ResCapsGuard and Res2TCNGuard, respectively. ResCapsGuard achieved an Equal Error Rate (EER) value of 2.27, while Res2TCNGuard reached an EER value of 1.49. Notebooks with our models are available in repositories in github. Due to the fact that a random part is cut out of the audio, the results may vary.
Keywords:
anti-spoofing, ASVspoof, fake audio, capsules, TCNDownloads
References
V. Popov, I. Vovk, V. Gogoryan, T. Sadekova, and M. Kudinov, "Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech." arXiv, Aug. 05, 2021.
N. Evans, T. Kinnunen, and J. Yamagishi, "Spoofing and countermeasures for automatic speaker verification," in Interspeech 2013, Lyon, France, Aug. 2013, pp. 925–929.
Z. Wu et al., "ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge," presented at the Proc. Interspeech 2015, 2015, pp. 2037–2041.
T. Kinnunen et al., "The ASVspoof 2017 Challenge: Assessing the Limits of Replay Spoofing Attack Detection," in Interspeech 2017, Stockholm, Sweden, Aug. 2017, pp. 2–6.
X. Wang et al., "ASVspoof 2019: A large-scale public database of synthesized, converted and replayed speech," arXiv e-prints. Nov. 01, 2019.
J. Jung et al., "AASIST: Audio Anti-Spoofing using Integrated Spectro-Temporal Graph Attention Networks." arXiv, Oct. 04, 2021.
M. Ravanelli and Y. Bengio, "Speaker Recognition from Raw Waveform with SincNet." arXiv, Aug. 09, 2019.
H. Tak, J. Patino, M. Todisco, A. Nautsch, N. Evans, and A. Larcher, "End-to-end anti-spoofing with RawNet2." arXiv, Dec. 16, 2021.
S.-H. Gao, M.-M. Cheng, K. Zhao, X.-Y. Zhang, M.-H. Yang, and P. Torr, "Res2Net: A New Multi-Scale Backbone Architecture," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 2, pp. 652–662, Feb. 2021.
A. A. Alasadi, T. H. Aldhayni, R. R. Deshmukh, A. H. Alahmadi, and A. S. Alshebami, "Efficient Feature Extraction Algorithms to Develop an Arabic Speech Recognition System," Engineering, Technology & Applied Science Research, vol. 10, no. 2, pp. 5547–5553, Apr. 2020.
J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu, "Squeeze-and-Excitation Networks." arXiv, May 16, 2019.
H. H. Nguyen, J. Yamagishi, and I. Echizen, "Capsule-Forensics: Using Capsule Networks to Detect Forged Images and Videos." arXiv, Oct. 26, 2018.
S. Sabour, N. Frosst, and G. E. Hinton, "Dynamic Routing Between Capsules." arXiv, Nov. 07, 2017.
L. Luo, Y. Xiong, Y. Liu, and X. Sun, "Adaptive Gradient Methods with Dynamic Bound of Learning Rate." arXiv, Feb. 26, 2019.
Z. Xinyi and L. Chen, "Capsule Graph Neural Network," in Seventh International Conference on Learning Representations, New Orleans, LA, USA, Dec. 2019, pp. 1–16.
Q. Ma, J. Zhong, Y. Yang, W. Liu, Y. Gao, and W. W. Y. Ng, "ConvNeXt Based Neural Network for Audio Anti-Spoofing." arXiv, Dec. 22, 2022.
P. Wen, K. Hu, W. Yue, S. Zhang, W. Zhou, and Z. Wang, "Robust Audio Anti-Spoofing with Fusion-Reconstruction Learning on Multi-Order Spectrograms." arXiv, Aug. 18, 2023.
S. Ding, Y. Zhang, and Z. Duan, "SAMO: Speaker Attractor Multi-Center One-Class Learning for Voice Anti-Spoofing." arXiv, Nov. 04, 2022.
Q. Fu, Z. Teng, J. White, M. Powell, and D. C. Schmidt, "FastAudio: A Learnable Audio Front-End for Spoof Speech Detection." arXiv, Sep. 06, 2021.
G. Hua, A. B. J. Teoh, and H. Zhang, "Towards End-to-End Synthetic Speech Detection," IEEE Signal Processing Letters, vol. 28, pp. 1265–1269, 2021.
W. Ge, J. Patino, M. Todisco, and N. Evans, "Raw Differentiable Architecture Search for Speech Deepfake and Spoofing Detection." arXiv, Oct. 06, 2021.
X. Li, X. Wu, H. Lu, X. Liu, and H. Meng, "Channel-wise Gated Res2Net: Towards Robust Detection of Synthetic Speech Attacks." arXiv, Jul. 19, 2021.
X. Wang and J. Yamagishi, "A Comparative Study on Recent Neural Spoofing Countermeasures for Synthetic Speech Detection." arXiv, Jun. 13, 2021.
Y. Zhang, F. Jiang, and Z. Duan, "One-Class Learning Towards Synthetic Voice Spoofing Detection," IEEE Signal Processing Letters, vol. 28, pp. 937–941, 2021.
X. Li et al., "Replay and Synthetic Speech Detection with Res2net Architecture." arXiv, Feb. 13, 2021.
H. Tak, J. Patino, A. Nautsch, N. Evans, and M. Todisco, "Spoofing Attack Detection using the Non-linear Fusion of Sub-band Classifiers." arXiv, May 20, 2020.
H. Tak, J. Jung, J. Patino, M. Todisco, and N. Evans, "Graph Attention Networks for Anti-Spoofing." arXiv, Apr. 08, 2021.
J. Yamagishi et al., "ASVspoof 2019: The 3rd Automatic Speaker Verification Spoofing and Countermeasures Challenge database." University of Edinburgh. The Centre for Speech Technology Research (CSTR), Jun. 04, 2019.
W. Ge, M. Panariello, J. Patino, M. Todisco, and N. Evans, "Partially-Connected Differentiable Architecture Search for Deepfake and Spoofing Detection." arXiv, Jun. 30, 2021.
S. Borzi, O. Giudice, F. Stanco, and D. Allegra, "Is synthetic voice detection research going into the right direction?," in IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, New Orleans, LA, USA, Jun. 2022, pp. 71–80.
S. Bai, J. Z. Kolter, and V. Koltun, "An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling." arXiv, Apr. 19, 2018.
"mtuciru/ResCapsGuard." MTUCI, Sep. 10, 2024, [Online]. Available: https://github.com/mtuciru/ResCapsGuard.
"mtuciru/Res2TCNGuard." MTUCI, Sep. 10, 2024, [Online]. Available: https://github.com/mtuciru/Res2TCNGuard.
Downloads
How to Cite
License
Copyright (c) 2024 Kirill Borodin, Vasiliy Kudryavtsev, Grach Mkrtchian, Mikhail Gorodnichev
This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.