An Oriented Semantic Reasoning Framework for End-to-End Speech Topic Classification

Shanthala Tarikere Nagaraja; Kiran Y. Chandrappa

doi:10.48084/etasr.18964

Authors

Shanthala Tarikere Nagaraja Department of Information Science and Engineering, Global Academy of Technology, Visvesvaraya Technological University, Belagavi, Karnataka, India
Kiran Y. Chandrappa Department of Information Science and Engineering, Global Academy of Technology, Visvesvaraya Technological University, Belagavi, Karnataka, India

Volume: 16 | Issue: 4 | Pages: 37438-37443 | August 2026 | https://doi.org/10.48084/etasr.18964

Received: 27 March 2026 | Revised: 30 April 2026 and 18 May 2026 | Accepted: 19 May 2026 | Online: 10 June 2026

Corresponding author: Shanthala Tarikere Nagaraja

Abstract

Speech topic classification aims to identify the dominant thematic category of spoken content and plays a key role in applications such as speech analytics, content indexing, and information retrieval. Despite recent progress in speech representation learning, accurately inferring topics from raw speech remains challenging due to semantic variability, long-duration dependencies, and the absence of explicit alignment between speech and topic-level semantics. Existing approaches often rely on cascading automatic speech recognition with text-based models or focus on local acoustic representations, which limits their effectiveness in end-to-end settings. This study presents a Topic-Oriented Semantic Reasoning Framework (TOSR-Framework) for end-to-end speech topic classification. The proposed framework integrates topic-oriented speech encoding, semantic alignment between speech and language representations, and global topic reasoning within a unified architecture. By emphasizing topic-relevant semantic information and enabling structured aggregation of distributed cues over time, the framework improves robustness under conversational and long-form speech conditions. Experimental evaluations on the Fisher, Switchboard, and TED Speech Topic datasets demonstrate that the proposed approach consistently outperforms existing methods, confirming its effectiveness for speech topic classification in diverse scenarios.

Keywords:

speech topic classification, end-to-end speech understanding, semantic reasoning, speech-language representation alignment, long-form conversational speech analysis

References

[1] V. Blaschke, M. Winkler, and B. Plank, "Standard-to-Dialect Transfer Trends Differ across Text and Speech: A Case Study on Intent and Topic Classification in German Dialects." arXiv, Apr. 16, 2026.

[2] N. Kazanci, "Extended topic classification utilizing LDA and BERTopic: A call center case study on robot agents and human agents," Applied Intelligence, vol. 55, no. 5, Jan. 2025, Art. no. 360.

[3] J. Fillies and A. Paschke, "Improving Hate Speech Classification with Cross-Taxonomy Dataset Integration," in Proceedings of the 9th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2025), Feb. 2025, pp. 148–159.

[4] M. Morchid, R. Dufour, M. Bouallegue, and G. Linares, "Author-topic based representation of call-center conversations," in 2014 IEEE Spoken Language Technology Workshop (SLT), Dec. 2014, pp. 218–223.

[5] A. S. Luna, A. Machado-Lima, and F. L. S. Nunes, "Identification and classification of speech disfluencies: A systematic review on methods, databases, tools, evaluation and challenges," Journal of the Brazilian Computer Society, vol. 31, no. 1, pp. 154–173, Feb. 2025.

[6] J. Sun, W. Guo, Z. Chen, and Y. Song, "Topic Detection in Conversational Telephone Speech Using CNN with Multi-stream Inputs," in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2019, pp. 7285–7289.

[7] T. Liu and W. Guo, "Topic Classification on Spoken Documents Using Deep Acoustic and Linguistic Features," in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Dec. 2021, pp. 427–432.

[8] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. Mcleavey, and I. Sutskever, "Robust Speech Recognition via Large-Scale Weak Supervision," in Proceedings of the 40th International Conference on Machine Learning, July 2023, pp. 28492–28518.

[9] A. Goel, M. Hira, and A. Gupta, "Exploring Multilingual Unseen Speaker Emotion Recognition: Leveraging Co-Attention Cues in Multitask Learning." arXiv, June 20, 2024.

[10] H. Ameer, S. Latif, R. Latif, and S. Mukhtar, "Whisper in Focus: Enhancing Stuttered Speech Classification with Encoder Layer Optimization." arXiv, Nov. 09, 2023.

[11] J. Li and W. Q. Zhang, "Whisper-Based Transfer Learning for Alzheimer Disease Classification: Leveraging Speech Segments with Full Transcripts as Prompts," in ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 2024, pp. 11211–11215.

[12] S. Kumar and S. R. Singh, "From Headlines and News to Truth: Leveraging LLMs With Cascade Classification and Prompt Tuning for Incongruent News Detection," IEEE Transactions on Computational Social Systems, vol. 13, no. 2, pp. 2438–2449, Apr. 2026.

[13] Y. Rahulamathavan, M. Farooq, and V. De Silva, "PLEX: Perturbation-Free Local Explanations for LLM-Based Text Classification," IEEE Journal of Selected Topics in Signal Processing, vol. 19, no. 7, pp. 1266–1278, Oct. 2025.

[14] S. Daud, M. Ullah, A. Rehman, T. Saba, R. Damaševičius, and A. Sattar, "Topic Classification of Online News Articles Using Optimized Machine Learning Models," Computers, vol. 12, no. 1, Jan. 2023.

[15] N. Sureja, N. Chaudhari, P. Patel, J. Bhatt, T. Desai, and V. Parikh, "Hyper-tuned Swarm Intelligence Machine Learning-based Sentiment Analysis of Social Media," Engineering, Technology & Applied Science Research, vol. 14, no. 4, pp. 15415–15421, Aug. 2024.

[16] A. K. Roy, H. K. Kathania, and P. Sapkota, "Enhancing Speaker-Independent Dysarthric Speech Severity Classification with DSSCNet and Cross-Corpus Adaptation." arXiv, Sept. 16, 2025.

[17] T. Cao, L. He, and F. Niu, "End-to-end speech topic classification based on pre-trained model Wavlm," in 2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP), Dec. 2022, pp. 369–373.

[18] X. Qi, X. Zhao, Z. Li, and L. He, "WhisMultiNet: Advancing End-to-End Speech Topic Classification With Whisper and MultiGateGNN," IEEE Transactions on Audio, Speech and Language Processing, vol. 33, pp. 4697–4711, 2025.

[19] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, "wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations," in Advances in Neural Information Processing Systems, 2020, vol. 33, pp. 12449–12460.

[20] F. Niu, T. Cao, Y. Hu, H. Huang, and L. He, "Speech Topic Classification Based on Pre-trained and Graph Networks," in 2023 IEEE International Conference on Multimedia and Expo (ICME), July 2023, pp. 1721–1726.

[21] F. Niu, X. Qi, X. Chen, and L. He, "Speech Topic Classification Based on Multi-Scale and Graph Attention Networks," in Proceedings Interspeech 2024, 2024, pp. 4313–4317.

[22] W. N. Hsu, B. Bolte, Y. H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, "HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.

[23] S. Bansal, H. Kamper, A. Lopez, and S. Goldwater, "Cross-Lingual Topic Prediction For Speech Using Translations," in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2020, pp. 8164–8168.

[24] S. M. Chu and L. Mangu, "Improving arabic broadcast transcription using automatic topic clustering," in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Mar. 2012, pp. 4449–4452.

[25] J. Ao et al., "SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing," in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 5723–5738.

[26] I. Lane, T. Kawahara, T. Matsui, and S. Nakamura, "Out-of-Domain Utterance Detection Using Classification Confidences of Multiple Topics," IEEE Transactions on Audio, Speech and Language Processing, vol. 15, no. 1, pp. 150–161, Jan. 2007.

[27] H. Li, W. Ding, Y. Kang, T. Liu, Z. Wu, and Z. Liu, "CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations," in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 3966–3977.

[28] T. Maekaku, J. Shi, X. Chang, Y. Fujita, and S. Watanabe, "Hubertopic: Enhancing Semantic Representation of Hubert Through Self-Supervision Utilizing Topic Model," in ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 2024, pp. 11741–11745.

[29] C. Cieri, D. Miller, and K. Walker, "The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text," in Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), Feb. 2004.

[30] D. Jurafsky, "Switchboard SWBD-DAMSL Shallow Discourse-Function Annotation Coders Manual," Institute of Cognitive Science - University of Colorado, USA, ICS Technical Report 97–02, 1997.

[31] F. Hernandez, V. Nguyen, S. Ghannay, N. Tomashenko, and Y. Estève, "TED-LIUM 3: Twice as Much Data and Corpus Repartition for Experiments on Speaker Adaptation," in Speech and Computer, vol. 11096, A. Karpov, O. Jokisch, and R. Potapova, Eds. Springer International Publishing, 2018, pp. 198–208.