Time-Distributed Attention-Layered Convolution Neural Network with Ensemble Learning using Random Forest Classifier for Speech Emotion Recognition

Authors

  • Yalamanchili Bhanusree Department of Computer Science Engineering, Vallurupalli Nageswara Rao Vignana Jyothi Institute of Engineering and Technology, India
  • Samayamantula Srinivas Kumar Department of Electronics and Communications Engineering, Jawaharlal Nehru Technological University Kakinada, India
  • Anne Koteswara Rao Department of Computer Science Engineering, Kalasalingam Academy of Research and Education, India

DOI:

https://doi.org/10.32890/jict2023.22.1.3

Keywords:

Ensemble classifiers, Random Forest, Speech Emotion Recognition, Human Computer Interaction, time-distributed layers, spatiotemporal features

Abstract

Speech Emotion Detection (SER) is a field of identifying human emotions from human speech utterances. Human speech utterances
are a combination of linguistic and non-linguistic information. Nonlinguistic SER provides a generalized solution in human–computer
interaction applications as it overcomes the language barrier. Machine learning and deep learning techniques were previously proposed for classifying emotions using handpicked features. To achieve effective and generalized SER, feature extraction can be performed using deep neural networks and ensemble learning for classification. The proposed model employed a time-distributed attention-layered convolution neural network (TDACNN) for extracting spatiotemporal features at the first stage and a random forest (RF) classifier, which is an ensemble classifier for efficient and generalized classification of emotions, at the second stage. The proposed model was implemented on the RAVDESS and IEMOCAP data corpora and compared with the CNN-SVM and CNN-RF models for SER. The TDACNN-RF model exhibited test classification accuracies of 92.19 percent and 90.27 percent on the RAVDESS and IEMOCAP data corpora, respectively. The experimental results proved that the proposed model is efficient in extracting spatiotemporal features from time-series speech signals and can classify emotions with good accuracy. The class confusion among the emotions was reduced for both data corpora, proving that the model achieved generalization.

Metrics

Metrics Loading ...

Additional Files

Published

18-01-2023

How to Cite

Bhanusree, Y. ., Kumar, S. S. ., & Rao, A. K. . (2023). Time-Distributed Attention-Layered Convolution Neural Network with Ensemble Learning using Random Forest Classifier for Speech Emotion Recognition. Journal of Information and Communication Technology, 22(1), 49–76. https://doi.org/10.32890/jict2023.22.1.3