Time-Distributed Attention-Layered Convolution Neural Network with Ensemble Learning using Random Forest Classifier for Speech Emotion Recognition
Keywords:Ensemble classifiers, Random Forest, Speech Emotion Recognition, Human Computer Interaction, time-distributed layers, spatiotemporal features
Speech Emotion Detection (SER) is a field of identifying human emotions from human speech utterances. Human speech utterances
are a combination of linguistic and non-linguistic information. Nonlinguistic SER provides a generalized solution in human–computer
interaction applications as it overcomes the language barrier. Machine learning and deep learning techniques were previously proposed for classifying emotions using handpicked features. To achieve effective and generalized SER, feature extraction can be performed using deep neural networks and ensemble learning for classification. The proposed model employed a time-distributed attention-layered convolution neural network (TDACNN) for extracting spatiotemporal features at the first stage and a random forest (RF) classifier, which is an ensemble classifier for efficient and generalized classification of emotions, at the second stage. The proposed model was implemented on the RAVDESS and IEMOCAP data corpora and compared with the CNN-SVM and CNN-RF models for SER. The TDACNN-RF model exhibited test classification accuracies of 92.19 percent and 90.27 percent on the RAVDESS and IEMOCAP data corpora, respectively. The experimental results proved that the proposed model is efficient in extracting spatiotemporal features from time-series speech signals and can classify emotions with good accuracy. The class confusion among the emotions was reduced for both data corpora, proving that the model achieved generalization.
How to Cite
Copyright (c) 2023 Journal of Information and Communication Technology
This work is licensed under a Creative Commons Attribution 4.0 International License.