IMPROVED SPEAKER-INDEPENDENT EMOTION RECOGNITION FROM SPEECH USING TWO-STAGE FEATURE REDUCTION

In the recent years, researchers are focusing to improve the accuracy of speech emotion recognition. Generally, high emotion recognition accuracies were obtained for two-class emotion recognition, but multi-class emotion recognition is still a challenging task . The main aim of this work is to propose a two-stage feature reduction using Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) for improving the accuracy of the speech emotion recognition (ER) system. Short-term speech features were extracted from the emotional speech signals. Experiments were carried out using four different supervised classi ﬁ ers with two different emotional speech databases. From the experimental results, it can be inferred that the proposed method provides better accuracies of 87.48% for speaker dependent (SD) and gender dependent (GD) ER experiment, 85.15% for speaker independent (SI) ER experiment, and 87.09% for gender independent (GI) experiment.


INTRODUCTION
To recognize human emotion, various modalities are used such as facial images and videos, speech and physiological signals.In the recent years, researchers have published several works on emotion recognition from spoken utterances (El Ayadi, Kamel, & Karray, 2011;Koolagudi & Rao, 2012).Spoken utterances of an individual can provide information about his/ her health state, emotion, language used and gender.Speech is the one of the most natural form of communication between individuals.Understanding of an individual's emotion can be useful for applications like web movies, electronic tutoring applications, in-car board system, diagnostic tool for therapists and call-centre applications (El Ayadi et al., 2011;Koolagudi & Rao, 2012).Researchers have proposed several parameterization methods in the fi eld of emotion recognition from speech, however it is not clear that which speech features are best in distinguishing between emotions.Researchers have used four primary emotions such as happiness, sadness, anger, fear, surprise and disgust.The recognition accuracy between high-activation emotions and low-activation emotions are always high, but recognition between different emotions is still challenging (Wang & Guan, 2004).
Emotional speech database contains three speech categories: simulated, elicited and natural.Simulated emotions tend to be more expressive than real ones and are most commonly used (El Ayadi et al., 2011).For the elicited category, emotions are nearer to the natural database, but if the speakers know that they are being recorded, the quality will be artifi cial.In natural category, all emotions may not be availabe and is diffi cult to model because they are completely naturally expressed.Table 1 presents the summary of the signifi cant research works on ER from speech signals.

Database
In this paper, two emotional speech databases were used and they were diffi erent in terms of sampling frequency, number of emotions, number of subjects, number of words and utterances.Berlin emotional speech database (Burkhardt, Paeschke, Rolfes, Sendlmeier, & Weiss, 2005) was recorded at the Technical University, Berlin.Seven emotions such as neutral (N), anger (A), fear (F), happiness (H), sadness (Sa), disgust (D) and boredom (BD) were simulated by 10 actors (5 male and 5 Female).In EmoDB, 535 utterances were recorded for each emotion with 10 german sentences.Sahand emotional speech database (SESD) (Sedaaghi, 2008) comprises of utterances expressed by fi ve male and fi ve female students in fi ve emotional states (A, N, H, Sa, and Surprise-Su).SESD consists of 1200 utterances which include twenty four words, short sentences and paragraphs spoken in Farsi language.The sampling frequency was 16 kHz for EmoDB and 8 kHz for SESD.In our analysis, we set 8 kHz as the sampling frequency and hence the speech samples of EmoDB were down-sampled to 8 kHz.Number of utterances for each category of emotion available in EmoDB and SESD is shown in Table 2.The proposed system for ER from speech signals is shown in Fig 1.

Feature Extraction
In this paper, short term cepstral features were extracted from the emotional speech signals of two different emotional speech databases.The following sections describe the derivation of short-term cepstral features.

Extraction of LPC based cepstral parameters
Linear prediction is used to extract the relavent information that represents the signal.All the speech samples were pre-emphasized using a fi rst-order digital fi lter as shown in equation 1.In Equation 1, the value of ã = 15/16 = 0.9375 was used with the sampling rate of 8 kHz for the fi xed-point implementations (Rabiner & Juang, 1993).
The overview of Linear Prediction process is shown in Fig 2 .The preemphasized speech signals were segmented at the length of 20 ms with 50% overlap.Speech signal at time t, (t), can be estimated as a linear combination of the past order of the LPC for speech samples.
(2) where p represent the order of the LPC.

Figure 2. Linear prediction process
The differences between the actual and the estimated sample value is known as the prediction error, e(t) and defi ned as where s a is the LPC.By minimizing the mean squared error over a frame of the speech signal, the LPC's are calculated.Therefore, autocorrelation method is employed to each frame of the windowed signal as shown in Eqs. 4 and 5. (4) where the autocorrelation function is symmetric, so that the LPC equations can be defi ned as (5) In this paper, we set the order of p as 14.LPCC's are LPC's represented in the cepstrum domain and the coeffi cient of the Fourier transform representation of the log magnitude spectrum.The number of LPCC used to represent each frame is calculated by applying Q   (3/2)p when Q > p.By deriving directly from LPC using recursion technique, LPCC's were obtained.

Mel frequency cepstral coeffi cients
Mel frequency cepstral coeffi cients are the most commonly used features for speech/speaker recognition.MFCCs take human perception sensitivity with respect to frequencies into consideration and are best for speech recognition (Jang, 2011).The block diagram of the MFCC extraction is shown in Fig 3 .The speech signals were pre-emphasized with a fi rst-order digital fi lter and segmented into short overlapping frames as in LPC feature extraction (Chee, Ai, Hariharan, & Yaacob, 2009).The frame size for the study was fi xed at 160 samples and 50% of data overlapping was used.Each frame is multiplied by Hamming window to minimize the spectral distortion and the signal discontinuities (Chee, et al., 2009) .
Fast Fourier Transform (FFT) was applied to convert time domain into the frequency domain.The spectrum of each frame was fi ltered by a set of fi lter after the FFT block and then, the power of each band was calculated (Chee, et al., 2009).To simulate the subjective spectrum, a fi lter band spaced uniformly on the Mel-scale was used.Mel-scale is defi ned as a logarithmic scale of frequency based on human pitch perception.Equation ( 11) shows the mapping from linear frequency to Mel-frequency (Chee, et al., 2009).
(11) Lastly, the log Mel spectrum was converted to time using Discrete Cosine Transform (DCT) and the output is called as Mel Frequency Cepstrum Coeffi cients.
The emotional speech signals were subjected to feature extraction and feature database was formed by using 40-MFCCs, 14-LPCs, 21-LPCCs and 14-WLPCCs and totally there were 89 short-term cepstral features.Next section discusses about dimension reduction of the 89 features into fewer dimensions using the two-stage feature reduction with PCA and LDA.

Two stage feature reduction
Feature selection/reduction is a important step in all pattern recognition problems since the large feature space (curse of dimensionality) may reduce the classifi cation performance.Dimensionality of feature set can be reduced by using statistical methods to minimize the k, relevant information (Haq & Jackson, 2009;Haq, Jackson, & Edge, 2008).PCA (Shlens, 2005)  where M pca is the linear transformations matrix, and i is the number of features.
The columns of M pca are the l eigenvectors associated with the l largest eigenvalues of scatter matrix U T , defi ned as where μ R h is the mean features of all samples (Deng, Jin, Zhen, & Huang, 2005).
Next, LDA is commonly used technique for dimensionality reduction (Duda, Hart, & Stork, 2012;Yusuf, Mahat, Siraj, & Yaacob, 2012).LDA maximizes the ratio of between-class variance to whithin-class variance to optimize seperability between classes.The within class scatter matrix U m and between class scatter matrix U e are defi ned as where j i a is the i th sample of class j, μ j is the mean of class j, μ is the mean features of all classes, p is the number of classes, and N j is the number of samples of class j.To select M Ida is to maximize the ratio det|U e |/det|U m |.
PCA was applied on the feature database and the number of principal components were selected according to containment of 99% of the total variability and the number of features was reduced from the original 89 features.Next, LDA was applied on the reduced feature database obtained from PCA, to reduce the dimensionality of features further.PCA combined with LDA was a applied as a feature reduction method in order to seek a projection that best represent the original data and best seperates the data in a least-squares sense.The PCA maps the original h-dimensional feature a i to l-dimensional feature b i as an intermediate space.Then, LDA projects the PCA output into a new g-dimensional feature vector c i (Deng, et al., 2005).( 16) In this study, recognition of three, fi ve and seven classes of emotions were performed, so as to reduce the dimension of features into 2, 4 and 6 (i.e., number of class-1) respectively.

Classifi ers
ER from speech signals is a typical pattern recognition application.The original and dimensionality reduced features were used to recognize the emotions.In this study, seven emotions of EmoDB database and fi ve emotions of SESD database were considered.Recognition of three (3E) (N, H, Sa-(EmoDB and SESD)), fi ve (5E) (N, H, Sa, A, BD -EmoDB and N, A, H, Sa, Su -SESD) and seven (7E) (N, A, F, H, Sa, D, BD -EmoDB) classes of emotions were done.The effect of gender and speaker dependency on recognition of emotions was also investigated using four different classifi ers.The classifi cation process was repeated for 10 times and the average emotion recognition accuracy was reported in all the experiments.The following sections give the basics of classifi ers used.

K-nearest neighbor
k-NN is the elementary classifi cation model that apply lazy learning.The k-NN prediciton of the query instance is determined by the majority voting of the nearest neighbour category.To locate the k-NN category of the training data set, the minimum distance from the test speech signal to the each of the training speech signal in the training test was calculated (Chia Ai, Hariharan, Yaacob, & Sin Chee, 2012;Hariharan, Chee, Ai, & Yaacob, 2012;Yusuf, Mahat, Siraj, & Yaacob, 2012).Class label of the test speech signal was determined by using majority voting between the k nearest training speech samples from the k-NN category.Hence, the k values show an important role in k-NN classifi cation (Chia Ai, et al., 2012;Hariharan, et al., 2012;Liu, Lee, & Lin, 2010).Therefore, in this study, the best k value was found between 1 and 10.
FKNN is a classifi cation technique that provides the simplicity and the practicability of classical k-NN using fuzzy logic concept.The FKNN algorithm assigns class membership to a sample vector rather than assigning the vector to a particular class.The basic of the algorithm is to assign membership as a function of the patterns distance from its k-NN and those neighbors membership in the possible classes.It is similar to the traditional set theory in the sense that it must also search the labelled sample set for the k-NN.The FKNN keeps the main idea of k-NN, in which the class decision is made by the nearest neighbor class information.The advantages of using fuzzy set theory is that no arbitrary assignments are made (Keller, Gray, & Givens, 1985;Kim & Han, 1995) which are the residues that are assigned with a membership value in each class rather than binary decision of 'belongs to' or 'does not belong to'.The advantage of such assignment is that these membership values act as strength or confi dence with which the current residue belongs to a particular class (Bondugula, Duzlevski, & Xu, 2005).

Multiclass support vector machine
SVM is the one of most popular supervised classifi ers for binary classifi cation problems and it is insensitive to high dimensionality of the feature space.However, SVM can also be used for multi-class classifi cation problems using multiple binary SVM classifi ers with either one-against-all or one-against-one approach.In binary classifi cation, the class labels can take only two values (1 and -1).The idea of multiclass is to use the one-against-all approach where it constructs E two-class rules, where the m th function separates training vectors.Hence there are E decision functions but all are obtained by solving one problem.The formulation is as follows: x is in class which has the largest value of the decision function .
In this study, we used one-against-all MCSVM from Kernel Methods MATLAB Toolbox (Canu, Grandvalet, Guigue, & Rakotomamonjy, 2005) to classify the features.In our study, we fi xed the value of hyper parameter C (C = 1000).

Extreme learning machine
The ELM has two main advantages such as it requires less training time compared to conventional neural network based classifi ers and need to tune the parameter L (hidden layer nodes) to get better accuracy.ELM has higher generalization capability and suitable for many nonlinear activation function and kernel functions.ELM is developed for generalized single hidden layer feedback networks (SLFNs) with a wide variety of hidden nodes.ELM randomly selects all the hidden note parameters, after which the network can be represented as a linear system and the output of weights can be computed analytically (G.-B.Huang, Zhou, Ding, & Zhang, 2012;G.-B. Huang, Zhu, & Siew, 2006;J. Huang, et al., 2012).In ELM, the input data is mapped from the input space to L-dimensional hidden layer feature space.The output of ELM is (20) where is the output weight vector from hidden nodes to the output node.
is the row vector presenting the output of the L hidden nodes with respect to the input x.In other words, h(x) maps the data from the d-dimensional input space to the L-dimensional hidden layer feature space H.In our study, we used the ELM code developed by (G.-B.Huang, et al., 2012;G.-B. Huang, et al., 2006;J. Huang, et al., 2012).The number of hidden neurons was set to 20 after several experiments.

Experimental Results
One-way analysis of variance (ANOVA) was performed by using statistical package for the social science (SPSS) to validate the discerning abilities of the features between the groups.Table 3 shows the ANOVA results and F-ratio of dimensionality reduced (DimRed) features which were greater than the original features.
From Table 3, it was also noticed that all the p-value was used for testing the hypothesis and it was equal to 0.000.Since the p-value of 0.000 is less than signifi cance level of 0.05, we reject H 0 .The statistical analysis provides suffi cient evidence to conclude that mean weights of features from 3E, 5E and 7E were different for both the databases.Different experiments of ER were performed such as speaker dependent (SD), speaker independent (SI), gender dependent (GD), gender independent (GI), recognition of 3E, 5E and 7E.The training and testing sets were prepared as shown in Table 4 for SI ER experiment.

Gender and Speaker Dependent
Training and testing sets were prepared using the original features and DimRed features extracted from all the utterances (535 utterances in EmoDB database and 1200 utterances in SESD).Conventional validation method (80% training + 20% testing) was used in GD and SD ER experiment.Out of the total utterances, 80% of the utterances were used as training set and the remaining 20% of utterances were used as testing set.Experiments were repeated for 10 times and the average of 10 repetitions was reported as the ER accuracy.

Speaker Independent
SI ER in both database are evaluated in fi ve separate experiments.In each experiment, training set was formed using the original and DimRed features extracted from two speakers.The duo was selected in order to get one male and one female speaker at a time (Table 4).
From Table 6, it can be observed that we obtained highest recognition accuracies of 100% and 65.28% by using DimRed features with MCSVM classifi er in classifying 3E for EmoDB and SESD respectively.The DimRed features will convey more discerning information about the different emotional speech which results in highest ER accuracy compared to original features in all the experiments.While the ER accuracies for 5E were 93.10% and 48.39 % using k-NN and ELM classifi er respectively.In recognition of 7E, both MCSVM and k-NN classifi ers provided highest accuracy of 85.15% for EmoDB.

Gender dependent
Table 7 shows the ER results for GI experiment for original and DimRed features.Here, training and testing sets were prepared using features extracted from the male speakers and female speakers respectively.From Table 7, inferences unfold that MCSVM and FKNN were performed equally better in providing highest accuracy for both the databases.MCSVM and FKNN provided highest accuracy of 87.09% and 96.04% for 7E and 5E using DimRed features for EmoDB.MCSVM and ELM provided maximum accuracy of 71.11% and 52.05% for 3E and 5E using DimRed features for SESD.(Bozkurt, et al., 2010).In (Giannoulis & Potamianos, 2012), prosodic features, spectral features, glottal fl ow features, AM-FM features were utilized and twostage feature reduction was proposed for speech emotion recognition.The overall emotion recognition rates of 85.18% for gender dependent and 80.09% for gender independent was achieved using SVM classifi er.In this work, we obtained 87.48% for gender dependent and 87.09% for gender independent.Ali Shahzadi et.al have proposed non-linear dynamics features (NLDs) for speech emotion recognition (Shahzadi, et al., 2013).They have achieved overall recognition rates between 82% and 86% using NLDs + prosodic + spectral features with 10-fold cross validation.Margarita Kotti and Fabio Paterno (Kotti & Paternò, 2012) have proposed a psychologically-inspired binary cascade classifi cation scheme for speech based emotion recognition using low level audio descriptors and high level perceptual descriptors with Linear SVM.The best emotion recognition accuracy of 87.7% was obtained using SVM with linear kernel.In (Sezgin et al., 2012), a new set of acoustic features based on the perceptual quality metrics are proposed for the binary arousal and valence discrimination which include partial loudness of the emotional difference, emotional difference-to-perceptual mask ratio, measures of alterations of temporal envelopes, measures of harmonics of the emotional difference etc.They had not reported the results for seven classes of emotions discrimination.From the above results and discussions, it can be observed that the proposed method provides better ER accuracy compared to some of the signifi cant works in the literature.Also, for the second database, the proposed algorithms also provided better ER accuracy under different experiments.

CONCLUSION
Generally speaker/gender dependent ER is easier and provides higher ER accuracy.The performance of the speaker/gender independent ER is low compared to speaker/gender dependent ER.In this work, two-stage feature reduction using PCA and LDA was proposed for gender/speaker independent ER from speech.Short-term (MFCCs, LPCs, LPCCs, WLPCCs) cepstral features were extracted from the emotional speech signals.The extracted shortterm cepstral features were reduced to fewer dimensions using PCA followed by LDA.Four different classifi ers such as k-NN, FKNN, MCSVM, and ELM were used to gauge the DimRed features in speaker/gender independent ER.From the simulation results, MCSVM showed good performance in ER for both databases and proposed methods provides very encouraging ER accuracy compared to existing work in the literature.

Figure 1 .
Figure 1.Block diagram of the proposed speech emotion recognition system term in the LPC model c a LPCC s a LPC P order p WLPCC generated by multiplying LPCC with the wighted formula (9).Weighted function as bandpass fi lter in cepstral domain to de-emphasizes c a around a = 1 and a = Q.

Table 1
Summary of previous research works on ER from speech (El Ayadi et al., 2011;Koolagudi & Rao, 2012;Ververidis & Kotropoulos, 2006)sifi cation algorithms and the recognition accuracies varies between 49.52% and 95.10%.They have used Berlin emotional speech database (EmoDB) and also their own emotional speech database.High emotion recognition accuracies were obtained for two-class emotion recognition (High arousal Vs Low arousal) but multi-class emotion recognition is still disputing.This is due to the following reasons: (a) which speech features are informationrich and parsimonious, (b) different sentences, speakers, speaking styles and rates, (c) more than one perceived emotion in the same utterance, (d) longterm and short-term emotional states(El Ayadi et al., 2011;Koolagudi & Rao, 2012;Ververidis & Kotropoulos, 2006).Although all the above works are novel contributions to the fi eld of speech emotion recognition, it is diffi cult to compare them directly since the division of datasets are

Input Output Pre- Processing Feature Extraction Two Stage Feature Reduction Emotion Recognition LPC, LPCC, WLPCC, MFCC KNN, FKNN, SVM, ELM PCA+ LDA
The advantages of PCA is when we know the patterns in the data, we can compress the data by reducing the number of dimensions, without loss of information.First step in PCA analysis is the subtraction of the mean from each of the data dimensions.Next step is the estimation of covariance matrix and then determine the Eigenvectors and Eigenvalues of the covariance matrix.A linear transformation mapping was done to map the original h-dimensional feature space into an l-dimensional feature subspace (l<h).The h-dimensional short term cepstral feature vector is represented by considering a set of N sample features {a 1 ,a 2 ,..,a N }.Next, the new vector b i  R l is defi ned by

Table 3
ANOVA results

Table 4
Details of training and testing sets used in SI ER Table 5 presents the experimental results of GD and SD ER by using the four different classifi ers for original and DimRed features.From the Table 5, it can be seen that the MCSVM performed well in recognizing 3E, 5E and 7E using DimRed features for EmoDB and SESD compared to other classifi ers (KNN, FKNN and ELM).

Table 6
ER Results for SI

Table 7
(Shen, Changjun, & Chen, 2011)current studies focus on developing new feature extraction and classifi cation algorithms.ER accuracy depends on the relevant features, quality of the database and experimental setups, and classifi cation techniques.The robustness of the proposed algorithms should be tested with more than one emotional speech database.In this paper, the proposed algorithms were tested using two different emotional speech databases and also we have conducted different experiments like SD, GD, SI and GI.The combination of PCA and LDA reduced the high dimension features into fewer dimensions and increased the discrimination ability of the features and hence, we obtained very promising ER accuracy in all the experiments.In(Pan, et al., 2012), energy, pitch, MFCCs and LPCCs were used as features and SVM as a classifi er to classify 3E (N,H,Sa) from EmoDB.The highest ER accuracy was 95.1% only.However, in our work, we have achieved 100% by applying two-stage feature reduction (SI).In(Shen, Changjun, & Chen, 2011), LPCs and MFCCs were used as features and SVM as a classifi er to recognize 5E from EmoDB and the ER accuracy was only 70.70%, but in our analysis, the highest ER accuracy was 96.04% using DimRed features (GI).In recognition of 7E, our analysis showed the highest accuracy of 87.48% with DimRed features (SI).Bozkurt et al. have obtained 84.58% accuracy using Line Spectral Frequency and MFCCs as their features and GMM as classifi er