An Improved Grey Wolf Optimization- b ased Learning of Artificial Neural Network for Medical Data Classification

Grey wolf optimization (GWO) is a recent and popular swarm-based metaheuristic approach. It has been used in numerous fields such as numerical optimization, engineering problems, and machine learning. The different variants of GWO have been developed in the last five years for solving optimization problems in diverse fields. Like other metaheuristic algorithms, GWO also suffers from local optima and slow convergence problems, which result in degraded performance. An adequate equilibrium among exploration and exploitation is a key factor to the success of metaheuristic algorithms, especially for optimization tasks. In this paper, a new variant of GWO, called inertia motivated GWO (IMGWO), was proposed. The aim of IMGWO was to establish better balance between exploration and exploitation. Traditionally, artificial neural network (ANN) with backpropagation (BP) depends on initial values and in turn, attains poor convergence. The metaheuristic approaches are better alternatives instead of BP. The proposed IMGWO was used to train ANN to prove its competency in terms of prediction. The proposed IMGWO-ANN was applied for medical diagnosis tasks. Several benchmark medical datasets including heart disease, breast cancer, hepatitis, and Parkinson’s diseases were used for assessing the performance of IMGWO-ANN. The performance measures were described in terms of mean squared errors, classification accuracies, sensitivities, specificities, the area under the curve, and receiver operating characteristic curve. It was found that IMGWO outperformed three popular metaheuristic approaches including GWO, genetic algorithm, and particle swarm optimization. Results confirmed the potency of IMGWO as a viable learning technique for an ANN.


INTRODUCTION
The process of medical diagnosis becomes easier and faster if a decision support system assists the doctors because machines do not suffer from fatigue or boredom. Numerous tests are involved in a disease diagnosis that can make the process complicated. High performance is desirable in the medical diagnosis process as a little difference in accuracy may lead to a substantial change in prediction (Li et al., 2017). Continuous efforts are put in this direction to improve the performance of the diagnosis process through machine learning methods like support vector machine (SVM) by Akay (2009) and Maglogiannis et al. (2009). Das et al. (2009), Lin and Chuang (2010), and Yan et al. (2006) developed artificial neural network (ANN)-based models for medical diagnosis. Such other methods are also reported in the literature of medical data mining. Among all of these methods, ANNs are assumed as universal methods for approximation and generic classifiers. In 1943, McCulloch-Pitts developed a model called ANN and gained much popularity in the field of artificial intelligence and machine learning. ANN effectively modeled the different problems related to computational intelligence, optimization, function approximation, and complex predictive system. Three fundamental architectures for ANN are reported including single-layer feedforward network, multilayered feedforward network, and recurrent networks. Furthermore, neural network identifies six learning tasks such as pattern association, pattern recognition, function approximation, control, filtering, and beam forcing (Haykin, 1994). ANN is further described by patterns of connections between the neurons, methods for determining the weights of communications, and its activation functions (Amirsadri et al., 2018).
Apart from the abovementioned merits, the training process of ANN through backpropagation has been criticized in many studies, due to its slow convergence and poor performance (Brent, 1991;Gori & Tesi, 1992). The poor performance is due to dependence on initial values, whereas trapped in local minima is responsible for causing slow convergence. Metaheuristic algorithm-based training of ANN is a possible substitution of backpropagation learning and existing studies support the aforementioned claim (Ojha et al., 2017). There are various ways to integrate the metaheuristic techniques with ANN, such as: (i) setting the architecture of the network: a network structure is defined by the number of layers and number of neurons per layer; (ii) managing weight and bias by a metaheuristic technique: any proposed metaheuristic technique is used to adjust the weight and bias of the connections established between neurons of different layers, either called the learning of weight and bias or the training of an ANN; and (iii) tuning different learning parameters: it is kind of a hybridization in which a metaheuristic approach is used to adjust some important parameters such as learning rate and momentum in a gradient descentbased learning method (Mirjalili et al., 2012).
The performance of multilayered perceptron (MLP) was deeply investigated in Seiffert (2001). A study adopted artificial bee colony (ABC) algorithm to train ANN and it performed better than genetic algorithm (GA) as well as backpropagation (BP) (Karaboga et al., 2007). Gudise and Venayagamoorthy (2003) found that particle swarm optimization (PSO) is a fast-learning algorithm with nonlinear function for feedforward neural network and supersede backpropagation in terms of speed. Blum and Socha (2005) applied ant colony optimization (ACO) algorithm to train a feedforward neural network, and improved classification results were obtained using benchmark medical datasets including breast cancer, diabetes, and heart disease.
Grey wolf optimization (GWO) is a recent swarm-based metaheuristic technique developed by , inspired through social hierarchy and the hunting process of grey wolves. GWO proved its competence in terms of fast convergence and better global search capabilities and provides competitive results for a wide variety of optimization problems (Long et al., 2017;Qais et al., 2018;Saremi et al., 2015). Zhang et al. (2019) found that GWO is comparatively a new and more capable technique among other swarm-based optimization algorithms for solving numerical optimization. A binary version of GWO has also been used for feature selection in various medical datasets downloaded from University of California Irvine (UCI) machine learning repository (Emary et al., 2016). GWO is also used for prediction of medical data by Sahoo and Chandra (2017). Furthermore, the classification of cervical cancer is predicted through GWO and a significant discrimination is produced between benign and malignant lesions. Khairuzzaman and Chaudhury (2017) used GWO for multilevel thresholding in image segmentation. Some recent studies also applied GWO with ANN for prediction tasks (Djema et al., 2019;Mirjalili, 2015;Nosratabadi et al., 2020;Turabieh, 2016). Due to the diverse applicability of GWO, several drawbacks are also associated with the GWO algorithm like other population-based metaheuristic algorithms. These drawbacks include local optima entrapment at the end of the optimization process and slower convergence during the later part of the evolution process. Moreover, it is easily entrapped in local optima with complex multimodal tasks (Long et al., 2018;Zhang et al., 2019). Long et al. (2018) stated that a proper balance between exploration and exploitation is needed to achieve global optima solution in case of population-based stochastic methods. They proposed a new non-linear control variable and modification in the position updating equation inspired by PSO.
In this paper, a new variant of GWO called inertia motivated GWO (IMGWO) is proposed to act as a training algorithm for multilayered perceptron (MLP), a kind of ANN. The new training technique

RELATED WORKS
The literature on metaheuristic algorithm-based training and optimizing of ANN is very rich and profound base. The metaheuristic algorithms can be categorized into: (i) single-solution-based, (ii) multiple-solutions-based, (iii) evolutionary algorithms (EA), (iv) nature-inspired algorithms (NIA), and (v) swarm intelligence (SI)based optimization. Many of the EA and NIA algorithms are considered as swarm-based optimization like GA and GWO. Metaheuristic algorithms such as EA, NIA, and SI are multiple-solutions-based methods. Simulated annealing, a single-solution-based metaheuristic was used to optimize ANN in a work by Sarkar and Modak (2003) and it performed better than traditional methods. The metaheuristic methods that are based on multiple solutions can be found to be more capable of avoiding local optima problems (Mirjalili, 2015). Recent ℎ literature also confirmed that SI, NIA, and EA have better exploration capabilities than single-solution-based metaheuristic algorithms to train ANN (Ojha et al., 2017). Nevertheless, according to the No Free Lunch theorem, there is no universal metaheuristic-based training algorithm for ANN. Therefore, various metaheuristic-based algorithms have been applied to train ANN (Amirsadri et al., 2018).
Recent studies showed that GA outperforms conventional backpropagation to train ANN for some real-world applications (Ding et al., 2011;Tong & Mintram, 2010). Slowik (2011) incorporated an advanced differential evolution technique to train neural networks, and claimed that simulation results proposed that the technique was better than EA and traditional backpropagation methods. Several studies explored the capabilities of PSO for weight optimization of ANN (Mendes et al., 2002;Green et al., 2012;Zhang et al., 2007;Gudise & Venayagamoorthy, 2003). A recent study optimized the weight, structure, and activation function of ANN using PSO (Das et al., 2015). ABC with backpropagation was used to optimize the weight of ANN and it was observed that ABC and BP integration improved the slow convergence rate issues (Sarangi et al., 2014). ABC was successfully applied to optimize the weight and structure of ANN in Garro et al. (2011). ACO was effectively used for the training purpose of ANN by Blum and Socha (2007), and some other studies also applied ACO to train the ANN model to solve prediction, scheduling, and image recognition problems (Irani & Nasimi, 2011;Shariati et al., 2019). Several studies adopted other metaheuristics methods to train ANN for different applications. Teaching-learning based optimization was applied to train ANN for estimating the energy consumption problem (Uzlu et al., 2014). Harmony search-based neural network was used to classify UCI datasets including breast cancer and thyroid disease (Kulluk et al., 2012). The biogeography-based optimizer (BBO) was also used to train MLP and further applied to classify breast cancer and heart disease datasets . A social-spider optimization algorithm was employed to train MLP for medical datasets (Pereira et al., 2014).
Metaheuristic optimization techniques have been in practice for more than a decade especially for medical data classification. These techniques improved the results of different machine learning methods used for medical diagnosis. Some of these are discussed in Pham and Triantaphyllou (2009) and Muhaideb and Menai (2014).
Breast cancer is one of the leading causes of death in the world and many studies are reported on breast cancer literature. However, earlier symptoms of breast cancer are not easily identified (Das et al., 2020;Pham & Triantaphyllou, 2009). If a woman lives for 85 years and does not have a family history of breast cancer, then there is also a 0.12 probability of being a breast cancer patient (Bhardwaj & Tiwari, 2015). Metaheuristic based algorithms like fruit fly optimization, homogeneity-based algorithm, genetic programming, and PSO are extensively adopted for breast cancer prediction and it is observed that good results were achieved by the aforementioned algorithms (Brameier & Banzhaf, 2001;Chen et al., 2011;Huang et al., 2019;Pham & Triantaphyllou, 2009;Shen et al., 2016).
Heart disease is also considered in this work, as it is also a lifethreatening disease and common in old and middle-aged people. Nevertheless, young people may also suffer from heart disease especially in developing nations like India. A study showed that heart disease is responsible for 24 percent of the total deaths that occur due to non-communicable diseases in India. Heart disease is most common in Asia Pacific and almost 17 million people die every year worldwide due to heart-related problems (Latha & Jeeva, 2019). In the United States (US), a person with Myocardial Infarction, a type of heart disease, is detected in every 34 seconds and the probability of death is 0.15 (Tay et al., 2014). Turabieh (2016) adopted the GWO algorithm to train ANN for accurate prediction of heart disease and observed that the root mean squared error (RMSE) was significantly reduced as compared to standard ANN.
Parkinson's disease (PD) and Alzheimer's disease (AD) are two common neurodegenerative diseases and millions of people are suffering from these diseases throughout the world (Oliva & Hinojosa, 2020). PD is the second major neurodegenerative disease in the world (Cai et al., 2018). A detailed analysis of PD using association rule mining and metaheuristic algorithms was discussed in Altay and Alatas (2020).
Liver is also an important organ of living beings and hepatitis is one of major diseases associated with it. It has been found that the diagnosis of hepatitis is a very difficult process as an expert does many comparisons with previously identified cases. The hepatitis virus may be of different kinds such as HAV, HBV, HCV, HDV, HEV, HGV, and over 1.5 million deaths worldwide occur due to this viral disease per year (Sartakhti et al., 2012). The medical diagnosis of hepatitis using SVM has been discussed in detail (Chen et al., 2011). Bascil and Temurtas (2011) adopted ANN for the same purpose. It is observed that ANN is widely adopted in the field of data mining for prediction tasks. It is also noticed that various metaheuristic algorithms are integrated with ANN for improving prediction results. Apart from medical diagnosis, ANN is extensively applied for optimizing various applications using metaheuristic methods (Mirjalili, 2015;Mirjalili et al., 2012).
As per the extensive literature review, it is seen that GWO outperforms other SI and EA techniques. Nevertheless, this algorithm suffers from local optima entrapment and slow convergence rate. Long et al. (2018) studied the reasons behind the aforementioned issues and highlighted several reasons: (i) In GWO, the control variable is linearly decreased while it must decrease in a non-linear fashion so that better exploration can be achieved during later optimization, and also to maintain rich exploitation. (ii) To achieve global optima position, the update rule of classic GWO needs some improvements because the best positions achieved in the previous steps cannot be retained as GWO does not have a memory concept. Therefore, GWO must remember the previous best solution and attain the global optima. Long et al. (2018) suggested the following solutions to the abovementioned problems: (i) Develop a non-linear control parameter that can manage exploration and exploitation in a better way. (ii) A memory concept is introduced in GWO that must remember the personal best component inspired through PSO. Furthermore, the position of alpha wolf represents the global best position, and in every iteration, the previous personal best must be stored to achieve the global best position.  Long et al. (2018). The proposed optimization technique is termed as IMGWO since the non-linear control variable resembles the inertia of PSO in behavior (Chatterjee & Siarry, 2006). IMGWO used MSE as fitness function during the training of MLP and obtained a significant improvement in the results as compared to other contemporary metaheuristic methods (GA, PSO, and GWO).

Multilayer Perceptron
ANN consists of three structures, such as single-layer feedforward, multilayer feedforward neural network (FNN), and recurrent neural network (RNN). Several other variants of the three structures are also presented in the literature, viz., convolution neural network, radial basis function, Hopfield network, Boltzmann machine, liquid state machine, Kohonen networks, extreme learning machines, and modular neural networks. However, this study considered the multilayer perceptron (MLP) model. MLP is a kind of FNN with one or more hidden layers between the input and output layers and each node is associated with an activation function. It is assumed that the activation function should be different for the hidden layer and output layer. Nevertheless, this study considered the same activation function for both layers. The activation function used in this study is highlighted in Equation 1. (1) Assume that an input layer consists of number of input neurons, a hidden layer with number of neurons, and an output layer with number of neurons. The weighted sum of inputs for MLP can be calculated using Equation 2.
(2) is the weight of connection from a neuron in the input layer to a neuron in the hidden layer and is the bias associated with denotes the input for neuron. It is assumed that input layer neurons are passive; they simply pass the information without being processed through the activation function and bias unlike in the hidden and output layers. Neurons at the hidden layer produce output using Equation 3. (3) The output of the neurons can be described using Equation 4. (4) Where is the weight of the connection between hidden neuron to output node and is the bias at output neuron.
The output is further passed to the sigmoid function as in Equation 5. A sigmoid function is a kind of squashing function that converts input into an output range (0, 1). It is assumed that denotes distinct classes.

Grey Wolf Optimizer
Grey wolf optimization is inspired through the searching and hunting behaviors of a special family of wolves, named as Canis lupus. The searching mechanism of grey wolves follows a hierarchical system that consists of alpha, beta, delta, and gamma wolves. All wolves are found in a group of 5 to 12. However, for the implementation point of view, wolves' population can be defined in the range of 50 to 250. Furthermore, the alpha wolf acts as the leader of the group and has the highest priority. The priority is computed using some fitness functions. The alpha wolf consists of the best fitness value, while beta and delta wolves contain the second best and third best fitness values, respectively. The rest of wolves are known as omega (ω) and follow their superior when searching for the prey and encircling it. The prey is the target solution (most optimal) and it can be explored during the searching process. When the prey is discovered, then the search begins and the iteration variable is set as . Then, the top three wolves (α, β, and δ) lead the other wolves (ω) to reach the target solution, i.e., the position of the prey. Each wolf must change its position after every iteration to meet the target. At last, when the alpha attacks the prey, GWO stops and returns the position of the alpha or prey as the possible solution. The encircling mechanism is described through the coefficients as given in Equations 8 to 10 and generic Equations 6 and 7 are used to update their positions around the top best three solutions, i.e., alpha, beta, and delta. (6) The two important variables and are responsible for exploration and exploitation, and change as per the expressions and where, and are the random values responsible for the movement of wolves in multidimensional space. Equations 8 to 10 define the step size toward the best three solutions . (α, β, and δ) Journal of ICT, 20, No. 2 (April) 2021, pp: 213-248 represents the current iteration in the aforementioned equations (Amirsadri et al., 2018). GWO is described in Algorithm 1.

Particle Swarm Optimization
Like GWO, PSO is also an NIA optimization and it is one of the popular optimization methods presented in the literature among different SI techniques. The basic working of PSO is described through personal thinking factor and collaborative thinking factors as shown in Equation 15. The concept of inertia weight has gained wide popularity especially in the case of PSO, which also proved its importance for exploring optimum solution. Furthermore, PSO Journal of ICT, 20, No. 2 (April) 2021, pp: 213-248 reported the best performance when inertia is in the range [0.9, 1.2] (Shi & Eberhart, 1998).
is the velocity of participating particles, are the positive coefficients, while and functions generate random numbers in the range [0,1]. The variables and denotes the personal best and global best positions of particles, respectively. Chatterjee and Siarry (2006) suggested that a better balance between exploration and exploitation can be achieved through Equation 17. (17) is the velocity vector for the iteration and are the personal best and global positions as per classic PSO definitions. and are the weighting variables of stochastic nature for balancing private thinking component and global thinking component.

The update in inertia
is done through the updating rule according to Equation 18 (Chatterjee & Siarry, 2006). (18) Where denotes total iterations and denotes current iteration. It is assumed that inertia weight maintains a good balance between exploitation and exploration using non-linear modulation index . It becomes linear when it equals to 1. The inertia weights are changed from to during successive iterations. The integration of inertia weight concept into the PSO algorithm reflects the non-linear behavior and also improves the simulation results in a significant manner (Chatterjee & Siarry, 2006).

Exploration and Exploitation
All non-deterministic metaheuristic algorithms are described as population-based algorithms. These algorithms consist of a natural tendency to maintain the balance between exploration and exploitation Like GWO, PSO is also an NIA optimization and it is one of the popular opt presented in the literature among different SI techniques. The basic working of through personal thinking factor and collaborative thinking factors as shown in concept of inertia weight has gained wide popularity especially in the case of PSO, its importance for exploring optimum solution. Furthermore, PSO reported the best inertia is in the range [0.9, 1.2] (Shi & Eberhart, 1998).
is the velocity of ℎ participating particles, 1 , 2 are the positive coefficients, rand 2 () functions generate random numbers in the range [0,1]. The variables a personal best and global best positions of particles, respectively. Chatterjee and Siarr that a better balance between exploration and exploitation can be achieved through , to avoid local optima. Without the proper balancing between these two processes, i.e., exploration and exploitation, the optimal solution cannot be achieved for optimization problems. Exploration or diversification generates the solution in a distant area with respect to the current solution so that the generated solution proceeds to the global solution. Whereas exploitation or intensification searches the solution near to the current solution (Emary et al., 2018). For all stochastic population-based metaheuristic methods, exploration means to use the previously attained knowledge for finding better and probably optimal solutions. Classic GWO clearly stated that candidate solutions diverge (exploration) with respect to the target when and converge (exploitation) toward the target when and in turn, control parameter is decreased from 2 to 0 to emphasize divergence and convergence during successive iterations . However, it is practically observed that a linear decrement of does not reflect the actual search process due to complex and non-linear nature of exploration and exploitation. Long et al. (2018) developed a new variant of GWO that integrated PSO and GWO. The exploration and exploitation processes were improved by mimicking the control variable as the inertia concept (a weight alike parameter) of PSO. The aim of inertia factor is to make PSO more effective in terms of performance. Therefore, Long et al. (2018) developed a non-linear control variable that maintains the balance between exploration and exploitation as well as an adjustment strategy that is adjusted through a decay function (Long et al., 2018).

Learning of the MLP Using a Metaheuristic Technique
There are certain encoding schemes available for learning the weight and bias parameters through metaheuristics (Zhang et al., 2007).
Vector Encoding Scheme: In this scheme, each search agent is encoded in the form of a vector and this vector represents the weights and biases of MLP as shown in Equation 19: , ℎ_ Where are the weights of different connections between the input layer and hidden layer; and the weights of different connections between the hidden layer and the output layer. For example, is the weight of the connection from node 1 in the input layer to node 1 in the hidden layer, and is the weight of the connection from node 2 in the hidden layer to node 1 in the output layer. Bias is associated with the neurons of the hidden and output layers only. Bias is associated with the first node in the hidden layer and is the bias associated with the last node in the hidden layer. Similarly, and are the biases for the first and last nodes in the output layer. The architecture of MLP is where is the number of neurons in the input layer, is the number of neurons in the hidden layer, and is the number of neurons in the output layer.
The search agent matrix for vector encoding scheme can be described using Equation 20: Where M describes the population of search agents or size of the swarm.
(ii) Matrix Encoding Scheme: In this encoding, each search agent is encoded in the form of matrices. For a given MLP with structure , the matrices can be represented as: The structure of the search agent using the matrix encoding scheme can be represented as: Search agent [ _ 11 , _ 12 , … _ 1ℎ , … , _ 21 , _ 22 , … , _ 2ℎ … … … … … _ ℎ , ℎ_ 11 , ℎ_ 12 … . ℎ_ 1 …… , ℎ_ 21 , ℎ_ 22 … . . ℎ_ 2 , … … … … ℎ_ ℎ , ℎ1 , ℎ2 , … ℎℎ , … … . The matrices are the weight matrices for the hidden layer and the transpose of the output layer, whereas are the bias matrices for the hidden layer and output layer, respectively. Variable can vary from 1 to any population size . Typically, population size can vary in the range of 50 to 500 depending on the type of problems being solved.
(iii) Binary Encoding Scheme: It is a kind of encoding scheme, where each search agent can be either 0 or 1 and the weight for MLP training can be described through a series of zeros and ones. The cost of the encoding scheme is comparatively high for training purposes; however, this encoding is good for feature selection tasks.
The abovementioned encoding schemes are used to represent the population of metaheuristic algorithms with ANN, specifically for MLP. Equations 19 and 20 are vector representations of weights and biases, while Equations 21 to 25 represent the weights and biases in matrix form.

INERTIA MOTIVATED GREY WOLF OPTIMIZATION
In this section, the new swarm optimization algorithm named IMGWO is discussed and further, IMGWO is used to train an MLP.

Inertia Motivated Grey Wolf Optimization
Through the literature, it is observed that the exploration and exploitation processes of GWO depend on the behavior of control variable ,since it determines the vectors and (using Equations 6 and 7) that can affect the final solution. The current study focuses on the non-linear behavior of control variable and enhances the nonlinearity of by proposing Equation 26.
Consider that denotes the current iteration, denotes the total number of iterations, and denote the initial and final values of control variable , respectively. The non-linear decay of the control variable is expressed through Equation 26 and also manages the balance between exploitation and exploration in an effective manner. This study uses the rule for updating the position of grey wolf that is mentioned in Equation 27 (Long et al., 2018).
Where is the current iteration, are some random variables in the range of 0 to 1. Further, represent the individual memory coefficient and population communication coefficient in the range of [0, 1], respectively. denotes the personal best position during successive iteration. is considered as the global best solution of position of α (alpha wolf) and is the weight similar to inertia weight in PSO. The value of variable changes from an initial value to as per Equation 28 with respect to iteration variable (Long et al., 2018). (28) Where is the total number of iterations specified in the algorithm.
Several findings can be highlighted as follows: (i) The variable w can increase the speed for exploitation; (ii) random variables are responsible for improving the degree of randomness and exploration; and (iii) coefficients are responsible for better balancing among exploration and exploitation. The aim of the proposed study is to enhance the balance between exploration and exploitation by introducing a non-linear control variable as expressed in Equation 26.The study used the position updating rule as expressed in Equation 27 that determines the global best position. Finally, the abovementioned improvements are incorporate into the GWO algorithm and proposed to train MLP, as during the training process, MLP is often trapped in local optima. The summary of IMGWO is described in Algorithm 2.

Training of MLP using IMGWO
This section discusses the IMGWO-based training algorithm for MLP. IMGWO attained promising results during the experimental setup and attracted the attention for solving the weight optimization problem of ANN. The algorithm managed better trade-off between the divergence and convergence processes, which is one of the essential steps to obtain competitive results using metaheuristic techniques. The current study focuses on optimizing the weights and biases during the training of MLP using IMGWO. Furthermore, the vector encoding scheme (already discussed in the background concepts) is adopted to represent the weights and biases of MLP for search agents (wolves) (α, β, δ, and ω). The weight and bias vectors (as shown in Equation 19) represent the dimensions or variables for IMGWO. The variables should be optimized to obtain maximum classification accuracy. Classification or prediction accuracy also depends on another important measure, i.e., MSE. MSE can be considered as an objective function for evaluating classifier performance.
The training of MLP is also considered as a challenging problem due to large search space and gradient descent nature of weight and bias. To meet the aforementioned challenges of MLP, the proposed IMGWO-MLP framework can be described in Algorithm 3.

Declaration 3: i-h-o is the structure of neural network where i is the number of neurons in input layer h is the number of neurons in hidden layer o is the number of neurons in output layer Dimensions are the number of weights and biases expressed in Equation 19 MSE is the mean squared error also known as average training error expressed in Equation 30
Calculate

the number of dimensions as per formulate given in Equation 29
Initialize all the dimensions in a suitable range Input: training samples of a medical dataset, and set of weights and biases (dimensions) for each training sample Apply Algorithm 2 (IMGWO) to each of the dimensions to obtain MSE and optimum values of weights and biases Use the optimum weights and biases to classify the sample end for Output: MSE, and set of optimum weights and biases (29) In the IMGWO algorithm, the fitness of search agents is computed using Equations 30 and 31. For simplicity, an MLP with one hidden layer is considered in the proposed study. The input layer consists of neurons, whereas number of neurons in the hidden layer is .It is assumed that the input dataset contains number of training patterns and each pattern could be classified in number of classes. Therefore, the total neurons in the output layer are set to . The output can be described in terms of average training error for a given input unit and pattern pattern as follows: Where and are the observed and expected outputs of the input unit with respect to training pattern.

Algorithm 3: IMGWO-MLP Declaration 3: i-h-o is the structure of neural network where i is the number of neurons in input layer h is the number of neurons in hidden layer o is the number of neurons in output layer Dimensions are the number of weights and biases expressed in Equation 19 MSE is the mean squared error also known as average training error expressed in Equation 30
Calculate

the number of dimensions as per formulate given in Equation 29
Initialize all the dimensions in a suitable range

Input: training samples of a medical dataset, and set of weights and biases (dimensions) for each training sample Apply Algorithm 2 (IMGWO) to each of the dimensions to obtain MSE and optimum values of weights and biases
Use the optimum weights and biases to classify the sample end for Output: MSE, and set of optimum weights and biases − ℎ − , ℎ (α, β, δ, and ω) (α, β, δ, and ω)  . The average training error is described using Equation 31, which acts as a fitness function for all search agents.

Algorithm 3: IMGWO-MLP Declaration 3: i-h-o is the structure of neural network where i is the number of neurons in input layer h is the number of neurons in hidden layer o is the number of neurons in output layer Dimensions are the number of weights and biases expressed in E MSE is the mean squared error also known as average training e Calculate the number of dimensions as per formulate given in Eq
(31)

EXPERIMENTAL STUDY
The four benchmark medical datasets were taken from the UCI machine learning repository and a description of these datasets is given in Table 1. All datasets had two classes; therefore, the medical data classification problem could be interpreted as a two-class problem or binary classification. The second column denotes the total number of attributes, whereas the third column indicates the total number of samples. The fourth column provides the information regarding the presence of missing values in the datasets. The last column describes the structure of MLP. In the MLP structure, 'I' denotes the total number of input nodes in the input layer, 'H' denotes the total number of nodes in the hidden layer, and 'O' indicates the total number of nodes in the output layer. The MLP structure contains a single hidden layer for all datasets.
The heart disease dataset consists of 303 data instances, 13 attributes, and one class attribute. Originally, it has 76 attributes, however, most of the irrelevant attributes (e.g., ID, social security number, etc.) were removed during the pre-processing task. All attributes are numeric and the name of class attribute is num, which is either 0 or more than 0. However, it will always be less than 1.0 that stands for the absence of heart disease, whereas a value near to 1 has the worst situation. Breast cancer data is a medium-size data, named Breast Cancer Wisconsin (original) (BCW) dataset and it contains a total of 9 attributes (ID number is omitted) and 699 data instances. The tenth attribute is the class attribute that is represented through either benign (non-cancerous) or malignant (cancerous). The hepatitis dataset contains a small number of data instances (total 155) as given in Table 1, despite having a large number of attributes (total 19). Out of 19 attributes, 13 attributes are of binary type while other attributes contain numeric values. The class attribute has two distinct values, i.e., 'die' and 'live'. The Parkinson's disease (PD) dataset is a medium-size data and has 22 attributes and one class attribute. The PD dataset consists of 195 subjects. The class attribute discriminates the patients as either PD-affected or the healthier one. The disease identification in PD is associated with the differences observed when pronouncing the vowels.

Experimental Setup
IMGWO is a modified metaheuristic computational approach that was applied to refine the classification results of ANN. MSE could be considered as a fitness function for IMGWO and it should be minimized in successive iteration for attaining optimal solution. The simulation results of IMGWO -MLP classifier was compared with ANN-based classifier using three well-known metaheuristic techniques, namely GA, PSO, and GWO.
Each dataset was partitioned into training and testing sets, whereby the partitioning ratio was 70:30, i.e., 70 percent of the data were used for training the network and the rest of the data was used to obtain the classification results using IMGWO-MLP. Three other competent metaheuristics-based classification (GWO, PSO, and GA) techniques were compared with IMGWO for performance evaluation. The implementation details of the proposed IMGWO-MLP and other three techniques (GWO, PSO, and GA) are given in Table 2. The experiments were implemented in MATLAB software tool (version 2018) and all user-defined parameters were set prior to executing the experiments. The total number of iterations was set to 500 and the population of search agents was set to 250. In GWO, was initially set to 2, and it changed linearly during the successive iterations and reached 0. In IMGWO, the initial value of was 2, and reached 0 during the successive iterations; however, it decreased nonlinearly. In IMGWO, a new parameter was proposed and initially, the value of w was set to 2. In PSO, was set to 0.3, and the personnel learning coefficient and social learning coefficients were set to 1. In GA, the single point probability of crossover was set as 1, while the mutation uniform probability was set as 0.01.

Result and Discussion
This section presents the discussion on the simulation results obtained through the MLP classification model trained by IMGWO and the other techniques, namely GWO, GA, and PSO. The study included four benchmark medical datasets, i.e., heart disease, breast cancer, hepatitis, and Parkinson's disease for implementation tasks. All datasets were partitioned into two sets as training and testing sets in the ratio of 70:30. The first phase constructed a model using the training set for each medical dataset and further, optimal weights and biases were computed using the IMGWO technique. In the second phase, model usage was described through the testing set for each medical dataset. The proposed models (IMGWO-MLP, GWO-MLP, PSO-MLP, and GA-MLP) executed ten different runs for each medical dataset. Tables 3-7 present the simulation results of all MLP models using all medical datasets. The results are presented as arithmetic         mean of ten different independent runs. The convergence rates of the training task using IMGWO, GWO, GA, and PSO are shown in Figures 1-4.

Figure 1
Convergence Curve for Heart Disease.

Figure 2
Convergence Curve for Breast Cancer.
It was observed that the convergence rate of IMGWO was better than all other metaheuristic techniques (GWO, GA, and PSO) used for comparisons. It could be achieved by balancing the exploration and exploitation processes of GWO through control variable and position updates of wolves.
training task using IMGWO, GWO, GA, and PSO are shown in Figures 1-4.  It was observed that the convergence rate of IMGWO was better than all other metaheuristic techniques (GWO, GA, and PSO) used for comparisons. It could be achieved by balancing the exploration and exploitation processes of GWO through control variable ⃗ and position updates of wolves.
training task using IMGWO, GWO, GA, and PSO are shown in Figures 1-4.  It was observed that the convergence rate of IMGWO was better than all other metaheuristic techniques (GWO, GA, and PSO) used for comparisons. It could be achieved by balancing the exploration and exploitation processes of GWO through control variable ⃗ and position updates of wolves.  Convergence Curve for Hepatitis.

Figure 4
Convergence Curve for Parkinson's Disease.

The framework of IMGWO used a new expression for Equation 26
and a position updating rule Equation 27. These two improvements enhanced the convergence rate of IMGWO in a significant manner. The other significant improvement, i.e., avoidance of entrapment in local optima, could not be directly seen in the recorded parameter, but was also improved as IMGWO obtained better convergence rate and reduced MSE.   ). These two improvements enhanced the convergence rate of IMGWO in a significant manner. The other significant improvement, i.e., avoidance of entrapment in local optima, could not be directly seen in the recorded parameter, but was also improved as IMGWO obtained better convergence rate and reduced MSE.   ). These two improvements enhanced the convergence rate of IMGWO in a significant manner. The other significant improvement, i.e., avoidance of entrapment in local optima, could not be directly seen in the recorded parameter, but was also improved as IMGWO obtained better convergence rate and reduced MSE.   The average accuracy of all models is presented in Table 4. The results claimed that IMGWO-MLP had a better accuracy rate in comparison to GWO, GA, and PSO. IMGWO-MLP also achieved better accuracies during the training phase. The best mean accuracies achieved by the IMGWO-MLP model were 90.11 percent, 92.82 percent, 84.78 percent, and 82.76 percent with all medical datasets.  The simulation results of sensitivities and specificities parameters are also presented in Tables 5-6 and the best values are highlighted in bold. For designing an expert system for medical diagnosis, it is necessary to compare true-positive rate (sensitivity) and true-negative rate (specificity).
ROC curves were also computed as shown Figures 5-6. It is also considered as an important measure in the field of medical data mining. The ROC curve gave true positive classification rate at the cost of false-positive classification rate. The area under the curves (AUC) is also illustrated in Table 7. It was found that better AUC values were obtained by IMGWO for all datasets and the AUC values were also near to 1, which showed the significance of the proposed IMGWO-MLP model.

Figure 5
ROC Curve for IMGWO for Heart Disease .
e simulation results of sensitivities and specificities parameters are also presented in Tables 5best values are highlighted in bold. For designing an expert system for medical diagnosis, essary to compare true-positive rate (sensitivity) and true-negative rate (specificity). C curves were also computed as shown Figures 5-6. It is also considered as an important me he field of medical data mining. The ROC curve gave true positive classification rate at the co e-positive classification rate. The area under the curves (AUC) is also illustrated in Table 7. I nd that better AUC values were obtained by IMGWO for all datasets and the AUC values were r to 1, which showed the significance of the proposed IMGWO-MLP model.  ROC Curve for IMGWO for Parkinson's Disease.

Conclusion and Future Scope
In this study, a new metaheuristic algorithm, IMGWO, was developed for training the MLP model. Furthermore, the searching (exploration and exploitation) capability of GWO was enhanced with the concept of time-variant inertia. The proposed method proved its performance in terms of convergence. The four benchmark medical datasets were considered to evaluate the performance of the IMGWO-MLP model and results were compared with three popular MLP models. IMGWO can effectively optimize weight and bias, and in turn, improves the performance in terms of convergence rate as well reduces MSE. Some other performance measures like accuracy, sensitivity, specificity, ROC, and AUC were also computed to signify the performance of IMGWO-MLP using all datasets. It is found that the proposed method supersedes most of the datasets as compared to the rest of the models. The key points of the IMGWO-based training algorithm are highlighted as follows: (i) IMGWO has better management of exploration and exploitation; (ii) It is a well-defined model inspired through mathematical formulations that can make it igure 6. ROC curve for IMGWO for Parkinson's disease.

CONCLUSION AND FUTURE SCOPE
is study, a new metaheuristic algorithm, IMGWO, was developed for training the MLP mode ermore, the searching (exploration and exploitation) capability of GWO was enhanced with th ept of time-variant inertia. The proposed method proved its performance in terms of convergenc four benchmark medical datasets were considered to evaluate the performance of the IMGWO model and results were compared with three popular MLP models. IMGWO can effectivel ize weight and bias, and in turn, improves the performance in terms of convergence rate as we es MSE. Some other performance measures like accuracy, sensitivity, specificity, ROC, and AU also computed to signify the performance of IMGWO-MLP using all datasets. It is found that th osed method supersedes most of the datasets as compared to the rest of the models. The key poin e IMGWO-based training algorithm are highlighted as follows: (i) IMGWO has better managemen xploration and exploitation; (ii) It is a well-defined model inspired through mathematica ulations that can make it easy to comprehend and implement; and (iii) No technique-specif eters adjustment is done, only some common parameters of population-based metaheurist ods are adjusted. easy to comprehend and implement; and (iii) No technique-specific parameters adjustment is done, only some common parameters of population-based metaheuristic methods are adjusted.
Finally, it can be concluded that the proposed IMGWO technique is a better alternative solution with respect to many contemporary metaheuristic methods including GA, PSO, and GWO to train the neural network, especially for medical data classification. Future work may incorporate other neural network techniques such as deep neural networks or RNN and investigate the performance of IMGWO.