ESTIMATION OF MISSING VALUES USING OPTIMISED HYBRID FUZZY C - MEANS AND MAJORITY VOTE FOR MICROARRAY DATA

Missing values are a huge constraint in microarray technologies towards improving and identifying disease-causing genes. Estimating missing values is an undeniable scenario faced by field experts. The imputation method is an effective way to impute the proper values to proceed with the next process in microarray technology. Missing value imputation methods may increase the classification accuracy. Although these methods might predict the values, classification accuracy rates prove the ability of the methods to identify the missing values in gene expression data. In this study, a novel method, Optimised Hybrid of Fuzzy C-Means and Majority Vote ( opt -FCMMV), was proposed to identify the missing values in the data. Using the Majority Vote (MV) and optimisation through Particle Swarm Optimisation (PSO), this study predicted missing values in the data to form more informative and solid data. In order to verify the effectiveness of opt -FCMMV, several experiments were carried out on two publicly available microarray datasets (i.e. Ovary and Lung Cancer) under three missing value mechanisms with five different percentage values in the biomedical domain using Support Vector


INTRODUCTION
In many areas, the quality of data is a very serious problem in the current rapid world that produces millions of data each day that are often noisy and incomplete. Nevertheless, the issues from missing data are ubiquitous in the healthcare sector especially in microarray experiments that are able to generate thousands of gene expression datasets with missing expression values. The consequences faced by real-world healthcare research centres, such as the production of biased data and invalid inferences, undermine the purpose of data (Suphanchaimat et al., 2017). This is due to experimental errors, insufficient resolutions, and scratches or dust in slides during the laboratory processes (Yaraghi et al., 2012). As mentioned by Ouyang et al. (2004), every microarray experiment virtually contains missing expressions, and this affects more than 90% of the genes. During these scenarios, the extracted gene expression microarray datasets are unable to guarantee complete and useful knowledge that may influence the validity of the data. Meanwhile, the fundamental goal of microarray data is to detect the expressions of thousands of genes, identify disease-causing genes (Pino Angulo et al., 2018), accelerate molecular biology experiments, and find the functions of genes, genetic networks, and biomarker genes (Li et al., 2010). Therefore, it is important to consider the treatment of missing values before analysing the microarray data.
There are existing missing value strategies that have been developed and deployed in the gene expression data to promote data quality and reliability. The common treatment of missing values for microarray data is classified into three categories. Ignorance is the simplest solution to delete the records of data with missing values using listwise and pairwise deletion methods. Nevertheless, these deletion methods might drop abundant values in one process and reduce the accuracy rate in order to identify the diseasecausing genes. The second category, tolerance, discards missing points in the data. Even though this is a low-cost solution, it might produce lowquality datasets. The third category, imputation, is one of the best methods that can renew the whole dataset in order to prove the best means to process the missing values in the experiments (Tian et al., 2012;Hourani & Emary, 2009). Accordingly, the imputation method attempts to increase the relevancy and knowledge from the data that are able to construct a complete dataset. Taking all into account, a new imputation method was proposed to impute the missing values based on existing values in the data that are able to construct more information and knowledge. Practically, the proposed method is realised as a hybrid of Fuzzy C-Means (FCM), Majority Vote (MV), and Particle Swarm Optimisation (PSO), which is termed as opt-FCMMV in this study. An optimisation's contribution is to minimise and maximise the decision-making algorithm normally adapted to the approximation methods (Shehab et al., 2018). Therefore, the central idea of the imputation method is to use optimisation as the key in improvising and predicting the best missing values. In this study, opt-FCMMV is investigated as a solution for gene expression datasets.
An Optimised Hybrid of Fuzzy C-means and Majority Vote (opt-FCMMV) using PSO is proposed to impute the missing values in order to provide better information on the data. The effectiveness of the proposed opt-FCMMV in terms of solution quality and computational efficiency was demonstrated at various level (5%, 10%, 30%, 50%, and 80%) and missing value mechanisms such as Missing at Random (MAR), Missing Completely at Random (MCAR), and Missing Not at Random (MNAR) on two publicly available microarray data. The datasets with different missing values and mechanisms were tested with the proposed method (opt-FCMMV), FCM, and Fuzzy C-Means with Majority Vote (FCMMV). The performance of classification showed that the proposed method is able to produce a higher accuracy rate due to the optimisation by metaheuristic algorithms such as PSO. Considering the increasing demand of analysing data in various domains such as biomedical, this study hopes that it will be able to provide a new direction for missing value imputation by overcoming issues such as trap in local minima and high level of objective function. The remainder of the article is divided into four main parts. The upcoming section describes the theory related to missing values and presents a literature survey of the existing methods. Next, the missing value imputation method termed as the Optimised Hybrid of Fuzzy C-Means and Majority Vote (opt-FCMMV) is proposed. Then, the experimental results obtained from thirty datasets are presented while the conclusion of the study is presented at the end.

RELATED WORKS
In the context of missing value mechanisms, the mechanisms can be divided into three main groups: Missing at Random (MAR), Missing Completely at Random (MCAR), and Missing Not at Random (MNAR) (Suphanchaimat et al., 2017;Dibal et al., 2017;Kellermann et al., 2016;Tshering et al., 2013). MAR assesses the probability of missing data that do not depend on unobserved data; however, it also does not depend on available information. MAR consists of equal values of missing data that are randomly distributed within one or more sub-samples of data (Rubin, 1976); P (missing| observed, unobserved) = P (missing | observed) (Dibal et al., 2017). An example of an MAR scenario is women are more likely to get breast cancer; however, the probability of women who come for breast cancer check-up to get a diagnosis is the same for all women. In contrast, MCAR defines the probability of missing values on one variable is unrelated to other observed variables; P (missing| observed, unobserved) = P (missing) (Dibal et al., 2017;Tshering et al., 2013). For example, a breast cancer test has been performed on the patients; however, the mammogram is unable to function properly, whereby the results might show missing points completely at random. Meanwhile, MNAR is the probability of data that have fields of missing values and depend on the values of attributes; P (missing| observed, unobserved). MNAR cannot be quantified because the missing values depend on the values (Dibal et al., 2017;Tshering et al., 2013). An example of an MNAR scenario is breast cancer patients might be required to undergo chemotherapy weekly to screen whether the cancer has grown or spread. However, if the patient fails to show up for the chemotherapy sessions, then, the missing data points are related to the unobserved spread of cancers and this is classified as MNAR. In reality, most research for microarray experiments have been devoted to MAR or MCAR mechanisms, while very few research have been conducted on MNAR scenarios. Lazar et al. (2016) are considered as one of the motivations to conduct this research. With the knowledge of missing value mechanisms, it is practical enough to identify the appropriate analysis method recommended for the datasets. In many situations, missing values are required to be imputed in order to further analyse the imputed dataset (Bertsimas et al., 2017).
There are many imputation methods proposed specifically for microarray datasets. A number of effective imputations that have been used are clustering (Salleh & Samat, 2017) and classification algorithms (Tsai et al., 2018). Most articles proposed cluster-based algorithms and utilised high dimensional microarray datasets with a large number of features and samples that might directly affect the clustering performance (Keerin et al., 2016;Chattopadhyay et al., 2015;Gupta et al., 2015;Keerin et al., 2012) Moreover, the clustering performance is highly dependent on the number of clusters and with such conditions of samples, the selection of clusters will be crucial. Therefore, the researchers must handle the selection of clusters with a more detailed analysis in the algorithm of the selection part. Paul et al. (2017) utilised a pattern similarity matching algorithm, while Baraldi et al. (2015) used fuzzy similarity to impute missing values and the optimised fuzzy rule for gene selection. Indeed, gene selection is an important phase to pre-process the data and improvise the classification performance. However, missing values in the dataset must be handled well before identifying the disease-causing genes. The main disadvantage of pattern similarity matching is due to the distance that affects the dimensions with high dissimilarity (Tung et al., 2006), which might reflect its drawback in imputing the missing values in the data.
Some articles proposed Fuzzy C-Means as the clustering algorithm handling missing values (Saha et al., 2016;Pourhasem et al., 2010). However, the main drawback of fuzzy clustering is sensitivity at the initialisation phase, which will decrease the efficiency of the method. Consequently, this is the main reason FCM is hybridised with MV in this research. Furthermore, one of the commonly used methods for missing values is k-nearest neighbour (kNN) imputation (De Silva & Perera, 2017;Suyundikov et al., 2015;Keerin et al., 2012). In this process of kNN to impute the missing values, the intracluster dissimilarity is measured using the summation of distances between the data. However, the drawbacks of kNN imputation are the choice of the function of distances, time-consuming due to the large database, and choice number of neighbours (Edgar & Rodirguez, 2004). Additionally, Local Least Squares Imputation (LLSimpute) is common in estimating missing values. Yu et al. (2017), Bose et al. (2013), and Qin and Lee (2010) used LLSimpute to estimate missing values in microarray gene expression data. Nevertheless, one disadvantage of LLSimpute is that the optimal number of neighbours is based on the heuristic search that might elevate the computational cost of the algorithm.
To conclude, with evidence from the recent research, this article would like to suggest that advances in optimisation have shown promising progress in machine learning to be applied in missing value situations. This idea can be used to solve the missing value issues in microarray datasets as the optimisation process is able to offer effective solutions in a difficult scenario. The ability of the optimisation method can be used to minimise the missing data error (Marwala, 2009). Despite imputing the best predicted missing values in the data, the proposed method is able to provide informative data. The proposed method in this article aims to utilise the power of the optimisation and hybrid technique in imputing the accurate values. The advantages of this new method are: a) construction of values that are more accurate; and b) use of optimisation to minimise the difference measure between clusters centres and the values that directly minimise the data error. Moreover, to handle the reliability and validity of data with high accuracy rates is an important challenge faced in missing value scenarios.

RESEARCH DESIGN
In this section, a new missing value imputation method termed as Optimised Hybrid of Fuzzy C-Means and Majority Vote (opt-FCMMV) is proposed for microarray datasets. Although FCM is able to impute missing values, there is room to improvise FCM. Therefore, FCM is hybridised with MV in this study. Through this, MV is able to construct many accurate values in the missing data for best selection on the estimation of missing values. After this hybridisation, FCMMV will be optimised using PSO, whereby the role of optimisation is to minimise the measures between centroid clusters and data errors. The proposed opt-FCMMV will be tested with missing value mechanisms of MAR, MCAR, and MNAR.

A Sample Example
As an example, the infusion of missing values based on MAR, MCAR, and MNAR mechanisms is presented as samples. These examples used the Ovary Cancer dataset. Figure 1 illustrates a sample of 5% missing values of all mechanisms. Meanwhile, Figure 2 shows a sample of 80% missing values of all mechanisms. Both illustrations demonstrated a major level of differences of missing values in the data. The randomly injected missing values for each dataset were calculated based on the percentage of missing values by the total amount of genes in each data according to the missing value mechanisms. The amount of red indicates the number of missing values while green indicates the range of 1. As can be seen in the figures, the illustrated missing values can be expected to be improvised using the proposed method.

RESEARCH DESIGN
is section, a new missing value imputation method termed as Optimised Hybrid of Fuzzy Cs and Majority Vote (opt-FCMMV) is proposed for microarray datasets. Although FCM is able pute missing values, there is room to improvise FCM. Therefore, FCM is hybridised with MV in tudy. Through this, MV is able to construct many accurate values in the missing data for best tion on the estimation of missing values. After this hybridisation, FCMMV will be optimised PSO, whereby the role of optimisation is to minimise the measures between centroid clusters ata errors. The proposed opt-FCMMV will be tested with missing value mechanisms of MAR, R, and MNAR.

Proposed Method
The central idea of the proposed opt-FCMMV method is presented in Figure  3. The figure shows that the algorithm begins with the FCM algorithm hybridised with MV in order to impute values in the gene expression data.
Here, MV functions to compare the generated values and aggregate the votes on the values to choose the best values to be imputed. Meanwhile, the imputed values will be initialised with the particles and evaluated. The purpose of the optimisation of PSO is to minimise the error rates and train PSO with a complete dataset in order to estimate the values that correspond to the input of the imputed values by the rule of fitness variance less than threshold values. With this attempt, the best optimised values are imputed in the missing data of the datasets. A detailed explanation is discussed in the upcoming sections. nge of 1. As can be seen in the figures, the illustrated missing values can be expected to b provised using the proposed method.

Proposed Method
The central idea of the proposed opt-FCMMV method is presented in Figure 3. The figure shows that the algorithm begins with the FCM algorithm hybridised with MV in order to impute values in the gene expression data. Here, MV functions to compare the generated values and aggregate the votes on the values to choose the best values to be imputed. Meanwhile, the imputed values will be initialised with the particles and evaluated. The purpose of the optimisation of PSO is to minimise the error rates and train PSO with a complete dataset in order to estimate the values that correspond to the input of the imputed values by the rule of fitness variance less than threshold values. With this attempt, the best optimised values are imputed in the missing data of the datasets. A detailed explanation is discussed in the upcoming sections.

Hybrid Fuzzy C-Means with Majority Vote
Based on fuzzy clustering algorithms, an object might belong to more than one cluster with probabilities (Bezdek et al., 1981

Hybrid Fuzzy C-Means with Majority Vote
Based on fuzzy clustering algorithms, an object might belong to more than one cluster with probabilities (Bezdek et al., 1981). The FCM algorithm was originally introduced by Bezdek et al. (1981) and later enhanced by Dunn (1973) to ensure well-separated clusters. However, in this research, FCM will be improvised by hybridising it with MV so that the best imputation values will be identified in the gene expression data. The main steps of the FCMMV imputation method are as follows based on idea of Zhang and Shen (2014).
Step 1: The parameter values of the cluster size and the weighting factor, m, are set and the membership function, U, is initialised.
Step 2: The cluster centroids are calculated, where c= {c 1, c 2 , …, c k } based on Equation 1: is the weighting factor (real number) that influences the fuzzy degree of clustering, and the membership function, U = (x i , c k ) is defined as follows in Equation 2 for the cluster centres. For all, where d(x i ,c k ) is the distance between the data, x i and the centroid, c k . This can be calculated through Equation 3: ( 3) where p = 2 and p = 1 indicate the Euclidean and Manhattan distances, respectively, and are the cases of Minkowski distances. This research article utilises the value of p = 1.5.
Step 3: The objective function is minimised and defined. The optimal values are searched based on U and C as stated in Equation 4: improvised by hybridising it with MV so that the best imputation values will be identif expression data. The main steps of the FCMMV imputation method are as follows based on and Shen (2014).
Step 1: The parameter values of the cluster size and the weighting factor, m, are set and t function, U, is initialised.
Step 2: The cluster centroids are calculated, where c= {c1, c2, …, ck } based on Equation (1): is the weig (real number) that influences the fuzzy degree of clustering, and the membership function, is defined as follows in Equation (2) for the cluster centres. For all, xi, where d(xi ,ck) is the distance between the data, xi and the centroid, ck. This can be calcula Equation (3): where p = 2 and p = 1 indicate the Euclidean and Manhattan distances, respectively, and a of Minkowski distances. This research article utilises the value of p = 1.5.
Step 3: The objective function is minimised and defined. The optimal values are searched b and C as stated in Equation (4): improvised by hybridising it with MV so that the best imputation values will be identified expression data. The main steps of the FCMMV imputation method are as follows based on id and Shen (2014).
Step 1: The parameter values of the cluster size and the weighting factor, m, are set and the function, U, is initialised.
Step 2: The cluster centroids are calculated, where c= {c1, c2, …, ck } based on Equation (1): is the weighti (real number) that influences the fuzzy degree of clustering, and the membership function, U is defined as follows in Equation (2) for the cluster centres. For all, xi, where d(xi ,ck) is the distance between the data, xi and the centroid, ck. This can be calculated Equation (3): where p = 2 and p = 1 indicate the Euclidean and Manhattan distances, respectively, and are of Minkowski distances. This research article utilises the value of p = 1.5.
Step 3: The objective function is minimised and defined. The optimal values are searched base and C as stated in Equation (4): Step 4: The termination condition is met if the preset threshold values are more than the y hybridising it with MV so that the best imputation values will be identified in the gene ata. The main steps of the FCMMV imputation method are as follows based on idea of Zhang 14).
parameter values of the cluster size and the weighting factor, m, are set and the membership is initialised.
luster centroids are calculated, where c= {c1, c2, …, ck } based on Equation (1): is the weighting factor ) that influences the fuzzy degree of clustering, and the membership function, U = (xi , ck ) follows in Equation (2) for the cluster centres. For all, xi, is the distance between the data, xi and the centroid, ck. This can be calculated through : and p = 1 indicate the Euclidean and Manhattan distances, respectively, and are the cases i distances. This research article utilises the value of p = 1.5.
bjective function is minimised and defined. The optimal values are searched based on U ed in Equation (4): termination condition is met if the preset threshold values are more than the objective improvised by hybridising it with MV so that the best imputation values will be identif expression data. The main steps of the FCMMV imputation method are as follows based o and Shen (2014).
Step 1: The parameter values of the cluster size and the weighting factor, m, are set and t function, U, is initialised.
Step 2: The cluster centroids are calculated, where c= {c1, c2, …, ck } based on Equation (1): is the weig (real number) that influences the fuzzy degree of clustering, and the membership function, is defined as follows in Equation (2) for the cluster centres. For all, xi, where d(xi ,ck) is the distance between the data, xi and the centroid, ck. This can be calcul Equation (3): where p = 2 and p = 1 indicate the Euclidean and Manhattan distances, respectively, and of Minkowski distances. This research article utilises the value of p = 1.5.
Step 3: The objective function is minimised and defined. The optimal values are searched b and C as stated in Equation (4): Step 4: The termination condition is met if the preset threshold values are more than t improvised by hybridising it with MV so that the best imputation values will be identif expression data. The main steps of the FCMMV imputation method are as follows based o and Shen (2014).
Step 1: The parameter values of the cluster size and the weighting factor, m, are set and t function, U, is initialised.
Step 2: The cluster centroids are calculated, where c= {c1, c2, …, ck } based on Equation (1): is the weig (real number) that influences the fuzzy degree of clustering, and the membership function, is defined as follows in Equation (2) for the cluster centres. For all, xi, where d(xi ,ck) is the distance between the data, xi and the centroid, ck. This can be calcul Equation (3): where p = 2 and p = 1 indicate the Euclidean and Manhattan distances, respectively, and of Minkowski distances. This research article utilises the value of p = 1.5.
Step 3: The objective function is minimised and defined. The optimal values are searched b and C as stated in Equation (4): Step 4: The termination condition is met if the preset threshold values are more than t improvised by hybridising it with MV so that the best imputation values will be iden expression data. The main steps of the FCMMV imputation method are as follows based and Shen (2014).
Step 1: The parameter values of the cluster size and the weighting factor, m, are set and function, U, is initialised.
Step 2: The cluster centroids are calculated, where c= {c1, c2, …, ck } based on Equation (1 is the we (real number) that influences the fuzzy degree of clustering, and the membership functio is defined as follows in Equation (2) for the cluster centres. For all, xi, where d(xi ,ck) is the distance between the data, xi and the centroid, ck. This can be calcu Equation (3): where p = 2 and p = 1 indicate the Euclidean and Manhattan distances, respectively, and of Minkowski distances. This research article utilises the value of p = 1.5.
Step 3: The objective function is minimised and defined. The optimal values are searched and C as stated in Equation (4): Step 4: The termination condition is met if the preset threshold values are more than function values. The difference between the preset thresholds is more than the values o function of two successive iterations or the number of successive iteration reach Step 4: The termination condition is met if the preset threshold values are more than the objective function values. The difference between the preset thresholds is more than the values of an objective function of two successive iterations or the number of successive iteration reaches the preset threshold's maximum number. Then, the next step is proceeded; otherwise, U values have to be updated based on Equation (2) and back to Step 2.
Step 5: The optimal values of U and C are obtained in order to estimate the missing attribute values of x i in accordance with Equation 5: where represents the missing value that acts as the non-reference attribute.
Step 6: K is considered as the target label with representing i th predicted target label. Given as input x, provided with respect to the target labels, yielding a total of K predictions, i.e. P 1 ,…,P k . MV aims to produce a combined predictions of the estimated missing attributes for input x, P(x)=j, j from all the K predictions, i.e.P k (x) = j k ,k=1,...,k.A binary function is used to represent the votes as in Equation 6: The sum of the votes from all K for each C i and the label that receives the highest gbest vote are the final phase of estimating missing values of the predicted class. If failed to get the highest vote, then, return to Step 4 till the highest vote is obtained to select the best missing values in the data.

Optimised Hybrid of Fuzzy C-Means with Majority Vote
For the optimisation, Particle Swarm Optimisation and Support Vector Machine (PSOSVM) is selected for this research due to the strong optimisation bond between both methods based on Salleh and Samat's (2017) work on the PSO algorithm. Three steps are used on each gene attribute one by one and the attribute outputs are combined into the output that corresponds to the input. Therefore, the SVM model is trained, "input gene attribute values = output gene attribute values". opt-FCMMV is the missing value imputation method proposed in this article. The imputation of FCMMV is to identify the missing values in the dataset, whereby the parameters K and m are optimised (with the assistance of PSOSVM) with the best K votes. The purpose of the PSO algorithm with SVM in this research is to minimise the error rate. The objective function is minimised via (Input-Output) 2 , where the input is the FCMMV xi in accordance with Equation (5): where ij x  represents the missing value that acts as the non-reference attribute.
Step 6: K is considered as the target label with Ci, representing i label. Given as input x, provided with respect to the target labels, yielding a total of P1,…,Pk. MV aims to produce a combined predictions of the estimated missing attr P(x)=j, j   from all the K predictions, i.e. A binary f represent the votes as in Equation (6): The sum of the votes from all K for each Ci and the label that receives the highest final phase of estimating missing values of the predicted class. If failed to get the h return to Step 4 till the highest vote is obtained to select the best missing values in th

Optimised Hybrid of Fuzzy C-Means with Majority Vote
For the optimisation, Particle Swarm Optimisation and Support Vector Machi selected for this research due to the strong optimisation bond between both method and Samat's (2017) work on the PSO algorithm. Three steps are used on each gen one and the attribute outputs are combined into the output that corresponds to the inp SVM model is trained, "input gene attribute values = output gene attribute values". o missing value imputation method proposed in this article. The imputation of FCMMV missing values in the dataset, whereby the parameters K and m are optimised (with PSOSVM) with the best K votes. The purpose of the PSO algorithm with SVM in minimise the error rate. The objective function is minimised via (Input-Output) 2 , the FCMMV imputation and the output is the SVM prediction. Before the final opti the missing values in the dataset, SVM must be trained with a complete dataset in o xi in accordance with Equation (5): where ij x  represents the missing value that acts as the non-reference attribute.
Step 6: K is considered as the target label with Ci, representing i th pre label. Given as input x, provided with respect to the target labels, yielding a total of K pre P1,…,Pk. MV aims to produce a combined predictions of the estimated missing attributes P(x)=j, j   from all the K predictions, i.e. A binary functio represent the votes as in Equation (6): The sum of the votes from all K for each Ci and the label that receives the highest gbest final phase of estimating missing values of the predicted class. If failed to get the highes return to Step 4 till the highest vote is obtained to select the best missing values in the data

Optimised Hybrid of Fuzzy C-Means with Majority Vote
For the optimisation, Particle Swarm Optimisation and Support Vector Machine (P selected for this research due to the strong optimisation bond between both methods bas and Samat's (2017) work on the PSO algorithm. Three steps are used on each gene attr one and the attribute outputs are combined into the output that corresponds to the input. T SVM model is trained, "input gene attribute values = output gene attribute values". opt-FC missing value imputation method proposed in this article. The imputation of FCMMV is to missing values in the dataset, whereby the parameters K and m are optimised (with the PSOSVM) with the best K votes. The purpose of the PSO algorithm with SVM in this r minimise the error rate. The objective function is minimised via (Input-Output) 2 , where the FCMMV imputation and the output is the SVM prediction. Before the final optimal im the missing values in the dataset, SVM must be trained with a complete dataset in order estimate the values that correspond to the input. xi in accordance with Equation (5): where ij x  represents the missing value that acts as the non-reference attribute.
Step 6: K is considered as the target label with Ci, representing i th predicte label. Given as input x, provided with respect to the target labels, yielding a total of K predicti P1,…,Pk. MV aims to produce a combined predictions of the estimated missing attributes for P(x)=j, j   from all the K predictions, i.e. A binary function is represent the votes as in Equation (6): The sum of the votes from all K for each Ci and the label that receives the highest gbest vote final phase of estimating missing values of the predicted class. If failed to get the highest vo return to Step 4 till the highest vote is obtained to select the best missing values in the data.

Optimised Hybrid of Fuzzy C-Means with Majority Vote
For the optimisation, Particle Swarm Optimisation and Support Vector Machine (PSOS selected for this research due to the strong optimisation bond between both methods based on and Samat's (2017) work on the PSO algorithm. Three steps are used on each gene attribute one and the attribute outputs are combined into the output that corresponds to the input. Theref SVM model is trained, "input gene attribute values = output gene attribute values". opt-FCMM missing value imputation method proposed in this article. The imputation of FCMMV is to iden missing values in the dataset, whereby the parameters K and m are optimised (with the assis PSOSVM) with the best K votes. The purpose of the PSO algorithm with SVM in this resear minimise the error rate. The objective function is minimised via (Input-Output) 2 , where the the FCMMV imputation and the output is the SVM prediction. Before the final optimal imput the missing values in the dataset, SVM must be trained with a complete dataset in order to re estimate the values that correspond to the input. xi in accordance with Equation (5): where ij x  represents the missing value that acts as the non-reference attribute.
Step 6: K is considered as the target label with Ci, representing i th predicted tar label. Given as input x, provided with respect to the target labels, yielding a total of K predictions, P1,…,Pk. MV aims to produce a combined predictions of the estimated missing attributes for inpu P(x)=j, j   from all the K predictions, i.e. A binary function is used represent the votes as in Equation (6): The sum of the votes from all K for each Ci and the label that receives the highest gbest vote are final phase of estimating missing values of the predicted class. If failed to get the highest vote, th return to Step 4 till the highest vote is obtained to select the best missing values in the data.

Optimised Hybrid of Fuzzy C-Means with Majority Vote
For the optimisation, Particle Swarm Optimisation and Support Vector Machine (PSOSVM) selected for this research due to the strong optimisation bond between both methods based on Sal and Samat's (2017) work on the PSO algorithm. Three steps are used on each gene attribute one one and the attribute outputs are combined into the output that corresponds to the input. Therefore, SVM model is trained, "input gene attribute values = output gene attribute values". opt-FCMMV is missing value imputation method proposed in this article. The imputation of FCMMV is to identify missing values in the dataset, whereby the parameters K and m are optimised (with the assistance PSOSVM) with the best K votes. The purpose of the PSO algorithm with SVM in this research is minimise the error rate. The objective function is minimised via (Input-Output) 2 , where the inpu the FCMMV imputation and the output is the SVM prediction. Before the final optimal imputation the missing values in the dataset, SVM must be trained with a complete dataset in order to recall a xi in accordance with Equation (5): where ij x  represents the missing value that acts as the non-reference attribute.
Step 6: K is considered as the target label with Ci, representing label. Given as input x, provided with respect to the target labels, yielding a total o P1,…,Pk. MV aims to produce a combined predictions of the estimated missing at P(x)=j, j   from all the K predictions, i.e. A binary represent the votes as in Equation (6): The sum of the votes from all K for each Ci and the label that receives the highes final phase of estimating missing values of the predicted class. If failed to get the return to Step 4 till the highest vote is obtained to select the best missing values in

Optimised Hybrid of Fuzzy C-Means with Majority Vote
For the optimisation, Particle Swarm Optimisation and Support Vector Mach selected for this research due to the strong optimisation bond between both meth and Samat's (2017) work on the PSO algorithm. Three steps are used on each ge one and the attribute outputs are combined into the output that corresponds to the i SVM model is trained, "input gene attribute values = output gene attribute values". missing value imputation method proposed in this article. The imputation of FCMM missing values in the dataset, whereby the parameters K and m are optimised (w PSOSVM) with the best K votes. The purpose of the PSO algorithm with SVM i minimise the error rate. The objective function is minimised via (Input-Output) 2 the FCMMV imputation and the output is the SVM prediction. Before the final op the missing values in the dataset, SVM must be trained with a complete dataset in estimate the values that correspond to the input.
imputation and the output is the SVM prediction. Before the final optimal imputation of the missing values in the dataset, SVM must be trained with a complete dataset in order to recall and estimate the values that correspond to the input.
Step 1: The datasets without any missing values are the samples that will be selected.
Step 2: One of the input gene attributes are set, some of the values that are missing act as the output gene attributes, which are also the condition gene attributes.
Step 3: SVM is used to predict each value of gene attribute.
Step 4: X c represents the complete data, while X m represents the missing data. The input is as shown in Equation 7 and the output is as shown in Equation 8: where f represents the mapping between the input and output of the SVM model.
Step 5: The input data are recalled in the SVM model and the difference is known as the error. PSO is used to minimise the error between the input and output of the SVM model as shown in Equation 9. The objective function has the responsibility to minimise the error that results in an approximate value for the missing value. Following Equation 10, it shows the objective function of PSO and the outputs are used to minimise the objective function values for completeness.

EXPERIMENTAL RESULTS
This study empirically evaluated opt-FCMMV by comparing its performance with FCM and FCMMV algorithms. Experiments were conducted on a total f X c X m ( ( X c X m ( ( of fifteen datasets in the biomedical domain. In Experiment 1 and Experiment 2, a comparison was made between opt-FCM with FCM and FCMMV using SVM classifier based on different levels of missing values to examine the efficiency of the proposed method. The SVM classifier was used based on the default parameter values using the Radial Basis Kernel (RBF) (Wahyudi et al., 2010) provided in the LibSVM software package. In Experiment 1, the research discussed on the Ovary Cancer dataset, whereas Experiment 2 elaborated on the Lung Cancer dataset. The differences between the methods MAR, MCAR, and MNAR were calculated based on accuracy rates and Root Mean Squared Error (RMSE). The formulae used are as in Equations 11 and 12 (Shcherbakov et al., 2013;Kouchaki et al., 2018): where TP = true positive, TN = true negative, FP = false positive, and FN = false negative.

RMSE = (12)
where y i is the original value, is the mean of observed data, and n is the total amount of predictions.

Datasets
To verify the efficiency and effectiveness of opt-FCMMV, Experiment 1 consisted of a total of 15 sub-datasets, created from the Ovary Cancer dataset (Zhu et al., 2007). The dataset contained 15,154 genes and 254 instances with two classes, Normal and Cancerous. The dataset used for this research was a normalised dataset without any missing values. The statistics of the dataset is summarised in Table 1. The sample of the dataset is illustrated in Table 2. For the optimisation, PSOSVM was selected for this research due to the strong optimisation bond between both methods. Three steps were used on each gene attribute one by one and the attribute outputs were combined into the output that corresponded to the input. Therefore, the SVM model was trained, "input This study empirically evaluated opt-FCMMV by comparing its performance with FCM algorithms. Experiments were conducted on a total of fifteen datasets in the biomedica Experiment 1 and Experiment 2, a comparison was made between opt-FCM with FCM using SVM classifier based on different levels of missing values to examine the effi proposed method. The SVM classifier was used based on the default parameter values usi Basis Kernel (RBF) (Wahyudi et al., 2010) provided in the LibSVM software package. I 1, the research discussed on the Ovary Cancer dataset, whereas Experiment 2 elaborated Cancer dataset. The differences between the methods MAR, MCAR, and MNAR we based on accuracy rates and Root Mean Squared Error (RMSE). The formulae us Equations (11) and (12)  gene attribute values = output gene attribute values". opt-FCMMV is the novel missing value imputation method proposed in this article. The imputation of FCMMV is to identify the missing values in the dataset, whereby the parameters K and m are optimised (with the assistance of PSOSVM) with the best K votes. On the other hand, the purpose of the PSO algorithm with SVM is to minimise the error rate. The objective function was minimised via (Input-Output) 2 , where the input is the FCMMV imputation and the output is the SVM prediction. Before the final optimal imputation of the missing values in the dataset, SVM must be trained with a complete dataset in order to recall and estimate the values that corresponded to the input.  (Zhu et al., 2007) with five classes. The dataset contained 12,600 genes and 204 instances. The dataset used for this research was a complete and non-normalised dataset without any missing values. The statistics of the dataset is summarised in Table 3. The sample of the dataset is illustrated in Table 4. The dataset was normalised using the following Equation 14 (Wenzel & Peter, 2017) within the range of [0, 1] to reduce redundancies and data anomalies.
where = the new value for variable X, = the current value for variable X, = the minimum data point, and = the maximum data point in the dataset .  Table 5 also depicts the MCAR scenarios. The proposed method showcased higher accuracy rates as well. There was a hike from 85.4% to 87.0% using the opt-FCMMV method for 5% missing values (refer Table  5). The same goes to all other rates of missing values with high accuracy rates. Additionally, for the MNAR scenario, there were missing data columns from 5% (13) to 80% (203). Furthermore, the accuracy rates improvised from 83.0% using FCM with missing values to 94.1% using the proposed method (refer Table 5). Among the different experiments from Table 5, this indicated that the best performance of the proposed method with highest accuracy rates was shown using SVM. With opt-FCMMMV, all the experiments with different methods and missing value rates (5%, 10%, 30%, 50%, and 80%) demonstrated significant performance improvement as compared to other methods before enhancement (FCM and FCMMV). Here, MV assisted to consider the most suitable values to be imputed for the highest voted values. Optimisation was able to assist the imputed values into many feasible values and the best-predicted values were able to be selected for the missing regions for the respective mechanisms.

Experiment 2
This section reports Experiment 2 with the Lung Cancer dataset to identify the effectiveness of the proposed method. Tables 8 till 10 for Experiment 2 illustrate different missing value ratios and comparisons between methods such as no methods used to handle missing values, FCM, FCMMV, and opt-FCMMV. As shown in Table 6, with the increase in missing ratios, opt-FCMMV was able to demonstrate promising and high accuracy rates. For 80% of missing ratio, the proposed method was able to improvise 20.2% from the initial accuracy with no imputation and FCM methods. Referring to Table  6, opt-FCMMV showed higher accuracy rates as compared to other methods. 5% missing ratio results indicated that FCMMV obtained 72.9%, which was higher as compared to opt-FCMMV's 71.4% accuracy rate. This is due to the poor measure of the MNAR mechanism via 5% of missing value ratio.

DISCUSSION
In this article, both experiments used two existing methods, which are no imputations and FCM. Mewnwhile, the improved FCM such as FCMMV and opt-FCMMV were also utilised to evaluate the accuracy rates of the imputation method on microarray data. Accuracy rate and RMSE were used to measure the credibility of the algorithm since RMSE can show the increase and decrease in methods by the increase in sample size with any missing value rate. Accuracy rate was used to measure the performance of the methods because the quantity of information missed increased due to the number of missing values, whereby it led to affecting the accuracy rates. For Experiment 1, all accuracy rate results from the experiments showed that opt-FCMMV was the best method to impute the missing values. In Experiment 2, 14 out 15 experiments proved the credibility of the proposed method based on accuracy rates. These results from the experiments showed the ability of the proposed method in imputing the missing values in the data whether in smaller or larger ratio. While for RMSE values, almost all mechanisms (MCAR, MNAR, and MAR) for Experiments 1 and 2 showed the lowest values, proving the methods' advantages. One of the major advantages of the proposed method is that the algorithm used the information from the data itself to predict the missing values. This is also due to MV that assisted in choosing the best optimal measurement for gene similarity. Another advantage in this method is the optimisation itself. Optimising the coefficients of the non-missing values of the similar genes via the proposed method allowed to

DISCUSSION
n this article, both experiments used two existing methods, which are no imputations and FCM ewnwhile, the improved FCM such as FCMMV and opt-FCMMV were also utilised to evaluate the ccuracy rates of the imputation method on microarray data. Accuracy rate and RMSE were used to easure the credibility of the algorithm since RMSE can show the increase and decrease in methods y the increase in sample size with any missing value rate. Accuracy rate was used to measure the erformance of the methods because the quantity of information missed increased due to the number f missing values, whereby it led to affecting the accuracy rates.
For Experiment 1, all accuracy rate results from the experiments showed that opt-FCMMV as the best method to impute the missing values. In Experiment 2, 14 out 15 experiments proved the redibility of the proposed method based on accuracy rates. These results from the experiments howed the ability of the proposed method in imputing the missing values in the data whether in maller or larger ratio. While for RMSE values, almost all mechanisms (MCAR, MNAR, and MAR) or Experiments 1 and 2 showed the lowest values, proving the methods' advantages. One of the majo dvantages of the proposed method is that the algorithm used the information from the data itself to redict the missing values. This is also due to MV that assisted in choosing the best optima easurement for gene similarity. Another advantage in this method is the optimisation itself ptimising the coefficients of the non-missing values of the similar genes via the proposed method llowed to gain the nearest gene measurements in accordance with the class of the genes. Furthermore his method worked well for a large number of missing values. This is due to the PSO algorithm's earch strategy as it minimised the error rate that directly improved the accuracy rates and lowered the gain the nearest gene measurements in accordance with the class of the genes. Furthermore, this method worked well for a large number of missing values. This is due to the PSO algorithm's search strategy as it minimised the error rate that directly improved the accuracy rates and lowered the RMSE values.

CONCLUSION
In this article, a new imputation method, known as Optimised Hybrid of Fuzzy C-Means and Majority Vote (opt-FCMMV) was proposed. This new method created a more solid and informative dataset as compared to other methods due to its optimisation method. Therefore, the achieved accuracy rates are higher through the improved method from FCM, FCMMV to opt-FCMMV. The experimental results confirmed the proposed method can be a credible method for upcoming research in handling missing values. In this article, the proposed method was compared against three imputation methods (i.e. None, FCM, and FCMMV), with five types of missing value percentage (i.e. 5%, 10%, 30%, 50%, and 80%). The Ovary and Lung Cancer microarray data were used as datasets that covered the biomedical field. The opt-FCMMV method has proven that it can solve high dimensional problems and improve accuracy across different types of missing value percentage. In the future, opt-FCMMV can also be applied in different domains while other imputation methods and metaheuristic algorithms for optimisation can be investigated. opt-FCMMV can be considered as a promising imputation method for the pre-processing stage for future research in the biomedical field.