Multilabel Over-sampling and Under-sampling with Class Alignment for Imbalanced Multilabel Text

Simultaneous multiple labeling of documents, also known as multilabel text classification, will not perform optimally if the class is highly imbalanced. Class imbalance entails skewness in the fundamental data for distribution that leads to more difficulty in classification. Random over-sampling and under-sampling are common approaches to solve the class imbalance problem. However, these approaches have several drawbacks; under-sampling is likely to dispose of useful data, whereas over-sampling can heighten the probability of overfitting. Therefore, a new method that can avoid discarding useful data and overfitting problems is needed. This study proposed a method to tackle the class imbalance problem by combining multilabel over-sampling and


INTRODUCTION
Multilabel classification is a task applied in various data mining applications, such labeling video, images, music, and texts.Multilabel classification assigns documents to several classes at the same time based on their belongings.This task differs from the traditional single label, which associates each document to one class.The classification task of the single label can also be considered as multiclass or binary classification.In multiclass classification, each document can belong to more than one label category, but only one label category is assigned.Whereas, in the multilabel classification, it is a generalization of the multiclass and binary classification, as it does not enforce any limits to the number of components that are held at the outputs (Charte et al., 2013;Siblini et al., 2019).Methods of multilabel text classification suffer from a high level of class imbalance, and because of that, they will not work efficiently (Ali et al., 2019;Glazkova, 2020;Japkowicz & Stephen, 2002;Koziarski et al., 2020).The main issue in class imbalance occurs when a certain class has an extremely higher number of instances than other classes (Tanha et al., 2020;Weng et al., 2018;Zhang et al., 2020).In actual conditions, skewness often occurs in the distribution of examples of certain classes that rarely appear.Such an issue affects learning algorithms leaning toward the majority classes.Numerous solutions for imbalanced classification have been proposed by García et al. (2018), Pereira et al. (2020), Patel et al. (2020), Qiao et al. (2017), and Song et al. (2016).However, previous works principally focused on binary classifications that were less complex than multilabel imbalanced classification (Cascar et al., 2019;Sáez et al., 2016).Results of earlier studies have shown that random over-sampling and under-sampling are the most efficient approaches to solve imbalanced classifications (García et al., 2018;Sáez et al., 2016;Zhang et al., 2020).However, the approaches have several drawbacks; undersampling is likely to dispose of useful data, whereas over-sampling can heighten the probability of overfitting (Charte et al., 2015;Qiao et al., 2017;Sáez et al., 2016).Therefore, a new method that can avoid discarding useful data and overfitting problems is needed.
This study presents a proposed method on handling the class imbalance problem for a multilabel learning model on text classification based on a new sampling and class alignment.The proposed method combines multilabel over-sampling and under-sampling, and class alignment, which is called the ML-OUSCA algorithm.This combination aims to deal with the limitations of previous approaches in tackling the class imbalance problem.Likewise, this study aims to balance the classes in the training set examples by joining and exploiting the power of oversampling, under-sampling, and non-sampling methods.However, the proposed method draws a new training set using under-sampling by only discarding a few non-useful majority class samples from the set.The discarding strategy in the under-sampling method is based on the interdependency between the training set samples.By contrast, over-sampling is performed by duplicating few randomly selected minority class samples from the dataset.In addition, the samples in the classes that are not too high or low in their numbers are identified as class alignment (balanced classes) and kept without resorting to over-sampling and under-sampling.This work is henceforth structured into five sections.The next section presents a short review of previous related studies.The proposed method and its assessment are presented in the following section.The fourth section describes the carried-out experiments.Finally, the proposed method is concluded in the last section.

RELATED WORKS
Multilabel class imbalance is a highlighted topic among the artificial intelligence (AI) community in recent years (Daniels & Metaxas, 2017).Class imbalance also affects multilabel learning, whereby the high and low instance distributions of each label are largely imbalanced and broadly varied.The situation is exacerbated in the presence of numerous labels and low densities (Maheshwari et al., 2017;Zhang et al., 2020).Besides, the level of imbalance among multilabel datasets is greater than the binary or multiclass datasets (Charte et al., 2013).
Class imbalance is generally resolved using the under-sampling method.For instance, Rao and Reddy (2020) presented the under-sampling strategy (i.e., KNN-US) to reduce the less prominent instances from majority subsets to solve imbalanced dataset.The proposed method in Rao and Reddy (2020) identified the mostly misclassified instances based on the k-nearest neighbor (KNN) technique.Onan (2019) presented consensus clustering-based under-sampling method to lessen the number of instances of the majority class.Lin et al. (2017) presented two under-sampling strategies that also utilize the clustering technique.Zhang et al. (2018) proposed an approach based on the stacking and inverse random under-sampling methods.Zhang et al. (2018) used the inverse random under-sampling method to under-sample the majority class samples and applied the stacking approach to separate and classify the minority from the majority class.The inverse random undersampling method was also employed by Tahir et al. (2012) before applying an ensemble classifier.A bidirectional resampling method (i.e., multilabel decoupling bidirectional resampling (ML-DBR)) at the data level was proposed by Zhou et al. (2020).The disparity of the labels was minimized by decoupling the extremely concurrent data of the majority and minority labels and by calculating the effect of the labels during resampling.The independence of the instances was then guaranteed.However, the ML-DBR approach was tested using seven benchmark multilabel datasets, including the Enron text dataset.The results showed that the proposed method was able to outperform several methods, namely REMEDIAL, REMEDIAL-HwR-ROS, and REMEDIAL-HwR-HUS (Charte et al., 2019).Three classifiers were used to classify the dataset, which were label powerset (LP), binary relevance (BR), and multilabel k-nearest neighbor (ML-kNN) (Zhang & Zhou, 2007).In terms of Micro-F values, the usage of ML-DBR helped the employed classifiers to achieve higher results than the compared methods on five out of seven datasets.Kim et al. (2019) presented the principles of under-sampling technique to solve a class imbalance problem.Pereira et al. (2020) presented a Multilabel Tomek Link (MLTL) based on the Tomek Link resampling method.This under-sampling algorithm detected and eliminated the so-called Tomek links from the multilabel dataset.If they were the nearest neighbors but belonged to different groups, a pair of instances was considered a Tomek link.In addition to being a subsampling method, MLTL could be implemented in a post-process cleaning stage for the ML-SMOTE method.According to Pereira et al. (2020), the justification for using it as a post-process cleaning stage relied on the fact that the class groups were typically not well specified after applying ML-SMOTE, i.e., some instances from the majority class might invade the space of the minority class or vice versa.Consequently, the feature space could be cleaned and the edges between classes smoothed by the MLTL method.
Over-sampling is the second widely used method to resolve class imbalance.Sáez et al. (2016) applied this method in analyzing class characteristics, whereby the subsets of certain instances were identified in each class and increased individually.A novel reversenearest neighborhood-based over-sampling method for the class imbalance of a multilabel dataset was introduced by Sadhukhan and Palit (2019).All those points that included the query point as one of their neighbors had the reverse nearest neighborhood of a query point.However, the proposed method was tested using ten mutilabel datasets including the Enron text dataset.The results showed that the proposed method was able to outperform label-specific features (LIFT) (Zhang & Wu, 2014), random k-labelset (RAKEL) (Tsoumakas et al., 2010), improved baseline restoration (IBLR) (Cheng & Hüllermeier, 2009), cross-coupling aggregation (COCOA) (Zhang et al., 2020), calibrated label ranking (CLR) (Fürnkranz et al., 2008), synthetic minority over-sampling technique (SMOTE) (Chawla et al., 2002), adaptive synthetic sampling (ADASYN) (He et al., 2008), and USAM (Fernández et al., 2017).In terms of F-measure, the proposed method achieved higher results than the compared methods on nine out of ten datasets.Last et al. (2017) presented a combination of k-means clustering and SMOTE over-sampling, which was called K-means SMOTE.The proposed method avoided noise generation and effectively overcame the imbalance problem between and within classes.Another over-sampling technique was introduced by Abdi and Hashemi (2016) based on the Mahalanobis distance.Moreo et al. (2016) presented a new over-sampling method, i.e., distributional random over-sampling (DRO), explicitly designed to identify the imbalanced text dataset for which the distributional hypothesis held, according to which the importance of a feature was somehow determined by its distribution in large data corporations.The proposed method generated new random minority class synthetic documents by exploiting the distributional properties of the terms in the collection.However, the proposed method was evaluated on three mutilabel datasets, including Reuters-21578 text dataset.The proposed method was compared against three methods, namely random over-sampling (RO) and SMOTE (Chawla et al., 2002), BSMOTE (Han et al., 2005) and DECOM (Chen et al., 2011).The proposed method obtained higher results than the comparative methods on all datasets in terms of F-measure.Li et al. (2014) presented an over-sampling approach that used the clustering technique and the Euclidean distance.Meanwhile, Rivera (2017) introduced an over-sampling approach based on noise reduction and selective sampling of the minority class to achieve good predictive abilities concerning its membership.Another widely used modification of over-sampling was the synthetic minority oversampling technique (SMOTE) (Charte et al., 2015;Díez-Pastor et al., 2015;Jian et al., 2016).Koziarski et al. (2019) presented a radialbased over-sampling (RBO) method, which could find areas where artificial organisms of the minority class must be created based on estimating the imbalanced distribution of defects with radial basis functions.Two over-sampling methods, namely borderline-SMOTE1 and borderline-SMOTE2, were presented by Han et al. (2005) to oversample the minority examples around the borderline.
The hybrid sampling method proposed by several studies (Dubey et al., 2014;Shi et al., 2018;Song et al., 2016;Wang, 2014) is a combination of the under-sampling and over-sampling techniques.This method showed promising results in comparison with the standalone methods.Dubey et al. (2014) carried out a systematical analysis of various sampling techniques by studying the effectiveness of different rates and types of under-sampling and over-sampling and a combination of both methods.Shi et al. (2018) proposed an undersampling that selected the informative instances and features from the original dataset, whereas over-sampling balanced the majority class instances.Song et al. (2016) proposed a hybrid of SMOTE and undersampling technique by applying k-means.Wang (2014) proposed a simple integration between under-sampling and over-sampling to improve the classification result of support vector machine (SVM).All the results reported in the studies above showed that the hybrid sampling method is better than the stand-alone methods in terms of classification performance.For instance, in Song et al. (2016), the proposed hybrid sampling method of under-sampling and oversampling achieved 6.4 percent higher than the under-sampling method in terms of F-measure across four datasets.Whereas the over-sampling in isolation achieved 2 percent lower than the hybrid sampling method.
Other types of hybrids entail the combination of one of the sampling methods and other methods, such as the combination of SMOTE and artificial immune recognition system (AIRS) (Wang & Adrian, 2013).Fang et al. (2017) presented a new method dealing with imbalance problem for multilabel classification called DEML.DEML transformed the whole label set of the multilabel dataset into some subsets and each subset was treated as a multilabel dataset with balanced class distribution to solve the class imbalance problem.DEML was tested using ten multilabel datasets including Bibtex and Enron datasets.The results showed that the proposed method was able to outperform CLR (Fürnkranz et al., 2008), RAkEL (Tsoumakas et al., 2010), ensemble of classifier chains (ECC) (Read et al., 2011), ML-kNN (Zhang et al., 2007), and BR (Tsoumakas et al., 2007).DEML achieved a higher average rating in terms of the micro-F1 and macro-F1 values.Xu et al. (2020) presented a hybrid of SMOTE and under-sampling with nearest neighbor based on random forest to solve the class imbalance problem.Galar et al. (2013) presented a novel approach to improve the ensembles of classifiers via a combination of under-sampling and boosting techniques known as EUSBoost.Feng et al. (2020) presented a hybrid method cluster-based under-sampling and SMOTE (CUSS) to handle class imbalance classification.Sun and Lee (2017) presented a two-stage multilabel hypernetwork (TSMLHN) method to deal with the class imbalance problem in multilabel learning.In TSMLHN, class labels were divided into two groups, i.e., common labels and imbalanced labels based on their imbalance ratios.The correlations between common labels and imbalanced labels were used to improve the learning performance of imbalanced labels.TSMLHN was tested using 15 multilabel datasets including Bibtex and Enron datasets.The results showed that the proposed method was able to outperform BR-SVM (Boutell et al., 2004), ML-kNN (Zhang et al., 2007), CLR (Fürnkranz et al., 2008), RAkEL (Tsoumakas et al., 2010), ECC (Read et al., 2011), IBLR (Cheng & Hüllermeier, 2009), COCOA (Zhang et al., 2020), ML-ROS, ML-RUS, and MLSMOTE (Charte et al., 2015), and MLHN (Sun et al., 2016).In terms of macro-F, TSMLHN achieved higher results than the compared methods on 9 out of 12 datasets.
Class imbalance is yet an issue that is highly investigated in recent years.When the instances of a specific class outnumber other classes, this usually causes a poor result (Feng et al., 2020;García et al., 2018;Maurya et al., 2017;Sáez et al., 2016;Zhou et al., 2020).In machine learning, presenting an imbalanced dataset usually result in low classification accuracy.The reason is because the machine learning method can learn very little about the minority class.A true good classifier is when the classifier is able to classify a balanced amount of classes with high accuracy (Haixiang et al., 2017;Qiao et al., 2017;Xu et al., 2020).Generally, the most commonly used approaches to handle imbalanced classes are under-sampling and over-sampling and they provide competitive results when compared with more complex methods found in the literature (Charte et al., 2015;Moreo et al., 2016;Sun et al., 2017;Sáez et al., 2016).Over-sampling aims to balance classes' training examples through reproducing the minority class examples (last et al., 2017;Tahir et al., 2012;Tanha et al., 2020).On the other hand, under-sampling targets to balance the classes' training examples through the elimination of majority class examples (Charte et al., 2015;Pereira et al., 2020;Rao & Reddy, 2020;Sáez et al., 2016).Both of these approaches have limitations.For instance, undersampling can discard potentially useful data, while over-sampling can increase the likelihood of overfitting.In order to overcome random over-sampling and under-sampling limitations for balancing the classes, this study proposes a new method by combining Multilabel Over-Sampling and Under-Sampling with Class Alignment (ML-OUSCA).The aim behind the combination is to deal with both the limitations of previous approaches in addressing the class imbalance problem.

METHODS
A complete framework of multilabel text classification was set up to evaluate the ML-OUSCA method (Figure 1).The framework consisted of four phases, namely (i) data pre-processing; (ii) resampling and class alignment; (iii) data representation and feature selection; and (iv) multilabel classification models.In the framework configuration, two baseline resampling algorithms, namely k-nearest neighbor under-sampling strategy (KNN-US) and K-means SMOTE, were used against the proposed ML-OUSCA algorithm.The learning algorithms and ensemble methods were constructed to determine the performance of ML-OUSCA against the two baselines of resampling model.Figure 1 shows the different combinations of the multilabel text classification architecture.The details of each phase are described in subsequent subsections.

Methodology for ML-OUSCA in Multilabel Text Classification
Data Pre-processing Pre-processing is an essential stage before the application of machine learning approaches.It includes four steps: (i) tokenization, (ii) normalization, (iii) stop-word removal, and (iv) stemming.First, tokenization aims to turn the text of a certain document into an appropriate format for machine learning.The tokenization process involves a text to discrete fragmentation in between the separated units distinguished by a space or a certain indicator so that every unit matches a single word.Second, the normalization step focuses on cleaning the data by eliminating noise or unwanted data, such as special characters.Third, the stop-word task is applied to discard unnecessary words, such as conjunctions, pronouns, and prepositions.Finally, stemming refers to figuring out the root or stem of words.Stemming extracts the word's root form from its inflectional or derivational form, which is a necessary step for addressing high dimensional and sparse data, especially with multilabel text data classification.

Resampling and Class Alignment
This subsection describes two baseline methods, namely undersampling: KNN-US method and over-sampling: K-SMOTE method.It also describes the new resampling proposed method (ML-OUSCA) for handling the class imbalance problem in multilabel text classification.

Baseline 1: Under-sampling: KNN-under Sampling Strategy (KNN-US)
KNN-US that was proposed by Rao and Reddy (2020) is the first baseline resampling method used in this work.KNN-US is one of the latest developments in under-sampling methods and is considered one of the state-of-the-art methods of resampling.The main idea of KNN-US is to recognize the mostly misclassified instances by taking into account the k-nearest neighbor technique.If all the nearest neighboring instances of a particular instance are of other classes, it means that the specific record is listed as a noisy or outlier instance and may therefore be excluded.The dataset is split into minority subsets and majority subsets , respectively, at the first stage of KNN-US.P is the minority subset of instances that are much lower when compared to the other classes in the dataset.N is the majority subset of instances, which are more than the other classes in percentage.Through analyzing the intrinsic properties of the instances, the noisy and outlier records can be easily detected.Two main steps are taken into account.First (step 1), minority set data cleaning: pi = m′; where m′ (0 ≤ m′ ≤ m), if m/2 ≤ m′ < m, then pi is an often-misclassified instance.Then, delete the m› instances from the minority set.Second (step 2), majority set data cleaning: ni = m′; where m' (0 ≤ m′ ≤ m), if m/2 ≤ m′ < m, then pi is a mostly misclassified instance.Then, delete the m› instances from the m′ from the majority set (see Algorithm 1).

Baseline 2: Over-sampling: K-SMOTE (K-means Synthetic Minority Over-sampling Technique)
Over-sampling aims to increase the number in the training set of minority class members.The over-sampling method generates new minority class instances to eliminate the harms of skewed distribution.
To evaluate over-sampling techniques, this work adopted K-SMOTE (Last et al., 2017), as shown in Algorithm 2. K-SMOTE is one of the recent advances of SMOTE and is considered to be one of the stateof-the-art over-sampling methods (Last et al., 2017).
K-SMOTE consists of three steps: clustering, filtering, and oversampling.In the clustering step, the input space is clustered into k groups using k-means clustering.The filtering step selects those groups for over-sampling in order to maintain the minority class instances with a high percentage.

end
Step 3: Oversample each filtered cluster using SMOTE.The number of samples to be generated is computed using the sampling weight. 13

Proposed Method: ML-OUSCA
An imbalanced dataset is caused by unbalanced data distribution, leading to the poor performance of multilabel text classification algorithms because the classifiers are more inclined toward the majority than the minority data.This study proposes a new method based on both under-sampling and over-sampling of imbalanced classes.In the method, class labels are grouped into three major groups, namely major classes, minor classes, and class alignment (balanced classes).Over-sampling entails the random elimination of the majority classes to attain balanced distribution.In contrast, undersampling involves the replication of the minority classes in achieving a balanced distribution with the majority classes.
For minority classes, new documents will be added based on the size of a minority class, average class size, and standard deviation.The aim is to increase their sizes to be nearest to the balanced class sizes.For majority classes, documents will be deleted.
In multilabel text classification, let X ∈ R d be the domain of documents and Y = {l 1 ,l 2 ,...,l q } denote the finite set of labels.D = {(x i , y i )|1 ≤ i ≤ N, x i ∈ X ,y i ⊆ Y} denotes the training data that consists of N documents and its related labels.y i is a vector consisting of 1 and 0. Documents linked to a certain label are treated as positive or negative instances.The main idea of the proposed ML-OUSCA algorithm is derived based on median outlier detection and Chebyshev's Theorem (Amidan et al., 2005).Chebyshev's Theorem is applied to solve the class imbalanced data in multiple works (Amidan et al., 2005;Su & Hsiao, 2007) by estimating the likelihood of arriving at a value that differs from the mean by less than some degree of standard deviation.It then shows a percentage of how far the data is outside the standard deviation from the mean.The theory is described in Equation 1: Chebyshev's Theorem states that at least (1 -( 1 r 2 )) of the items in any dataset will be within r standard deviations of the mean, where r is any value greater than 1.Based on Chebyshev's Theorem, at least 75 percent of the items must be within r = 2 standard deviations of the mean.At least 89 percent of the items must be within r = 3 standard deviations of the mean.At least 94 percent of the items must be within r = 4 standard deviations of the mean.For data that have a normal distribution, approximately 68 percent of the data values will be within r = 1 standard deviation of the mean and 95 percent of the data values will be within r = 2 standard deviations of the mean.Almost all of the items (99%) will be within r = 3 standard deviations of the mean.
The proposed ML-OUSCA algorithm (Algorithm 3) consists of the following main steps: Step 1: Group samples according to their classes.
In this step, the samples in dataset D are rearranged, where each sample is distributed into sample S, in which the total number of samples are equal to Q (number of labels), D = {S 1 , S 2 , S 3 ,…,Q}. .They are distributed based on their belonging to each label.
Step 2: Obtain majority classes, minority classes, and class alignment (balanced classes) based on class sizes' median and quartiles.
This step starts by ranking the groups in D in ascending order.
Then, the median of the samples is computed using Equation 2: median = (Q + 1)/( 2(2) In order to identify the extreme values at the tails of the distribution, the samples are divided into quartiles.The following quantities (called fences) are calculated using Equations 3 and 4: lower inner fence = Quar 1 -1.5IQ (3) upper outer fence = Quar 3 + 3.0IQ (4) where lower inner fence is the median of the values from the high values quartile (Quar 1 ).upper inner fence represents the median of the values from the low values quartile (Quar 3 ).
The major (called Major classes ) and minor labels (called Minor classes ) are identified based on the median of (Quar 1 ) and (Quar 3 ).The class alignment (called Balanced classes ) that do not belong to (Quar 1 ) and (Quar 3 ) are identified and their training examples are kept without over-sampling and undersampling.In other words, class alignment (balanced classes) are classes whose size is not more than or less than one standard deviation away from the mean.
Step 3: The mean and standard deviation of the class alignment (balanced classes) are calculated to determine the reduction size of majority classes and increment size of minority classes.
In order to recognize the amounts of examples to be added to the minority classes and removed from the majority classes, the means and standard deviation of the class alignment (balanced classes) are calculated based on Class Mean Size (CMS) and cross-sectional standard deviation (CSSD) using Equations 5 and 6: Class Mean Size (CMS) = ( 5) where CSSD is the cross-sectional standard deviation.
In this step for majority classes, new documents will be deleted based on the size of a majority class, average class size, and standard deviation using Equation 7. The aim is to reduce their sizes to be nearest to the balanced class sizes.Reduct size = |MajorL i | -|CMS + 1 * CSSD| (7) Step 5: Minor classes are over-sampled.
For the minority classes, new documents will be added based on the size of a minority class using Equation 8.In addition, cosine similarity between x and other documents are used to increase their sizes of minority classes to avoid overfitting.It is added based on the size of a minority class, average class size, and standard deviation.The aim is to increase their sizes to be nearest to the balanced class sizes.

Algorithm
Step 1 //Group samples according to their classes 1.For do 2. amples with Label(j) 3. End for (continued) Step 2 //Obtain majority classes, minority classes, and class alignment (balanced classes) based on class sizes' median and quartiles.

TF.IDF Model
In text classification, the feature values and a vector of features (terms) are used to describe a document (Adel et al., 2019;Johnson & Zhang, 2014;Mao et al., 2019;Taha & Tiun, 2016).TF.IDF is a well-known text representation method, which works by assigning a weight to each word (feature) (Chen et al., 2016;Mashaan Abed et al., 2013;Zubiaga, 2018).It finds the important phrases or words in a specific document and calculates the combination of the term frequency and inverse document frequency.This scenario entails the frequency of the word w in document D. The weight of a term is determined using two measures: (1) -the frequency of a term in a single document; and (2) -the number of documents in the corpus containing the specified term. is the total number of documents.From each document, only a few terms are selected (terms that have the highest).All other terms (terms that have the lowest) are removed from the document.Terms in a document are assigned their using Equation 9:

Normalized Pointwise Mutual Information Features Selection
The mutual information feature selection measures the common information that is found between the terms and the labels (Kermani, et al., 2019;Lim et al., 2017).The common information MI (t, c) is found in between the class c, while the term t is distinct on the level of co-occurrence between a feature f j and a class c i (Li et al., 2017;Lim et al., 2017).In this work, the NPMI feature selection method was adopted to select features for each class according to co-occurrence measure between a feature f j and c i a class.NPMI between the feature and its classes (Lim et al., 2017) is calculated using Equations 10 and 11:

Multilabel Classification Models
For evaluation, two multilabel learning models, namely (i) chain of classifier (CC) based on a binary relevance method, and (ii) AdaBoost.MH, were adopted.These approaches were selected because they are considered as the state-of-the-art multilabel classification algorithms and often used in the works of imbalanced data (Al-Salemi, et al., 2018;Pant et al., 2018;Taha & Tiun, 2016).

Chain Classifiers Based on Binary Relevance Method
A combination of multiple classifiers to solve a single task is called chained classifiers (CC).The classifiers can be trained independently by different datasets (Taha & Tiun, 2016).This work utilized the proven binary classifiers, i.e., Naive Bayes (NB) classifier, k-nearest neighbor (KNN) classifier, and SVM (Mirończuk & Protasiewicz, 2018).

AdaBoost.MH.
AdaBoost.MH constructs several weak classifiers iteratively and subsequently groups them into a final classifier that can estimate the multiple labels for a particular instance.Through integration and training, a boosting algorithm transfers a weak classifier to a strong one, which is what the AdaBoost algorithm does as an adaptive booster.The AdaBoost algorithm is capable of adjusting the weight distribution of the training samples adaptively and selecting the best weak classifier out of the sample weight distribution consistently to integrate all the weak classifiers and vote by a given weight to build a robust classifier.AdaBoost.MH is a multilabel version of AdaBoost algorithm (Al-Salemi et al., 2018;Pant et al., 2018).

Evaluation Measurements
The performance of these classification methods is measured by classifying the experimental results into four groups using Equations 5, 6, and 7, respectively.The first group is true positive (TP), entailing correctly assigned documents.The second group is false positive (FP), consisting of falsely assigned documents.The third is false negative (FN), as the set of documents that were not incorrectly assigned to the class.Finally, the fourth is true negative (TN), as the set of documents that were not correctly assigned to the class.Besides, this study adopted three multilabel evaluation measurements that are commonly used in multilabel classification (Sharef et al., 2014;Taha et al., 2020;Taha & Tiun, 2016), which can be referred to in Equations 12, 13, and 14: i.
Average precision metric, M_PRECISION, evaluates the proportion of the correctly predicted relevant, as shown in Equation 12: ii.
Average recall metric, M-RECALL, calculates the proportion of the correctly predicted relevant (true) labels that were correctly identified, as shown in Equation 13: iii.
Average F-measure metric, is the balance mean of both M_ PRECISION and M_RECALL, as shown in Equation 14: iv.

RESULTS AND DISCUSSION
This study evaluated the strengths of the proposed ML-OUSCA algorithm in the multilabel text classification context, in which AdaBoost.MH and CC were used as the classifiers for multilabel text classification.Main experiments involving K-SMOTE, KNN-US, and ML-OUSCA had been carried out using the framework of Figure 1.
In addition, a five-fold cross-validation was utilized to evaluate all the experiments.

Dataset
As described in Table 1, the Bibtex, Enron, and Reuters-21578 corpus datasets, which are publicly available multilabel text classification domains, were used.Table 1 shows the number of instances, number of attributes, number of labels, cardinality, density, diversity, and average imbalance ratio per label (avgIR).Cardinality measured the average number of classes for each instance, whereas density entailed cardinality divided by the number of labels.Diversity involved the percentage of class sets present in the dataset divided by the number of possible label sets.The avgIR measured the average degree of imbalance of all classes.Therefore, the greater the avgIR, the greater the imbalance of the dataset.

Results
This study conducted two kinds of experiments using AdaBoost and CC classifiers for evaluation.The first experiment was conducted with baseline models (K-SMOTE and KNN-US) and the proposed ML-OUSCA method using AdaBoost for evaluation.
The second experiment employed the same settings and datasets that were used in the first experiment, and CC was applied instead of AdaBoost.The experiments were categorized based on the usage of AdaBoost and CC.Each experiment had three resampling methods, which were K-SMOTE, KNN-US, and ML-OUSCA.NPMI was used as a feature selection method with feature sizes ranging from 250 to 2250 and with a constant increase of 250 each time.Tables 2, 3, and 4 show the selected features (labeled feature selection set) for each dataset using both classification methods.
Table 2 describes the results of using K-means SMOTE as the oversampling method, and Table 3 shows the results of using KNN-US as the under-sampling method.Table 4 presents the ML-OUSCA results.The results shown in Tables 2, 3, 4, and 5 are summaries of using the best sets of features for each classification method on all the datasets.Table 5 categorizes the experiment into evaluation models (AdaBoost and CC).Each experiment had three resampling methods (labeled Kmeans SMOTE, KNN-US, and ML-OUSCA) applied to each of the described datasets in Table 1.

DISCUSSION
The obtained results are summarized in Figure 2, presenting the effect of the proposed ML-OUSCA method on the multilabel text classification models based on all the datasets.It compared the classification accuracy of ML-OUSCA and the baseline methods, namely KNN-US and K-means SMOTE.The results also demonstrated that the multilabel text classification models could be improved further if the inherited imbalance problem was solved.
The results obtained by ML-OUSCA with AdaBoost.MH was stable (consistently high) regardless of the imbalance problem.As seen in Figure 2, though the avgIR value of the Enron dataset was more than 70, ML-OUSCA obtained F-measure of 86.17 percent using AdaBoost and 82.6 percent using CC, respectively.Therefore, the proposed ML-OUSCA method was capable of handling imbalanced text problem, even with high diversity in the size of imbalanced data (i.e., large value of avgIR).
Figure 2 shows that ML-OUSCA significantly outperformed the other baseline sampling methods on all the multilabel text classification models (AdaBoost.MH and CC).Thus, to verify whether the above observations were statistically significant, a paired t-test was carried out involving the attained results of the proposed method and the two baseline methods on all datasets.First, the t-test result obtained between the proposed ML-OUSCA and baseline method KNN-US was p = 0.000388.Second, the t-test between results obtained by the proposed ML-OUSCA and baseline method K-means SMOTE was p = 0.009999.However, in order to conclude the t-test, a significance level of 0.05 was employed in this study.Based on the archived p values, it can be concluded that the results of the proposed method were significantly better than those of the baseline methods.
In analyzing the cause for these results, it is believed that KNN-US might increase the likelihood of overfitting, whereas K-means SMOTE method might lead to overgeneralization due to disregarding the majority class instances.Therefore, ML-OUSCA could provide an effective solution for the problem of classifying the imbalanced dataset to overcome several limitations of the baseline methods, such as losing important information and adding trivial information.The proposed method drew a new training set by over-sampling small size classes and under-sampling big size classes according to training examples by combining and exploiting the power of over-sampling, under-sampling, and non-sampling methods.The results showed that the classification performances of ML-OUSCA significantly outperformed the other baseline sampling methods in all datasets.Besides, with the consistently higher results of AdaBoost.MH as compared to CC in all of the experiments (see Table 5) across all the datasets, AdaBoost.MH should be chosen as the ensemble classifier.This is because the AdaBoost.MH model aims to reduce the number of misclassified labels.It works by setting the weights to the training samples and classifiers in order to ensure the accuracy of the classification.
In other words, it can be concluded that for the best model for multilabel text classification, given the choice of baseline resampling and the proposed ML-OUSCA method to tackle imbalanced dataset and ensemble classifiers of AdaBoost.MH and CC, one should choose the proposed ML-OUSCA with AdaBoost.MH as the classifier.

CONCLUSION
This study presented a new method, ML-OUSCA, to solve the class imbalance problem in multilabel classification.Instead of using all training instances, the proposed method constructed a new training set by using over-sampling on the minority classes, and under-sampling on the majority classes.Over-sampling and under-sampling were used to avoid the curse of class imbalance problem, a common problem in a majority of large-scale multilabel classification problems.The proposed ML-OUSCA was applied on well-known multilabel text classification datasets, namely Reuters-21578, Bibtex, and Enron.The results indicated the superiority of the proposed ML-OUSCA method as opposed to the baseline methods identified in the literature.Based on the results, the study concludes that combining multilabel oversampling and under-sampling can help to achieve higher classification accuracy than using any of the above methods isolation.

Algorithm 1 :
KNN-under sampling (KNN-US) Input: Minority class dataset , Majority class dataset P= set of the minority instances N= set of the majority instances m'= the number of majority nearest neighbors T= the whole training set m= the number of nearest neighbors Step 1. Find mostly misclassified instances pi 1.Let us consider 2. m' = the number of majority nearest neighbors 3. pi = m'; where m' (0 ≤ m'≤ m) 4. if ≤ m' < m then pi is a mostly misclassified instance.Then remove the instances m' from the minority set.Step 2. Find noisy instances pi' 5. pi' = m'; where m' (0 ≤ m' ≤ m) 6.If m^'= m, i.e. all the m nearest neighbors of pi are majority examples, pi' is considered to be noise or outliers or missing values and are to be n removed.7. ni' = m'; where m' (0 ≤ m'≤ m) 8.If m^'= m, i.e.all the m nearest neighbors of pi are minority examples, ni' is considered to be noise or outliers or missing values and are to be removed.Output: A new minority class dataset Sm To overcome random over-sampling and under-sampling limitations, the proposed work balances the classes of training examples by combining and exploiting the power of over-sampling, under-sampling, and non-sampling methods.Under-sampling can only discard a few non-useful majority class examples, whereas over-sampling prevents overfitting by duplicating few randomly selected minority class examples.Furthermore, class alignment (balanced classes) that have suitable training examples (number of training examples that are not too high or too low) are identified, and their training examples are kept without being over-sampling and under-sampling.

Table 1
Summary of the Multilabel Text Classification Standard Data

Table 4
Performance (Average F-measure) of the Proposed ML-OUSCA onCC and AdaBoost.MH