A Hybrid K-Means Hierarchical Algorithm for Natural Disaster Mitigation Clustering

Cluster methods such as k-means have been widely used to group areas with a relatively equal number of disasters to determine areas (cid:83)(cid:85)(cid:82)(cid:81)(cid:72)(cid:3) (cid:87)(cid:82)(cid:3) (cid:81)(cid:68)(cid:87)(cid:88)(cid:85)(cid:68)(cid:79)(cid:3) (cid:71)(cid:76)(cid:86)(cid:68)(cid:86)(cid:87)(cid:72)(cid:85)(cid:86)(cid:17)(cid:3) (cid:49)(cid:72)(cid:89)(cid:72)(cid:85)(cid:87)(cid:75)(cid:72)(cid:79)(cid:72)(cid:86)(cid:86)(cid:15)(cid:3)


INTRODUCTION
Many countries around the world are prone to natural disasters, including Indonesia.High rainfall, active tectonic and volcanic earthquakes, and tsunamis, are very common occurrences in Indonesia.Consequently, disaster mitigation efforts are indispensable to minimize the impact of a disaster in many regions in Indonesia.
A huge variety of research on natural disaster mitigation have been carried out.Natural disaster mitigation is a continuous effort to reduce the impact of disasters against people and property (Sadewo et al., 2018).Prihandoko and Bertalya (2016) studied several factors for natural disasters in Indonesia and found that geographical condition was the main cause for natural disaster occurrence instead of weather condition.Anjayani (2008) suggested that earthquake hypocenters strongly correlated with the locations of many active volcanoes.most common natural disaster in Indonesia.In 2007, the Indonesian government passed the Law of the Republic of Indonesia Number 24 of 2007 concerning disaster management as the national reference (Indonesia, 2007).Rachmawati (2018) conducted a study on the community's knowledge in the disaster areas to measure people's general awareness of areas at risk to lessen the consequences of natural disasters.
Numerous previous works have used data and information on natural disaster mitigation compiled by the National Agency for Disaster Countermeasure (BNPB) of Indonesia. Sadewo et al. (2018) conducted a clustering of disaster mitigation anticipation levels at the provincial government using the k-means method.Priatmodjo (2011) stated that disaster mitigation required preparedness, which included analysis of potential disasters and planning for anticipation.He also developed tools for disaster prevention and management.Atasever (2017) revealed a method to determine the level of damage due to a disaster.Han and Kamber (2001) used data mining to process large amounts of disaster data.Meanwhile, Prihandoko et al. (2017) used data mining techniques to analyze and predict disaster mitigation anticipation levels.
Various methods have been used to cluster the anticipation level of natural disaster mitigation.Ediyanto et al. ( 2013) described hierarchical clustering based on Euclidean distance to calculate the level of similarity.Hierarchical clustering is usually shown in the form of a tree diagram (dendrogram).Whereas for large amounts of data, the k-means method is more often used (Bagirov et al., 2011).This paper presents the results of the clustering process of datasets of mitigation activities using data mining techniques to determine the anticipation levels for natural disasters.In this study, the k-means method and hierarchy are combined with a hybrid approach in producing a hierarchical k-means hybrid clustering.The clustering process is conducted using datasets originating from various research reports of natural disaster mitigation activities conducted by local and managed by the National Research and Innovation Agency (BRIN).information for a better clustering process.Clustering is a powerful amounts of information are needed for data organization (Abdulsahib & Kamaruddin, 2015).

RELATED WORKS Natural Disaster Mitigation
A natural disaster is a natural event that has a major impact on the many tectonic activities), Indonesia must continue and prepare to face Preparation for natural disasters includes all activities carried out prior to the detection of signs of disaster in order to facilitate the use of available natural resources, request assistance, and plan for rehabilitation in the best possible way and likelihood.Preparedness for natural disasters starts at the local communication level.If local and international levels (Sadewo et al., 2018).

Clustering Disaster-Prone Areas and Mitigation
Cluster methods such as k-means have been widely used to group areas with relatively the same number of disaster characteristics to see which areas are prone to natural disasters (Yana et al., 2018).A study by Supriyadi et al. (2018) used k-means to classify disasterprone areas into three clusters: high, medium, and low.In addition, Yana et al. (2018) found two regional clusters in Indonesia, namely prone to and not prone to natural disasters.Prihandoko and Bertalya (2016) suggested the cluster correlation between natural disasters, the number of victims, and weather conditions using k-means.efforts using k-means in a disaster mitigation study.Their research results showed three clusters (high, medium, and low mitigation efforts).The regions of West Java, Central Java, and East Java entered a high level of mitigation.In another study, Kandel et al. (2014) discussed a comprehensive assessment of fuzzy techniques for mitigation.They utilized incremental fuzzy clustering to group mitigation data.Nevertheless, the authors did not experiment with other clustering techniques on the same dataset for accuracy measures.Several previous studies above, such as Sadewo et al. (2018), Supriyadi et al. (2018), and Prihandoko and Bertalya (2016), did not validate the results of clustering on the mitigation and disaster grouping by province.In addition, the results have not yet been compared with mitigation/disaster grouping.

Hybrid Clustering
Hybrid k-means and hierarchical clustering have been applied to studying disasters such as air pollution (Govender & Sivakumar, 2020).K-means and hierarchical clustering are two approaches that have different strengths and weaknesses.For instance, hierarchical computational complexity in large datasets.In contrast, k-means spherically shaped clusters (Peterson et al., 2018).Several studies have combined these two methods, such as Govender and Sivakumar (2020), who applied a combination of k-means and hierarchical clustering techniques to analyze air pollution.Atasever (2017) combined the k-means cluster method and backtracking search optimization algorithm (BSA) clustering to detect damage to natural disaster areas.The data results were grouped with a hybrid approach into two classes: damaged and undamaged areas.
Moreover, some studies compared hybrid k-means cluster methods with other methods.Nugroho (2021) compared the kernel k-means algorithm on bipartite graphs and k-means on the term-document matrix in the COVID-19 research dataset.The result was that the k-means kernel algorithm provided slightly better validation as compared to k-means.Balavand et al. (2018) used a hybrid of the crow search algorithm (CSA) k-means method with data envelopment analysis and compared it with other algorithms.
Nevertheless, other research works used a combination of clustering methods for disaster or other subjects.Wen et al. (2019) developed a combination of geographic information system (GIS) technology and the QUEST cluster algorithm, and the results showed the distribution of drought disaster areas.Ali et al. ( 2018) discussed disaster management with cluster techniques for emergencies, while Welton-Mitchell et al. (2018) examined clusters of people affected by clustering algorithms for plantation stocks on Bursa Malaysia.They utilized expectation maximization (EM), k-means, and hierarchical clustering algorithms to cluster the 38 plantation stocks listed on Bursa Malaysia.The results showed that a cluster resulting from EM This study seeks to address some of the shortcomings of previous research.First, no one has explicitly used hybrid k-means and hierarchical clustering algorithms to suppress the level of disaster mitigation efforts.Second, previous research only surveyed the combination of k-means and hierarchical clustering studies.This study presents the application of the hybrid clustering approach that amalgamates the two methods to identify the general-shaped level of stage is to combine k-means and hierarchical as a hybrid approach.The hybrid approach is used because the k-means algorithm uses random observational data to determine the initial centroid.The centroid point is initialized randomly so that the resulting data grouping can be different.If the random value for initialization is not good, then the resulting grouping becomes less than optimal.A hybrid k-means and hierarchical algorithm is expected to avoid this problem.

METHODOLOGY
This study clustered the natural disaster (earthquake, tsunami, landslide, volcano eruption) dataset from technical reports on natural This dataset consisted of 237 documents of technical research reports conducted by researchers within and outside BRIN.A total of 81 districts and cities (subsequently named "region") in Indonesia were included in this dataset.A mitigation category was created for each of the technical reports on natural disaster research.The categories consisted of A, B, C, D, E, and F, and are based on the types of natural disaster mitigation recommended by the National Research and Innovation Agency in each disaster-prone area.Table 1 presents the dataset summary.2017) hybrid approach to detect damage due to natural disasters.The current study also used the R programming language with the factoextra library in computing the application of a hybrid approach (Kassambara & Mundt, 2020).Figure 1 illustrates the exact method.b= average inter-cluster distance, i.e., the average distance between all clusters.
The next stage was to combine the k-means and hierarchical clustering as a hybrid approach.The hybrid approach calculated the hierarchical clusters and cut the tree into several k clusters.It then calculated the centroid of each cluster.Finally, the hybrid approach calculated the k-means using the cluster centroid obtained from the previous calculation as the cluster's initial centroid.Next, hierarchical and k-means clustering results were compared, respectively, with hybrid clustering results using a matching matrix.
The hybrid clustering results from the disaster mitigation category the three clusters were made as a ground truth reference for applying the hybrid k-means hierarchical algorithm to the natural disaster dataset on the subset of keywords and types of disasters.
This study used disaster-type and keyword subsets as the clustering base.The subsets represented the mitigation category relationship.This study carried out the clustering stage from each subset using the k-means algorithm by building a term frequency-inverse document frequency (TF-IDF) matrix to convert the document into a TF-IDF vector.Stop words were eliminated.In the computation, stop words (text), such as the, is, at, which, and on.The stemming process was algorithm of k-means and hierarchical was employed with the number of clusters k = 3 according to the anticipated levels of natural disaster This study utilized unsupervised learning to divide the input data point with some common properties.In the previous stage, prior knowledge clustering results, a matching matrix method was intuitively used.As described by Samatova et al. (2013), the matching matrix (Figure 2) is a V × W matrix, where V is the number of class labels in P and W is the total number of resulting clusters.Each row of the matrix represents one class label, and each column represents a cluster ID.Each m ij entry represents the number of points from Class i that are present in cluster g j P and clusters obtained using U.
In this paper, purity was employed as the validation metric for the hybrid algorithm.Purity (Pu) is a measure to analyze the cluster's homogeneity concerning the class labels.Equation 1 calculates purity as follows: (1) This measure takes any value in the range of 1/V to 1.A value of 1 indicates an utterly homogeneous cluster.The total purity (TPu) was calculated for the entire cluster's results.TPu, as denoted by Equation 2 for the whole cluster set, was calculated as the sum of each cluster's purities weighted by the number of elements in each cluster. (2)

RESULTS
parameter of the mitigation category, with the number of clusters k = 3.The results of k-means clustering are presented in Figure 3. including Kab.Simeulue and Kab.Toba Samosir.Kota Banda Aceh, Kab.Kebumen, and Kab.Cilacap were included in Cluster 2. While Kab.Kep.Mentawai and Kab.Lampung were located in Cluster 3.Each cluster represented a different level of preparedness for natural disaster mitigation.To determine the validity of this clustering result, the silhouette size was used as shown in Figure 4.The silhouette size indicated that the majority of data points were well clustered, as indicated by positive silhouette values, particularly in Clusters 1 and 3.While in Cluster 2, several data points were found with negative values, indicating that they might belong to the incorrect cluster.The and estimated the average distance between clusters (i.e., the average the observations might be grouped in the wrong cluster.Table 2 revealed that some data points in Cluster 2 were grouped in the wrong cluster.In the next stage, the hierarchical algorithm was applied to the dataset of disaster mitigation categories with the parameter number of k = 3 and the ward method.The hierarchical results are presented in a dendrogram graph, as shown in Figure 5. Based on Table 3, there were Aceh and Kab.Kebumen might be incorrectly clustered at this stage of hierarchical clustering, as indicated by a negative silhouette value.

Cluster Plot of K-means on Mitigation Category
Figure 6 shows that the averaged silhouette width of hierarchical clustering was 0.53, which was higher than that of k-means (see Figure 4) for the same clustering category.The clustering results were similar to those of k-means clustering.Most data points were grouped into Cluster 1, followed by Cluster 2, and the remainder into Cluster 3. The difference between hierarchical and k-means clustering was fewer data points with negative silhouette values were found in the former.Therefore, hierarchical might produce a better result than the k-means algorithm in clustering the dataset.

Cluster Dendrogram of Hierarchical Clustering on Mitigation Category
Cluster Silhouette of Hierarchical Clustering on Mitigation Category The next stage was to combine k-means and hierarchical as a hybrid approach.The hybrid approach was employed because the k-means algorithm uses random observational data to determine the initial centroid.The clustering solution of k-means is very sensitive to a random selection of the centers of the cluster.Therefore, clustering results may vary when recomputing.
The hybrid approach calculated the hierarchical clusters and cut the tree into several k clusters.It then calculated the center of each cluster and calculated the k-means using the cluster center obtained from the previous calculation as the cluster's initial center.The new Table 4 summarizes the result of the calculation of the new centroid.Next, k-means clustering was applied using the cluster's center above to obtain the cluster results, as presented in Table 4.

Table 4
New Centroids for Hybrid K-means Hierarchical Clustering

Hierarchical and Hybrid Clustering Results Comparison
Next, the hierarchical and hybrid clustering results were compared using the match matrix.As shown in Table 5, the hybrid algorithm produced better clustering results than the hierarchical algorithm.The data points were clustered homogeneously into each cluster.

Matching Matrix of Hierarchical and Hybrid Clustering
Hybrid Results 1 2 3 Hierarchical Results From Figure 7, it can be observed that most data points were clustered homogeneously into a predetermined cluster.Nevertheless, some data seven data points included in Cluster 1.The mis-clustered data points Figure 7 and Table 5.

Cluster Dendrogram of Hybrid K-means and Hierarchical Clustering on Mitigation Category
Figure 8 shows that most data points had positive silhouette values, which meant that the data points were clustered into the correct cluster.In contrast, it can also be seen that in Cluster 2, eight data points had negative values.The negative value indicated that there was a possibility that the data points were not clustered correctly in

Figure 8
Cluster Silhouette of Hybrid Clustering on Mitigation Category

K-means and Hybrid Clustering Results Comparison
In the same way, using the matching matrix, the clustering of standard k-means was compared with the hybrid approach, as shown in Table 7. Table 7 describes a matching matrix that consolidated the results of the standard and hybrid k-means clustering.Clusters 1, 2, and 3 each showed the clustered data points correctly by the two types of clustering algorithms applied.The results of this clustering were consulted with related experts.The clustering results showed that there were two regions, namely Kab.
Mentawai and Kab.South Lampung, in the high anticipation category.
Aceh, Padang City, and Bengkulu City, could be highly anticipated.This difference was due to the lack of research in Category A that discussed construction and strengthening of building structures.
The hybrid clustering results from the disaster mitigation category experts were from the Research Center for Geotechnology with k-means hierarchical algorithm.The hybrid algorithm was then used for keyword and disaster-type subsets.Figure 9 presents the ground truth, which consolidated the clustering results of applying the hybrid algorithm and validation from experts.Cluster 1 represented areas with low anticipation levels, Cluster 2 for medium anticipation levels, and Cluster 3 for high anticipation levels.

Result of Hybrid K-means and Hierarchical Clustering on Keywords
Figure 12 illustrates the keywords that represented each mitigation category.For Category A, which was construction and strengthening of building structures, the keywords that appeared most often included earth movements, pressure, and earthquakes.Category B was for mapping of disaster-prone areas, and the keyword that appeared the most was fault.Then, Category C was for assessment of disaster risks and characteristics, and the keywords that appeared were earthquakes and tectonic plates.While in Category D, which was for preparation and installation of early warning system instrumentation, keywords such as deformation and earthquake fault appeared.Category E was for planning and implementation of spatial planning, and the keyword that most often appeared was earthquakes.Finally, in Category F, which was for outreach and information dissemination, the keyword that often appeared was disaster.

Figure 12
Keyword Word Cloud on Mitigation Category

Result of Hybrid K-means and Hierarchical Clustering on Disaster Types
Figure 15 shows the correlation of each category of mitigation anticipation with different types of disasters.For example, in the mitigation category A, which was construction and strengthening of building structures, landslides were the most anticipated.While in Category D, which was for preparation and installation of early warning system instrumentation, earthquake and tsunami disasters were the most anticipated.

Disaster Types Bar Plot on Mitigation Category
To validate the clustering results, a matching matrix was used on keywords, disaster types, and mitigation code to determine purity.
As shown in Table 8, the results indicated that the clusters had an average TPu value of 0.88 for the hybrid clustering algorithm, 0.84 for hierarchical, and 0.86 for k-means.The TPu values were close to 1, representing the acceptable results of the hybrid clustering algorithm.From this table, it can be concluded that the hybrid clustering outperformed k-means and hierarchical since its TPu value was the highest.

CONCLUSION
This study examined on clustering the natural disaster literature dataset.The clustering process was performed by applying the k-means, hierarchical, and hybrid algorithms.This process produced three clusters for the anticipation levels of natural disaster mitigation: Cluster 1 for low anticipation level, Cluster 2 for medium anticipation level, and Cluster 3 for high anticipation level.In addition, from validation by experts, the clustering results indicated that 67 districts/ cities (82.7%) fell into Cluster 1, nine districts/cities (11.1%) were categorized in Cluster 3 (6.2%).From the analysis of the silhouette homogeneous clustering results.
Furthermore, a matching matrix was used on keywords, disaster types, and mitigation code to determine purity to validate the clustering results.The clusters had a TPu close to 1, representing acceptable results of the hybrid clustering algorithm.It was concluded that the hybrid clustering outperformed standard k-means and hierarchical since its TPu value was the highest.The clustering solution of k-means is very sensitive to a random selection of the centers of the cluster.Therefore, clustering results may vary when recomputing.This led the study to use hybrid clustering because the algorithm uses random observational data to determine the initial centroid.
A further study that aims to compare the hybrid clustering algorithm with other algorithms is recommended.The method for determining the disaster mitigation level also needs improvement.

Figures
Figures 13 and 14 show the result of hybrid k-means and hierarchical on disaster types.In contrast with the previous clustering results on the keywords, the clustering on the disaster types resulted in more regions falling into Clusters 2 and 3.

Table 1
Summary of Dataset from the Research Center for Geotechnology-Indonesian Institute of Sciences.This study used a hybrid approach that combined the k-means and the hierarchical algorithms to categorize the anticipation level.This method adoptedSadewo et al.'s (2018)clustering of anticipated levels of natural disaster mitigation at the provisional level and Atasever's (

Table 3 Clustering
Table 6 presents the negative values of the silhouette.As many as eight data points were indicated in this table as being in the incorrect cluster, including Kota Bandung, Kab.Cilacap, and Kab.Tanggamus.

Table 7
Matching Matrix of K-means and Hybrid Clustering

Table 8
Matching Matrix Validation on Keywords, Disaster Types, and Mitigation Code