EVALUATION ON RAPID PROFILING WITH CLUSTERING ALGORITHMS FOR PLANTATION STOCKS ON BURSA MALAYSIA

Building a stock portfolio often requires extensive financial knowledge and Herculean efforts looking at the amount of financial data to analyse. In this study, we utilized Expectation Maximization (EM), K-Means (KM), and Hierarchical Clustering (HC) algorithms to cluster the 38 plantation stocks listed on Bursa Malaysia using 14 financial ratios derived from the fundamental analysis. The clustering allows investors to profile each resulted cluster statistically and assists them in selecting stocks for their stock portfolios rapidly. The performance of each cluster was then assessed using 1-year stock price movement. The result showed that a cluster resulted from EM had a better profile and obtained a higher average capital gain as compared with the other clusters.


INTRODUCTION
Investing in stock markets is not an easy task for many people as stock markets are complex and dynamic systems.Short term movements or patterns in stock markets are always unpredictable and difficult to trace.Thus, lucrative returns are difficult to gain from stock investments.However, investors and financial researchers still keen to adopt different approaches to understand the behaviour of stock markets.As a result, the research into stock markets remains interesting and appealing to them.In the early years, research on the stock price movements and predictions was primarily based on statistical approaches (Brown & Warner, 1985;Pearce, 1984).But in recent years, the focus of the stock research has been shifted to applying data mining techniques (Ou & Wang, 2009).
Data mining is a process of identifying interesting patterns in data for decision making (Ngai, Hu, Wong, Chen, & Sun, 2011).Historical data, i.e., financial data or time series data of stocks are readily available and huge in size.Applying data mining techniques on the data will definitely allow researchers in identifying and uncovering the hidden patterns of a particular stock or even a stock market.Classification, clustering and generalization are among the commonly used data mining techniques to analyse and predict the movement of stock prices or stock market indexes.In this research, clustering algorithms were adopted on the financial data of plantation stocks listed on Bursa Malaysia.Clustering in the data mining context refers to unsupervised classification of data into clusters/groups, and the data in the same cluster exhibit a certain degree of pattern similarity (Jain, Murty, & Flynn, 1999).The clustering algorithms have been widely used in many disciplines such as Bioinformatics (Ng, Ho, & Phon-Amnuaisuk, 2012), big data analytics (Feldman, Schmidt, & Sohler, 2013), multi-level Kohonen network learning (Shamsuddin, Zainal, & Mohd Yusof, 2008), etc.
Although clustering research on stock market data is not new, but the research remains challenging.This is because the size of stock market data can be substantially huge and they often need to be pre-processed carefully and accurately before use.Furthermore, the patterns exist in the data of a particular stock market data might be different from others.Hence, the stock research with clustering is still intact and attractive to many researchers.In a study by Nanda, Mahanty, & Tiwari (2010), clustering was performed on stocks listed on Bombay Stock Exchange (BSE) with the objective of building a stock portfolio via the selection of stocks from the resulted clusters, and then compared the investment returns with the Sensex index; the research indicated that KM clustering yielded better results as compared to Self-organizing Map (SOM) and Fuzzy C-Means.Lee, Lin, Kao, & Chen (2010) applied hierarchical agglomerative and KM clustering to predict the short-term movement of stock prices after releasing the financial reports.
Clustering technique was also applied to predict and assess the stock market co-movement (Aghabozorgi & Teh, 2014).The researchers proposed a threephase clustering method to group the stocks listed on the Kuala Lumpur Stock Exchange (now known as Bursa Malaysia).It started the first phase with the approximate clustering of the stocks using a low-resolution time series data.The clusters formed were further refined by splitting them into sub-clusters in the second phase.The third phase involved the merging of sub-clusters into the final clusters.Hsu (2011) proposed a hybrid method to predict the prices of stocks listed on Taiwan Stock Exchange; the hybrid method was formed by integrating the SOM and genetic programming.The researcher claimed that the hybrid method was effective for stock price prediction.
Classification technique was also employed in the stock market research.A fuzzy rule based system was proposed by Chang and Liu (2008) to predict the electronic stock prices in Taiwan Stock Exchange.Besides, Tan, Yong, & Tay (2012) applied Bayesian Networks (BN) to model the financial ratios of plantation stocks listed in Malaysia; the developed model can be used to forecast the future price performance of the plantation stocks.Much early research employed Artificial Neural Network (ANN) to predict the stock market.Thus, there were quite a number of ANN-based stock prediction models reported by financial researchers (Zhang & Wu, 2009;Ishikawa, Fukuhara, & Nakamura, 1997;Yoon & Swales, 1991).
Data mining research in stock markets generally uses (1) fundamental analysis or (2) technical analysis to analyse stocks (Lam, 2004).Fundamental analysis refers to the finding of the intrinsic value of a stock that can be measured from the stock's quantitative and qualitative data (Tan et al., 2012;Lee et al., 2010;Nanda et al., 2010;Yoon & Swales, 1991).Quantitative data mainly comprise of financial ratios such as profit margin, debt ratio, price earnings ratio, etc. Qualitative data, on the other hand, link to the quality of key management, company policy, brand, marketing strategy, etc.Unlike fundamental analysis, technical analysis emphasises on the patterns and trends of a stock trading information; it gathers and analyses statistics generated by stock activities, i.e., price movement and volume.The patterns or trends discovered by the technical analysis are used as indicators to predict future stock price performance (Aghabozorgi & Teh, 2014;Hsu, 2011;Zhang & Wu, 2009;Chang & Liu, 2008).These two analyses produced relevant stock information that is beneficial to investors in building stock portfolios.
A stock portfolio is a collection of stocks possessed by an individual or a company.Building a stock portfolio often needs Herculean efforts from an investor.This is because of the large number of stocks in a stock market.For instance, there are more than 900 common stocks (excluding financial derivatives) listed on Bursa Malaysia.Thus, building a good stock portfolio is always not an easy task for an amateur or even a professional fund manager.It is always important for a stock investor to find an efficient way to build a good stock portfolio which can generate excellent investment returns.
Thus, the primary objective of this study is to perform a rapid profiling on the 38 plantation stocks listed on Bursa Malaysia using quantitative data of stocks and clustering algorithms.Selecting the plantation stocks for this study is mainly because these stocks play an important role in the economy of Malaysia.Besides, Malaysia is also among the world largest exporters of palm oil (Sulaiman, Abdullah, Gerhauser, & Shariff, 2011) and rubber (Nambiar, 2010).
The organization of this paper is as follows.The second section provides the detail of the research methodology.The third section covers the clustering results for these three clustering algorithms, as well as the analysis and discussion of the results.The last section concludes the paper and suggests the future directions of this research.

METHODOLOGY
The overview of the methods used in this study is depicted in Figure 1.It was started with the collection of raw financial data for the plantation stocks listed on Bursa Malaysia.In the subsequent step, the collected data were transformed into useful financial ratios.It was then followed by clustering the stocks data using three clustering algorithms, i.e., EM, KM and HC.In the final step, the resulted clusters were analysed and profiled based on their financial performance.The clusters were also assessed using a comparative analysis that was based on the average capital return (in stock price).

Transform the Plantation Stocks Data
The financial data of year 2012 for each plantation stock were retrieved using the DataStream database.Alternative data sources such as annual reports provided in Bursa Malaysia or plantation company websites shall be used for any missing value in the data.The financial data of year 2013 were not collected because many plantation companies had been yet to release their 2013 annual reports at the time of data collection.There were 38 plantation stocks (excluding financial derivatives) available.Each of them is represented by its four digits unique stock code and stock name.Table 1 shows the details of each plantation stock listed on Bursa Malaysia.

Table 1
The stocks that are listed on the plantation sector of Bursa Malaysia, the total number of active plantation stocks is 38 (exclude warrants, delisted stocks, etc.)The collected financial data were raw data and inappropriate to be used directly.For instance, we cannot claim that two plantation companies with different scales are equivalent in financial performance even though they are generating the same amount of profit; the plantation company with the smaller scale is more remarkable as compared with a plantation company with a bigger scale.To effectively compare them, financial ratios such as earnings per share (EPS) or return on equity (ROE) are more suitable than just looking at the amount of profit.Financial ratios are crucial indicators to evaluate the "health" or financial status of a stock (Sim & Liu, 2011).They have been used extensively in the data mining research on financial and stock data (Baresa, Bogdan & Ivanovic, 2013;Tan et al., 2012;Sim & Liu, 2011;Lee et al., 2010;Kloptchenko et al., 2004).Thus, the next step was to pre-process the raw financial data and convert it into useful financial ratios.The consistency of these financial ratios was checked against other resources such as online trading portals.Rectification was conducted for any value inconsistency in these financial ratios.
A total of 14 financial ratios was identified and used in this study.Table 2 displays the formulae to derive each financial ratio.The first financial ratio is cash ratio, it measures a company's liquidity and how fast a company repays its short-term debt.A company with a high cash ratio indicates its healthy cash flow position and it is unlikely for the company to encounter repayment problems for its short-term debt.Assets are crucial to a company since it can be used to generate revenue and subsequently make profits.To measures the effectiveness of a company in utilizing its assets, total asset turnover is used.A high total asset turnover implies that a company is highly efficient in managing its assets.
Financial leverage ratios also play an important role in determining the "healthiness" of a company.The financial leverage ratios comprise of debt ratio, debt to equity ratio (D/E), current debt to equity ratio and equity turnover.These four financial ratios provide information on the degree of a company's financing debt and its ability to pay short-term or long-term debts.
On the other hand, to measure the profitability of a company generated from its investments, it is good to use profitability ratios such as return on asset (ROA), return on equity (ROE), net profit margin and operating margin; high values in these financial ratios strongly indicate that the performance of a company is outstanding and better than its peers or competitors.The remaining four financial ratios are price to earnings ratio (P/E), price to book ratio (P/B), dividend yield (DY), and earnings yield.These four financial ratios are market value ratios used to describe a company's financial condition in an amount of shares.They are good measures to find out whether or not a particular stock is currently overpriced or at its biggest bargain compares with its peers.On certain occasions, a stock with a high P/E ratio or P/B ratio may imply that the stock is overpriced.However, the price of the stock remains high.This indicates that investors have strong confidences on the stock and they are willing to pay a high price for it.Investors who aim to receive steady dividend every year will favour stocks with high DY.
The transformed data were formed using these 14 financial ratios of the 38 plantation stocks.Inevitably, the dataset contains some missing values, but they are justifiable.For example, a company with net loss will not have a P/E value.Thus, the P/E field of the company shall be assigned to zero.The subsequent step was to cluster the plantation stocks using these financial ratios.

Clustering Algorithms
The clustering algorithms used in this study, i.e., EM, KM and HC, grouped the plantation stocks based on their similarity in the financial ratios.The clustering outcomes were analysed and evaluated thoroughly.
EM has been widely used in research areas such as Machine Learning and Computer Vision.Using EM, data are modeled as a linear combination of multivariate normal distributions.EM finds the parameters of a probability distribution that maximize the log-likelihood.Figure 2 shows the complete description of EM in pseudocode.In general, the algorithm has two key steps, i.e., Expectation step (E-step) and Maximization step (M-step) (Ordonez & Cereghini, 2000).KM is one of the earliest clustering algorithms used in data mining research (Hasan et al., 2009).The algorithm is explained in the pseudocode as shown in Figure 3.This algorithm works by partitioning data into k clusters (k is determined beforehand).The clustering method begins with choosing the initial centroids for k clusters at random points in the data.The stocks are then assigned to their closest centroids in the next step.Once all stocks are assigned to their respective clusters, the centroid of each cluster shall be updated by re-calculating its cluster members' distances.These two key steps (step 3 and 4 in the pseudocode, Figure 3) are repeated until convergence.It means the clustering should stop if there is no more changes for the centroid of each cluster.
HC is another widely adopted clustering algorithm in data mining research.HC uses a merged-based (agglomerative) clustering method and works in a bottom up manner (Ng, Phon-Amnuaisuk, & Ho, 2010).In the initial step, each instance of data is in its own cluster.It means that there are n clusters if the data contain n instances.The subsequent step is to find two disjoint clusters that are closest to each other and merge them.This step is repeated until all clusters are merged into the specified k clusters.Figure 4 shows the detail of the HC algorithm in pseudocode.

Hierarchical Clustering (HC)
Algorithm Hierarchical (Stock_Db, k) Input: The plantation stocks dataset, Stock_Db that comprises of 14 financial ratios for each plantation stock Output: k-clusters of plantation stocks 1.
for a ← 1 to n // n denotes the plantation stock quantity in the dataset 2.
let Ca = {sa} // start with n cluster, each cluster has only one plantation stock, s 3. repeat 4.
find a pair of non-merged clusters, e.g.Ca and Cb so that the cluster resulting from their union has the smallest diameter 5.
merge Ca and Cb 6.
until n plantation stocks are clustered into k clusters The three clustering algorithms were used in this study because of the justifications as follows.Firstly, these algorithms are well-established and commonly used in the data mining research.Secondly, the computational efficiency of these clustering algorithms is considered acceptable (Abbas, 2008).Thus, the time consumption and memory space are feasible for processing the transformed data using a low end computer.The next section reveals and discusses the clustering results of the plantation stocks.The statistical properties of the resulted clusters were used for stock analysing and profiling.A comparative study of investment returns for each of the resulted clusters was conducted based on the capital gain (in stock price) from January till December 2013.

RESULTS & DISCUSSION
The clustering results of these three clustering algorithms are displayed in Table 3.The statistical properties of the resulted clusters are also shown in the table.In this study, the mean and the 5-number summary (minimum, 1 st quartile, median, 3 rd quartile and maximum) were used to discuss the clustering results and to understand the data dispersion of the resulted clusters, respectively.The mean is the average of a financial ratio in a cluster.In case a cluster has extreme values or asymmetric data dispersion, they may greatly influence the mean value.However, the mean was used in this study because it can be understood easily as it has convenient mathematical properties that allow it to be used in many statistical contexts (Whitley & Ball, 2002).
The data were first clustered using EM.The number of clusters was not set beforehand as there was no prior information about the dataset.The number of clusters was determined by EM itself based on the patterns discovered in the data.As a result, EM yielded two clusters, i.e., cluster 0 and cluster 1 which contain 18 and 20 stocks, respectively.To allow the comparative analysis among these three clustering algorithms, the number of clusters was set to two for KM and HC.

Comparing Clusters
Table 3 shows the statistics of clusters generated by the algorithms.Each cluster is represented using a specific name.For instance, EM C0 denotes cluster 0 resulted from EM.By comparing the clusters, it was observed that generally the plantation stocks in cluster 1 had a better financial profile than those in cluster 0 regardless of the clustering algorithms used (Table 3).This was because cluster 1s scored better than cluster 0s in 11 out of 14 financial ratios, particularly with major differences in cash ratio, debt ratio, debt-equity ratio, current debt to equity ratio, net profit margin, and operating margin (also refer to Appendix).The profile of the clusters shall be discussed as follows using these six financial ratios.The financial ratios of cluster 0s and 1s were compared regardless of the clustering algorithms used as they exhibit the same phenomenon as follows.The means of cash ratio in cluster 1s were significantly higher than cluster 0s.
A company with a strong cash ratio has sufficient fund to pay off its current liabilities (short-term loans, trade payable, etc.).Thus, the company will not face any financial difficulty that will affect its normal operation.Investors should avoid any company with a very low cash ratio at all costs.Previous research by Kim and Kang (2010) and Min, Lee, & Han, (2006) showed that such company is highly susceptible to bankruptcy.In the stock investment, it is always desirable to pick a company with a strong cash flow.
A company with high values in debt ratio, debt-equity ratio and current debt to equity is in "unhealthy" financial position and likely to expose to financial risks.The means of debt ratio, debt-equity ratio and current debt to equity in cluster 1s were lower than cluster 0s.These statistics showed that most of the plantation stocks in cluster 1s were in "healthier" debt positions than the plantation stocks in cluster 0s.As a result, they were unlikely to encounter any short term financial distress.Cielen, Peeters, & Vanhoof (2004) showed that the debt ratio is an efficient indicator to predict bankruptcy.
It is always the higher the better for net profit margin and operating margin.A company with a higher net profit margin than its competitors indirectly reveals its excellent management efficiency, particularly in cost control.The operating margin is a useful indicator to evaluate a company's operating performance.
High operating margin means that the company is efficient in managing raw materials, logistics, products and staffs of the company.Both ratios are useful to compare companies in a similar industry.The means of net profit margin and operating margin in cluster 1s were significantly higher than cluster 0s.
The comparison result showed that most of the plantation stocks in cluster 1s had a better profitability than those in cluster 0s.
In addition to these six financial ratios, DY, P/E ratio and P/B ratio were also discussed to get a better picture of the clusters' profiles.DY measures a dividend payout by a company in one financial year relative to its stock price.For example, the DY is 0.05 (5%) if a company pays RM0.05 in one financial year and its stock price is currently traded at RM1.00.DY is a critical financial ratio to measure the realized gain from a stock investment.It is a type of investment return other than the capital gain from the stock price appreciation.Thus, a high DY has always been preferable in the stock investment.On average, the plantation stocks in cluster 1s pay a slightly higher DY than the plantation stocks in cluster 0s.This means that the dividend payouts by plantation stocks in cluster 1s were more generous than those in cluster 0s.By comparing P/E ratio and P/B ratio, there are no significant differences between cluster 0s and cluster 1s.The mean values of P/E ratio and P/B ratio indicated that the plantation stocks were not undervalued.

CONCLUSION
In this study, a rapid profiling on the plantation stocks listed on Bursa Malaysia was performed using EM, KM and HC algorithms.The profiles of the resulted clusters were described using the mean and the 5-number summary.It was observed that EM C1 had a better profile as compared with the others.The capital gain of the resulted clusters was then compared and EM C1 gave the best investment return.In conclusion, the proposed profiling has demonstrated that it is able to identify clusters with good profiles and assist investors in building stock portfolios.Furthermore, the profiling is considered effective since it could provide beneficial investment information to investors.
The future directions of this research have been identified.These include: (1) the incorporation of other non-financial data (such as technical data) into the dataset for clustering purpose and (2) further dividing the existing clusters to narrow down the search for companies with outstanding performance.

Figure 2 .Figure 3 .
Figure 2. The Pseudocode explained how EM Performs Clustering on the Plantation Stocks Dataset

Figure 4 .
Figure 4.The Pseudocode for the HC Algorithm.

Table 2
The Extracted Raw Financial Data are Transformed Into 14 Useful Financial Ratios, Which Formed the Dimension of The Plantation Stocks Dataset

Table 3
The Statisticsof Clusters Generated by EM, KM, and HC Algorithms.Comparing the Clusters Resulted from the Same Algorithm, the Numbers in Bold Indicate a Better Financial Ratio of a Cluster than the Other