SUMMARIZING INDONESIAN NEWS ARTICLES USING GRAPH CONVOLUTIONAL NETWORK

Multi-document summarization transforms a set of related documents into a concise summary. Existing Indonesian news article summarization does not take relationships between sentences into account and depends heavily on Indonesian language tools and resources. This study employed Graph Convolutional Network (GCN) which allows for word embedding sequence and sentence relationship graph as input for Indonesian news article summarization. The system in this study comprised four main components: preprocess, graph construction, sentence scoring, and sentence selection components. Sentence scoring component is a neural network that uses Recurrent Neural Network and GCN to produce scores for all sentences. This study used three different representation types for the sentence relationship graph. The sentence selection component then generates a summary with two different techniques: by greedily choosing sentences with the highest scores and by using the Maximum Marginal Relevance (MMR) technique. The evaluation showed that the GCN summarizer with Personalized Discourse Graph, a graph representation system, achieved the best results with an average ROUGE-2 recall score of 0.370 for a 100-word summary and 0.378 for a 200-word summary. Sentence selection using the greedy technique gave better results for generating a 100-word summary, while the MMR performed better for generating a 200-word summary.


INTRODUCTION
Nowadays, people use online media to retrieve the latest news worldwide. Different with information shared in social media that suffers from originality and authenticity (Olanrewaju & Ahmad), the reported news articles often share redundant information because an event will be reported multiple times by different media portals. Multi-document summarization will take these topic-related articles as input and produce a concise summary. An ideal summary contains common information of the news topic and unique information of each article (Goldstein et al., 2000).
There are two common techniques for automatic summarization, which are extractive and abstractive summarization. Extractive summarization extracts important sentences of each document, while abstractive summarization collects important information of documents which are later written in the form of new sentences as the summary (Jurafsky & Martin, 2014). While an ideal human-made summary is an abstractive summary, extractive summarization research often gives a small difference in performance compared to abstractive research that requires more resources and tools (Galanis et al., 2012). Thus, this paper chose to use the extractive approach.
Most extractive summarization research uses supervised learning to determine salient sentences. Kupiec et al. (1995) built one of the earliest extractive summarization systems using Naïve Baye classifier. Each sentence is transformed into simple features such as sentence length and position. Then, a model is trained to classify each sentence as important or the opposite. Hong and Nenkova (2014) created another extractive summarization system based on word importance. Sentence salience is determined by the importance of the sentence's words, and each word is represented as thousands of features. While this research has one of the best performances in DUC 2004 summarization tasks, it requires many resources and careful feature engineering.
Recent summarization research began to use the powerful artificial neural network to generate summaries. Instead of doing extensive feature engineering, most summarization systems using neural networks only require low-level features as input. Rush et al. (2015) and Zhang et al. (2018) employed sequence-to-sequence architecture to generate singledocument abstractive summaries with only word embedding as input. Cao et al. (2017) and Zhang et al. (2017) used Convolutional Neural Network (CNN) for multi-document summarization systems without using any handcrafted features.
Some multi-document summarization research also utilized sentence relationships represented as graphs. Erkan and Radev (2004) computed sentence importance based on the concept of eigenvector centrality in a cosine similarity sentence graph. Christensen et al. (2013) generated a graph depicting discourse relations between sentences to produce a more coherent summary. Recently, Yasunaga et al. (2017) introduced the usage of Graph Convolutional Network (GCN) (Kipf & Welling, 2016) to their neural network architecture. By using GCN, the system will consider sentence relationships when estimating scores of a sentence. Each work shows that considering sentence relationship graphs will provide better summary results.
There are several multi-document summarization works for Indonesian news articles. Christie and Khodra (2016) proposed an abstractive summarization system employing sentence fusion with word graphs. Reztaputra and Khodra (2017) proposed a summarization system based on the sentence structure of subject, predicate, object, and complement (SPOC). While both do not require training data set, both are very dependent on Indonesian language processing tools and resources. As an example, some bad summaries generated by Reztaputra and Khodra (2017) were often caused by SPOC extraction mistake attributed to the quality of the Indonesian dependency tree used. Although employing sentence relationships have improved the performances of summarization systems (Erkan & Radev, 2004;Christensen et al., 2013;Yasunaga et al., 2017), existing Indonesian multi-document summarization works have not considered relationships between sentences in generating summaries. Since considering sentence relationships is also crucial to identify important sentences in Indonesian texts, we will show that employing sentence relationships improves the performance of Indonesian summarizers. This paper investigates whether adapting sentence relationships in GCN (Yasunaga et al., 2017) is able to provide better performance for Indonesian news article summarization. As described before, the system generates extractive summaries without employing feature engineering, and only requires word embedding as input. Word embedding can be easily constructed for any language. We will also show that our system provides better results than abstractive summaries from existing systems. This paper is organized as follows: Section 2 discusses related works; section 3 describes the proposed solution used in this study; section 4 shows the experiment and evaluation; section 5 discusses the results and finally, section 6 presents the conclusion and future plans.

RELATED WORKS
There are two main steps in extractive summarization which are sentence scoring and sentence selection. A sentence with a higher score is more likely to be a part of the summary. Kupiec et al. (1995) used Naïve Bayes to classify a sentence represented with surface level features to ascertain whether it was salient or not. Hong and Nenkova (2014) worked on wordlevel features that were mostly hand-crafted and determined the salience of a sentence as the average importance of each word in the sentence. Sentence selection then selects sentences that had previously been scored. Hong and Nenkova (2014) used a greedy approach by selecting sentences with higher scores first. Other summarization research used selection techniques based on certain heuristics like Integer Linear Programming (ILP) (Galanis et al., 2012) and Maximum Marginal Relevance (MMR) (Goldstein et al., 2000). For example, MMR ensured that selected sentences were relevant to sentence candidates but not redundant to the summary.
The Indonesian news article summarization system created by Christie and Khodra (2016) and Reztaputra and Khodra (2017) were modifications of the clustering-based summarization approach proposed by Sarkar (2009). Figure  1 shows Sarkar's summarization architecture. Similar sentences are grouped into a cluster. Each cluster is ordered so that the most important cluster will be the first one to be put in summary. From each cluster, a representative sentence is selected to be added to the summary. Christie and Khodra (2016) did not use representative selection. Instead, they generated all possible sentences by sentence fusion utilizing word graphs that depicted the relation of words in sentences. Reztaputra and Khodra (2017) did not cluster sentences based on its similarity; they instead clustered sentences based on its extracted subject and object similarity. Similar to Christie and Khodra (2016), it generated all possible sentences based on the clusters. Both depend on summary generation steps to select salient sentences. Christie and Khodra (2016) used Integer Linear Programming (ILP) summary generation while Reztaputra and Khodra (2017) used Maximal Marginal Relevance (MMR). However, both approaches are very dependent on Indonesian language resources. For example, Reztaputra and Khodra (2017)'s incorrect generated sentence clusters are mostly caused by the quality of the Indonesian language dependency tree. In addition, clustering approaches often miss important sentences due to incorrect clustering. For instance, two important sentences are clustered into one, but only one sentence out of this cluster is picked out. Yasunaga et al. (2017) proposed a multi-document summarization using neural networks that can accept sentence relationships graph as additional input using GCN. First, each sentence enters the Recurrent Neural Network (RNN) to produce a sentence embedding. Next, the GCN accepts both sentence embeddings and sentence graph as input to produce better sentence embeddings whereby values 3 proposed by Sarkar (2009) . Fig. 1 shows Sarkar's summarization architecture. Similar sentences are grouped into a cluster. Each cluster is ordered so that the most important cluster will be the first one to be put in summary. From each cluster, a representative sentence is selected to be added to the summary. are adjusted to consider sentence relationships. Sentence salience score is then obtained based on its sentence embedding and cluster embedding. Summaries are then composed by choosing high-score sentences greedily.

PROPOSED METHOD
This study adapted the summarization method of Yasunaga et al. (2017) to improve the performance of summarization system for Indonesian news articles. Fig. 2 shows the overview of the adapted automatic summarization system. There are four major components, which are preprocessing, graph construction, sentence scoring, and sentence selection. The sentence scoring component applies a neural network architecture that has several sub-components.

A. Preprocessing
This component performs sentence segmentation and tokenization of each sentence. We also removed quotations because Reztaputra and Khodra (2017) had shown that doing so improved performance. The results of preprocessing are used as the input for both, the sentence scoring and graph construction components.
4 to select salient sentences. Christie and Khodra (2016) used Integer Linear Programming (ILP) summary generation while Reztaputra and Khodra (2017) used Maximal Marginal Relevance (MMR). However, both approaches are very dependent on Indonesian language resources. For example, Reztaputra and Khodra (2017)'s incorrect generated sentence clusters are mostly caused by the quality of the Indonesian language dependency tree. In addition, clustering approaches often miss important sentences due to incorrect clustering. For instance, two important sentences are clustered into one, but only one sentence out of this cluster is picked out. Yasunaga et al. (2017) proposed a multi-document summarization using neural networks that can accept sentence relationships graph as additional input using GCN. First, each sentence enters the Recurrent Neural Network (RNN) to produce a sentence embedding. Next, the GCN accepts both sentence embeddings and sentence graph as input to produce better sentence embeddings whereby values are adjusted to consider sentence relationships. Sentence salience score is then obtained based on its sentence embedding and cluster embedding. Summaries are then composed by choosing highscore sentences greedily.

B. Graph Construction
This component forms a sentence relationship graph for each new topic. In the graph, a node is a sentence and an edge connecting two nodes shows how strong both sentences are related. We used three graph representations used in Yasunaga et al. (2017) as follows: 1. Cosine graph. The weight of edge between two sentence nodes is tf-idf cosine similarity of both sentences. Therefore, the maximum weight of an edge is 1.0 if both sentences are exactly the same. Any edge with a weight less than 0.2 are removed, i.e. the weight is changed to 0.0.

2.
Approximate Discourse Graph (ADG) as proposed by Christensen et al. (2013). Unlike the cosine graph, this graph tries to illustrate the relationship between sentences based on discourse relations. While Christensen et al. (2013) used five aspects; we used only three aspects to determine sentence relationships due to the limitations in resources for the Indonesian language. The three aspects that we used were discourse markers, coreference resolution, and entity continuation.
We used 100 Indonesian discourse markers that depicted various discourse relations, e.g. "walaupun" (although), "begitu pula" (likewise), and "akibatnya" (consequently). For the conference resolution, we used a very simple rule-based approach in which two sentences were connected if the latter sentence contained a pronoun. Lastly, entity continuation relies heavily on Indonesian Named Entity Recognition (NER) proposed by Wibisono and Khodra (2018). Figure 3 shows an example of ADG created for three sentences. approach to summarization (Ren et al., 2018), the personalization score of a sentence is basically a salience score obtained from surface level features of the sentence. We used the same features described in Yasunaga et al. (2017). 3. Personalized Discourse Graph (PDG) proposed by Yasunaga et al. (2017) is a graphic representation which modifies an ADG edge's weight by infusing personalization scores of both connected sentences to the edge's weight. With reference to the regression-based approach to summarization (Ren et al., 2018), the personalization score of a sentence is basically a salience score obtained from surface level features of the sentence. We used the same features described in Yasunaga et al. (2017).

C. Sentence Scoring
We obtained the scores of each sentence with the artificial neural network architecture adapted from Yasunaga et al. (2017). The sentence scoring component contained several sub-components as follows: 1. Sentence RNN takes each sentence's token sequence as an input. Each token will be represented as word embedding. The output of this component is that the last hidden state of RNN will be called as sentence embedding.

2.
GCN takes sentence embedding from sentence RNN and sentence relation graph as input. The output is final sentence embedding that infuses graph relation into its value.

3.
Document RNN groups the sequences of sentence embedding of each document. The output is the last hidden state of RNN that will be called as document embedding. Then cluster embedding is calculated by averaging all document embeddings.

4.
Score estimation component is a regular artificial neural network that will take sentence embedding and cluster embedding as input. A sentence's score is estimated based on its sentence embedding and cluster embedding.
Every single sub-component has weights that need to be determined in order to obtain best results. To determine the best weights, we trained the network on a training dataset in which the target of the network is the normalized ROUGE-score of all sentences shown in Equation (1) (Yasunaga et al., 2017). (1) Where r(s i ) is the average of ROUGE-1 and ROUGE-2 recall scores of sentence s i and α is softmax distribution constant. Training was done with back propagation algorithm. We used cross-entropy loss between the sentence score output and averaged ROUGE score as the loss function.
Each sub-component also has hyperparameters. For example, an RNN needs to determine its hidden state length and maximum length of sequence. Softmax distribution α shown in Formula (1) is also a hyperparameter. This study determined the values of best hyperparameters by conducting an experiment on the validation results.

D. Sentence Selection
For the sentence selection method, we used two techniques as follows: 1. This study utilized a greedy approach to create the summary of a topic base described by Hong and Nenkova (2014). The component then chose a sentence(s) with the highest score as a part of the current summary as long as the summary had not reached its length limit and the sentence was not deemed as redundant.

2.
Maximum Marginal Relevance (MMR) was used to generate the summary. MMR was chosen as an alternative because we often found that the sentence with the highest score was an extremely long sentence that did not fit as a summary. We first selected sentence candidates with the highest scores. Formula (2) was used to generate the summary (Goldstein et al., 2000). (2) Where C is a set of available sentence candidates and S is a set of selected sentences. By doing so, MMR ensures that the selected sentence will represent the candidates' sentences yet not too similar to the current selected sentences.

EXPERIMENT AND EVALUATION
This section is comprised of experiment settings, results, and discussion. Experiment settings will explain the dataset details and how to determine the best models. The results will elaborate the best hyperparameters obtained including details of the validation and test data set evaluation. Finally, the discussion part will analyze the results of the summaries produced by each different system.
Journal of ICT, 18, No. 3 (July) 2019, pp: 345-365 354 A. Experiment Setting Table 1 shows the data set details used in the experiment. To compare this work's performance in a fair way to previous Indonesian summarization works, we used the exact same test data set used in Christie and Khodra (2016) and Reztaputra and Khodra (2017). The test data set contained five topics with a total of 87 articles and 1,216 sentences. Each topic had both 100-word and 200-word summaries written by one human annotator. Unlike previous Indonesian multi-document summarization works (Christie & Khodra, 2016;Reztaputra & Khodra, 2017), this approach required conducting supervised learning in order to determine the best weights for the sentence scoring neural networks component. Therefore, we had to use training and validation data sets. We chose to build a separate model for 100-word and 200word summaries for each summary length, and trained four different models where one model was a model without GCN and the other three were models using GCN, each with a different graph representation type. Each model was trained on 50 clusters of topics and validated on 10 clusters of topics. Unlike the test set, each topic contained 100-word and 200-word summaries that were written by two different annotators.
Validation was performed in order to determine the best hyperparameters for each model. A model with particular hyperparameters was better than other models if the generated validation set summaries had better performance. We measured a system summary's performance by comparing it to annotators' summaries using ROUGE-2 recall metric (Lin, 2004).
After the best models were obtained, we evaluated each model with test data set that were used in previous Indonesian summarization studies. ROUGE-2 metric recall was also used as a metric. The performance of the greedy selection and MMR selection were compared for each system.
This study also compared the results with three baseline systems which were clustering-based summarization systems. Baseline I is an adaptation system based on Sarkar (2009) for Indonesian articles, baseline II is a summarizer of Christie and Khodra's (2016), and baseline III is a summarizer by Reztaputra and Khodra (2017).
In addition, we developed a word embedding from an Indonesian news article collection containing 171,923 articles. This study used Word2vec (Mikolov et al., 2013) and FastText (Bojanowski et al., 2016) techniques to generate word embedding. Table 2 shows the hyperparameters of the best models both for 100-word and 200-word summaries. Both models had several differences in the best hyperparameter values. Interestingly, the best 100-word model only required the first 20 sentences of each article as input to obtain the best results. Figure  4 shows the validation loss of system without GCN and system using GCN for each graph representation for 100-word summary models. From the graph, it is very clear that GCN with ADG and PDG systems had decreased the validation loss compared to systems without GCN. However, GCN with cosine graph system had the worst validation system. GCN with cosine graph system also demonstrated an unstable validation loss.  Figure 4. Validation loss between each 100-word model. Table 3 shows the results of each system for 100-word and 200-word summaries. GCN with PDG and greedy selection system gave the best results for 100-word summaries while the MMR gave much worse results for all systems compared to the greedy selection system for 100-word summaries. Even though it was shown in Yasunaga et al. (2017) that all GCN systems achieved better results, systems without GCN achieved better results than GCN with ADG and cosine graph systems.

B. Results
For 200-word summaries, GCN with PDG once again gave the best results, but this time with MMR selection. As compared to the 100-word summary results, the MMR gave a slightly better performance for 200-word summaries than the greedy selection. GCN with ADG and PDG systems outperformed Table 3 shows the results of each system for 100-word and 200-word summaries. GCN and greedy selection system gave the best results for 100-word summaries while the M much worse results for all systems compared to the greedy selection system for summaries. Even though it was shown in Yasunaga et al. (2017) that all GCN system better results, systems without GCN achieved better results than GCN with ADG and co systems.
For 200-word summaries, GCN with PDG once again gave the best results, but this time selection. As compared to the 100-word summary results, the MMR gave a slig performance for 200-word summaries than the greedy selection. GCN with ADG and PD outperformed systems without GCN while GCN with cosine similarity graph resulted in performance. Table 4 then compares the performance of the proposed solution to bas system without GCN already performed slightly better than the baselines and GCN system showed a very significant improvement from the baselines.  systems without GCN while GCN with cosine similarity graph resulted in a declining performance. Table 4 then compares the performance of the proposed solution to baselines. The system without GCN already performed slightly better than the baselines and GCN with PDG system showed a very significant improvement from the baselines.  Table 4, the GCN -PDG system's ROUGE-2 recall is much higher compared to the previous baselines. Table 5 shows the best summary with a huge ROUGE-2 recall score of 0.703. It can be seen that the summary is almost identical to the reference summary with some grammatical errors. The decrease in ROUGE score was caused by a few unimportant phrases ("In their press release, Tuesday, 21 June 2016") and different phrases that actually had the same meaning ("to synergize" and "a synergy between"). On the other hand, Table 6 shows the worst summary results. Time and place information phrases like "Kamis (2/6/2016) (Thursday (2/6/2016))" and "Tangerang Selatan City" are repeated many times. This redundant information fills up the 100-word length limit and therefore blocks other salient sentences from being written in the summary. While the MMR is better on preventing redundancy, it is more prone to select non-salient sentences. Table 6 The worst generated summary We analyzed how each type of graph would influence the sentence scoring results. The study used "Gempa Dieng" (Dieng Earthquake) topic as an example. Table 7 shows three most important scores generated by GCN with cosine graph system. The bold text shows overlapping phrases between sentences. It is very clear that the top three sentences are talking about the same information. All three sentences also share the same long phrase, which is "Pusat Vulkanologi dan Mitigasi Bencana Geologi (PVMBG)" (Centre for Volcanology and Geological Hazard Mitigation). This boosts the cosine similarity weight between sentences. This may be the reason why the performance of the cosine graph was even worse as compared to no graph because the selected sentences were often similar and did not add new information. 0.027 Table 8 shows the same for GCN with PDG system where the bold text indicates fulfilled discourse relation aspects. All of the sentences also share entity "Dataran Tinggi Dieng" (Dieng Plateau) to fulfill the entity continuation aspect. Most of the sentences also have discourse markers as the first word, i.e. "Sementara" ("Meanwhile"). Each of the five sentences gives different information. In this way, GCN with PDG system will be able to obtain richer information. GCN with ADG system showed similar results, but it had a bit of tendency to choose non-salient sentences that fulfilled the discourse aspects. Similar to Yasunaga et al. (2017), we also analyzed the influence of each type of graph representation by measuring the correlation between the sum of incoming edges' weight of a sentence node and its salience score. Table  9 shows the average correlation results. It can be seen that PDG has a stable correlation value for each topic while ADG can result in a very high and a very low correlation. As the results in Table 9 mostly show high and positive correlations, we can conclude that a sentence that is interconnected with many sentences will most likely have a high salience score. Table 9 Correlation between sentence salience and sum of incoming edges weight

CONCLUSION
The proposed adaptation of summarization using GCN has a good performance with an average ROUGE-2 of 0.370 for 100 words and an average ROUGE-2 of 0.378 for 200 words. These results are better compared to previous Indonesian clustering-based summarization research. The best system is GCN system with PDG. Unlike the adapted approach, GCN system with ADG and cosine graph did not perform better compared to the system without GCN. Sentence selection using greedy technique gave better results for generating 100-word summaries, while the MMR performed better in generating 200word summaries.
Improvements can be made by including other discourse aspects in Indonesian texts for better ADG and PDG representations such as deverbal-noun reference and more advanced coreference resolution. Like Alias et al. (2017), sentence compression could be considered to reduce the length of each sentence to remove unnecessary words so that the system can take in more compressed sentences as part of the summary.