NUWT: JAWI-SPECIFIC BUCKWALTER CORPUS FOR MALAY WORD TOKENIZATION

This paper describes the design and creation of a monolingual parallel corpus for the Malay language written in Jawi. This paper proposes a new corpus called the National University of Malaysia Word Tokenization (NUWT) corpora To the best of our knowledge, currently, there is no sufficiently comprehensive, well-designed standard corpus that is annotated and made available for the public for the Jawi script corpora. This corpus contains the Jawi-specific Buckwalter character code and can be used to evaluate the performance of word tokenization tasks, as well as further language processing. The objective of this work is to conform and standardize the corpora between similar characters in Jawi. It consists of three subcorporas with documents from different genres. The gathering and processing steps, as well as the definition of several evaluation tasks regarding the use of these corpora, are included in this paper. One of the important roles and fundamental tasks of the corpus, which is the tokenization, is also presented in this paper. The development of the Malay language tokenizer is based on the syntactic data compatibility of Malay words written in Jawi. A series of experiments were performed to validate the corpus and to fulfill the requirement of the Jawi script tokenizer with an average error rate of 0.020255. Based on this promising result, the token will be used for the disambiguation and unknown word resolution, such as out-of-vocabulary (OOV) problem in the tagging process.


INTRODUCTION
The annotated corpora, the Malay corpus (Knowles & Don, 2003) and the Dewan Bahasa dan Pustaka (DBP) database (DBP, 2015) corpora, are among the most valuable resources in current natural language processing in Malay studies.They underlie statistical research in monolingual tasks, such as sentence structure, syntactic disambiguation, semantic recognition, information retrieval, etc. Annotated corpora constitutes a very useful tool for research.
The Malay language consists of two writing systems.The Malay language is usually written in Roman (Rumi), which stands for the Latin alphabet, and also written in Jawi, which is originally from the Arabic language.Some efforts are currently being undertaken to strengthen Jawi writing among the Malays in Malaysia.For the Malaysian Malay community, the creation of a monolingual parallel corpus between Roman and Jawi has a special significance.It provides the basis for the development of Malay language applications that can be used to facilitate or even avoid labor and time-consuming processes of manual handling of parallel language information.In addition, such a corpus enables the empowerment of minority languages.With the use of a monolingual parallel corpus and the methods which allow for the transfer of linguistic annotations across parallel languages, new resources and tools can be created for the minority languages.
The goal of the research as presented in this paper is for the development of a parallel language corpus and basic tools and resources for the Malay language.This paper describes the creation of such parallel corpus and the attachment of a part-of-speech (POS) tagset for the Malay corpus.The pattern of the Malay language is quite different from Indo-European languages because of its lower level.At high syntactic level, the language is similar enough to Indo-European language, and one can talk of direct objects in transitive constructions and even of agentless passive.The dominant sentence order is SVO (Knowles & Don, 2003).
Several collections of unannotated Jawi texts do exist.However, the only corpus with incorporated linguistic information that is currently available for the Malay language is a small corpus of approximately 30,500 tokens annotated with POS analyses.The Malay corpus written in Rumi are the Malay corpus (Mohamed, Omar, & Ab Aziz, 2011) and the Malay Corpus UKM-DBP (Saad, Bakar, Karim, Tukiman, & Nor, 2012).For the attachment of the POS tagset to a new parallel corpus, an original tagset applied in Mohamed et al. (2011) (ML).In the study of PR, a Jawi script undergoes the process of character recognition in handwritten or printed form.Segmentation deals with the identification of the Jawi characters.Then, the characters are dealt with in the processes of NLP and ML field at the final part.Previous studies related to Jawi include the research on MM (Diah, Ismail, Ahmad, & Abdullah, 2010;Diah, Ismail, Hami, & Ahmad, 2011); Engineering (Ismail, Yusof, & Jomhari, 2010); PR (Azmi, 2013;Heryanto, Nasrudin, & Omar, 2008;Redika, Omar, & Nasrudin, 2008); and NLP (C.W. S. B. C. W. Ahmad, Omar, Nasrudin, Murah, & Azmi, 2013;Sulaiman, 2013), just to name a few. Figure 1 shows the trend of the studies conducted on the Jawi script.The corpus development of the Jawi script starts from the use of the POS tagset.
For the Malay language written in Roman, a number of researches pioneered by linguists and computing researchers have developed a POS tagset corpus in justification of their research.Atwell (2008) states that before we develop a POS tagset, or decide to re-use an existing pos-tagset, we should be clear about why we want to pos-tag the corpus.For developers of corpus resources for general purposes, the aim is perhaps to enrich the text with linguistic analyses to maximize the potential for corpus re-use in a wide range of applications.
On the other hand, very fine-grained distinctions may cause problems for automatic tagging if some words can change the grammatical tag depending on the function and context.At present, POS tagset for Jawi is no longer being developed.In order to do that, we believe that the corpus development of the Jawi script can facilitate the task of analyzing the NLP task pipeline for the Malay language written in Jawi.In the next section, the detailed description of the new corpus, namely the NUWT corpora, its corresponding collection, its acquisition processes as well as its structure, is given.

NUWT CORPORA DESCRIPTION
The NUWT corpora sources were gathered from three different genres of documents.The "Quranic Malay written in Jawi character Corpus" (Sulaiman, Omar, Omar, Murah, & Rahman, 2011) is an unannotated text of the Quran translation and contains a collection of 114 chapters with 157,388 words.
The corpus was used on the NLP task, Stemmer for Jawi characters, using two sets of rules in Jawi.One set of rules was used to stem various forms of derived words, while the other set was used to replace the use of a dictionary by producing the root word for each derivative.
The second source is an annotated corpus named the "Malay corpus" and contains 18,135 tokens with 1,381 words that have ambiguous tags.The corpora was prepared by Mohamed et al. (2011) using the Dewan Bahasa dan Pustaka (DBP) tagset.The DBP tagset was used because it is the highest government authorized body concerning Bahasa Malaysia and the grammar follows that of Bahasa Malaysia.The "Malay corpus" was written using Roman (Rumi) writing and has 21 tags, as shown in Table 2.
The third source corpus is a grammatical corpus named the "Malay corpus UKM-DBP".It is retrieved from Saad et al. (2012) and is a collection of newspapers, magazines and books with 12,304 words.The corpus was developed according to the DBP tagset and written using Roman writing.It has five main tags, with the elaboration fraction for each main tag shown in Table 3.  Abdul Rahman, 2014;Yonhendri, 2008).Figure 2 shows the original Jawi orthography from old manuscripts.Close orthography, such as the Arabic language, leads in such studies (Diab, Hacioglu, & Jurafsky, 2004).Even though Jawi is similar to Arabic, Jawi has six more characters than Arabic.Jawi script also represents the Malay language, which is different from Malay (Roman).MADAMIRA is an Arabic morphology project by Pasha et al. (2014), which can be accessed at http:// nlp.ldeo.columbia.edu/madamira/.Figure 3 shows a sample input and Figure 4 shows the result of the tokenization and morphology from MADAMIRA.
As shown, several additional characters have been changed to the equivalent similar orthography in Arabic.4.

NUWT CORPORA DEVELOPMENT
The NUWT corpora development goes through several phases, such as preprocessing or encoding phase, tokenization and corpus annotation.In the first phase, two encoding transliteration processes are carried out, which are the Close orthography, such as the Arabic language, leads in such studies (Diab, H 2004).Even though Jawi is similar to Arabic, Jawi has six more characters tha also represents the Malay language, which is different from Malay (Roman).However, the equivalent similar orthography represents another pronunciation so orthography issues are shown in Table 4.

Arabic morphology project by
Table 4. Orthographic issues using MADAMIRA.

Roman equiva
sound Close orthography, such as the Arabic language, leads in such studies (Diab, Hacio 2004).Even though Jawi is similar to Arabic, Jawi has six more characters than Ar also represents the Malay language, which is different from Malay (Roman).MA Arabic morphology project by Pasha et al. (2014), which can be http://nlp.ldeo.columbia.edu/madamira/.Figure 3 shows a sample input and Figure 4 of the tokenization and morphology from MADAMIRA.As shown, several additional been changed to the equivalent similar orthography in Arabic.4.

Roman equivalent
sound Close orthography, such as the Arabic language, leads in such studies (Diab, Hacioglu, & Jurafsky, 2004).Even though Jawi is similar to Arabic, Jawi has six more characters than Arabic.Jawi script also represents the Malay language, which is different from Malay (Roman).MADAMIRA is an   4.

Roman equivalent
sound transliteration of Roman to Jawi and the transliteration of Jawi (Unicode) to Jawi (Buckwalter code).In this phase, both the Malay corpora (Malay corpus and Malay corpus UKM-DBP) are translated into Jawi using Teruja or Ejawi (Malay Text Rumi-To-Jawi Script Transliteration system), which can be accessed at http://www.jawi.ukm.my and http://www.ejawi.net/.After the transliteration process, the data is randomly checked using the Rumi-Jawi-Unicode dictionary for writing style and Unicode.Table 5 shows several errors detected using Teruja and Ejawi.The processed data is shown in Fig. 5.

Loanword
The suggested word is given but requires for a human expert to check the spelling.

Variant positional forms
An Arabic representation form has variant positional forms, which contains isolated, medial, initial and final forms.All Arabic texts (including Jawi) follow Arabic representation form.In order to use JawiàBuckwalter transliteration code, we only used Arabic letter in the Arabic block (U+0600..U+06FF) or the Arabic Supplement block (U+0750..U+077F).

Teruja or Ejawi
Rumi-Jawi-Unicode In the second phase, the NUWT Corpora scripts are translated into Buckwalter transliteration.The Roman (Rumi) writing uses ASCII character code.Meanwhile, Jawi writing uses UNICODE character code 1 .Jawi script has 36 characters, consisting of 30 shared characters with the Arabic language and six newly created characters to meet the needs of Malay language phonemes.The six additional letters are ca ‫,چ‬ nga ‫,ڠ‬ nya ‫,ڽ‬ pa ‫,ڤ‬ ga ‫,ڬ‬ and va ‫.ۏ‬In this project, we used the Buckwalter transliteration to the Jawi corpora 2 , named NUWT Corpora.
The Buckwalter transliteration has been used in many publications in natural language processing (NLP) and in resources developed at the LDC (Habash, Soudi, & Buckwalter, 2007).The main advantages of the Buckwalter transliteration are that it is a strict transliteration (i.e., one-to-one) and it is written in ASCII characters.Similar orthographic languages, such as Urdu (Habash & Metsky, 2008) and Pakistani language (Irvine, Weese, & Callison-Burch, 2012) also use Buckwalter transliteration.The extended Buckwalter transliteration applied to Jawi scripts is shown in Table 6 (Abdul Rahman, 1999;Unicode, 2014).The highlighted sections indicate those parts of the scheme that have been extended over the original scheme.

Table 6
The Positional Variant Forms of Jawi Characters with Unicode & Buckwalter All Arabic script transliterations are provided in the Extended Buckwalter transliteration scheme (Bakar, Omar, Nasrudin, Murah, & Ahmad, 2013).This scheme extends Buckwalter's transliteration scheme (Buckwalter, 2002) to increase its readability while maintaining the one-to-one correspondence with the orthography as represented in Unicode.For Jawi-specific extensions of the Arabic scripts, we extend the (Habash et al., 2007) transliteration scheme as follows: X ‫,ڽ‬ W ‫ۏ‬ , e ‫,ڠ‬ Q ‫ݢ‬ (continued) Letter Uniwalter The next phase in natural language text preparation is tokenization, which typically plays an important role in cutting a string into identifiable units that constitute a piece of language data (Bird, Klein, & Loper, 2009).The simplest method commonly used in tokenizing a text is to split it on whitespace.
Although this is the fundamental task in NLP, Jawi script is still far from having a standard tokenizer.At present, there is no sufficiently comprehensive, well-designed standard corpus that is annotated and publicly available for the Jawi script corpora.Tokenization task is chosen to evaluate the significance of the corpus because of the Buckwalter transliteration format applied to Jawi characters.Further explanation is made on the NUWT Corpora Evaluation.

Corpus Annotation
POS annotation (also morpho-syntactic annotation) consists of assigning each word in a corpus to its general word-class (e.g., nouns), or to finer-grained grammatical categories (e.g., singular common noun) (Balossi, 2014).It also enables the tagging of homographs; for example, the term 'work' can be tagged as a verb or as a noun.Semantic annotation marks the semantic categories of words in a text; for example, the term 'bank' can belong to two different semantic fields according to whether it refers to a financial institution or an area of land along a river.Through lexical annotation, we learn about the lemma -the base form of a word -of each word form in our corpus; for example, 'speak', 'speaks', 'spoke, 'speaking' are forms of the same lexeme, with 'speak' as their lemma.Pragmatic annotation adds information to the words and multi-word expressions in a spoken conversation or dialogue; so the expression 'go now' may become a command or a question depending on the punctuation marks used.Linguistic annotation is used to capture a range of higher-level phenomena, including the practice of tagging the types of speech and thought presentation (e.g., direct speech and indirect speech, direct thought and indirect thought, etc.).On the use of corpus annotation, Leech (2005) states that "the practice of adding interpretative linguistic information to a corpus" confers an "added value" to it.An annotated corpus has the advantage of being used in either a "manual examination" or in an "automatic analysis".
Moreover, it can be re-used and exploited for different aims and applications.
Both the Malay corpus and the Malay corpus UKM-DBP were developed according to the DBP tagset.General word-class has been used for both corpora.The Malay tagged corpus has been developed with the same tagset used in the modified bilingual dictionary (Hock, 2009) as stated in Mohamed et al. (2011).Some tags have not been used in the corpus, while a few new ones have been added.The AWL and KEP tags, which are linguistically not word classes, have not been used in the corpus.Other than that, clitics in Malay, such as nya (it, them), mu (you), lah (a particle added to words (suffix) used for emphasizing its predecessor word or sentence), kah (a particle at the end of a word or phrase for expressing enquiry), etc., are crucial to the Malay language.In the Malay corpus, only two clitics are handled, whereby the clitic, nya and lah are split into two tokens.For example, the word terjejasnya (it being affected) is split into terjejas (is affected) and nya (it).Thus, the tag @KG is used to tag nya, and #E for tagging lah (Mohamed et al., 2011).Other added tags have also been used, such as KNF to tag denial words, KNK to tag proper nouns, SEN to tag any list numbers, and SYM to tag any symbols, including punctuations.In the NUWT Corpora, each tag used in the original corpus is maintained and the use of ASCII letter is also maintained as in the original corpus.Figures 6, 7 and 8 show the sample data of three corpus in the NUWT Corpus.

NUWT CORPORA EVALUATION
To evaluate the new corpus, previous researchers (Outahajala, Zenkouar, Benajiba, Rosso, & Elirf, 2013;Tmshkina, 2006) have used or applied the new corpus to the state-of-the-art POS tagger, such as TnT, Support Vector Machine (SVM) and Conditional Random Field (CRF).Accuracy (%) has been used to measure the performance of the corpus.However, in this study, a new orthography is used, which is the Jawi-specific Buckwalter transliteration, which is tagged along with the previous known tag set.Thus, it is believed that the tokenizer task is a suitable task for the evaluation of the corpora because of its orthographical differences with the Malay language written in Roman.
As a pilot study, the application of the NUWT corpus to a tokenization task was made to reveal what modifications are needed.The NLTK was used which is a platform for building Python programs to work with human language data.The NLTK is easy to use and interfaces with over 50 corpora and lexical resources, such as WordNet, and has a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing and semantic reasoning (NLTK Project, 2015).There are several types of tokenizers in NLTK, which are Punkt Sentence Tokenizer, Regular Expression Tokenizer, S-Expression Tokenizer, Simple Tokenizer, Penn Treebank Tokenizer and NLTK Tokenizer package.In our experiment, we used the regular expression tokenizer module from NLTK for the Jawi corpus.The regular expression tokenizer is deemed suitable for this task because of its uniqueness in handling language-dependent applications and algorithms (Shaalan & Raza, 2007;Sharum, Abdullah, Sulaiman, Murad, & Hamzah, 2011).

Jawi Regular-Expression Tokenizer (Jawi RegExpTokenizer)
A RegExpTokenizer splits substrings using a regular expression.A tokenizer then matches either the tokens or the separators between tokens.The tokenizer tokenizes a string, treating any sequence of blank lines as a delimiter.Blank lines are defined as lines containing no characters, except for space or tab characters (NLTK Project, 2015).The NLTK's regular expression tokenizer uses four parameters.
Otherwise, the tokenizer will leave it as a single word.Patterns, such as * ‫,ذ‬ $ ‫,ش‬ > ‫,أ‬ < ‫,إ‬ and } ‫ئ‬ are supposed to be segmented as a single word when it occurs in the raw text or sentence.Patterns identified as a single or a separate word in Malay in the Jawi character are shown in Table 8.Regular expression patterns are sorted by priority.If a standard tokenizer built for other languages is used, such as the English language, patterns, such as *, >, <, }, will be segmented into two separate tokens.In the Arabic language, the orthographic problem interferes, as we have mentioned above.

Algorithm and Implementation
The tokenizer process is summarized as per the following algorithm: The following algorithm, PatternMatching, constructs tokens using regular expression Require

Tokenizer Experiments
The experiments outlined in this paper were tested using the bootstrapping approach and started with a few training words, 1,000, 5,000, 10,000 and 15,000 words for each corpus.A series of experiments were performed.Our first experiment was to manipulate a standard regular expression tokenizer built for the English language.One of the issues in tokenizing English words is the presence of contractions, such as didn't.In the NUWT Corpora, the apostrophe symbol (') represents the letter Hamza ‫.)ء(‬Other than that, the English language uses uppercase letters to identify abbreviations; meanwhile, in our corpora, the uppercase and lowercase letters for the abbreviations were used. For the first implementation, the regular expression pattern suitable for Jawispecific Buckwalter code was built up.The errors obtained from the first experiment were with a minimum error of 0.01038961, a maximum error of 0.029490617 and an average error of 0.020255.Several errors were obtained from a number of issues, as listed in Table 9.For example, the * ‫ذ‬ (Thal) character can be seen in the front, middle, and final position of a word in Table 10.During the first experiment, only Thal in the middle position could be detected as a single word.However, Thal in the front and final part of the words is missing.See Figures 9, 10 and 11.Fig. 12 shows the error rate (%) for the NUWT corpus.After several modifications were made to the program's code, these errors were addressed.The second experiment shows the complete performance of the tokenization on the NUWT Corpora.The accuracy obtained was approximately 99.8%.For example, Figures 13 and 14 show the correct tokenizer word for Error 1.The output overcame the errors which occurred in the program for the same sentences (see Figures 9 and 10).The summary of successful tasks is shown in Table 11.

CONCLUSION AND FUTURE WORK
A corpora that contains three Malay subcorpora is described in this paper.This corpora is unique due to the code applied.This corpora will serve as a benchmarking corpus for the development and evaluation systems in word tokenization, as well as further language processing.A pilot study successfully developed a tokenizer system to reveal what modifications need to be made in the NUWT Corpora.We attempted to develop the tokenizer model for the NUWT Corpora using regular expression.The aim of this model is to suit the new corpora developed for the Malay language.Based on the experimental results and analysis, it can be concluded that regular expression is appropriate as a Jawi tokenizer in a Buckwalter transliteration format.Further works should focus on Out-Of-Vocabulary (OOV) problem in the POS tagging which is an important text analysis task that is used to classify words into their parts of speech.It labels them according to their tagsets, which is a collection of tags used for POS tagging.

Figure 1 .
Figure 1.Jawi research trends and relationship to the field.

Figure 4 .
Figure 4. MADAMIRA output process Pasha et al. (2014), which can http://nlp.ldeo.columbia.edu/madamira/.Figure3shows a sample input and Figu of the tokenization and morphology from MADAMIRA.As shown, several addit been changed to the equivalent similar orthography in Arabic.

Figure 3 .Figure 4 .
Figure 3. Jawi script sample input Arabic morphology project byPasha et al. (2014), which can be accessed at http://nlp.ldeo.columbia.edu/madamira/.Figure3shows a sample input and Figure4shows the result of the tokenization and morphology from MADAMIRA.As shown, several additional characters have been changed to the equivalent similar orthography in Arabic.

Figure 5 .
Figure 5. Pre-processing phase in preparing the data.

Figure 5 .
Figure 5. Pre-processing phase in preparing the data

Figure 6 .
Figure 6.The Quranic Malay written in the Jawi character corpus.

Figure 6 .Figure 7 .
Figure 6.The Quranic Malay written in the Jawi character Corpus

Figure 6 .Figure 7 .Figure 8 .Figure 7 .Figure 8 .
Figure 6.The Quranic Malay written in the Jawi character Corpus : text T, pattern P, gaps G,discard_empty D, flags F begin read sequence of T, read regular expression pattern Pmarks read gaps G,read discard_empty D, read flags F, If found regular expression pattern P in sequence of T single token else separate token applied gaps G, discard_empty D, flags F end

Figure 13 .
Figure 13.First letter Thal ‫ذ‬ detected the same as the input.

Figure 14 .
Figure 14.Last letter Thal ‫ذ‬ detected the same as the input.

Figure 11 .
Figure 11.Middle letter Thal ‫ذ‬ detected the same as the input.

Figure 14 .
Figure 14.Last letter Thal ‫ذ‬ detected the same as the input.

Figure 14 .
Figure 14.Last letter Thal ‫ذ‬ detected the same as the input.

Table 1
Comparison between Corpus Unannotated Jawi corpus has been developed from a variety of sources, including old Jawi manuscripts, Al-Quran text translation, textbooks and local newspapers.Research on Jawi is widely used in the learning field, for instance in Multimedia (MM), computer hardware, such as virtual keyboard (Engineering) and in the field of Artificial Intelligence in the study of Pattern Recognition (PR), Natural Language Processing (NLP) and Machine Learning

Table 2
Malay Tagset DBP in Malay Corpus

Table 3 .
Malay DBP Tagset in the Malay UKM-DBP Corpus

Table 5
Errors after using the Online Transliteration System Error Description Letters Wrong letter used between Heh ‫ه‬ and Hah ‫,ح‬ Theh ‫ة‬ and Teh ‫,ت‬ Kaf ‫ک‬ and Qaf ‫,ق‬ Yeh ‫ي‬ and Alef Maksura ‫,ى‬ and Alef ‫ا‬ with Alef, Hamza Above ‫أ‬ or Alef, Hamza Below ‫إ‬

Roman equivalent Isolated Form Uni- code (U+) Initial Form Uni- code (U+) Medial Form Uni- code (U+) Final Form Uni- code (U+) Buck- walter
• flags (int) -The regular expression flags used to compile this tokenizer's pattern.By default, the following flags are used: re.UNICODE | re.MULTILINE | re.DOTALL Jawi-specific extension Buckwalter character code is normally not very different from English character code.By default, abbreviation patterns, e.g., TUDM ‫ت.م.د.وا(‬ t.Aw.d.m),MCA ‫ميع.يأ.يس(‬Eym.sy.>y), have the same pattern or writing style as the English language, e.g., U.S.A.The abbreviations and acronyms in the Malay language are shown in Table 7 (Abdul Rahman, • discard_empty (bool) -True if any empty tokens generated by the tokenizer should be discarded.Empty tokens can only be generated if _gaps == True.

Table 7
Writing Abbreviations and Acronyms for Jawi and Roman in Malay Language

Table 8
Tokenization Pattern for Jawi

Table 9
Errors from Certain Issues

Table 11
Successful Tasks