=Paper=
{{Paper
|id=Vol-1312/jist2014pd_paper6
|storemode=property
|title=A Multi-Aspect Comparison and Evaluation on Thai Word Segmentation Programs
|pdfUrl=https://ceur-ws.org/Vol-1312/jist2014pd_paper6.pdf
|volume=Vol-1312
|dblpUrl=https://dblp.org/rec/conf/jist/NoyunsanHPS14
}}
==A Multi-Aspect Comparison and Evaluation on Thai Word Segmentation Programs==
A Multi-Aspect Comparison and Evaluation on Thai Word Segmentation Programs Chaluemwut Noyunsan1 , Choochart Haruechaiyasak2 Seksan Poltree3 , and Kanda Runapongsa Saikeaw1?? 1 Department of Computer Engineering, Faculty of Engineering, Khon Kaen University 123 Mittrapap, Mueang, Khon Kaen 40002, Thailand chaluemwut@kkumail.com, krunapon@kku.ac.th 2 Human Language Technology Laboratory (HLT) National Electronics and Computer Technology Center (NECTEC), Thailand Science Park, Klong Luang, Pathumthani 12120, Thailand choochart.haruechaiyasak@nectec.or.th 3 Morange Solution Company Limited, 456/1 Klang Mueang, Mueang, Khon Kaen 40000, Thailand seksan@morange.co.th Abstract. Word segmentation is an important task in natural language processing, especially for languages without word boundaries, such as Thai language. Many Thai word segmentation programs have been de- veloped. Researchers and developers in Thai documents usually spend a tremendous amount of time in studying and trying different Thai word segmentation programs. This paper presents the performance of six Thai word segmentation programs which include Libthai, Swath, Wordcut, CRF++, Thaisemantics, and Tlexs. Based on experimental results, we compare these programs in terms of usage, response time, time outs, and relevance. Keywords: Word segmentation, term tokenization, software tools, nat- ural language processing 1 Introduction Natural Language Processing (NLP) enables computers to understand human languages. It consists of many processes such as word segmentation, part-of- speech tagging, automatic summarization, and speech synthesis. Most NLP ap- plications require input text to be segmented into words before being processed ?? Corresponding author further. For example, in sentences similarity application, text must first be tok- enized into a series of terms before being analyzed grammatically and semanti- cally. Word segmentation is an essential part for Asian languages such as Thai, Chinese, Japanese, and Korean. This is because these languages are written without delimiter spaces for words in the same sentence. 2 Thai Word Segmentation Programs Many researches and several programs have been developed for word segmenta- tion. We chose six Thai word segmentation programs which were Libthai, Swath, Wordcut, CRF++, Thaisemantics, and Tlexs. They were selected because they were actively maintained and widely used. Libthai [1] is a set of C program- ming function to support Thai word segmentation. Swath [7] and Wordcut [6] are command-line programs. CRF++ is a tool for supporting Condiction Ran- dom Field (CRF) [2]. Thaisemantics has been developed by using Restful web service [5]. Tlexs uses CRF to train models for segmentation [4]. 3 Experimental Analysis Fig. 1 shows the system overview. Our system used the Benchmark for Enhanc- ing the Standard of Thai language processing (BEST) corpus which has been developed by NECTEC [3] and widely accepted for Thai language processing. Our system sent an origin message to the selected Thai word segmentation pro- grams, and then received the results from each program. Then the results from each system were compared with manually tagged words in BEST. Fig. 1. System overview Usage Table 1 summarizes the features about usage, offline support, and whether installation is needed. Table 1. Comparison of word segmentation programs in terms of usage Features Libthai Swath Wordcut Thaisemantics Tlexs CRF++ Usage C function Wrapper Wrapper REST API SOAP API Wrapper Offline support Yes Yes Yes No No Yes Installation needed Yes Yes Yes No No Yes Response Time We ran the experiments and measured response times on the computer with Intel Dual 1.73GHz and 3 GB of RAM using Ubuntu 64bit as an operating system. The program execution times are shown in Table 2. Libthai performed the best with the response time 0.022 seconds while Swath and CRF++ had response times as 0.024 seconds. On the other hand, the re- sponses times of Thaisemantics, Wordcut, and Tlexs were large. This is because Thaisemantics and Tlexs are programs that are used over the internet thus their response times depend on the internet bandwidth and traffic time. Wordcut is implemented using JavaScript language which may increase the response time of the program. Table 2. Comparison of word segmentation program response times (in seconds) Libthai Swath Wordcut Thaisemantics Tlexs CRF++ 0.022 0.024 0.203 1.520 0.144 0.024 Number of Time Outs Tlexs and Thaisemantics had several time outs be- cause they are called over the internet. Tlexs had average 2,070 time outs while Thaisemantics had 5 time outs. Tlexs caused a large number of occurrences of time outs might be due to server settings or errors. Relevance Messages from the BEST were sent to each program and the word segmentation output was kept in a list. After that, we compared this log list with a correct list from BEST by using the percentage of precision, recall and F-measure. We ran this test by using 5-fold cross-validation, and then computed the average value as shown in Table 3. Both Tlexs and CRF++ have the best F-measure because they use CRF. Table 3. F-measure of Thai word segmentation programs Measurement Libthai Swath Wordcut CRF++ Thaisemantics Tlexs Precision 61.23 65.09 57.27 59.91 66.03 74.80 Recall 54.97 55.96 62.05 67.80 60.58 75.88 F-measure 57.61 59.60 59.30 63.14 63.03 75.26 4 Conclusions This paper presents the comparison of six Thai word segment programs in terms of usage, response time, time outs, and relevance. Swath, Libthai, and CRF++ programs provide the smallest response times because they are native programs. Thaisemantics yields the largest response time because Thaisemantics is called over the internet and uses a dictionary. Although Tlexs is also called over the in- ternet, it has better response time because it uses CRF. Both Tlexs and CRF++ give the highest F-measure because they employ CRF. References 1. T. Karoonboonyanan, C. Silpa-Anan, P. Kiatisevi, P.Veerathanabutr and V. Am- pornaramveth, Libthai Library, retrieved on Jul 1, 2014 from http://linux.thai. net/projects/libthai. 2. T. Kudo, CRTF++: Yet Another CFT toolkit retrieved on July 2, 2014 from http: //crfpp.googlecode.com/svn/trunk/doc/index.html?source=navbar. 3. National Electronics and Computer Technology Center (NECTEC), BEST: Ben- charmk for Enhancing the Standard for Thai Language Processing retrived on Jul 5, 2014 from http://thailang.nectec.or.th/best/. 4. National Electronics and Computer Technology Center (NECTEC), TLexs: Thai Lexeme Analyser, retrieved on Jul 3, 2014 from http://sansarn.com/tlex/. 5. S. Poltree, Thaisemantics: Free Thai Language Resources and Services, retrieved on Jul 2, 2014 from http://www.thaisemantics.org/. 6. V. Satayamas, wordcut program retrieved on Jul 3, 2014 from https://github.com/ veer66/wordcut. 7. P. Charoenpornsawat, SWATH - Thai Word Segmentation, retrieved on Jul 1, 2014 from http://www.cs.cmu.edu/~paisarn/software.html.