Serbian Text Categorization Using Byte Level n-Grams Jelena Graovac Faculty of Mathematics, University of Belgrade Studentski trg 16 11000 Belgrade, Serbia jgraovac@matf.bg.ac.rs ABSTRACT technique is based on byte-level n-gram frequency statistics This paper presents the results of classifying Serbian text method for documents representation and K nearest neigh- documents using the byte-level n-gram based frequency statis- bors machine learning algorithm for categorization process. tics technique, employing four different dissimilarity mea- It is derived from Kešelj’s[3] method to solving the author- sures. Results show that the byte-level n-grams text cate- ship attribution problem by using n-grams and some To- gorization, although very simple and language independent, mović’s ideas from [5], where the problem of automated achieves very good accuracy. categorization of genome isolates was examined. Kešelj de- fines an author profile as an ordered set of pairs (x1 , f1 ), (x2 , f2 ),..., (xL , fL ) of the L most frequent byte n-grams xi Keywords and their normalized frequencies fi . The authorship is de- N-gram, text categorization, Serbian termined based on the dissimilarity between two profiles, comparing the most frequent n-grams. Based on this work, a wide range of dissimilarity measures are introduced in [5]. 1. INTRODUCTION This technique, including all these measures and one newly Text categorization is the task of classifying unlabelled introduced measure, has been tested in the work [2], to solve natural language documents into a predefined set of cate- the problem of text categorization in English, Chinese and gories. The increasing volume of available documents in the Serbian. In this paper is presented results of categorization World Wide Web has turned the document indexing and only Serbian text documents. The same technique and the searching more and more complex. This issue has motivated same corpus is used as in the [2] except that the categoriza- the development of automated text and document catego- tion is carried out at five rather than three classes and only rization techniques that are capable of automatically orga- dissimilarity measures from [3] and [5] were examined. Since nizing and classifying documents. There are many differ- it is based on byte level n-grams technique do not need any ent techniques for solving text categorization problem with text preprocessing or higher level processing, such as tag- very good results, including Naive Bayes classifier, Deci- ging, parsing, feature selection, or other language-dependent sion trees, Decision rule classifiers, Regression methods, Roc- and non-trivial natural language processing tasks.[3] The ap- chio’s method, Neural networks, K nearest neighbors, Sup- proach is also tolerant to typing, spelling and grammatical port vector machines etc.[4] Most of them have concentrated errors and word stemming is got essentially for free.[1] on English text documents, and therefore are not applica- ble to documents written in some other language such as Overview of the paper: Some background information Serbian. about n-grams are presented in Section 2. Section 3 de- Serbian language has several properties that significantly scribes methodology for categorization of text documents influence text categorization: the use of two alphabets (Cyril- used in this paper. This section also presents several dis- lic or Latin alphabet), phonologically based orthography, the similarity measures, the data set used for text categoriza- rich morphological system, free word order of the subject, tion and the set of evaluation metrics that are used to as- predicate, object and other sentence constituents, special sess the performance of this technique. Section 4 reports on placement of enclitics and complex agreement system.[6] All experimental results and shows comparison of dissimilarity these characteristics make the preprocessing steps, such as measures. Finally, Section 5 concludes the paper. feature extraction and feature selection, to be more complex. There is a need for a dictionaries, finite-state transducers for the description of the interactions between text and dictio- 2. N-GRAMS nary and other complex natural language processing tools.[6] Given a sequence of tokens S = (s1 , s2 , ..., sN+(n−1) ) over This paper presents the results of Serbian text catego- the token alphabet A, where N and n are positive integers, rization using a technique which is language independent an n-gram of the sequence S is any n-long subsequence of and very simple, that overcomes the above difficulties. This consecutive tokens. The ith n-gram of S is the sequence (si , si+1 , ..., si+n−1 ).[5] BCI’12, September 16–20, 2012, Novi Sad, Serbia. For example, if A is the English alphabet, and l string Copyright c 2012 by the paper’s authors. Copying permitted only for private and on alphabet A, l = ”life is a miracle” then 1-grams are: academic purposes. This volume is published and copyrighted by its editors. Local Proceedings also appeared in ISBN 978-86-7031-200-5, Faculty of Sciences, l,i,f,e, ,s,a,m,r,c; 2-grams are: li,if, fe, e , i, is, s , a, ...; University of Novi Sad. 3-grams are: lif, ife, fe , e i, ...; 4-grams are: life, ife , fe i, 93 ... and so on. The underscore character (” ”) is used here to represent blanks. For n ≤ 5 Latin names are commonly used for n-grams (e.g., trigrams) and for n > 5 numeric prefixes are common (e.g., 6-grams).[5] The use of n-gram models and techniques based on n- gram probability distribution in natural language process- ing is a relatively simple idea, but it turned out to be ef- fective in many applications.[3] Some of them are text com- pression, spelling error detection and correction, informa- tion retrieval, language identification, authorship attribu- tion. It also proved useful in domains not related to language processing such as music representation, computational im- munology, protein classification etc. Figure 1: Category distribution of Ebart corpus. The term n-gram could be defined on word, character or byte level. In the case of Latin-alphabet languages, character- level and byte-level n-gram models are quite similar accord- ing to the fact that one character is usually represented by  √2 · |f (n) − f (n)| 2 1 2 X one byte. The only difference is that character-level n-grams d2 (P1 , P2 ) = p (3) use letters only and typically ignore digits, punctuation, and n∈prof ile f1 (n)2 + f2 (n)2 whitespace while byte-level n-grams use all printing and non- printing characters.[2] √ X 2 · |f1 (n) − f2 (n)| d3 (P1 , P2 ) = p (4) f1 (n)2 + f2 (n)2 3. METHODOLOGY AND DATA n∈prof ile The technique used in this paper is based on calculating In this paper is presented categorization of text documents and comparing profiles of N-gram frequencies. First, profiles in Serbian. For this purpose, Ebart corpus is used. on training set data that represent the various categories are computed. Then the profile for a particular testing docu- Ebart: Ebart1 is the largest digital media corpus in Ser- ment that is to be classified is computed. Finally, a dissim- bia with almost one million news articles from daily and ilarity measure between the document’s profile and each of weekly newspapers archived by early 2003 onwards. The the category profiles is computed. The category whose pro- current archive is classified into thematic sections following file has the smallest value of dissimilarity measure with the the model of regular newspaper columns. In this paper a document’s profile is selected. Detailed text categorization subset of the Ebart corpus is taken into consideration - arti- procedure is presented in [2]. cles from the Serbian daily newspaper ”Politika” that belong Dissimilarity measures: Dissimilarity measure d is a func- to columns Sport, Economics, Politics, Chronicle & Crime tion that maps the Cartesian product of two sets of se- and Culture & Entertainment published from 2003 to 2006. quences P1 and P2 (defining specific profiles) into the set There are 5235 such articles. This data set was split into of positive real numbers. It should reflect the dissimilarity the training and testing set in the ratio 2 : 1. Fig. 1 shows between these two profiles and it should meet the following the distribution of this corpus.2 conditions:[5] Performance evaluation: The standard metrics for the • d(P, P) = 0; categorization performance evaluation is considered, namely micro- and macro-averaged F1 measures. As usual, micro- • d(P1 , P2 ) = d(P2 , P1 ); averaged F1 measure is computed for all documents over all document categories. Macro-averaged F1 measure rep- • the value d(P1 , P2 ) should be small if P1 and P2 are resents the averaged value determined from F1 values com- similar ; puted for each classification category separately. The stan- • the value d(P1 , P2 ) should be large if P1 and P2 are dard definition of these measures and measures of precision not similar. and recall can be found, e.g., in [2]. The last two conditions are informal as the notion of sim- ilarity (and thus the dissimilarity) is not strictly defined. 4. RESULTS In this paper is used four dissimilarity measures. First This section presents the experimental results of text cat- of them is measure used by Kešelj[3] and it has a form of egorization obtained for Ebart corpus on five categories: relative distance: Sport, Economics, Politics, Chronicle & Crime and Culture X  2 · (f1 (n) − f2 (n)) 2 & Entertainment. No preprocessing is done on texts, and d(P1 , P2 ) = (1) f1 (n) + f2 (n) a simple byte n-grams representation is used, treating text n∈prof ile documents simply as byte sequences. Only the measures where f1 (n) and f2 (n) are frequencies of an n-gram n in the selected from [3] and [5] that give the best results on the author profile P1 and the document profile P2 . Ebart corpus are considered: d, d1 , d2 and d3 (see Sec. 3). Other three measures are measures from [5] that per- For producing n-grams and their normalized frequencies the formed best on considered data set: software package Ngrams written by Kešelj[3] is used. In 1 Ebart current archive is available at http://www.arhiv.rs X 2 · (f1 (n) − f2 (n)) 2 d1 (P1 , P2 ) = (2) All experimental data can be obtained on request from the f1 (n) + f2 (n) author. n∈prof ile 94 89.0 88.5 88.5 88.0 n=5 d 88.0 n=6 d1 n=7 d2 n=8 d3 n=9 Micro−averaged F1 Micro−averaged F1 87.5 87.5 87.0 87.0 86.5 86.5 86.0 2e+04 4e+04 6e+04 8e+04 1e+05 2e+04 4e+04 6e+04 8e+04 1e+05 L L Figure 2: Micro-averaged F1 in percentages for Figure 4: Micro-averaged F1 in percentages for Ebart corpus, for different values of n-gram size n Ebart corpus, for n = 7 and different dissimilarity and dissimilarity measure d. measures. 87.5 87.5 n=5 d 87.0 n=6 d1 87.0 n=7 d2 n=8 d3 n=9 Macro−averaged F1 Macro−averaged F1 86.5 86.5 86.0 86.0 85.5 85.5 2e+04 4e+04 6e+04 8e+04 1e+05 2e+04 4e+04 6e+04 8e+04 1e+05 L L Figure 3: Macro-averaged F1 in percentages for Figure 5: Macro-averaged F1 in percentages for Ebart corpus, for different values of n-gram size n Ebart corpus, for n = 7 and different dissimilarity and dissimilarity measure d. measures. the process of separating testing from training documents of n, comparison between measures d, d1 , d2 and d3 is per- and in the process of categorization, the software package formed. The results of these experiments are shown in Fig. NgramsClassification 3 is used. 4 and 5. One of the most important question in the byte n-gram Additionally, Fig. 6 presents micro- and macro-averaged categorization is what are the values of n and L that produce F1 values for the best results for each measure. All these the best results. To give an answer to this question, the results show that the all introduced measures achieves com- accuracy (micro- and macro-averaged F1 ) of the technique parable results. was tested for all values of n-gram size n and the profile Quality of classification is assessed also using a confusion length L that makes sense to do. matrix, i.e., records of correctly and incorrectly recognized Fig. 2 and 3 shows graphical representation of this exten- documents for each category. Table 1 give the confusion sive set of experiments taking into account only a few values matrix for the experiments using the 70000 most frequent of n for which the highest accuracy is achieved and only dis- 7-grams and dissimilarity measure d. It shows the informa- similarity measure d (all other measures achieve the maxi- tion about actual and predicted categorizations done by our mum accuracy for the same value of n as d). It can be seen system. Each column of the matrix represents the number of that the maximum values for micro- and macro-averaged F1 documents in a predicted class, while each row represents the measures are reached for n = 7. For that particulary value number of documents in an actual class. The diagonal ele- ments show the number of correct classified documents, and 3 Source code can be obtained on request from the author. the off-diagonal elements show the number of wrongly classi- 95 Annual Symposium on Document Analysis and Information Retrieval, pages 161–175, 1994. [2] J. Graovac. A variant of n-gram based language-independent text categorization. Submitted, 2012. [3] V. Keselj, F. Peng, N. Cercone, and C. Thomas. N-gram-based author profiles for authorship attribution. In In Proceedings of the Pacific Association for Computational Linguistics, pages 255–264, 2003. [4] F. Sebastiani and C. N. D. Ricerche. Machine learning in automated text categorization. ACM Computing Figure 6: Best micro- and macro-averaged F1 in per- Surveys, 34:1–47, 2002. centages for Ebart corpus, and different measures. [5] A. Tomovic and et al. N-gram-based classification and unsupervised hierarchical clustering of genome Table 1: Confusion Matrix for Classification of sequences. In Computer Methods and Programs in Ebart Corpus Biomedicine, pages 137–153, 2006. Categories Economy Politics Sport H&C C&E [6] D. Vitas, C. Krstev, I. Obradovic, L. Popovic, and Economy 140 22∗ 0 3 1 G. Pavlovic-lazetic. An overview of resources and basic Politics 12 ∗ 413 3 35∗ 4 tools for processing of serbian written texts. In In Proc. Sport 3 9 466 8 2 of the Workshop on Balkan Language Resources, 1st H&C 14 61∗ 0 229 5 Balkan Conference in Informatics, pages 97–104, 2003. C&E 3 9 5 2 294 fied documents. The main reason for some wrongly classified documents comes from the similarities between categories in real world. For example, the category ”Chronicle & Crime” and the category ”Politics” are close to each other. The same stands for ”Politics” and ”Economy”. In the Table 1, the numbers with a label of ”*” are the numbers of docu- ments wrongly classified in a category that is close to the correct category. 5. CONCLUSION AND FUTURE WORK In this paper is presented the results of classifying Serbian data set using a new variant of a document categorization approach based on byte-level n-grams, including all printing and non-printing characters. The approach relies on a profile document representation of restricted size and a very simple algorithm for comparing profiles. It provides an inexpensive and effective way of classifying documents. Dissimilarity measures are subject to further investiga- tion and improvement, as well as categorization methods themselves. The plan is to compare the results obtained by presented method with results of other supervised meth- ods on the same data sets. This method, being based on a sequence of bytes, is applicable to different domains and problems and will be further tested on specific corpora such as bioinformatics and multi-lingual corpora and on tuning Internet search engines. 6. ACKNOWLEDGEMENTS The work presented has been financially supported by the Ministry of Science and Technological Development, Repub- lic of Serbia, through Projects No. III47003 and No. 174021. The author is grateful to prof. Gordana Pavlović-Lažetić for her unfailing support and supervision of my research. 7. REFERENCES [1] W. B. Cavnar and J. M. Trenkle. N-gram-based text categorization. In In Proceedings of SDAIR-94, 3rd 96