Serbian Text Categorization Using Byte Level n-Grams

                                                                     Jelena Graovac
                                                 Faculty of Mathematics, University of Belgrade
                                                                Studentski trg 16
                                                             11000 Belgrade, Serbia
                                                              jgraovac@matf.bg.ac.rs


ABSTRACT                                                                               technique is based on byte-level n-gram frequency statistics
This paper presents the results of classifying Serbian text                            method for documents representation and K nearest neigh-
documents using the byte-level n-gram based frequency statis-                          bors machine learning algorithm for categorization process.
tics technique, employing four different dissimilarity mea-                            It is derived from Kešelj’s[3] method to solving the author-
sures. Results show that the byte-level n-grams text cate-                             ship attribution problem by using n-grams and some To-
gorization, although very simple and language independent,                             mović’s ideas from [5], where the problem of automated
achieves very good accuracy.                                                           categorization of genome isolates was examined. Kešelj de-
                                                                                       fines an author profile as an ordered set of pairs (x1 , f1 ),
                                                                                       (x2 , f2 ),..., (xL , fL ) of the L most frequent byte n-grams xi
Keywords                                                                               and their normalized frequencies fi . The authorship is de-
N-gram, text categorization, Serbian                                                   termined based on the dissimilarity between two profiles,
                                                                                       comparing the most frequent n-grams. Based on this work,
                                                                                       a wide range of dissimilarity measures are introduced in [5].
1.    INTRODUCTION                                                                     This technique, including all these measures and one newly
   Text categorization is the task of classifying unlabelled                           introduced measure, has been tested in the work [2], to solve
natural language documents into a predefined set of cate-                              the problem of text categorization in English, Chinese and
gories. The increasing volume of available documents in the                            Serbian. In this paper is presented results of categorization
World Wide Web has turned the document indexing and                                    only Serbian text documents. The same technique and the
searching more and more complex. This issue has motivated                              same corpus is used as in the [2] except that the categoriza-
the development of automated text and document catego-                                 tion is carried out at five rather than three classes and only
rization techniques that are capable of automatically orga-                            dissimilarity measures from [3] and [5] were examined. Since
nizing and classifying documents. There are many differ-                               it is based on byte level n-grams technique do not need any
ent techniques for solving text categorization problem with                            text preprocessing or higher level processing, such as tag-
very good results, including Naive Bayes classifier, Deci-                             ging, parsing, feature selection, or other language-dependent
sion trees, Decision rule classifiers, Regression methods, Roc-                        and non-trivial natural language processing tasks.[3] The ap-
chio’s method, Neural networks, K nearest neighbors, Sup-                              proach is also tolerant to typing, spelling and grammatical
port vector machines etc.[4] Most of them have concentrated                            errors and word stemming is got essentially for free.[1]
on English text documents, and therefore are not applica-
ble to documents written in some other language such as                                Overview of the paper: Some background information
Serbian.                                                                               about n-grams are presented in Section 2. Section 3 de-
   Serbian language has several properties that significantly                          scribes methodology for categorization of text documents
influence text categorization: the use of two alphabets (Cyril-                        used in this paper. This section also presents several dis-
lic or Latin alphabet), phonologically based orthography, the                          similarity measures, the data set used for text categoriza-
rich morphological system, free word order of the subject,                             tion and the set of evaluation metrics that are used to as-
predicate, object and other sentence constituents, special                             sess the performance of this technique. Section 4 reports on
placement of enclitics and complex agreement system.[6] All                            experimental results and shows comparison of dissimilarity
these characteristics make the preprocessing steps, such as                            measures. Finally, Section 5 concludes the paper.
feature extraction and feature selection, to be more complex.
There is a need for a dictionaries, finite-state transducers for
the description of the interactions between text and dictio-                           2. N-GRAMS
nary and other complex natural language processing tools.[6]                               Given a sequence of tokens S = (s1 , s2 , ..., sN+(n−1) ) over
   This paper presents the results of Serbian text catego-                             the token alphabet A, where N and n are positive integers,
rization using a technique which is language independent                               an n-gram of the sequence S is any n-long subsequence of
and very simple, that overcomes the above difficulties. This                           consecutive tokens. The ith n-gram of S is the sequence
                                                                                       (si , si+1 , ..., si+n−1 ).[5]
BCI’12, September 16–20, 2012, Novi Sad, Serbia.                                           For example, if A is the English alphabet, and l string
Copyright c 2012 by the paper’s authors. Copying permitted only for private and
                                                                                       on alphabet A, l = ”life is a miracle” then 1-grams are:
academic purposes. This volume is published and copyrighted by its editors.
Local Proceedings also appeared in ISBN 978-86-7031-200-5, Faculty of Sciences,
                                                                                       l,i,f,e, ,s,a,m,r,c; 2-grams are: li,if, fe, e , i, is, s , a, ...;
University of Novi Sad.                                                                3-grams are: lif, ife, fe , e i, ...; 4-grams are: life, ife , fe i,


                                                                                  93
... and so on. The underscore character (” ”) is used here to
represent blanks. For n ≤ 5 Latin names are commonly used
for n-grams (e.g., trigrams) and for n > 5 numeric prefixes
are common (e.g., 6-grams).[5]
   The use of n-gram models and techniques based on n-
gram probability distribution in natural language process-
ing is a relatively simple idea, but it turned out to be ef-
fective in many applications.[3] Some of them are text com-
pression, spelling error detection and correction, informa-
tion retrieval, language identification, authorship attribu-
tion. It also proved useful in domains not related to language
processing such as music representation, computational im-
munology, protein classification etc.                                            Figure 1: Category distribution of Ebart corpus.
   The term n-gram could be defined on word, character or
byte level. In the case of Latin-alphabet languages, character-
level and byte-level n-gram models are quite similar accord-
ing to the fact that one character is usually represented by                                                     √2 · |f (n) − f (n)| 2
                                                                                                                         1       2
                                                                                                         X
one byte. The only difference is that character-level n-grams                        d2 (P1 , P2 ) =              p                           (3)
use letters only and typically ignore digits, punctuation, and                                       n∈prof ile
                                                                                                                    f1 (n)2 + f2 (n)2
whitespace while byte-level n-grams use all printing and non-
printing characters.[2]                                                                                             √
                                                                                                            X         2 · |f1 (n) − f2 (n)|
                                                                                       d3 (P1 , P2 ) =              p                         (4)
                                                                                                                       f1 (n)2 + f2 (n)2
3.     METHODOLOGY AND DATA                                                                              n∈prof ile

   The technique used in this paper is based on calculating                    In this paper is presented categorization of text documents
and comparing profiles of N-gram frequencies. First, profiles                in Serbian. For this purpose, Ebart corpus is used.
on training set data that represent the various categories are
computed. Then the profile for a particular testing docu-                    Ebart: Ebart1 is the largest digital media corpus in Ser-
ment that is to be classified is computed. Finally, a dissim-                bia with almost one million news articles from daily and
ilarity measure between the document’s profile and each of                   weekly newspapers archived by early 2003 onwards. The
the category profiles is computed. The category whose pro-                   current archive is classified into thematic sections following
file has the smallest value of dissimilarity measure with the                the model of regular newspaper columns. In this paper a
document’s profile is selected. Detailed text categorization                 subset of the Ebart corpus is taken into consideration - arti-
procedure is presented in [2].                                               cles from the Serbian daily newspaper ”Politika” that belong
Dissimilarity measures: Dissimilarity measure d is a func-                   to columns Sport, Economics, Politics, Chronicle & Crime
tion that maps the Cartesian product of two sets of se-                      and Culture & Entertainment published from 2003 to 2006.
quences P1 and P2 (defining specific profiles) into the set                  There are 5235 such articles. This data set was split into
of positive real numbers. It should reflect the dissimilarity                the training and testing set in the ratio 2 : 1. Fig. 1 shows
between these two profiles and it should meet the following                  the distribution of this corpus.2
conditions:[5]                                                               Performance evaluation: The standard metrics for the
     • d(P, P) = 0;                                                          categorization performance evaluation is considered, namely
                                                                             micro- and macro-averaged F1 measures. As usual, micro-
     • d(P1 , P2 ) = d(P2 , P1 );                                            averaged F1 measure is computed for all documents over
                                                                             all document categories. Macro-averaged F1 measure rep-
     • the value d(P1 , P2 ) should be small if P1 and P2 are
                                                                             resents the averaged value determined from F1 values com-
       similar ;
                                                                             puted for each classification category separately. The stan-
     • the value d(P1 , P2 ) should be large if P1 and P2 are                dard definition of these measures and measures of precision
       not similar.                                                          and recall can be found, e.g., in [2].
   The last two conditions are informal as the notion of sim-
ilarity (and thus the dissimilarity) is not strictly defined.                4. RESULTS
   In this paper is used four dissimilarity measures. First                     This section presents the experimental results of text cat-
of them is measure used by Kešelj[3] and it has a form of                   egorization obtained for Ebart corpus on five categories:
relative distance:                                                           Sport, Economics, Politics, Chronicle & Crime and Culture
                         X  2 · (f1 (n) − f2 (n)) 2                        & Entertainment. No preprocessing is done on texts, and
         d(P1 , P2 ) =                                      (1)
                                   f1 (n) + f2 (n)                           a simple byte n-grams representation is used, treating text
                       n∈prof ile
                                                                             documents simply as byte sequences. Only the measures
where f1 (n) and f2 (n) are frequencies of an n-gram n in the                selected from [3] and [5] that give the best results on the
author profile P1 and the document profile P2 .                              Ebart corpus are considered: d, d1 , d2 and d3 (see Sec. 3).
  Other three measures are measures from [5] that per-                       For producing n-grams and their normalized frequencies the
formed best on considered data set:                                          software package Ngrams written by Kešelj[3] is used. In
                                                                             1
                                                                              Ebart current archive is available at http://www.arhiv.rs
                                X         2 · (f1 (n) − f2 (n))              2
           d1 (P1 , P2 ) =                                        (2)         All experimental data can be obtained on request from the
                                             f1 (n) + f2 (n)                 author.
                             n∈prof ile


                                                                        94
                                                                                                                  89.0
                              88.5


                                                                                                                  88.5
                              88.0
                                     n=5                                                                                 d


                                                                                                                  88.0
                                     n=6                                                                                 d1
                                     n=7                                                                                 d2
                                     n=8                                                                                 d3
                                     n=9
          Micro−averaged F1


                                                                                              Micro−averaged F1

                                                                                                                  87.5
                              87.5


                                                                                                                  87.0
                              87.0


                                                                                                                  86.5
                              86.5


                                                                                                                  86.0
                                     2e+04   4e+04       6e+04   8e+04   1e+05                                           2e+04   4e+04       6e+04   8e+04   1e+05

                                                     L                                                                                   L


Figure 2: Micro-averaged F1 in percentages for                                        Figure 4: Micro-averaged F1 in percentages for
Ebart corpus, for different values of n-gram size n                                   Ebart corpus, for n = 7 and different dissimilarity
and dissimilarity measure d.                                                          measures.
                              87.5


                                                                                                                  87.5
                                     n=5                                                                                 d
                              87.0


                                     n=6                                                                                 d1


                                                                                                                  87.0
                                     n=7                                                                                 d2
                                     n=8                                                                                 d3
                                     n=9                                                      Macro−averaged F1
          Macro−averaged F1

                              86.5


                                                                                                                  86.5
                              86.0


                                                                                                                  86.0
                                                                                                                  85.5
                              85.5


                                     2e+04   4e+04       6e+04   8e+04   1e+05                                           2e+04   4e+04       6e+04   8e+04   1e+05

                                                     L                                                                                   L


Figure 3: Macro-averaged F1 in percentages for                                        Figure 5: Macro-averaged F1 in percentages for
Ebart corpus, for different values of n-gram size n                                   Ebart corpus, for n = 7 and different dissimilarity
and dissimilarity measure d.                                                          measures.


the process of separating testing from training documents                             of n, comparison between measures d, d1 , d2 and d3 is per-
and in the process of categorization, the software package                            formed. The results of these experiments are shown in Fig.
NgramsClassification 3 is used.                                                       4 and 5.
   One of the most important question in the byte n-gram                                 Additionally, Fig. 6 presents micro- and macro-averaged
categorization is what are the values of n and L that produce                         F1 values for the best results for each measure. All these
the best results. To give an answer to this question, the                             results show that the all introduced measures achieves com-
accuracy (micro- and macro-averaged F1 ) of the technique                             parable results.
was tested for all values of n-gram size n and the profile                               Quality of classification is assessed also using a confusion
length L that makes sense to do.                                                      matrix, i.e., records of correctly and incorrectly recognized
   Fig. 2 and 3 shows graphical representation of this exten-                         documents for each category. Table 1 give the confusion
sive set of experiments taking into account only a few values                         matrix for the experiments using the 70000 most frequent
of n for which the highest accuracy is achieved and only dis-                         7-grams and dissimilarity measure d. It shows the informa-
similarity measure d (all other measures achieve the maxi-                            tion about actual and predicted categorizations done by our
mum accuracy for the same value of n as d). It can be seen                            system. Each column of the matrix represents the number of
that the maximum values for micro- and macro-averaged F1                              documents in a predicted class, while each row represents the
measures are reached for n = 7. For that particulary value                            number of documents in an actual class. The diagonal ele-
                                                                                      ments show the number of correct classified documents, and
3
    Source code can be obtained on request from the author.                           the off-diagonal elements show the number of wrongly classi-


                                                                                 95
                                                                            Annual Symposium on Document Analysis and
                                                                            Information Retrieval, pages 161–175, 1994.
                                                                        [2] J. Graovac. A variant of n-gram based
                                                                            language-independent text categorization. Submitted,
                                                                            2012.
                                                                        [3] V. Keselj, F. Peng, N. Cercone, and C. Thomas.
                                                                            N-gram-based author profiles for authorship
                                                                            attribution. In In Proceedings of the Pacific Association
                                                                            for Computational Linguistics, pages 255–264, 2003.
                                                                        [4] F. Sebastiani and C. N. D. Ricerche. Machine learning
                                                                            in automated text categorization. ACM Computing
Figure 6: Best micro- and macro-averaged F1 in per-                         Surveys, 34:1–47, 2002.
centages for Ebart corpus, and different measures.
                                                                        [5] A. Tomovic and et al. N-gram-based classification and
                                                                            unsupervised hierarchical clustering of genome
Table 1: Confusion Matrix for Classification of                             sequences. In Computer Methods and Programs in
Ebart Corpus                                                                Biomedicine, pages 137–153, 2006.
 Categories Economy Politics Sport H&C C&E                              [6] D. Vitas, C. Krstev, I. Obradovic, L. Popovic, and
 Economy      140    22∗       0     3     1                                G. Pavlovic-lazetic. An overview of resources and basic
 Politics     12  ∗
                     413       3    35∗    4                                tools for processing of serbian written texts. In In Proc.
 Sport          3      9      466    8     2                                of the Workshop on Balkan Language Resources, 1st
 H&C           14    61∗       0   229     5                                Balkan Conference in Informatics, pages 97–104, 2003.
 C&E            3      9       5     2    294


fied documents. The main reason for some wrongly classified
documents comes from the similarities between categories in
real world. For example, the category ”Chronicle & Crime”
and the category ”Politics” are close to each other. The
same stands for ”Politics” and ”Economy”. In the Table 1,
the numbers with a label of ”*” are the numbers of docu-
ments wrongly classified in a category that is close to the
correct category.

5.   CONCLUSION AND FUTURE WORK
   In this paper is presented the results of classifying Serbian
data set using a new variant of a document categorization
approach based on byte-level n-grams, including all printing
and non-printing characters. The approach relies on a profile
document representation of restricted size and a very simple
algorithm for comparing profiles. It provides an inexpensive
and effective way of classifying documents.
   Dissimilarity measures are subject to further investiga-
tion and improvement, as well as categorization methods
themselves. The plan is to compare the results obtained
by presented method with results of other supervised meth-
ods on the same data sets. This method, being based on
a sequence of bytes, is applicable to different domains and
problems and will be further tested on specific corpora such
as bioinformatics and multi-lingual corpora and on tuning
Internet search engines.

6.   ACKNOWLEDGEMENTS
   The work presented has been financially supported by the
Ministry of Science and Technological Development, Repub-
lic of Serbia, through Projects No. III47003 and No. 174021.
The author is grateful to prof. Gordana Pavlović-Lažetić for
her unfailing support and supervision of my research.

7.   REFERENCES
[1] W. B. Cavnar and J. M. Trenkle. N-gram-based text
    categorization. In In Proceedings of SDAIR-94, 3rd


                                                                   96