Multilingual Text Classification through Combination of Monolingual Classifiers Teresa Gonalves, Paulo Quaresma Departamento de Informtica, Universidade de vora 7000-671 vora, Portugal (tcg@di.uevora.pt, pq@di.uevora.pt) Abstract. With the globalization trend there is a big amount of documents written in differ- ent languages. If these polylingual documents are already organized into existing categories one can deliver a learning model to classify newly arrived polylingual documents. Despite being able to adopt a simple approach by considering the problem as multiple independent monolingual text classification problems, this approach fails to use the opportunity offered by polylingual training documents to improve the effectiveness of the classifier. This paper proposes a method to combine different monolingual classifiers in order to get a new classifier as good as the best monolingual one having also the ability to deliver the best performance measures possible (precision, recall and F1 ). The proposed methodology was applied to a corpus of legal documents – from the EUR-Lex site – and was evaluated. The obtained re- sults were quite good, indicating that combining different mono-lingual classifiers may be a promising approach to reach the best performance for each category independently of the language. Keywords: Multilingual text classification, Machine Learning, Support Vector Machines 1. Introduction Current Information Technologies and Web-based services need to manage, select and filter increasing amounts of textual information. Text classification allows users, through navigation on class hierarchies, to browse more easily the texts of their interests. This paradigm is very effective both in filtering information as in the development of online end-user services. Since the number of documents involved in these applications is large, efficient and automatic approaches are necessary for classification. A Ma- chine Learning approach can be used to automatically build the classifiers. The construction process can be seen as a problem of supervised learning: the algorithm receives a relatively small set of labelled documents and generates the classifier. Several algorithms have been applied, such as decision trees, linear discriminant analysis and logistic regression, the nave Bayes algorithm and Support Vector Machines (SVM). Besides having a justified learning the- ory describing its mechanics, SVM are known to be computationally efficient, robust and accurate. Because of the globalization trend, an organization or individual often generates, acquires and archives the same document written in different lan- LOAIT2010-Proceedings.tex; 26/06/2010; 13:46; p.29 30 T. Gonalves, P. Quaresma guages (i.e., polylingual documents); moreover, many countries adopt mul- tiple languages as their official languages. If these polylingual documents are organized into existing categories one would like to use this set of pre- classified documents as training documents to build models to classify newly arrived polylingual documents. For multilingual text classification (i.e., collections of documents writ- ten in several languages), some prior studies address the challenge of cross- lingual text classification. However, prior research has not paid much atten- tion to using polylingual documents yet. This study is motivated by the im- portance of providing polylingual text classification support to organizations and individuals in the increasingly globalized and multilingual environment. We propose a method that combines different monolingual classifiers in order to get a new classifier as good as the best monolingual one which has the ability to deliver all the best performance measures (precision, recall and F1) possible. This methodology was applied and evaluated on a set of legal documents from the EUR-Lex site. We collected documents for two anglo-saxon lan- guages (English and German) and two roman ones (Italian and Portuguese), obtaining four different sets. The obtained results were quite good, indi- cating that combining different monolingual classifiers may be a promising approach to the problem of classifying documents written in several lan- guages. The paper is organized as follows: Section 2 describes the main concepts and tools used in our approach, Section 3 introduces the methodology for combining monolingual classifiers and Section 4 presents the document col- lection used for evaluation, describes the experimental setup and evaluates the obtained results. Finally, Section 5 presents some conclusions and points out possible future work. 2. Concepts and Tools This section introduces the Automatic Text Classification approach and the classification algorithm and software tool used in this work. 2.1. AUTOMATIC T EXT C LASSIFICATION Originally, research in Automatic Text Classification addressed the binary problem, where a document is either relevant or not w.r.t. a given category. However, in real-world situations the great variety of different sources and hence categories usually poses a multi-class classification problem, where a document belongs to exactly one category from a predefined set. Even more general is the multi-label problem, where a document can be classified into more than one category. LOAIT2010-Proceedings.tex; 26/06/2010; 13:46; p.30 Combination of monolingual classifiers 31 In order to be fed to the learning algorithm, documents must by pre- processed to obtain a more structured representation. The most common ap- proach is to use a bag-of-words representation (Salton, 1975), where each document is represented by the words it contains, with their order and punc- tuation being ignored. Normally, words are weighted by some measure of word’s frequency in the document and, possibly, the corpus. In most cases, a subset of words (stop-words) is not considered, because their role is related to the structural organization of the sentences and does not have discriminat- ing power over different classes and some works reduce semantically related terms to the same root applying a lemmatizer. Research interest in this field has been growing in the last years. Sev- eral machine learning algorithms were applied, such as decision trees (Tong, 1994), linear discriminant analysis and logistic regression (Schütze, 1995), the naïve Bayes algorithm (Mladenić, 1999) and Support Vector Machines (SVM)(Joachims, 1999). Joachims (Joachims, 2002) says that using SVMs to learn text classifiers is the first approach that is computationally efficient and performs well and robustly in practice. There is also a justified learning theory that describes its mechanics with respect to text classification. 2.1.1. Multilingual text classification. While most text classification studies focus on monolingual documents, some point to multilingual text classification. From these, the great majority ad- dress the challenge of crosslingual text classification where the classification model relies on monolingual training documents and a translation mecha- nism to classify documents written in another language (Bel, 2003; Rigutini, 2005; Lee, 2009). A technique that takes into account all training documents of all languages when constructing a monolingual classifier for a specific language is proposed in (Wei, 2007). Wei et al. showed that for English and Chinese a feature-based reinforcement polylingual category integration ap- proach obtains better accuracy then monolingual ones. Our proposal is quite different because we do not use information from other languages and mul- tilingual thesaurus to build the individual classifiers. Our aim is to combine individual classifiers in order to obtain a better classifier and not to improve individual classifiers. 2.2. S UPPORT V ECTOR M ACHINES Support Vector Machines, a learning algorithm introduced by Vapnik and co- workers (Cortes, 1995), was motivated by theoretical results from statistical learning theory: it joins a kernel technique with the structural risk minimiza- tion framework. Kernel techniques comprise two parts: a module that performs a mapping from the original data space into a suitable feature space and a learning al- LOAIT2010-Proceedings.tex; 26/06/2010; 13:46; p.31 32 T. Gonalves, P. Quaresma gorithm designed to discover linear patterns in the (new) feature space. The kernel function, that implicitly performs the mapping, depends on the specific data type and domain knowledge of the particular data source. The learning algorithm is general purpose and robust. It’s also efficient since the amount of computational resources required is polynomial with the size and number of data items, even when the dimension of the embed- ding space grows exponentially (Shawe-Taylor, 2004). A mapping example is illustrated in Fig. 1a). The structural risk minimization (SRM) framework creates a model with a minimized VC (Vapnik-Chervonenkis) dimension. This developed theory (Vapnik, 1998) shows that when the VC dimension of a model is low, the expected probability of error is low as well, which means good performance on unseen data (good generalization). In geometric terms, it can be seen as a search to find, between all decision surfaces (the T -dimension surfaces that separate positive from negative examples) the one with maximum margin, that is, the one having a separating property that is invariant to the most wide translation of the surface. This property can be enlighten by Fig. 1b) that shows a 2-dimensional problem. Figure 1. The SVM approach: kernel transformation and search for maximum margin. 2.2.1. Classification software. As classification software we used SVMlight (Joachims, 1999)1 . It is a C implementation of SVM that allows solving classification, regression and ranking problems, handles many thousands of support vectors and several hundred-thousands of training examples and supports standard kernel func- tions besides letting the user define its own. 3. Combining monolingual classifiers Having documents in several languages, one can adopt a nave approach by considering the problem as multiple independent monolingual text classifi- cation problems. This simple approach only employs the training documents 1 Available at http://svmlight.joachims.org LOAIT2010-Proceedings.tex; 26/06/2010; 13:46; p.32 Combination of monolingual classifiers 33 of one language to construct a monolingual classifier for that language and ignores all training documents of other languages. When a new document in a specific language arrives, one select the corresponding classifier to predict appropriate category(s) for the target document. However, the independent construction of each monolingual classifier fails to use the opportunity of- fered by polylingual training documents to improve the effectiveness of the classifier. With this bearing in mind, and to get a decision for a new document, monolingual classifiers could be improved up in several ways. We propose the following strategies for the combination system: − the sum of SVMs output values − the F1 weighted sum of SVMs output values − the F1 weighted sum of SVMs decisions The above measures could also be used to draw decisions when considering a voting strategy of the monolingual classifiers. 4. Experiments This section introduces the dataset, describes the experimental setup and pre- sents the obtained results for the legal concepts classification task. 4.1. DATASET DESCRIPTION For testing the proposed methodology, experiments were run over a set of European Union law documents. These documents were obtained from the EUR-Lex site2 within the “International Agreements” section, belonging to the “External Relations” subject matter. From all available agreements we chose the ones with full text (not just bibliographic notice) obtaining a set of 2714 documents (dated from 1953 to 2008). Since agreements are available in several languages we collected them for two anglo-saxon languages (English and German) and two roman ones (Ital- ian and Portuguese), obtaining four different corpora: eurlex-EN, eurlex-DE, eurlex-IT and eurlex-PT. Table I presents the total number and average per document of tokens (running words) and types (unique words). Each document is classified onto several ontologies: the “EUROVOC de- scriptor”, the “Directory code” and the “Subject matter”. In all available classifications each document can be assigned to several categories. For our classification problem we used the first level of the “Directory code” classifi- cation, considering only categories with at least 50 documents. Table II shows each category along with the number of documents assigned. 2 Available at http://eur-lex.europa.eu/en/index.htm LOAIT2010-Proceedings.tex; 26/06/2010; 13:46; p.33 34 T. Gonalves, P. Quaresma Table I. Total number and average per document of tokens and types for each corpus. tokens types corpus total per doc total per doc eurlex-EN 10699234 3942 73091 570 eurlex-DE 10145702 3728 133191 688 eurlex-IT 10665455 3929 96029 636 eurlex-PT 9731861 3585 86086 567 Table II. Number of documents assigned to each category. id name # of docs 2 Customs Union and free movement of goods 209 3 Agriculture 390 4 Fisheries 361 7 Transport policy 81 11 External relations 2628 12 Energy 58 13 Industrial policy and internal market 55 15 Environment, consumers and health protection 138 16 Science, information, education and culture 99 4.2. E XPERIMENTAL SETUP The experiments were done using a bag-of-words representation of docu- ments, the SVM algorithm was run using SVMlight with a linear kernel and other default parameters and the model was evaluated using a 10-fold strati- fied cross-validation procedure with significance tests done with a 90% con- fidence level. To represent each document we used the bag-of-words approach, a vector space model (VSM) representation where each document is represented by the words it contains, with their order and punctuation being ignored. Docu- ment’s representation was obtained by mapping all numbers to the same token and using the tf-idf weighting function normalized to unit length. To measure learner’s performance we analyzed precision, recall and the F1 measures (Salton, 1975) of the positive class. These measures are ob- tained from contingency table of the classification (prediction vs. manual classification). LOAIT2010-Proceedings.tex; 26/06/2010; 13:46; p.34 Combination of monolingual classifiers 35 4.3. M ONOLINGUAL EXPERIMENTS To support our claim, as baseline we have built classifiers for each language. Table III shows the average precision, recall and F1 measures for each corpus and each category (boldface values are significantly worse than the best value obtained). Last line presents the average values over all nine classes. Table III. Average precision, recall and F1 values for each mono-lingual classifier. precision recall F1 id EN DE IT PT EN DE IT PT EN DE IT PT 2 .919 .957 .922 .937 .651 .665 .580 .565 .755 .778 .702 .701 3 .916 .928 .938 .943 .818 .805 .705 .503 .862 .860 .803 .655 4 .956 .966 .980 .971 .934 .906 .914 .823 .944 .934 .945 .890 7 .846 .870 .793 .806 .568 .543 .518 .482 .651 .640 .608 .590 11 .973 .973 .973 .973 .998 .997 .998 .997 .985 .985 .985 .985 12 .958 .874 .877 .938 .637 .700 .670 .600 .752 .765 .745 .716 13 .942 .933 .933 .967 .393 .320 .300 .320 .522 .454 .436 .461 15 .909 .922 .917 .908 .726 .732 .725 .732 .801 .813 .805 .806 16 .862 .883 .916 .947 .779 .799 .718 .647 .804 .828 .785 .753 avg .828 .832 .825 .839 .650 .647 .613 .567 .708 .706 .681 .656 For the precision values we can notice that the Portuguese dataset has values with no significant difference with the “best” for all classes; all other languages perform worse for some classes (English: c2, c4 and c16; German: c12 and c16; Italian: c2, c7 and c12). With this in mind one can say that the Portuguese language generates the best precision classifiers. Concerning recall, it’s the English and German languages that consistently present the best values; Italian and Portuguese while equally good for some classes, are worse for others (Italian: c2 and c3; Portuguese: c2, c3 and c4). The F1 measure presents the same behavior as recall, being the only differ- ence the classes where the Portuguese language performs worse (c2, c3 and c16). 4.4. P OLYLINGUAL EXPERIMENTS From all possible combiners (see Section 3), there is one that, for all classes, persistently generated the best F1 values: the F1 weighted sum of SVMs decisions. Table IV shows, for each performance measure its results compared with the “best” monolingual classifiers(boldface values are significantly worse than the corresponding multilingual one): the Portuguese one for precision, and LOAIT2010-Proceedings.tex; 26/06/2010; 13:46; p.35 36 T. Gonalves, P. Quaresma the English and German one for recall and F1 . Last line equally presents the average values over all classes. Table IV. Average precision, recall and F1 values compared with the combiner ones. precision recall F1 id PT comb EN DE comb EN DE comb 2 .937 .947 .651 .665 .675 .755 .778 .782 3 .943 .925 .818 .805 .813 .862 .860 .863 4 .971 .964 .934 .906 .928 .944 .934 .945 7 .806 .868 .568 .543 .567 .651 .640 .654 11 .973 .973 .998 .997 .998 .985 .985 .985 12 .938 .908 .637 .700 .670 .752 .765 .761 13 .967 .933 .393 .320 .340 .522 .454 .467 15 .908 .912 .726 .732 .754 .801 .813 .821 16 .947 .881 .779 .799 .779 .804 .828 .815 avg .839 .831 .650 .647 .652 .708 .706 .709 From the average values, one can easily see that precision is higher than recall and that the best monolingual classifier depends on what performance measure one is considering. Nevertheless, the combined classifier has all per- formance measures very similar and never significatively worse then the best monolingual classifier. In fact, significant tests show that, for all classes and all performance measures, there is no significant difference between the “best” monolingual classifier and the corresponding combined classifier. 5. Conclusions and Future Work A proposal to combine monolingual classifiers was presented and evaluated. The proposed methodology uses SVM classifiers to associate concepts to legal documents and uses a decision function that combines them in order to obtain, for each class, a classifier as good as the best monolingual classifier of each performance measure. The baseline experiments allows one to conclude that some languages generate classifiers with better precision values (Portuguese language) while others generate classifiers with better recall ones (English and German lan- guages). In order to be able to explain and to try to generalise these results further experiments need to be done. For instance, we will need to evalu- ate this methodology with other collections and domains. Are these results LOAIT2010-Proceedings.tex; 26/06/2010; 13:46; p.36 Combination of monolingual classifiers 37 specific for the legal domain? Or only for this collection and topics? Never- theless, from a linguistic point of view, these results raise quite interesting questions. By combining all classifiers one obtains a classifier as good as the best monolingual one. This combined classifier can even be considered better than the others since it has the ability to deliver all the best performance measures (precision, recall and F1 ) unlike using one monolingual classifier. As ongoing research we intend to use a deeper linguistic representation of documents and to re-evaluate this methodology. Specifically, we will use a semantic representation (based on DRS3 ) of documents and a graph kernel to create SVM models. In previous work, this approach showed to be able to improve the bag-of-words result for the Portuguese language. Another research line is to use legal thesaurus, such as the LOIS4 lexical thesaurus, to reinforce some features/terms. With this approach we would combine our proposal with the main ideas of the Wei et al. work (Wei, 2007). References Bel, N., Koster, C. and Villegas, M. (2003), Cross-lingual text categorization, in Proceedings of ECDL’03, Proceedings of the 7th European Conference on Research and Advanced Tecnology for Digital Libraries, pp. 126–139. Cortes, C. and Vapnik, V. (1995), Support-vector networks, Machine Learning, Vol. 20 No. 3, pp. 273–297. Joachims, T. (1999a), Making large-scale SVM learning practical, in Schölkopf, B., Burges, C. and Smola, A. (Ed.), “Advances in Kernel Methods - Support Vector Learning”, MIT Press. Joachims, T. (2002), Learning to Classify Text Using Support Vector Machines, Kluwer Academic Publishers. Lee, C.H. and Yang, H.C. (2009), Construction of supervised and unsupervised learning systems for multilingual text categorization, Expert Systems Applications, Vol. 36 No. 2, pp. 2400–2410. Mladenić, D. and Grobelnik, M. (1999), Feature selection for unbalanced class distribution and naïve Bayes, in Proceedings of ICML’99, 16th International Conference on Machine Learning, pp. 258–267. Rigutini, L., Maggini, M. and Liu, B. (2005), An EM Based Training Algorithm for Cross- Language Text Categorization, in Proceedings of WI’05, IEEE/WIC/ACM International Conference on Web Intelligence (IEEE Computer Society), pp. 529–535. Salton, G., Wang, A. and Yang, C. (1975), A vector space model for information retrieval, Journal of the American Society for Information Retrieval, Vol. 18, pp. 613–620. Schütze, H., Hull, D. and Pedersen, J. (1995), A comparison of classifiers and document representations for the routing problem, in Proceedings of SIGIR’95, 18th International Conference on Research and Developement in Information Retrieval (ACM), pp. 229–237. Shawe-Taylor, J. and Cristianini, N. (2004), Kernel Methods for Pattern Analysis, Cambridge University Press. 3 Discourse Representation Structures 4 Lexical Ontologies for Legal Information Sharing LOAIT2010-Proceedings.tex; 26/06/2010; 13:46; p.37 38 T. Gonalves, P. Quaresma Tong, R. and Appelbaum, L.A. (1994), Machine learning for knowledge-based document routing, in Proceedings of TRC’94, 2nd Text Retrieval Conference. Vapnik, V. (1998), Statistical learning theory, Wiley, NY. Wei, C., Shi, H. and Yang, C. (2007), Feature reinforcement approach to poly-lingual text categorization, in Proceedings of the International Conference on Asia Digital Libraries (LNCS Springer), pp. 99–108. LOAIT2010-Proceedings.tex; 26/06/2010; 13:46; p.38