<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multilingual Text Classification through Combination of Monolingual Classifiers</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Teresa Gonalves</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paulo Quaresma</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Departamento de Informtica, Universidade de vora 7000-671 vora</institution>
          ,
          <country country="PT">Portugal</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2010</year>
      </pub-date>
      <fpage>29</fpage>
      <lpage>38</lpage>
      <abstract>
        <p>With the globalization trend there is a big amount of documents written in different languages. If these polylingual documents are already organized into existing categories one can deliver a learning model to classify newly arrived polylingual documents. Despite being able to adopt a simple approach by considering the problem as multiple independent monolingual text classification problems, this approach fails to use the opportunity offered by polylingual training documents to improve the effectiveness of the classifier. This paper proposes a method to combine different monolingual classifiers in order to get a new classifier as good as the best monolingual one having also the ability to deliver the best performance measures possible (precision, recall and F1). The proposed methodology was applied to a corpus of legal documents - from the EUR-Lex site - and was evaluated. The obtained results were quite good, indicating that combining different mono-lingual classifiers may be a promising approach to reach the best performance for each category independently of the language.</p>
      </abstract>
      <kwd-group>
        <kwd>Multilingual text classification</kwd>
        <kwd>Machine Learning</kwd>
        <kwd>Support Vector Machines</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Current Information Technologies and Web-based services need to manage,
select and filter increasing amounts of textual information. Text classification
allows users, through navigation on class hierarchies, to browse more easily
the texts of their interests. This paradigm is very effective both in filtering
information as in the development of online end-user services.</p>
      <p>Since the number of documents involved in these applications is large,
efficient and automatic approaches are necessary for classification. A
Machine Learning approach can be used to automatically build the classifiers.
The construction process can be seen as a problem of supervised learning: the
algorithm receives a relatively small set of labelled documents and generates
the classifier. Several algorithms have been applied, such as decision trees,
linear discriminant analysis and logistic regression, the nave Bayes algorithm
and Support Vector Machines (SVM). Besides having a justified learning
theory describing its mechanics, SVM are known to be computationally efficient,
robust and accurate.</p>
      <p>Because of the globalization trend, an organization or individual often
generates, acquires and archives the same document written in different
languages (i.e., polylingual documents); moreover, many countries adopt
multiple languages as their official languages. If these polylingual documents
are organized into existing categories one would like to use this set of
preclassified documents as training documents to build models to classify newly
arrived polylingual documents.</p>
      <p>For multilingual text classification (i.e., collections of documents
written in several languages), some prior studies address the challenge of
crosslingual text classification. However, prior research has not paid much
attention to using polylingual documents yet. This study is motivated by the
importance of providing polylingual text classification support to organizations
and individuals in the increasingly globalized and multilingual environment.</p>
      <p>We propose a method that combines different monolingual classifiers in
order to get a new classifier as good as the best monolingual one which has
the ability to deliver all the best performance measures (precision, recall and
F1) possible.</p>
      <p>This methodology was applied and evaluated on a set of legal documents
from the EUR-Lex site. We collected documents for two anglo-saxon
languages (English and German) and two roman ones (Italian and Portuguese),
obtaining four different sets. The obtained results were quite good,
indicating that combining different monolingual classifiers may be a promising
approach to the problem of classifying documents written in several
languages.</p>
      <p>The paper is organized as follows: Section 2 describes the main concepts
and tools used in our approach, Section 3 introduces the methodology for
combining monolingual classifiers and Section 4 presents the document
collection used for evaluation, describes the experimental setup and evaluates
the obtained results. Finally, Section 5 presents some conclusions and points
out possible future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Concepts and Tools</title>
      <p>This section introduces the Automatic Text Classification approach and the
classification algorithm and software tool used in this work.</p>
      <sec id="sec-2-1">
        <title>2.1. AUTOMATIC TEXT CLASSIFICATION</title>
        <p>Originally, research in Automatic Text Classification addressed the binary
problem, where a document is either relevant or not w.r.t. a given category.
However, in real-world situations the great variety of different sources and
hence categories usually poses a multi-class classification problem, where a
document belongs to exactly one category from a predefined set. Even more
general is the multi-label problem, where a document can be classified into
more than one category.</p>
        <p>
          In order to be fed to the learning algorithm, documents must by
preprocessed to obtain a more structured representation. The most common
approach is to use a bag-of-words representation
          <xref ref-type="bibr" rid="ref8">(Salton, 1975)</xref>
          , where each
document is represented by the words it contains, with their order and
punctuation being ignored. Normally, words are weighted by some measure of
word’s frequency in the document and, possibly, the corpus. In most cases, a
subset of words (stop-words) is not considered, because their role is related
to the structural organization of the sentences and does not have
discriminating power over different classes and some works reduce semantically related
terms to the same root applying a lemmatizer.
        </p>
        <p>
          Research interest in this field has been growing in the last years.
Several machine learning algorithms were applied, such as decision trees (Tong,
1994), linear discriminant analysis and logistic regression (
          <xref ref-type="bibr" rid="ref9">Schütze, 1995</xref>
          ),
the naïve Bayes algorithm
          <xref ref-type="bibr" rid="ref6">(Mladenic´, 1999)</xref>
          and Support Vector Machines
(SVM)
          <xref ref-type="bibr" rid="ref3">(Joachims, 1999)</xref>
          . Joachims
          <xref ref-type="bibr" rid="ref4">(Joachims, 2002)</xref>
          says that using SVMs
to learn text classifiers is the first approach that is computationally efficient
and performs well and robustly in practice. There is also a justified learning
theory that describes its mechanics with respect to text classification.
        </p>
        <sec id="sec-2-1-1">
          <title>2.1.1. Multilingual text classification.</title>
          <p>
            While most text classification studies focus on monolingual documents, some
point to multilingual text classification. From these, the great majority
address the challenge of crosslingual text classification where the classification
model relies on monolingual training documents and a translation
mechanism to classify documents written in another language
            <xref ref-type="bibr" rid="ref1 ref5 ref7">(Bel, 2003; Rigutini,
2005; Lee, 2009)</xref>
            . A technique that takes into account all training documents
of all languages when constructing a monolingual classifier for a specific
language is proposed in
            <xref ref-type="bibr" rid="ref13">(Wei, 2007)</xref>
            . Wei et al. showed that for English and
Chinese a feature-based reinforcement polylingual category integration
approach obtains better accuracy then monolingual ones. Our proposal is quite
different because we do not use information from other languages and
multilingual thesaurus to build the individual classifiers. Our aim is to combine
individual classifiers in order to obtain a better classifier and not to improve
individual classifiers.
          </p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. SUPPORT VECTOR MACHINES</title>
        <p>
          Support Vector Machines, a learning algorithm introduced by Vapnik and
coworkers
          <xref ref-type="bibr" rid="ref2">(Cortes, 1995)</xref>
          , was motivated by theoretical results from statistical
learning theory: it joins a kernel technique with the structural risk
minimization framework.
        </p>
        <p>Kernel techniques comprise two parts: a module that performs a mapping
from the original data space into a suitable feature space and a learning
algorithm designed to discover linear patterns in the (new) feature space. The
kernel function, that implicitly performs the mapping, depends on the specific
data type and domain knowledge of the particular data source.</p>
        <p>
          The learning algorithm is general purpose and robust. It’s also efficient
since the amount of computational resources required is polynomial with
the size and number of data items, even when the dimension of the
embedding space grows exponentially
          <xref ref-type="bibr" rid="ref10">(Shawe-Taylor, 2004)</xref>
          . A mapping example
is illustrated in Fig. 1a).
        </p>
        <p>
          The structural risk minimization (SRM) framework creates a model with
a minimized VC (Vapnik-Chervonenkis) dimension. This developed theory
          <xref ref-type="bibr" rid="ref12">(Vapnik, 1998)</xref>
          shows that when the VC dimension of a model is low, the
expected probability of error is low as well, which means good performance
on unseen data (good generalization). In geometric terms, it can be seen as a
search to find, between all decision surfaces (the T -dimension surfaces that
separate positive from negative examples) the one with maximum margin,
that is, the one having a separating property that is invariant to the most wide
translation of the surface. This property can be enlighten by Fig. 1b) that
shows a 2-dimensional problem.
        </p>
        <sec id="sec-2-2-1">
          <title>2.2.1. Classification software.</title>
          <p>
            As classification software we used SVMlight
            <xref ref-type="bibr" rid="ref3">(Joachims, 1999)</xref>
            1. It is a C
implementation of SVM that allows solving classification, regression and
ranking problems, handles many thousands of support vectors and several
hundred-thousands of training examples and supports standard kernel
functions besides letting the user define its own.
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Combining monolingual classifiers</title>
      <p>Having documents in several languages, one can adopt a nave approach by
considering the problem as multiple independent monolingual text
classification problems. This simple approach only employs the training documents
1 Available at http://svmlight.joachims.org
of one language to construct a monolingual classifier for that language and
ignores all training documents of other languages. When a new document in
a specific language arrives, one select the corresponding classifier to predict
appropriate category(s) for the target document. However, the independent
construction of each monolingual classifier fails to use the opportunity
offered by polylingual training documents to improve the effectiveness of the
classifier.</p>
      <p>With this bearing in mind, and to get a decision for a new document,
monolingual classifiers could be improved up in several ways. We propose
the following strategies for the combination system:
the sum of SVMs output values
the F1 weighted sum of SVMs output values
the F1 weighted sum of SVMs decisions
The above measures could also be used to draw decisions when considering
a voting strategy of the monolingual classifiers.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>This section introduces the dataset, describes the experimental setup and
presents the obtained results for the legal concepts classification task.</p>
      <sec id="sec-4-1">
        <title>4.1. DATASET DESCRIPTION</title>
        <p>For testing the proposed methodology, experiments were run over a set of
European Union law documents. These documents were obtained from the
EUR-Lex site2 within the “International Agreements” section, belonging to
the “External Relations” subject matter. From all available agreements we
chose the ones with full text (not just bibliographic notice) obtaining a set of
2714 documents (dated from 1953 to 2008).</p>
        <p>Since agreements are available in several languages we collected them for
two anglo-saxon languages (English and German) and two roman ones
(Italian and Portuguese), obtaining four different corpora: eurlex-EN, eurlex-DE,
eurlex-IT and eurlex-PT. Table I presents the total number and average per
document of tokens (running words) and types (unique words).</p>
        <p>Each document is classified onto several ontologies: the “EUROVOC
descriptor”, the “Directory code” and the “Subject matter”. In all available
classifications each document can be assigned to several categories. For our
classification problem we used the first level of the “Directory code”
classification, considering only categories with at least 50 documents. Table II shows
each category along with the number of documents assigned.</p>
        <p>2 Available at http://eur-lex.europa.eu/en/index.htm</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. EXPERIMENTAL SETUP</title>
        <p>The experiments were done using a bag-of-words representation of
documents, the SVM algorithm was run using SVMlight with a linear kernel and
other default parameters and the model was evaluated using a 10-fold
stratified cross-validation procedure with significance tests done with a 90%
confidence level.</p>
        <p>To represent each document we used the bag-of-words approach, a vector
space model (VSM) representation where each document is represented by
the words it contains, with their order and punctuation being ignored.
Document’s representation was obtained by mapping all numbers to the same token
and using the tf-idf weighting function normalized to unit length.</p>
        <p>
          To measure learner’s performance we analyzed precision, recall and the
F1 measures
          <xref ref-type="bibr" rid="ref8">(Salton, 1975)</xref>
          of the positive class. These measures are
obtained from contingency table of the classification (prediction vs. manual
classification).
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. MONOLINGUAL EXPERIMENTS</title>
        <p>To support our claim, as baseline we have built classifiers for each language.
Table III shows the average precision, recall and F1 measures for each corpus
and each category (boldface values are significantly worse than the best value
obtained). Last line presents the average values over all nine classes.</p>
        <p>For the precision values we can notice that the Portuguese dataset has
values with no significant difference with the “best” for all classes; all other
languages perform worse for some classes (English: c2, c4 and c16; German:
c12 and c16; Italian: c2, c7 and c12). With this in mind one can say that the
Portuguese language generates the best precision classifiers.</p>
        <p>Concerning recall, it’s the English and German languages that consistently
present the best values; Italian and Portuguese while equally good for some
classes, are worse for others (Italian: c2 and c3; Portuguese: c2, c3 and c4).</p>
        <p>The F1 measure presents the same behavior as recall, being the only
difference the classes where the Portuguese language performs worse (c2, c3 and
c16).</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. POLYLINGUAL EXPERIMENTS</title>
        <p>From all possible combiners (see Section 3), there is one that, for all classes,
persistently generated the best F1 values: the F1 weighted sum of SVMs
decisions.</p>
        <p>Table IV shows, for each performance measure its results compared with
the “best” monolingual classifiers(boldface values are significantly worse than
the corresponding multilingual one): the Portuguese one for precision, and
the English and German one for recall and F1. Last line equally presents the
average values over all classes.</p>
        <p>From the average values, one can easily see that precision is higher than
recall and that the best monolingual classifier depends on what performance
measure one is considering. Nevertheless, the combined classifier has all
performance measures very similar and never significatively worse then the best
monolingual classifier.</p>
        <p>In fact, significant tests show that, for all classes and all performance
measures, there is no significant difference between the “best” monolingual
classifier and the corresponding combined classifier.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions and Future Work</title>
      <p>A proposal to combine monolingual classifiers was presented and evaluated.
The proposed methodology uses SVM classifiers to associate concepts to
legal documents and uses a decision function that combines them in order to
obtain, for each class, a classifier as good as the best monolingual classifier
of each performance measure.</p>
      <p>The baseline experiments allows one to conclude that some languages
generate classifiers with better precision values (Portuguese language) while
others generate classifiers with better recall ones (English and German
languages). In order to be able to explain and to try to generalise these results
further experiments need to be done. For instance, we will need to
evaluate this methodology with other collections and domains. Are these results
specific for the legal domain? Or only for this collection and topics?
Nevertheless, from a linguistic point of view, these results raise quite interesting
questions.</p>
      <p>By combining all classifiers one obtains a classifier as good as the best
monolingual one. This combined classifier can even be considered better than
the others since it has the ability to deliver all the best performance measures
(precision, recall and F1) unlike using one monolingual classifier.</p>
      <p>
        As ongoing research we intend to use a deeper linguistic representation
of documents and to re-evaluate this methodology. Specifically, we will use
a semantic representation (based on DRS3) of documents and a graph kernel
to create SVM models. In previous work, this approach showed to be able
to improve the bag-of-words result for the Portuguese language. Another
research line is to use legal thesaurus, such as the LOIS4 lexical thesaurus,
to reinforce some features/terms. With this approach we would combine our
proposal with the main ideas of the Wei et al. work
        <xref ref-type="bibr" rid="ref13">(Wei, 2007)</xref>
        .
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Bel</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koster</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Villegas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2003</year>
          ),
          <article-title>Cross-lingual text categorization</article-title>
          ,
          <source>in Proceedings of ECDL'03, Proceedings of the 7th European Conference on Research and Advanced Tecnology for Digital Libraries</source>
          , pp.
          <fpage>126</fpage>
          -
          <lpage>139</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Cortes</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Vapnik</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          (
          <year>1995</year>
          ),
          <article-title>Support-vector networks</article-title>
          ,
          <source>Machine Learning</source>
          , Vol.
          <volume>20</volume>
          No.
          <issue>3</issue>
          , pp.
          <fpage>273</fpage>
          -
          <lpage>297</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Joachims</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          (
          <year>1999a</year>
          ),
          <article-title>Making large-scale SVM learning practical</article-title>
          , in Schölkopf, B.,
          <string-name>
            <surname>Burges</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Smola</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . (Ed.), “
          <article-title>Advances in Kernel Methods - Support Vector Learning”</article-title>
          , MIT Press.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Joachims</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          (
          <year>2002</year>
          ),
          <article-title>Learning to Classify Text Using Support Vector Machines</article-title>
          , Kluwer Academic Publishers.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>C.H.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>H.C.</given-names>
          </string-name>
          (
          <year>2009</year>
          ),
          <article-title>Construction of supervised and unsupervised learning systems for multilingual text categorization</article-title>
          ,
          <source>Expert Systems Applications</source>
          , Vol.
          <volume>36</volume>
          No.
          <issue>2</issue>
          , pp.
          <fpage>2400</fpage>
          -
          <lpage>2410</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Mladenic´</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Grobelnik</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>1999</year>
          ),
          <article-title>Feature selection for unbalanced class distribution and naïve Bayes</article-title>
          ,
          <source>in Proceedings of ICML'99, 16th International Conference on Machine Learning</source>
          , pp.
          <fpage>258</fpage>
          -
          <lpage>267</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Rigutini</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maggini</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          (
          <year>2005</year>
          ),
          <article-title>An EM Based Training Algorithm for CrossLanguage Text Categorization</article-title>
          ,
          <source>in Proceedings of WI'05</source>
          , IEEE/WIC/ACM International Conference on Web Intelligence (IEEE Computer Society), pp.
          <fpage>529</fpage>
          -
          <lpage>535</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Salton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          (
          <year>1975</year>
          ),
          <article-title>A vector space model for information retrieval</article-title>
          ,
          <source>Journal of the American Society for Information Retrieval</source>
          , Vol.
          <volume>18</volume>
          , pp.
          <fpage>613</fpage>
          -
          <lpage>620</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Schütze</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hull</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Pedersen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>1995</year>
          ),
          <article-title>A comparison of classifiers and document representations for the routing problem</article-title>
          ,
          <source>in Proceedings of SIGIR'95, 18th International Conference on Research and Developement in Information Retrieval (ACM)</source>
          , pp.
          <fpage>229</fpage>
          -
          <lpage>237</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Shawe-Taylor</surname>
          </string-name>
          , J. and
          <string-name>
            <surname>Cristianini</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          (
          <year>2004</year>
          ),
          <article-title>Kernel Methods for Pattern Analysis</article-title>
          , Cambridge University Press.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <article-title>3 Discourse Representation Structures 4 Lexical Ontologies for Legal Information Sharing Tong</article-title>
          ,
          <string-name>
            <given-names>R.</given-names>
            and
            <surname>Appelbaum</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.A.</surname>
          </string-name>
          (
          <year>1994</year>
          ),
          <article-title>Machine learning for knowledge-based document routing</article-title>
          ,
          <source>in Proceedings of TRC'94</source>
          , 2nd Text Retrieval Conference.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Vapnik</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          (
          <year>1998</year>
          ),
          <article-title>Statistical learning theory</article-title>
          , Wiley, NY.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Wei</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          (
          <year>2007</year>
          ),
          <article-title>Feature reinforcement approach to poly-lingual text categorization</article-title>
          ,
          <source>in Proceedings of the International Conference on Asia Digital Libraries (LNCS Springer)</source>
          , pp.
          <fpage>99</fpage>
          -
          <lpage>108</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>