<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multilingual Document Classi cation via Transductive Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Salvatore Romeo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dino Ienco</string-name>
          <email>dino.ienco@irstea.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Tagarelli</string-name>
          <email>tagarellig@dimes.unical.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DIMES, University of Calabria</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>IRSTEA - UMR TETIS, and LIRMM</institution>
          ,
          <addr-line>Montpellier</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present a transductive learning based framework for multilingual document classi cation, originally proposed in [7]. A key aspect in our approach is the use of a large-scale multilingual knowledge base, BabelNet, to support the modeling of di erent language-written documents into a common conceptual space, without requiring any language translation process. Results on real-world multilingual corpora have highlighted the superiority of the proposed document model against existing language-dependent representation approaches, and the signi cance of the transductive setting for multilingual document classi cation.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Multilingual document collections are getting increased attention as their
analysis is essential to support a variety of tasks, such as building translation
resources [
        <xref ref-type="bibr" rid="ref6 ref8">8, 6</xref>
        ], detection of plagiarism in patent collections [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], cross-lingual
document similarity and multilingual document categorization [
        <xref ref-type="bibr" rid="ref2 ref4">4, 2</xref>
        ]. Focusing on
the latter problem, existing methods in the literature can mainly be
characterized based on the language-speci c resources they use to perform cross-lingual
tasks. A common approach is to resort to machine translation techniques or
bilingual dictionaries to mapping every document to the target language, and
perform cross-lingual document similarity and categorization (e.g., [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]).
      </p>
      <p>
        We address the multilingual document classi cation problem from a di erent
perspective. First, we are not restricted to deal with bilingual corpora
dependent on machine translation. In this regard, we exploit a large, publicly available
knowledge base speci cally designed for multilingual retrieval tasks: BabelNet [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
BabelNet embeds both the lexical ontology capabilities of WordNet and the
encyclopedic power of Wikipedia. Second, our view is di erent from the standard
inductive learning setting. High-quality labeled datasets are in fact di cult to
obtain due to costly and time-consuming annotation processes. This
particularly holds for the multilingual scenario where the documents belong to di erent
languages and, as a consequence, more language-speci c experts need to be
involved in the annotation process. Moreover, in multilingual corpora documents
are often all available at the same time and the classi cations for the unlabeled
instances need to be provided contextually to the learning of the current
document collection. To ful ll the last two requisites, transductive learning o ers
an e ective approach [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] as it requires few labels for decision making, and the
learning process is tailored to the particular dataset.
      </p>
      <p>
        Motivated by the above considerations, we present a framework for
multilingual document classi cation under a transductive learning setting, originally
proposed in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. By introducing a uni ed conceptual feature space based on
BabelNet, we de ne a multilingual document representation model which does not
require any language translation. We resort to a state-of-the-art transductive
learner [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to produce the document classi cation. Using RCV2 and Wikipedia
document collections, we compare our proposal w.r.t. document representations
typically involved in multilingual and cross-lingual analysis.
2
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Transductive Multilingual Document Categorization</title>
      <sec id="sec-2-1">
        <title>Text Representation Models</title>
        <p>Bag-of-words and machine-translation based models. The classic
bag-ofwords model has been employed also in the context of multilingual documents.
Hereinafter we use notation BoW to refer to the term-frequency vector
representation of documents over the union of language-speci c term vocabularies.</p>
        <p>
          A common approach adopted in the literature is to translate all documents to
a unique anchor language and then represent the translated documents with the
BoW model [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. In this work, we have considered three settings corresponding to
the use of English (BoW-MT-en), French (BoW-MT-fr ) or Italian
(BoW-MTit ) as anchor language. We also employ a dimensionality reduction approach via
Latent Semantic Analysis (LSA) [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] over the BoW representation. We will refer
to this model as BoW-LSA.
        </p>
        <p>
          Bag-of-synset representation. Di erently from the previously discussed
document representations, we propose to model a collection of multilingual
documents into a uni ed conceptual feature space. Our key idea is to exploit the
multilingual lexical knowledge of BabelNet [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], in order to generate document
features that correspond to BabelNet synsets. More speci cally, the input
document collection is subject to a two-step processing phase. In the rst step,
each document is broken down into a set of lemmatized and POS-tagged
sentences, in which each word is replaced with related lemma and associated
POStag (hw; P OS(w)i). In the second step, word sense disambiguation is performed
over each pair hw; P OS(w)i to detect the most appropriate BabelNet synset w
contextually to any sentence in the document. Each document is nally modeled
as a jBSj-dimensional vector of BabelNet synset frequencies, where BS is the
set of retrieved BabelNet synsets. We will refer to this representation model as
BoS (i.e., bag-of-synsets )
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Transductive Setting and Label Propagation Algorithm</title>
        <p>
          A major contribution of our work is the use of a transductive learning based
approach to address the problem of multilingual document classi cation. For
this purpose, we use a particularly e ective transductive learner, named Robust
Multi-class Graph Transduction (RMGT ) [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>RMGT implements a graph-based label propagation approach, which
exploits a kNN graph built over the entire document collection to propagate the
class information from the labeled to the unlabeled documents. The transductive
learning scheme used by RMGT employs spectral properties of the kNN graph
to spread the labeled information over the set of test documents. Speci cally,
the label propagation process is modeled as a constrained convex optimization
problem where the labeled documents are employed to constrain and guide the
nal classi cation. After the propagation step, every unlabeled document di is
associated to a vector representing the likelihood of the document di for each of
the classes; therefore, di is assigned to the class that maximizes the likelihood.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experimental Setting and Results</title>
      <p>
        We evaluated our transductive learning based approach on two multilingual
document collections, RCV2 and Wikipedia. Here we summarize results obtained
on Wikipedia; the interested reader is referred to [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] for further details. We
considered documents in three di erent languages, English, French, and Italian,
covering 6 di erent topic-classes. Topics were selected to obtain, for each
topiclanguage pair, the same number of documents. The resulting balanced dataset is
comprised of 1 000 documents for each topic-language pair, with a total of 18 000
documents. The number and density of terms for the BoW model (resp. synsets
in BoS ) are 15 634 and 1.61E-2 (resp. 10 247 and 1.81E-2). We also produced
an unbalanced version of the dataset by keeping the whole subset of English
documents while sampling half of the French and half of the Italian subsets. To
build both datasets, every document was subject to tokenization and
lemmatization.3 To setup the transductive learner, we used k = 10 for the kNN graph
construction, and varied the percentage of labeled documents from 1% to 20%
with step of 1%. To measure the classi cation performance, we used standard
F-measure, Precision and Recall. Results were averaged over 30 runs.
      </p>
      <p>Figure 1 shows a summary of the best average performances on both balanced
and unbalanced corpora, and also shows results obtained on the balanced corpus
for di erent values of the training percentage of the transductive learner. We
observe that BoS clearly outperforms the other document representation models,
including BoW-MT-en which in this case achieves similar (or even slightly lower)
results to BoW-MT-fr and BoW-MT-it . BoW-LSA and BoW also show a
performance gap from the other models. Considering the unbalanced scenario, the
BoS results are still higher than the best competing methods. Interestingly, BoS
performance is the same as for the balanced case, which would indicate a higher
robustness of BoS w.r.t. the corpus characteristics. Another remark is that the
proposed BoS not only performs signi cantly better than the other models, but
also it exhibits a performance trend that is not a ected by issues related to
language speci city. In fact, the machine-translation based models have relative
performance that may vary on di erent datasets; no preference on translation
3 http://nlp.lsi.upc.edu/freeling/
0.92
0.90
●</p>
      <p>● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ●
0.84 ● ● BBooSW ● BBooWW−−MMTT−−iftr</p>
      <p>BoW−MT−en ● BoW−LSA
1 2 3 4 5 6 7 8 9 11 13 15 17 19
training percentage (%)
Fig. 1. Summary of best average performance results of the various representation
methods (left), and average F-measure results on the balanced corpus (right).
languages can be made in advance as a language that leads to better results on
a dataset can perform worse than other languages on another dataset.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>We have presented a knowledge-based framework for multilingual document
classi cation under a transductive setting. Our BoS document model has shown to
be e ective for multilingual comparable corpora, as it supports the transductive
learner to obtain better classi cation performance than language-dependent
document models, using a relatively small portion of labeled data. Future works will
concentrate on the development of a richer conceptual document model that can
incorporate more types of information (i.e., relations among the synsets), and
on investigating hybrid solutions of transductive and active learning.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>A.</given-names>
            <surname>Barron-Ceden</surname>
          </string-name>
          ~o,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gupta</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          .
          <article-title>Methods for cross-language plagiarism detection</article-title>
          .
          <source>Knowl.-Based Syst.</source>
          ,
          <volume>50</volume>
          :
          <fpage>211</fpage>
          {
          <fpage>217</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>A.</given-names>
            <surname>Barron-Ceden</surname>
          </string-name>
          ~o,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Paramita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. D.</given-names>
            <surname>Clough</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          .
          <article-title>A comparison of approaches for measuring cross-lingual similarity of wikipedia articles</article-title>
          .
          <source>In ECIR</source>
          , pages
          <volume>424</volume>
          {
          <fpage>429</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>S.</given-names>
            <surname>Deerwester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. T.</given-names>
            <surname>Dumais</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. W.</given-names>
            <surname>Furnas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. K.</given-names>
            <surname>Landauer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Harshman</surname>
          </string-name>
          .
          <article-title>Indexing by latent semantic analysis</article-title>
          .
          <source>J. Assoc. Inf. Sci.</source>
          ,
          <volume>41</volume>
          (
          <issue>6</issue>
          ):
          <volume>391</volume>
          {
          <fpage>407</fpage>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Guo</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Xiao</surname>
          </string-name>
          .
          <article-title>Transductive representation learning for cross-lingual text classi cation</article-title>
          .
          <source>In ICDM</source>
          , pages
          <volume>888</volume>
          {
          <fpage>893</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>W.</given-names>
            <surname>Liu</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Chang</surname>
          </string-name>
          .
          <article-title>Robust multi-class transductive learning with graphs</article-title>
          .
          <source>In CVPR</source>
          , pages
          <volume>381</volume>
          {
          <fpage>388</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          and
          <string-name>
            <given-names>S. P.</given-names>
            <surname>Ponzetto</surname>
          </string-name>
          . Babelnet:
          <article-title>The automatic construction, evaluation and application of a wide-coverage multilingual semantic network</article-title>
          .
          <source>Artif</source>
          . Intell.,
          <volume>193</volume>
          :
          <fpage>217</fpage>
          {
          <fpage>250</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>S.</given-names>
            <surname>Romeo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ienco</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Tagarelli</surname>
          </string-name>
          .
          <article-title>Knowledge-based representation for transductive multilingual document classi cation</article-title>
          .
          <source>In ECIR</source>
          , pages
          <volume>92</volume>
          {
          <fpage>103</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>P.</given-names>
            <surname>Vossen. EuroWordNet</surname>
          </string-name>
          :
          <article-title>A Multilingual Database of Autonomous and LanguageSpeci c Wordnets Connected via an Inter-Lingual Index</article-title>
          .
          <source>International Journal of Lexicography</source>
          Vol.
          <volume>17</volume>
          ,
          <issue>2</issue>
          :
          <fpage>161</fpage>
          {
          <fpage>173</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>