=Paper=
{{Paper
|id=Vol-2947/paper28
|storemode=property
|title=Generalized Funnelling: Ensemble Learning and Heterogeneous Document Embeddings for Cross-Lingual Text Classification
|pdfUrl=https://ceur-ws.org/Vol-2947/paper28.pdf
|volume=Vol-2947
|authors=Andrea Pedrotti,Fabrizio Sebastiani,Alejandro Moreo
|dblpUrl=https://dblp.org/rec/conf/iir/Pedrotti0M21
}}
==Generalized Funnelling: Ensemble Learning and Heterogeneous Document Embeddings for Cross-Lingual Text Classification==
Generalized Funnelling: Ensemble Learning and Heterogeneous Document Embeddings for Cross-Lingual Text Classification Discussion Paper Alejandro Moreo1 , Andrea Pedrotti1,2 and Fabrizio Sebastiani1 1 Istituto di Scienza e Tecnologie dell’Informazione, Consiglio Nazionale delle Ricerche, 56124 Pisa, Italy 2 Dipartimento di Informatica, Università di Pisa, 56127 Pisa, Italy Abstract Funnelling (Fun) is a method for cross-lingual text classification (CLTC) based on a two-tier learning ensemble for heterogeneous transfer learning (HTL). In this ensemble method, 1st-tier classifiers, each working on a different and language-dependent feature space, return a vector of calibrated posterior probabilities (with one dimension for each class) for each document, and the final classification decision is taken by a metaclassifier that uses this vector as its input. In this paper we describe Generalized Funnelling (gFun), a generalization of Fun consisting of a HTL architecture in which 1st-tier components can be arbitrary view-generating functions, i.e., language-dependent functions that each produce a language- independent representation (“view”) of the document. We describe an instance of gFun in which the metaclassifier receives as input a vector of calibrated posterior probabilities (as in Fun) aggregated to other embedded representations that embody other types of correlations. We describe preliminary results that we have obtained on a large standard dataset for multilingual multilabel text classification. Keywords Transfer Learning, Cross-Lingual Text Classification, Ensemble Learning, Word Embeddings 1. Introduction According to [1], the amount of (labelled and unlabelled) resources for the more than 7,000 languages spoken around the world follows (somehow unsurprisingly) a power-law distribution. That is, while a small set of languages account for most of the available data, a very long tail of other languages suffer from data scarcity, despite the fact that many languages belonging to this long tail have large speaker bases. Bearing in mind that most of the languages in the world are low-resource, it is appealing to develop methods and techniques capable of exploiting the high-quality resources available for the few resource-rich languages, in order to improve the performance on tasks carried out on the resource-poor languages. Cross-Lingual Transfer Learning (CLTL) is a class of machine learning tasks in which, given a training set of textual labelled data sampled from one or more IIR 2021 – 11th Italian Information Retrieval Workshop, September 13–15, 2021, Bari, Italy Envelope-Open alejandro.moreo@isti.cnr.it (A. Moreo); andrea.pedrotti@phd.unipi.it (A. Pedrotti); fabrizio.sebastiani@isti.cnr.it (F. Sebastiani) Orcid 0000-0002-0377-1025 (A. Moreo); 0000-0002-2322-7043 (A. Pedrotti); 0000-0003-4221-6427 (F. Sebastiani) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) source languages, we must issue predictions for unlabelled documents written in one or more target languages. In other words, the goal of CLTL is to transfer (i.e., reuse) the knowledge that has been obtained from the training data in the source languages, to the target languages of interest, for which few labelled data (or no labelled data at all) exist. Cross-Lingual Text Classification (CLTC) is a specific instance of CLTL, in which classification is the task to be carried out. In CLTC, documents are written in one of a finite set ℒ = {𝜆1 , ..., 𝜆|ℒ | } of languages, and labelled according to a shared codeframe (a.k.a. classification scheme) 𝒴 = {𝑦1 , ..., 𝑦|𝒴 | }. In such a scenario, it is common to have different numbers of training documents for the different languages, with the languages with fewer training documents usually being also the ones with fewer (if at all) available external resources (such as bilingual dictionaries, thesauri, pre-trained sets of word-embeddings, language models) that could otherwise be leveraged for this task. Funnelling (Fun – [2]) is an ensemble learning architecture for CLTC especially designed to learn from heterogeneous sources of data and effectively transfer information from one language to another. In other words, Fun operates in an all-to-all fashion since all training languages contribute to the classification of the other languages while, at the same time, all languages benefit from the training data which is available for other languages. In this work we expand over this architecture by injecting into the algorithm new heterogeneous sources of information. 2. Funnelling and Generalized Funnelling Fun is a two-level architecture [2], where the first tier takes care of translating documents from their original language-dependent domain to a language-independent one. Subsequently, the second tier operates on the newly encoded documents and outputs the final prediction scores. The main intuition behind Fun is to leverage the fact that all the documents are classified according to the same set of labels. Documents, regardless of the language they are written in, can be represented as vectors of posterior probabilities, i.e., vectors encoding, at each dimension 𝑖, the probability for a given document to be labeled as belonging to the respective class 𝑦𝑖 . Once all the documents are homogenized (i.e., they are all represented as vectors of posterior probabilities), they can be stacked vertically and fed to the second-tier (the metaclassifier) regardless of the language they were originaly written in. We generalize this architecture, and call it Generalized Funnelling (gFun). The first tier of Fun is redesigned in order to accommodate for a a set Ψ of view-generating functions (VGFs) that can expand the shared vector space on which the meta-classifier operates. VGFs are language- dependent functions that map documents into language-independent vectorial representations (views) aligned across languages. Since each view is aligned across languages, it is easy to aggregate (e.g., by concatenation) the different views into a single representation aligned across languages, that is then given as input to the metaclassifier. Notice that, according to this definition, the original implementation of Fun can be seen as a specific setting of gFun equipped with one single VGF. The key idea is to leverage the VGFs in order to inject into the model information about different correlations between the main elements of a Text Classification task.In this research, we consider four kinds of correlations: Class-Class correlation, Document-Class correlation, Word- Class correlation, Word-Word correlation, Document-Word correlation. We bring to bear these stochastic correlations by means of the following VGFs: • the Posteriors VGF (encoding document-class correlations): it maps documents into the space defined by calibrated posterior probabilities (as in the original Fun). • the MUSEs VGF (encoding word-word correlations): it uses the Multilingual Unsupervised / Supervised Embeddings (MUSEs) made available by the authors of [3], a set of word embeddings aligned for 30 languages. • the WCEs VGF (encoding word-class correlations): it uses Word-Class Embeddings (WCE) [4], a form of supervised word embeddings based on the class-conditional distri- butions observed in the training set which is natively aligned across languages. • the BERT VGF (encoding document-word correlations): it uses the contextualized word- embeddings generated by multilingual BERT [5], a deep pretrained language model based on the transformer architecture. The different views produced by the VGFs need to be aggregated before being issued to the metaclassifier. In this work, we propose to average the different views.1 Before averaging the representations, we must ensure all views to have same dimensionality, and to be aligned.2 In order to do so, we learn additional mappings of the views to the space of class-conditional posterior probabilities, i.e., for each VGF (other than the Posteriors VGF, which already returns vectors of |𝒴 | calibrated posterior probabilities) we train a classifier that maps the view of a document into a vector of |𝒴 | calibrated posterior probabilities. Finally, we have found that applying some routine normalization techniques consistently increases the performance of gFun. This normalization consists of imposing unit L2-norm to the vectors computed by the view generators, removing the first principal component of the document embeddings obtained via WCEs or MUSEs, and standardizing the columns of the shared space before passing the vectors [6] to the metaclassifier.3 3. Experiments In order to maximize comparability with the previous results, we adopt an experimental setup identical to the one used in [2] including the evaluation metrics, i.e., 𝐹1 score and 𝐾, in both their micro (𝜇) and macro-averaged (𝑀) versions. We carry out experiments on JRC-Acquis, a parallel corpus of legislative texts published by the European Union, consisting of 11 different languages. We retain the 300 most frequent 1 In preliminary work, we have observed experimentally that avaraging tends to produce better results than simply concatenating the different views. 2 Two views are said to be aligned when the semantics of the dimensions (whatever it may be) is common to both views. 3 Standardizing (a.k.a. “z-scoring”, or “z-transforming”) consists of having a random variable 𝑥, with mean 𝜇 and 𝑥−𝜇 standard deviation 𝜎, translated and scaled as 𝑧 = 𝜎 , so that the new random variable 𝑧 has zero mean and unit variance. The statistics 𝜇 and 𝜎 are unknown, and are thus estimated on the training set. target classes and use the same splits as in [2].4 In Table 1, we directly compare our results with the naïve solution (i.e., one monolingual classifier for each language), Fun and multilingual BERT (mBERT). We group gFun results in three different batches: the first one groups the results obtained by deploying one single VGF at the time; in the second one we report the results combining multiple generators; in the latter we deployed all the proposed VGFs jointly. We use the notation -X to refer to the Posteriors VGF, -M denotes the MUSEs VGF, -W the WCEs VGF, and -B the BERT VGF. The superior results of gFun-Xwith respect to Fun indicate that the normalization steps are beneficial. It is noteworthy how by simply leveraging the class-class correlations (brought to bear by the metaclassifier) gFun-B outperforms its counterpart mBERT. The best results are obtained by the combination of Posterior, MUSEs, and BERT VGFs. Table 1 CLTC results on JRC-Acquis dataset. Each cell indicates the mean value and the standard deviation across the 10 runs. Boldface indicates the best method. Superscripts † and †† denote the method (if any) whose score is not statistically significantly different from the best one; 𝜇 Method 𝐹1𝑀 𝐹1 𝐾𝑀 𝐾𝜇 Naïve .340 ± .017 .559 ± .012 .288 ± .016 .429 ± .015 Fun [2] .399 ± .013 .587 ± .009 .365 ± .014 .490 ± .013 mBERT [5] .420 ± .023 .608 ± .016 .379 ± .006 .507 ± .009 gFun–X .432 ± .015 .587 ± .010 .441 ± .016 .553 ± .013 gFun–M .440 ± .039 .586 ± .032 .442 ± .045 .549 ± .034 gFun–W .410 ± .016 .553 ± .014 .410 ± .021 .525 ± .022 gFun–B .501 ± .023 .627 ± .016 .485 ± .023 .574 ± .019 gFun–XB .510 ± .017 .637 ± .012 .512 ± .020† .603 ± .016† gFun–XMB .525 ± .020 .649 ± .014 .528 ± .023 .620 ± .017 gFun–XWB .497 ± .011 .621 ± .008 .508 ± .011 .606 ± .010 gFun–XMW .475 ± .012 .604 ± .010 .489 ± .014 .593 ± .011 gFun–WMB .513 ± .016 .632 ± .011 .522 ± .017†† .619 ± .013†† gFun–XWMB .514 ± .014 .635 ± .010 .521 ± .015† .618 ± .011†† UPPERBOUND .599 .707 .547 .632 4. Conclusions In this paper we propose Generalized Funnelling (gFun), a revised variant of Fun [2] that allows a set of view-generating functions (VGFs) to provide the metaclassifier with different views of the same document, each embodying a different type of correlation in the data. We explore views leveraging the multilingual unsupervised-supervised embeddings (MUSE) [3], word-class embeddings (WCE) [4], and the contextualized embeddings of multilingual BERT [5]. The results confirm that injecting in the process heterogeneous information in the form of different types of embeddings aligned across languages improves performance in CLTL. 4 We have validated our method also using RCV1/2, but we leave the discussion of this dataset out of this short paper for the sake of brevity. References [1] P. Joshi, S. Santy, A. Budhiraja, K. Bali, M. Choudhury, The state and fate of linguistic diversity and inclusion in the NLP world, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), 2020, pp. 6282–6293. doi:1 0 . 1 8 6 5 3 / v1/2020.acl- main.560. [2] A. Esuli, A. Moreo, F. Sebastiani, Funnelling: A new ensemble method for heterogeneous transfer learning and its application to cross-lingual text classification, ACM Transactions on Information Systems 37 (2019) Article 37. doi:h t t p s : / / d o i . o r g / 1 0 . 1 1 4 5 / 3 3 2 6 0 6 5 . [3] A. Conneau, G. Lample, M. Ranzato, L. Denoyer, H. Jégou, Word translation without parallel data, in: Proceedings of the 6th International Conference on Learning Representations (ICLR 2018), Vancouver, CA, 2018. [4] A. Moreo, A. Esuli, F. Sebastiani, Word-class embeddings for multiclass text clas- sification, Data Mining and Knowledge Discovery 353 (2021) 911–963. doi:1 0 . 1 0 0 7 / s10618- 020- 00735- 3. [5] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL 2019), Minneapolis, US, 2019, pp. 4171–4186. doi:1 0 . 1 8 6 5 3 / v 1 / N 1 9 - 1 4 2 3 . [6] S. Arora, Y. Liang, T. Ma, A simple but tough-to-beat baseline for sentence embeddings, in: Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), Toulon, FR, 2017.