=Paper=
{{Paper
|id=Vol-1653/paper_5
|storemode=property
|title=Transductive Distributional Correspondence Indexing for Cross-Domain Topic Classification
|pdfUrl=https://ceur-ws.org/Vol-1653/paper_5.pdf
|volume=Vol-1653
|authors=Alejandro Moreo Fernández,Andrea Esuli,Fabrizio Sebastiani
|dblpUrl=https://dblp.org/rec/conf/iir/FernandezE016
}}
==Transductive Distributional Correspondence Indexing for Cross-Domain Topic Classification==
Transductive Distributional Correspondence Indexing for Cross-Domain Topic Classification Alejandro Moreo Fernández1 , Andrea Esuli1 , and Fabrizio Sebastiani2 1 Istituto di Scienza e Tecnologie dell’Informazione Consiglio Nazionale delle Ricerche, 56124 Pisa, IT alejandro.moreo@isti.cnr.it andrea.esuli@isti.cnr.it 2 Qatar Computing Research Institute Qatar Foundation, PO Box 5825, Doha, QA ? fsebastiani@qf.org.qa Abstract. Obtaining high-quality annotated data for training a clas- sifier for a new domain is often costly. Domain Adaptation (DA) aims at leveraging the annotated data available from a different but related source domain in order to deploy a classification model for the target domain of interest, thus alleviating the aforementioned costs. To that aim, the learning model is typically given access to a set of unlabelled documents collected from the target domain. These documents might consist of a representative sample of the target distribution, and they could thus be used to infer a general classification model for the domain (inductive inference). Alternatively, these documents could be the entire set of documents to be classified; this happens when there is only one set of documents we are interested in classifying (transductive inference). Many of the DA methods proposed so far have focused on transductive classification by topic, i.e., the task of assigning class labels to a specific set of documents based on the topics they are about. In this work, we report on new experiments we have conducted in transductive classifi- cation by topic using Distributional Correspondence Indexing method, a DA method we have recently developed that delivered state-of-the-art results in inductive classification by sentiment. The results we have ob- tained on three popular datasets show DCI to be competitive with the state of the art also in this scenario, and to be superior to all compared methods in many cases. Keywords: Transduction, Cross-Domain Adaptation, Topic Classification, Dis- tributional Hypothesis 1 Introduction As a supervised task, automatic Text Classification (TC) is constrained to the availability of high-quality corpora of annotated documents to train a classifier ? Fabrizio Sebastiani is on leave from Consiglio Nazionale delle Ricerche, Italy. 2 that will then predict the classes of new documents about a given domain of knowledge. In the absence of any such labelled collection for the domain of interest, an additional cost, economical and of time, is to be undertaken in order to collect and annotate the training examples. Domain Adaptation (DA) is an special case of Transfer Learning (TL)[13,14] to TC, aimed at reducing, or completely avoiding, such costs, by leveraging on any different, but related, source of knowledge for which a training corpus exists already. DA thus challenges one core assumption of machine learning, usually referred to as the iid assumption, according to which the training and test examples are believed to be drawn from the same distribution. Traditionally, two different scenarios are considered in DA: (i) cross-domain adaptation [2], where the source and target domains differ in the topics they are about; and (ii) cross-lingual adaptation [15], where the source and target domains are expressed in different languages, although dealing with the same topics. This article focuses on cross-domain adaptation. The transference of knowledge is typically attempted by uncovering regular- ities in examples that are shared across domains. To that aim, a representative (unlabeled) sample from the target distribution is collected and passed to the inference method when learning the decision function. Many of the proposed ap- proaches to cross-domain adaptation so far though, have considered this target sampling to be, at the same time, the test set, i.e., the (only) set of documents one might be interested in classifying (see e.g., [4,10,17,18,9,1]). This fact leads us distinguish between inductive and transductive cross-domain approaches, de- pending on the type of inference they carry out3 . Accordingly, inductive cross- domain approaches might be viewed as those aiming at deploying a classification model that generalizes adequately on the target domain, whereas transductive cross-domain approaches are only requested to deliver an accurate classification of the target set at one’s disposal [16]. A general trend one could observe from the related literature of cross-domain adaptation is that the vast majority of the inductive approaches proposed so far have been dedicated to sentiment classification (namely, assessing positive or negative labels to opinion-laden texts), while most of the transductive approaches have instead been tested in topic classification4 . Be that as it may, two well- differentiated folds of techniques for cross-domain adaptation exist, for which it remains unclear how much effort it entails for porting one of these methods (say, an inductive one) to the configuration of the other group (say, to the transductive setting), or to more general TL configurations (e.g., when the source and target tasks are different). 3 This distinction is surprisingly overlooked in the related literature though. This is probably so due to the terminology Pan & Yang used in their popular survey [13], where they categorized as transductive all TL approaches in which the source and target task are the same, but the source and target domains are different, while term inductive was instead attributed the opposite meaning, i.e., when the domains are the same, but the source and target tasks differ. 4 This seemingly deliberate partition might rather respond to the characteristics of the most popular benchmark collections available for each problem. 3 This paper is an extension of our former work in [5,11], where the Distri- butional Correspondence Indexing (DCI) method for cross-domain and cross- lingual adaptation was proposed. DCI creates words embeddings based in the distributional hypothesis (words with similar meanings tend to co-occur in sim- ilar contexts [6]), and delivered new state-of-the-art results for inductive classi- fication by sentiment recently. We now put to test DCI in a different problem setting, i.e., the transductive approach, and report new experiments on a dif- ferent task, i.e., cross-domain classification by topic. Results confirm that our Transductive DCI (hereafter TDCI for short), behaves robustly also in this sce- nario, delivering comparable classification accuracies, and even better in many cases, to state-of-the-art methods, while still being computationally cheap. The rest of this paper is organized as follows. Section 2 offers a brief overview of related work. In Section 3 we describe our proposal. Section 4 reports the results of the experiments we have conducted, while Section 5 concludes. 2 Related work In this section, we briefly review main related methods in the literature of domain adaptation. We will restrict our attention to transductive approaches proposed for topic classification. The interested reader can check [11] for a discussion focusing on inductive methods for sentiment classification, and [13,14] for a more general overview on transfer learning methods. Transductive Support Vector Machines (TSVMs) for text classification where proposed in [8] as an extension of Suport Vector Machines (SVMs) aiming at minimizing the misclassification error in a concrete test set, assuming it acces- sible when inducing the decision function. Even though it was not particularly designed to deal with DA problems, it has often been reported as a baseline in the related literature. The Co-Clustering approach [4] uses clusters of words and documents as a bridge to propagate the class structure from the source domain to the target domain. The key idea, is to use the class labels in the source do- main as a constraint for the word clusters, that are shared among both domains. The Matrix Tri-factorization [18] approach follows a somewhat similar assump- tion, based on the belief that associations between word clusters and classes should remain consistent between the source and target domain. The method thus performs two matrix tri-factorizations, for the source and target domains, in a joint optimization framework subject to sharing the association between word clusters and classes. Topic-bridged Probabilistic Latent Semantic Analy- sis [17] is an extension of Probabilistic Latent Semantic Analysis (PLSA) that models the relations between (observed) documents and terms thorough a set of (hidden) latent features, hypothesizing those latent features to be consistent across domains. Along these lines, Topic Correlation Analysis [9] establishes a distinction between latent features that could be shared between domains, and those that are rather domain specific. A joint mixture model is first used to cluster word features into shared and domain-specific topics. Then, a mapping between the domain-specific topics from both domains is induced from a correla- 4 tions analysis, that serves to derive a shared feature space where the transference of supervised knowledge is facilitated. Finally, the Cross Domain Spectral Clas- sification [10] approach formulates the knowledge transference thorough spectral classification, via optimizing an objective function aimed at regularizing the su- pervised information contained in the source domain in order to bring to bear improved consistence with respect to the target domain structure. In [1] a prob- abilistic method based on Latent Dirichlet Allocation (LDA) is proposed. The method jointly optimizes the marginal and conditional distributions following a EM algorithm, while also differentiating between the domain-dependent and domain-independent latent features. 3 Transductive Distributional Correspondence Indexing Loosely speaking, the main challenge one has to face in domain adaptation is to deal with the discrepancy of words relevance that comes about by its particular role in the source domain, and that is not generalizable to the target domain. That is to say, most important words for the source domain, on which the decision surface is likely to hinge upon, are likely not helpful enough in discriminating the positive and negative regions in the target domain. DCI builds upon (i) the concept of pivots terms [3], namely, frequent and discriminant words which behave expectedly in a similar way in the source and target domains; and (ii) the distributional hypothesis [6], which states that terms with similar meanings tend to co-occur in similar contexts. Our idea is to model each term as a word embedding where each dimension quantifies its relative semantic similarity to a fixed set of pivots. The expectation is that words with equivalent role across domains might end up lying close to each other in the new embedding space, as they are expected to present similar distributions to the pivots in their respective knowledge domains. Take as an example a classifier by genre (sci-fi, drama, horror, romantic, ...), that is trained with documents from a source domain of films, but intended to classify documents from a target domain of books. Note that role equivalences between, e.g., ‘director’-‘writer’, ‘duration’- ‘length’, or ‘film’-‘book’ might be uncovered by inspecting their co-occurrence distribution to some pivots like ‘plot’, ‘character’, or ‘story’, which are expected to be approximately invariant across domains. As a result, the surface decision boundary found for the source domain will likely generalize well in the target domain. DCI is an instantiation of this model that implements a pivot selection strategy (section 3.2), and quantifies the similarity of meaning of two words thorough a Distributional Correspondence Function (DCF, section 3.3). 3.1 Preliminaries Given a source (S) and a target (T ) domain of documents, with different marginal distributions, for which a training set of annotated documents T rS exists exclu- sively for S, cross-domain classification by topic might be formalized as the task of assigning class labels C = {c1 , . . . , c|C| } to target documents in a test set T eT 5 by means of a classifier Φ trained on T rS which is also given access to a sample of (non annotated) documents UT from T (and, optionally, to a sample US from S), where classes in C represent predefined topics of discussion, such as e.g., “politics”, “economics”, or “computers”. We will here restrict our attention to the binary case C = {c, c}, that is, deciding whether a document discusses a given topic c, or not. We will also adhere to the aforementioned “transductive setting”, in which the sample of target documents given to Φ corresponds also to the unique set of documents we might be interested in classifying, i.e., UT = T eT , and there is not any sample US from the source collection other than the training set T rS . 3.2 Pivot Selection According to [3], pivots are frequent and discriminant terms that behave similarly in both the source and target domain. Regarding frequency, and as was done in [15], we restrict the set of pivot candidates to those which occur in at least φ = 30 document in the source and target corpora. Following [2,15], we use the mutual information between the term and the classes {c, c} to asses the degree of discrimination of a given feature in the training set (i.e., exclusively in the source domain). Finally, we apply the cross-consistency heuristic defined in [11] which allows the model to be aware of the prevalence5 drift across the source and target domains. 3.3 Distributional Correspondence Functions DCFs are a family of real-valued functions that quantify the deviation of cor- respondence between two terms with respect to the expected correspondence due to chance. Different interpretations of correspondence could be plugged into the definition, leading to different implementations of DCF. In this work, we will restrict our attention to the cases in which correspondence is measured as the cosine similarity (Eq. 1), the Asymmetric Mutual Information (AMI – Eq. 2), the Pointwise Mutual Information (PMI – Eq. 3), and linear (Eq. 4), as discussed in [11]. Correspondence between two terms f i and f j in a given domain is measured by comparing their context distribution vectors f i and f j . Context distribution vectors are extracted from the co-occurrence matrix of the domain, and model how a term relates to a set of contexts (e.g., documents). hf i , f j i √ Cosine(f i , f j ) = i j − pi pj (1) kf kkf k X X P (x, y) AM I(f i , f j ) = ρ(f i , f j ) P (x, y) log2 (2) P (x)P (y) x∈{f i ,f i } y∈{f j ,f j } 5 The prevalence of a term is typically defined as the proportion of documents in which a term appears in a corpus. 6 P (f i , f j ) P M I(f i , f j ) = log2 (3) P (f i )P (f j ) Linear(f i , f j ) = P (f i |f j ) − P (f i |f j ) (4) Where pi denotes the prevalence (proportion of occurrences in the total num- ber of contexts) of feature f i , P (x) denotes the probability that feature x occurs in a random context, P (x) is the probability that x does not occur in a ran- dom context, and ρ(x, y) is a function that changes the sign when x and y are negatively correlated 6 . 3.4 Word Embeddings and Document Representation The feature representations of DCI might be though as a generalization of Co- Occurrence vectors (see, e.g., [12]), where the co-occurrence metric is any of the DCF, and the context window is set to the document length. Once a set of m pivots P = {p1 , p2 , . . . , pm } and a DCF η have been selected, each term f in the source and target domains is modeled as an m-dimensional vector → − f = (η(f , p1 ), η(f , p2 ), . . . , η(f , pm )) (5) where f and pi are the context distribution vectors of the term f and the ith pivot, respectively. Note that, because we are operating in the transduc- tive regime, the context distribution vectors f and pi are taken from the co- occurrence matrix of the training set when modeling the source terms, and from the co-occurrence matrix of the test set when modeling the target terms 7 . Finally, train and test documents are indexed in the embedding space via a weighted sum of all word embeddings of the terms composing the documents. That is, document di is represented as the m-dimensional vector → − X → − di = wij · fj (6) fj ∈di where wij is the weight of term fj in document di (we used the standard cosine- → − normalized tf idf ), and fj is the word embedding for term fj . Once the training and test matrices have been represented in the embedding space, the classifier is learned. As the classifier we adopt the Transductive SVM [8], that also takes into account the structure of the test data while modeling the decision function. We used the linear-kernel which have consistently delivered good accuracy in text classification so far [7]. 6 That is, when the true positive rate plus the true negative rate as obtained from the 4-cell contingency table of x and y is lower than 0. 7 In this case, and differently from [5,11] we do not apply unification to the common features, because during preliminary tests we observed most of the features to appear simultaneously in the source and target domains, causing thus most of the words in the vocabulary to get unified. This contradicts the rationale behind the unification process, originally proposed to consolidate the representations of shared words across languages, such as proper nouns in cross-lingual adaptation. 7 4 Experiments In this section, we report on the experiments we run to test the effectiveness of our TDCI method in cross-domain topic classification. As the evaluation measure we adopt standard accuracy, i.e., the ratio between the number of correctly labeled documents over the total number of documents sued to the classifier, i.e., TP + TN Acc = (7) TP + FP + FN + TN where T P , T N , F P , and F N stand for the numbers of true positives, true negatives, false positives, and false negatives, respectively. Note this choice is perfectly valid given that all datasets are approximately balanced with respect to the positive and negative classes, In order to gain in reproducibility and to facilitate a comparison of perfor- mance with other methods, we consider most commonly used benchmarks in the related literature, including the Reuters-21578, SRAA, and 20 Newsgroups collections. Aside from being well-known benchmarks collections in the reign of topic classification, their class codes are organized hierarchically, and some rep- resentative subsets could thus be taken in order to generate new benchmarks that are well-suited for domain adaptation as well8 . Reuters-21578: is one of the most used collections in TC research. Reuters- 21578 is a set of 21,578 news stories appeared in the Reuters newswire in 1987. Documents in the collection are assigned to 5 top classes, among which, orgs, people, and places classes have commonly been selected in other works for exper- imenting in domain adaptation, leading to three datasets, orgs vs people, orgs vs places, and people vs places; a preprocessed version could be found in9 . SRAA: consists of 73,218 Usenet posts about simulated autos, simulated aviation, real autos, and real aviation, accessible in10 . In this dataset, the pairs of classes real vs simulated, and auto vs aviation have been used to instantiate two different domain adaptation problems. For example, in real vs simulated, documents about aviation were used as the source domain, while documents about autos constitute the target domain; the binary decision problem consists thus in discerning between real and simulated topics. In a similar vein, auto vs aviation is created, where documents about simulated vehicles act as source domain examples, and documents about real vehicles as the target ones. 8 This procedure consists in taking two top classes, say, A and B, with subclasses A.1 . . . A.x and B.1 . . . B.y, respectively. Then, two disjoint folds are taken for the source (S) and target (T ) sides in each class; e.g., AS = ∪1≤i