=Paper= {{Paper |id=Vol-1653/paper_5 |storemode=property |title=Transductive Distributional Correspondence Indexing for Cross-Domain Topic Classification |pdfUrl=https://ceur-ws.org/Vol-1653/paper_5.pdf |volume=Vol-1653 |authors=Alejandro Moreo Fernández,Andrea Esuli,Fabrizio Sebastiani |dblpUrl=https://dblp.org/rec/conf/iir/FernandezE016 }} ==Transductive Distributional Correspondence Indexing for Cross-Domain Topic Classification== https://ceur-ws.org/Vol-1653/paper_5.pdf
  Transductive Distributional Correspondence
 Indexing for Cross-Domain Topic Classification

      Alejandro Moreo Fernández1 , Andrea Esuli1 , and Fabrizio Sebastiani2
                   1
                    Istituto di Scienza e Tecnologie dell’Informazione
                   Consiglio Nazionale delle Ricerche, 56124 Pisa, IT
                            alejandro.moreo@isti.cnr.it
                              andrea.esuli@isti.cnr.it
                         2
                           Qatar Computing Research Institute
                     Qatar Foundation, PO Box 5825, Doha, QA ?
                                fsebastiani@qf.org.qa



        Abstract. Obtaining high-quality annotated data for training a clas-
        sifier for a new domain is often costly. Domain Adaptation (DA) aims
        at leveraging the annotated data available from a different but related
        source domain in order to deploy a classification model for the target
        domain of interest, thus alleviating the aforementioned costs. To that
        aim, the learning model is typically given access to a set of unlabelled
        documents collected from the target domain. These documents might
        consist of a representative sample of the target distribution, and they
        could thus be used to infer a general classification model for the domain
        (inductive inference). Alternatively, these documents could be the entire
        set of documents to be classified; this happens when there is only one
        set of documents we are interested in classifying (transductive inference).
        Many of the DA methods proposed so far have focused on transductive
        classification by topic, i.e., the task of assigning class labels to a specific
        set of documents based on the topics they are about. In this work, we
        report on new experiments we have conducted in transductive classifi-
        cation by topic using Distributional Correspondence Indexing method,
        a DA method we have recently developed that delivered state-of-the-art
        results in inductive classification by sentiment. The results we have ob-
        tained on three popular datasets show DCI to be competitive with the
        state of the art also in this scenario, and to be superior to all compared
        methods in many cases.


Keywords: Transduction, Cross-Domain Adaptation, Topic Classification, Dis-
tributional Hypothesis


1     Introduction
As a supervised task, automatic Text Classification (TC) is constrained to the
availability of high-quality corpora of annotated documents to train a classifier
?
    Fabrizio Sebastiani is on leave from Consiglio Nazionale delle Ricerche, Italy.
2

that will then predict the classes of new documents about a given domain of
knowledge. In the absence of any such labelled collection for the domain of
interest, an additional cost, economical and of time, is to be undertaken in order
to collect and annotate the training examples.
    Domain Adaptation (DA) is an special case of Transfer Learning (TL)[13,14]
to TC, aimed at reducing, or completely avoiding, such costs, by leveraging
on any different, but related, source of knowledge for which a training corpus
exists already. DA thus challenges one core assumption of machine learning,
usually referred to as the iid assumption, according to which the training and
test examples are believed to be drawn from the same distribution. Traditionally,
two different scenarios are considered in DA: (i) cross-domain adaptation [2],
where the source and target domains differ in the topics they are about; and (ii)
cross-lingual adaptation [15], where the source and target domains are expressed
in different languages, although dealing with the same topics. This article focuses
on cross-domain adaptation.
    The transference of knowledge is typically attempted by uncovering regular-
ities in examples that are shared across domains. To that aim, a representative
(unlabeled) sample from the target distribution is collected and passed to the
inference method when learning the decision function. Many of the proposed ap-
proaches to cross-domain adaptation so far though, have considered this target
sampling to be, at the same time, the test set, i.e., the (only) set of documents
one might be interested in classifying (see e.g., [4,10,17,18,9,1]). This fact leads
us distinguish between inductive and transductive cross-domain approaches, de-
pending on the type of inference they carry out3 . Accordingly, inductive cross-
domain approaches might be viewed as those aiming at deploying a classification
model that generalizes adequately on the target domain, whereas transductive
cross-domain approaches are only requested to deliver an accurate classification
of the target set at one’s disposal [16].
    A general trend one could observe from the related literature of cross-domain
adaptation is that the vast majority of the inductive approaches proposed so
far have been dedicated to sentiment classification (namely, assessing positive or
negative labels to opinion-laden texts), while most of the transductive approaches
have instead been tested in topic classification4 . Be that as it may, two well-
differentiated folds of techniques for cross-domain adaptation exist, for which it
remains unclear how much effort it entails for porting one of these methods (say,
an inductive one) to the configuration of the other group (say, to the transductive
setting), or to more general TL configurations (e.g., when the source and target
tasks are different).
3
  This distinction is surprisingly overlooked in the related literature though. This is
  probably so due to the terminology Pan & Yang used in their popular survey [13],
  where they categorized as transductive all TL approaches in which the source and
  target task are the same, but the source and target domains are different, while term
  inductive was instead attributed the opposite meaning, i.e., when the domains are
  the same, but the source and target tasks differ.
4
  This seemingly deliberate partition might rather respond to the characteristics of
  the most popular benchmark collections available for each problem.
                                                                                   3

    This paper is an extension of our former work in [5,11], where the Distri-
butional Correspondence Indexing (DCI) method for cross-domain and cross-
lingual adaptation was proposed. DCI creates words embeddings based in the
distributional hypothesis (words with similar meanings tend to co-occur in sim-
ilar contexts [6]), and delivered new state-of-the-art results for inductive classi-
fication by sentiment recently. We now put to test DCI in a different problem
setting, i.e., the transductive approach, and report new experiments on a dif-
ferent task, i.e., cross-domain classification by topic. Results confirm that our
Transductive DCI (hereafter TDCI for short), behaves robustly also in this sce-
nario, delivering comparable classification accuracies, and even better in many
cases, to state-of-the-art methods, while still being computationally cheap.
    The rest of this paper is organized as follows. Section 2 offers a brief overview
of related work. In Section 3 we describe our proposal. Section 4 reports the
results of the experiments we have conducted, while Section 5 concludes.


2   Related work

In this section, we briefly review main related methods in the literature of domain
adaptation. We will restrict our attention to transductive approaches proposed
for topic classification. The interested reader can check [11] for a discussion
focusing on inductive methods for sentiment classification, and [13,14] for a more
general overview on transfer learning methods.
    Transductive Support Vector Machines (TSVMs) for text classification where
proposed in [8] as an extension of Suport Vector Machines (SVMs) aiming at
minimizing the misclassification error in a concrete test set, assuming it acces-
sible when inducing the decision function. Even though it was not particularly
designed to deal with DA problems, it has often been reported as a baseline in
the related literature. The Co-Clustering approach [4] uses clusters of words and
documents as a bridge to propagate the class structure from the source domain
to the target domain. The key idea, is to use the class labels in the source do-
main as a constraint for the word clusters, that are shared among both domains.
The Matrix Tri-factorization [18] approach follows a somewhat similar assump-
tion, based on the belief that associations between word clusters and classes
should remain consistent between the source and target domain. The method
thus performs two matrix tri-factorizations, for the source and target domains,
in a joint optimization framework subject to sharing the association between
word clusters and classes. Topic-bridged Probabilistic Latent Semantic Analy-
sis [17] is an extension of Probabilistic Latent Semantic Analysis (PLSA) that
models the relations between (observed) documents and terms thorough a set
of (hidden) latent features, hypothesizing those latent features to be consistent
across domains. Along these lines, Topic Correlation Analysis [9] establishes a
distinction between latent features that could be shared between domains, and
those that are rather domain specific. A joint mixture model is first used to
cluster word features into shared and domain-specific topics. Then, a mapping
between the domain-specific topics from both domains is induced from a correla-
4

tions analysis, that serves to derive a shared feature space where the transference
of supervised knowledge is facilitated. Finally, the Cross Domain Spectral Clas-
sification [10] approach formulates the knowledge transference thorough spectral
classification, via optimizing an objective function aimed at regularizing the su-
pervised information contained in the source domain in order to bring to bear
improved consistence with respect to the target domain structure. In [1] a prob-
abilistic method based on Latent Dirichlet Allocation (LDA) is proposed. The
method jointly optimizes the marginal and conditional distributions following
a EM algorithm, while also differentiating between the domain-dependent and
domain-independent latent features.


3     Transductive Distributional Correspondence Indexing

Loosely speaking, the main challenge one has to face in domain adaptation is to
deal with the discrepancy of words relevance that comes about by its particular
role in the source domain, and that is not generalizable to the target domain.
That is to say, most important words for the source domain, on which the decision
surface is likely to hinge upon, are likely not helpful enough in discriminating
the positive and negative regions in the target domain.
    DCI builds upon (i) the concept of pivots terms [3], namely, frequent and
discriminant words which behave expectedly in a similar way in the source and
target domains; and (ii) the distributional hypothesis [6], which states that terms
with similar meanings tend to co-occur in similar contexts. Our idea is to model
each term as a word embedding where each dimension quantifies its relative
semantic similarity to a fixed set of pivots. The expectation is that words with
equivalent role across domains might end up lying close to each other in the new
embedding space, as they are expected to present similar distributions to the
pivots in their respective knowledge domains. Take as an example a classifier by
genre (sci-fi, drama, horror, romantic, ...), that is trained with documents from a
source domain of films, but intended to classify documents from a target domain
of books. Note that role equivalences between, e.g., ‘director’-‘writer’, ‘duration’-
‘length’, or ‘film’-‘book’ might be uncovered by inspecting their co-occurrence
distribution to some pivots like ‘plot’, ‘character’, or ‘story’, which are expected
to be approximately invariant across domains. As a result, the surface decision
boundary found for the source domain will likely generalize well in the target
domain. DCI is an instantiation of this model that implements a pivot selection
strategy (section 3.2), and quantifies the similarity of meaning of two words
thorough a Distributional Correspondence Function (DCF, section 3.3).


3.1   Preliminaries

Given a source (S) and a target (T ) domain of documents, with different marginal
distributions, for which a training set of annotated documents T rS exists exclu-
sively for S, cross-domain classification by topic might be formalized as the task
of assigning class labels C = {c1 , . . . , c|C| } to target documents in a test set T eT
                                                                                                        5

by means of a classifier Φ trained on T rS which is also given access to a sample
of (non annotated) documents UT from T (and, optionally, to a sample US from
S), where classes in C represent predefined topics of discussion, such as e.g.,
“politics”, “economics”, or “computers”.
    We will here restrict our attention to the binary case C = {c, c}, that is,
deciding whether a document discusses a given topic c, or not. We will also
adhere to the aforementioned “transductive setting”, in which the sample of
target documents given to Φ corresponds also to the unique set of documents we
might be interested in classifying, i.e., UT = T eT , and there is not any sample
US from the source collection other than the training set T rS .

3.2     Pivot Selection
According to [3], pivots are frequent and discriminant terms that behave similarly
in both the source and target domain. Regarding frequency, and as was done in
[15], we restrict the set of pivot candidates to those which occur in at least
φ = 30 document in the source and target corpora. Following [2,15], we use the
mutual information between the term and the classes {c, c} to asses the degree
of discrimination of a given feature in the training set (i.e., exclusively in the
source domain). Finally, we apply the cross-consistency heuristic defined in [11]
which allows the model to be aware of the prevalence5 drift across the source
and target domains.

3.3     Distributional Correspondence Functions
DCFs are a family of real-valued functions that quantify the deviation of cor-
respondence between two terms with respect to the expected correspondence
due to chance. Different interpretations of correspondence could be plugged into
the definition, leading to different implementations of DCF. In this work, we
will restrict our attention to the cases in which correspondence is measured as
the cosine similarity (Eq. 1), the Asymmetric Mutual Information (AMI – Eq.
2), the Pointwise Mutual Information (PMI – Eq. 3), and linear (Eq. 4), as
discussed in [11].
    Correspondence between two terms f i and f j in a given domain is measured
by comparing their context distribution vectors f i and f j . Context distribution
vectors are extracted from the co-occurrence matrix of the domain, and model
how a term relates to a set of contexts (e.g., documents).

                                                         hf i , f j i  √
                            Cosine(f i , f j ) =           i       j
                                                                      − pi pj                          (1)
                                                        kf kkf k
                                               X             X                             P (x, y)
         AM I(f i , f j ) = ρ(f i , f j )                                 P (x, y) log2                (2)
                                                                                          P (x)P (y)
                                            x∈{f i ,f i } y∈{f j ,f j }

5
    The prevalence of a term is typically defined as the proportion of documents in which
    a term appears in a corpus.
6


                                                     P (f i , f j )
                         P M I(f i , f j ) = log2                                   (3)
                                                    P (f i )P (f j )

                      Linear(f i , f j ) = P (f i |f j ) − P (f i |f j )            (4)
   Where pi denotes the prevalence (proportion of occurrences in the total num-
ber of contexts) of feature f i , P (x) denotes the probability that feature x occurs
in a random context, P (x) is the probability that x does not occur in a ran-
dom context, and ρ(x, y) is a function that changes the sign when x and y are
negatively correlated 6 .

3.4   Word Embeddings and Document Representation
The feature representations of DCI might be though as a generalization of Co-
Occurrence vectors (see, e.g., [12]), where the co-occurrence metric is any of the
DCF, and the context window is set to the document length. Once a set of m
pivots P = {p1 , p2 , . . . , pm } and a DCF η have been selected, each term f in
the source and target domains is modeled as an m-dimensional vector
                        →
                        −
                         f = (η(f , p1 ), η(f , p2 ), . . . , η(f , pm ))      (5)
where f and pi are the context distribution vectors of the term f and the
ith pivot, respectively. Note that, because we are operating in the transduc-
tive regime, the context distribution vectors f and pi are taken from the co-
occurrence matrix of the training set when modeling the source terms, and from
the co-occurrence matrix of the test set when modeling the target terms 7 .
    Finally, train and test documents are indexed in the embedding space via a
weighted sum of all word embeddings of the terms composing the documents.
That is, document di is represented as the m-dimensional vector
                                →
                                −     X         →
                                                −
                                di =      wij · fj                          (6)
                                          fj ∈di

where wij is the weight of term fj in document di (we used the standard cosine-
                           →
                           −
normalized tf idf ), and fj is the word embedding for term fj .
    Once the training and test matrices have been represented in the embedding
space, the classifier is learned. As the classifier we adopt the Transductive SVM
[8], that also takes into account the structure of the test data while modeling
the decision function. We used the linear-kernel which have consistently delivered
good accuracy in text classification so far [7].
6
  That is, when the true positive rate plus the true negative rate as obtained from the
  4-cell contingency table of x and y is lower than 0.
7
  In this case, and differently from [5,11] we do not apply unification to the common
  features, because during preliminary tests we observed most of the features to appear
  simultaneously in the source and target domains, causing thus most of the words in
  the vocabulary to get unified. This contradicts the rationale behind the unification
  process, originally proposed to consolidate the representations of shared words across
  languages, such as proper nouns in cross-lingual adaptation.
                                                                                          7

4    Experiments

In this section, we report on the experiments we run to test the effectiveness of
our TDCI method in cross-domain topic classification.
    As the evaluation measure we adopt standard accuracy, i.e., the ratio between
the number of correctly labeled documents over the total number of documents
sued to the classifier, i.e.,

                                         TP + TN
                            Acc =                                                       (7)
                                    TP + FP + FN + TN

where T P , T N , F P , and F N stand for the numbers of true positives, true
negatives, false positives, and false negatives, respectively. Note this choice is
perfectly valid given that all datasets are approximately balanced with respect
to the positive and negative classes,
    In order to gain in reproducibility and to facilitate a comparison of perfor-
mance with other methods, we consider most commonly used benchmarks in
the related literature, including the Reuters-21578, SRAA, and 20 Newsgroups
collections. Aside from being well-known benchmarks collections in the reign of
topic classification, their class codes are organized hierarchically, and some rep-
resentative subsets could thus be taken in order to generate new benchmarks
that are well-suited for domain adaptation as well8 .
    Reuters-21578: is one of the most used collections in TC research. Reuters-
21578 is a set of 21,578 news stories appeared in the Reuters newswire in 1987.
Documents in the collection are assigned to 5 top classes, among which, orgs,
people, and places classes have commonly been selected in other works for exper-
imenting in domain adaptation, leading to three datasets, orgs vs people, orgs vs
places, and people vs places; a preprocessed version could be found in9 .
    SRAA: consists of 73,218 Usenet posts about simulated autos, simulated
aviation, real autos, and real aviation, accessible in10 . In this dataset, the pairs
of classes real vs simulated, and auto vs aviation have been used to instantiate
two different domain adaptation problems. For example, in real vs simulated,
documents about aviation were used as the source domain, while documents
about autos constitute the target domain; the binary decision problem consists
thus in discerning between real and simulated topics. In a similar vein, auto
vs aviation is created, where documents about simulated vehicles act as source
domain examples, and documents about real vehicles as the target ones.
8
   This procedure consists in taking two top classes, say, A and B, with subclasses
   A.1 . . . A.x and B.1 . . . B.y, respectively. Then, two disjoint folds are taken for the
   source (S) and target (T ) sides in each class; e.g., AS = ∪1≤i