=Paper= {{Paper |id=Vol-1670/paper-13 |storemode=property |title=Sampling Methods for Random Subspace Domain Adaptation |pdfUrl=https://ceur-ws.org/Vol-1670/paper-13.pdf |volume=Vol-1670 |authors=Christian Pölitz |dblpUrl=https://dblp.org/rec/conf/lwa/Politz16 }} ==Sampling Methods for Random Subspace Domain Adaptation== https://ceur-ws.org/Vol-1670/paper-13.pdf

Sampling Methods for Random Subspace Domain
Adaptation

Christian Pölitz

TU Dortmund University, Otto Hahn Str. 12, 44227 Dortmund

Abstract. Supervised classification tasks like Sentiment Analysis or text clas-
sification need labelled training data. These labels can be difficult to obtain, es-
pecially for complicated and ambiguous data like texts. Instead of labelling new
data, domain adaptation tries to reuse already labelled data from related tasks as
training data. We propose a greedy selection strategy to identify a small subset
of data samples that are most suited for domain adaptation. Using these samples
the adaptation is done on a subspace in a kernel defined feature space. To make
this kernel approach applicable for large scale data sets, we use random Fourier
features to approximate kernels by expectations.

Introduction

The usual assumption for most of the Data Mining and Machine Learning tasks is that
the training data used to learn a model has the same distribution as the test data on that
the model is applied. On the other hand, there are many situation where this is not true.
Imagine as Data Mining task Sentiment Analysis on product reviews from Amazon. In
case of a new product or product type, producers might be interested in how their prod-
ucts catches on. Sentiment Analysis now tries to label the reviews of the corresponding
products as being positive or negative. To assign such labels, classification models are
trained on some labelled reviews and than applied on the unlabelled reviews. For new
product types it is reasonable to assume that we have no labelled training data at hand.
Labelling the new reviews can be quite expensive. Especially identifying the sentiment
in texts can be hard - even for experts. Ambiguous words or sarcasm for instance make
this task difficult. Instead of starting to label the new reviews, another possibility is
to reuse already labelled reviews from different products. There might be for instance
already labelled reviews about books and now we get new reviews about DVDs. The
idea is to leverage the reviews about books to train a classifier that is applied on re-
views about DVDs. To accomplish this, we need to find a way to safely transfer the
information from one domain to another.
We solve this problem by domain adaptation with the following assumptions: We
have two data sets with (possible large) difference in distribution. We have data from a
source domain S that is distributed via ps together with label information y distributed
via ps (y|x). On the other hand, we also have data from a target domain T that is dis-
tributed via pt with no label information. The domain adaptation task now is to use
the source domain together with its label information to find a classifier that labels the
target domain best.
We expect that many data sets share similarities on latent subspaces. On product
reviews for instance, a book might be described as tedious while a toaster might be
described as malfunctioning. Both words have negative meaning and very likely appear
together with other negative words like bad, poor or poorly. In a latent subspace in
the space spanned by the words, we expect that these words span together a whole
dimension. When we map texts of reviews from books and electronic articles onto such
a subspace the words tedious and malfunctioning can be replaced by their common
meaning. This will make the texts from the different domains more similar. Further,
only terms alone might not be able to find such subspaces. For instance, bi-grams like
little helpful or hardly improving can also span a latent subspace that is helpful for
domain adaptation. Generally, n-grams should also be considered.
In order to integrate information of multiple combinations of words, kernels like
polynomial kernels can be used. Kernel methods can also integrate structural informa-
tion and even information from probabilistic models. Consequently, we find low dimen-
sional representations of the data from a source and a target domain in a Reproducing
Kernel Hilbert Space. These representations shall keep enough structure from the data
that a classifier trained on the source domain still performs well. On the other hand, the
low dimensional representation shall make the two data sets more similar. This justifies
a safe application of a classifier trained on the source domain, to the target domain.
To find the subspace for the domain adaptation we propose a greedy selection strat-
egy that finds the most useful data samples in the source domain for the domain adapta-
tion. By this, we reduce the data size and concentrate on those samples that are poten-
tially best suited to transfer knowledge. This idea is based on the assumption that not all
source samples might by equally important for adaptability. This has been investigated
for instance by in [10]. Further, we approximate kernels by random Fourier features as
proposed by [19]. This tackles the quadratically or cubically scaling behaviour of kernel
methods in the number of data samples.

Related Work

We distinguish two main directions in domain adaptation. On the hand, many of the
existing approaches try to find weights for the samples that account for an mismatch
in distribution of a target and a source domain. This is especially useful under the so
call covariate shift assume. Here, we assume that the distribution of the labels given a
sample is the same for both target and source domain. Via the weights, a sample selec-
tion bias shall be corrected. This means, we assume that the source domain is sampled
from the target distribution applied a certain weighting mechanism. Many previous ap-
proaches learn such weights such that the weighted source distributions is most similar
to the target distribution.
For instance [8] propose density estimators that incorporate sample selection bias to
adapt two distribution, [13] do this by matching the distributions in an RKHS, [14] find
the optimal weights by solving least squares problem and [23] minimize the Kullback-
Leibler divergence of the target distributions and the weighted source distribution, to
name only a few. A theoretical analysis of this adaptation can be found in [6] and [5].
In contrast to these approaches, several other works try to extract a subspace or
feature representations in the data space that covers invariant parts across the target and
the source distribution. Within such a subspace or feature representations, transferring
knowledge between the source and target domain is expected to be more effective than
in the whole ambient space.
In [18], Transfer Component Analysis is introduced to find low dimensional repre-
sentations in a kernel defined Hilbert space. In this representation the target and source
domain are more similar than before. The authors in [22] learn a linear subspace that
is suitable for transfer learning by minimizing Bregman divergence of the target and
source distribution in this subspace, [21] transform the target points such that they are a
linear combination of a basis in the source domain, [25] propose to transfer knowledge
in a Hilbert space by aligning a kernel with the target domain, [17] learn domain invari-
ant data transformation to minimize differences in source and target domain distribu-
tions while preserving functional relations of the data with possible label information.
Further, in [9] the authors propose to create subspaces that aligns to the eigenspaces of
the target and source domain.

Background

In this section, we introduce the background on kernel methods, subspaces in Hilbert
Spaces and distances of distributions. The presented information are crucial for our
proposed strategy in the next sections.

Kernel Methods and RKHS

Kernel methods accomplish to apply linear methods on non-linear representations of
data. Any kernel method uses a map X → φ(X) from a compact input space X, for
instance