Sampling Methods for Random Subspace Domain Adaptation Christian Pölitz TU Dortmund University, Otto Hahn Str. 12, 44227 Dortmund Abstract. Supervised classification tasks like Sentiment Analysis or text clas- sification need labelled training data. These labels can be difficult to obtain, es- pecially for complicated and ambiguous data like texts. Instead of labelling new data, domain adaptation tries to reuse already labelled data from related tasks as training data. We propose a greedy selection strategy to identify a small subset of data samples that are most suited for domain adaptation. Using these samples the adaptation is done on a subspace in a kernel defined feature space. To make this kernel approach applicable for large scale data sets, we use random Fourier features to approximate kernels by expectations. Introduction The usual assumption for most of the Data Mining and Machine Learning tasks is that the training data used to learn a model has the same distribution as the test data on that the model is applied. On the other hand, there are many situation where this is not true. Imagine as Data Mining task Sentiment Analysis on product reviews from Amazon. In case of a new product or product type, producers might be interested in how their prod- ucts catches on. Sentiment Analysis now tries to label the reviews of the corresponding products as being positive or negative. To assign such labels, classification models are trained on some labelled reviews and than applied on the unlabelled reviews. For new product types it is reasonable to assume that we have no labelled training data at hand. Labelling the new reviews can be quite expensive. Especially identifying the sentiment in texts can be hard - even for experts. Ambiguous words or sarcasm for instance make this task difficult. Instead of starting to label the new reviews, another possibility is to reuse already labelled reviews from different products. There might be for instance already labelled reviews about books and now we get new reviews about DVDs. The idea is to leverage the reviews about books to train a classifier that is applied on re- views about DVDs. To accomplish this, we need to find a way to safely transfer the information from one domain to another. We solve this problem by domain adaptation with the following assumptions: We have two data sets with (possible large) difference in distribution. We have data from a source domain S that is distributed via ps together with label information y distributed via ps (y|x). On the other hand, we also have data from a target domain T that is dis- tributed via pt with no label information. The domain adaptation task now is to use the source domain together with its label information to find a classifier that labels the target domain best. We expect that many data sets share similarities on latent subspaces. On product reviews for instance, a book might be described as tedious while a toaster might be described as malfunctioning. Both words have negative meaning and very likely appear together with other negative words like bad, poor or poorly. In a latent subspace in the space spanned by the words, we expect that these words span together a whole dimension. When we map texts of reviews from books and electronic articles onto such a subspace the words tedious and malfunctioning can be replaced by their common meaning. This will make the texts from the different domains more similar. Further, only terms alone might not be able to find such subspaces. For instance, bi-grams like little helpful or hardly improving can also span a latent subspace that is helpful for domain adaptation. Generally, n-grams should also be considered. In order to integrate information of multiple combinations of words, kernels like polynomial kernels can be used. Kernel methods can also integrate structural informa- tion and even information from probabilistic models. Consequently, we find low dimen- sional representations of the data from a source and a target domain in a Reproducing Kernel Hilbert Space. These representations shall keep enough structure from the data that a classifier trained on the source domain still performs well. On the other hand, the low dimensional representation shall make the two data sets more similar. This justifies a safe application of a classifier trained on the source domain, to the target domain. To find the subspace for the domain adaptation we propose a greedy selection strat- egy that finds the most useful data samples in the source domain for the domain adapta- tion. By this, we reduce the data size and concentrate on those samples that are poten- tially best suited to transfer knowledge. This idea is based on the assumption that not all source samples might by equally important for adaptability. This has been investigated for instance by in [10]. Further, we approximate kernels by random Fourier features as proposed by [19]. This tackles the quadratically or cubically scaling behaviour of kernel methods in the number of data samples. Related Work We distinguish two main directions in domain adaptation. On the hand, many of the existing approaches try to find weights for the samples that account for an mismatch in distribution of a target and a source domain. This is especially useful under the so call covariate shift assume. Here, we assume that the distribution of the labels given a sample is the same for both target and source domain. Via the weights, a sample selec- tion bias shall be corrected. This means, we assume that the source domain is sampled from the target distribution applied a certain weighting mechanism. Many previous ap- proaches learn such weights such that the weighted source distributions is most similar to the target distribution. For instance [8] propose density estimators that incorporate sample selection bias to adapt two distribution, [13] do this by matching the distributions in an RKHS, [14] find the optimal weights by solving least squares problem and [23] minimize the Kullback- Leibler divergence of the target distributions and the weighted source distribution, to name only a few. A theoretical analysis of this adaptation can be found in [6] and [5]. In contrast to these approaches, several other works try to extract a subspace or feature representations in the data space that covers invariant parts across the target and the source distribution. Within such a subspace or feature representations, transferring knowledge between the source and target domain is expected to be more effective than in the whole ambient space. In [18], Transfer Component Analysis is introduced to find low dimensional repre- sentations in a kernel defined Hilbert space. In this representation the target and source domain are more similar than before. The authors in [22] learn a linear subspace that is suitable for transfer learning by minimizing Bregman divergence of the target and source distribution in this subspace, [21] transform the target points such that they are a linear combination of a basis in the source domain, [25] propose to transfer knowledge in a Hilbert space by aligning a kernel with the target domain, [17] learn domain invari- ant data transformation to minimize differences in source and target domain distribu- tions while preserving functional relations of the data with possible label information. Further, in [9] the authors propose to create subspaces that aligns to the eigenspaces of the target and source domain. Background In this section, we introduce the background on kernel methods, subspaces in Hilbert Spaces and distances of distributions. The presented information are crucial for our proposed strategy in the next sections. Kernel Methods and RKHS Kernel methods accomplish to apply linear methods on non-linear representations of data. Any kernel method uses a map X → φ(X) from a compact input space X, for instance