1. Introduction

International Advances in neural information processing systems journal of corpus linguistics 6 (2001) 97-133. 27 (2014). [44] F. Wilcoxon

Analysis in Deep Transfer Learning

Osayande P. Omondiagbe

omondiagbep@landcarereserach.co.nz 0 1

Sherlock A. Licorish

Stephen G. MacDonell

stephen.macdonell@aut.ac.nz 1 2

Atlanta, USA

0 Department of Informatics, Landcare Research , Lincoln, New Zelaand 1 Department of Information science, University of Otago , Dunedin, New Zelaand 2 Software Engineering Research Lab, Auckland University of Technology , Auckland , New Zealand

2019

2 17 21

Data sparsity is a challenge facing most modern recommendation systems. With cross-domain recommendation technique, one can overcome data sparsity by leveraging knowledge from relevant domains. This approach can be further enhanced by considering the latent sentiment information. However, as this latent sentiment information is derived from both relevant and irrelevant sources, the performance of the recommendation system may decline. This is a negative transfer (NT) problem, where the knowledge that is derived from multiple sources afects the system. Also, these source domains are often imbalanced, which could further hurt the performance of the recommendation system. To this end, recent research has shown that NT is caused by domain divergence, source and target quality, and algorithms that are not carefully designed to utilise the target data to improve the domain transferability. While various research works have been proposed to prevent NT, these address only some of the factors that may lead to NT. In this paper, we propose a more systematic and comprehensive approach to overcoming NT in sentiment analysis by tackling the main causes of NT. Our approach combines the use of cost weighting learning, uncertainty-guided (aleatoric and epistemic) loss function over the target dataset, and the concept of importance sampling, to derive a robust model. Experimental results on a sentiment analysis task using Amazon review datasets validate the superiority of our proposed method when compared to three other state-of-the-art methods. To disentangle the contributions behind the success of both uncertainties, we conduct an ablation study exploring the efect of each module in our approach. Our findings reveal that we can improve a sentiment analysis task in a transfer learning setting from 4% to 10% when combining both uncertainties. Our outcomes show the importance of considering all factors that may lead to NT. These ifndings can help to build an efective recommendation system when including the latent sentiment information.

Transfer learning neural networks bert uncertainty

1. Introduction Generally, recommendation systems are used in commer

cial applications to help users discover the products or services they are looking for. In order to solve the lack of data and the cold-start1 problem, researchers have increasingly introduced concepts of source domain and target domain into cross-domain recommendation [ 1 ].

Through the use of transfer learning, cross-domain based

recommendation is able to leverage the rich information from multiple domains as against in a single domain, and transfer knowledge efectively from one domain to another. For cross-domain recommendation to work, howDL4SR’22: Workshop on Deep Learning for Search and Recommendation, co-located with the 31st ACM International Conference on

CEUR Workshop Proce dings htp:/ceur-ws.org IS N1613-073

CEUR

Workshop Proceedings (CEUR-WS.org)

1A problem where the system cannot draw any inferences for users

or items about which it has not yet gathered suficient information ever, users’ interests or item features must be consistent or correlated across domains [ 1 ].

Most existing cross-domain recommendation methods rely only on sharing text information, such as ratings, tags or reviews, and ignore latent sentiment information in the sentiment analysis domain [ 2 ]. Recently, methods that consider this latent sentiment information have been proven to be more efective when compared with existing recommendation algorithms that do not consider this information [ 3 ]. This is because user reviews are usually subjective, so they would not be able to reflect the user’s preferences and sentiments towards diferent attributes.

As these sentiment data are derived from both relevant and irrelevant sources and the datasets are often imbalanced, the performance of these cross-domain recommendation system may decline due to learning a bias [ 4 ]. Also, these cross-domain models did not take into account the bidirectional latent relations between users and items [ 5 ]. A better solution to this problem is to introduce transfer learning (TL) [ 6 ] into the cross-domain recommendation system [ 5 ]. TL systems utilise data and knowledge from a related domain (known as the source domain) to mitigate this learning bias, and can improve the generalizability of models in the target domain [ 6 ].

Regrettably, this approach is not always successful un

less specific guidelines are adhered to [ 7 ]; 1) both tasks utilising the source and target datasets and incorporate a should be related; 2) the source and target domain should cost weight to tackle the problem of imbalanced data that be similar; 3) and a model which can learn both domains may further increase the domain divergence issue. Hence, should be applied to both the source and target datasets. this work uses the idea of model and data transferability When these guidelines are not followed, the performance enhancement to develop a more robust model aimed at of the target model is likely to degrade. This is known preventing negative transfer. By using such a systematic as negative transfer (NT) [ 8 ]. NT can be caused by four approach, we would be able to tackle the four main causes main issues [ 7 ]: One: Domain divergence - When the of NT mentioned above. Our main contributions are divergence between the source and target domains is summarised as follows. wide, NT will occur. Two: Transfer algorithm - When designing a transfer algorithm, it should have a theoreti- • We propose using a combined uncertainty as a cal guarantee that the performance in the target domain loss function. This combined uncertainty consists will be better when auxiliary data are used, or the transfer of both the aleatoric and epistemic uncertainties. algorithm should be carefully designed to improve the The epistemic uncertainty captures the model untransferability of auxiliary domains, else NT may occur. certainty, while the aleatoric uncertainty captures Three: Source data quality - The quality of the source the uncertainty concerning information that the data determines the quality of the transferred knowledge. data cannot explain and is modelled over the tarIf the source data are very noisy, then a model trained get and source dataset to guide the learning proon them is unreliable. Four: Target data quality - The cess. By using the aleatoric uncertainty-guided target domain data may be noisy, which may also lead to loss function over the target and source data, we NT. Also, the amount of labelled target data has a direct can derive more information and enhance the impact on the learning process if not fully utilised by the model’s transferability. learning algorithm [ 9, 10 ]. • We propose combining an uncertainty-guided

Various research works have proposed the mitigation loss function, a cost-sensitive classification of NT, and these are seen in the following areas [ 7 ]; One: method of incorporating cost-weighting into the By enhancing the data transferability strategy [ 11, 7 ]. model and an importance sampling strategy to This is done by either addressing the domain divergence enhance the data and model transferability. This between the source and target [ 12, 11 ], or reweighing method can be used when there is imbalanced strategy by applying more weight to those source do- data and/or dissimilarity between the source and mains which are similar to the target dataset [ 13, 14 ], or target dataset. by learning a common latent feature space [ 15 ]. Two: • Finally, we perform an ablation study to disentanBy enhancing the model transferability enhancement gle the contributions behind the success of each through transferable normalisation [ 16 ], or by making module introduced in our system. the model robust to adversarial samples through the use of a robust optimisation loss function [ 17 ]. Three: By en- The remainder of this paper is organised as follows. hancing the target prediction through the use of pseudo We present related work in Section 2. Next, we introlabelling [ 18, 19 ]. duce our proposed approach in 3. Section 4 presents our

Previous research found that the use of a model that datasets, candidate models, and experimental setup. The is robust to adversarial samples results in better transfer- results and discussion are presented in Sections 5 and ability [ 20, 21 ]. They tend to have better accuracy than a 6, respectively, before considering threats in Section 7. standard target model. Similarly, Liang et al. [ 20 ] found a Finally, we conclude the study in Section 8. positive correlation between a model that is robust to an adversarial sample and the knowledge transferred. This work suggests such a model can benefit from the knowl- 2. Related Work edge transfer between the source and target. By relying on such methods, these approaches can be limited to being robust to adversarial samples and fail to model uncertainty under data and label distribution, which could introduce further bias [ 22 ]. Recently, the work of Grauer and Pei [ 22 ] has shown that when model uncertainty is known and distributed evenly, the performance and reliability of the model are greatly improved.

In this work, we introduce the use of an uncertaintyguided loss function to guide the training process when Transfer Learning is a research strategy in machine learning (ML) that aims to use the knowledge gained while solving one problem and apply it to a diferent but related problem [ 23 ]. Early methods in this area have exploited techniques such as instance weighting [ 24, 25 ], feature mapping [ 26, 27 ] and transferring relational knowledge [28]. Due to the increased processing power aforded by graphical processing units (GPUs), deep learning is now used more frequently in transfer learning tasks and when compared to earlier approaches, such models have achieved better results in the discovery domain invariant features [29]. It was shown that when deep learning is used the transferability of features decreases as the distance between the base task and target task increases, but that transferring features even from distant tasks can be better than using random features [29]. Some of these deep learning methods [30, 31, 32] have exploited the use of mismatch measurement, such as Maximum Mean Discrepancy (MMD) to transfer features or by using generative adversarial networks (GANs) [33]. Although these methods have all achieved high performance in diferent domains, such as in computer vision [ 34] and natural language processing [35], they were not designed to tackle the problem of negative transfer (NT).

Other prominent lines of work can be seen in deep learning to tackle the issue of NT. These works include the use of instance weighting (e.g., predictive distribution matching (PDM) [ 13 ]) , enhancing the feature transferability through the use of a latent feature (e.g., DTL [36]), and the use of soft loss function based on soft pseudolabels (e.g., Mutual Mean-Teaching (MMT)[ 19 ]). These methods do not guarantee tackling NT, as they tackle some causes of NT, but not all (e.g., PMD method tackles the transfer algorithm and source data quality, while MMT tackles the domain divergence, transfer algorithm and target data quality issue). Although, a previous study exploring the benefits of modelling epistemic and aleatoric uncertainty in Bayesian deep learning models for vision tasks has demonstrated that when these uncertainties are integrated into the loss functions, the model is more robust to noisy data, how these can be used to tackle NT has not been looked at. Hence, our main objective in this paper is to derive a robust loss function for deep transfer learning that tackles the causes of NT mentioned in Section 1.

3. Method This section provides a formal definition of NT and proposed methods to overcome it. 3.1. Negative Transfer

Notation: We use the following notation px q ≠ px q and p |x q ≠ p |x q to denote the marginal and conditional distribution of source and target sets, respectively. In this case, x and x represent the source and target, respectively. Zhang et al. [ 7 ] gave a mathematical definition of NT, and proposed a way to determine the degree of NT (NTD) when it happens.

Definition: Let be the test error in the target domain, pS, Tq a TL algorithm between source (S) and target (T), and p∅, Tq the same algorithm which does not use the source domain information at all. Then, NT happens when ( pS, Tq) > p p∅, Tqq, and the degree of NT can be evaluated by equation 1 below: “ p pS, Tqq ´ p p∅, Tqq (1)

When NTD is positive, then negative transfer has occurred. Next, we propose a systematic way to avoid negative transfer. 3.2. Proposed Methods We explain the three concepts used in our method below:

Cost-sensitive Classification: The idea of costsensitive classification is used when there is a higher cost of mislabelling one class over the other class [37]. Cost-sensitive learning tackles the class imbalance problem by changing the model cost function giving more weight to the minority class and multiplying the loss of each training sample by a specific factor. The imbalanced data distribution is not modified directly during training [37]. Madabushi et al. [38] introduced a costweighting strategy in the Bert model, which increases the weight of incorrectly labelled sentences by altering the cost function of the final model layer. The cost function is changed by modifying the cross-entropy loss for a single prediction , and the model’s prediction for class k to accommodate an array weight as shown in equation 2 p, q “ ℎ r s∅ where ∅ “ − r s ` p∑ “1 p r sqq (2) Importance Sampling: The traditional way of training a deep learning model has one major drawback, where it is not able to diferentiate samples where it performs very well, i.e., low loss and those samples where the performance is poor i.e., high loss [39]. Also, as not all source samples can provide useful knowledge [39], we introduce the idea of importance sampling to control examples which should be given more priority. Importance sampling [40] is a variance reduction technique and is done by taking a random sample of a set based on a probability distribution among the elements of the group. In our proposed method, we attach weights to the source training examples based on their similarity to the target dataset. The samples with more weight will have a higher chance of being selected. We sample the source from a probability density over the target data. Uncertainty Quantification: There are diferent types of uncertainties, and these could be present in the data or model. When the uncertainty is derived from the model, it is referred to as ”epistemic or model uncertainty” [41]. Epistemic uncertainty captures the ignorance about the model generated from the collected data and can be explained more when more data is given to the model [41]. It is the property of the model. When the uncertainty is related to the data, it is referred to as aleatoric uncertainty [41]. It captures the uncertainty concerning information that the data cannot explain. This can be further divided into two; • Heteroscedastic uncertainty, which depends on the data input and is predicted as a model output [41]. • Homoscedastic uncertainty, which is not input data dependent but assumes a constant for all input data and varies between the diferent tasks [42].

In this case, we are not interested in the homoscedastic uncertainty because we are assuming related task between the source and target. To learn the heteroscedastic uncertainty, the loss function can be replaced with the following [41]: “ || ´ ̂||2 2 2

1 log 2 ` 2 (3) where the model predicts a mean ̂ and variance 2. Kendall and Gal [41] proposed a loss function to combine both epistemic and aleatoric (heteroscedastic) uncertainty as follows: “ 1 ∑ “1 1 log 2 (4) p´ log 2q|| ´ ̂ ||2 ` 2 where is the total number of output and 2 is the variance.

Algorithm 1 Combined Uncertainty Loss Function and

Cost-Weighting (CUCW) Input: Output: • Source model : gpxq • Source Training set Str • Target Training set Ttr • Target Validation set Tv • Target Testing set Tts

Degree of negative transfer (NTD) 1. Estimate similarity for each source sample against random 1000 target samples 2. Estimate importance weight with importance sampling based on the similarity 3. Train a source model using importance weight with a small target sample as the validation data Tv 4. Compute loss function using Equations 2 and 3 OR

Equations 2 and 4 5. Compute test error p pS, Tqq on model using target test set Tts 6. Train a target model with the target data only Ttr 7. Compute test error p p∅, Tqq on model using target test set Tts 8. Calculate NTD p pS, Tqq - p p∅, Tqq 9. Fine tune model using target training set Ttr and target validation set Tv to derive a new model 10. Compute test error p p∅, Tqq on model using target test set Tts 11. return Degree of negative transfer (NTD) and model performance

Based on the algorithm above, we can employ a deep

transfer learning using the proposed approach to find an optimal model with the least degree of negative transfer.

This can be done by following the steps in sequential order. For each step, we can find the best model by training diferent hyperparameters in our model.

Our Proposed Approach: To derive our proposed loss function, which can enhance the data and model trans- 4. Experiments ferability, we combine equation 2 and 3 when incorporating heteroscedastic uncertainty, and equation 2 and 4 All experiments were conducted 10 times as used in the when incorporating both epistemic and heteroscedastic work of Bennin et al. [45] to reduce the impact of bias, uncertainty. To determine the similarity of the sample, and the results were averaged across all independent runs. we use the method proposed by Kilgarrif [43]. Then, For our sentiment analysis task, we use the Amazon rethe Wilcoxon signed-rank test [44] is used to compare view dataset. We aim to build an accurate sentiment the frequency counts from both datasets to determine analysis model for low-resource domains by learning if both datasets have a statistically similar distribution. from high-resource but related domains. We used the To overcome the divergence problem, we use the impor- smaller version of the datasets prepared by Lakkaraju tance sampling technique in our training process. The et al. [46]. These datasets contain 22 domains, as shown pseudocode for our proposed method is as follows: in section 1 above. It is worth noting that some domains in this dataset are imbalanced, as seen in Fig 1. We ranked reviews with 1 to 3 stars as negative, while reviews with 4 or 5 stars were ranked as positive. For the pre-processing steps, we use standard techniques commonly used in NLP and Amazon sentiment analysis tasks [47, 48] in the following order; tokenisation, stop word/punctuation removal, and lemmatisation. Tokenisation involves the process of separating a sentence into a sequence of words known as “tokens” [49]. These tokens are identifiable or separated from each other by a space character. Punctuation and stop words that frequently appear and do not significantly afect meaning (stop word removal e.g., ”the”, “is” and “and”) were also removed [49]. Our lemmatisation process involves using the context in which the word is derived from (e.g., studies becomes study). By lemmatising a word, we reduce its derivationally related forms into a common root form. By using the root form of a word, the model will be able to learn any inflectional form for that given word. 4.1. Experiment Setup

We selected only domains from the Amazon review

datasets where class imbalance was evident. To determine the domains to select, the negative to positive ratio is presented in Table 1, where only domains with less than 0.7 ratio were selected to be used in this experiment. Figure 2: Amazon review dataset showing imbalance domains From Table 1, six domains were selected as shown in Figure 2 below.

We designed two groups of experiments by selecting and Software. For each experiment, a single domain was domains where class imbalance is present, as shown in used as the target dataset, while the remaining domains Figure 2. In the experiment, we excluded the ”Grocery” in that group were used as the source datasets. domain, as this domain is not related to the other six Text Similarity Measure: We use the Wilcoxon signeddomains shown in Figure 2. The first group of domains rank test [44] to compare the frequency counts from both consists of datasets from Beauty, Outdoor_living and Jew- datasets to determine if both datasets have a statistically elry_&_Watches, while the second domain group consists similar distribution. This was done by extracting all of datasets from Ofice_products, Cell_phones_&_Service words while retaining the repeat from each sample of our source training set and ignoring the stop words. From the target set, we sampled (with replacement) 1000 samples as done by Madabushi et al. [38]. Then, we use a word frequency from each of the source training samples and the sample’s target set to calculate the p-value using the Wilcoxon signed-rank test.

Model: We used the Bert uncased model for this task.

It consists of a 768-dimension vector, 12 layers of the transformer block and 110 million parameters. We added a fully connected layer on top of the BERT self-attention layers to classify the review. For the parameters, we adopt a similar hyperparameter as used in the Bert uncased model for Amazon sentiment analysis [50]. These parameters include using the Adam optimiser with various learning rates and 512 Max Sequence Length with ifve epochs. The learning rate was 1e´05. The model was first build using the source dataset to derive a source model. Then, this source model was fine-tuned with the target datasets. The fine-tuning with the target datasets was done by using a commonly split ratio (30:70) [51].

The training sets of the target data were used to fine-tune the source model before being tested on the test sets. We ran 10 experiments to compute the estimated risk by the diferent methods and the average was reported.

Evaluation measures: All experiments were conducted 10 times as done in the work of Bennin et al. [45] to reduce the impact of sampling bias, and the results were averaged across the independent runs. To evaluate the prediction accuracy of each modelling approach, the following were computed: • Balanced accuracy (BAUC): BAUC measures model performance, taking into account class imbalances and it also overcomes bias in binary cases [52]. The balanced accuracy is computed as the average of the proportion of correct predictions for each class separately. • F-measure: This is used for evaluating binary classification models based on the predictions made for the positive class [52].

5. Results Here, we compare our systematic approach against three

diferent strategies proposed for tackling NT. These strategies were: • Predictive distribution matching (PDM) [ 13 ].

This is an instance-based weighting approach.

This method works by first measuring the difering predictive distributions of the target domain and the related source domains. In this case, a PDM regularised classifier is used to infer the target pseudolabeled data, which will help to identify the relevant source data, so as to correctly align their predictive distributions [ 13 ]. We used the support vector machines (SVM) variant of the proposed PDM as used in the sentiment analysis task in the work of Seah et al. [ 13 ]. • Mutual Mean-Teaching (MMT)[ 19 ]: This is a feature transferability approach which uses a soft loss function based on soft pseudo-labels and is carried out in two stages. In the first stage, the Bert uncased model was trained using the source domain to derive a source model. This source model is trained to model a feature transformation function that transforms each input sample into a feature representation. For this experiment. the source model is trained with a classification loss and a triplet loss to separate features belonging to diferent identity, as used in the original paper [ 19 ]. Next, the source model trained in stage 1 is optimised using the MMT framework, which is based on the clustering method. The details of this approach are explained in the original paper [ 19 ]. • Dual Transfer Learning (DTL) [ 15 ]: This approach enhances feature transferability through the use of a latent feature. This method simultaneously learns the marginal and conditional distributions, and exploits their duality. For this experiment, the training was done using the Bert uncased model by combining the source and tar- posed approach, we report the results by removing each get training data before being tested on the target component in our proposed approach. When epistemic dataset. uncertainty or cost weighting was excluded from the loss function, we noticed three cases (i.e., outdoor living, cell In Tables 2 to 3, we report the fine-tuned models’ perfor- phones & service, and ofice product were used as the mance (balanced accuracy and F- measure) on the target target datasets) where the MMT method outperformed test set. In cases where NT has occurred (i.e., the degree our approach. A similar outcome was noted in the Fof NT was calculated using Equation 1), we denote the measure as shown in Table 3. To further disentangle the colour of the accuracy as red. From Table 2, the results contribution of all components in our proposed approach indicate that our proposed approach with fine-tuning, without fine-tuning the Bert model and to provide a fair other components and including both uncertainties (het- comparison with the three methods we compared against, eroscedastic aleatoric and epistemic uncertainty) in the we combined the source and target training data to train loss function outperformed the other three models. To our Bert model before testing on the target test data. This disentangle the contribution of all components in our prowas done to remove the benefit of the fine-tuning compo- against (MMT and DTL) in this study also use the Bert nent in our design. The results in Tables 4 to 5 show that, Uncased model, hence, we are able to eliminate the inwithout the fine-tuning component, we were still able terference of model complexity in the comparison result. to improve the performance when all other components From the ablation study, model fine-tuning improved the are integrated in our deep transfer learning, but with less overall performance from 2% to 6% when integrating all improvement (i.e., noting an improvment of BAUC and components into our approach.

F-measure of 2% to 9% as shown in Table 4 and Table 5).

7. Addressing threats to validity 6. Discussion The experimental dataset was compiled by [46]. We ac

In our sentiment analysis experiment (see Table 2 to Ta- knowledge threats relating to errors in the review labels. ble 5), our proposed method, which incorporated both These threats have been well minimised by experimentuncertainties, was able to improve the balanced accuracy ing with diferent projects in the datasets. Also, we conof the BERT model from 5% to 14% and F- measure value cede that there are a few uncontrolled factors that may from 5% to 10% as compared to using techniques that have impacted the experimental results in this study. For are instance [ 13 ] or feature transferability based [ 19, 15 ]. instance, there could have been unexpected faults in the Although the instance level transferability enhancement implementation of the approaches we compare against has been used in the deep learning model to prevent NT in this paper [54]. We sought to reduce such threats by [ 11, 53 ], they do not handle the target data quality. This using the source code provided for these methods (e.g., factor is shown to be one of the causes of NT [ 7 ]. The PDM, MMT and DTL). While we recognize the threats PDM method that we compared against in this paper above, we anticipate that our study here still contributes tackles the domain divergence issue by using predictive novel findings to transfer-based modelling for recomdistribution matching to remove the irrelevant source. mendation systems in NLP domains relying on latent This method still failed to address the target data quality; sentiment information. hence, we noted a single case of nt in our nlp task result (when the outdoor living domain was used as the target’s dataset). Although the MMT method uses a softmax loss 8. Conclusion function based on soft pseudo-labels to tackle the target data quality, it cannot tackle the domain divergence In this work, we proposed a systematic approach to overissue, which may also lead to NT. A single case of NT coming negative transfer by tackling domain divergence, (when the outdoor living domain was used as the target’s taking account of the source and target data quality. dataset) was also noted when using this method. On Our approach involves using cost weighting learning, the other hand, our proposed method is more robust. It uncertainty-guided loss function over the target dataset, uses the uncertainty-guided function to tackle the target and the concept of importance sampling to derive a robust and source data quality issue, importance sampling and model. This systematic approach improves the target docost weighting learning, to tackle the domain divergence main’s performance. The results reported in this work problem. For the fine-tuning process, we use a small tar- also reveal that when both aleatoric heteroscedastic and get sample as the validation data in the source model to epistemic uncertainty are combined, we can further enimprove the transferability of the final model. Our results hance the performance of the target model. We therefore show that the final model is improved when we intro- assert that our systematic approach is a good approach duce the use of an uncertainty-guided loss function to for overcoming negative transfer and improving target guide the training process when utilising the source and model performance when performing sentiment analytarget datasets and incorporate a cost weight to tackle sis in a transfer learning setting. This approach can be the problem of imbalanced data. In the work of Grauer used to build an efective recommendation system when and Pei [ 22 ], it was also noted that when model uncer- including the latent sentiment information. A plausible tainty is known and distributed evenly, the performance next step, is to use such an approach to design an efecand reliability of the model are greatly improved. Hence, tive recommendation system that takes into account the this work uses the idea of model and data transferability latent sentiment information. Although our experiments enhancement to develop a more robust approach aimed showed our approach improves the target model perforat preventing negative transfer. The evidence from our mance and prevents NT in sentiment analysis, it is still results suggests that we could use a systematic approach important to investigate this approach for other domains. such as what was proposed in this paper to improve the quality of models in a deep transfer learning setting. Also, it is worth noting that two of the methods we compared

Acknowledgements This research was partly supported by an Internal Re

search fund from Manaaki Whenua — Landcare Research, New Zealand. Special thanks are given to the Department of Informatics at Landcare Research for their ongoing support.

[1]

Cremonesi ,

Tripodi ,

Turrin , Cross-domain recommender systems , in: 2011 IEEE 11th International Conference on Data Mining Workshops , Ieee, 2011 , pp. 496 - 503 .

[2]

Zang ,

Zhu , H. Liu,

Zhang ,

Yu , A survey on cross-domain recommendation: taxonomies, methods, and future directions , arXiv preprint arXiv:2108.03357 ( 2021 ).

[3]

Wang ,

Yu ,

Wang ,

Xie , Cross-domain recommendation based on sentiment analysis and latent feature mapping , Entropy 22 ( 2020 ) 473 .

[4]

Zadrozny , Learning and evaluating classifiers under sample selection bias , in: Proceedings of the twenty-first international conference on Machine learning , 2004 , p. 114 .

[5]

Li ,

Tuzhilin , Ddtcdr: Deep dual transfer cross domain recommendation , in: Proceedings of the 13th International Conference on Web Search and Data Mining , 2020 , pp. 331 - 339 .

[6]

S. J.

Pan ,

Yang , A survey on transfer learning , IEEE Transactions on knowledge and data engineering 22 ( 2009 ) 1345 - 1359 .

[7]

Zhang ,

Deng ,

Zhang ,

Wu , A survey on negative transfer, 2020 . URL: https://arxiv.org/abs/ 2009 .00909. doi:1 0 . 4 8 5 5 0

/ A R X I

V . 2 0 0 9 . 0 0 9 0 9 .

[8]

Rosenstein ,

Marx ,

Kaelbling , & dietterich, tg ( 2005 ). to transfer or not to transfer , in: NIPS 2005 Workshop on Transfer Learning, ????

[9]

Wang ,

Dai ,

Póczos , J. Carbonell, Characterizing and avoiding negative transfer , in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2019 , pp. 11293 - 11302 .

[10]

O. P.

Omondiagbe ,

Licorish , S. G. MacDonell, Improving transfer learning for cross project defect prediction, TechRxiv preprint techrxiv . 19517029 ( 2022 ).

[11]

Wang , J. Carbonell, Towards more reliable transfer learning , in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases , Springer, 2018 , pp. 794 - 810 .

[12]

Eaton , et al., Selective transfer between learning tasks using task-based boosting , in: Twenty-Fifth AAAI Conference on Artificial Intelligence , 2011 .

[13] C.-W. Seah , Y.-S.

Ong , I. W.

Tsang , Combating negative transfer from predictive distribution diferences , IEEE transactions on cybernetics 43 ( 2012 ) 1153 - 1165 .

[14]

Wu , Pool-based sequential active learning for regression , IEEE transactions on neural networks and learning systems 30 ( 2018 ) 1348 - 1359 .

[15]

Long ,

Wang ,

Ding , W. Cheng,

Zhang , W. Wang, Dual transfer learning , in: Proceedings of the 2012 SIAM International Conference on Data Mining, SIAM , 2012 , pp. 540 - 551 .

[16]

Wang ,

Jin ,

Long ,

Wang , M. I. Jordan , Transferable normalization: Towards improving transferability of deep neural networks , Advances in neural information processing systems 32 ( 2019 ).

[17]

Madry ,

Makelov ,

Schmidt ,

Tsipras ,

Vladu , Towards deep learning models resistant to adversarial attacks , arXiv preprint arXiv:1706.06083 ( 2017 ).

[18]

Gui ,

Xu ,

Lu ,

Du ,

Zhou , Negative transfer detection in transductive transfer learning , International Journal of Machine Learning and Cybernetics 9 ( 2018 ) 185 - 197 .

[19]

Ge ,

Chen ,

Li , Mutual mean-teaching: Pseudo label refinery for unsupervised domain adaptation on person re-identification , arXiv preprint arXiv: 2001 . 01526 ( 2020 ).

[20]

Liang ,

J. Y.

Zhang ,

O. O.

Koyejo ,

Li , Does adversarial transferability indicate knowledge transferability? ( 2020 ).

[21]

Deng ,

Zhang ,

Vodrahalli ,

Kawaguchi ,

J. Y.

Zou , Adversarial training helps transfer learning via better representations , Advances in Neural Information Processing Systems 34 ( 2021 ).

[22]

J. A.

Grauer ,

Pei , Minimum-variance control allocation considering parametric model uncertainty , in: AIAA SCITECH 2022 Forum , 2022 , p. 0749 .

[23]

Caruana ,

Silver ,

Baxter , T. Mitchell, L. Pratt,

Thrun , Learning to learn: knowledge consolidation and transfer in inductive systems , in: Workshop held at NIPS- 95 , Vail, CO, see http://www. cs. cmu. edu/afs/user/caruana/pub/transfer. html, 1995 .

[24]

Sugiyama ,

Krauledat , K.-R. Müller , Covariate shift adaptation by importance weighted cross validation ., Journal of Machine Learning Research 8 ( 2007 ).

[25]

Huang ,

Gretton ,

Borgwardt ,

Schölkopf ,

Smola , Correcting sample selection bias by unlabeled data , Advances in neural information processing systems 19 ( 2006 ).

[26]

Jebara , Multi-task feature and kernel selection for svms , in: Proceedings of the twenty-first international conference on Machine learning , 2004 ,