Preventing Negative Transfer on Sentiment Analysis in Deep Transfer Learning Osayande P. Omondiagbe1,2,∗ , Sherlock A. Licorish2 and Stephen G. MacDonell2,3 1 Department of Informatics, Landcare Research, Lincoln, New Zelaand 2 Department of Information science, University of Otago, Dunedin, New Zelaand 3 Software Engineering Research Lab, Auckland University of Technology, Auckland, New Zealand Abstract Data sparsity is a challenge facing most modern recommendation systems. With cross-domain recommendation technique, one can overcome data sparsity by leveraging knowledge from relevant domains. This approach can be further enhanced by considering the latent sentiment information. However, as this latent sentiment information is derived from both relevant and irrelevant sources, the performance of the recommendation system may decline. This is a negative transfer (NT) problem, where the knowledge that is derived from multiple sources affects the system. Also, these source domains are often imbalanced, which could further hurt the performance of the recommendation system. To this end, recent research has shown that NT is caused by domain divergence, source and target quality, and algorithms that are not carefully designed to utilise the target data to improve the domain transferability. While various research works have been proposed to prevent NT, these address only some of the factors that may lead to NT. In this paper, we propose a more systematic and comprehensive approach to overcoming NT in sentiment analysis by tackling the main causes of NT. Our approach combines the use of cost weighting learning, uncertainty-guided (aleatoric and epistemic) loss function over the target dataset, and the concept of importance sampling, to derive a robust model. Experimental results on a sentiment analysis task using Amazon review datasets validate the superiority of our proposed method when compared to three other state-of-the-art methods. To disentangle the contributions behind the success of both uncertainties, we conduct an ablation study exploring the effect of each module in our approach. Our findings reveal that we can improve a sentiment analysis task in a transfer learning setting from 4% to 10% when combining both uncertainties. Our outcomes show the importance of considering all factors that may lead to NT. These findings can help to build an effective recommendation system when including the latent sentiment information. Keywords Transfer learning, neural networks, bert, uncertainty 1. Introduction ever, users’ interests or item features must be consistent or correlated across domains [1]. Generally, recommendation systems are used in commer- Most existing cross-domain recommendation methods cial applications to help users discover the products or rely only on sharing text information, such as ratings, services they are looking for. In order to solve the lack tags or reviews, and ignore latent sentiment information of data and the cold-start1 problem, researchers have in the sentiment analysis domain [2]. Recently, methods increasingly introduced concepts of source domain and that consider this latent sentiment information have been target domain into cross-domain recommendation [1]. proven to be more effective when compared with existing Through the use of transfer learning, cross-domain based recommendation algorithms that do not consider this recommendation is able to leverage the rich information information [3]. This is because user reviews are usually from multiple domains as against in a single domain, and subjective, so they would not be able to reflect the user’s transfer knowledge effectively from one domain to an- preferences and sentiments towards different attributes. other. For cross-domain recommendation to work, how- As these sentiment data are derived from both rele- DL4SR’22: Workshop on Deep Learning for Search and Recommen- vant and irrelevant sources and the datasets are often dation, co-located with the 31st ACM International Conference on imbalanced, the performance of these cross-domain rec- Information and Knowledge Management (CIKM), October 17-21, 2022, ommendation system may decline due to learning a bias Atlanta, USA [4]. Also, these cross-domain models did not take into ∗ Corresponding author. account the bidirectional latent relations between users Envelope-Open omondiagbep@landcarereserach.co.nz (O. P. Omondiagbe); sherlock.licorish@otago.ac.nz (S. A. Licorish); and items [5]. A better solution to this problem is to in- stephen.macdonell@aut.ac.nz (S. G. MacDonell) troduce transfer learning (TL) [6] into the cross-domain Orcid 0000-0002-9267-4832 (O. P. Omondiagbe); 0000-0001-7318-2421 recommendation system [5]. TL systems utilise data and (S. A. Licorish); 0000-0002-2231-6941 (S. G. MacDonell) knowledge from a related domain (known as the source © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). domain) to mitigate this learning bias, and can improve CEUR CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org the generalizability of models in the target domain [6]. Workshop ISSN 1613-0073 Proceedings 1 A problem where the system cannot draw any inferences for users or items about which it has not yet gathered sufficient information Regrettably, this approach is not always successful un- less specific guidelines are adhered to [7]; 1) both tasks utilising the source and target datasets and incorporate a should be related; 2) the source and target domain should cost weight to tackle the problem of imbalanced data that be similar; 3) and a model which can learn both domains may further increase the domain divergence issue. Hence, should be applied to both the source and target datasets. this work uses the idea of model and data transferability When these guidelines are not followed, the performance enhancement to develop a more robust model aimed at of the target model is likely to degrade. This is known preventing negative transfer. By using such a systematic as negative transfer (NT) [8]. NT can be caused by four approach, we would be able to tackle the four main causes main issues [7]: One: Domain divergence - When the of NT mentioned above. Our main contributions are divergence between the source and target domains is summarised as follows. wide, NT will occur. Two: Transfer algorithm - When designing a transfer algorithm, it should have a theoreti- • We propose using a combined uncertainty as a cal guarantee that the performance in the target domain loss function. This combined uncertainty consists will be better when auxiliary data are used, or the transfer of both the aleatoric and epistemic uncertainties. algorithm should be carefully designed to improve the The epistemic uncertainty captures the model un- transferability of auxiliary domains, else NT may occur. certainty, while the aleatoric uncertainty captures Three: Source data quality - The quality of the source the uncertainty concerning information that the data determines the quality of the transferred knowledge. data cannot explain and is modelled over the tar- If the source data are very noisy, then a model trained get and source dataset to guide the learning pro- on them is unreliable. Four: Target data quality - The cess. By using the aleatoric uncertainty-guided target domain data may be noisy, which may also lead to loss function over the target and source data, we NT. Also, the amount of labelled target data has a direct can derive more information and enhance the impact on the learning process if not fully utilised by the model’s transferability. learning algorithm [9, 10]. • We propose combining an uncertainty-guided Various research works have proposed the mitigation loss function, a cost-sensitive classification of NT, and these are seen in the following areas [7]; One: method of incorporating cost-weighting into the By enhancing the data transferability strategy [11, 7]. model and an importance sampling strategy to This is done by either addressing the domain divergence enhance the data and model transferability. This between the source and target [12, 11], or reweighing method can be used when there is imbalanced strategy by applying more weight to those source do- data and/or dissimilarity between the source and mains which are similar to the target dataset [13, 14], or target dataset. by learning a common latent feature space [15]. Two: • Finally, we perform an ablation study to disentan- By enhancing the model transferability enhancement gle the contributions behind the success of each through transferable normalisation [16], or by making module introduced in our system. the model robust to adversarial samples through the use of a robust optimisation loss function [17]. Three: By en- The remainder of this paper is organised as follows. hancing the target prediction through the use of pseudo We present related work in Section 2. Next, we intro- labelling [18, 19]. duce our proposed approach in 3. Section 4 presents our Previous research found that the use of a model that datasets, candidate models, and experimental setup. The is robust to adversarial samples results in better transfer- results and discussion are presented in Sections 5 and ability [20, 21]. They tend to have better accuracy than a 6, respectively, before considering threats in Section 7. standard target model. Similarly, Liang et al. [20] found a Finally, we conclude the study in Section 8. positive correlation between a model that is robust to an adversarial sample and the knowledge transferred. This work suggests such a model can benefit from the knowl- 2. Related Work edge transfer between the source and target. By relying Transfer Learning is a research strategy in machine learn- on such methods, these approaches can be limited to ing (ML) that aims to use the knowledge gained while being robust to adversarial samples and fail to model un- solving one problem and apply it to a different but related certainty under data and label distribution, which could problem [23]. Early methods in this area have exploited introduce further bias [22]. Recently, the work of Grauer techniques such as instance weighting [24, 25], feature and Pei [22] has shown that when model uncertainty mapping [26, 27] and transferring relational knowledge is known and distributed evenly, the performance and [28]. Due to the increased processing power afforded reliability of the model are greatly improved. by graphical processing units (GPUs), deep learning is In this work, we introduce the use of an uncertainty- now used more frequently in transfer learning tasks and guided loss function to guide the training process when when compared to earlier approaches, such models have achieved better results in the discovery domain invariant when 𝜖𝜀(𝜃pS, Tq) > 𝜖𝜀p𝜃p∅, Tqq, and the degree of NT can features [29]. It was shown that when deep learning is be evaluated by equation 1 below: used the transferability of features decreases as the dis- tance between the base task and target task increases, 𝑁 𝑇 𝐷 “ 𝜖𝜀p𝜃pS, Tqq ´ 𝜖𝜀p𝜃p∅, Tqq (1) but that transferring features even from distant tasks When NTD is positive, then negative transfer has oc- can be better than using random features [29]. Some of curred. Next, we propose a systematic way to avoid these deep learning methods [30, 31, 32] have exploited negative transfer. the use of mismatch measurement, such as Maximum Mean Discrepancy (MMD) to transfer features or by using generative adversarial networks (GANs) [33]. Although 3.2. Proposed Methods these methods have all achieved high performance in We explain the three concepts used in our method below: different domains, such as in computer vision [34] and Cost-sensitive Classification: The idea of cost- natural language processing [35], they were not designed sensitive classification is used when there is a higher to tackle the problem of negative transfer (NT). cost of mislabelling one class over the other class [37]. Other prominent lines of work can be seen in deep Cost-sensitive learning tackles the class imbalance prob- learning to tackle the issue of NT. These works include lem by changing the model cost function giving more the use of instance weighting (e.g., predictive distribution weight to the minority class and multiplying the loss matching (PDM) [13]) , enhancing the feature transfer- of each training sample by a specific factor. The imbal- ability through the use of a latent feature (e.g., DTL [36]), anced data distribution is not modified directly during and the use of soft loss function based on soft pseudo- training [37]. Madabushi et al. [38] introduced a cost- labels (e.g., Mutual Mean-Teaching (MMT)[19]). These weighting strategy in the Bert model, which increases methods do not guarantee tackling NT, as they tackle the weight of incorrectly labelled sentences by altering some causes of NT, but not all (e.g., PMD method tack- the cost function of the final model layer. The cost func- les the transfer algorithm and source data quality, while tion is changed by modifying the cross-entropy loss for MMT tackles the domain divergence, transfer algorithm a single prediction 𝑥, and the model’s prediction for class and target data quality issue). Although, a previous k to accommodate an array weight as shown in equation study exploring the benefits of modelling epistemic and 2 aleatoric uncertainty in Bayesian deep learning models for vision tasks has demonstrated that when these uncer- 𝑙𝑜𝑠𝑠p𝑥, 𝑐𝑙𝑎𝑠𝑠q “ 𝑤𝑒𝑖𝑔ℎ𝑡r𝑐𝑙𝑎𝑠𝑠s∅ (2) tainties are integrated into the loss functions, the model is more robust to noisy data, how these can be used to tackle where ∅ “ −𝑥r𝑐𝑙𝑎𝑠𝑠s ` 𝑙𝑜𝑔p∑𝑘“1 𝑒𝑥𝑝p𝑥r𝑘sqq NT has not been looked at. Hence, our main objective in this paper is to derive a robust loss function for Importance Sampling: The traditional way of training deep transfer learning that tackles the causes of NT a deep learning model has one major drawback, where mentioned in Section 1. it is not able to differentiate samples where it performs very well, i.e., low loss and those samples where the performance is poor i.e., high loss [39]. Also, as not all 3. Method source samples can provide useful knowledge [39], we introduce the idea of importance sampling to control This section provides a formal definition of NT and pro- examples which should be given more priority. Impor- posed methods to overcome it. tance sampling [40] is a variance reduction technique and is done by taking a random sample of a set based 3.1. Negative Transfer on a probability distribution among the elements of the group. In our proposed method, we attach weights to the Notation: We use the following notation 𝑃𝑠 px𝑠 q ≠ 𝑃𝑡 px𝑡 q source training examples based on their similarity to the and 𝑃𝑠 p𝑦𝑠 |x𝑠 q ≠ 𝑃𝑠 p𝑦𝑡 |x𝑡 q to denote the marginal and con- target dataset. The samples with more weight will have ditional distribution of source and target sets, respec- a higher chance of being selected. We sample the source tively. In this case, x𝑠 and x𝑡 represent the source and from a probability density over the target data. target, respectively. Zhang et al. [7] gave a mathematical Uncertainty Quantification: There are different types definition of NT, and proposed a way to determine the of uncertainties, and these could be present in the data degree of NT (NTD) when it happens. or model. When the uncertainty is derived from the Definition: Let 𝜖𝜀 be the test error in the target domain, model, it is referred to as ”epistemic or model uncertainty” 𝜃pS, Tq a TL algorithm between source (S) and target (T), [41]. Epistemic uncertainty captures the ignorance about and 𝜃p∅, Tq the same algorithm which does not use the the model generated from the collected data and can be source domain information at all. Then, NT happens explained more when more data is given to the model [41]. It is the property of the model. When the uncertainty is Algorithm 1 Combined Uncertainty Loss Function and related to the data, it is referred to as aleatoric uncertainty Cost-Weighting (CUCW) [41]. It captures the uncertainty concerning information Input: that the data cannot explain. This can be further divided • Source model : gpxq into two; • Source Training set Str • Target Training set Ttr • Heteroscedastic uncertainty, which depends on • Target Validation set Tv • Target Testing set Tts the data input and is predicted as a model output Output: Degree of negative transfer (NTD) [41]. • Homoscedastic uncertainty, which is not input 1. Estimate similarity for each source sample against ran- dom 1000 target samples data dependent but assumes a constant for all 2. Estimate importance weight with importance sampling input data and varies between the different tasks based on the similarity [42]. 3. Train a source model 𝑔 using importance weight with a small target sample as the validation data Tv In this case, we are not interested in the homoscedastic 4. Compute loss function using Equations 2 and 3 OR Equations 2 and 4 uncertainty because we are assuming related task be- 5. Compute test error 𝜖𝜀p𝜃pS, Tqq on model 𝑔 using target tween the source and target. To learn the heteroscedastic test set Tts uncertainty, the loss function can be replaced with the 6. Train a target model 𝑡 with the target data only Ttr 7. Compute test error 𝜖𝜀p𝜃p∅, Tqq on model 𝑡 using target following [41]: test set Tts 8. Calculate NTD 𝜖𝜀p𝜃pS, Tqq - 𝜖𝜀p𝜃p∅, Tqq ||𝑦´𝑦̂ ||2 𝐿𝑜𝑠𝑠 “ 2𝜎 2 ` 12 log 𝜎 2 (3) 9. Fine tune model 𝑔 using target training set Ttr and target validation set Tv to derive a new model 𝑡𝑔 10. Compute test error 𝜖𝜀p𝜃p∅, Tqq on model 𝑡𝑔 using target where the model predicts a mean 𝑦̂ and variance 𝜎 2 . test set Tts 11. return Degree of negative transfer (NTD) and model per- formance Kendall and Gal [41] proposed a loss function to com- bine both epistemic and aleatoric (heteroscedastic) un- certainty as follows: Based on the algorithm above, we can employ a deep transfer learning using the proposed approach to find an 𝐿𝑜𝑠𝑠 “ 𝐷1 ∑𝑖“1 𝑒𝑥𝑝p´ log 𝜎 2 q||𝑦 ´ 𝑦|| ̂ 2 ` 12 log 𝜎 2 (4) optimal model with the least degree of negative transfer. This can be done by following the steps in sequential or- where 𝐷 is the total number of output and 𝜎 2 is the der. For each step, we can find the best model by training variance. different hyperparameters in our model. Our Proposed Approach: To derive our proposed loss function, which can enhance the data and model trans- 4. Experiments ferability, we combine equation 2 and 3 when incorpo- rating heteroscedastic uncertainty, and equation 2 and 4 All experiments were conducted 10 times as used in the when incorporating both epistemic and heteroscedastic work of Bennin et al. [45] to reduce the impact of bias, uncertainty. To determine the similarity of the sample, and the results were averaged across all independent runs. we use the method proposed by Kilgarriff [43]. Then, For our sentiment analysis task, we use the Amazon re- the Wilcoxon signed-rank test [44] is used to compare view dataset. We aim to build an accurate sentiment the frequency counts from both datasets to determine analysis model for low-resource domains by learning if both datasets have a statistically similar distribution. from high-resource but related domains. We used the To overcome the divergence problem, we use the impor- smaller version of the datasets prepared by Lakkaraju tance sampling technique in our training process. The et al. [46]. These datasets contain 22 domains, as shown pseudocode for our proposed method is as follows: in section 1 above. It is worth noting that some domains in this dataset are imbalanced, as seen in Fig 1. We ranked reviews with 1 to 3 stars as negative, while reviews with 4 or 5 stars were ranked as positive. For the pre-processing steps, we use standard techniques commonly used in NLP and Amazon sentiment analysis tasks [47, 48] in the following order; tokenisation, stop word/punctuation removal, and lemmatisation. Tokenisation involves the process of separating a sentence into a sequence of words known as “tokens” [49]. These tokens are identifiable or Table 1 Ratio of negative to positive sample in the Amazon datasets Domains Ratio Apparel 0.98 Automotive 1.00 Baby 0.91 count Beauty 0.49 Books 0.89 Camera_&_Photo 0.91 Cell_phones_&_Service 0.58 Computer_&_Video_Games 1.00 Dvd 0.96 Electronics 0.91 Grocery 0.34 Health_&_Personal_Care 0.99 Jewelry_&_Watches 0.29 Kitchen_&_Housewares 0.94 Magazines 0.97 Music 1.02 Office_products 0.72 Outdoor_living 0.34 Software 0.63 Sports_&_Outdoors 0.95 Toy_&_Games 0.91 Figure 1: Amazon review dataset Video 1.24 separated from each other by a space character. Punc- tuation and stop words that frequently appear and do not significantly affect meaning (stop word removal e.g., ”the”, “is” and “and”) were also removed [49]. Our lemma- count tisation process involves using the context in which the word is derived from (e.g., studies becomes study). By lemmatising a word, we reduce its derivationally related forms into a common root form. By using the root form of a word, the model will be able to learn any inflectional form for that given word. 4.1. Experiment Setup We selected only domains from the Amazon review datasets where class imbalance was evident. To deter- mine the domains to select, the negative to positive ratio is presented in Table 1, where only domains with less than 0.7 ratio were selected to be used in this experiment. Figure 2: Amazon review dataset showing imbalance domains From Table 1, six domains were selected as shown in Figure 2 below. We designed two groups of experiments by selecting and Software. For each experiment, a single domain was domains where class imbalance is present, as shown in used as the target dataset, while the remaining domains Figure 2. In the experiment, we excluded the ”Grocery” in that group were used as the source datasets. domain, as this domain is not related to the other six Text Similarity Measure: We use the Wilcoxon signed- domains shown in Figure 2. The first group of domains rank test [44] to compare the frequency counts from both consists of datasets from Beauty, Outdoor_living and Jew- datasets to determine if both datasets have a statistically elry_&_Watches, while the second domain group consists similar distribution. This was done by extracting all of datasets from Office_products, Cell_phones_&_Service words while retaining the repeat from each sample of our Table 2 BAUC of fine-tuned Bert uncased model and different baseline methods on Amazon review dataset CUCW CUCW CUCW Groups Target CUCW No epistemic No Importance Sampling No Cost Weighting PDM MMT DTL Group 1 Outdoor living 0.956 0.925 0.942 0.806 0.798 0.810 0.779 ! Beauty 0.935 0.902 0.921 0.824 0.690 0.745 0.767 ! Jwellery & Watches 0.931 0.912 0.928 0.776 0.644 0.763 0.745 Group 2 Cellphones & services 0.976 0.956 0.966 0.875 0.789 0.886 0.819 ! Software 0.965 0.949 0.931 0.854 0.776 0.845 0.788 ! Office_products 0.957 0.945 0.944 0.818 0.778 0.823 0.799 source training set and ignoring the stop words. From the 5. Results target set, we sampled (with replacement) 1000 samples as done by Madabushi et al. [38]. Then, we use a word Here, we compare our systematic approach against three frequency from each of the source training samples and different strategies proposed for tackling NT. These the sample’s target set to calculate the p-value using the strategies were: Wilcoxon signed-rank test. Model: We used the Bert uncased model for this task. • Predictive distribution matching (PDM) [13]. It consists of a 768-dimension vector, 12 layers of the This is an instance-based weighting approach. transformer block and 110 million parameters. We added This method works by first measuring the differ- a fully connected layer on top of the BERT self-attention ing predictive distributions of the target domain layers to classify the review. For the parameters, we and the related source domains. In this case, a adopt a similar hyperparameter as used in the Bert un- PDM regularised classifier is used to infer the tar- cased model for Amazon sentiment analysis [50]. These get pseudolabeled data, which will help to iden- parameters include using the Adam optimiser with var- tify the relevant source data, so as to correctly ious learning rates and 512 Max Sequence Length with align their predictive distributions [13]. We used five epochs. The learning rate was 1e´05. The model the support vector machines (SVM) variant of the was first build using the source dataset to derive a source proposed PDM as used in the sentiment analysis model. Then, this source model was fine-tuned with the task in the work of Seah et al. [13]. target datasets. The fine-tuning with the target datasets • Mutual Mean-Teaching (MMT)[19]: This is a fea- was done by using a commonly split ratio (30:70) [51]. ture transferability approach which uses a soft The training sets of the target data were used to fine-tune loss function based on soft pseudo-labels and is the source model before being tested on the test sets. We carried out in two stages. In the first stage, the ran 10 experiments to compute the estimated risk by the Bert uncased model was trained using the source different methods and the average was reported. domain to derive a source model. This source Evaluation measures: All experiments were conducted model is trained to model a feature transforma- 10 times as done in the work of Bennin et al. [45] to re- tion function that transforms each input sample duce the impact of sampling bias, and the results were into a feature representation. For this experiment. averaged across the independent runs. To evaluate the the source model is trained with a classification prediction accuracy of each modelling approach, the fol- loss and a triplet loss to separate features belong- lowing were computed: ing to different identity, as used in the original paper [19]. Next, the source model trained in • Balanced accuracy (BAUC): BAUC measures stage 1 is optimised using the MMT framework, model performance, taking into account class which is based on the clustering method. The de- imbalances and it also overcomes bias in binary tails of this approach are explained in the original cases [52]. The balanced accuracy is computed paper [19]. as the average of the proportion of correct pre- • Dual Transfer Learning (DTL) [15]: This ap- dictions for each class separately. proach enhances feature transferability through • F-measure: This is used for evaluating binary the use of a latent feature. This method simul- classification models based on the predictions taneously learns the marginal and conditional made for the positive class [52]. distributions, and exploits their duality. For this experiment, the training was done using the Bert Table 3 F-measure of fine-tuned Bert uncased model and different baseline methods on Amazon review dataset CUCW CUCW CUCW Groups Target CUCW No epistemic No Importance Sampling No Cost Weighting PDM MMT DTL Group 1 Outdoor living 0.945 0.911 0.903 0.800 0.778 0.808 0.756 ! Beauty 0.922 0.899 0.909 0.799 0.665 0.716 0.733 ! Jwellery & Watches 0.898 0.886 0.886 0.730 0.616 0.742 0.709 Group 2 Cellphones & services 0.965 0.931 0.961 0.832 0.742 0.835 0.799 ! Software 0.949 0.919 0.925 0.818 0.754 0.778 0.718 ! Office_products 0.949 0.927 0.939 0.832 0.731 0.809 0.817 Table 4 BAUC of none fine-tuned Bert uncased method and different baseline methods on Amazon review dataset CUCW CUCW CUCW Groups Target CUCW No epistemic No Importance Sampling No Cost Weighting PDM MMT DTL Group 1 Outdoor living 0.887 0.845 0.864 0.799 0.798 0.810 0.779 ! Beauty 0.935 0.902 0.921 0.824 0.690 0.745 0.767 ! Jwellery & Watches 0.853 0.821 0.843 0.734 0.644 0.763 0.745 Group 2 Cellphones & services 0.939 0.909 0.920 0.840 0.789 0.886 0.819 ! Software 0.915 0.898 0.878 0.819 0.776 0.845 0.788 ! Office_products 0.919 0.888 0.865 0.843 0.778 0.823 0.799 uncased model by combining the source and tar- posed approach, we report the results by removing each get training data before being tested on the target component in our proposed approach. When epistemic dataset. uncertainty or cost weighting was excluded from the loss function, we noticed three cases (i.e., outdoor living, cell In Tables 2 to 3, we report the fine-tuned models’ perfor- phones & service, and office product were used as the mance (balanced accuracy and F- measure) on the target target datasets) where the MMT method outperformed test set. In cases where NT has occurred (i.e., the degree our approach. A similar outcome was noted in the F- of NT was calculated using Equation 1), we denote the measure as shown in Table 3. To further disentangle the colour of the accuracy as red. From Table 2, the results contribution of all components in our proposed approach indicate that our proposed approach with fine-tuning, without fine-tuning the Bert model and to provide a fair other components and including both uncertainties (het- comparison with the three methods we compared against, eroscedastic aleatoric and epistemic uncertainty) in the we combined the source and target training data to train loss function outperformed the other three models. To our Bert model before testing on the target test data. This disentangle the contribution of all components in our pro- Table 5 F-measure of the none fine-tuned Bert uncased method and different baseline methods on Amazon review dataset CUCW CUCW CUCW Groups Target CUCW No epistemic No Importance Sampling No Cost Weighting PDM MMT DTL Group 1 Outdoor living 0.881 0.844 0.829 0.770 0.778 0.808 0.756 ! Beauty 0.899 0.865 0.834 0.788 0.665 0.716 0.733 ! Jwellery & Watches 0.822 0.808 0.789 0.710 0.616 0.742 0.709 Group 2 Cellphones & services 0.887 0.858 0.887 0.787 0.742 0.835 0.799 ! Software 0.878 0.844 0.868 0.819 0.754 0.778 0.718 ! Office_products 0.843 0.822 0.829 0.709 0.731 0.809 0.817 was done to remove the benefit of the fine-tuning compo- against (MMT and DTL) in this study also use the Bert nent in our design. The results in Tables 4 to 5 show that, Uncased model, hence, we are able to eliminate the in- without the fine-tuning component, we were still able terference of model complexity in the comparison result. to improve the performance when all other components From the ablation study, model fine-tuning improved the are integrated in our deep transfer learning, but with less overall performance from 2% to 6% when integrating all improvement (i.e., noting an improvment of BAUC and components into our approach. F-measure of 2% to 9% as shown in Table 4 and Table 5). 7. Addressing threats to validity 6. Discussion The experimental dataset was compiled by [46]. We ac- In our sentiment analysis experiment (see Table 2 to Ta- knowledge threats relating to errors in the review labels. ble 5), our proposed method, which incorporated both These threats have been well minimised by experiment- uncertainties, was able to improve the balanced accuracy ing with different projects in the datasets. Also, we con- of the BERT model from 5% to 14% and F- measure value cede that there are a few uncontrolled factors that may from 5% to 10% as compared to using techniques that have impacted the experimental results in this study. For are instance [13] or feature transferability based [19, 15]. instance, there could have been unexpected faults in the Although the instance level transferability enhancement implementation of the approaches we compare against has been used in the deep learning model to prevent NT in this paper [54]. We sought to reduce such threats by [11, 53], they do not handle the target data quality. This using the source code provided for these methods (e.g., factor is shown to be one of the causes of NT [7]. The PDM, MMT and DTL). While we recognize the threats PDM method that we compared against in this paper above, we anticipate that our study here still contributes tackles the domain divergence issue by using predictive novel findings to transfer-based modelling for recom- distribution matching to remove the irrelevant source. mendation systems in NLP domains relying on latent This method still failed to address the target data quality; sentiment information. hence, we noted a single case of nt in our nlp task result (when the outdoor living domain was used as the target’s dataset). Although the MMT method uses a softmax loss 8. Conclusion function based on soft pseudo-labels to tackle the tar- In this work, we proposed a systematic approach to over- get data quality, it cannot tackle the domain divergence coming negative transfer by tackling domain divergence, issue, which may also lead to NT. A single case of NT taking account of the source and target data quality. (when the outdoor living domain was used as the target’s Our approach involves using cost weighting learning, dataset) was also noted when using this method. On uncertainty-guided loss function over the target dataset, the other hand, our proposed method is more robust. It and the concept of importance sampling to derive a robust uses the uncertainty-guided function to tackle the target model. This systematic approach improves the target do- and source data quality issue, importance sampling and main’s performance. The results reported in this work cost weighting learning, to tackle the domain divergence also reveal that when both aleatoric heteroscedastic and problem. For the fine-tuning process, we use a small tar- epistemic uncertainty are combined, we can further en- get sample as the validation data in the source model to hance the performance of the target model. We therefore improve the transferability of the final model. Our results assert that our systematic approach is a good approach show that the final model is improved when we intro- for overcoming negative transfer and improving target duce the use of an uncertainty-guided loss function to model performance when performing sentiment analy- guide the training process when utilising the source and sis in a transfer learning setting. This approach can be target datasets and incorporate a cost weight to tackle used to build an effective recommendation system when the problem of imbalanced data. In the work of Grauer including the latent sentiment information. A plausible and Pei [22], it was also noted that when model uncer- next step, is to use such an approach to design an effec- tainty is known and distributed evenly, the performance tive recommendation system that takes into account the and reliability of the model are greatly improved. Hence, latent sentiment information. Although our experiments this work uses the idea of model and data transferability showed our approach improves the target model perfor- enhancement to develop a more robust approach aimed mance and prevents NT in sentiment analysis, it is still at preventing negative transfer. The evidence from our important to investigate this approach for other domains. results suggests that we could use a systematic approach such as what was proposed in this paper to improve the quality of models in a deep transfer learning setting. Also, it is worth noting that two of the methods we compared Acknowledgements [13] C.-W. Seah, Y.-S. Ong, I. W. Tsang, Combating neg- ative transfer from predictive distribution differ- This research was partly supported by an Internal Re- ences, IEEE transactions on cybernetics 43 (2012) search fund from Manaaki Whenua — Landcare Research, 1153–1165. New Zealand. Special thanks are given to the Department [14] D. Wu, Pool-based sequential active learning for of Informatics at Landcare Research for their ongoing regression, IEEE transactions on neural networks support. and learning systems 30 (2018) 1348–1359. [15] M. Long, J. Wang, G. Ding, W. Cheng, X. Zhang, W. Wang, Dual transfer learning, in: Proceedings References of the 2012 SIAM International Conference on Data [1] P. Cremonesi, A. Tripodi, R. Turrin, Cross-domain Mining, SIAM, 2012, pp. 540–551. recommender systems, in: 2011 IEEE 11th Inter- [16] X. Wang, Y. Jin, M. Long, J. Wang, M. I. Jordan, national Conference on Data Mining Workshops, Transferable normalization: Towards improving Ieee, 2011, pp. 496–503. transferability of deep neural networks, Advances [2] T. Zang, Y. Zhu, H. Liu, R. Zhang, J. Yu, A sur- in neural information processing systems 32 (2019). vey on cross-domain recommendation: taxonomies, [17] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, methods, and future directions, arXiv preprint A. Vladu, Towards deep learning models resistant to arXiv:2108.03357 (2021). adversarial attacks, arXiv preprint arXiv:1706.06083 [3] Y. Wang, H. Yu, G. Wang, Y. Xie, Cross-domain (2017). recommendation based on sentiment analysis and [18] L. Gui, R. Xu, Q. Lu, J. Du, Y. Zhou, Negative transfer latent feature mapping, Entropy 22 (2020) 473. detection in transductive transfer learning, Interna- [4] B. Zadrozny, Learning and evaluating classifiers tional Journal of Machine Learning and Cybernetics under sample selection bias, in: Proceedings of the 9 (2018) 185–197. twenty-first international conference on Machine [19] Y. Ge, D. Chen, H. Li, Mutual mean-teaching: learning, 2004, p. 114. Pseudo label refinery for unsupervised domain [5] P. Li, A. Tuzhilin, Ddtcdr: Deep dual transfer cross adaptation on person re-identification, arXiv domain recommendation, in: Proceedings of the preprint arXiv:2001.01526 (2020). 13th International Conference on Web Search and [20] K. Liang, J. Y. Zhang, O. O. Koyejo, B. Li, Does ad- Data Mining, 2020, pp. 331–339. versarial transferability indicate knowledge trans- [6] S. J. Pan, Q. Yang, A survey on transfer learning, ferability? (2020). IEEE Transactions on knowledge and data engineer- [21] Z. Deng, L. Zhang, K. Vodrahalli, K. Kawaguchi, ing 22 (2009) 1345–1359. J. Y. Zou, Adversarial training helps transfer learn- [7] W. Zhang, L. Deng, L. Zhang, D. Wu, A survey on ing via better representations, Advances in Neural negative transfer, 2020. URL: https://arxiv.org/abs/ Information Processing Systems 34 (2021). 2009.00909. doi:1 0 . 4 8 5 5 0 / A R X I V . 2 0 0 9 . 0 0 9 0 9 . [22] J. A. Grauer, J. Pei, Minimum-variance control allo- [8] M. Rosenstein, Z. Marx, L. Kaelbling, & diet- cation considering parametric model uncertainty, terich, tg (2005). to transfer or not to transfer, in: in: AIAA SCITECH 2022 Forum, 2022, p. 0749. NIPS 2005 Workshop on Transfer Learning, ???? [23] R. Caruana, D. Silver, J. Baxter, T. Mitchell, L. Pratt, [9] Z. Wang, Z. Dai, B. Póczos, J. Carbonell, Charac- S. Thrun, Learning to learn: knowledge consolida- terizing and avoiding negative transfer, in: Pro- tion and transfer in inductive systems, in: Work- ceedings of the IEEE/CVF Conference on Com- shop held at NIPS-95, Vail, CO, see http://www. puter Vision and Pattern Recognition, 2019, pp. cs. cmu. edu/afs/user/caruana/pub/transfer. html, 11293–11302. 1995. [10] O. P. Omondiagbe, S. Licorish, S. G. MacDonell, [24] M. Sugiyama, M. Krauledat, K.-R. Müller, Covari- Improving transfer learning for cross project defect ate shift adaptation by importance weighted cross prediction, TechRxiv preprint techrxiv.19517029 validation., Journal of Machine Learning Research (2022). 8 (2007). [11] Z. Wang, J. Carbonell, Towards more reliable trans- [25] J. Huang, A. Gretton, K. Borgwardt, B. Schölkopf, fer learning, in: Joint European Conference on A. Smola, Correcting sample selection bias by un- Machine Learning and Knowledge Discovery in labeled data, Advances in neural information pro- Databases, Springer, 2018, pp. 794–810. cessing systems 19 (2006). [12] E. Eaton, et al., Selective transfer between learning [26] T. Jebara, Multi-task feature and kernel selection tasks using task-based boosting, in: Twenty-Fifth for svms, in: Proceedings of the twenty-first in- AAAI Conference on Artificial Intelligence, 2011. ternational conference on Machine learning, 2004, p. 55. [40] C. P. Robert, G. Casella, G. Casella, Monte Carlo [27] S. Uguroglu, J. Carbonell, Feature selection for statistical methods, volume 2, Springer, 1999. transfer learning, in: Joint European Conference [41] A. Kendall, Y. Gal, What uncertainties do we need on Machine Learning and Knowledge Discovery in in bayesian deep learning for computer vision?, Ad- Databases, Springer, 2011, pp. 430–442. vances in neural information processing systems [28] L. Mihalkova, R. J. Mooney, Transfer learning by 30 (2017). mapping with minimal target data, in: Proceedings [42] Q. V. Le, A. J. Smola, S. Canu, Heteroscedastic of the AAAI-08 workshop on transfer learning for gaussian process regression, in: Proceedings of the complex tasks, 2008. 22nd international conference on Machine learning, [29] J. Yosinski, J. Clune, Y. Bengio, H. Lipson, How 2005, pp. 489–496. transferable are features in deep neural networks?, [43] A. Kilgarriff, Comparing corpora, International Advances in neural information processing systems journal of corpus linguistics 6 (2001) 97–133. 27 (2014). [44] F. Wilcoxon, Probability tables for individual com- [30] B. Sun, K. Saenko, Deep coral: Correlation align- parisons by ranking methods, Biometrics 3 (1947) ment for deep domain adaptation, in: European 119–122. conference on computer vision, Springer, 2016, pp. [45] K. E. Bennin, J. Keung, P. Phannachitta, A. Monden, 443–450. S. Mensah, Mahakil: Diversity based oversampling [31] M. Long, Y. Cao, J. Wang, M. Jordan, Learning trans- approach to alleviate the class imbalance issue in ferable features with deep adaptation networks, software defect prediction, IEEE Transactions on in: International conference on machine learning, Software Engineering 44 (2017) 534–550. PMLR, 2015, pp. 97–105. [46] H. Lakkaraju, J. McAuley, J. Leskovec, What’s in a [32] G. K. Dziugaite, D. M. Roy, Z. Ghahramani, Train- name? understanding the interplay between titles, ing generative neural networks via maximum content, and communities in social media, in: Pro- mean discrepancy optimization, arXiv preprint ceedings of the International AAAI Conference on arXiv:1505.03906 (2015). Web and Social Media, volume 7, 2013, pp. 311–320. [33] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, [47] A. S. AlQahtani, Product sentiment analysis for D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, amazon reviews, International Journal of Computer Generative adversarial nets, Advances in neural Science & Information Technology (IJCSIT) Vol 13 information processing systems 27 (2014). (2021). [34] S. Sankaranarayanan, Y. Balaji, C. D. Castillo, [48] A. F. Anees, A. Shaikh, A. Shaikh, S. Shaikh, Sur- R. Chellappa, Generate to adapt: Aligning domains vey paper on sentiment analysis: Techniques and using generative adversarial networks, in: Proceed- challenges, EasyChair2516-2314 (2020). ings of the IEEE conference on computer vision and [49] D. D. Palmer, Tokenisation and sentence segmen- pattern recognition, 2018, pp. 8503–8512. tation, Handbook of natural language processing [35] W. Y. Wang, S. Singh, J. Li, Deep adversarial learn- (2000) 11–35. ing for nlp, in: Proceedings of the 2019 Conference [50] M. Geetha, D. K. Renuka, Improving the perfor- of the North American Chapter of the Association mance of aspect based sentiment analysis using for Computational Linguistics: Tutorials, 2019, pp. fine-tuned bert base uncased model, International 1–5. Journal of Intelligent Networks 2 (2021) 64–69. [36] M. Rajesh, J. Gnanasekar, Annoyed realm outlook [51] I. H. Witten, E. Frank, M. A. Hall, C. J. Pal, Data Min- taxonomy using twin transfer learning, Interna- ing: Practical Machine Learning Tools and Tech- tional Journal of Pure and Applied Mathematics niques, Morgan Kaufmann Publishers, San Fran- 116 (2017) 549–558. cisco, 2016. doi:1 0 . 1 0 1 6 / c 2 0 0 9 - 0 - 1 9 7 1 5 - 5 . [37] M. Kukar, I. Kononenko, et al., Cost-sensitive learn- [52] T. Menzies, J. Greenwald, A. Frank, Data min- ing with neural networks., in: ECAI, volume 15, ing static code attributes to learn defect predictors, Citeseer, 1998, pp. 88–94. IEEE transactions on software engineering 33 (2006) [38] H. T. Madabushi, E. Kochkina, M. Castelle, Cost- 2–13. sensitive bert for generalisable sentence classi- [53] Y. Xu, H. Yu, Y. Yan, Y. Liu, et al., Multi-component fication with imbalanced data, arXiv preprint transfer metric learning for handling unrelated arXiv:2003.11563 (2020). source domain samples, Knowledge-Based Systems [39] A. Katharopoulos, F. Fleuret, Not all samples are 203 (2020) 106132. created equal: Deep learning with importance sam- [54] E. A. Felix, S. P. Lee, Predicting the number of pling, in: International conference on machine defects in a new software version, PloS one 15 learning, PMLR, 2018, pp. 2525–2534. (2020) e0229131.