=Paper= {{Paper |id=Vol-3317/paper1 |storemode=property |title=Preventing Negative Transfer on Sentiment Analysis in Deep Transfer Learning |pdfUrl=https://ceur-ws.org/Vol-3317/Paper1.pdf |volume=Vol-3317 |authors=Osayande P. Omondiagbe,Sherlock A Licorish,Stephen G. MacDonell |dblpUrl=https://dblp.org/rec/conf/cikm/OmondiagbeLM22 }} ==Preventing Negative Transfer on Sentiment Analysis in Deep Transfer Learning== https://ceur-ws.org/Vol-3317/Paper1.pdf
Preventing Negative Transfer on Sentiment Analysis in
Deep Transfer Learning
Osayande P. Omondiagbe1,2,∗ , Sherlock A. Licorish2 and Stephen G. MacDonell2,3
1
  Department of Informatics, Landcare Research, Lincoln, New Zelaand
2
  Department of Information science, University of Otago, Dunedin, New Zelaand
3
  Software Engineering Research Lab, Auckland University of Technology, Auckland, New Zealand


                                       Abstract
                                       Data sparsity is a challenge facing most modern recommendation systems. With cross-domain recommendation technique,
                                       one can overcome data sparsity by leveraging knowledge from relevant domains. This approach can be further enhanced by
                                       considering the latent sentiment information. However, as this latent sentiment information is derived from both relevant
                                       and irrelevant sources, the performance of the recommendation system may decline. This is a negative transfer (NT) problem,
                                       where the knowledge that is derived from multiple sources affects the system. Also, these source domains are often imbalanced,
                                       which could further hurt the performance of the recommendation system. To this end, recent research has shown that NT is
                                       caused by domain divergence, source and target quality, and algorithms that are not carefully designed to utilise the target
                                       data to improve the domain transferability. While various research works have been proposed to prevent NT, these address
                                       only some of the factors that may lead to NT. In this paper, we propose a more systematic and comprehensive approach to
                                       overcoming NT in sentiment analysis by tackling the main causes of NT. Our approach combines the use of cost weighting
                                       learning, uncertainty-guided (aleatoric and epistemic) loss function over the target dataset, and the concept of importance
                                       sampling, to derive a robust model. Experimental results on a sentiment analysis task using Amazon review datasets
                                       validate the superiority of our proposed method when compared to three other state-of-the-art methods. To disentangle the
                                       contributions behind the success of both uncertainties, we conduct an ablation study exploring the effect of each module in
                                       our approach. Our findings reveal that we can improve a sentiment analysis task in a transfer learning setting from 4% to 10%
                                       when combining both uncertainties. Our outcomes show the importance of considering all factors that may lead to NT. These
                                       findings can help to build an effective recommendation system when including the latent sentiment information.

                                       Keywords
                                       Transfer learning, neural networks, bert, uncertainty



1. Introduction                                                                                                 ever, users’ interests or item features must be consistent
                                                                                                                or correlated across domains [1].
Generally, recommendation systems are used in commer-                                                              Most existing cross-domain recommendation methods
cial applications to help users discover the products or rely only on sharing text information, such as ratings,
services they are looking for. In order to solve the lack tags or reviews, and ignore latent sentiment information
of data and the cold-start1 problem, researchers have in the sentiment analysis domain [2]. Recently, methods
increasingly introduced concepts of source domain and that consider this latent sentiment information have been
target domain into cross-domain recommendation [1]. proven to be more effective when compared with existing
Through the use of transfer learning, cross-domain based recommendation algorithms that do not consider this
recommendation is able to leverage the rich information information [3]. This is because user reviews are usually
from multiple domains as against in a single domain, and subjective, so they would not be able to reflect the user’s
transfer knowledge effectively from one domain to an- preferences and sentiments towards different attributes.
other. For cross-domain recommendation to work, how-                                                               As these sentiment data are derived from both rele-
DL4SR’22: Workshop on Deep Learning for Search and Recommen-
                                                                                                                vant  and irrelevant sources and the datasets are often
dation, co-located with the 31st ACM International Conference on imbalanced, the performance of these cross-domain rec-
Information and Knowledge Management (CIKM), October 17-21, 2022, ommendation system may decline due to learning a bias
Atlanta, USA                                                                                                    [4]. Also, these cross-domain models did not take into
∗
     Corresponding author.                                                                                      account the bidirectional latent relations between users
Envelope-Open omondiagbep@landcarereserach.co.nz (O. P. Omondiagbe);
sherlock.licorish@otago.ac.nz (S. A. Licorish);
                                                                                                                and items [5]. A better solution to this problem is to in-
stephen.macdonell@aut.ac.nz (S. G. MacDonell)                                                                   troduce transfer learning (TL) [6] into the cross-domain
Orcid 0000-0002-9267-4832 (O. P. Omondiagbe); 0000-0001-7318-2421                                               recommendation system [5]. TL systems utilise data and
(S. A. Licorish); 0000-0002-2231-6941 (S. G. MacDonell)                                                         knowledge from a related domain (known as the source
                   © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
                   Attribution 4.0 International (CC BY 4.0).                                                   domain) to mitigate this learning bias, and can improve
    CEUR

                   CEUR Workshop Proceedings (CEUR-WS.org)
                  http://ceur-ws.org




                                                                                                                the generalizability of models in the target domain [6].
    Workshop      ISSN 1613-0073
    Proceedings


1
    A problem where the system cannot draw any inferences for users
    or items about which it has not yet gathered sufficient information Regrettably, this approach is not always successful un-
less specific guidelines are adhered to [7]; 1) both tasks     utilising the source and target datasets and incorporate a
should be related; 2) the source and target domain should      cost weight to tackle the problem of imbalanced data that
be similar; 3) and a model which can learn both domains        may further increase the domain divergence issue. Hence,
should be applied to both the source and target datasets.      this work uses the idea of model and data transferability
When these guidelines are not followed, the performance        enhancement to develop a more robust model aimed at
of the target model is likely to degrade. This is known        preventing negative transfer. By using such a systematic
as negative transfer (NT) [8]. NT can be caused by four        approach, we would be able to tackle the four main causes
main issues [7]: One: Domain divergence - When the             of NT mentioned above. Our main contributions are
divergence between the source and target domains is            summarised as follows.
wide, NT will occur. Two: Transfer algorithm - When
designing a transfer algorithm, it should have a theoreti-          • We propose using a combined uncertainty as a
cal guarantee that the performance in the target domain               loss function. This combined uncertainty consists
will be better when auxiliary data are used, or the transfer          of both the aleatoric and epistemic uncertainties.
algorithm should be carefully designed to improve the                 The epistemic uncertainty captures the model un-
transferability of auxiliary domains, else NT may occur.              certainty, while the aleatoric uncertainty captures
Three: Source data quality - The quality of the source                the uncertainty concerning information that the
data determines the quality of the transferred knowledge.             data cannot explain and is modelled over the tar-
If the source data are very noisy, then a model trained               get and source dataset to guide the learning pro-
on them is unreliable. Four: Target data quality - The                cess. By using the aleatoric uncertainty-guided
target domain data may be noisy, which may also lead to               loss function over the target and source data, we
NT. Also, the amount of labelled target data has a direct             can derive more information and enhance the
impact on the learning process if not fully utilised by the           model’s transferability.
learning algorithm [9, 10].                                         • We propose combining an uncertainty-guided
   Various research works have proposed the mitigation                loss function, a cost-sensitive classification
of NT, and these are seen in the following areas [7]; One:            method of incorporating cost-weighting into the
By enhancing the data transferability strategy [11, 7].               model and an importance sampling strategy to
This is done by either addressing the domain divergence               enhance the data and model transferability. This
between the source and target [12, 11], or reweighing                 method can be used when there is imbalanced
strategy by applying more weight to those source do-                  data and/or dissimilarity between the source and
mains which are similar to the target dataset [13, 14], or            target dataset.
by learning a common latent feature space [15]. Two:                • Finally, we perform an ablation study to disentan-
By enhancing the model transferability enhancement                    gle the contributions behind the success of each
through transferable normalisation [16], or by making                 module introduced in our system.
the model robust to adversarial samples through the use
of a robust optimisation loss function [17]. Three: By en-        The remainder of this paper is organised as follows.
hancing the target prediction through the use of pseudo        We present related work in Section 2. Next, we intro-
labelling [18, 19].                                            duce our proposed approach in 3. Section 4 presents our
   Previous research found that the use of a model that        datasets, candidate models, and experimental setup. The
is robust to adversarial samples results in better transfer-   results and discussion are presented in Sections 5 and
ability [20, 21]. They tend to have better accuracy than a     6, respectively, before considering threats in Section 7.
standard target model. Similarly, Liang et al. [20] found a    Finally, we conclude the study in Section 8.
positive correlation between a model that is robust to an
adversarial sample and the knowledge transferred. This
work suggests such a model can benefit from the knowl-         2. Related Work
edge transfer between the source and target. By relying
                                                               Transfer Learning is a research strategy in machine learn-
on such methods, these approaches can be limited to
                                                               ing (ML) that aims to use the knowledge gained while
being robust to adversarial samples and fail to model un-
                                                               solving one problem and apply it to a different but related
certainty under data and label distribution, which could
                                                               problem [23]. Early methods in this area have exploited
introduce further bias [22]. Recently, the work of Grauer
                                                               techniques such as instance weighting [24, 25], feature
and Pei [22] has shown that when model uncertainty
                                                               mapping [26, 27] and transferring relational knowledge
is known and distributed evenly, the performance and
                                                               [28]. Due to the increased processing power afforded
reliability of the model are greatly improved.
                                                               by graphical processing units (GPUs), deep learning is
   In this work, we introduce the use of an uncertainty-
                                                               now used more frequently in transfer learning tasks and
guided loss function to guide the training process when
                                                               when compared to earlier approaches, such models have
achieved better results in the discovery domain invariant         when 𝜖𝜀(𝜃pS, Tq) > 𝜖𝜀p𝜃p∅, Tqq, and the degree of NT can
features [29]. It was shown that when deep learning is            be evaluated by equation 1 below:
used the transferability of features decreases as the dis-
tance between the base task and target task increases,                        𝑁 𝑇 𝐷 “ 𝜖𝜀p𝜃pS, Tqq ´ 𝜖𝜀p𝜃p∅, Tqq           (1)
but that transferring features even from distant tasks
                                                                  When NTD is positive, then negative transfer has oc-
can be better than using random features [29]. Some of
                                                                  curred. Next, we propose a systematic way to avoid
these deep learning methods [30, 31, 32] have exploited
                                                                  negative transfer.
the use of mismatch measurement, such as Maximum
Mean Discrepancy (MMD) to transfer features or by using
generative adversarial networks (GANs) [33]. Although             3.2. Proposed Methods
these methods have all achieved high performance in
                                                                  We explain the three concepts used in our method below:
different domains, such as in computer vision [34] and
                                                                  Cost-sensitive Classification:         The idea of cost-
natural language processing [35], they were not designed
                                                                  sensitive classification is used when there is a higher
to tackle the problem of negative transfer (NT).
                                                                  cost of mislabelling one class over the other class [37].
   Other prominent lines of work can be seen in deep
                                                                  Cost-sensitive learning tackles the class imbalance prob-
learning to tackle the issue of NT. These works include
                                                                  lem by changing the model cost function giving more
the use of instance weighting (e.g., predictive distribution
                                                                  weight to the minority class and multiplying the loss
matching (PDM) [13]) , enhancing the feature transfer-
                                                                  of each training sample by a specific factor. The imbal-
ability through the use of a latent feature (e.g., DTL [36]),
                                                                  anced data distribution is not modified directly during
and the use of soft loss function based on soft pseudo-
                                                                  training [37]. Madabushi et al. [38] introduced a cost-
labels (e.g., Mutual Mean-Teaching (MMT)[19]). These
                                                                  weighting strategy in the Bert model, which increases
methods do not guarantee tackling NT, as they tackle
                                                                  the weight of incorrectly labelled sentences by altering
some causes of NT, but not all (e.g., PMD method tack-
                                                                  the cost function of the final model layer. The cost func-
les the transfer algorithm and source data quality, while
                                                                  tion is changed by modifying the cross-entropy loss for
MMT tackles the domain divergence, transfer algorithm
                                                                  a single prediction 𝑥, and the model’s prediction for class
and target data quality issue). Although, a previous
                                                                  k to accommodate an array weight as shown in equation
study exploring the benefits of modelling epistemic and
                                                                  2
aleatoric uncertainty in Bayesian deep learning models
for vision tasks has demonstrated that when these uncer-
                                                                                𝑙𝑜𝑠𝑠p𝑥, 𝑐𝑙𝑎𝑠𝑠q “ 𝑤𝑒𝑖𝑔ℎ𝑡r𝑐𝑙𝑎𝑠𝑠s∅           (2)
tainties are integrated into the loss functions, the model is
more robust to noisy data, how these can be used to tackle              where ∅ “ −𝑥r𝑐𝑙𝑎𝑠𝑠s ` 𝑙𝑜𝑔p∑𝑘“1 𝑒𝑥𝑝p𝑥r𝑘sqq
NT has not been looked at. Hence, our main objective
in this paper is to derive a robust loss function for             Importance Sampling: The traditional way of training
deep transfer learning that tackles the causes of NT              a deep learning model has one major drawback, where
mentioned in Section 1.                                           it is not able to differentiate samples where it performs
                                                                  very well, i.e., low loss and those samples where the
                                                                  performance is poor i.e., high loss [39]. Also, as not all
3. Method                                                         source samples can provide useful knowledge [39], we
                                                                  introduce the idea of importance sampling to control
This section provides a formal definition of NT and pro-          examples which should be given more priority. Impor-
posed methods to overcome it.                                     tance sampling [40] is a variance reduction technique
                                                                  and is done by taking a random sample of a set based
3.1. Negative Transfer                                            on a probability distribution among the elements of the
                                                                  group. In our proposed method, we attach weights to the
Notation: We use the following notation 𝑃𝑠 px𝑠 q ≠ 𝑃𝑡 px𝑡 q       source training examples based on their similarity to the
and 𝑃𝑠 p𝑦𝑠 |x𝑠 q ≠ 𝑃𝑠 p𝑦𝑡 |x𝑡 q to denote the marginal and con-   target dataset. The samples with more weight will have
ditional distribution of source and target sets, respec-          a higher chance of being selected. We sample the source
tively. In this case, x𝑠 and x𝑡 represent the source and          from a probability density over the target data.
target, respectively. Zhang et al. [7] gave a mathematical        Uncertainty Quantification: There are different types
definition of NT, and proposed a way to determine the             of uncertainties, and these could be present in the data
degree of NT (NTD) when it happens.                               or model. When the uncertainty is derived from the
Definition: Let 𝜖𝜀 be the test error in the target domain,        model, it is referred to as ”epistemic or model uncertainty”
𝜃pS, Tq a TL algorithm between source (S) and target (T),         [41]. Epistemic uncertainty captures the ignorance about
and 𝜃p∅, Tq the same algorithm which does not use the             the model generated from the collected data and can be
source domain information at all. Then, NT happens                explained more when more data is given to the model [41].
It is the property of the model. When the uncertainty is          Algorithm 1 Combined Uncertainty Loss Function and
related to the data, it is referred to as aleatoric uncertainty   Cost-Weighting (CUCW)
[41]. It captures the uncertainty concerning information          Input:
that the data cannot explain. This can be further divided                  • Source model : gpxq
into two;                                                                  • Source Training set Str
                                                                           • Target Training set Ttr
     • Heteroscedastic uncertainty, which depends on                       • Target Validation set Tv
                                                                           • Target Testing set Tts
       the data input and is predicted as a model output
                                                                  Output:     Degree of negative transfer (NTD)
       [41].
     • Homoscedastic uncertainty, which is not input                    1. Estimate similarity for each source sample against ran-
                                                                           dom 1000 target samples
       data dependent but assumes a constant for all                    2. Estimate importance weight with importance sampling
       input data and varies between the different tasks                   based on the similarity
       [42].                                                            3. Train a source model 𝑔 using importance weight with a
                                                                           small target sample as the validation data Tv
In this case, we are not interested in the homoscedastic                4. Compute loss function using Equations 2 and 3 OR
                                                                           Equations 2 and 4
uncertainty because we are assuming related task be-                    5. Compute test error 𝜖𝜀p𝜃pS, Tqq on model 𝑔 using target
tween the source and target. To learn the heteroscedastic                  test set Tts
uncertainty, the loss function can be replaced with the                 6. Train a target model 𝑡 with the target data only Ttr
                                                                        7. Compute test error 𝜖𝜀p𝜃p∅, Tqq on model 𝑡 using target
following [41]:                                                            test set Tts
                                                                        8. Calculate NTD 𝜖𝜀p𝜃pS, Tqq - 𝜖𝜀p𝜃p∅, Tqq
                         ||𝑦´𝑦̂ ||2
                𝐿𝑜𝑠𝑠 “      2𝜎 2
                                    ` 12 log 𝜎 2           (3)          9. Fine tune model 𝑔 using target training set Ttr and target
                                                                           validation set Tv to derive a new model 𝑡𝑔
                                                                       10. Compute test error 𝜖𝜀p𝜃p∅, Tqq on model 𝑡𝑔 using target
  where the model predicts a mean 𝑦̂ and variance 𝜎 2 .                    test set Tts
                                                                       11. return Degree of negative transfer (NTD) and model per-
                                                                           formance
Kendall and Gal [41] proposed a loss function to com-
bine both epistemic and aleatoric (heteroscedastic) un-
certainty as follows:                                                Based on the algorithm above, we can employ a deep
                                                                  transfer learning using the proposed approach to find an
  𝐿𝑜𝑠𝑠 “ 𝐷1 ∑𝑖“1 𝑒𝑥𝑝p´ log 𝜎 2 q||𝑦 ´ 𝑦||
                                       ̂ 2 ` 12 log 𝜎 2 (4)       optimal model with the least degree of negative transfer.
                                                                  This can be done by following the steps in sequential or-
  where 𝐷 is the total number of output and 𝜎 2 is the            der. For each step, we can find the best model by training
                        variance.                                 different hyperparameters in our model.
Our Proposed Approach: To derive our proposed loss
function, which can enhance the data and model trans-             4. Experiments
ferability, we combine equation 2 and 3 when incorpo-
rating heteroscedastic uncertainty, and equation 2 and 4          All experiments were conducted 10 times as used in the
when incorporating both epistemic and heteroscedastic             work of Bennin et al. [45] to reduce the impact of bias,
uncertainty. To determine the similarity of the sample,           and the results were averaged across all independent runs.
we use the method proposed by Kilgarriff [43]. Then,              For our sentiment analysis task, we use the Amazon re-
the Wilcoxon signed-rank test [44] is used to compare             view dataset. We aim to build an accurate sentiment
the frequency counts from both datasets to determine              analysis model for low-resource domains by learning
if both datasets have a statistically similar distribution.       from high-resource but related domains. We used the
To overcome the divergence problem, we use the impor-             smaller version of the datasets prepared by Lakkaraju
tance sampling technique in our training process. The             et al. [46]. These datasets contain 22 domains, as shown
pseudocode for our proposed method is as follows:                 in section 1 above. It is worth noting that some domains
                                                                  in this dataset are imbalanced, as seen in Fig 1. We ranked
                                                                  reviews with 1 to 3 stars as negative, while reviews with 4
                                                                  or 5 stars were ranked as positive. For the pre-processing
                                                                  steps, we use standard techniques commonly used in
                                                                  NLP and Amazon sentiment analysis tasks [47, 48] in
                                                                  the following order; tokenisation, stop word/punctuation
                                                                  removal, and lemmatisation. Tokenisation involves the
                                                                  process of separating a sentence into a sequence of words
                                                                  known as “tokens” [49]. These tokens are identifiable or
Table 1
Ratio of negative to positive sample in the Amazon datasets
              Domains                    Ratio
              Apparel                    0.98
              Automotive                 1.00
              Baby                       0.91




                                                              count
              Beauty                     0.49
              Books                      0.89
              Camera_&_Photo             0.91
              Cell_phones_&_Service      0.58
              Computer_&_Video_Games     1.00
              Dvd                        0.96
              Electronics                0.91
              Grocery                    0.34
              Health_&_Personal_Care     0.99
              Jewelry_&_Watches          0.29
              Kitchen_&_Housewares       0.94
              Magazines                  0.97
              Music                      1.02
              Office_products            0.72
              Outdoor_living             0.34
              Software                   0.63
              Sports_&_Outdoors          0.95
              Toy_&_Games                0.91                 Figure 1: Amazon review dataset
              Video                      1.24




separated from each other by a space character. Punc-
tuation and stop words that frequently appear and do
not significantly affect meaning (stop word removal e.g.,
”the”, “is” and “and”) were also removed [49]. Our lemma-
                                                              count




tisation process involves using the context in which the
word is derived from (e.g., studies becomes study). By
lemmatising a word, we reduce its derivationally related
forms into a common root form. By using the root form
of a word, the model will be able to learn any inflectional
form for that given word.

4.1. Experiment Setup
We selected only domains from the Amazon review
datasets where class imbalance was evident. To deter-
mine the domains to select, the negative to positive ratio
is presented in Table 1, where only domains with less
than 0.7 ratio were selected to be used in this experiment.   Figure 2: Amazon review dataset showing imbalance domains
From Table 1, six domains were selected as shown in
Figure 2 below.
   We designed two groups of experiments by selecting         and Software. For each experiment, a single domain was
domains where class imbalance is present, as shown in         used as the target dataset, while the remaining domains
Figure 2. In the experiment, we excluded the ”Grocery”        in that group were used as the source datasets.
domain, as this domain is not related to the other six        Text Similarity Measure: We use the Wilcoxon signed-
domains shown in Figure 2. The first group of domains         rank test [44] to compare the frequency counts from both
consists of datasets from Beauty, Outdoor_living and Jew-     datasets to determine if both datasets have a statistically
elry_&_Watches, while the second domain group consists        similar distribution. This was done by extracting all
of datasets from Office_products, Cell_phones_&_Service       words while retaining the repeat from each sample of our
Table 2
BAUC of fine-tuned Bert uncased model and different baseline methods on Amazon review dataset
                                                CUCW                 CUCW                     CUCW
 Groups     Target                  CUCW      No epistemic    No Importance Sampling     No Cost Weighting    PDM     MMT     DTL
 Group 1    Outdoor living          0.956     0.925           0.942                      0.806                0.798   0.810   0.779
 !          Beauty                  0.935     0.902           0.921                      0.824                0.690   0.745   0.767
 !          Jwellery & Watches      0.931     0.912           0.928                      0.776                0.644   0.763   0.745
 Group 2    Cellphones & services   0.976     0.956           0.966                      0.875                0.789   0.886   0.819
 !          Software                0.965     0.949           0.931                      0.854                0.776   0.845   0.788
 !          Office_products         0.957     0.945           0.944                      0.818                0.778   0.823   0.799




source training set and ignoring the stop words. From the 5. Results
target set, we sampled (with replacement) 1000 samples
as done by Madabushi et al. [38]. Then, we use a word Here, we compare our systematic approach against three
frequency from each of the source training samples and different strategies proposed for tackling NT. These
the sample’s target set to calculate the p-value using the strategies were:
Wilcoxon signed-rank test.
Model: We used the Bert uncased model for this task.             • Predictive distribution matching (PDM) [13].
It consists of a 768-dimension vector, 12 layers of the            This is an instance-based weighting approach.
transformer block and 110 million parameters. We added             This method works by first measuring the differ-
a fully connected layer on top of the BERT self-attention          ing predictive distributions of the target domain
layers to classify the review. For the parameters, we              and the related source domains. In this case, a
adopt a similar hyperparameter as used in the Bert un-             PDM regularised classifier is used to infer the tar-
cased model for Amazon sentiment analysis [50]. These              get pseudolabeled data, which will help to iden-
parameters include using the Adam optimiser with var-              tify the relevant source data, so as to correctly
ious learning rates and 512 Max Sequence Length with               align their predictive distributions [13]. We used
five epochs. The learning rate was 1e´05. The model                the support vector machines (SVM) variant of the
was first build using the source dataset to derive a source        proposed PDM as used in the sentiment analysis
model. Then, this source model was fine-tuned with the             task in the work of Seah et al. [13].
target datasets. The fine-tuning with the target datasets        • Mutual Mean-Teaching (MMT)[19]: This is a fea-
was done by using a commonly split ratio (30:70) [51].             ture transferability approach which uses a soft
The training sets of the target data were used to fine-tune        loss function based on soft pseudo-labels and is
the source model before being tested on the test sets. We          carried out in two stages. In the first stage, the
ran 10 experiments to compute the estimated risk by the            Bert uncased model was trained using the source
different methods and the average was reported.                    domain to derive a source model. This source
Evaluation measures: All experiments were conducted                model is trained to model a feature transforma-
10 times as done in the work of Bennin et al. [45] to re-          tion function that transforms each input sample
duce the impact of sampling bias, and the results were             into a feature representation. For this experiment.
averaged across the independent runs. To evaluate the              the source model is trained with a classification
prediction accuracy of each modelling approach, the fol-           loss and a triplet loss to separate features belong-
lowing were computed:                                              ing to different identity, as used in the original
                                                                   paper [19]. Next, the source model trained in
      • Balanced accuracy (BAUC): BAUC measures                    stage 1 is optimised using the MMT framework,
         model performance, taking into account class              which is based on the clustering method. The de-
         imbalances and it also overcomes bias in binary           tails of this approach are explained in the original
         cases [52]. The balanced accuracy is computed             paper [19].
         as the average of the proportion of correct pre-        • Dual Transfer Learning (DTL) [15]: This ap-
         dictions for each class separately.                       proach enhances feature transferability through
      • F-measure: This is used for evaluating binary              the use of a latent feature. This method simul-
         classification models based on the predictions            taneously learns the marginal and conditional
         made for the positive class [52].                         distributions, and exploits their duality. For this
                                                                   experiment, the training was done using the Bert
Table 3
F-measure of fine-tuned Bert uncased model and different baseline methods on Amazon review dataset
                                                 CUCW                 CUCW                     CUCW
 Groups     Target                   CUCW      No epistemic    No Importance Sampling     No Cost Weighting     PDM     MMT     DTL
 Group 1    Outdoor living           0.945     0.911           0.903                      0.800                 0.778   0.808   0.756
 !          Beauty                   0.922     0.899           0.909                      0.799                 0.665   0.716   0.733
 !          Jwellery & Watches       0.898     0.886           0.886                      0.730                 0.616   0.742   0.709
 Group 2    Cellphones & services    0.965     0.931           0.961                      0.832                 0.742   0.835   0.799
 !          Software                 0.949     0.919           0.925                      0.818                 0.754   0.778   0.718
 !          Office_products          0.949     0.927           0.939                      0.832                 0.731   0.809   0.817



Table 4
BAUC of none fine-tuned Bert uncased method and different baseline methods on Amazon review dataset
                                                 CUCW                 CUCW                     CUCW
 Groups     Target                   CUCW      No epistemic    No Importance Sampling     No Cost Weighting     PDM     MMT     DTL
 Group 1    Outdoor living           0.887     0.845           0.864                      0.799                 0.798   0.810   0.779
 !          Beauty                   0.935     0.902           0.921                      0.824                 0.690   0.745   0.767
 !          Jwellery & Watches       0.853     0.821           0.843                      0.734                 0.644   0.763   0.745
 Group 2    Cellphones & services    0.939     0.909           0.920                      0.840                 0.789   0.886   0.819
 !          Software                 0.915     0.898           0.878                      0.819                 0.776   0.845   0.788
 !          Office_products          0.919     0.888           0.865                      0.843                 0.778   0.823   0.799




         uncased model by combining the source and tar- posed approach, we report the results by removing each
         get training data before being tested on the target component in our proposed approach. When epistemic
         dataset.                                            uncertainty or cost weighting was excluded from the loss
                                                             function, we noticed three cases (i.e., outdoor living, cell
In Tables 2 to 3, we report the fine-tuned models’ perfor- phones & service, and office product were used as the
mance (balanced accuracy and F- measure) on the target target datasets) where the MMT method outperformed
test set. In cases where NT has occurred (i.e., the degree our approach. A similar outcome was noted in the F-
of NT was calculated using Equation 1), we denote the measure as shown in Table 3. To further disentangle the
colour of the accuracy as red. From Table 2, the results contribution of all components in our proposed approach
indicate that our proposed approach with fine-tuning, without fine-tuning the Bert model and to provide a fair
other components and including both uncertainties (het- comparison with the three methods we compared against,
eroscedastic aleatoric and epistemic uncertainty) in the we combined the source and target training data to train
loss function outperformed the other three models. To our Bert model before testing on the target test data. This
disentangle the contribution of all components in our pro-


Table 5
F-measure of the none fine-tuned Bert uncased method and different baseline methods on Amazon review dataset
                                                 CUCW                 CUCW                     CUCW
 Groups     Target                   CUCW      No epistemic    No Importance Sampling     No Cost Weighting     PDM     MMT     DTL
 Group 1    Outdoor living           0.881     0.844           0.829                      0.770                 0.778   0.808   0.756
 !          Beauty                   0.899     0.865           0.834                      0.788                 0.665   0.716   0.733
 !          Jwellery & Watches       0.822     0.808           0.789                      0.710                 0.616   0.742   0.709
 Group 2    Cellphones & services    0.887     0.858           0.887                      0.787                 0.742   0.835   0.799
 !          Software                 0.878     0.844           0.868                      0.819                 0.754   0.778   0.718
 !          Office_products          0.843     0.822           0.829                      0.709                 0.731   0.809   0.817
was done to remove the benefit of the fine-tuning compo-       against (MMT and DTL) in this study also use the Bert
nent in our design. The results in Tables 4 to 5 show that,    Uncased model, hence, we are able to eliminate the in-
without the fine-tuning component, we were still able          terference of model complexity in the comparison result.
to improve the performance when all other components           From the ablation study, model fine-tuning improved the
are integrated in our deep transfer learning, but with less    overall performance from 2% to 6% when integrating all
improvement (i.e., noting an improvment of BAUC and            components into our approach.
F-measure of 2% to 9% as shown in Table 4 and Table 5).

                                                               7. Addressing threats to validity
6. Discussion
                                                               The experimental dataset was compiled by [46]. We ac-
In our sentiment analysis experiment (see Table 2 to Ta-       knowledge threats relating to errors in the review labels.
ble 5), our proposed method, which incorporated both           These threats have been well minimised by experiment-
uncertainties, was able to improve the balanced accuracy       ing with different projects in the datasets. Also, we con-
of the BERT model from 5% to 14% and F- measure value          cede that there are a few uncontrolled factors that may
from 5% to 10% as compared to using techniques that            have impacted the experimental results in this study. For
are instance [13] or feature transferability based [19, 15].   instance, there could have been unexpected faults in the
Although the instance level transferability enhancement        implementation of the approaches we compare against
has been used in the deep learning model to prevent NT         in this paper [54]. We sought to reduce such threats by
[11, 53], they do not handle the target data quality. This     using the source code provided for these methods (e.g.,
factor is shown to be one of the causes of NT [7]. The         PDM, MMT and DTL). While we recognize the threats
PDM method that we compared against in this paper              above, we anticipate that our study here still contributes
tackles the domain divergence issue by using predictive        novel findings to transfer-based modelling for recom-
distribution matching to remove the irrelevant source.         mendation systems in NLP domains relying on latent
This method still failed to address the target data quality;   sentiment information.
hence, we noted a single case of nt in our nlp task result
(when the outdoor living domain was used as the target’s
dataset). Although the MMT method uses a softmax loss          8. Conclusion
function based on soft pseudo-labels to tackle the tar-
                                                               In this work, we proposed a systematic approach to over-
get data quality, it cannot tackle the domain divergence
                                                               coming negative transfer by tackling domain divergence,
issue, which may also lead to NT. A single case of NT
                                                               taking account of the source and target data quality.
(when the outdoor living domain was used as the target’s
                                                               Our approach involves using cost weighting learning,
dataset) was also noted when using this method. On
                                                               uncertainty-guided loss function over the target dataset,
the other hand, our proposed method is more robust. It
                                                               and the concept of importance sampling to derive a robust
uses the uncertainty-guided function to tackle the target
                                                               model. This systematic approach improves the target do-
and source data quality issue, importance sampling and
                                                               main’s performance. The results reported in this work
cost weighting learning, to tackle the domain divergence
                                                               also reveal that when both aleatoric heteroscedastic and
problem. For the fine-tuning process, we use a small tar-
                                                               epistemic uncertainty are combined, we can further en-
get sample as the validation data in the source model to
                                                               hance the performance of the target model. We therefore
improve the transferability of the final model. Our results
                                                               assert that our systematic approach is a good approach
show that the final model is improved when we intro-
                                                               for overcoming negative transfer and improving target
duce the use of an uncertainty-guided loss function to
                                                               model performance when performing sentiment analy-
guide the training process when utilising the source and
                                                               sis in a transfer learning setting. This approach can be
target datasets and incorporate a cost weight to tackle
                                                               used to build an effective recommendation system when
the problem of imbalanced data. In the work of Grauer
                                                               including the latent sentiment information. A plausible
and Pei [22], it was also noted that when model uncer-
                                                               next step, is to use such an approach to design an effec-
tainty is known and distributed evenly, the performance
                                                               tive recommendation system that takes into account the
and reliability of the model are greatly improved. Hence,
                                                               latent sentiment information. Although our experiments
this work uses the idea of model and data transferability
                                                               showed our approach improves the target model perfor-
enhancement to develop a more robust approach aimed
                                                               mance and prevents NT in sentiment analysis, it is still
at preventing negative transfer. The evidence from our
                                                               important to investigate this approach for other domains.
results suggests that we could use a systematic approach
such as what was proposed in this paper to improve the
quality of models in a deep transfer learning setting. Also,
it is worth noting that two of the methods we compared
Acknowledgements                                                           [13] C.-W. Seah, Y.-S. Ong, I. W. Tsang, Combating neg-
                                                                                ative transfer from predictive distribution differ-
This research was partly supported by an Internal Re-                           ences, IEEE transactions on cybernetics 43 (2012)
search fund from Manaaki Whenua — Landcare Research,                            1153–1165.
New Zealand. Special thanks are given to the Department                    [14] D. Wu, Pool-based sequential active learning for
of Informatics at Landcare Research for their ongoing                           regression, IEEE transactions on neural networks
support.                                                                        and learning systems 30 (2018) 1348–1359.
                                                                           [15] M. Long, J. Wang, G. Ding, W. Cheng, X. Zhang,
                                                                                W. Wang, Dual transfer learning, in: Proceedings
References                                                                      of the 2012 SIAM International Conference on Data
 [1] P. Cremonesi, A. Tripodi, R. Turrin, Cross-domain                          Mining, SIAM, 2012, pp. 540–551.
     recommender systems, in: 2011 IEEE 11th Inter-                        [16] X. Wang, Y. Jin, M. Long, J. Wang, M. I. Jordan,
     national Conference on Data Mining Workshops,                              Transferable normalization: Towards improving
     Ieee, 2011, pp. 496–503.                                                   transferability of deep neural networks, Advances
 [2] T. Zang, Y. Zhu, H. Liu, R. Zhang, J. Yu, A sur-                           in neural information processing systems 32 (2019).
     vey on cross-domain recommendation: taxonomies,                       [17] A. Madry, A. Makelov, L. Schmidt, D. Tsipras,
     methods, and future directions, arXiv preprint                             A. Vladu, Towards deep learning models resistant to
     arXiv:2108.03357 (2021).                                                   adversarial attacks, arXiv preprint arXiv:1706.06083
 [3] Y. Wang, H. Yu, G. Wang, Y. Xie, Cross-domain                              (2017).
     recommendation based on sentiment analysis and                        [18] L. Gui, R. Xu, Q. Lu, J. Du, Y. Zhou, Negative transfer
     latent feature mapping, Entropy 22 (2020) 473.                             detection in transductive transfer learning, Interna-
 [4] B. Zadrozny, Learning and evaluating classifiers                           tional Journal of Machine Learning and Cybernetics
     under sample selection bias, in: Proceedings of the                        9 (2018) 185–197.
     twenty-first international conference on Machine                      [19] Y. Ge, D. Chen, H. Li, Mutual mean-teaching:
     learning, 2004, p. 114.                                                    Pseudo label refinery for unsupervised domain
 [5] P. Li, A. Tuzhilin, Ddtcdr: Deep dual transfer cross                       adaptation on person re-identification, arXiv
     domain recommendation, in: Proceedings of the                              preprint arXiv:2001.01526 (2020).
     13th International Conference on Web Search and                       [20] K. Liang, J. Y. Zhang, O. O. Koyejo, B. Li, Does ad-
     Data Mining, 2020, pp. 331–339.                                            versarial transferability indicate knowledge trans-
 [6] S. J. Pan, Q. Yang, A survey on transfer learning,                         ferability? (2020).
     IEEE Transactions on knowledge and data engineer-                     [21] Z. Deng, L. Zhang, K. Vodrahalli, K. Kawaguchi,
     ing 22 (2009) 1345–1359.                                                   J. Y. Zou, Adversarial training helps transfer learn-
 [7] W. Zhang, L. Deng, L. Zhang, D. Wu, A survey on                            ing via better representations, Advances in Neural
     negative transfer, 2020. URL: https://arxiv.org/abs/                       Information Processing Systems 34 (2021).
     2009.00909. doi:1 0 . 4 8 5 5 0 / A R X I V . 2 0 0 9 . 0 0 9 0 9 .   [22] J. A. Grauer, J. Pei, Minimum-variance control allo-
 [8] M. Rosenstein, Z. Marx, L. Kaelbling, & diet-                          cation considering parametric model uncertainty,
     terich, tg (2005). to transfer or not to transfer, in:                     in: AIAA SCITECH 2022 Forum, 2022, p. 0749.
     NIPS 2005 Workshop on Transfer Learning, ????                         [23] R. Caruana, D. Silver, J. Baxter, T. Mitchell, L. Pratt,
 [9] Z. Wang, Z. Dai, B. Póczos, J. Carbonell, Charac-                          S. Thrun, Learning to learn: knowledge consolida-
     terizing and avoiding negative transfer, in: Pro-                          tion and transfer in inductive systems, in: Work-
     ceedings of the IEEE/CVF Conference on Com-                                shop held at NIPS-95, Vail, CO, see http://www.
     puter Vision and Pattern Recognition, 2019, pp.                            cs. cmu. edu/afs/user/caruana/pub/transfer. html,
     11293–11302.                                                               1995.
[10] O. P. Omondiagbe, S. Licorish, S. G. MacDonell,                       [24] M. Sugiyama, M. Krauledat, K.-R. Müller, Covari-
     Improving transfer learning for cross project defect                       ate shift adaptation by importance weighted cross
     prediction, TechRxiv preprint techrxiv.19517029                            validation., Journal of Machine Learning Research
     (2022).                                                                    8 (2007).
[11] Z. Wang, J. Carbonell, Towards more reliable trans-                   [25] J. Huang, A. Gretton, K. Borgwardt, B. Schölkopf,
     fer learning, in: Joint European Conference on                             A. Smola, Correcting sample selection bias by un-
     Machine Learning and Knowledge Discovery in                                labeled data, Advances in neural information pro-
     Databases, Springer, 2018, pp. 794–810.                                    cessing systems 19 (2006).
[12] E. Eaton, et al., Selective transfer between learning                 [26] T. Jebara, Multi-task feature and kernel selection
     tasks using task-based boosting, in: Twenty-Fifth                          for svms, in: Proceedings of the twenty-first in-
     AAAI Conference on Artificial Intelligence, 2011.                          ternational conference on Machine learning, 2004,
     p. 55.                                                  [40] C. P. Robert, G. Casella, G. Casella, Monte Carlo
[27] S. Uguroglu, J. Carbonell, Feature selection for             statistical methods, volume 2, Springer, 1999.
     transfer learning, in: Joint European Conference        [41] A. Kendall, Y. Gal, What uncertainties do we need
     on Machine Learning and Knowledge Discovery in               in bayesian deep learning for computer vision?, Ad-
     Databases, Springer, 2011, pp. 430–442.                      vances in neural information processing systems
[28] L. Mihalkova, R. J. Mooney, Transfer learning by             30 (2017).
     mapping with minimal target data, in: Proceedings       [42] Q. V. Le, A. J. Smola, S. Canu, Heteroscedastic
     of the AAAI-08 workshop on transfer learning for             gaussian process regression, in: Proceedings of the
     complex tasks, 2008.                                         22nd international conference on Machine learning,
[29] J. Yosinski, J. Clune, Y. Bengio, H. Lipson, How             2005, pp. 489–496.
     transferable are features in deep neural networks?,     [43] A. Kilgarriff, Comparing corpora, International
     Advances in neural information processing systems            journal of corpus linguistics 6 (2001) 97–133.
     27 (2014).                                              [44] F. Wilcoxon, Probability tables for individual com-
[30] B. Sun, K. Saenko, Deep coral: Correlation align-            parisons by ranking methods, Biometrics 3 (1947)
     ment for deep domain adaptation, in: European                119–122.
     conference on computer vision, Springer, 2016, pp.      [45] K. E. Bennin, J. Keung, P. Phannachitta, A. Monden,
     443–450.                                                     S. Mensah, Mahakil: Diversity based oversampling
[31] M. Long, Y. Cao, J. Wang, M. Jordan, Learning trans-         approach to alleviate the class imbalance issue in
     ferable features with deep adaptation networks,              software defect prediction, IEEE Transactions on
     in: International conference on machine learning,            Software Engineering 44 (2017) 534–550.
     PMLR, 2015, pp. 97–105.                                 [46] H. Lakkaraju, J. McAuley, J. Leskovec, What’s in a
[32] G. K. Dziugaite, D. M. Roy, Z. Ghahramani, Train-            name? understanding the interplay between titles,
     ing generative neural networks via maximum                   content, and communities in social media, in: Pro-
     mean discrepancy optimization, arXiv preprint                ceedings of the International AAAI Conference on
     arXiv:1505.03906 (2015).                                     Web and Social Media, volume 7, 2013, pp. 311–320.
[33] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,       [47] A. S. AlQahtani, Product sentiment analysis for
     D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio,          amazon reviews, International Journal of Computer
     Generative adversarial nets, Advances in neural              Science & Information Technology (IJCSIT) Vol 13
     information processing systems 27 (2014).                    (2021).
[34] S. Sankaranarayanan, Y. Balaji, C. D. Castillo,         [48] A. F. Anees, A. Shaikh, A. Shaikh, S. Shaikh, Sur-
     R. Chellappa, Generate to adapt: Aligning domains            vey paper on sentiment analysis: Techniques and
     using generative adversarial networks, in: Proceed-          challenges, EasyChair2516-2314 (2020).
     ings of the IEEE conference on computer vision and      [49] D. D. Palmer, Tokenisation and sentence segmen-
     pattern recognition, 2018, pp. 8503–8512.                    tation, Handbook of natural language processing
[35] W. Y. Wang, S. Singh, J. Li, Deep adversarial learn-         (2000) 11–35.
     ing for nlp, in: Proceedings of the 2019 Conference     [50] M. Geetha, D. K. Renuka, Improving the perfor-
     of the North American Chapter of the Association             mance of aspect based sentiment analysis using
     for Computational Linguistics: Tutorials, 2019, pp.          fine-tuned bert base uncased model, International
     1–5.                                                         Journal of Intelligent Networks 2 (2021) 64–69.
[36] M. Rajesh, J. Gnanasekar, Annoyed realm outlook         [51] I. H. Witten, E. Frank, M. A. Hall, C. J. Pal, Data Min-
     taxonomy using twin transfer learning, Interna-              ing: Practical Machine Learning Tools and Tech-
     tional Journal of Pure and Applied Mathematics               niques, Morgan Kaufmann Publishers, San Fran-
     116 (2017) 549–558.                                          cisco, 2016. doi:1 0 . 1 0 1 6 / c 2 0 0 9 - 0 - 1 9 7 1 5 - 5 .
[37] M. Kukar, I. Kononenko, et al., Cost-sensitive learn-   [52] T. Menzies, J. Greenwald, A. Frank, Data min-
     ing with neural networks., in: ECAI, volume 15,              ing static code attributes to learn defect predictors,
     Citeseer, 1998, pp. 88–94.                                   IEEE transactions on software engineering 33 (2006)
[38] H. T. Madabushi, E. Kochkina, M. Castelle, Cost-             2–13.
     sensitive bert for generalisable sentence classi-       [53] Y. Xu, H. Yu, Y. Yan, Y. Liu, et al., Multi-component
     fication with imbalanced data, arXiv preprint                transfer metric learning for handling unrelated
     arXiv:2003.11563 (2020).                                     source domain samples, Knowledge-Based Systems
[39] A. Katharopoulos, F. Fleuret, Not all samples are            203 (2020) 106132.
     created equal: Deep learning with importance sam-       [54] E. A. Felix, S. P. Lee, Predicting the number of
     pling, in: International conference on machine               defects in a new software version, PloS one 15
     learning, PMLR, 2018, pp. 2525–2534.                         (2020) e0229131.