Knowledge Distillation Techniques
       for Biomedical Named Entity Recognition

    Tahir Mehmood1,2 , Ivan Serina1 , Alberto Lavelli2 , and Alfonso Gerevini1
                     1
                       University of Brescia, 25121 Brescia, Italy
               {t.mehmood,ivan.serina,alfonso.gerevini}@unibs.it
                2
                  Fondazione Bruno Kessler, 38123 Povo, Trento, Italy
                           {t.mehmood,lavelli}@fbk.eu


        Abstract. The limited amount of annotated biomedical literature and
        its peculiar characteristics make biomedical named entity recognition
        more challenging than standard named entity recognition. The multi-
        task learning approach overcomes these limitations by training different
        related tasks simultaneously. It learns common features among different
        tasks by sharing some layers of the neural network architecture. For this
        reason, the multi-task model attains more generalization properties than
        a single task learning. The generalization of the multi-task model can
        be utilized to enhance other models’ results. In particular, knowledge
        distillation techniques make this possible in which one model supervises,
        through its learned generalization, another model during the training.
        This research analyzes the knowledge distillation approach and shows
        that a simple deep learning model performance can be leveraged through
        distilling the multi-task model’s generalization. Results show that our ap-
        proach outperformed compared with the multi-task model and single task
        model. This demonstrates that our model learns more diverse features
        using the knowledge distillation approach. We also found our approach
        statistically better than multi-task model and single task model.

                                                              ·
           ·
        Keywords: Biomedical Named Entity Recognition Multi-task Learn-
        ing Knowledge Distillation.


1     Introduction

The biomedical named entity recognition (BioNER) task has gained more atten-
tion with the increasing availability of large amounts of unstructured biomedical
text data. BioNER is also a preliminary task of many other tasks e.g. the relation
extraction task (e.g., chemical induced disease relation, drug-drug interaction, . .
. ) [20]. However, biomedical texts are more complex than normal texts and carry

               ©
    Copyright 2020 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0).


                                           141
unusual characteristics, e.g. spelling alternations (e.g., 10-Ethyl-5-methyl-5,10-
dideazaaminopterin vs 10-EMDDA) [1], long multi-word expressions (10-ethyl-
5-methyl-5,10-dideazaaminopterin), and ambiguous words (TNF alpha can be
used for both DNA and Protein) [8]. The above-mentioned characteristics make
BioNER even a more difficult task than traditional named entity recognition.
    Traditional machine learning approaches that include e.g., Hidden Markov
Models (HMMs), Conditional Random Fields (CRFs), and Support Vector Ma-
chine (SVM), have been used to overcome the limitations faced by the BioNER
task [4]. These machine learning methods have shown some promising results.
However, these approaches strongly rely on feature engineering. On the other
hand, deep learning models usually do not require hand-crafted feature engi-
neering since this is done implicitly. Simultaneously, the deep learning models’
results are very appealing for the BioNER task. However, due to the biomedical
literature’s peculiar characteristics mentioned at the beginning of the section,
these systems’ performance is still limited. Another challenge to deep learning
models is the limited availability of annotated biomedical text data to train these
systems as deep learning models require substantial amounts of training data.
    The multi-task and transfer learning approaches have shown results improve-
ment for BioNER task [16][17], but these techniques still have some limitations.
The multi-task model (MTM) [18] does not always produce noticeable increase
in performance compared to their counterpart single task model (STM) [3][5].
The MTM could also learn the features that are more task-specific and which
can lead to biased feature learning [13]. Similarly, transfer learning [7] also faces
limitations e.g., catastrophic forgetting or catastrophic interference problem [23].
In catastrophic forgetting, the deep learning model starts forgetting what it has
learned from the previous domain. The forgetting of the previously learned source
information happens, even if both source and target domains are heterogeneous
[10]. It is also an empirical dilemma to choose the number of new layers for the
model used on the target datasets along with pretrained layers or weights of the
pretrained layers need to be frozen in the pretrained model as it is applied to the
target dataset. The transfer learning approach is therefore not always a feasible
solution to transfer previous knowledge into the new task.
    Furthermore, in general, a common issue with the deep learning models is
their complex structure. The deep learning methods have brought much success
in numerous fields and have shown results breakthrough. To achieve state-of-the-
art results, the complex structure of the deep learning models is often observed
in many fields. Sutskever et al. [25] model comprised of 4-layers of long short-
term memory (LSTM) and each layer had 1000 hidden units. Similarly, Zhou
et al. [33] proposed a model that contains multi-level LSTM and each layer had
512 hidden units. These deep learning models have millions of parameters, and
training such models require much more computational power. These complex
models also require more storage space and which is also not very suitable to
deploy on the systems where available storage capacity is limited e.g., cell phones.
In such situations, implementation of these complex models requires compression


                                        142
while, in the meantime, not to compromise their performances and keep the
generalization they have learned.
     In this regard, the knowledge distillation approach is utilized where the cum-
bersome model is compressed into the simple model, which is more feasible to
set up in the end devices [11]. In the knowledge distillation technique [9], one
model teaches another model through its learned knowledge. This supervision
is done through prediction, where the learning model mimics the prediction of
the teacher model. The learning model, therefore, uses two gradients, i.e., the
gradient of itself and gradient of the teacher model, and for this reason, it can
produce better results. Romero et al. [22] showed that the intermediate layer
of the teacher model gives useful information to the student model during the
training. Liu et al. [14] improved the performance of the single model using
knowledge distillation from an ensemble of different deep neural networks. Tang
et al. [26] showed performance gain by distilling knowledge from a single machine
translation model to train the multilingual translation model. Zhang et al. [32]
demonstrated an increase in performance when different student models were
trained mutually and teach each other through knowledge distillation. Sun et
al. [24] showed performance gain using knowledge distillation approach in which
the intermediate layers of the teacher model were used to train the task specific
student model.
     This research also proposes the distillation knowledge approach to enhance
the performance of the deep learning models for BioNER task. Therefore, the
purpose of this research is to increase the performance of the model instead
of compression. The multi-task model is used to perform knowledge distillation
for the single task model using its logits. In other words, single task model
matches the true labels as well as the logits of the multi-task model during its
training. Logits are the inputs to the softmax output layer [9] which carries
more information and its value ranges from [−∞, +∞]. This helps the single
task model to not only learn from the true labels but also optimize logits for
multi-task model.
     The rest of the paper is organized as follows. Section 2 gives an introduc-
tion of the knowledge distillation approach which is followed by our proposed
methodology in Section 3. The experimental setup is described in Section 4
whereas results are discussed in Section 5. Finally, the research is concluded in
Section 6.


2   Knowledge Distillation

In transfer learning, the learned representation from source domain is utilized
in another related domain. In contrast, the objective of knowledge distillation
is to train a model with the knowledge learned by another model. The idea of
the knowledge distillation is to train a simple (student) model on the knowledge
learned by the complex (teacher) model. More specifically, the knowledge distilla-
tion approach addresses how to transfer the generalization of one model, usually
a complex model (teacher), to another model, usually a simple model (student).

                                       143
The complex models or ensemble approaches usually produce better results than
the simple single-task model, but it is computationally expensive to train them.
The knowledge distillation approach helps the simple model (student) to pro-
duce better results than the stand alone single model and the ensemble models.
This way student model can be trained on fewer training examples since it will
also consume the knowledge learned by the teacher model during training. The
idea is that the complex model has already been generalized on the data during
its training. This helps the student model to achieve or nearly achieve the gen-
eralization of the teacher model. The student model not only learns through the
gradient of itself but also though the gradient of another knowledge.

    Transferring knowledge from a teacher model is usually done in the shape of
the probabilities predicted by the teacher model. The objective of any learning
model is to predict the correct class for the input example and assign a high
probability to that class whereas allocating small probability values to the rest
of the classes. Associating the probabilities to the rest of the incorrect classes is
not performed randomly. These side probabilities also carry information which
depicts how a specific model has generalized the classes presented in the dataset.
For instance, there is very little chance of miss-classifying a motorbike image
into a car image but the probability would still be higher for miss-classifying it
into the truck image. The softmax activation function outputs the probability
distribution of the possible classes for the specific instance. The sum of these
softmax probability distributions sums to 1.

     These softmax probabilities give more information compared to the one-hot
“hard labels”. For instance, the softmax probabilities, [0.7, 0.2, 0.1], show rank-
ing of the classes. Such information cannot be examined in the hard labels e.g, [1,
0, 0] where we cannot extract any such information. The posterior probabilities
can pass an extra useful signal to the student model during its training. However,
training the student model to match these probabilities could not be so much
useful as the student model can only pay more attention to the highest proba-
bility value. To overcome this barrier, it is better to soften these final softmax
output probabilities through normalizing them [9]. The normalized probabilities
represents soft labels which provides some knowledge distillation to the student
model [29]. The student model then pay attention to other values as well along
with the highest probable class. Hinton et al. proposed a term temperature, T ,
to soften the posterior probabilities [9]. Keeping T = 1 makes it standard soft-
max function as represented in equation 1. The large value of T more softens the
softmax output and enhances the non-target class output probability [19]. On
the downside, it also reduces the probability value of the target class. Therefore,
it is vital to choose the right value for the temperature parameter.


                                          exp(zi /T )
                          Softmax(zi ) = P                                       (1)
                                          j exp(zj /T )


                                        144
3    Our Approach
Figure 1 introduces our proposed knowledge distillation approach. The teacher
model is a multi-task model (MTM) with the word and character input of the
sentences. We use bidirectional LSTM (BiLSTM) to process the sequence in
both directions [21]. The upper layers, shown in black round rectangle, of the
MTM are shared among all the datasets. The bottom layers, shown in red round
rectangle, are dataset specific whereas Softmax is used for output labelling. In
multi-task learning (MTL) approach shared layers help one task to be learned
better with the help of another task. Training jointly on related tasks helps
the multi-task model to learn common features among different tasks by using
shared layers [2]. The task-specific layers learn features that are more related to
the current task. Training related tasks together helps the model to optimize the
value of the parameters. The joint learning also lowers the chances to embrace
overfitting for any specific task [15]. Therefore, we assume that the student model
will also have lower changes to encounter overfitting with the help of knowledge
distillation from the MTM. The purpose of our word is to transfer the token
level knowledge distillation, therefore, we use softmax function at the output
layer. The token level knowledge distillation is not possible with conditional
random field (CRF) as it predicts the labels of the whole sequence. The CRF
based model labels the sequence globally considering the association between
neighboring labels. This limits the distilling knowledge from the teacher models
[30].
    An alternative training approach was adopted for MTM training phase. Let
us suppose we have D1 , D2 , ..., Dt training sets, related to the T1 , T2 , ..., Tt tasks
respectively. During the training phase, a training set Di is selected randomly
and both shared layers and the ones specific to the corresponding task Ti are
activated. Every task has its own optimizer so during training only the one
specific to the task Ti is activated and the loss function related to it is optimized.
    The student model is in fact a counterpart single task model (STM) of the
MTM. Therefore, the structures of both models are same. In this research we
perform knowledge distillation using the teacher (MTM)logits, zt , which is input
to the softmax layer [28]. The logits carry the values that can range [−∞, +∞]
and therefore, carries more dark information. During the training, student model
considers the hard labels as well as the logits (zt ) of the teacher model (MTM).
We also have not normalized the logits that means temperature, T = 1. We
examine losses for both predictions i.e., the loss of the hard labels matching and
the loss of the logits matching. The hard targets matching loss, which involves
one-hot labels, can be referred as student loss whereas the distillation loss con-
siders the logits loss. The loss function of our student model model is depicted in
equation 2. The distillation loss tries to minimize mean-squared-error between
the student logits, zs , and teacher logits, zt , as depicted in equation 2. The x
represents the input, W represents student model’s parameters, H is the cross-
entropy loss whereas y is the true hard labels and σ is the softmax function. The
logits of student and teacher models are represented as zs , zt respectively. The
coefficients, α and β, specify the balance between student loss and distillation

                                           145
                          Student Model                                      Teacher Model

                      Word                   Char                          Word                       Char
                      Input                  Input                         Input                      Input


                                                                                                              Shared Layers
                                         BiLSTM                                              BiLSTM


                    BiLSTMs              BiLSTMs                         BiLSTMs             BiLSTMs


                                                                                             Task Specific
                               BiLSTM                                              BiLSTM
                                                     Distillation Loss
                                                           (zs,zt)
      Hard Labels


                               Softmax                                             Softmax


                                                                                     tag
                          Hard Predictions


                    Student Loss


                                                        KD Loss


Fig. 1. Proposed Knowledge Distillation Approach (colored circles show embedding)


loss whereas β = 1 − α.

                              L(x; W ) = α ∗ H(y, σ(zs , zt )) + β ∗ M SE(zs , zt )                                           (2)


4   Experiments

As a first approach, the MTM model, shown in the right side of Figure 1, is
trained separately. This MTM is then used to distill the knowledge to the student
model. We perform knowledge distillation from MTM using two approaches.
In the first approach, we perform simple knowledge distillation as shown in
Figure 1 where MTM’s logits are used to train the student model. In the second
approach, we use logits from ensemble of MTMs to train the student model.
The MTMs used in the ensemble approach have the same architecture, but they
are initialized with different seed values which result in different predictions.
Although, the structure of all MTMs are same but this gives us five different
predictions due to the different seed values. We take the average of the logits
from these MTMs, which is then used to train our student model. Furthermore,
the F1-score presented in the later section is also based on the average of five
runs with different seed values. In the rest of this article, MTM and teacher


                                                            146
MTM will be used interchangeably as the logits of the MTM are used to train
the student models.
     We perform experiments for different values of α i.e.,[0, 0.5, 1]. The hyper-
parameter tuning is not done for α, instead the values are selected in a simple
straight forward way. If α = 0 then the student model learns with only dis-
tillation loss i.e., β ∗ M SE(zs , zt ), which tries to match logits of the student
model and teacher model. Similarly, with α = 0.5, both student loss and dis-
tillation loss are considered equally. In last, α = 1, only allows student model
to consider the student loss, α ∗ H(y, σ(zs ; zt )). Furthermore, words are repre-
sented with pre-trained domain-specific word embedding. More specifically, we
utilize the WikiPubMed-PMC word embedding which is trained on a large set of
the PubMedCentral(PMC) articles and PubMed abstracts as well as on English
Wikipedia articles [7]. On the other hand, character embedding is initialized
randomly which is further processed by BiLSTM. In this paper, we perform
experiments on the 15 datasets which are also used by Crichton et al. [6] and
Wang et al. [31]. The bio-entities in these datasets are Chemical, Species, Cell,
Gene/Protein, Cell Component, and Disease3 . The description of these entities
can be found in [16]. Each dataset contains separate training, development, and
test sets. We follow the same experimental setup adopted by Wang et al.4 , which
uses both train and development set data for training the model.


5   Results and Discussion

The F1-score comparison of our student model with different α values is shown
in Table 1. The MTM is the teacher model as mentioned in the earlier section
as well. This MTM is used for distilling knowledge to the student model via
its logits. The best results are shown in the bold font while second best score
is represented with the Italic style. It can be noticed that our student model
has outperformed the MTM approach, except for BioNLP13CG and most of the
protein datasets (BioNLP11EPI, BioNLP11ID, BioNLP13GE, and Ex-PTM).
We speculate that as BioNLP11EPI, BioNLP11ID, and Ex-PTM are the cor-
pora created for BioNLP 2011 shared task corpus, they might carry similar
characteristics. Therefore, we observe a performance decrease for all these three
datasets. In particular, the entity mentions in BioNLP11EPI and Ex-PTM were
automatically annotated using BANNER named entity tagger [12] which was
trained on the GENETAG corpus [27]. We anticipate that the wrong entity clas-
sification might have propagated in both datasets due to the annotation from
the same named entity tagger. On the other hand, BioNLP13CG contains 16 dif-
ferent classes and some of them have very few examples present in the dataset.
These classes represent cancer genetics (CG) and are more correlated with each
other. Therefore, our student model might not be able to differentiate among
these classes.
3
  The datasets can be found at https://github.com/cambridgeltl/MTL-
  Bioinformatics-2016
4
  https://github.com/yuzhimanhua/Multi-BioNER


                                       147
    Student model, with α = 0, has shown a performance gain for 6 datasets
compared to the MTM (teacher). The student model, trained with α = 0.5,
achieves an increase in performance for 9 and 8 datasets compared to the MTM
and student (α = 0) model, respectively. Similarly, student model (α = 1)
improves results for 11 datasets against MTM whereas it yields best performance
for 10 and 11 datasets compared to the student with (α = 0) and (α = 0.5),
respectively.
    We further analyse the performance of the student models considering the
STM which is also depicted in Table 1. It can be noticed that our student
model has outperformed many datasets, except BC4CHEMD and CRAFT. We
analyzed the performance of our teacher model (MTM) for BC4CHEMD and
CRAFT datasets, and found a performance drop upto F1-score of 3% for these
two datasets compared to STM. Therefore, we assume that teacher MTM model
could not able to perform much knowledge distillation for these two datasets.
The student model (α = 0) obtained best performance for 13 datasets compared
to the STM. Likewise, student (α = 0.5) obtained a performance gain for 12
datasets whereas student (α = 1) attains performance for 13 datasets compared
to STM.
    We also use the second approach to train our student model where logits
from an ensemble of MTMs is used to train the student model. Instead of using
the teacher model with a different architecture, we use the same MTM teacher
model but these teacher models are initialized with different seed values. For
this reason, all the 5 teacher models produce different predictions. We average
their logits and train each single student model on such logits.
     Table 2 represents the results comparison of our second approach. We can
notice the remarkable improvement in results for the student models using en-
semble approach. We notice that for two protein datasets (BioNLP13GE and
Ex-PTM), our student models are unable to show an increase in results com-
pared to the teacher (MTM). However, the student models are able to show a
performance gain for other protein datasets; for which our previous approach of
student model does not show increase in performance. We observed that the stu-
dent model with distillation loss (α = 0) shows performance gain for 11 datasets
against teacher model (MTM). Similarly, considering both the losses (α = 0.5)
i.e., student loss and distillation loss, the student model is able to leverage the
results for 13 and 6 datasets compared to teacher model and student model
(α − 0), respectively. On the other hand, student model trained with only stu-
dent loss (α = 1) achieves performance gain for 13 datasets compared to the
teacher model (MTM). Whereas, it is able to enhance the results for 6 datasets
compared to both student models with α = 0 and α = 0.5. Comparing the results
with STM, we can notice that all the student models have shown performance
gain for all 15 datasets compared to STM. We see our student models, trained
with the logits from ensemble of MTMs, produce better results. This is because
ensemble predictions are more accurate than a single prediction, and therefore
our student models perform better with the ensemble approach.

                                       148
                                             Student      Student      Student
   Datasets             MTM       STM
                                              α=0          α = 0.5      α=1
   AnatEM               86.78     86.53       87.56         87.55       87.63
   BC2GM                79.68     81.07       81.25         81.04       81.29
   BC4CHEMD             86.80     90.24       89.45         89.50       89.58
   BC5CDR               87.49     88.09       88.33         88.30       88.32
   BioNLP09             88.40     87.37       88.70        88.82        88.74
   BioNLP11EPI          84.56     82.58       84.45         84.44       84.56
   BioNLP11ID           87.26     85.58       86.98         86.77       86.91
   BioNLP13CG           83.83     82.11       83.27        83.39        83.35
   BioNLP13GE           80.06     75.38       77.64        78.08        77.84
   BioNLP13PC           88.17     87.26       88.05         88.09       88.22
   CRAFT                81.96     84.27       83.98        83.98        83.81
   ExPTM                80.69     73.06       76.11         76.39       76.71
   JNLPBA               70.40     70.86       72.14        72.20        72.02
   linnaeus             88.32     87.88       88.49        88.58        88.91
   NCBI                 84.50     83.98       84.88         84.72       84.67
   Average                83.93    83.08      84.08        84.12       84.17
   Average Variance       0.17      0.27      0.15          0.21         0.24
Table 1. Results comparison of the proposed student models. The Average represents
the average F1-score of all datasets. The Average Variance represents the average
variance of all datasets.


    We also compare our results with state-of-the-art models. Table 3 compares
the results of our proposed student model with Wang et al. [31] and Crichton
et al. [6] models. We use their published results instead of regenerating them.
Wang et al. and Crichton et al. have used the MTL approach and used the
same 15 datasets to train their MTM. Our MTM structure resembles with the
proposed model of Wang et al. but we use a task specific BiLSTM layer, and we
use Softmax instead of CRF. In the given table, we can notice that our proposed
approach shows substantial increase in F1-score compare to the model proposed
by Crichton et al., while model proposed by Wang et al. shows performance gain
for 5 datasets. The student model, α = 1, shows the best results against the
benchmark results. The comparison of our second approach of student models,
trained with ensemble of MTMs, is depicted in Table 4. We see that our second
approach again outperformed against Crichton et al., while shows absolute gain
for most of the datasets compared to the Wang et al. except for BioNLP13GE
and Ex-PTM.
    We also performed a statistical analysis of our results using the Friedman
test [34], shown in Figure 2. We are interested to see if the difference in the
results among different models is statistically significant or not. We observe that
the student models trained with single teacher logits (our first approach) do
not produce statistically significant results with respect to the teacher model.


                                       149
                                                StudentF    StudentF        StudentF
  Datasets             MTM        STM
                                                  α=0        α = 0.5          α=1
  AnatEM                86.78     86.53           87.97       87.97          88.04
  BC2GM                 79.68     81.07          81.96        81.78           81.89
  BC4CHEMD              86.80     90.24          90.48       90.47            90.45
  BC5CDR                87.49     88.09          88.76        88.68           88.71
  BioNLP09              88.40     87.37           89.05      89.12            89.08
  BioNLP11EPI           84.56     82.58           84.73       84.72          84.89
  BioNLP11ID            87.26     85.58           87.05      87.52            87.37
  BioNLP13CG            83.83     82.11           83.80       83.88          84.00
  BioNLP13GE            80.06     75.38           78.61       78.60           78.60
  BioNLP13PC            88.17     87.26           88.72      88.76            88.52
  CRAFT                 81.96     84.27          85.15        84.89           84.89
  ExPTM                 80.69     73.06           76.93       77.17           77.33
  JNLPBA                70.40     70.86           72.51      72.54            72.50
  linnaeus              88.32     87.88          89.44        89.05           88.84
  NCBI                  84.50     83.98          86.12        85.70           85.66
  Average                83.93      83.08      84.75           84.72           84.72
  Average Variance        0.17       0.27      0.09             0.21            0.11
Table 2. Results comparison of proposed student models. The Average represents the
average F1-score of all datasets. The Average Variance represents the average variance
of all datasets.(F The student model trained with ensemble of MTMs.)


This is understandable as the student model is unable to show performance
gain for most of the datasets against the teacher model (Table 1). However,
the results produced by that student model (our first approach) are statistically
significant, considering the results of STM. On the other hand, results of our
second approach of student model (trained with an ensemble of MTMs’ logits),
represented as Ens MTM, are statistically significant compared to both teacher
(MTM) and STM. We also see that our student models’ approaches, with and
without ensemble approach, produce results statistically significant with each
other. We also see that the student models trained without ensemble of MTM’s
logits are not significantly different among themselves. The same behavior can
be noticed for our second approach of student models trained with ensemble
MTM’s logits.
    In Figure 3, the models are shown according to their best statistical ranks,
decreasing from left to right. The arrows show that a difference in results be-
tween models is statistically significant with p < 0.001. The group of student
models trained with an ensemble of MTMs, shown in black dashed rectangle, are
statistically better than the rest of the other models. In particular, the student
model (St Ens α = 0) is statistically better than those of the other models. This
shows that our second approach learns much better with only distillation loss.
We also consider our first group of student training (trained without ensemble


                                          150
                       Wang        Crichton     Student      Student      Student
Datasets
                     et al. [31]   et al. [6]    α=0          α = 0.5      α=1
AnatEM                 86.04        82.21        87.56         87.55       87.63
BC2GM                  78.86        73.17        81.25         81.04       81.29
BC4CHEMD               88.83        83.02        89.45        89.50        89.58
BC5CDR                 88.14        83.90        88.33         88.30       88.32
BioNLP09               88.08        84.20        88.70        88.82        88.74
BioNLP11EPI            83.18        78.86        84.45         84.44       84.56
BioNLP11ID             83.26        81.73        86.98         86.77       86.91
BioNLP13CG             82.48        78.90        83.27        83.39        83.35
BioNLP13GE            79.87         78.58        77.64        78.08        77.84
BioNLP13PC            88.46         81.92        88.05         88.09       88.22
CRAFT                  82.89        79.56        83.98        83.98        83.81
ExPTM                 80.19         74.90        76.11         76.39       76.71
JNLPBA                72.21         70.09        72.14        72.20        72.02
linnaeus               88.88        84.04        88.49         88.58       88.91
NCBI                  85.54         80.37        84.88        84.72        84.67
Average               83.79         79.70       84.08        84.12        84.17
Average Variance       —             —           0.15         0.21         0.24
Table 3. Results comparison of proposed student models with state-of-the-art results


           Fig. 2. Posthoc Pairwise Analysis with Conover Friedman Test


approach), as shown in the blue dashed rectangle. We find the student model
(St α = 1), trained with student loss, is statistically better than the rest of the
models shown on its right.


                                        151
                      Wang        Crichton       StudentF      StudentF       StudentF
Datasets
                    et al. [31]   et al. [6]       α=0          α = 0.5         α=1
AnatEM                86.04        82.21           87.97         87.97         88.04
BC2GM                 78.86        73.17          81.96          81.78          81.89
BC4CHEMD              88.83        83.02          90.48          90.47          90.45
BC5CDR                88.14        83.90          88.76          88.68          88.71
BioNLP09              88.08        84.20           89.05        89.12           89.08
BioNLP11EPI           83.18        78.86           84.73         84.72         84.89
BioNLP11ID            83.26        81.73           87.05        87.52           87.37
BioNLP13CG            82.48        78.90           83.80         83.88         84.00
BioNLP13GE           79.87         78.58           78.61         78.60          78.60
BioNLP13PC            88.46        81.92           88.72        88.76           88.52
CRAFT                 82.89        79.56          85.15          84.89          84.89
ExPTM                80.19         74.90           76.93         77.17          77.33
JNLPBA                72.21        70.09           72.51        72.54           72.50
linnaeus              88.88        84.04          89.44          89.05          88.84
NCBI                  85.54        80.37          86.12          85.70          85.66
Average              83.79         79.70        84.75         84.72     84.72
Average Variance       —             —           0.09          0.21      0.11
  Table 4. Results comparison of proposed student models with state-of-the-art
  results(F The student model trained with ensemble of MTMs.)


  Fig. 3. Statistical Comparison of Our Models. The arrows show that models are statis-
  tically significant to another model with p < 0.001. St Ens represents Student model
  trained with ensemble of MTMs.


  6    Conclusions

  In this research, we introduced knowledge distillation to increase the perfor-
  mance of the BioNER task. We use MTM as our teacher model because of the
  advantages MTM has over STM. We further use ensemble MTMs in our pro-
  posed knowledge distillation approach. The knowledge distillation is done by
  using MTM’s logits. By analyzing the F1-score and statistical test, we found our
  approach better than teacher MTM and STM. We found that using the ensem-
  ble of MTMs as a teacher model is more beneficial than using a single MTM. In
  future work, we will use the probability distributions of the softmax prediction


                                          152
for student models. Furthermore, different teacher models’ architecture will also
be used in an ensemble approach to supervising the student model.


References
 1. Alam, F., Corazza, A., Lavelli, A., Zanoli, R.: A knowledge-poor approach to
    chemical-disease relation extraction. Database J. Biol. Databases Curation 2016
    (2016), https://doi.org/10.1093/database/baw071
 2. Bansal, T., Belanger, D., McCallum, A.: Ask the GRU: Multi-task learning
    for deep text recommendations. In: Sen, S., Geyer, W., Freyne, J., Castells,
    P. (eds.) Proceedings of the 10th ACM Conference on Recommender Sys-
    tems, Boston, MA, USA, September 15-19, 2016. pp. 107–114. ACM (2016),
    https://doi.org/10.1145/2959100.2959180
 3. Bingel, J., Søgaard, A.: Identifying beneficial task relations for multi-task learn-
    ing in deep neural networks. In: Lapata, M., Blunsom, P., Koller, A. (eds.) Pro-
    ceedings of the 15th Conference of the European Chapter of the Association for
    Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume
    2: Short Papers. pp. 164–169. Association for Computational Linguistics (2017),
    https://doi.org/10.18653/v1/e17-2026
 4. Chowdhury, M.F.M., Lavelli, A.: Disease mention recognition with specific fea-
    tures. In: Cohen, K.B., Demner-Fushman, D., Ananiadou, S., Pestian, J., Tsu-
    jii, J., Webber, B.L. (eds.) Proceedings of the 2010 Workshop on Biomed-
    ical Natural Language Processing, BioNLP@ACL 2010, Uppsala, Sweden,
    July 15, 2010. pp. 83–90. Association for Computational Linguistics (2010),
    https://www.aclweb.org/anthology/W10-1911/
 5. Clark, K., Luong, M., Khandelwal, U., Manning, C.D., Le, Q.V.: Bam! born-again
    multi-task networks for natural language understanding. In: Korhonen, A., Traum,
    D.R., Màrquez, L. (eds.) Proceedings of the 57th Conference of the Association for
    Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019,
    Volume 1: Long Papers. pp. 5931–5937. Association for Computational Linguistics
    (2019), https://doi.org/10.18653/v1/p19-1595
 6. Crichton, G.K.O., Pyysalo, S., Chiu, B., Korhonen, A.: A neural network multi-
    task learning approach to biomedical named entity recognition. BMC Bioinform.
    18(1), 368:1–368:14 (2017), https://doi.org/10.1186/s12859-017-1776-8
 7. Giorgi, J.M., Bader, G.D.: Transfer learning for biomedical named entity
    recognition with neural networks. Bioinformatics 34(23), 4087–4094 (2018),
    https://doi.org/10.1093/bioinformatics/bty449
 8. Gridach, M.: Character-level neural network for biomedical named
    entity    recognition.    J.   Biomed.      Informatics     70,    85–91     (2017),
    https://doi.org/10.1016/j.jbi.2017.05.002
 9. Hinton, G.E., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network.
    CoRR abs/1503.02531 (2015), http://arxiv.org/abs/1503.02531
10. Jung, H., Ju, J., Jung, M., Kim, J.: Less-forgetting learning in deep neural net-
    works. CoRR abs/1607.00122 (2016), http://arxiv.org/abs/1607.00122
11. Kim, Y., Rush, A.M.: Sequence-level knowledge distillation. In: Su, J., Carreras,
    X., Duh, K. (eds.) Proceedings of the 2016 Conference on Empirical Methods in
    Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-
    4, 2016. pp. 1317–1327. The Association for Computational Linguistics (2016),
    https://doi.org/10.18653/v1/d16-1139


                                          153
12. Leaman, R., Gonzalez, G.: BANNER: an executable survey of advances in biomed-
    ical named entity recognition. In: Altman, R.B., Dunker, A.K., Hunter, L., Mur-
    ray, T., Klein, T.E. (eds.) Biocomputing 2008, Proceedings of the Pacific Sympo-
    sium, Kohala Coast, Hawaii, USA, 4-8 January 2008. pp. 652–663. World Scientific
    (2008), http://psb.stanford.edu/psb-online/proceedings/psb08/leaman.pdf
13. Liu, P., Qiu, X., Huang, X.: Adversarial multi-task learning for text classification.
    In: Barzilay, R., Kan, M. (eds.) Proceedings of the 55th Annual Meeting of the
    Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July
    30 - August 4, Volume 1: Long Papers. pp. 1–10. Association for Computational
    Linguistics (2017), https://doi.org/10.18653/v1/P17-1001
14. Liu, X., He, P., Chen, W., Gao, J.: Improving multi-task deep neural net-
    works via knowledge distillation for natural language understanding. CoRR
    abs/1904.09482 (2019), http://arxiv.org/abs/1904.09482
15. Lu, P., Bai, T., Langlais, P.: SC-LSTM: learning task-specific representations in
    multi-task learning for sequence labeling. In: Burstein, J., Doran, C., Solorio, T.
    (eds.) Proceedings of the 2019 Conference of the North American Chapter of
    the Association for Computational Linguistics: Human Language Technologies,
    NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and
    Short Papers). pp. 2396–2406. Association for Computational Linguistics (2019),
    https://doi.org/10.18653/v1/n19-1249
16. Mehmood, T., Gerevini, A., Lavelli, A., Serina, I.: Leveraging multi-task learning
    for biomedical named entity recognition. In: Alviano, M., Greco, G., Scarcello, F.
    (eds.) AI*IA 2019 - Advances in Artificial Intelligence - XVIIIth International Con-
    ference of the Italian Association for Artificial Intelligence, Rende, Italy, November
    19-22, 2019, Proceedings. Lecture Notes in Computer Science, vol. 11946, pp. 431–
    444. Springer (2019), https://doi.org/10.1007/978-3-030-35166-3 31
17. Mehmood, T., Gerevini, A., Lavelli, A., Serina, I.: Multi-task learning applied to
    biomedical named entity recognition task. In: Bernardi, R., Navigli, R., Semeraro,
    G. (eds.) Proceedings of the Sixth Italian Conference on Computational Linguis-
    tics, Bari, Italy, November 13-15, 2019. CEUR Workshop Proceedings, vol. 2481.
    CEUR-WS.org (2019), http://ceur-ws.org/Vol-2481/paper47.pdf
18. Mehmood, T., Gerevini, A.E., Lavelli, A., Serina, I.: Combining multi-task learning
    with transfer learning for biomedical named entity recognition. Procedia Computer
    Science 176, 848–857 (2020)
19. Mishra, A.K., Marr, D.: Apprentice: Using knowledge distillation techniques
    to improve low-precision network accuracy. In: 6th International Conference
    on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April
    30 - May 3, 2018, Conference Track Proceedings. OpenReview.net (2018),
    https://openreview.net/forum?id=B1ae1lZRb
20. Putelli, L., Gerevini, A., Lavelli, A., Serina, I.: Applying self-interaction atten-
    tion for extracting drug-drug interactions. In: Alviano, M., Greco, G., Scarcello, F.
    (eds.) AI*IA 2019 - Advances in Artificial Intelligence - XVIIIth International Con-
    ference of the Italian Association for Artificial Intelligence, Rende, Italy, November
    19-22, 2019, Proceedings. Lecture Notes in Computer Science, vol. 11946, pp. 445–
    460. Springer (2019), https://doi.org/10.1007/978-3-030-35166-3 32
21. Putelli, L., Gerevini, A.E., Lavelli, A., Serina, I.: The impact of self-interaction
    attention on the extraction of drug-drug interactions. In: Bernardi, R., Navigli,
    R., Semeraro, G. (eds.) Proceedings of the Sixth Italian Conference on Computa-
    tional Linguistics, Bari, Italy, November 13-15, 2019. CEUR Workshop Proceed-
    ings, vol. 2481. CEUR-WS.org (2019), http://ceur-ws.org/Vol-2481/paper61.pdf


                                           154
22. Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets:
    Hints for thin deep nets. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Con-
    ference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9,
    2015, Conference Track Proceedings (2015), http://arxiv.org/abs/1412.6550
23. Serrà, J., Suris, D., Miron, M., Karatzoglou, A.: Overcoming catastrophic for-
    getting with hard attention to the task. In: Dy, J.G., Krause, A. (eds.) Pro-
    ceedings of the 35th International Conference on Machine Learning, ICML
    2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018. Proceed-
    ings of Machine Learning Research, vol. 80, pp. 4555–4564. PMLR (2018),
    http://proceedings.mlr.press/v80/serra18a.html
24. Sun, S., Cheng, Y., Gan, Z., Liu, J.: Patient knowledge distillation for BERT
    model compression. In: Inui, K., Jiang, J., Ng, V., Wan, X. (eds.) Proceedings of
    the 2019 Conference on Empirical Methods in Natural Language Processing and
    the 9th International Joint Conference on Natural Language Processing, EMNLP-
    IJCNLP 2019, Hong Kong, China, November 3-7, 2019. pp. 4322–4331. Association
    for Computational Linguistics (2019), https://doi.org/10.18653/v1/D19-1441
25. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural net-
    works. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger,
    K.Q. (eds.) Advances in Neural Information Processing Systems 27: Annual Con-
    ference on Neural Information Processing Systems 2014, December 8-13 2014, Mon-
    treal, Quebec, Canada. pp. 3104–3112 (2014), http://papers.nips.cc/paper/5346-
    sequence-to-sequence-learning-with-neural-networks
26. Tan, X., Ren, Y., He, D., Qin, T., Zhao, Z., Liu, T.: Multilingual neural ma-
    chine translation with knowledge distillation. In: 7th International Conference on
    Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.
    OpenReview.net (2019), https://openreview.net/forum?id=S1gUsoR9YX
27. Tanabe, L.K., Xie, N., Thom, L.H., Matten, W., Wilbur, W.J.: GENETAG: a
    tagged corpus for gene/protein named entity recognition. BMC Bioinform. 6(S-1)
    (2005), https://doi.org/10.1186/1471-2105-6-S1-S3
28. Tang, R., Lu, Y., Liu, L., Mou, L., Vechtomova, O., Lin, J.: Distilling task-specific
    knowledge from BERT into simple neural networks. CoRR abs/1903.12136
    (2019), http://arxiv.org/abs/1903.12136
29. Wang, L., Yoon, K.: Knowledge distillation and student-teacher learning for vi-
    sual intelligence: A review and new outlooks. CoRR abs/2004.05937 (2020),
    https://arxiv.org/abs/2004.05937
30. Wang, X., Jiang, Y., Bach, N., Wang, T., Huang, F., Tu, K.: Structure-level
    knowledge distillation for multilingual sequence labeling. In: Jurafsky, D., Chai,
    J., Schluter, N., Tetreault, J.R. (eds.) Proceedings of the 58th Annual Meet-
    ing of the Association for Computational Linguistics, ACL 2020, Online, July
    5-10, 2020. pp. 3317–3330. Association for Computational Linguistics (2020),
    https://www.aclweb.org/anthology/2020.acl-main.304/
31. Wang, X., Zhang, Y., Ren, X., Zhang, Y., Zitnik, M., Shang, J.,
    Langlotz, C., Han, J.: Cross-type biomedical named entity recognition
    with deep multi-task learning. Bioinformatics 35(10), 1745–1752 (2019),
    https://doi.org/10.1093/bioinformatics/bty869
32. Zhang, Y., Xiang, T., Hospedales, T.M., Lu, H.: Deep mutual learning. CoRR
    abs/1706.00384 (2017), http://arxiv.org/abs/1706.00384
33. Zhou, J., Cao, Y., Wang, X., Li, P., Xu, W.: Deep recurrent
    models      with   fast-forward    connections     for  neural    machine     trans-
    lation.     Trans.    Assoc.    Comput.     Linguistics    4,    371–383     (2016),
    https://transacl.org/ojs/index.php/tacl/article/view/863


                                          155
34. Zubani, M., Sigalini, L., Serina, I., Gerevini, A.E.: Evaluating different natural
    language understanding services in a real business case for the italian language.
    Procedia Computer Science 176, 995–1004 (2020)


                                         156