=Paper= {{Paper |id=Vol-2735/paper53 |storemode=property |title=Knowledge Distillation Techniques for Biomedical Named Entity Recognition |pdfUrl=https://ceur-ws.org/Vol-2735/paper53.pdf |volume=Vol-2735 |authors=Tahir Mehmood,Ivan Serina,Alberto Lavelli,Alfonso Gerevini |dblpUrl=https://dblp.org/rec/conf/aiia/MehmoodSLG19 }} ==Knowledge Distillation Techniques for Biomedical Named Entity Recognition== https://ceur-ws.org/Vol-2735/paper53.pdf

Knowledge Distillation Techniques
for Biomedical Named Entity Recognition

Tahir Mehmood1,2 , Ivan Serina1 , Alberto Lavelli2 , and Alfonso Gerevini1
1
University of Brescia, 25121 Brescia, Italy
{t.mehmood,ivan.serina,alfonso.gerevini}@unibs.it
2
Fondazione Bruno Kessler, 38123 Povo, Trento, Italy
{t.mehmood,lavelli}@fbk.eu

Abstract. The limited amount of annotated biomedical literature and
its peculiar characteristics make biomedical named entity recognition
more challenging than standard named entity recognition. The multi-
task learning approach overcomes these limitations by training different
related tasks simultaneously. It learns common features among different
tasks by sharing some layers of the neural network architecture. For this
reason, the multi-task model attains more generalization properties than
a single task learning. The generalization of the multi-task model can
be utilized to enhance other models’ results. In particular, knowledge
distillation techniques make this possible in which one model supervises,
through its learned generalization, another model during the training.
This research analyzes the knowledge distillation approach and shows
that a simple deep learning model performance can be leveraged through
distilling the multi-task model’s generalization. Results show that our ap-
proach outperformed compared with the multi-task model and single task
model. This demonstrates that our model learns more diverse features
using the knowledge distillation approach. We also found our approach
statistically better than multi-task model and single task model.

·
·
Keywords: Biomedical Named Entity Recognition Multi-task Learn-
ing Knowledge Distillation.

1 Introduction

The biomedical named entity recognition (BioNER) task has gained more atten-
tion with the increasing availability of large amounts of unstructured biomedical
text data. BioNER is also a preliminary task of many other tasks e.g. the relation
extraction task (e.g., chemical induced disease relation, drug-drug interaction, . .
. ) [20]. However, biomedical texts are more complex than normal texts and carry

©
Copyright 2020 for this paper by its authors. Use permitted under Creative Com-
mons License Attribution 4.0 International (CC BY 4.0).

141
unusual characteristics, e.g. spelling alternations (e.g., 10-Ethyl-5-methyl-5,10-
dideazaaminopterin vs 10-EMDDA) [1], long multi-word expressions (10-ethyl-
5-methyl-5,10-dideazaaminopterin), and ambiguous words (TNF alpha can be
used for both DNA and Protein) [8]. The above-mentioned characteristics make
BioNER even a more difficult task than traditional named entity recognition.
Traditional machine learning approaches that include e.g., Hidden Markov
Models (HMMs), Conditional Random Fields (CRFs), and Support Vector Ma-
chine (SVM), have been used to overcome the limitations faced by the BioNER
task [4]. These machine learning methods have shown some promising results.
However, these approaches strongly rely on feature engineering. On the other
hand, deep learning models usually do not require hand-crafted feature engi-
neering since this is done implicitly. Simultaneously, the deep learning models’
results are very appealing for the BioNER task. However, due to the biomedical
literature’s peculiar characteristics mentioned at the beginning of the section,
these systems’ performance is still limited. Another challenge to deep learning
models is the limited availability of annotated biomedical text data to train these
systems as deep learning models require substantial amounts of training data.
The multi-task and transfer learning approaches have shown results improve-
ment for BioNER task [16][17], but these techniques still have some limitations.
The multi-task model (MTM) [18] does not always produce noticeable increase
in performance compared to their counterpart single task model (STM) [3][5].
The MTM could also learn the features that are more task-specific and which
can lead to biased feature learning [13]. Similarly, transfer learning [7] also faces
limitations e.g., catastrophic forgetting or catastrophic interference problem [23].
In catastrophic forgetting, the deep learning model starts forgetting what it has
learned from the previous domain. The forgetting of the previously learned source
information happens, even if both source and target domains are heterogeneous
[10]. It is also an empirical dilemma to choose the number of new layers for the
model used on the target datasets along with pretrained layers or weights of the
pretrained layers need to be frozen in the pretrained model as it is applied to the
target dataset. The transfer learning approach is therefore not always a feasible
solution to transfer previous knowledge into the new task.
Furthermore, in general, a common issue with the deep learning models is
their complex structure. The deep learning methods have brought much success
in numerous fields and have shown results breakthrough. To achieve state-of-the-
art results, the complex structure of the deep learning models is often observed
in many fields. Sutskever et al. [25] model comprised of 4-layers of long short-
term memory (LSTM) and each layer had 1000 hidden units. Similarly, Zhou
et al. [33] proposed a model that contains multi-level LSTM and each layer had
512 hidden units. These deep learning models have millions of parameters, and
training such models require much more computational power. These complex
models also require more storage space and which is also not very suitable to
deploy on the systems where available storage capacity is limited e.g., cell phones.
In such situations, implementation of these complex models requires compression

142
while, in the meantime, not to compromise their performances and keep the
generalization they have learned.
In this regard, the knowledge distillation approach is utilized where the cum-
bersome model is compressed into the simple model, which is more feasible to
set up in the end devices [11]. In the knowledge distillation technique [9], one
model teaches another model through its learned knowledge. This supervision
is done through prediction, where the learning model mimics the prediction of
the teacher model. The learning model, therefore, uses two gradients, i.e., the
gradient of itself and gradient of the teacher model, and for this reason, it can
produce better results. Romero et al. [22] showed that the intermediate layer
of the teacher model gives useful information to the student model during the
training. Liu et al. [14] improved the performance of the single model using
knowledge distillation from an ensemble of different deep neural networks. Tang
et al. [26] showed performance gain by distilling knowledge from a single machine
translation model to train the multilingual translation model. Zhang et al. [32]
demonstrated an increase in performance when different student models were
trained mutually and teach each other through knowledge distillation. Sun et
al. [24] showed performance gain using knowledge distillation approach in which
the intermediate layers of the teacher model were used to train the task specific
student model.
This research also proposes the distillation knowledge approach to enhance
the performance of the deep learning models for BioNER task. Therefore, the
purpose of this research is to increase the performance of the model instead
of compression. The multi-task model is used to perform knowledge distillation
for the single task model using its logits. In other words, single task model
matches the true labels as well as the logits of the multi-task model during its
training. Logits are the inputs to the softmax output layer [9] which carries
more information and its value ranges from [−∞, +∞]. This helps the single
task model to not only learn from the true labels but also optimize logits for
multi-task model.
The rest of the paper is organized as follows. Section 2 gives an introduc-
tion of the knowledge distillation approach which is followed by our proposed
methodology in Section 3. The experimental setup is described in Section 4
whereas results are discussed in Section 5. Finally, the research is concluded in
Section 6.

2 Knowledge Distillation

In transfer learning, the learned representation from source domain is utilized
in another related domain. In contrast, the objective of knowledge distillation
is to train a model with the knowledge learned by another model. The idea of
the knowledge distillation is to train a simple (student) model on the knowledge
learned by the complex (teacher) model. More specifically, the knowledge distilla-
tion approach addresses how to transfer the generalization of one model, usually
a complex model (teacher), to another model, usually a simple model (student).

143
The complex models or ensemble approaches usually produce better results than
the simple single-task model, but it is computationally expensive to train them.
The knowledge distillation approach helps the simple model (student) to pro-
duce better results than the stand alone single model and the ensemble models.
This way student model can be trained on fewer training examples since it will
also consume the knowledge learned by the teacher model during training. The
idea is that the complex model has already been generalized on the data during
its training. This helps the student model to achieve or nearly achieve the gen-
eralization of the teacher model. The student model not only learns through the
gradient of itself but also though the gradient of another knowledge.

Transferring knowledge from a teacher model is usually done in the shape of
the probabilities predicted by the teacher model. The objective of any learning
model is to predict the correct class for the input example and assign a high
probability to that class whereas allocating small probability values to the rest
of the classes. Associating the probabilities to the rest of the incorrect classes is
not performed randomly. These side probabilities also carry information which
depicts how a specific model has generalized the classes presented in the dataset.
For instance, there is very little chance of miss-classifying a motorbike image
into a car image but the probability would still be higher for miss-classifying it
into the truck image. The softmax activation function outputs the probability
distribution of the possible classes for the specific instance. The sum of these
softmax probability distributions sums to 1.

These softmax probabilities give more information compared to the one-hot
“hard labels”. For instance, the softmax probabilities, [0.7, 0.2, 0.1], show rank-
ing of the classes. Such information cannot be examined in the hard labels e.g, [1,
0, 0] where we cannot extract any such information. The posterior probabilities
can pass an extra useful signal to the student model during its training. However,
training the student model to match these probabilities could not be so much
useful as the student model can only pay more attention to the highest proba-
bility value. To overcome this barrier, it is better to soften these final softmax
output probabilities through normalizing them [9]. The normalized probabilities
represents soft labels which provides some knowledge distillation to the student
model [29]. The student model then pay attention to other values as well along
with the highest probable class. Hinton et al. proposed a term temperature, T ,
to soften the posterior probabilities [9]. Keeping T = 1 makes it standard soft-
max function as represented in equation 1. The large value of T more softens the
softmax output and enhances the non-target class output probability [19]. On
the downside, it also reduces the probability value of the target class. Therefore,
it is vital to choose the right value for the temperature parameter.

exp(zi /T )
Softmax(zi ) = P (1)
j exp(zj /T )

144
3 Our Approach
Figure 1 introduces our proposed knowledge distillation approach. The teacher
model is a multi-task model (MTM) with the word and character input of the
sentences. We use bidirectional LSTM (BiLSTM) to process the sequence in
both directions [21]. The upper layers, shown in black round rectangle, of the
MTM are shared among all the datasets. The bottom layers, shown in red round
rectangle, are dataset specific whereas Softmax is used for output labelling. In
multi-task learning (MTL) approach shared layers help one task to be learned
better with the help of another task. Training jointly on related tasks helps
the multi-task model to learn common features among different tasks by using
shared layers [2]. The task-specific layers learn features that are more related to
the current task. Training related tasks together helps the model to optimize the
value of the parameters. The joint learning also lowers the chances to embrace
overfitting for any specific task [15]. Therefore, we assume that the student model
will also have lower changes to encounter overfitting with the help of knowledge
distillation from the MTM. The purpose of our word is to transfer the token
level knowledge distillation, therefore, we use softmax function at the output
layer. The token level knowledge distillation is not possible with conditional
random field (CRF) as it predicts the labels of the whole sequence. The CRF
based model labels the sequence globally considering the association between
neighboring labels. This limits the distilling knowledge from the teacher models
[30].
An alternative training approach was adopted for MTM training phase. Let
us suppose we have D1 , D2 , ..., Dt training sets, related to the T1 , T2 , ..., Tt tasks
respectively. During the training phase, a training set Di is selected randomly
and both shared layers and the ones specific to the corresponding task Ti are
activated. Every task has its own optimizer so during training only the one
specific to the task Ti is activated and the loss function related to it is optimized.
The student model is in fact a counterpart single task model (STM) of the
MTM. Therefore, the structures of both models are same. In this research we
perform knowledge distillation using the teacher (MTM)logits, zt , which is input
to the softmax layer [28]. The logits carry the values that can range [−∞, +∞]
and therefore, carries more dark information. During the training, student model
considers the hard labels as well as the logits (zt ) of the teacher model (MTM).
We also have not normalized the logits that means temperature, T = 1. We
examine losses for both predictions i.e., the loss of the hard labels matching and
the loss of the logits matching. The hard targets matching loss, which involves
one-hot labels, can be referred as student loss whereas the distillation loss con-
siders the logits loss. The loss function of our student model model is depicted in
equation 2. The distillation loss tries to minimize mean-squared-error between
the student logits, zs , and teacher logits, zt , as depicted in equation 2. The x
represents the input, W represents student model’s parameters, H is the cross-
entropy loss whereas y is the true hard labels and σ is the softmax function. The
logits of student and teacher models are represented as zs , zt respectively. The
coefficients, α and β, specify the balance between student loss and distillation

145
Student Model Teacher Model

Word Char Word Char
Input Input Input Input

Shared Layers
BiLSTM BiLSTM

BiLSTMs BiLSTMs BiLSTMs BiLSTMs

Task Specific
BiLSTM BiLSTM
Distillation Loss
(zs,zt)
Hard Labels

Softmax Softmax

tag
Hard Predictions

Student Loss

KD Loss

Fig. 1. Proposed Knowledge Distillation Approach (colored circles show embedding)

loss whereas β = 1 − α.

L(x; W ) = α ∗ H(y, σ(zs , zt )) + β ∗ M SE(zs , zt ) (2)

4 Experiments

As a first approach, the MTM model, shown in the right side of Figure 1, is
trained separately. This MTM is then used to distill the knowledge to the student
model. We perform knowledge distillation from MTM using two approaches.
In the first approach, we perform simple knowledge distillation as shown in
Figure 1 where MTM’s logits are used to train the student model. In the second
approach, we use logits from ensemble of MTMs to train the student model.
The MTMs used in the ensemble approach have the same architecture, but they
are initialized with different seed values which result in different predictions.
Although, the structure of all MTMs are same but this gives us five different
predictions due to the different seed values. We take the average of the logits
from these MTMs, which is then used to train our student model. Furthermore,
the F1-score presented in the later section is also based on the average of five
runs with different seed values. In the rest of this article, MTM and teacher

146
MTM will be used interchangeably as the logits of the MTM are used to train
the student models.
We perform experiments for different values of α i.e.,[0, 0.5, 1]. The hyper-
parameter tuning is not done for α, instead the values are selected in a simple
straight forward way. If α = 0 then the student model learns with only dis-
tillation loss i.e., β ∗ M SE(zs , zt ), which tries to match logits of the student
model and teacher model. Similarly, with α = 0.5, both student loss and dis-
tillation loss are considered equally. In last, α = 1, only allows student model
to consider the student loss, α ∗ H(y, σ(zs ; zt )). Furthermore, words are repre-
sented with pre-trained domain-specific word embedding. More specifically, we
utilize the WikiPubMed-PMC word embedding which is trained on a large set of
the PubMedCentral(PMC) articles and PubMed abstracts as well as on English
Wikipedia articles [7]. On the other hand, character embedding is initialized
randomly which is further processed by BiLSTM. In this paper, we perform
experiments on the 15 datasets which are also used by Crichton et al. [6] and
Wang et al. [31]. The bio-entities in these datasets are Chemical, Species, Cell,
Gene/Protein, Cell Component, and Disease3 . The description of these entities
can be found in [16]. Each dataset contains separate training, development, and
test sets. We follow the same experimental setup adopted by Wang et al.4 , which
uses both train and development set data for training the model.

5 Results and Discussion

The F1-score comparison of our student model with different α values is shown
in Table 1. The MTM is the teacher model as mentioned in the earlier section
as well. This MTM is used for distilling knowledge to the student model via
its logits. The best results are shown in the bold font while second best score
is represented with the Italic style. It can be noticed that our student model
has outperformed the MTM approach, except for BioNLP13CG and most of the
protein datasets (BioNLP11EPI, BioNLP11ID, BioNLP13GE, and Ex-PTM).
We speculate that as BioNLP11EPI, BioNLP11ID, and Ex-PTM are the cor-
pora created for BioNLP 2011 shared task corpus, they might carry similar
characteristics. Therefore, we observe a performance decrease for all these three
datasets. In particular, the entity mentions in BioNLP11EPI and Ex-PTM were
automatically annotated using BANNER named entity tagger [12] which was
trained on the GENETAG corpus [27]. We anticipate that the wrong entity clas-
sification might have propagated in both datasets due to the annotation from
the same named entity tagger. On the other hand, BioNLP13CG contains 16 dif-
ferent classes and some of them have very few examples present in the dataset.
These classes represent cancer genetics (CG) and are more correlated with each
other. Therefore, our student model might not be able to differentiate among
these classes.
3
The datasets can be found at https://github.com/cambridgeltl/MTL-
Bioinformatics-2016
4
https://github.com/yuzhimanhua/Multi-BioNER

147
Student model, with α = 0, has shown a performance gain for 6 datasets
compared to the MTM (teacher). The student model, trained with α = 0.5,
achieves an increase in performance for 9 and 8 datasets compared to the MTM
and student (α = 0) model, respectively. Similarly, student model (α = 1)
improves results for 11 datasets against MTM whereas it yields best performance
for 10 and 11 datasets compared to the student with (α = 0) and (α = 0.5),
respectively.
We further analyse the performance of the student models considering the
STM which is also depicted in Table 1. It can be noticed that our student
model has outperformed many datasets, except BC4CHEMD and CRAFT. We
analyzed the performance of our teacher model (MTM) for BC4CHEMD and
CRAFT datasets, and found a performance drop upto F1-score of 3% for these
two datasets compared to STM. Therefore, we assume that teacher MTM model
could not able to perform much knowledge distillation for these two datasets.
The student model (α = 0) obtained best performance for 13 datasets compared
to the STM. Likewise, student (α = 0.5) obtained a performance gain for 12
datasets whereas student (α = 1) attains performance for 13 datasets compared
to STM.
We also use the second approach to train our student model where logits
from an ensemble of MTMs is used to train the student model. Instead of using
the teacher model with a different architecture, we use the same MTM teacher
model but these teacher models are initialized with different seed values. For
this reason, all the 5 teacher models produce different predictions. We average
their logits and train each single student model on such logits.
Table 2 represents the results comparison of our second approach. We can
notice the remarkable improvement in results for the student models using en-
semble approach. We notice that for two protein datasets (BioNLP13GE and
Ex-PTM), our student models are unable to show an increase in results com-
pared to the teacher (MTM). However, the student models are able to show a
performance gain for other protein datasets; for which our previous approach of
student model does not show increase in performance. We observed that the stu-
dent model with distillation loss (α = 0) shows performance gain for 11 datasets
against teacher model (MTM). Similarly, considering both the losses (α = 0.5)
i.e., student loss and distillation loss, the student model is able to leverage the
results for 13 and 6 datasets compared to teacher model and student model
(α − 0), respectively. On the other hand, student model trained with only stu-
dent loss (α = 1) achieves performance gain for 13 datasets compared to the
teacher model (MTM). Whereas, it is able to enhance the results for 6 datasets
compared to both student models with α = 0 and α = 0.5. Comparing the results
with STM, we can notice that all the student models have shown performance
gain for all 15 datasets compared to STM. We see our student models, trained
with the logits from ensemble of MTMs, produce better results. This is because
ensemble predictions are more accurate than a single prediction, and therefore
our student models perform better with the ensemble approach.

148
Student Student Student
Datasets MTM STM
α=0 α = 0.5 α=1
AnatEM 86.78 86.53 87.56 87.55 87.63
BC2GM 79.68 81.07 81.25 81.04 81.29
BC4CHEMD 86.80 90.24 89.45 89.50 89.58
BC5CDR 87.49 88.09 88.33 88.30 88.32
BioNLP09 88.40 87.37 88.70 88.82 88.74
BioNLP11EPI 84.56 82.58 84.45 84.44 84.56
BioNLP11ID 87.26 85.58 86.98 86.77 86.91
BioNLP13CG 83.83 82.11 83.27 83.39 83.35
BioNLP13GE 80.06 75.38 77.64 78.08 77.84
BioNLP13PC 88.17 87.26 88.05 88.09 88.22
CRAFT 81.96 84.27 83.98 83.98 83.81
ExPTM 80.69 73.06 76.11 76.39 76.71
JNLPBA 70.40 70.86 72.14 72.20 72.02
linnaeus 88.32 87.88 88.49 88.58 88.91
NCBI 84.50 83.98 84.88 84.72 84.67
Average 83.93 83.08 84.08 84.12 84.17
Average Variance 0.17 0.27 0.15 0.21 0.24
Table 1. Results comparison of the proposed student models. The Average represents
the average F1-score of all datasets. The Average Variance represents the average
variance of all datasets.

We also compare our results with state-of-the-art models. Table 3 compares
the results of our proposed student model with Wang et al. [31] and Crichton
et al. [6] models. We use their published results instead of regenerating them.
Wang et al. and Crichton et al. have used the MTL approach and used the
same 15 datasets to train their MTM. Our MTM structure resembles with the
proposed model of Wang et al. but we use a task specific BiLSTM layer, and we
use Softmax instead of CRF. In the given table, we can notice that our proposed
approach shows substantial increase in F1-score compare to the model proposed
by Crichton et al., while model proposed by Wang et al. shows performance gain
for 5 datasets. The student model, α = 1, shows the best results against the
benchmark results. The comparison of our second approach of student models,
trained with ensemble of MTMs, is depicted in Table 4. We see that our second
approach again outperformed against Crichton et al., while shows absolute gain
for most of the datasets compared to the Wang et al. except for BioNLP13GE
and Ex-PTM.
We also performed a statistical analysis of our results using the Friedman
test [34], shown in Figure 2. We are interested to see if the difference in the
results among different models is statistically significant or not. We observe that
the student models trained with single teacher logits (our first approach) do
not produce statistically significant results with respect to the teacher model.

149
StudentF StudentF StudentF
Datasets MTM STM
α=0 α = 0.5 α=1
AnatEM 86.78 86.53 87.97 87.97 88.04
BC2GM 79.68 81.07 81.96 81.78 81.89
BC4CHEMD 86.80 90.24 90.48 90.47 90.45
BC5CDR 87.49 88.09 88.76 88.68 88.71
BioNLP09 88.40 87.37 89.05 89.12 89.08
BioNLP11EPI 84.56 82.58 84.73 84.72 84.89
BioNLP11ID 87.26 85.58 87.05 87.52 87.37
BioNLP13CG 83.83 82.11 83.80 83.88 84.00
BioNLP13GE 80.06 75.38 78.61 78.60 78.60
BioNLP13PC 88.17 87.26 88.72 88.76 88.52
CRAFT 81.96 84.27 85.15 84.89 84.89
ExPTM 80.69 73.06 76.93 77.17 77.33
JNLPBA 70.40 70.86 72.51 72.54 72.50
linnaeus 88.32 87.88 89.44 89.05 88.84
NCBI 84.50 83.98 86.12 85.70 85.66
Average 83.93 83.08 84.75 84.72 84.72
Average Variance 0.17 0.27 0.09 0.21 0.11
Table 2. Results comparison of proposed student models. The Average represents the
average F1-score of all datasets. The Average Variance represents the average variance
of all datasets.(F The student model trained with ensemble of MTMs.)

This is understandable as the student model is unable to show performance
gain for most of the datasets against the teacher model (Table 1). However,
the results produced by that student model (our first approach) are statistically
significant, considering the results of STM. On the other hand, results of our
second approach of student model (trained with an ensemble of MTMs’ logits),
represented as Ens MTM, are statistically significant compared to both teacher
(MTM) and STM. We also see that our student models’ approaches, with and
without ensemble approach, produce results statistically significant with each
other. We also see that the student models trained without ensemble of MTM’s
logits are not significantly different among themselves. The same behavior can
be noticed for our second approach of student models trained with ensemble
MTM’s logits.
In Figure 3, the models are shown according to their best statistical ranks,
decreasing from left to right. The arrows show that a difference in results be-
tween models is statistically significant with p < 0.001. The group of student
models trained with an ensemble of MTMs, shown in black dashed rectangle, are
statistically better than the rest of the other models. In particular, the student
model (St Ens α = 0) is statistically better than those of the other models. This
shows that our second approach learns much better with only distillation loss.
We also consider our first group of student training (trained without ensemble

150
Wang Crichton Student Student Student
Datasets
et al. [31] et al. [6] α=0 α = 0.5 α=1
AnatEM 86.04 82.21 87.56 87.55 87.63
BC2GM 78.86 73.17 81.25 81.04 81.29
BC4CHEMD 88.83 83.02 89.45 89.50 89.58
BC5CDR 88.14 83.90 88.33 88.30 88.32
BioNLP09 88.08 84.20 88.70 88.82 88.74
BioNLP11EPI 83.18 78.86 84.45 84.44 84.56
BioNLP11ID 83.26 81.73 86.98 86.77 86.91
BioNLP13CG 82.48 78.90 83.27 83.39 83.35
BioNLP13GE 79.87 78.58 77.64 78.08 77.84
BioNLP13PC 88.46 81.92 88.05 88.09 88.22
CRAFT 82.89 79.56 83.98 83.98 83.81
ExPTM 80.19 74.90 76.11 76.39 76.71
JNLPBA 72.21 70.09 72.14 72.20 72.02
linnaeus 88.88 84.04 88.49 88.58 88.91
NCBI 85.54 80.37 84.88 84.72 84.67
Average 83.79 79.70 84.08 84.12 84.17
Average Variance — — 0.15 0.21 0.24
Table 3. Results comparison of proposed student models with state-of-the-art results

Fig. 2. Posthoc Pairwise Analysis with Conover Friedman Test

approach), as shown in the blue dashed rectangle. We find the student model
(St α = 1), trained with student loss, is statistically better than the rest of the
models shown on its right.

151
Wang Crichton StudentF StudentF StudentF
Datasets
et al. [31] et al. [6] α=0 α = 0.5 α=1
AnatEM 86.04 82.21 87.97 87.97 88.04
BC2GM 78.86 73.17 81.96 81.78 81.89
BC4CHEMD 88.83 83.02 90.48 90.47 90.45
BC5CDR 88.14 83.90 88.76 88.68 88.71
BioNLP09 88.08 84.20 89.05 89.12 89.08
BioNLP11EPI 83.18 78.86 84.73 84.72 84.89
BioNLP11ID 83.26 81.73 87.05 87.52 87.37
BioNLP13CG 82.48 78.90 83.80 83.88 84.00
BioNLP13GE 79.87 78.58 78.61 78.60 78.60
BioNLP13PC 88.46 81.92 88.72 88.76 88.52
CRAFT 82.89 79.56 85.15 84.89 84.89
ExPTM 80.19 74.90 76.93 77.17 77.33
JNLPBA 72.21 70.09 72.51 72.54 72.50
linnaeus 88.88 84.04 89.44 89.05 88.84
NCBI 85.54 80.37 86.12 85.70 85.66
Average 83.79 79.70 84.75 84.72 84.72
Average Variance — — 0.09 0.21 0.11
Table 4. Results comparison of proposed student models with state-of-the-art
results(F The student model trained with ensemble of MTMs.)

Fig. 3. Statistical Comparison of Our Models. The arrows show that models are statis-
tically significant to another model with p < 0.001. St Ens represents Student model
trained with ensemble of MTMs.

6 Conclusions

In this research, we introduced knowledge distillation to increase the perfor-
mance of the BioNER task. We use MTM as our teacher model because of the
advantages MTM has over STM. We further use ensemble MTMs in our pro-
posed knowledge distillation approach. The knowledge distillation is done by
using MTM’s logits. By analyzing the F1-score and statistical test, we found our
approach better than teacher MTM and STM. We found that using the ensem-
ble of MTMs as a teacher model is more beneficial than using a single MTM. In
future work, we will use the probability distributions of the softmax prediction

152
for student models. Furthermore, different teacher models’ architecture will also
be used in an ensemble approach to supervising the student model.

References
1. Alam, F., Corazza, A., Lavelli, A., Zanoli, R.: A knowledge-poor approach to
chemical-disease relation extraction. Database J. Biol. Databases Curation 2016
(2016), https://doi.org/10.1093/database/baw071
2. Bansal, T., Belanger, D., McCallum, A.: Ask the GRU: Multi-task learning
for deep text recommendations. In: Sen, S., Geyer, W., Freyne, J., Castells,
P. (eds.) Proceedings of the 10th ACM Conference on Recommender Sys-
tems, Boston, MA, USA, September 15-19, 2016. pp. 107–114. ACM (2016),
https://doi.org/10.1145/2959100.2959180
3. Bingel, J., Søgaard, A.: Identifying beneficial task relations for multi-task learn-
ing in deep neural networks. In: Lapata, M., Blunsom, P., Koller, A. (eds.) Pro-
ceedings of the 15th Conference of the European Chapter of the Association for
Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume
2: Short Papers. pp. 164–169. Association for Computational Linguistics (2017),
https://doi.org/10.18653/v1/e17-2026
4. Chowdhury, M.F.M., Lavelli, A.: Disease mention recognition with specific fea-
tures. In: Cohen, K.B., Demner-Fushman, D., Ananiadou, S., Pestian, J., Tsu-
jii, J., Webber, B.L. (eds.) Proceedings of the 2010 Workshop on Biomed-
ical Natural Language Processing, BioNLP@ACL 2010, Uppsala, Sweden,
July 15, 2010. pp. 83–90. Association for Computational Linguistics (2010),
https://www.aclweb.org/anthology/W10-1911/
5. Clark, K., Luong, M., Khandelwal, U., Manning, C.D., Le, Q.V.: Bam! born-again
multi-task networks for natural language understanding. In: Korhonen, A., Traum,
D.R., Màrquez, L. (eds.) Proceedings of the 57th Conference of the Association for
Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019,
Volume 1: Long Papers. pp. 5931–5937. Association for Computational Linguistics
(2019), https://doi.org/10.18653/v1/p19-1595
6. Crichton, G.K.O., Pyysalo, S., Chiu, B., Korhonen, A.: A neural network multi-
task learning approach to biomedical named entity recognition. BMC Bioinform.
18(1), 368:1–368:14 (2017), https://doi.org/10.1186/s12859-017-1776-8
7. Giorgi, J.M., Bader, G.D.: Transfer learning for biomedical named entity
recognition with neural networks. Bioinformatics 34(23), 4087–4094 (2018),
https://doi.org/10.1093/bioinformatics/bty449
8. Gridach, M.: Character-level neural network for biomedical named
entity recognition. J. Biomed. Informatics 70, 85–91 (2017),
https://doi.org/10.1016/j.jbi.2017.05.002
9. Hinton, G.E., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network.
CoRR abs/1503.02531 (2015), http://arxiv.org/abs/1503.02531
10. Jung, H., Ju, J., Jung, M., Kim, J.: Less-forgetting learning in deep neural net-
works. CoRR abs/1607.00122 (2016), http://arxiv.org/abs/1607.00122
11. Kim, Y., Rush, A.M.: Sequence-level knowledge distillation. In: Su, J., Carreras,
X., Duh, K. (eds.) Proceedings of the 2016 Conference on Empirical Methods in
Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-
4, 2016. pp. 1317–1327. The Association for Computational Linguistics (2016),
https://doi.org/10.18653/v1/d16-1139

153
12. Leaman, R., Gonzalez, G.: BANNER: an executable survey of advances in biomed-
ical named entity recognition. In: Altman, R.B., Dunker, A.K., Hunter, L., Mur-
ray, T., Klein, T.E. (eds.) Biocomputing 2008, Proceedings of the Pacific Sympo-
sium, Kohala Coast, Hawaii, USA, 4-8 January 2008. pp. 652–663. World Scientific
(2008), http://psb.stanford.edu/psb-online/proceedings/psb08/leaman.pdf
13. Liu, P., Qiu, X., Huang, X.: Adversarial multi-task learning for text classification.
In: Barzilay, R., Kan, M. (eds.) Proceedings of the 55th Annual Meeting of the
Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July
30 - August 4, Volume 1: Long Papers. pp. 1–10. Association for Computational
Linguistics (2017), https://doi.org/10.18653/v1/P17-1001
14. Liu, X., He, P., Chen, W., Gao, J.: Improving multi-task deep neural net-
works via knowledge distillation for natural language understanding. CoRR
abs/1904.09482 (2019), http://arxiv.org/abs/1904.09482
15. Lu, P., Bai, T., Langlais, P.: SC-LSTM: learning task-specific representations in
multi-task learning for sequence labeling. In: Burstein, J., Doran, C., Solorio, T.
(eds.) Proceedings of the 2019 Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language Technologies,
NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and
Short Papers). pp. 2396–2406. Association for Computational Linguistics (2019),
https://doi.org/10.18653/v1/n19-1249
16. Mehmood, T., Gerevini, A., Lavelli, A., Serina, I.: Leveraging multi-task learning
for biomedical named entity recognition. In: Alviano, M., Greco, G., Scarcello, F.
(eds.) AI*IA 2019 - Advances in Artificial Intelligence - XVIIIth International Con-
ference of the Italian Association for Artificial Intelligence, Rende, Italy, November
19-22, 2019, Proceedings. Lecture Notes in Computer Science, vol. 11946, pp. 431–
444. Springer (2019), https://doi.org/10.1007/978-3-030-35166-3 31
17. Mehmood, T., Gerevini, A., Lavelli, A., Serina, I.: Multi-task learning applied to
biomedical named entity recognition task. In: Bernardi, R., Navigli, R., Semeraro,
G. (eds.) Proceedings of the Sixth Italian Conference on Computational Linguis-
tics, Bari, Italy, November 13-15, 2019. CEUR Workshop Proceedings, vol. 2481.
CEUR-WS.org (2019), http://ceur-ws.org/Vol-2481/paper47.pdf
18. Mehmood, T., Gerevini, A.E., Lavelli, A., Serina, I.: Combining multi-task learning
with transfer learning for biomedical named entity recognition. Procedia Computer
Science 176, 848–857 (2020)
19. Mishra, A.K., Marr, D.: Apprentice: Using knowledge distillation techniques
to improve low-precision network accuracy. In: 6th International Conference
on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April
30 - May 3, 2018, Conference Track Proceedings. OpenReview.net (2018),
https://openreview.net/forum?id=B1ae1lZRb
20. Putelli, L., Gerevini, A., Lavelli, A., Serina, I.: Applying self-interaction atten-
tion for extracting drug-drug interactions. In: Alviano, M., Greco, G., Scarcello, F.
(eds.) AI*IA 2019 - Advances in Artificial Intelligence - XVIIIth International Con-
ference of the Italian Association for Artificial Intelligence, Rende, Italy, November
19-22, 2019, Proceedings. Lecture Notes in Computer Science, vol. 11946, pp. 445–
460. Springer (2019), https://doi.org/10.1007/978-3-030-35166-3 32
21. Putelli, L., Gerevini, A.E., Lavelli, A., Serina, I.: The impact of self-interaction
attention on the extraction of drug-drug interactions. In: Bernardi, R., Navigli,
R., Semeraro, G. (eds.) Proceedings of the Sixth Italian Conference on Computa-
tional Linguistics, Bari, Italy, November 13-15, 2019. CEUR Workshop Proceed-
ings, vol. 2481. CEUR-WS.org (2019), http://ceur-ws.org/Vol-2481/paper61.pdf

154
22. Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets:
Hints for thin deep nets. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Con-
ference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9,
2015, Conference Track Proceedings (2015), http://arxiv.org/abs/1412.6550
23. Serrà, J., Suris, D., Miron, M., Karatzoglou, A.: Overcoming catastrophic for-
getting with hard attention to the task. In: Dy, J.G., Krause, A. (eds.) Pro-
ceedings of the 35th International Conference on Machine Learning, ICML
2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018. Proceed-
ings of Machine Learning Research, vol. 80, pp. 4555–4564. PMLR (2018),
http://proceedings.mlr.press/v80/serra18a.html
24. Sun, S., Cheng, Y., Gan, Z., Liu, J.: Patient knowledge distillation for BERT
model compression. In: Inui, K., Jiang, J., Ng, V., Wan, X. (eds.) Proceedings of
the 2019 Conference on Empirical Methods in Natural Language Processing and
the 9th International Joint Conference on Natural Language Processing, EMNLP-
IJCNLP 2019, Hong Kong, China, November 3-7, 2019. pp. 4322–4331. Association
for Computational Linguistics (2019), https://doi.org/10.18653/v1/D19-1441
25. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural net-
works. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger,
K.Q. (eds.) Advances in Neural Information Processing Systems 27: Annual Con-
ference on Neural Information Processing Systems 2014, December 8-13 2014, Mon-
treal, Quebec, Canada. pp. 3104–3112 (2014), http://papers.nips.cc/paper/5346-
sequence-to-sequence-learning-with-neural-networks
26. Tan, X., Ren, Y., He, D., Qin, T., Zhao, Z., Liu, T.: Multilingual neural ma-
chine translation with knowledge distillation. In: 7th International Conference on
Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.
OpenReview.net (2019), https://openreview.net/forum?id=S1gUsoR9YX
27. Tanabe, L.K., Xie, N., Thom, L.H., Matten, W., Wilbur, W.J.: GENETAG: a
tagged corpus for gene/protein named entity recognition. BMC Bioinform. 6(S-1)
(2005), https://doi.org/10.1186/1471-2105-6-S1-S3
28. Tang, R., Lu, Y., Liu, L., Mou, L., Vechtomova, O., Lin, J.: Distilling task-specific
knowledge from BERT into simple neural networks. CoRR abs/1903.12136
(2019), http://arxiv.org/abs/1903.12136
29. Wang, L., Yoon, K.: Knowledge distillation and student-teacher learning for vi-
sual intelligence: A review and new outlooks. CoRR abs/2004.05937 (2020),
https://arxiv.org/abs/2004.05937
30. Wang, X., Jiang, Y., Bach, N., Wang, T., Huang, F., Tu, K.: Structure-level
knowledge distillation for multilingual sequence labeling. In: Jurafsky, D., Chai,
J., Schluter, N., Tetreault, J.R. (eds.) Proceedings of the 58th Annual Meet-
ing of the Association for Computational Linguistics, ACL 2020, Online, July
5-10, 2020. pp. 3317–3330. Association for Computational Linguistics (2020),
https://www.aclweb.org/anthology/2020.acl-main.304/
31. Wang, X., Zhang, Y., Ren, X., Zhang, Y., Zitnik, M., Shang, J.,
Langlotz, C., Han, J.: Cross-type biomedical named entity recognition
with deep multi-task learning. Bioinformatics 35(10), 1745–1752 (2019),
https://doi.org/10.1093/bioinformatics/bty869
32. Zhang, Y., Xiang, T., Hospedales, T.M., Lu, H.: Deep mutual learning. CoRR
abs/1706.00384 (2017), http://arxiv.org/abs/1706.00384
33. Zhou, J., Cao, Y., Wang, X., Li, P., Xu, W.: Deep recurrent
models with fast-forward connections for neural machine trans-
lation. Trans. Assoc. Comput. Linguistics 4, 371–383 (2016),
https://transacl.org/ojs/index.php/tacl/article/view/863

155
34. Zubani, M., Sigalini, L., Serina, I., Gerevini, A.E.: Evaluating different natural
language understanding services in a real business case for the italian language.
Procedia Computer Science 176, 995–1004 (2020)

156