=Paper=
{{Paper
|id=Vol-3180/paper-24
|storemode=property
|title=ELECTROLBERT: Combining Replaced Token Detection and Sentence Order Prediction
|pdfUrl=https://ceur-ws.org/Vol-3180/paper-24.pdf
|volume=Vol-3180
|authors=Martin Reczko
|dblpUrl=https://dblp.org/rec/conf/clef/Reczko22
}}
==ELECTROLBERT: Combining Replaced Token Detection and Sentence Order Prediction==
<pdf width="1500px">https://ceur-ws.org/Vol-3180/paper-24.pdf</pdf>
<pre>
ELECTROLBERT: Combining Replaced Token
Detection and Sentence Order Prediction
Martin Reczko1
1
 Institute for Fundamental Biomedical Science, Biomedical Sciences Research Center "Alexander Fleming", 34 Fleming
Street, 16672 Vari, Greece


                                         Abstract
                                         ELECTROLBERT is a novel transformer algorithm that combines the replaced token prediction of the
                                         ELECTRA system [1] with the sentence order prediction used in the ALBERT system [2]. As reported,
                                         the default next sentence prediction component used in most BERT-based transformers has the draw-
                                         back that a random next sentence can be easily predicted based on its different scope. The sentence
                                         order prediction facilitates the detection of semantic flow and is well suited for finetuning question
                                         answering systems, as the pairing of a question with text related to the correct answer resembles the
                                         correct order of two sentences in a scientific text. The implementation of ELECROLBERT is based on
                                         the BioELECTRA [3] code. ELECTROLBERT is pretrained on the 2022 baseline set of all PubMed ab-
                                         stracts provided by the National Library of Medicine and two predictors are finetuned using pairs of rel-
                                         evant and non-relevant question-abstract pairs for document prediction and examples for the “yes/no”
                                         type questions, both generated using the BioASQ10 training dataset [4]. For each novel question in
                                         the document prediction task of BioASQ10, 6750 Pubmed abstracts are filtered for processing by ELEC-
                                         TROLBERT from all Pubmed abstracts using a combination of the GENSIM topic modelling system [5]
                                         and a simple infrequent word detection method. The system was continuously improved during the
                                         BioASQ10 competition and in the last batch, ELECTROLBERT ranked as the 3𝑟𝑑 team for document
                                         prediction and 1𝑠𝑡 place and team for the “yes/no” type questions.

                                         Keywords
                                         Biomedical Question Answering, ELECTRA, ALBERT, BioASQ


1. Introduction
Autoencoders with attention mechanisms led to impressive improvements in many natural lan-
guage processing tasks [6]. A computationally efficient approach is called ELECTRA (Efficiently
Learning an Encoder that Classifies Token Replacements Accurately) [1] which replaces the
traditional reconstruction of masked tokens from their context as introduced in BERT [7] with
a simpler identification of replaced tokens that are generated by an adaptive generator. Another
successful algorithm is the ALBERT (A Lite BERT) transformer that incorporates several param-
eter reduction techniques and replaces the next sentence prediction (NSP) component of BERT
with a sentence order prediction (SOP) component that improves the detection of inter-sentence
coherence. In the transformer called ELECTROLBERT (ELECTR-A-LBERT) introduced here,


CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy
  reczko@fleming.gr (M. Reczko)
 0000-0002-0005-8718 (M. Reczko)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
Table 1
Question–Answer pairs in real abstracts. Coherence detected by sentence order prediction (SOP) facil-
itates answer identification. The topic-shift detection obtained with next sentence prediction (NSP) is
less specific.
   PMID
                Question Answer
   34884907
                Neurogenic Inflammation in the Context of Endometriosis-What Do We Know? En-
                dometriosis (EM) is an estrogen-dependent disease characterized by the presence of ep-
                ithelial, stromal, and smooth muscle cells outside the uterine cavity.
   34894155
                The question to ask is, is this prescribed load regimen congruent with Wolff’s law, and does
                it provide an adequate mechanical stimulus to maintain the functional health of periodon-
                tal complex? This question was answered by studying the effects of mice chewing on soft
                food (SF) and hard food (HF) while undergoing experimental tooth movement (ETM).
   34893939
                Will Artificial Intelligence (AI) re-humanize or de-humanize medicine? As AI becomes per-
                vasive in clinical medicine, we argue that the ethical framework that sustains a responsible
                implementation of such technologies should be reconsidered.


the SOP component of ALBERT is combined with the ELECTRA approach that is missing any
component to predict inter-sentence relations.


2. Motivation
NSP is defined as a prediction if two sentences are consecutive in the training texts. The negative
case uses sentences from two different documents. During the design of ALBERT, it was shown
that a BERT-style NSP essentially fails on the SOP task, while SOP as used in ALBERT also has
reasonable performance on the NSP problem. The conclusion was that NSP solves the easier task
of predicting the topic-shift between the two segments and fails to model the discourse-level
coherence between consecutive sentences. As shown in Table 1, it is not uncommon that
scientific texts contain consecutive segments with a question and an answer. An efficient SOP
would directly identify these cases as correct answers and finetuning a question-document
relevance prediction could effectively reuse the parts of the embedding that the SOP developed
during training. The embedding obtained with NSP based on detecting topic-shift trends to
accept all answers containing a topic similar to the question, but do not necessarily answer the
question. An overview of the system architecture is shown in Figure 1.


3. Data generation and training
3.1. Initial document retrieval
As the complete corpus cannot be processed by the transformer for each question in reasonable
time, a computationally efficient filter has to filter the corpus according to each question. To this
end, the TF/IDF based free topic modelling system GENSIM [5] (with logarithmic term frequency
Figure 1: The ELECTROLBERT system. The components above the blue line are used during pretrain-
ing. During question answering, all components except the Generator are used.


weighting, IDF document frequency weighting and no document normalization) is used to
return a list of 6750 documents relevant for each question to be processed by ELECTROLBERT.
A question in the fifth batch was "What is Waylivra?" (id:626aea6ae764a5320400003d) with
the answer in the document with PMID 3130103: "Volanesorsen (Waylivra®), an antisense
oligonucleotide inhibitor of . . . ". As the special ’registered trademark’ character "®" was included
in the dictionary entry for this word, but not in the question, the document was not detected in
my submission. A new dictionary and index was constructed omitting the special characters
®,© and ™.

3.2. Infrequent words treatment
Processing the corpus of 22542347 pubmed abstracts and titles, a dictionary with 5440825 words
is obtained. To ensure the detection of infrequent words that would not be detected using the
TF/IDF score, a list of documents is constructed for each word occurring less than 4000 times in
the complete corpus. These words cover 99.55% of the dictionary. In case a question contains
one of the these infrequent words, the union of the lists of documents for each infrequent word
occurring in the question is appended to the list of documents detected by TF/IDF for evaluation
by ELECTROLBERT.

3.3. Pretraining data
Pretraining data was generated using a modified version of the ALBERT code. To emphasize
sentence order prediction and subsequent question answering, all examples have two segments,
where sentences longer than half of the maximal sequence length without separators are
truncated to this length. In 50% of the examples, the sentence order is swapped. Instead of a
random 10% fraction of the examples having only a single segment, no single segment examples
are used. The vocabulary available in BioELECTRA [3] containing 31620 entries1 is used and all
training texts are lowercase. The titles and abstracts are extracted from the pubmed baseline
xml files with a custom R script based on the xml2 package.

3.4. Pretraining
For subsequent relevance prediction, a model with the ’base’ configuration of BioELECTRA
is pretrained due to computational limitations. This corresponds to an embedding size of 768,
a generator hidden size of 256 and 12 hidden layers. The model was trained for 1.5 million
steps using the ADAM algorithm (𝛽1 = 0.9, 𝛽2 = 0.999, 𝜖 = 10−5 , learning_rate = 10−4 ,
weight_decay_rate = 0.005, max_seq_length = 128, train_batch_size = 14). The final replaced
token discriminator has an area under the the curve (AUC) of 0.92 ± 0.0087. For finetuning the
“Yes/No” task, a ’large’ configuration is pretrained for 5 million steps with an embedding size
of 1024, a generator hidden size of 256 and 24 hidden layers (𝛽1 = 0.9, 𝛽2 = 0.999, 𝜖 = 10−5 ,
learning_rate = 10−4 , weight_decay_rate = 0.0025, max_seq_length = 128, train_batch_size
= 12). It should be noted that the training of the ’large’ model did not yet converge after 5
million steps, with an discriminator AUC of 0.85 ± 0.001. GPU memory restrictions required
the smaller train_batch_size for the ’large’ configuration. As known, smaller batch_sizes lead to
less accurate gradient calculations. This, in combination with the larger number of parameters
to optimize for the ’large’ configuration (’large’: 335 million vs ’base’: 110 million) could explain
the slow convergence.

3.5. Document relevance prediction finetuning
As initial question answering is focused on the “yes/no” type questions, 1148 questions of this
type are extracted from the BioASQ10 training set (873 “yes”, 275 “no”). All questions with
shared answer documents are grouped into equivalence sets. Initially, all documents relevant
for a questions are used as positive relevance examples and 2% of all documents not contained
in the equivalence set for a question are used as negative relevance examples, leading to an
approximate negative to positive example ratio of 20 to 1. As negative examples are generated
from completely unrelated questions, the class boundaries are distant. To better discriminate
the relevant documents obtained with GENSIM topic modelling, construction of the negative
example was modified after the third batch of BioASQ10. All questions of the relevance
training set were processed with GENSIM to produce 1000 documents for each question.
The documents were ranked according to their TF/IDF score and all documents between
rank 750 and 875 were used as negative examples, excluding potential positive examples
in these ranks. The values of the start and end rank positions for the negative set was op-
timized by retraining and maximizing the mean average precision measured on the second batch.

  The model was trained for 18768 steps using the ADAM algorithm (𝛽1 = 0.9, 𝛽2 =
0.999, 𝜖 = 10−5 , learning_rate = 2 · 10−6 , weight_decay_rate = 0.003, max_seq_length
      1
          https://github.com/SciCrunch/bio_electra/blob/master/electra/data/pmc_2017_abstracts_wp_vocab_sorted.
txt
Table 2
BioASQ10 prediction performance: Document relevance and “yes/no” task. For documents, the mean
average precision (MAP) is used for evaluation.
                                    documents                               “yes/no” type questions
               BioASQ submission                final system
    batch      MAP per team rank             MAP per team rank             accuracy      per team rank
        1     0.1121            7            0.3649           5                 -                -
        2     0.1632            9            0.3090           5                 -                -
        3     0.3209            8            0.3666           6               0.76              10
        4     0.3101            6            0.3140           6               0.75              11
        5     0.3242            4            0.3242           4              0.6429             10
        62    0.0977            3            0.0977           3                1.0              1


= 200, train_batch_size = 16).

3.6. “Yes/No” prediction finetuning
A training set of 70% of the “Yes/No” type questions was extracted from the BioASQ10 training
set. To prevent trivial tests, the training set, the 10% validation set and the 20% test set all
contain questions that are not in the equivalence set of any question in all other sets. The model
was trained for 88260 steps using the ADAM algorithm (𝛽1 = 0.9, 𝛽2 = 0.999, 𝜖 = 10−5 , learn-
ing_rate = 4 · 10−5 , weight_decay_rate = 0.008, max_seq_length = 256, train_batch_size = 4).
To classify a question, all documents relevant for the question are scored by ELECTROLBERT.
If the average score over all documents exceeds a fixed threshold, the answer to the question is
“Yes”. In the opposite case, an additional requirement to assign “No” to a question is a certain
degree of unanimity in the scores of all relevant documents. Only if the variance of all scores is
below a fixed threshold, the final answer is negative.

3.7. Fast abstract retrieval
A custom R script accessing chunks of 100000 abstracts based on their PMIDs can retrieve and
store > 3000 abstracts with titles per minute using 15 CPUs. This step is one of the major
bottlenecks to guarantee the processing of 100 questions within 24 hours without the use of
large compute clusters.


4. BioASQ10 results
The performance of ELECTROLBERT for each batch of BioASQ10 is listed in Table 2. For the
document retrieval task, the submissions that reflect the different development stages are listed
as well as a retrospective analysis for batches 1 to 4 that is obtained with the final system used
in batch 5 and 6. The improvements during the development of the system are obvious.
    2
     Batch 6 consisted of questions posed by new biomedical experts interested in material and answers that can
be automatically provided by state-of-the-art IR and QA systems. It is not part of the official evaluation.
5. Conclusions and Next Steps
All document retrieval results of ELECTROLBERT have been obtained with the ’base’ sized
architecture and are thus promising, fueling the expectation of competitive performance once
training of the ’large’ architecture has converged. ELECTROLBERT will be finetuned for the
snippet, factoid and list tasks. Transfer learning from SOP to relevance prediction and from
relevance prediction to the “Yes/No” task will be evaluated.


Acknowledgments
GPU computations were offered by HYPATIA3 , the Cloud infrastructure that supports the
computational needs of the Greek ELIXIR community.


References
[1] K. Clark, M.-T. Luong, Q. V. Le, C. D. Manning, Electra: Pre-training text encoders as
    discriminators rather than generators, 2020. URL: https://arxiv.org/abs/2003.10555. doi:10.
    48550/ARXIV.2003.10555.
[2] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, Albert: A lite bert for self-
    supervised learning of language representations, 2019. URL: https://arxiv.org/abs/1909.11942.
    doi:10.48550/ARXIV.1909.11942.
[3] K. r. Kanakarajan, B. Kundumani, M. Sankarasubbu, BioELECTRA:pretrained biomedical
    text encoder using discriminators, in: Proceedings of the 20th Workshop on Biomedical
    Language Processing, Association for Computational Linguistics, Online, 2021, pp. 143–154.
    URL: https://aclanthology.org/2021.bionlp-1.16. doi:10.18653/v1/2021.bionlp-1.16.
[4] A. Nentidis, G. Katsimpras, E. Vandorou, A. Krithara, L. Gasco, M. Krallinger, G. Paliouras,
    Overview of bioasq 2021: The ninth bioasq challenge on large-scale biomedical semantic
    indexing and question answering, Experimental IR Meets Multilinguality, Multimodal-
    ity, and Interaction (2021) 239–263. URL: http://dx.doi.org/10.1007/978-3-030-85251-1_18.
    doi:10.1007/978-3-030-85251-1_18.
[5] R. Rehurek, P. Sojka, Gensim–python framework for vector space modelling, NLP Centre,
    Faculty of Informatics, Masaryk University, Brno, Czech Republic 3 (2011).
[6] P. He, X. Liu, J. Gao, W. Chen, Deberta: Decoding-enhanced bert with disentangled attention,
    2020. URL: https://arxiv.org/abs/2006.03654. doi:10.48550/ARXIV.2006.03654.
[7] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
    transformers for language understanding, 2018. URL: https://arxiv.org/abs/1810.04805.
    doi:10.48550/ARXIV.1810.04805.


   3
       https://hypatia.athenarc.gr

</pre>