<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>P. Przybyła);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>for Propaganda Detection and Beyond</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Piotr Przybyła</string-name>
          <email>piotr.przybyla@upf.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Konrad Kaczyński</string-name>
          <email>konrad.kaczynski@ipipan.waw.pl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Economic Sciences, University of Warsaw</institution>
          ,
          <country country="PL">Poland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Computer Science, Polish Academy of Sciences</institution>
          ,
          <addr-line>Warsaw</addr-line>
          ,
          <country country="PL">Poland</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>LaSTUS Lab, TALN Group, Universitat Pompeu Fabra</institution>
          ,
          <addr-line>Barcelona</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1819</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Propaganda detection is usually defined and solved as a Named Entity Recognition (NER) task. However, the instances of propaganda techniques (text spans) are usually much longer than typical NER entities (e.g. person or location names) and can include dozens of words. In this work, we investigate how the extensive span lengths afect the recognition of propaganda, showing that the task dificulty indeed increases with the span length. We systematically evaluate several common approaches to the task, measuring how well they recover the length distribution of true spans. We also propose a new solution, including an adaptive convolution layer that facilitates sharing information between distant words. Our approach allows to improve length preservation without sacrificing overall performance.</p>
      </abstract>
      <kwd-group>
        <kwd>propaganda detection</kwd>
        <kwd>named entity recognition</kwd>
        <kwd>long named entities</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>the human annotators:</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        The rise of the many challenges collectively known as misinformation has prompted researchers
working on Natural Language Processing (NLP) to propose several tasks aimed at assessing the
reliability of online text. This includes detecting social media bots [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], non-credible news articles
[2] or other hyper-partisan content [3]. Propaganda detection is based on similar inspiration,
but difers from the other tasks, since it involves pinpointing specific usages of manipulative
techniques on the word level. In practice, it means that instead of a general fake news label, an
end user might see certain text passages highlighted and categorised with respect to the type
of manipulation they involve. This helps the user to understand why certain text should be
considered unreliable. Interpretability matters in misinformation context, where it has been
shown to impact the users’ trust in credibility labels [4].
      </p>
      <p>For example, in the following header, the underlined text was annotated as Flag-waving by
NLP-MisInfo 2023: SEPLN 2023 Workshop on NLP applied to Misinformation, held as part of SEPLN 2023: 39th</p>
      <p>To approach this problem, the proposed solutions build on the previous work on Named
Entity Recognition (NER), which involves finding spans in text that belong to certain categories.
However, the most popular NER solutions were developed for recognising relatively short
names of entities, such as persons, geographical entities or biomedical concepts. The instances
of propaganda techniques are typically much longer and may cover many words, as in the
example above, or even multiple sentences. As we show later, this makes these very long spans
rarely correctly recognised by generic NER approaches. In this study, we make the following
contributions:
• We systematically evaluate the current NER approaches in the propaganda detection,
both in terms of recognition accuracy and how well the span length distribution of the
predicted entities matches the gold standard.
• We propose a new model architecture, called LoNER (Long Named Entity Recognition),
including a neural layer based on context-sensitive convolution operations, which improves
length preservation while maintaining overall accuracy.</p>
      <p>The code of our solution is openly available1.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Related work</title>
      <p>Work on defining propaganda. Propaganda can be defined as a planned and systematically
applied suggestion, expressed largely in symbols and linguistic stimuli for the purpose of
controlling attitudes and behaviour of individuals towards a predetermined mode of conduct
[5, 6]. Propaganda can be a tool of disinformation, which intends to deceive the public to believe
in false claims, but it can be used sway opinions and behaviours towards generally beneficial
goals [7]. The word propaganda originated between the sixteenth and nineteenth century and
referred to the spreading of religious doctrine of the catholic church [8], but modern propaganda
gained momentum during World War I [9].</p>
      <p>One of the first works defining propaganda as usage of specific persuasive techniques was
that of Lee and Lee [10], listing seven distinct propaganda techniques: Name calling, Glittering
Generalities, Transfer, Testimonial, Plain Folks, Card stacking and Band wagon. Lazarsfeld and
Merton [11] added two more general categories: one-dimensional classification of symbols and
personification stereotype. Later taxonomies brought further propaganda techniques. Brown
[12] put forward eight broad categories, including Use of Stereotypes, Substitution of Names,
Selection, Repetition, Assertion or Appeal to Authority. Smith [13] mentioned Symbolic Fiction,
Multiple Standards, Historical Reconstruction and Asymmetrical Definition . The most extensive
list of propaganda techniques is produced by Weston [14] who proposes 24 rules of making a
successful argument.</p>
      <p>
        Work on detecting propaganda. When propaganda detection was first defined as an NLP
task, it was tackled mostly on the article level [15, 16]. More recently, research on fine-grained
(i.e. sentence- or word-level) propaganda detection has been gaining traction. It was presented
as a shared task at NLP4IF workshop [17] with two subtasks: fragment-level classification of
propaganda techniques used in a given span and a sentence-level binary classification task
1https://github.com/piotrmp/loner
(propaganda present or not). Another shared task (Detection of Propaganda Techniques in News
Articles) was organised at Semeval 2020 workshop [
        <xref ref-type="bibr" rid="ref2">18</xref>
        ], where it was formulated as a pipeline
task consisting of span identification subtask (i.e. spotting propaganda fragments in plain-text
documents) and a technique classification subtask (i.e. determining the propaganda technique
in a given span). Finally, in the shared-task on Detection of Persuasion Techniques in Texts
and Image at Semeval 2021 [
        <xref ref-type="bibr" rid="ref3">19</xref>
        ], the participants could choose from three subtasks: multi-label
classification task (given textual content of a meme, identify which techniques are used in it),
multilabel sequence tagging task (identify pairs of techniques and spans they cover), and a
multimodal, multi-label classification task (using both textual and pictorial content of a meme,
identify which techniques are used in it). Transformer-based [
        <xref ref-type="bibr" rid="ref4">20</xref>
        ] architectures dominated all
three shared-tasks. Best results on multilabel sequence tagging tasks were obtained with designs
consisting of pre-trained models such as BERT [
        <xref ref-type="bibr" rid="ref5">21</xref>
        ], RoBERTA [
        <xref ref-type="bibr" rid="ref6">22</xref>
        ], or ELMo [
        <xref ref-type="bibr" rid="ref7">23</xref>
        ] combined
with additional classification layers and fine-tuned within various scenarios: NER, multi-task
learning, question-answering, etc. Task participants often experimented with diferent loss
function designs and data augmentation methods in order to alleviate the problem of imbalance
or sparseness of the data.
      </p>
      <p>Note that even though our work uses data from one of the shared tasks (see section 3.1), the
results are not directly comparable with those of the tasks’s participants. This is because we
have no access to the test set and use a diferent success measure, focused on entity length
preservation (see section 7.2). However, our goal is not to establish the new state of the art in
propaganda recognition, but rather use the problem as an opportunity to explore the issue of
long entities in NER.</p>
      <p>
        NER with long entities. Despite the abundance of previous work on NER, it appears that
the vast majority of research is still focused on very short entities. This study in general and the
LoNER method in particular are focused on the problem of NER for very long entities, for which,
as shown in section 4.1, propaganda detection is a fitting example. While the relevant shared
tasks have attracted numerous interesting solutions, none of them have explicitly focused on
the problem of entity length [
        <xref ref-type="bibr" rid="ref2 ref3">17, 18, 19</xref>
        ]. Similarly, the very long entities have been present,
but not taken into account, in some other NER studies involving medical entities [
        <xref ref-type="bibr" rid="ref8">24</xref>
        ] and job
requirements [
        <xref ref-type="bibr" rid="ref9">25</xref>
        ]. The only one study that evaluated NER performance for longer entities we
are aware of is that of Li et al. [
        <xref ref-type="bibr" rid="ref10">26</xref>
        ], but they analyse entities of up to 6 words. The propaganda
instances are even longer as they can span multiple sentences, justifying a special approach. To
the best of our knowledge, our work is the first to investigate the impact of such extensive span
length on NER performance.
      </p>
    </sec>
    <sec id="sec-4">
      <title>3. Propaganda detection task</title>
      <p>The input of the propaganda detection task is a short text (sentence, paragraph or document) in
natural language (English in our case). Since we are expressing the problem through the NER
framework, the output is a list of entities. Each entity corresponds to the usage of a certain
propaganda technique and is described by a span, i.e. a continuous section of text defined
through character ofsets, and a category, i.e. one of the pre-defined propaganda techniques.</p>
      <sec id="sec-4-1">
        <title>Category AtA AtFP BRaH</title>
        <p>BaWF</p>
        <p>CO</p>
        <p>D
EM
FW</p>
        <p>LL
NCL</p>
        <p>R</p>
        <p>S
TTC
WSMRH
All</p>
        <sec id="sec-4-1-1">
          <title>3.1. Annotated data</title>
          <p>
            We base our experiments on the PTC Corpus from Semeval 2020 Task 11 [
            <xref ref-type="bibr" rid="ref2">18</xref>
            ], which is the largest
publicly available dataset annotated with propaganda categories on the token-level. Because
sequence tagging-based approaches do not allow overlapping spans, we need to disregard some
of the entities. This process is implemented in a way that prioritises preservation of entities (1)
belonging to rarer categories and (2) having less overlap. In the end, around 10% of entities are
removed in the process.
          </p>
          <p>From the original corpus, only the training and development subsets are publicly available,
since the test set was used to perform evaluation within the shared task. For the purpose of
our experiments, we aggregate the available subsets and randomly re-split them, assigning
documents into training (60%), development (20%) and test (20%) subsets.</p>
          <p>The corpus contains annotations of the the following 14 propaganda techniques:
• Appeal to Authority (AtA),
• Appeal to Fear, Prejudice (AtFP),
• Bandwagon, Reductio ad Hitlerum (BRaH),
• Black and White Fallacy (BaWF),
• Causal Oversimplification (CO),
• Doubt (D),
• Exaggeration, Minimisation (EM),
• Flag-Waving (FW),
• Loaded Language (LL)
• Name-Calling, Labelling (NCL),</p>
          <p>101
Length (characters)
102
101</p>
          <p>Length (characters)
102
• Repetition (R),
• Slogans (S),
• Thought-Terminating Clichés (TTC),
• Whataboutism, Straw Men, Red Herring (WSMRH).</p>
          <p>For detailed statistics of the number of entities, average length and coverage of each technique
in the corpus, see table 1.</p>
          <p>The categories difer significantly in terms of number of instances: the most common one
(Loaded Language) has 1369 instances in the training subset, while the rarest one (Bandwagon,
Reductio ad Hitlerum) occurs just 50 times. The diferences in span length are less stark, though
still significant: Appeal to Authority has on average over 120 characters, while Repetition less
than 17. The former may include a subordinate clause (Leading experts in the field agree that … ),
while the latter can strengthen impact by repeating a single word.</p>
          <p>Note that the combined efect of number and length of entities means that a category can
achieve high coverage (and thus impact on evaluation) through a large number of short entities
(e.g. Loaded Language or Name-Calling, Labelling) or through fewer long entities (e.g. Doubt
or Appeal to Fear, Prejudice). Recognising the latter type is definitely more challenging, as it
means learning complex structures (multi-word phrases) from few training example. This is
also reflected by the mismatch of coverage in the diferent subsets; e.g. Doubt covers 2.92% of
development text, but only 1.66% of the test text.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Propaganda techniques in the corpus</title>
      <sec id="sec-5-1">
        <title>4.1. Span length</title>
        <p>The extensive length of propaganda techniques makes the problem stand out when compared
to other NER tasks. The average entity length in the test set is 38.12 characters (6.24 words).
25%
20%
101
True length
102</p>
        <p>101
Length (characters)
102</p>
        <p>While there are some one-word instances (e.g. in Slogans or Repetition), longer ones dominate,
with the maximum of 69 words (412 characters).</p>
        <p>
          In figure 1 (left), we compare the length distribution of propaganda entities to previously
published NER corpora: ConLL-2003, including general-domain names of persons, locations
etc. [
          <xref ref-type="bibr" rid="ref11">27</xref>
          ]; GENIA/BioNLP, including biomedical concepts, such as genes, proteins, etc. [28] and
EMB-NLP [
          <xref ref-type="bibr" rid="ref8">24</xref>
          ], including PICO elements (further description in section 7.4). The average span
length for these are, respectively, 9.45, 15.35 and 18.79 characters. As mentioned previously,
EMB-NLP has the longest entities, but they are still less than half of the propaganda techniques,
which average at 38.12 characters. Note the significant portion of very long entities (over 100
characters), absent from the other datasets. Figure 1 (right) demonstrates the internal diversity
of propaganda techniques by showing an analogous comparison between four of the techniques.
        </p>
        <p>It is expected that extensive span length afects the NER performance. While the full
evaluation results are presented in section 8, here we broadly demonstrate the problem by showing
how the baseline model (BERT raw, see more in section 5) performs on our data. In figure 2
(left), each point corresponds to a matching between a true (gold standard) and a predicted
entity. For the purpose of this visualisations, two entities are considered to match if their spans
overlap, irrespective of the categories. The X and Y coordinates correspond to the character
length of the true and predicted entity, respectively. We can clearly see that the number of
entities shorter than expected (below the Y=X line) is much larger than of those longer than
expected (above the line). This is especially noticeable for very long entities (true length over
50 characters). Moreover, the set of predictions includes numerous entities including 1 or 2
characters, which obviously have no counterparts in the gold standard.</p>
        <p>Figure 2 (right) shows the distribution of lengths of the predicted and gold-standard entities,
regardless of their matching status. As expected, the entities predicted by the model are
noticeably shorter. In 7 we show how to quantify the distribution mismatch visible in this plot
and use it as an evaluation measure called entity length discrepancy.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. NER approaches evaluated</title>
      <p>In this section we briefly describe the existing solutions to the NER task, which are evaluated
in our experiments, focusing on the features that may afect their performance in case of long
entities. We start with the most frequently used approach (sequence tagging using language
models) and discuss its variants in section 5.1 before showing the less popular models
(BiLSTMCRF and span prediction) in section 5.2.</p>
      <sec id="sec-6-1">
        <title>5.1. Sequence tagging using language models</title>
        <p>The most popular framework for approaching the NER tasks is through sequence tagging. It
involves creating a label for each token of the text, determining whether the token is included in
a span of an entity, and if so, what category does this entity belong to (section 5.1.1). These token
labels can be generated by a two-step model: (1) representing each token in a multi-dimensional
space using a pretrained language model, such as BERT, and (2) predicting the token label based
on this representation (section 5.1.2).
5.1.1. Span encoding
Span encoding determines how information about entities present in a given sequence (e.g. a
sentence) is translated to token-level labels. In the most straightforward approach (raw category
labels), each token included in the span of an entity is assigned a label equal to the category of
this entity (e.g. AtA), while tokens not covered by entities are assigned a special out label (O).
The extension of this scheme, known as BIO (begin-inside-out), was first adopted by Ramshaw
and Marcus [29] and has since become a standard technique. It diferentiates between tokens
that are the first in a given entity (e.g. B-AtA) from the subsequent ones (e.g. I-AtA). Several
more elaborate schemes exist, e.g. IOBES with label types for the last token in an entity (E)
and single-token entities (S), used in biomedical NER [30]. There is no definite answer as for
which of these variants is most efective. Comparative studies on CoNLL-03 and MUC7 datasets
produced mixed results, with Ratinov and Roth [31] and Cho et al. [32] proving more complex
schemes to score higher, and Konkol and Konopík [33] showing the opposite.</p>
        <p>Given that the more elaborate span encoding schemes allow for richer representation of
multi-token entities, we expect that this choice may influence the recognition performance for
our long-entity task.
5.1.2. Label prediction
Sequence labelling can be seen as a classification problem, where the correct label is predicted
for each token in the sentence. This is however a contextual classification, since the label at
position  depends not only on the token at position  , but also on the neighbouring ones. The
label prediction using neural networks is performed in two steps.</p>
        <p>
          Firstly, each  -th token is represented as an embedding vector ℎ of length  . While the
embeddings can be static, from a method such as word2vec, the contextual classification goal is
better served through contextual (and trainable) embeddings generated by a pretrained language
model. In our experiments we use BERT [
          <xref ref-type="bibr" rid="ref5">21</xref>
          ] in the Base variant, which means each wordpiece
token is represented through a vector of length  = 768 .
        </p>
        <p>Secondly, each hidden representation vector ℎ needs to be converted to a  -dimensional
vector   , where the score  , reflects the likelihood that the  -th token should be assigned the
 -th label (out of  ). This prediction layer could be implemented as a dense layer with  × 
coeficients, and followed by a softmax layer to create label probabilities
 , such that ∑ , = 1.</p>
        <p />
        <p>Often more sophisticated approaches to the label prediction are used to make this operation
context-sensitive. They include CRF (analogous to BiLSTM-CRF, see section 5.2.1), convolutional
layers or recurrent ones, such as LSTM.</p>
      </sec>
      <sec id="sec-6-2">
        <title>5.2. Other approaches</title>
        <p>5.2.1. BiLSTM-CRF
transition scores  ,
BiLSTM-CRF is a model proposed by Lample et al. [34], which is similar to the approach
described above, but using a bi-directional LSTM layer [35] instead of the pretrained language
model. Specifically, the hidden representation</p>
        <p>ℎ of the  -th token is created by concatenating
the output of one LSTM processing forward and another one operating in reverse. The input of
these LSTM layers consists of GloVe embeddings [36] and character-level LSTM output.</p>
        <p>The subsequent CRF layer computes the score of a given sequence of predictions y =  1, …  
by taking into account the token-level scores  , (obtained from the preceding layer) and
(trainable as weights). the CRF layer can encode the information on the
entity length through the values of  , , indicating the likelihood that the tokens with category 
follow each other. For example, in our dataset we would expect  ,
to be higher than  ,
.
5.2.2. Span prediction
Finally, we include one NER solution based on span prediction to check how this new approach
performs in our task. We choose the recently-published SpanNER [37].</p>
        <p>In span prediction each entity is represented individually, rather than as a sequence of token
labels. This means that every possible span  ,+
= [,  + 1, … ,  + ]
in a sentence is assigned a
label  ,+ . In SpanNER, the label is predicted based on the hidden representation (computed
using BiLSTM) from the first and the last token in the span ( ℎ and ℎ+ ) and the learnable span
length ( + 1 ) embedding.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>6. LoNER</title>
      <p>Our approach, called Long Named Enity Recognition (LoNER), is designed to improve the NER
process in such a way that promotes entities matching those seen in training in terms of span
length. To understand the overall idea behind LoNER, consider a system output, where a single
token   is recognised as belonging to a certain category  , while the rest of the tokens in the
sentence  ≠ is predicted to contain no entities (O). However, if  is a category characterised by
long spans (e.g. Causal Oversimplification ), such output is extremely unlikely to be correct. It
would be better to either extend the entity span to neighbouring tokens or remove it altogether.
How can we communicate this intention to the model?</p>
      <sec id="sec-7-1">
        <title>6.1. Adaptive convolution</title>
        <p>The most straightforward approach is the convolution operation. If we convolve the class
probabilities vector with another vector (e.g. resembling the Gaussian distribution), the large
probability of  in   will be shared at  −1
,  +1
,  +2 etc. The challenge here is that a fixed
convolution kernel will not fit every situation. For example, the convolution size should depend
on whether a token at the centre has a chance of representing a short-span entity: some do
(e.g. emotive words indicative of Loaded language), while others do not (e.g. function words
taking meaning only within long phrases). Similarly, some tokens should be associated with
convolution obtaining maximum after its own position (e.g. words typical for a beginning of a
phrase), while maximum before would suit others (e.g. words used to end a phrase).</p>
        <p>For the reasons outlined above, we propose an adaptive convolution layer that convolves
the scores  , (indicating likelihood that token  belongs to category  ) with a Gaussian kernel,
whose mean   and standard deviation   depend on the hidden representation of the token ℎ :

=1

,∗ = ∑  , × exp (− (
1 ( − ) −  
2</p>
        <p>2
) )
We can see that the score at position  can be afected by scores at another position  , though this
influence wanes as the distance between the positions (  −  ) grows. Note that the Gaussian is
scaled to peak at 1.0, so that  ,</p>
        <p>∗ ≈  , when   is close to 0 and   is high. The values of standard
deviation and mean are computed from the hidden representation of the associated tokens
through a dense linear layer:</p>
        <p>= ℎ ⋅   +   ,   = ℎ ⋅   +</p>
      </sec>
      <sec id="sec-7-2">
        <title>6.2. LoNER architecture</title>
        <p>continuous spans with the associated categories.</p>
        <p>
          Here we describe how the adaptive convolution is integrated in the general architecture of
our solution; see figure 3 for a diagram. The BERT pretrained model [
          <xref ref-type="bibr" rid="ref5">21</xref>
          ] is used to generate a
hidden representation of each token (ℎ ). We use the BERT Base (uncased) model, which means
that these vectors have length of 768. The hidden representation is fed to a dense layer in order
to generate 15 label scores  , (for 14 categories and O). In parallel, the same representation is
used to compute token-wise kernel parameters   and   , which are then used in an adaptive
convolution as described above to produce convoluted scores  ,∗ . Additionally, we employ a
residual connection [38] to accelerate the learning, adding the original and convoluted score
matrices. Finally, a softmax operation is performed to obtain label likelihoods  , such that
∑ , = 1. For each token, the label with the highest value is selected and used to construct the
        </p>
        <p>The weights of the dense layer producing the kernel parameters are initialised so that   = 0
and   = 1 are output, corresponding to a very mild smoothing efect. The dense layer producing
the scores is initialised randomly. All of the weights in the model (including BERT) are trainable.</p>
        <p>To reduce overfitting, dropout [ 39] is applied to the hidden representations returned from
BERT. The dropout layer is designed in such a way that each ℎ vector can be hidden (replaced
with zeros) with probability 0.2. This forces the model to make predictions when some of the
words in a sentence are unknown, learning to draw clues from their context.
s*</p>
        <p>Convolution
s1, s2, s3, s4, s5
?1, ?1
?2, ?2
?3, ?3
?4, ?4
?5, ?5
Ds</p>
        <p>Ds</p>
        <p>Ds</p>
        <p>Ds</p>
        <p>Ds</p>
        <p>D?,?</p>
        <p>D?,?</p>
        <p>D?,?</p>
        <p>D?,?</p>
        <p>D?,?
h0
h1
h2
h3</p>
        <p>h4</p>
        <p>BERT
[CLS]</p>
        <p>Exactly
what</p>
        <p>Hitler
did
h5
.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>7. Evaluation</title>
      <p>We train a group of models selected to represent diferent paradigms (section 7.1) and check their
performance, both using traditional measures and newly introduced entity length discrepancy
(section 7.2). Section 7.3 contains details regarding the parameters of the trained models. We
also include an additional evaluation on a dataset from a diferent domain (section 7.4).</p>
      <sec id="sec-8-1">
        <title>7.1. Evaluated approaches</title>
        <p>We choose to evaluate the following 11 variants of the available NER approaches (section 5):
• BiLSTM-CRF with raw, BIO and IOBES labelling,
• BERT sequence labelling using raw, BIO and IOBES labelling,
• BERT sequence labelling with CRF using raw, BIO and IOBES labelling,
• LoNER using raw labelling,
• SpanNER.</p>
      </sec>
      <sec id="sec-8-2">
        <title>7.2. Measures</title>
        <p>For basic evaluation, we adopt the scheme proposed for Semeval 2020 Task 11 [40]. It compares
the gold-standard and predicted sets of entities through precision, recall and F-score, accepting
partially overlapping spans. Results from entity categories are aggregated using both
microand macro-averaging, which helps to get a broader picture in the case of highly unbalanced
dataset.</p>
        <p>The F-score defined in such a way rewards a model for predicting even one word from a long
sequence of words (or sentences) comprising the original entity, which often happens (see figure
2). We argue that such ’matches’ may not be helpful for an end user and the evaluation needs
to be extended with a measure that specifically targets the mismatch between the predicted and
gold-standard distributions of span lengths.</p>
        <p>In order to achieve that, we introduce entity length discrepancy, which is computed as
KullbackLeibler (KL) divergence [41] between the distributions of span lengths in gold-standard and
predicted entities. To represent the two compared sets of entities as length distributions, their
combined range is divided into 5-character-long bins, each of which contains the number of
entities with spans of that length.These are then scaled to sum up to 1, creating a discrete
probability distribution, which can be used to compute KL divergence.</p>
        <p>Entity length discrepancy is computed in two variants. In global, the length distributions
for comparison are built based on all the entities in predicted and gold-standard sets, as shown
in Figure 2. In local, the comparisons are made within each category, e.g. predicted LL
entities compared to gold-standard LL entities, and the obtained KL values are averaged over all
categories.</p>
      </sec>
      <sec id="sec-8-3">
        <title>7.3. Model training setup</title>
        <p>All of the evaluated approaches are trained on the training set for 50 epochs. The micro-averaged
F-score on the development set is used to choose the best epoch, for which the test set results
are reported. Because of the modest size of the dataset, the results may vary between runs. In
order to minimise the impact of this on the conclusions, we run each experiment ten times with
diferent random seeds and provide the averaged results.</p>
        <p>For BiLSTM-CRF, we use the implementation published by Genthial [42]2. We keep the
original parameters of the process, except for the dropout rate, which was set to 0.1. SpanNER
is evaluated using the code published alongside the article3, using the sp2 variant, which
performed best in the original evaluation [37].</p>
        <p>2https://github.com/guillaumegenthial/tf_ner
3https://github.com/neulab/SpanNER</p>
        <p>All of the BERT-based model are implemented using our own code in TensorFlow. The
optimisation is performed using Adam [43] with learning rate equal 3 × 10−5. The categorical
cross-entropy loss is applied to the per-token predictions.</p>
      </sec>
      <sec id="sec-8-4">
        <title>7.4. Additional evaluation</title>
        <p>
          In order to broaden our evaluation, we include an additional experiment involving the
EMBNLP dataset [
          <xref ref-type="bibr" rid="ref8">24</xref>
          ]. It consists of abstracts describing clinical trial with NER categories covering
elements that are of interest from the point of view of systematic reviews, including a description
of participants, a tested intervention (e.g. a drug) and the observed results. In total, entities of
18 categories are included, covering 18.79 characters on average – not as long as propaganda
techniques, but noticeably more than classic NER.
        </p>
        <p>In order for the EMB-NLP dataset to match the proportions of the PTC Corpus (60/20/20), we
keep the original test portion of the EMB-NLP dataset (301 abstracts) and use the remaining data
to randomly select subsets of 301 abstracts for development and 903 for training. To guarantee
that each token is associated with only one label, we remove the overlapping spans (3% of the
original data), prioritising the preservation of the longer span. We obtain 12007, 3853 and 4425
entities in the training, development and test portions, respectively.</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>8. Results</title>
      <p>Table 2 shows the results of the main experiment. The values achieved by diferent methods
vary greatly. Judging by the F1 measure, the superiority of the BERT-based approaches is quite
clear. Among these, LoNER achieves the best micro-averaged F1 of 0.2731, but all BERT and
BERT-CRF variants have a precision of around 0.26-0.27. BiLSTM-CRF and SpanNER clearly lag
behind in this scenario, with F1 around 0.17-0.18. Interestingly, these methods deliver a solid</p>
      <sec id="sec-9-1">
        <title>Method</title>
        <p>BERT
LoNER</p>
        <p>Enc.
raw
–
precision, but less than half of the recall of BERT variants. The encoding schemes do not appear
to improve the situation in this long-span task, with the simplest variant (raw) performing the
best. It is also worth noting that despite the strong imbalance of the categories, all of the above
observations hold both for micro- and macro-averaging.</p>
        <p>The results of span length discrepancy validate the introduction of this measure. It turns out
that the classic BERT-based methods, best-performing according to F1, are the worst in terms
of preserving the length distribution. However, LoNER allows to mitigate this weakness, with
global discrepancy of 0.3150 compared to 0.4686 of BERT.</p>
        <p>Table 3 shows the results for the PICO entities detection. We can see that the situation
resembles the case of propaganda detection. LoNER achieves similar values of F-score (higher
precision, but lower recall) to BERT, but with a substantially better preservation of span length
distribution. The diference is especially large for global length discrepancy, which is almost
three times lower than for classic BERT.</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>9. Limitations</title>
      <p>The overall goal of this work is to improve propaganda detection in online text. While we
focused on proposing a new NER model, we need to emphasise that training data quality is just
as crucial to achieving this aim. Persuasion techniques are one of the most subtle phenomena
in language, which necessitates gathering many examples for training efective recognition.
Unfortunately, as shown in table 1, several categories are represented by just a few dozen
instances. This is the main factor limiting the performance of our solution. However, the
consistency of outcomes between micro- and macro-averaged measures indicates that the
benefits of LoNER are not afected by these small categories.</p>
      <p>The main motivation behind LoNER was poor span length preservation by the mainstream
BERT-based models. However, instead of designing an additional layer, one could investigate
what this unsatisfactory performance stems from. In principle, Transformer-based language
models, using attention to share information between faraway tokens, should be able to
capture long-range contexts. The need to understand the workings of large language model has
motivated a lot of research recently and we expect long entity recognition could be another use
case for future work.</p>
      <p>Another limitation of our work is not exploring the precision / recall tradeof. As we can
see in table 2, the models that are best in terms of length preservation (LSTM-based) have also
relatively low recall. It might be beneficial to encourage other models to behave in a similar
way, i.e. return less mentions, but of higher quality (in terms of length). It is however not clear
how to compute model’s ’confidence’ in a span in sequence labelling framework.</p>
      <p>LoNER was designed specifically for propaganda detection, but could provide benefit in any
NER task, especially when long entities are involved. Our additional evaluation is limited to
PICO data, but it would be valuable to check other tasks, too. We hope some benefits of adaptive
convolution would be noticeable even for classic entity types, but it is left for future work.</p>
      <p>Finally, we emphasise the need to put the user in the loop when evaluating misinformation
solutions. While data-based studies like this one can be valuable, they are limited to automatic
performance measures and need to be followed by user studies in the expected application
scenario. In case of propaganda detection, we are still a long way from understanding how to
deploy such solutions to deliver real-world impact.
10. Conclusions
Representation using BERT vastly outperforms LSTMs. The advantages of pretrained
language models for word representation are obviously well known across NLP. Here it is
demonstrated through the significant gap between BiLSTM-CRM and BERT-based approaches.</p>
      <p>CRF layers or labelling schemas provide no benefit. Both types of techniques are
designed to manage spans through detecting tokens that occur in significant positions. But the
they do not seem to work for propaganda-length entities.</p>
      <p>Models with high F-score have poor length preservation. Interestingly, the entity length
is preserved the best by the models with comparatively low F-score, i.e. based on BiLSTM. This
is achieved by not returning less certain entities, as demonstrated by precision-recall imbalance
of LSTM-based solutions.</p>
      <p>LoNER improves length preservation while maintaining F-score. Adaptive convolution
works as expected, extending label information over longer spans. This allows us to keep the high
F-score of BERT-based solutions (or even increase it) while improving the length preservation.</p>
      <p>This result is not limited to propaganda detection. We can see the reduced length
discrepancy also in case of PICO entities. We hope that ideas behind LoNER will be used for
challenging NER in more domains.</p>
    </sec>
    <sec id="sec-11">
      <title>Acknowledgments</title>
      <p>The work was supported by the Polish National Agency for Academic Exchange through a Polish
Returns grant number PPN/PPO/2018/1/00006 and as part of ERINIA project that has received
funding from the European Union’s Horizon Europe research and innovation programme under
grant agreement No 101060930. Views and opinions expressed are however those of the author(s)
only and do not necessarily reflect those of the European Union. Neither the European Union
nor the granting authority can be held responsible for them.
[2] Y. Zhang, X. Yu, Z. Cui, S. Wu, Z. Wen, L. Wang, Every Document Owns Its Structure:
Inductive Text Classification via Graph Neural Networks, in: Proceedings of the 58th
Annual Meeting of the Association for Computational Linguistics, Association for
Computational Linguistics (ACL), 2020, pp. 334–339. URL: https://aclanthology.org/2020.acl-main.31.
doi:10.18653/V1/2020.ACL-MAIN.31. arXiv:2004.13826.
[3] M. Potthast, J. Kiesel, K. Reinartz, J. Bevendorf, B. Stein, A Stylometric Inquiry into
Hyperpartisan and Fake News, in: Proceedings of the 56th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers), Association for Computational
Linguistics, 2018, pp. 231–240. URL: https://www.aclweb.org/anthology/P18-1022.
[4] P. Przybyła, A. J. Soto, When classification accuracy is not enough:
Explaining news credibility assessment, Information Processing &amp; Management 58 (2021).
URL: https://linkinghub.elsevier.com/retrieve/pii/S0306457321001412. doi:10.1016/j.ipm.
2021.102653.
[5] C. Bird, Social Psychology, D. Appleton and Company, New York, NY, USA, 1940.
[6] T. Parsons, Propaganda and Social Control, Psychiatry 5 (1942) 551–572. URL: https:
//doi.org/10.1080/00332747.1942.11022421. doi:10.1080/00332747.1942.11022421.
[7] G. Bennet, Propaganda and disinformation: how a historical perspective aids critical
response development, in: The SAGE Handbook of propaganda, AGE Publications Ltd,
2020, pp. 244–260. doi:10.4135/9781526477170.n16.
[8] G. S. Jowett, Propaganda and Communication: The Re-emergence of a Research Tradition,
Journal of Communication 37 (1987) 97–114. doi:10.1111/j.1460-2466.1987.tb00971.
x.
[9] J. Wilke, Propaganda, in: W. Donsbach (Ed.), The International Encyclopedia of
Communication, John Wiley &amp; Sons, Ltd, 2008. doi:10.1002/9781405186407.wbiecp109.
[10] E. B. Lee, A. M. Lee, The Fine Art of Propaganda: A Study of Father Coughlin’s Speeches,</p>
      <p>Harcourt Brace and Co, 1939. URL: https://archive.org/details/LeeFineArt.
[11] P. F. Lazarsfeld, R. K. Merton, SECTION OF ANTHROPOLOGY: Studies in Radio and
Film Propaganda*, Transactions of the New York Academy of Sciences 6 (1943) 58–74.
doi:10.1111/j.2164-0947.1943.tb00897.x.
[12] J. A. C. Brown, Techniques of persuasion: from propaganda to brainwashing, 6th ed.,</p>
      <p>Penguin Books, Baltimore, USA, 1963.
[13] T. J. Smith, Propaganda: A Pluralistic Perspective, Praeger, 1989.
[14] A. Weston, A Rule Book for Arguments, Hackett Publishing Company, Inc, 2017.
[15] H. Rashkin, E. Choi, J. Y. Jang, S. Volkova, Y. Choi, Truth of varying shades: Analyzing
language in fake news and political fact-checking, EMNLP 2017 - Conference on Empirical
Methods in Natural Language Processing, Proceedings (2017) 2931–2937. doi:10.18653/
v1/d17-1317.
[16] A. Barrón-Cedeño, G. da San Martino, I. Jaradat, P. Nakov, Proppy: A system to
unmask propaganda in online news, Proceedings of the 33rd AAAI Conference on
Artiifcial Intelligence (AAAI 2019) (2019) 9847–9848. doi: 10.1609/aaai.v33i01.33019847.
arXiv:1912.06810.
[17] G. da San Martino, A. Barrón-Cedeño, P. Nakov, Findings of the NLP4IF-2019 Shared
Task on Fine-Grained Propaganda Detection, in: Proceedings of the Second Workshop
on Natural Language Processing for Internet Freedom: Censorship, Disinformation, and
Language-Independent Named Entity Recognition, in: Proceedings of the 7th Conference
on Natural Language Learning, CoNLL 2003 at HLT-NAACL 2003, 2003, pp. 142–147. URL:
http://lcg-www.uia.ac.be/conll2003/ner/. arXiv:0306050.
[28] J. D. Kim, T. Ohta, Y. Tateisi, J. Tsujii, GENIA corpus—a semantically annotated corpus for
bio-textmining, Bioinformatics 19 (2003). URL: https://academic.oup.com/bioinformatics/
article/19/suppl_1/i180/227927. doi:10.1093/BIOINFORMATICS/BTG1023.
[29] L. A. Ramshaw, M. P. Marcus, Text Chunking Using Transformation-Based Learning, in:
ACL Third Workshop on Very Large Corpora, 1995, pp. 82–94. URL: https://arxiv.org/abs/
cmp-lg/9505040v1. doi:10.1007/978- 94- 017- 2390- 9_10. arXiv:9505040.
[30] A. Vlachos, Tackling the BioCreative2 gene mention task with conditional random fields
and syntactic parsing, Proceedings of the Second BioCreative Challenge Evaluation
Workshop (2007). URL: http://www.cl.cam.ac.uk/~av308/biocreative2_GM_vlachos.pdf.
[31] L. Ratinov, D. Roth, Design Challenges and Misconceptions in Named Entity
Recognition, in: Proceedings of the Thirteenth Conference on Computational Natural Language
Learning (CoNLL), Association for Computational Linguistics, Boulder, Colorado„ 2009,
pp. 147–155. URL: http://l2r.cs.uiuc.edu/.
[32] H. C. Cho, N. Okazaki, M. Miwa, J. Tsujii, Named entity recognition with multiple
segment representations, Information Processing and Management 49 (2013) 954–965.</p>
      <p>URL: http://dx.doi.org/10.1016/j.ipm.2013.03.002. doi:10.1016/j.ipm.2013.03.002.
[33] M. Konkol, M. Konopík, Segment Representations in Named Entity Recognition,
in: Proceedings of the International Conference on Text, Speech and Dialogue (TSD
2015), Springer, Cham, 2015, pp. 61–70. URL: https://link.springer.com/chapter/10.1007/
978-3-319-24033-6{_}7. doi:10.1007/978- 3- 319- 24033- 6_7.
[34] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, C. Dyer, Neural architectures
for named entity recognition, in: Proceedings of the 2016 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language
Technologies (NAACL HLT 2016), 2016, pp. 260–270. URL: https://github.com/. doi:10.
18653/v1/n16- 1030. arXiv:1603.01360.
[35] S. Hochreiter, J. Schmidhuber, Long Short-Term Memory, Neural Computation 9 (1997)
1735–1780. URL: http://direct.mit.edu/neco/article-pdf/9/8/1735/813796/neco.1997.9.8.1735.
pdf. doi:10.1162/neco.1997.9.8.1735.
[36] J. Pennington, R. Socher, C. Manning, GloVe: Global Vectors for Word Representation, in:
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing
(EMNLP), 2014, pp. 1532–1543. URL: https://aclanthology.org/D14-1162/.
[37] J. Fu, X. Huang, P. Liu, SpanNER: Named Entity Re-/Recognition as Span Prediction, in:
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics
and the 11th International Joint Conference on Natural Language Processing (Volume 1:
Long Papers), Association for Computational Linguistics, Online, 2021, pp. 7183–7195.</p>
      <p>URL: https://aclanthology.org/2021.acl-long.558.
[38] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in:
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern
Recognition, IEEE Computer Society, 2016, pp. 770–778. doi:10.1109/CVPR.2016.90.
arXiv:1512.03385.
[39] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A Simple
Way to Prevent Neural Networks from Overfitting, Journal of Machine Learning Research
15 (2014) 1929–1958. URL: http://jmlr.org/papers/v15/srivastava14a.html.
[40] G. da San Martino, A. Barrón-Cedeño, P. Nakov, Evaluation of Propaganda Detection
Tasks, Technical Report, 2020. URL: https://propaganda.qcri.org/semeval2020-task11/data/
propaganda_tasks_evaluation.pdf.
[41] S. Kullback, R. A. Leibler, On Information and Suficiency, The
Annals of Mathematical Statistics 22 (1951) 79–86. URL: https://
projecteuclid.org/journals/annals-of-mathematical-statistics/volume-22/issue-1/
On-Information-and-Sufficiency/10.1214/aoms/1177729694.fullhttps://projecteuclid.org/
journals/annals-of-mathematical-statistics/volume-22/issue-1/On-Information-and-Su.
doi:10.1214/aoms/1177729694.
[42] G. Genthial, Sequence Tagging with Tensorflow, 2017. URL: https://guillaumegenthial.</p>
      <p>github.io/sequence-tagging-with-tensorflow.html.
[43] D. P. Kingma, J. L. Ba, Adam: A method for stochastic optimization, in: 3rd International
Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, 2015.
arXiv:1412.6980.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <source>Overview of the 7th Author Profiling Task at PAN</source>
          <year>2019</year>
          :
          <article-title>Bots and Gender Profiling</article-title>
          , in: L.
          <string-name>
            <surname>Cappellato</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>D. E.</given-names>
          </string-name>
          <string-name>
            <surname>Losada</surname>
          </string-name>
          , H. Müller (Eds.),
          <article-title>CLEF 2019 Labs and Workshops, Notebook Papers</article-title>
          .
          <source>CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2019</year>
          . Propaganda,
          <article-title>Association for Computational Linguistics (ACL</article-title>
          ),
          <year>2019</year>
          , pp.
          <fpage>162</fpage>
          -
          <lpage>170</lpage>
          . URL: http://arxiv.org/abs/
          <year>1910</year>
          .09982. arXiv:
          <year>1910</year>
          .09982.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [18] G. da San Martino, A.
          <string-name>
            <surname>Barrón-Cedeño</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Wachsmuth</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Petrov</surname>
          </string-name>
          , P. Nakov, SemEval2020 Task 11:
          <article-title>Detection of Propaganda Techniques in News Articles</article-title>
          ,
          <source>in: Proceedings of the Fourteenth Workshop on Semantic Evaluation (SemEval-2020)</source>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>1377</fpage>
          -
          <lpage>1414</lpage>
          . URL: http://propaganda.qcri.org/annotations/definitions.htmlhttp://arxiv.org/ abs/
          <year>2009</year>
          .02696. arXiv:
          <year>2009</year>
          .02696.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>D.</given-names>
            <surname>Dimitrov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. Bin</given-names>
            <surname>Ali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shaar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Silvestri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Firooz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          , G. Da San Martino, SemEval
          <article-title>-2021 Task 6: Detection of Persuasion Techniques in Texts and Images</article-title>
          ,
          <source>in: Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval2021)</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>70</fpage>
          -
          <lpage>98</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .semeval-
          <volume>1</volume>
          .7. arXiv:
          <volume>2105</volume>
          .
          <fpage>09284</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          , Attention Is All You Need,
          <source>in: Advances in Neural Information Processing Systems</source>
          <volume>30</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2017</year>
          , pp.
          <fpage>5998</fpage>
          -
          <lpage>6008</lpage>
          . arXiv:
          <volume>1706</volume>
          .
          <fpage>03762</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT:
          <article-title>Pre-training of Deep Bidirectional Transformers for Language Understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</article-title>
          , Association for Computational Linguistics,
          <year>2018</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . URL: http://arxiv.org/abs/
          <year>1810</year>
          .04805. arXiv:
          <year>1810</year>
          .04805.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          , V. Stoyanov,
          <article-title>RoBERTa: A Robustly Optimized BERT Pretraining Approach</article-title>
          , arXiv:
          <year>1907</year>
          .
          <volume>11692</volume>
          (
          <year>2019</year>
          ). URL: https://arxiv.org/abs/
          <year>1907</year>
          .11692v1. arXiv:
          <year>1907</year>
          .11692.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Peters</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Neumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Iyyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gardner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <article-title>Deep contextualized word representations</article-title>
          ,
          <source>in: NAACL HLT</source>
          <year>2018</year>
          <article-title>- 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies -</article-title>
          <source>Proceedings of the Conference</source>
          , volume
          <volume>1</volume>
          , Association for Computational
          <source>Linguistics (ACL)</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>2227</fpage>
          -
          <lpage>2237</lpage>
          . URL: https://aclanthology.org/N18-1202. doi:
          <volume>10</volume>
          . 18653/v1/n18-
          <fpage>1202</fpage>
          . arXiv:
          <year>1802</year>
          .05365.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>B.</given-names>
            <surname>Nye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. J.</given-names>
            <surname>Marshall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nenkova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. C.</given-names>
            <surname>Wallace</surname>
          </string-name>
          ,
          <article-title>A Corpus with Multi-Level Annotations of Patients, Interventions and Outcomes to Support Language Processing for Medical Literature, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics</article-title>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <year>2018</year>
          . doi:
          <volume>10</volume>
          .18653/ v1/
          <fpage>P18</fpage>
          -1019.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>O.</given-names>
            <surname>Shatalov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryabova</surname>
          </string-name>
          ,
          <article-title>Named Entity Recognition Problem for Long Entities in English Texts</article-title>
          ,
          <source>in: The 16th International Conference on Computer Sciences and Information Technologies (CSIT)</source>
          ,
          <source>Institute of Electrical and Electronics Engineers (IEEE)</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>76</fpage>
          -
          <lpage>79</lpage>
          . doi:
          <volume>10</volume>
          .1109/CSIT52700.
          <year>2021</year>
          .
          <volume>9648768</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>F.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. C.</given-names>
            <surname>Hui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <article-title>Modularized Interaction Network for Named Entity Recognition, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th</article-title>
          <source>International Joint Conference on Natural Language Processing</source>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>200</fpage>
          -
          <lpage>209</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>17</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>17</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>E. F.</given-names>
            <surname>Tjong Kim Sang</surname>
          </string-name>
          , F. de Meulder, Introduction to the CoNLL-2003 Shared Task:
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>