<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>How BERT Speaks Shakespearean English? Evaluating Historical Bias in Masked Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Miriam Cuscito</string-name>
          <email>miriam.cuscito@unicas.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alfio Ferrara</string-name>
          <email>Alfio.Ferrara@unimi.it</email>
          <email>alfio.ferrara@unimi.i</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Ruskov</string-name>
          <email>martin.ruskov@unimi.it</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dipartimento di Informatica “Giovanni Degli Antoni”, Università degli Studi di Milano</institution>
          ,
          <addr-line>Via Celoria 18, 20133 Milano</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Dipartimento di Lettere e Filosofia, Università degli Studi di Cassino e del Lazio Meridionale</institution>
          ,
          <addr-line>Via Zamosch 43, 03043 Cassino</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Dipartimento di Lingue, Letterature, Culture e Mediazioni, Università degli Studi di Milano</institution>
          ,
          <addr-line>Piazza Sant'Alessandro 1, 20123 Milano</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we explore the idea of analysing the historical bias of masked language models based on BERT by measuring their adequacy with respect to Early Modern (EME) and Modern (ME) English. In our preliminary experiments, we perform fill-in-the-blank tests with 60 masked sentences (20 EME-specific, 20 ME-specific and 20 generic) and three diferent models (i.e., BERTBase, MacBERTh, BL Books). We then rate the model predictions according to a 5-point bipolar scale between the two language varieties and derive a weighted score to measure the adequacy of each model to EME and ME varieties of English.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Masked Language Models</kwd>
        <kwd>Early Modern English</kwd>
        <kwd>Historical Bias</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>We then rate the proposed responses according to a 5-point bipolar scale between the two language
variants and derive a weighted score from the response probabilities and their respective scores on the
scale.</p>
      <p>These results, although preliminary, might suggest a method applicable in the digital humanities
when MLMs are employed for the analysis of historical corpora.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>If it is true that language shapes culture while it is shaped by it [10], language models in general – and
MLMs in particular – constitute a still partially covered mirror of this dual relationship. Not only can
a MLM be tested based on its level of representativeness of the language to determine its reliability,
but also it can tell us about linguistic, social, and historical phenomena that concern the culture tied
to that specific language. In other words, a MLM could be a valuable tool towards the expansion of
the broader social knowledge of a given culture, rightfully becoming part of the basic tools of Cultural
Analytics discussed by Manovich [2020]. According to Bruner’s [1984] pragmatic-cultural perspective,
learning a language also means learning the cultural patterns associated with it. Similarly, analysing
the language in its various realisations would mean having the opportunity to visualise the underlying
cultural patterns.</p>
      <p>Moreover, MLMs can be highly beneficial also for philological [ 13], pragmatic [14], critical [6],
and literary work [15]. However, the efectiveness of these models depends on their ability to adapt
to language specificity in its historical dimension. This is typically achieved by training models on
historical text corpora. However, the dificulty of accessing large historical documentary collections
means that the models available are still few and requires verifying whether they adapt efectively to
the historical linguistic context.</p>
      <p>
        BERT is a foundational masked language model (MLM) which to date is the most widely adopted [16].
A number of studies have explored diferent forms of bias in BERT [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2, 3</xref>
        ]. Three BERT-based MLMs
are of particular interest for our study: (i) Bert-Base-Uncased [7], created from a corpus of texts from
Wikipedia and BookCorpus and a model of contemporary language, which we use as a control condition
in our experiment; (ii) MacBERTh [8], pre-trained on texts from 1500 to 1950; and (iii) Bert British
Library Books English, pre-trained on contemporary texts and fine-tuned on historical texts from the
19th century to the present.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <p>To evaluate the adequacy of MLMs on a test set, we define a temporal valence task consisting of a
collection of test sentences, each with a masked token (i.e., word). This is a typical fill-in-the-blank task,
where the models are required to predict the masked token. Formally, we consider the following three
sets: (i) we denote with  the set of all test sentences, (ii) with  we denote a set of vocabulary words,
and (iii) with  = {− 1, − 0.5, 0, 0.5, 1} ⊂ R, we denote a 5-point bipolar temporal valence scale, where
− 1 represents the farthest historical period and 1 the closest to today.</p>
      <p>With the above notation, for each of the masked sentences (denoted as  ∈ ), we define a function
 :  →  representing the sentence temporal valence score. This function indicates the period from
which the masked sentence is typical.</p>
      <p>Then, we calculate a token-in-sentence temporal valence score  :  ×  →  , indicating the score of
a token substituting the sentence mask.</p>
      <p>The mentioned temporal valence scores are assigned arbitrarily according to the research hypotheses.
Taking this study as an example, the criterion used to determine each score was the degree of alignment
of certain sentences or tokens with a specific historical period on a philological-linguistic basis. Scholars
wishing to delve into language study using this methodological approach can selectively choose the
score to assign to their test set based on their specific research needs. The versatility of the proposed
methodology is evident in its adaptability to a diverse array of fields of interest. This flexibility enables
token
thou
you


i
she
he</p>
      <p>(i.e., − 1 ∈  ) and ME (Modern English) as closest (i.e., 1 ∈  ), if we consider the sentence 1 = “Why
wilt [MASK] be ofended by that?”</p>
      <p>we have  (1) = − 1 as 1 is a representative sentence for EME,
 (“”, 1) = 0, because “not” is neutral regarding the two language varieties.
and  (“ℎ”, 1) = − 1, because in this context “thou” is indicative for EME. On the other hand,
Given a model , for the masked token in each sentence (
), we have the set of
{1, 2, . . . , } ⊂ 
sponding probabilities from this model, shown in Equation 1.</p>
      <p>of  words predicted by  for , that are associated with the vector of
corre</p>
      <p>∈
p = ((1), (2), . . . , ())</p>
      <p>For this set, using the temporal valence score  , we define a token-in-sentence temporal valence
score vector x for  given the sentence  as in Equation 2.</p>
      <p>x = ( (1, ),  (2, ), . . . ,  (, ))</p>
      <p>This allows us to define the bias of a model regarding the sentence as the dot product of the
modelderived probabilities and the token valence scores, providing us with a weighted score as in Equation 3,
and efectively getting a single value measurement from the two vectors above.</p>
      <p>(, ) = xp</p>
      <p>We can also proceed to define the domain adequacy of a model with respect to a sentence  (see
Equation 4), based on the diference between the sentence temporal valence score  () and the model
bias  (, ). To do this, we consider the diference between the model bias and the sentence temporal
valence (disregarding which one is larger), and project it on the unit interval, making sure that more
similar values lead to higher adequacy scores.
(1)
(2)
(3)</p>
      <sec id="sec-3-1">
        <title>BERT Base</title>
      </sec>
      <sec id="sec-3-2">
        <title>MacBERTh BL books</title>
        <p>Scores for the neutral sentence “Have you come [MASK] to torment us before the time?” ( = 0)
token
orientation
misconduct
minorities
partners
harassment
 (, ) = 1 −
|  () −  (, ) |
2
(4)</p>
        <p>Examples of three sentences classified in diferent periods are provided in Tables 1, 2 and 3, which
show the corresponding values for  , ,  ,  (, ) and  (, ).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation</title>
      <p>We test our metrics with three BERT-based linguistic models we consider relevant for the varieties of the
English language of interest: (i) Bert-Base-Uncased, (ii) MacBERTh, and (iii) BL Books. In accordance
with the objectives of this study, the choice of models reflects a specific interest in language; therefore,
they can be replaced to best fit any other specific interest in diachronic language analysis. For the test
we used 60 word-masked sentences, specifically created for this study. To create the test set, we relied
on diferent types of written language: contemporary standard, journalistic language, social media
non-standard, and Early Modern language.</p>
      <p>The elements to be masked were selected based on their belonging to specific word classes known
to have sufered more exposure to the diachronic variation of the English language: pronouns, verbs,
adverbs, adjectives, and nouns. Of the 60 sentences4, 20 are selected to be suggestive for the EME
variety of English, further 20 – as suggestive for ME, and final 20 are generic. Once the test set was
complete, a temporal valence score was assigned to each sentence (see  in Section 3) based on their
level of chronological markedness.</p>
      <p>The test set was administered to the three MLMs, and the suggested words with their probability
were collected. The resultant vocabulary was marked independently from the models that provided it
by setting the token-in-sentence temporal valence score (i.e.  ) to each word, based on an estimation of
proximity of the token’s meaning to a certain linguistic variety in the context in which it appeared.
Notably, during this phase, our decision was to work on a sentence level (contextually) rather than on a
set level (globally). The method proved highly efective in avoiding the risk of semantic flattening, given
that almost every word has shown some level of contextual semantic specificity if taken contextually
rather than globally. An example is the pronoun you in “fare you well, sir”, which is globally neutral
and yet acquires a strong diachronic value if evaluated in its context, in which it appears to be utmost
archaic.</p>
      <p>Once  and  were calculated, we proceeded with the analysis of the data and the collection of results.
The distribution of the bias score  and the domain adequacy score  for the sentences in the three
groups (i.e., EME, Neutral, and ME) is shown in Figures 1 and 2, respectively.
4For transparency and reproducibility purposes, the following anonymous link contains the complete test set with the
corresponding values produced during evaluation:
https://tinyurl.com/bert-shakespearean
MacBERTh</p>
      <p>Model</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>Both notions of bias ( ) and domain adequacy ( ) provide important insights of the nature of the models.
The first,  , indicates a tendency in terms of temporal valency. In other words, the interpretation of
its value should be considered within the context of the specific dichotomy of language varieties. On
the other hand,  reflects the adequacy for an individual language variety. It successfully captures
model tendencies when completing historically predetermined sentences. However, due to its inclusion
of  ,  is less informative when there is no bias originating from the sentence, i.e. when completing
temporally-neutral sentences.</p>
      <p>Notably, our measures demonstrate that MacBERTh is better at representing the EME historical
context than BL Books. This could possibly be explained by the nature of the models in question. First,
MacBERTh is a model created from scratch and trained on texts spanning a time range that takes into
account the evolution of the English language from EME to ME. BL Books, on the other hand, was only
ifne-tuned on texts from the modern period, so it has no direct exposure to EME. It does perform better
on ME than MacBERTh and worse than BERT Base. Thus, MacBERTh demonstrates a strong linguistic
consistency, given the wide range of language varieties it is trained on, but in tasks related to ME yields
worse results than other more specialised models. Simply put, having a specific, narrower domain poses
fewer problems when working within it but reveals clear gaps when moving outside that domain.</p>
      <p>The notion that LMs can serve as a window into the history of a population is not new, but there is
a growing interest in exploring the relationships between these models and the socio-linguistic and
socio-cultural contexts [17, 18, 19, 20]. It is equally imperative to establish a procedural framework to
address the lack of evaluative methods for these models, as previously hinted at in this text. This is
particularly useful when no direct links could be drawn between the corpus used to train the model
and the social context of the test set.</p>
      <p>Within this evaluation, we created a dedicated test set for each model under scrutiny, drawing upon
approaches used for evaluation of bias in MLMs. In creating our test sets, we built our sentences both
on logical-semantic and logical-syntactic tasks. Future work could try to create a test set for model
interrogation that is culture-oriented, delving into socio-culturally significant elements such as customs,
historical events, and attitudes towards social groups – elements recognised as belonging to social
knowledge. Alternatively tests could be derived from word in context datasets, such as TempoWiC and
HistWiC [21, 22, 23].</p>
      <p>Alternatively, the temporal valence of word tokens could be derived not simply from the sentence
where they emerge, but from the wider historical context, e.g. from a large corpus, representative
for the period [24, 25]. This would allow automating not only the calculation of the token temporal
valence  , but also the identification of sentences that are representative for each historical period. As a
consequence, the dependence on manual expert evaluation would be strongly reduced, which would
result in both higher reproducibility and wider generalisability of the approach.</p>
      <p>This study aims not only to propose a methodology for assessing language models but also to put
forth hypotheses for expanding the available tools to humanities scholars interested in studying complex
socio-cultural phenomena with an approach which begins by interpreting textual clues and inferring
their connections to reality. As such it is also applicable to contexts beyond diachronic, but also across
dialects or professional jargons.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>The research leading to these results has received funding from MUR, PRIN2022 project “MetaLing
Corpus: Creating a corpus of English linguistics metalanguage from the 16th to the 18th century”,
ref.: 202233C93X, funded by the European Union under the programme NextGenerationEU.
[3] M. Mozafari, R. Farahbakhsh, N. Crespi, Hate speech detection and racial bias mitigation in
social media based on BERT model, PLOS ONE 15 (2020) e0237861. doi:10.1371/journal.pone.
0237861.
[4] D. Nozza, F. Bianchi, D. Hovy, HONEST: Measuring Hurtful Sentence Completion in
Language Models, in: K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy,
S. Bethard, R. Cotterell, T. Chakraborty, Y. Zhou (Eds.), Proceedings of the 2021 Conference of
the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, Association for Computational Linguistics, Online, 2021, pp. 2398–2406.
doi:10.18653/v1/2021.naacl-main.191.
[5] P. J. Corfield, Fleeting gestures and changing styles of greeting: researching daily life in
British towns in the long eighteenth century, Urban History 49 (2022) 555–567. doi:10.1017/
S0963926821000274.
[6] A. Morollon Diaz-Faes, C. Murteira, M. Ruskov, Explicit references to social values in fairy
tales: A comparison between three European cultures, in: M. Hämäläinen, E. Öhman, F. Pirinen,
K. Alnajjar, S. Miyagawa, Y. Bizzoni, N. Partanen, J. Rueter (Eds.), Proceedings of the Joint 3rd
International Conference on Natural Language Processing for Digital Humanities and 8th International
Workshop on Computational Linguistics for Uralic Languages, Association for Computational
Linguistics, Tokyo, Japan, 2023, pp. 62–75. URL: https://aclanthology.org/2023.nlp4dh-1.8.
[7] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers
for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long
and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp.
4171–4186. doi:10.18653/v1/N19-1423.
[8] E. Manjavacas, L. Fonteyn, Adapting vs. Pre-training Language Models for Historical Languages,</p>
      <p>Journal of Data Mining &amp; Digital Humanities NLP4DH (2022). doi:10.46298/jdmdh.9152.
[9] S. Schweter, Pretrained Language Models on British Library Corpus, 2024. doi:10.5281/zenodo.</p>
      <p>10715629.
[10] L. Boroditsky, How Language Shapes Thought, Scientific American 304 (2011) 62–65. doi: 10.</p>
      <p>1038/scientificamerican0211-62.
[11] L. Manovich, Cultural analytics, The MIT Press, Cambridge, Massachusetts, 2020. URL: https:
//mitpress.mit.edu/9780262037105/cultural-analytics/.
[12] J. Bruner, Pragmatics of Language and Language of Pragmatics, Social Research 51 (1984) 969–984.</p>
      <p>URL: https://www.jstor.org/stable/40970973.
[13] L. van Lit, Among Digitized Manuscripts. Philology, Codicology, Paleography in a Digital World,</p>
      <p>Brill, Leiden, The Netherlands, 2019. doi:10.1163/9789004400351.
[14] M. Ruskov, Who and How: Using Sentence-Level NLP to Evaluate Idea Completeness, in:
N. Wang, G. Rebolledo-Mendez, V. Dimitrova, N. Matsuda, O. C. Santos (Eds.), Artificial
Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials,
Industry and Innovation Tracks, Practitioners, Doctoral Consortium and Blue Sky, Communications
in Computer and Information Science, Springer Nature Switzerland, Cham, 2023, pp. 284–289.
doi:10.1007/978-3-031-36336-8_44.
[15] A. Piper, H. Xu, E. D. Kolaczyk, Modeling Narrative Revelation, in: A. Šel,a, F. Jannidis, I.
Romanowska (Eds.), Proceedings of the Computational Humanities Research Conference 2023, volume
3558 of CEUR Workshop Proceedings, CEUR, Paris, France, 2023, pp. 500–511.
[16] F. Periti, S. Montanelli, Lexical Semantic Change through Large Language Models: a Survey, ACM
Comput. Surv. 56 (2024) 282:1–282:38. URL: https://dl.acm.org/doi/10.1145/3672393. doi:10.1145/
3672393.
[17] R. M. M. Hicke, D. Mimno, T5 meets Tybalt: Author Attribution in Early Modern English Drama
Using Large Language Models, in: A. Šel,a, F. Jannidis, I. Romanowska (Eds.), Proceedings of the
Computational Humanities Research Conference 2023, volume 3558 of CEUR Workshop Proceedings,
CEUR, Paris, France, 2023, pp. 274–302. URL: https://ceur-ws.org/Vol-3558/#paper2757.
[18] H. Usui, K. Komiya, Translation from historical to contemporary Japanese using Japanese t5,
in: M. Hämäläinen, E. Öhman, F. Pirinen, K. Alnajjar, S. Miyagawa, Y. Bizzoni, N. Partanen,
J. Rueter (Eds.), Proceedings of the Joint 3rd International Conference on Natural Language
Processing for Digital Humanities and 8th International Workshop on Computational Linguistics
for Uralic Languages, Association for Computational Linguistics, Tokyo, Japan, 2023, pp. 27–35.</p>
      <p>URL: https://aclanthology.org/2023.nlp4dh-1.4.
[19] N. Pedrazzini, B. McGillivray, Machines in the media: semantic change in the lexicon of
mechanization in 19th-century British newspapers, in: M. Hämäläinen, K. Alnajjar, N. Partanen, J. Rueter
(Eds.), Proceedings of the 2nd International Workshop on Natural Language Processing for Digital
Humanities, Association for Computational Linguistics, Taipei, Taiwan, 2022, pp. 85–95. URL:
https://aclanthology.org/2022.nlp4dh-1.12.
[20] A. Palmero Aprosio, S. Menini, S. Tonelli, BERToldo, the historical BERT for Italian, in: R. Sprugnoli,
M. Passarotti (Eds.), Proceedings of the Second Workshop on Language Technologies for Historical
and Ancient Languages, European Language Resources Association, Marseille, France, 2022, pp.
68–72. URL: https://aclanthology.org/2022.lt4hala-1.10.
[21] M. T. Pilehvar, J. Camacho-Collados, WiC: the Word-in-Context Dataset for Evaluating
ContextSensitive Meaning Representations, in: J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings of the
2019 Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational
Linguistics, Minneapolis, Minnesota, 2019, pp. 1267–1273. doi:10.18653/v1/N19-1128.
[22] D. Loureiro, A. D’Souza, A. N. Muhajab, I. A. White, G. Wong, L. Espinosa-Anke, L. Neves,
F. Barbieri, J. Camacho-Collados, TempoWiC: An Evaluation Benchmark for Detecting Meaning
Shift in Social Media, in: N. Calzolari, C.-R. Huang, H. Kim, J. Pustejovsky, L. Wanner, K.-S. Choi,
P.-M. Ryu, H.-H. Chen, L. Donatelli, H. Ji, S. Kurohashi, P. Paggio, N. Xue, S. Kim, Y. Hahm, Z. He,
T. K. Lee, E. Santus, F. Bond, S.-H. Na (Eds.), Proceedings of the 29th International Conference
on Computational Linguistics, International Committee on Computational Linguistics, Gyeongju,
Republic of Korea, 2022, pp. 3353–3359. URL: https://aclanthology.org/2022.coling-1.296.
[23] F. Periti, H. Dubossarsky, N. Tahmasebi, (Chat)GPT v BERT Dawn of Justice for Semantic Change
Detection, in: Y. Graham, M. Purver (Eds.), Findings of the Association for Computational
Linguistics: EACL 2024, Association for Computational Linguistics, St. Julian’s, Malta, 2024, pp.
420–436. URL: https://aclanthology.org/2024.findings-eacl.29.
[24] G. D. Gasperis, P. Pavone, S. Bolasco, A Strategy to Identify the Peculiarity of a Lexicon in the
Analysis of a Corpus, in: G. Giordano, M. Misuraca (Eds.), New Frontiers in Textual Data Analysis,
Springer Nature Switzerland, Cham, 2024, pp. 105–118. doi:10.1007/978-3-031-55917-4_9.
[25] S. Bolasco, T. De Mauro, L’analisi automatica dei testi: fare ricerca con il text mining, number 922
in Studi superiori Statistica, 1a edizione ed., Carocci, Roma, 2013.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>D. de Vassimon Manela</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Errington</surname>
            , T. Fisher,
            <given-names>B. van Breugel</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Minervini</surname>
          </string-name>
          ,
          <article-title>Stereotype and skew: Quantifying gender bias in pre-trained and fine-tuned language models</article-title>
          ,
          <source>in: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics:</source>
          Main Volume,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>2232</fpage>
          -
          <lpage>2242</lpage>
          . doi:
          <volume>10</volume>
          .18653/ v1/
          <year>2021</year>
          .eacl-main.
          <volume>190</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ahn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Oh</surname>
          </string-name>
          ,
          <article-title>Mitigating language-dependent ethnic bias in BERT</article-title>
          ,
          <source>in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Online and
          <string-name>
            <given-names>Punta</given-names>
            <surname>Cana</surname>
          </string-name>
          , Dominican Republic,
          <year>2021</year>
          , pp.
          <fpage>533</fpage>
          -
          <lpage>549</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .emnlp-main.
          <volume>42</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>