<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Probing the SpanBERT Architecture to interpret Scientific Domain Adaptation Challenges for Coreference Resolution</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hari Timmapathini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anmol Nayak</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sarathchandra Mandadi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Siva Sangada</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vaibhav Kesri</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Karthikeyan Ponnalagu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vijendran Venkoparao</string-name>
          <email>GopalanVijendran.Venkoparaog@in.bosch.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ARiSE Labs at Bosch</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>HariPrasad.Timmapathini</institution>
          ,
          <addr-line>Anmol.Nayak, Mandadi.Sarathchandra, SivaChaitanya.Sangada, Vaibhav.Kesari, Karthikeyan.Ponnalagu, GopalanVijendran.Venkoparao</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <volume>2</volume>
      <fpage>30</fpage>
      <lpage>35</lpage>
      <abstract>
        <p>Coreference Resolution is a challenging problem in Natural Language Processing (NLP) that aims at clustering all references of the same entity or event. This requires both syntactic and semantic understanding of the text. A strong coreference resolution model is essential for achieving good performance in several downstream NLP tasks such as QuestionAnswering, Information Extraction etc. SpanBERT (Joshi et al. 2020) has achieved state of the art performance in coreference resolution on the OntoNotes dataset (Pradhan et al. 2012). However it still has several challenges when performing coreference resolution on documents involving multiple domain specific entities and events. In this paper we have highlighted these issues with SpanBERT-Base (pretrained coreference model) in scientific domain adaptation. Our detailed experiments have been performed on the SciERC scientific abstract dataset (Luan et al. 2018), where we analyse the encoder attention and probe the coarse-to-fine head network to interpret the short comings of SpanBERT. This has lead to interesting findings that showed: 1) While we observed that the syntactic behaviour is captured appropriately, the self-attention mechanism in the encoder layers of SpanBERT struggles to capture domain specific semantic concepts, 2) Inferior mention spans are picked in the top mention spans list due to poor mention scores even though better candidate key mention spans exist, and 3) Even by increasing the hyperparameter from 0.4 to 1 and 2, there is insignificant improvement in both Nkey\response and response coreference cluster scores across 5 different evaluation metrics.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        BERT
        <xref ref-type="bibr" rid="ref3">(Devlin et al. 2019)</xref>
        has been a breakthrough in
language understanding by leveraging the multi-head
selfattention mechanism
        <xref ref-type="bibr" rid="ref21">(Vaswani et al. 2017)</xref>
        in its
architecture. It is one of the prominent models used for a variety
of NLP tasks. With the Masked Language Model (MLM)
method, it has been successful at leveraging bidirectionality
while training the language model. SpanBERT-Base model
has 12 encoder layers, with each layer consisting of 12
self-attention heads. The word representations are
contextdependent 768 dimensional dynamic embeddings. The
vocabulary size is 28996 and contains 101 unused slots. The
unused slots in the vocabulary can be used to include
domain specific words, however the representations of these
will have to be fine-tuned with domain specific corpus.
      </p>
      <p>
        While the BERT architecture relies on MLM at word level
and Next Sentence Prediction (NSP) during training,
SpanBERT has changed the learning mechanism to MLM at span
level and uses a Span Boundary Objective (SBO). SBO
predicts a target masked token by using the representations of
the boundary tokens of a given span along with the
positional embedding of the target masked token. This
learning mechanism has enabled SpanBERT to outperform BERT
on almost all tasks with significant improvements. For the
coreference resolution task, SpanBERT leverages an
independent implementation of higher order coarse-to-fine span
ranking architecture
        <xref ref-type="bibr" rid="ref12 ref15 ref17 ref7">(Lee, He, and Zettlemoyer 2018)</xref>
        that
iteratively refines the mentions using an attention mechanism.
      </p>
      <p>
        A strong coreference resolution model is essential
in domains which describe concepts that require long
range dependencies between mentions for applications like
Question-Answering systems, Information Extraction for
Domain Specific Knowledge Graphs
        <xref ref-type="bibr" rid="ref11 ref16">(Lin et al. 2017;
Kejriwal 2019)</xref>
        . Scientific domain adaptation within industries
is challenging due to the following reasons:
1. Typically there is a lack of sufficient data to fine-tune the
language model of such large pre-trained networks.
2. Unavailability of annotated data for task specific
finetuning, as it requires a domain expert’s understanding to
annotate the data correctly to encapsulate the nuances of
the domain.
      </p>
      <p>
        We probe the model to analyse 5 different aspects of
the SpanBERT coreference resolution architecture: Encoder
attention, Identification of Mentions, Mention scores,
Antecedent scores and Coreference Clusters. The Newswire
genre of OntoNotes was selected with SpanBERT. MUC,
B3, CEAFm, CEAFe and LEA
        <xref ref-type="bibr" rid="ref18">(Pradhan et al. 2014;
Moosavi and Strube 2016)</xref>
        have been selected as the
coreference evaluation metrics. The experiments are performed
on the SciERC dataset along with motivating example
sentences, that depict the various kinds of sentence
structures typically found in technical documents of AUTOSAR
(http://www.autosar.org/) compliant automotive domain
systems. We discuss these challenges below by analysing
SpanBERT Encoder and Probing the Coarse-to-fine network.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <p>SpanBERT Coreference Resolution architecture consists of
a SpanBERT Transformer Encoder with a Coarse-to-fine
head network (Figure 1). The input is tokenized with a
BERT variant of the WordPiece algorithm (Schuster and
Nakajima 2012) and passed into the encoder to generate
contextualized representations for each token. Mention spans
are non-overlapping segments from the input text upto a
predefined length. The encoder representations are consumed
by the coarse-to-fine network and iteratively refined using
an attention mechanism to give the span representations g
which are used for computing the following Coreference
resolution specific scores:
1. Mention score sm(i) for a mention span i, that is used to
further prune the mention spans list.
2. Fast antecedent score sc(i; j) between mention span i and
candidate antecedent span j, that uses a bi-linear scoring
function to pick the top K candidate antecedent spans for
each mention.
3. Antecedent distance score sd(i; j) that is computed using
10 semi-log scale buckets.
4. Slow antecedent score sa(i; j) that relies upon mention
span i and candidate antecedent span j representations,
element-wise similarity between i and j, and a feature
vector encoding genre information, span distance etc.
5. Coreference resolution score s(i; j) that is used to decide
whether candidate antecedent span j is coreferent to
mention span i.</p>
      <p>Further, the mention spans can be segregated into 3
categories:
• Key spans Mkey, which are the annotated gold standard
spans.
• Top spans Mtop, which are the final pruned set of
candidate mention spans selected by the coarse-to-fine
network.
• Response spans Mresponse, which are the system
generated output spans found in the predicted coreference
clusters. These are a subset of the Top spans.</p>
      <p>
        We evaluated the overall coreference resolution
performance of SpanBERT using 5 standard metrics, each of
which compute the Precision, Recall and F1 scores with
emphasis on different aspects of the coreference clusters
        <xref ref-type="bibr" rid="ref1">(Cai
and Strube 2010)</xref>
        :
• MUC: It is a link-based metric that computes the
minimum number of links between mentions to be inserted or
deleted when mapping a system generated response to a
gold standard key set.
• B3: It is a mention-based metric that computes the overall
Precision and Recall based on the Precision and Recall of
the individual mentions.
• CEAFm: It is a mention-based variant of the CEAF
metric, which indicates the percentage of mentions that are in
the correct entities.
• CEAFe: It is an entity-based variant of the CEAF metric,
which indicates the percentage of correctly recognized
entities.
• LEA: It is a link-based entity-aware metric that considers
how important the entity is and how well it is resolved.
      </p>
      <p>
        We also performed a baseline comparison between the
independent variants of SpanBERT-Base and
        <xref ref-type="bibr" rid="ref4">BERT-Base
(Joshi et al. 2019</xref>
        ) pretrained coreference models on the
SciERC dataset.
BERT has been shown to learn surface level features in the
early layers, syntactic features in the middle layers and
semantic features in the higher layers
        <xref ref-type="bibr" rid="ref8">(Jawahar, Sagot, and
Seddah 2019)</xref>
        . Coreference resolution relies heavily on
capturing the syntactic behaviour to pick syntactically plausible
mention spans. BERT has been previously shown to capture
strong syntactic representations
        <xref ref-type="bibr" rid="ref19">(Tenney et al. 2019)</xref>
        .
      </p>
      <p>We found that across the SciERC scientific abstracts, most
of the top spans selected by SpanBERT had the correct
boundaries. This strong syntactic understanding in
SpanBERT can be attributed to the SBO technique it utilizes
during training. While the SpanBERT training objectives
have improved the span boundaries, domain specific
semantic concepts are significantly more difficult to learn due to
the following reasons:
1. Events typically involve multiple entities interacting
under certain conditions.
2. Long range dependencies between coreferent mentions as
sentences tend to build upon concepts previously
mentioned.</p>
      <p>To see how SpanBERT handles this, we analyse the
selfattention in the encoder layers between two sets of mention
spans for each abstract in the SciERC dataset:
• Set 1: Pairwise attention scores amongst spans in Mkey \</p>
      <p>Mresponse and Mkey - (Mkey \ Mresponse).
• Set 2: Pairwise attention scores between spans in Mkey \</p>
      <p>A sample output for the different categories of mention
spans and clusters for an abstract from the SciERC
coreference resolution dataset can be seen in Table 1. For each
encoder layer, we extract the pairwise attention scores to
observe the difference in attention given by a clustered key
span to a co-occurring clustered key span in comparison to
a non-clustered key span. Across the 12 layers we observed
that the attention scores in Set 1 and Set 2 were extremely
small. While we observed that the dominant heads (shades
of yellow and green in Figure 2) in both Set 1 and Set 2 tend
to be the same, on average each pairwise attention score for
these heads was found to be less than 0.01, which is less than
1% of the total attention mass for the abstract. As the
attention scores are computed from the Key and Query vectors
of a given word, these extremely low attention scores reflect
the weak semantic representations of the spans.</p>
      <p>Further, this also highlights that no specific head across
the 12 encoder layers is exhibiting strong coreference
behaviour in the case of scientific domain abstracts. Previously,
BERT showed that the different heads of each layer attend
to specific linguistic behaviours like coreference, syntax,
delimiter tokens (Clark et al. 2019). This semantic loss leads
to cascading problems in the coarse-to-fine network due to
the Fast and Slow antecedent scores computation. The weak
semantic representations have also lead to lesser number of
key mention spans being picked up as candidates to be
clustered (Table 2). This shows that the self-attention mechanism
in the encoder layers of SpanBERT struggles to capture
scientific domain specific semantic concepts.</p>
    </sec>
    <sec id="sec-3">
      <title>Probing the Coarse-to-fine network</title>
      <p>SpanBERT uses a coarse-to-fine architecture in the head
network to perform coreference resolution. For a given
sentence, the network first generates the mention scores
for all possible candidate mentions. It then picks the top
M=min(3900, T) non-crossing mentions based on the
menM key
M top
M response
M key\top
Mkey\response</p>
      <sec id="sec-3-1">
        <title>Key clusters</title>
      </sec>
      <sec id="sec-3-2">
        <title>Response clusters</title>
        <p>C90-3007
This paper examines the properties of feature-based partial descriptions built on top of Halliday’s
systemic networks. We show that the crucial operation of consistency checking for such descriptions
is NP-complete, and therefore probably intractable, but proceed to develop algorithms which can
sometimes alleviate the unpleasant consequences of this intractability.
[feature-based partial descriptions; descriptions]
[This paper; feature-based partial descriptions built on top of Halliday’s systemic networks;
such descriptions; intractable; this intractability; ...]
[intractable; this intractability]
[]
[]
[feature-based partial descriptions; descriptions]
[intractable; this intractability]
tion scores, where T is the number of words in the tokenized
sentence, and is a configurable parameter that decides the
number of spans per word and is set to 0.4 (default) in
SpanBERT coreference resolution.</p>
        <p>We conducted our experiments with = 0.4, 1 and 2 to
make sure that the limited size of the top span list is not a
reason for key mentions to be discarded. It should be noted
that while = 1 and = 2 may increase the number of key
mention spans in the top span list, it comes at a performance
cost as it can be seen in Table 2, Ntop ( = 2) 4 Ntop (
= 0.4).</p>
        <p>For each of the top M mentions, top K=min(50, T)
antecedents are picked from the top mention span list
based on the score sm(i)+sm(j)+sc(i; j)+sd(i; j), where
sm(i) is the mention score of mention span i, sm(j)
is the mention score of antecedent span j, sc(i; j)
is the fast antecedent score between spans i and j,
and sd(i; j) is the antecedent distance score
introduced in the coarse-to-fine implementation of SpanBERT.</p>
        <p>From this pruned set of antecedents, final coreference
score s(i; j)=sm(i)+sm(j)+sc(i; j)+sd(i; j)+sa(i; j) is
calculated between each pair of mention and its top
anSl. No.
1.</p>
      </sec>
      <sec id="sec-3-3">
        <title>Sentences</title>
        <p>When cruise control button is pressed for 2 seconds cruise control is activated1. After this2 happens,
the speed is maintained.</p>
        <p>After this condition3 is satisfied, cruise control will be activated: Cruise control button is pressed
for 2 seconds4.</p>
        <p>When the cruise control button is pressed for 2 seconds5, then6 cruise control is activated.
Adaptive Cruise control7, commonly known as Cruise control8, is a speed maintaining feature that
is often found in high-end cars.</p>
        <p>Cruise control9 is a speed maintain feature. When the car is cruising10, a beep is triggered every 5
minutes.</p>
        <p>When the minimum speed threshold11 of Cruise control12 is reached, the cruise activation lamp
turns green to signify cruise control activation is available.</p>
        <p>Cruise control is usually available in high-end cars13. Such vehicles14 are typically 30% costlier than
mid-end cars.</p>
        <p>When the vehicle speed15 is above 60kmph16, cruise control is activated.
tecedents, where sa(i; j) is the slow antecedent score. The
top scoring antecedent j is then picked as a coreferent to the
mention i if s(i; j)&gt;0. Antecedents that result in a positive
coreference score are only picked since a dummy antecedent
is introduced before the softmax layer, whose coreference
score with every mention is 0.</p>
      </sec>
      <sec id="sec-3-4">
        <title>SpanBERT performance on the SciERC dataset</title>
        <p>The SciERC dataset consists of 500 annotated scientific
domain abstracts. The total number of key mention spans was
2686. We probed the coarse-to-fine head network to analyse
two aspects of the SpanBERT coreference resolution
architecture:
1. Qualitative and Quantitative measures of the Mention
Spans (Table 2): Picking the top mention spans is the first
important task for the head network. We observed that
for = 0.4 and = 1, the recall of key mention spans
is around 30% and 40% respectively. The recall increased
to around 82% in the case of = 2. However that was
only possible because 126395 top spans had to be picked,
which is extremely large. The precision of the top spans
was found to be extremely low for all the values of .
We then checked the number of key mention spans that
were part of the response clusters (Nkey\response). In this
case, for all the the values of the numbers turned out
to be roughly the same. This clearly indicated that while
increasing the value of increases the chances of a larger
number of key mention spans to be part of the top spans
list, it does not guarantee improvement in the number of
key mentions becoming part of the response clusters.
Across all the values of the Precision, Recall and F1
scores for the identification of mentions were found to be
roughly 10%, 14% and 11% respectively. We believe that
these low values are due to the weak SpanBERT
representations for the mention spans found in the scientific
domain abstracts which makes it difficult for the
coarseto-fine head network to recover from.
2. Overall coreference resolution performance (Table 3): We
evaluated the SpanBERT coreference resolution
performance using 5 different metrics, each of which target
different aspects of the coreference clusters. Another
indication that increasing the did not have significant
improvement to the coreference resolution was that the
Precision, Recall and F1 scores for coreference resolution
were roughly the same being around 6%, 9% and 7%
respectively.</p>
        <p>The low scores appearing consistently both in
Identification of Mentions and Overall coreference resolution across a
large number of abstracts clearly indicates the difficulty that
SpanBERT faces while adapting to the scientific domain
corpus coreference resolution task. We also observed a similar
performance in both Identification of Mentions (Table 2) and
Overall coreference resolution (Table 4) with BERT-Base.</p>
      </sec>
      <sec id="sec-3-5">
        <title>SpanBERT performance on the Automotive domain motivating example sentences</title>
        <p>To get more granular insights into the coarse-to-fine
network, we further probed the head network on the
automotive domain motivating example sentences (Table 5) to
extract the Mention scores, Fast Antecedent scores, Slow
Antecedent scores, Antecedent distance scores and Final
Coreference scores. SpanBERT did not give a valid coreference
cluster for any of motivating example sentences (Table 6).
In the first motivating example sentence, a cluster was found
between this and activated, however it was still not the
expected cluster. For the mentions which were not picked as
top spans, sc(i; j), sa(i; j), sd(i; j) and s(i; j) scores
cannot be computed. We observed that:
• Due to the limit on the number of top mentions that can be
picked, many expected mentions were eliminated due to
a lower mention score. This happened in 5 different
motivating example sentences, each of which had a different
sentence structure.
• Even by increasing the to = 1 and = 2, the expected
antecedents were eliminated from being part of the top
span list by another irrelevant crossing mention that had a
better mention score.</p>
        <p>For e.g. in the first motivating example sentence, the
expected antecedent span cruise control is activated with a
mention score of -30.980 was not picked as a top span,
since a better scoring but irrelevant crossing mention is
activated . After this happens received a mention score of
-29.048.
• These different scores provide insights into the reasons
behind certain clusters not being formed by the network.</p>
        <p>We believe that probing the coarse-to-fine network
reveals the underlying issue of the mention spans having weak
semantic representations. Stronger semantic representations
would lead to better mention scores for the expected mention
spans, thereby ranking them higher to be selected as a top
mention. This would also positively impact the antecedent
scores as they rely heavily upon the mention and antecedent
representations.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion and Future Work</title>
      <p>
        We presented an analysis on the challenges faced by
SpanBERT Coreference Resolution in tackling scientific domain
corpus. We performed detailed experiments analysing the
attention mechanism in the SpanBERT encoder layers along
with probing the coarse-to-fine head network to understand
how well the syntactic and semantic behaviours are being
captured. Our findings show that while SpanBERT has a
strong syntactic understanding, its semantic understanding
of scientific domain documents is weak which further leads
to cascading problems for the coreference resolution task.
We believe that some of the directions which could improve
the scientific domain adaptation of SpanBERT are:
1. As SpanBERT relies on the BERT variant of the
WordPiece algorithm to tokenize an input text, which has
previously been shown to give poorer performance in the case
of Out-of-Vocabulary (OOV) words (Nayak et al. 2020),
a frequency or likelihood based tokenization algorithm
such as BPE-Dropout (Provilkov, Emelianenko, and Voita
2019), SentencePiece
        <xref ref-type="bibr" rid="ref12 ref7">(Kudo and Richardson 2018)</xref>
        could
lead to better sub-word choices and thereby better
semantic representations for OOV words.
2. In the case where sufficient data exists to fine-tune the
language model of SpanBERT, care should be taken to
ensure that task specific catastrophic forgetting is avoided by
leveraging advanced fine-tuning techniques
        <xref ref-type="bibr" rid="ref12 ref6 ref7">(Dodge et al.
2020; Howard and Ruder 2018)</xref>
        .
      </p>
      <p>Nayak, A.; Timmapathini, H.; Ponnalagu, K.; and
Venkoparao, V. G. 2020. Domain adaptation challenges
of BERT in tokenization and sub-word representations
of Out-of-Vocabulary words. In Proceedings of the First
Workshop on Insights from Negative Results in NLP, 1–5.
Provilkov, I.; Emelianenko, D.; and Voita, E. 2019.
BPEDropout: Simple and Effective Subword Regularization.
arXiv preprint arXiv:1910.13267. URL https://arxiv.org/
abs/1910.13267.</p>
      <p>Schuster, M.; and Nakajima, K. 2012. Japanese and korean
voice search. In 2012 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), 5149–
5152. IEEE.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Cai</surname>
            , J.; and Strube,
            <given-names>M.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>Evaluation metrics for end-toend coreference resolution systems</article-title>
          .
          <source>In Proceedings of the SIGDIAL 2010 Conference</source>
          ,
          <volume>28</volume>
          -
          <fpage>36</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2019.
          <article-title>What Does BERT Look At? An Analysis of BERT's Attention</article-title>
          . In BlackBoxNLP@ACL.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Chang, M.-W.;
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          .
          <source>In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers),
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . Minneapolis, Minnesota:
          <article-title>Association for Computational Linguistics</article-title>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N19</fpage>
          -1423. URL https://www.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>aclweb.org/anthology/N19-1423.</mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Dodge</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Ilharco,
          <string-name>
            <given-names>G.</given-names>
            ;
            <surname>Schwartz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ;
            <surname>Farhadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Hajishirzi</surname>
          </string-name>
          , H.; and
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <year>2020</year>
          .
          <article-title>Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping</article-title>
          . arXiv preprint arXiv:
          <year>2002</year>
          .06305 .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Howard</surname>
          </string-name>
          , J.; and
          <string-name>
            <surname>Ruder</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Universal language model fine-tuning for text classification</article-title>
          . arXiv preprint arXiv:
          <year>1801</year>
          .06146 .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Jawahar</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Sagot</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Seddah</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>What Does BERT Learn about the Structure of Language? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</article-title>
          ,
          <fpage>3651</fpage>
          -
          <lpage>3657</lpage>
          . Florence, Italy:
          <article-title>Association for Computational Linguistics</article-title>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P19</fpage>
          - 1356. URL https://www.aclweb.org/anthology/P19-1356.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Joshi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; Liu,
          <string-name>
            <given-names>Y.</given-names>
            ;
            <surname>Weld</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            ;
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ; and
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <surname>O.</surname>
          </string-name>
          <year>2020</year>
          .
          <article-title>Spanbert: Improving pre-training by representing and predicting spans</article-title>
          .
          <source>Transactions of the Association for Computational Linguistics</source>
          <volume>8</volume>
          :
          <fpage>64</fpage>
          -
          <lpage>77</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Joshi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Levy</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Weld</surname>
            ,
            <given-names>D. S.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Domain-Specific Knowledge Graph Kejriwal</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Kudo</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Richardson</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing</article-title>
          .
          <source>In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations</source>
          ,
          <fpage>66</fpage>
          -
          <lpage>71</lpage>
          . Brussels, Belgium: Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <source>doi:10</source>
          .18653/v1/
          <fpage>D18</fpage>
          -2012. URL https://www.aclweb.org/ anthology/D18-2012.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>He</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lewis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Endto-end Neural Coreference Resolution</article-title>
          .
          <source>In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <fpage>188</fpage>
          -
          <lpage>197</lpage>
          . Copenhagen, Denmark:
          <article-title>Association for Computational Linguistics</article-title>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D17</fpage>
          -1018. URL https://www.aclweb.org/anthology/D17- 1018.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>He</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Higher-Order Coreference Resolution with Coarse-to-Fine Inference</article-title>
          .
          <source>In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>2</volume>
          (
          <issue>Short Papers)</issue>
          ,
          <fpage>687</fpage>
          -
          <lpage>692</lpage>
          . New Orleans, Louisiana:
          <article-title>Association for Computational Linguistics</article-title>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N18</fpage>
          -2108. URL https://www.aclweb.org/anthology/N18-2108.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>Z.-Q.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Bing</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Yan-Zhen</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Jun-Feng</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ; XuanDon, L.;
          <string-name>
            <surname>Jun</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ;
          <article-title>Hai-Long, S.;</article-title>
          and Gang,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Intelligent development environment and software knowledge graph</article-title>
          .
          <source>Journal of Computer Science and Technology 242- 249.</source>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Luan</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>He</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ostendorf</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; and Hajishirzi,
          <string-name>
            <surname>H.</surname>
          </string-name>
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Moosavi</surname>
            ,
            <given-names>N. S.</given-names>
          </string-name>
          ; and Strube,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2016</year>
          .
          <article-title>Which Coreference Evaluation Metric Do You Trust? A Proposal for a Linkbased Entity Aware Metric</article-title>
          .
          <source>In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</source>
          ,
          <fpage>632</fpage>
          -
          <lpage>642</lpage>
          . Berlin, Germany: Association for Computational Linguistics. doi:
          <volume>10</volume>
          .18653/ v1/
          <fpage>P16</fpage>
          -1060. URL https://www.aclweb.org/anthology/P16- 1060.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Tenney</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ; Xia,
          <string-name>
            <given-names>P.</given-names>
            ;
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ;
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Poliak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>McCoy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. T.</given-names>
            ;
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.; Van</given-names>
            <surname>Durme</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ;
            <surname>Bowman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            ;
            <surname>Das</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          ; et al.
          <year>2019</year>
          .
          <article-title>What do you learn from context? probing for sentence structure in contextualized word representations</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          arXiv preprint arXiv:
          <year>1905</year>
          .06316 .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Vaswani</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Shazeer</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Parmar</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Uszkoreit</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Gomez</surname>
            ,
            <given-names>A. N.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kaiser</surname>
          </string-name>
          , Ł.; and
          <string-name>
            <surname>Polosukhin</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Attention is all you need</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          ,
          <volume>5998</volume>
          -
          <fpage>6008</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>