Introduction

Probing the SpanBERT Architecture to interpret Scientific Domain Adaptation Challenges for Coreference Resolution

Hari Timmapathini

Anmol Nayak

Sarathchandra Mandadi

Siva Sangada

Vaibhav Kesri

Karthikeyan Ponnalagu

Vijendran Venkoparao

GopalanVijendran.Venkoparaog@in.bosch.com 0 0 ARiSE Labs at Bosch 1 HariPrasad.Timmapathini , Anmol.Nayak, Mandadi.Sarathchandra, SivaChaitanya.Sangada, Vaibhav.Kesari, Karthikeyan.Ponnalagu, GopalanVijendran.Venkoparao

2014

2 30 35

Coreference Resolution is a challenging problem in Natural Language Processing (NLP) that aims at clustering all references of the same entity or event. This requires both syntactic and semantic understanding of the text. A strong coreference resolution model is essential for achieving good performance in several downstream NLP tasks such as QuestionAnswering, Information Extraction etc. SpanBERT (Joshi et al. 2020) has achieved state of the art performance in coreference resolution on the OntoNotes dataset (Pradhan et al. 2012). However it still has several challenges when performing coreference resolution on documents involving multiple domain specific entities and events. In this paper we have highlighted these issues with SpanBERT-Base (pretrained coreference model) in scientific domain adaptation. Our detailed experiments have been performed on the SciERC scientific abstract dataset (Luan et al. 2018), where we analyse the encoder attention and probe the coarse-to-fine head network to interpret the short comings of SpanBERT. This has lead to interesting findings that showed: 1) While we observed that the syntactic behaviour is captured appropriately, the self-attention mechanism in the encoder layers of SpanBERT struggles to capture domain specific semantic concepts, 2) Inferior mention spans are picked in the top mention spans list due to poor mention scores even though better candidate key mention spans exist, and 3) Even by increasing the hyperparameter from 0.4 to 1 and 2, there is insignificant improvement in both Nkey\response and response coreference cluster scores across 5 different evaluation metrics.

Introduction

BERT (Devlin et al. 2019) has been a breakthrough in language understanding by leveraging the multi-head selfattention mechanism (Vaswani et al. 2017) in its architecture. It is one of the prominent models used for a variety of NLP tasks. With the Masked Language Model (MLM) method, it has been successful at leveraging bidirectionality while training the language model. SpanBERT-Base model has 12 encoder layers, with each layer consisting of 12 self-attention heads. The word representations are contextdependent 768 dimensional dynamic embeddings. The vocabulary size is 28996 and contains 101 unused slots. The unused slots in the vocabulary can be used to include domain specific words, however the representations of these will have to be fine-tuned with domain specific corpus.

While the BERT architecture relies on MLM at word level and Next Sentence Prediction (NSP) during training, SpanBERT has changed the learning mechanism to MLM at span level and uses a Span Boundary Objective (SBO). SBO predicts a target masked token by using the representations of the boundary tokens of a given span along with the positional embedding of the target masked token. This learning mechanism has enabled SpanBERT to outperform BERT on almost all tasks with significant improvements. For the coreference resolution task, SpanBERT leverages an independent implementation of higher order coarse-to-fine span ranking architecture (Lee, He, and Zettlemoyer 2018) that iteratively refines the mentions using an attention mechanism.

A strong coreference resolution model is essential in domains which describe concepts that require long range dependencies between mentions for applications like Question-Answering systems, Information Extraction for Domain Specific Knowledge Graphs (Lin et al. 2017; Kejriwal 2019) . Scientific domain adaptation within industries is challenging due to the following reasons: 1. Typically there is a lack of sufficient data to fine-tune the language model of such large pre-trained networks. 2. Unavailability of annotated data for task specific finetuning, as it requires a domain expert’s understanding to annotate the data correctly to encapsulate the nuances of the domain.

We probe the model to analyse 5 different aspects of the SpanBERT coreference resolution architecture: Encoder attention, Identification of Mentions, Mention scores, Antecedent scores and Coreference Clusters. The Newswire genre of OntoNotes was selected with SpanBERT. MUC, B3, CEAFm, CEAFe and LEA (Pradhan et al. 2014; Moosavi and Strube 2016) have been selected as the coreference evaluation metrics. The experiments are performed on the SciERC dataset along with motivating example sentences, that depict the various kinds of sentence structures typically found in technical documents of AUTOSAR (http://www.autosar.org/) compliant automotive domain systems. We discuss these challenges below by analysing SpanBERT Encoder and Probing the Coarse-to-fine network.

Background

SpanBERT Coreference Resolution architecture consists of a SpanBERT Transformer Encoder with a Coarse-to-fine head network (Figure 1). The input is tokenized with a BERT variant of the WordPiece algorithm (Schuster and Nakajima 2012) and passed into the encoder to generate contextualized representations for each token. Mention spans are non-overlapping segments from the input text upto a predefined length. The encoder representations are consumed by the coarse-to-fine network and iteratively refined using an attention mechanism to give the span representations g which are used for computing the following Coreference resolution specific scores: 1. Mention score sm(i) for a mention span i, that is used to further prune the mention spans list. 2. Fast antecedent score sc(i; j) between mention span i and candidate antecedent span j, that uses a bi-linear scoring function to pick the top K candidate antecedent spans for each mention. 3. Antecedent distance score sd(i; j) that is computed using 10 semi-log scale buckets. 4. Slow antecedent score sa(i; j) that relies upon mention span i and candidate antecedent span j representations, element-wise similarity between i and j, and a feature vector encoding genre information, span distance etc. 5. Coreference resolution score s(i; j) that is used to decide whether candidate antecedent span j is coreferent to mention span i.

Further, the mention spans can be segregated into 3 categories: • Key spans Mkey, which are the annotated gold standard spans. • Top spans Mtop, which are the final pruned set of candidate mention spans selected by the coarse-to-fine network. • Response spans Mresponse, which are the system generated output spans found in the predicted coreference clusters. These are a subset of the Top spans.

We evaluated the overall coreference resolution performance of SpanBERT using 5 standard metrics, each of which compute the Precision, Recall and F1 scores with emphasis on different aspects of the coreference clusters (Cai and Strube 2010) : • MUC: It is a link-based metric that computes the minimum number of links between mentions to be inserted or deleted when mapping a system generated response to a gold standard key set. • B3: It is a mention-based metric that computes the overall Precision and Recall based on the Precision and Recall of the individual mentions. • CEAFm: It is a mention-based variant of the CEAF metric, which indicates the percentage of mentions that are in the correct entities. • CEAFe: It is an entity-based variant of the CEAF metric, which indicates the percentage of correctly recognized entities. • LEA: It is a link-based entity-aware metric that considers how important the entity is and how well it is resolved.

We also performed a baseline comparison between the independent variants of SpanBERT-Base and BERT-Base (Joshi et al. 2019 ) pretrained coreference models on the SciERC dataset. BERT has been shown to learn surface level features in the early layers, syntactic features in the middle layers and semantic features in the higher layers (Jawahar, Sagot, and Seddah 2019) . Coreference resolution relies heavily on capturing the syntactic behaviour to pick syntactically plausible mention spans. BERT has been previously shown to capture strong syntactic representations (Tenney et al. 2019) .

We found that across the SciERC scientific abstracts, most of the top spans selected by SpanBERT had the correct boundaries. This strong syntactic understanding in SpanBERT can be attributed to the SBO technique it utilizes during training. While the SpanBERT training objectives have improved the span boundaries, domain specific semantic concepts are significantly more difficult to learn due to the following reasons: 1. Events typically involve multiple entities interacting under certain conditions. 2. Long range dependencies between coreferent mentions as sentences tend to build upon concepts previously mentioned.

To see how SpanBERT handles this, we analyse the selfattention in the encoder layers between two sets of mention spans for each abstract in the SciERC dataset: • Set 1: Pairwise attention scores amongst spans in Mkey \

Mresponse and Mkey - (Mkey \ Mresponse). • Set 2: Pairwise attention scores between spans in Mkey \

A sample output for the different categories of mention spans and clusters for an abstract from the SciERC coreference resolution dataset can be seen in Table 1. For each encoder layer, we extract the pairwise attention scores to observe the difference in attention given by a clustered key span to a co-occurring clustered key span in comparison to a non-clustered key span. Across the 12 layers we observed that the attention scores in Set 1 and Set 2 were extremely small. While we observed that the dominant heads (shades of yellow and green in Figure 2) in both Set 1 and Set 2 tend to be the same, on average each pairwise attention score for these heads was found to be less than 0.01, which is less than 1% of the total attention mass for the abstract. As the attention scores are computed from the Key and Query vectors of a given word, these extremely low attention scores reflect the weak semantic representations of the spans.

Further, this also highlights that no specific head across the 12 encoder layers is exhibiting strong coreference behaviour in the case of scientific domain abstracts. Previously, BERT showed that the different heads of each layer attend to specific linguistic behaviours like coreference, syntax, delimiter tokens (Clark et al. 2019). This semantic loss leads to cascading problems in the coarse-to-fine network due to the Fast and Slow antecedent scores computation. The weak semantic representations have also lead to lesser number of key mention spans being picked up as candidates to be clustered (Table 2). This shows that the self-attention mechanism in the encoder layers of SpanBERT struggles to capture scientific domain specific semantic concepts.

Probing the Coarse-to-fine network

SpanBERT uses a coarse-to-fine architecture in the head network to perform coreference resolution. For a given sentence, the network first generates the mention scores for all possible candidate mentions. It then picks the top M=min(3900, T) non-crossing mentions based on the menM key M top M response M key\top Mkey\response

Key clusters Response clusters

C90-3007 This paper examines the properties of feature-based partial descriptions built on top of Halliday’s systemic networks. We show that the crucial operation of consistency checking for such descriptions is NP-complete, and therefore probably intractable, but proceed to develop algorithms which can sometimes alleviate the unpleasant consequences of this intractability. [feature-based partial descriptions; descriptions] [This paper; feature-based partial descriptions built on top of Halliday’s systemic networks; such descriptions; intractable; this intractability; ...] [intractable; this intractability] [] [] [feature-based partial descriptions; descriptions] [intractable; this intractability] tion scores, where T is the number of words in the tokenized sentence, and is a configurable parameter that decides the number of spans per word and is set to 0.4 (default) in SpanBERT coreference resolution.

We conducted our experiments with = 0.4, 1 and 2 to make sure that the limited size of the top span list is not a reason for key mentions to be discarded. It should be noted that while = 1 and = 2 may increase the number of key mention spans in the top span list, it comes at a performance cost as it can be seen in Table 2, Ntop ( = 2) 4 Ntop ( = 0.4).

For each of the top M mentions, top K=min(50, T) antecedents are picked from the top mention span list based on the score sm(i)+sm(j)+sc(i; j)+sd(i; j), where sm(i) is the mention score of mention span i, sm(j) is the mention score of antecedent span j, sc(i; j) is the fast antecedent score between spans i and j, and sd(i; j) is the antecedent distance score introduced in the coarse-to-fine implementation of SpanBERT.

From this pruned set of antecedents, final coreference score s(i; j)=sm(i)+sm(j)+sc(i; j)+sd(i; j)+sa(i; j) is calculated between each pair of mention and its top anSl. No. 1.

Sentences

When cruise control button is pressed for 2 seconds cruise control is activated1. After this2 happens, the speed is maintained.

After this condition3 is satisfied, cruise control will be activated: Cruise control button is pressed for 2 seconds4.

When the cruise control button is pressed for 2 seconds5, then6 cruise control is activated. Adaptive Cruise control7, commonly known as Cruise control8, is a speed maintaining feature that is often found in high-end cars.

Cruise control9 is a speed maintain feature. When the car is cruising10, a beep is triggered every 5 minutes.

When the minimum speed threshold11 of Cruise control12 is reached, the cruise activation lamp turns green to signify cruise control activation is available.

Cruise control is usually available in high-end cars13. Such vehicles14 are typically 30% costlier than mid-end cars.

When the vehicle speed15 is above 60kmph16, cruise control is activated. tecedents, where sa(i; j) is the slow antecedent score. The top scoring antecedent j is then picked as a coreferent to the mention i if s(i; j)>0. Antecedents that result in a positive coreference score are only picked since a dummy antecedent is introduced before the softmax layer, whose coreference score with every mention is 0.

SpanBERT performance on the SciERC dataset

The SciERC dataset consists of 500 annotated scientific domain abstracts. The total number of key mention spans was 2686. We probed the coarse-to-fine head network to analyse two aspects of the SpanBERT coreference resolution architecture: 1. Qualitative and Quantitative measures of the Mention Spans (Table 2): Picking the top mention spans is the first important task for the head network. We observed that for = 0.4 and = 1, the recall of key mention spans is around 30% and 40% respectively. The recall increased to around 82% in the case of = 2. However that was only possible because 126395 top spans had to be picked, which is extremely large. The precision of the top spans was found to be extremely low for all the values of . We then checked the number of key mention spans that were part of the response clusters (Nkey\response). In this case, for all the the values of the numbers turned out to be roughly the same. This clearly indicated that while increasing the value of increases the chances of a larger number of key mention spans to be part of the top spans list, it does not guarantee improvement in the number of key mentions becoming part of the response clusters. Across all the values of the Precision, Recall and F1 scores for the identification of mentions were found to be roughly 10%, 14% and 11% respectively. We believe that these low values are due to the weak SpanBERT representations for the mention spans found in the scientific domain abstracts which makes it difficult for the coarseto-fine head network to recover from. 2. Overall coreference resolution performance (Table 3): We evaluated the SpanBERT coreference resolution performance using 5 different metrics, each of which target different aspects of the coreference clusters. Another indication that increasing the did not have significant improvement to the coreference resolution was that the Precision, Recall and F1 scores for coreference resolution were roughly the same being around 6%, 9% and 7% respectively.

The low scores appearing consistently both in Identification of Mentions and Overall coreference resolution across a large number of abstracts clearly indicates the difficulty that SpanBERT faces while adapting to the scientific domain corpus coreference resolution task. We also observed a similar performance in both Identification of Mentions (Table 2) and Overall coreference resolution (Table 4) with BERT-Base.

SpanBERT performance on the Automotive domain motivating example sentences

To get more granular insights into the coarse-to-fine network, we further probed the head network on the automotive domain motivating example sentences (Table 5) to extract the Mention scores, Fast Antecedent scores, Slow Antecedent scores, Antecedent distance scores and Final Coreference scores. SpanBERT did not give a valid coreference cluster for any of motivating example sentences (Table 6). In the first motivating example sentence, a cluster was found between this and activated, however it was still not the expected cluster. For the mentions which were not picked as top spans, sc(i; j), sa(i; j), sd(i; j) and s(i; j) scores cannot be computed. We observed that: • Due to the limit on the number of top mentions that can be picked, many expected mentions were eliminated due to a lower mention score. This happened in 5 different motivating example sentences, each of which had a different sentence structure. • Even by increasing the to = 1 and = 2, the expected antecedents were eliminated from being part of the top span list by another irrelevant crossing mention that had a better mention score.

For e.g. in the first motivating example sentence, the expected antecedent span cruise control is activated with a mention score of -30.980 was not picked as a top span, since a better scoring but irrelevant crossing mention is activated . After this happens received a mention score of -29.048. • These different scores provide insights into the reasons behind certain clusters not being formed by the network.

We believe that probing the coarse-to-fine network reveals the underlying issue of the mention spans having weak semantic representations. Stronger semantic representations would lead to better mention scores for the expected mention spans, thereby ranking them higher to be selected as a top mention. This would also positively impact the antecedent scores as they rely heavily upon the mention and antecedent representations.

Conclusion and Future Work

We presented an analysis on the challenges faced by SpanBERT Coreference Resolution in tackling scientific domain corpus. We performed detailed experiments analysing the attention mechanism in the SpanBERT encoder layers along with probing the coarse-to-fine head network to understand how well the syntactic and semantic behaviours are being captured. Our findings show that while SpanBERT has a strong syntactic understanding, its semantic understanding of scientific domain documents is weak which further leads to cascading problems for the coreference resolution task. We believe that some of the directions which could improve the scientific domain adaptation of SpanBERT are: 1. As SpanBERT relies on the BERT variant of the WordPiece algorithm to tokenize an input text, which has previously been shown to give poorer performance in the case of Out-of-Vocabulary (OOV) words (Nayak et al. 2020), a frequency or likelihood based tokenization algorithm such as BPE-Dropout (Provilkov, Emelianenko, and Voita 2019), SentencePiece (Kudo and Richardson 2018) could lead to better sub-word choices and thereby better semantic representations for OOV words. 2. In the case where sufficient data exists to fine-tune the language model of SpanBERT, care should be taken to ensure that task specific catastrophic forgetting is avoided by leveraging advanced fine-tuning techniques (Dodge et al. 2020; Howard and Ruder 2018) .

Nayak, A.; Timmapathini, H.; Ponnalagu, K.; and Venkoparao, V. G. 2020. Domain adaptation challenges of BERT in tokenization and sub-word representations of Out-of-Vocabulary words. In Proceedings of the First Workshop on Insights from Negative Results in NLP, 1–5. Provilkov, I.; Emelianenko, D.; and Voita, E. 2019. BPEDropout: Simple and Effective Subword Regularization. arXiv preprint arXiv:1910.13267. URL https://arxiv.org/ abs/1910.13267.

Schuster, M.; and Nakajima, K. 2012. Japanese and korean voice search. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5149– 5152. IEEE.

Cai , J.; and Strube, M.

2010 . Evaluation metrics for end-toend coreference resolution systems . In Proceedings of the SIGDIAL 2010 Conference , 28 - 36 .

2019. What Does BERT Look At? An Analysis of BERT's Attention . In BlackBoxNLP@ACL.

Devlin , J. ; Chang, M.-W.; Lee , K. ; and Toutanova , K. 2019 .

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers), 4171 - 4186 . Minneapolis, Minnesota: Association for Computational Linguistics . doi: 10 .18653/v1/ N19 -1423. URL https://www.

aclweb.org/anthology/N19-1423.

Dodge , J. ; Ilharco, G. ; Schwartz , R. ; Farhadi , A. ; Hajishirzi , H.; and Smith , N. 2020 . Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping . arXiv preprint arXiv: 2002 .06305 .

Howard , J.; and Ruder , S. 2018 . Universal language model fine-tuning for text classification . arXiv preprint arXiv: 1801 .06146 .

Jawahar , G. ; Sagot , B. ; and Seddah , D. 2019 . What Does BERT Learn about the Structure of Language? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , 3651 - 3657 . Florence, Italy: Association for Computational Linguistics . doi: 10 .18653/v1/ P19 - 1356. URL https://www.aclweb.org/anthology/P19-1356.

Joshi , M. ; Chen , D. ; Liu, Y. ; Weld , D. S. ; Zettlemoyer , L. ; and Levy , O. 2020 . Spanbert: Improving pre-training by representing and predicting spans . Transactions of the Association for Computational Linguistics 8 : 64 - 77 .

Joshi , M. ; Levy , O. ; Weld , D. S. ; and Zettlemoyer , L. 2019 .

Domain-Specific Knowledge Graph Kejriwal , M.

2019 .

Kudo , T. ; and Richardson , J. 2018 . SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , 66 - 71 . Brussels, Belgium: Association for Computational Linguistics.

doi:10 .18653/v1/ D18 -2012. URL https://www.aclweb.org/ anthology/D18-2012.

Lee , K. ; He , L. ; Lewis , M. ; and Zettlemoyer , L. 2017 . Endto-end Neural Coreference Resolution . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing , 188 - 197 . Copenhagen, Denmark: Association for Computational Linguistics . doi: 10 .18653/v1/ D17 -1018. URL https://www.aclweb.org/anthology/D17- 1018.

Lee , K. ; He , L. ; and Zettlemoyer , L. 2018 . Higher-Order Coreference Resolution with Coarse-to-Fine Inference . In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 2 ( Short Papers) , 687 - 692 . New Orleans, Louisiana: Association for Computational Linguistics . doi: 10 .18653/v1/ N18 -2108. URL https://www.aclweb.org/anthology/N18-2108.

Lin , Z.-Q. ; Bing , X. ; Yan-Zhen , Z. ; Jun-Feng , Z. ; XuanDon, L.; Jun , W. ; Hai-Long, S.; and Gang, Y. 2017 . Intelligent development environment and software knowledge graph . Journal of Computer Science and Technology 242- 249.

Luan , Y. ; He , L. ; Ostendorf , M. ; and Hajishirzi, H. 2018 .

Moosavi , N. S. ; and Strube, M. 2016 . Which Coreference Evaluation Metric Do You Trust? A Proposal for a Linkbased Entity Aware Metric . In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 632 - 642 . Berlin, Germany: Association for Computational Linguistics. doi: 10 .18653/ v1/ P16 -1060. URL https://www.aclweb.org/anthology/P16- 1060.

Tenney , I. ; Xia, P. ; Chen , B. ; Wang , A. ; Poliak , A. ; McCoy , R. T. ; Kim ,

N.; Van

Durme , B. ; Bowman , S. R. ; Das , D. ; et al. 2019 . What do you learn from context? probing for sentence structure in contextualized word representations .

arXiv preprint arXiv: 1905 .06316 .

Vaswani , A. ; Shazeer , N. ; Parmar , N. ; Uszkoreit , J. ; Jones , L. ; Gomez , A. N. ; Kaiser , Ł.; and Polosukhin , I. 2017 . Attention is all you need . In Advances in neural information processing systems , 5998 - 6008 .