=Paper=
{{Paper
|id=Vol-3277/paper2
|storemode=property
|title=On the Behaviour of BERT's Attention for the Classification of Medical Reports
|pdfUrl=https://ceur-ws.org/Vol-3277/paper2.pdf
|volume=Vol-3277
|authors=Luca Putelli,Alfonso Emilio Gerevini,Alberto Lavelli,Tahir Mehmood,Ivan Serina
|dblpUrl=https://dblp.org/rec/conf/aiia/PutelliGLMS22
}}
==On the Behaviour of BERT's Attention for the Classification of Medical Reports==
On the Behaviour of BERT’s Attention for the Classification of Medical Reports Luca Putelli1 , Alfonso E. Gerevini1 , Alberto Lavelli2 , Tahir Mehmood1 and Ivan Serina1 1 Università degli Studi di Brescia, Brescia, Italy 2 Fondazione Bruno Kessler, Povo (TN), Italy Abstract Since BERT and the other Transformer-based models have been proved successful in many NLP tasks, several studies have been conducted to understand why these complex deep learning architectures are able to reach such remarkable results. Such studies have focused on visualising and analysing the behaviour of each self-attention mechanism and are often conducted with large, generic and annotated datasets for the English language, using supervised probing tasks in order to test specific linguistic capabilities. However, in several practical contexts there are some difficulties: probing tasks may not be available, the documents can contain a strict technical lexicon, and the datasets can be noisy. In this work we analyse the behaviour of BERT in a specific context, i.e. the classification of radiology reports collected from an Italian hospital. We propose (i) a simplified way to classify head patterns without relying on probing tasks or manual observations, and (ii) an algorithm for extracting the most relevant relations among words captured by each self-attention. Combining these techniques with manual observations, we present several examples of linguistic information that can be extracted from BERT in our application. 1. Introduction Language models based on Transformer [1] like BERT (Bidirectional Encoder Representations from Transformer) [2], have obtained remarkable results in Natural Language Processing (NLP) tasks like machine translation or text classification, reaching a new state of the art. Intuitively, these models are composed by several encoders that progressively learn information about words and how they are related to each other. Encoders are made (among other components) by several parallel self-attention mechanisms called heads. For each word, a head calculates a probability distribution representing how much this word is related to every other word contained in the document. The results and the complexity of BERT lead several research groups [3] to study how these language models capture the structure of the language [4], grammatical knowledge [5, 6, 7] or task specific information [8]. These analyses are conducted focusing on the embedded representation provided by each encoder [9] or on the self-attention mechanism in each head [10, 11, 12]. Similarly to other deep learning techniques such as LSTM Neural Networks [13], XAI.it 2022 - Italian Workshop on Explainable Artificial Intelligence Envelope-Open luca.putelli1@unibs.it (L. Putelli); alfonso.gerevini@unibs.it (A. E. Gerevini); lavelli@fbk.eu (A. Lavelli); tahir.mehmood@unibs.it (T. Mehmood); ivan.serina@unibs.it (I. Serina) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) BERT has been proven effective for extracting information from clinical narrative texts [14, 15]. Therefore, this technology could improve the efficacy and quality of patients’ care. However, in this environment it is very important to assure the physicians that the system is correct and to expose the reasoning behind its decisions [16]. Moreover, according to works like [13, 17, 18], the Attention Mechanism can be used for highlighting the most important sections in a document and exploit them for interpretability, simply extracting the weights assigned by the Attention Mechanism, which can be seen as an indicator of the importance of each word for the predictive task. However, the self-attention used in BERT assigns a weight representing how much two words are related to each other. Moreover, while in LSTM-based models there is usually a single attention mechanism, in BERT there are more than 100 heads to consider. Therefore, deriving a single and straightforward indication of the importance of words for the classification task is definitely more challenging. Furthermore, the studies in [10, 12] show that heads can be grouped according to a few distinct patterns; for instance, there are heads that always connect a word with the previous one or that distribute the attention weights across several words. This grouping process can be done by inspecting the heads manually [11] or with clustering techniques [19]. Both alternatives present some issues, such as the necessity of human intervention if the inspection is done manually, while in the second alternative the variability of the results depends on which clustering algorithm is selected and its hyper-parameters. These works usually show and verify the head behaviour on benchmark datasets in English and they use probing tasks, i.e., supervised classification tasks that focus on the capability of the self-attention weights to encode linguistic knowledge, without explicitly extracting the connection between words with the highest weights. This is not a trivial task given the differences among the heads. In this work, we apply BERT in the context of the classification of radiology reports written in Italian and collected from the radiology department of Spedali Civili di Brescia. Then, we analyse the behaviour of BERT’s Attention, presenting a schematic grouping process for the heads that does not require human inspection or clustering algorithms. Moreover, we propose an algorithm for extracting the most important word pairs according to the self-attention weights provided by each head. We then verify how these procedures can be exploited in our context for extracting useful information and their relation to the interpretability of BERT. 2. Background and Related work 2.1. BERT BERT [2] is an architecture based on Transformer [1] composed by several encoding layers which progressively analyse a sequence of tokens (i.e., words or parts of a word) in order to capture their meaning. Each layer applies multiple self-attention mechanisms (called heads) in parallel. Considering a sequence of tokens 𝑆 of length 𝑁, this mechanism produces a matrix 𝐴𝑖,𝑗 ∈ ℝ𝑁 ×𝑁 , where 𝑖 is the number of the encoding layer and 𝑗 is the head number. For each token 𝑤 ∈ 𝑆, the vector 𝑎𝑤 ∈ 𝐴𝑖,𝑗 contains the attention weights that represent how much 𝑤 is related to the other tokens in 𝑆. In order to calculate these weights, in each head the input representation of the token sequence 𝑋 ∈ ℝ𝑁 ×𝑑 is projected into three new representations called key (𝐾), query (𝑄) and value (𝑉) with three matrices 𝑊𝑘 , 𝑊𝑞 and 𝑊𝑣 : 𝐾 = 𝑋 × 𝑊 𝑘 , 𝑄 = 𝑋 × 𝑊𝑞 , 𝑉 = 𝑋 × 𝑊𝑣 (1) Then, the attention weights are calculated using a scaled dot-product between 𝑄 and 𝐾 and applying the softmax function. The new token representation 𝑍 is calculated by multiplying the attention weights for 𝑉. 𝑄×𝐾 𝐴 = 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥( ), 𝑍 = 𝐴 × 𝑉 (2) √𝑑 where 𝑑 is the length of the input representation of each token. Given that in each encoding layer there are multiple heads, in order to create a representation provided by the multi-head attention mechanism the result of each head is concatenated and then fed to a feed-forward layer. As described in [1], the multi-head attention mechanism is followed by a feed-forward layer and residual connections. The output of an encoding layer is the input of the next one. Exploiting a large collection of documents, BERT is trained for two tasks: language modeling, where BERT learns to predict a percentage (usually 15%) of tokens from context, and next sentence prediction, which is a binary classification task where BERT has to predict if a sequence of two sentences is correct or not. For the latter task, BERT introduces two special tokens: [CLS], whose representation is used for the binary classification task and represents the whole sequence, and [SEP] which separates the two sentences. Learning these two tasks allows BERT to create a meaningful representation of each token and also to summarise the most important information in a sentence. Once the model is trained, it can be adapted using smaller datasets for specific NLP tasks like Named Entity Recognition, text classification, sentiment analysis, etc. 2.2. Related work In the last few years, several studies have been conducted in order to understand the reasons behind its success and which linguistic and world knowledge is stored in a BERT model [3]. We can group these studies into two main categories, simply considering if they analyse the embedding representations or the head’s behaviour. From the former category, the works in [6, 20] show how linguistic information (a total of 68 features such as Part-of-speech tags, verb inflection, depth of the dependency tree, etc.) is encoded in the BERT embeddings produced by each layer. Testing semantic roles, semantic dependencies (such as coreference between nouns and pronouns), entities and relations like in the classical NLP pipeline is the focus of [7]. Similar tests were conducted in [5] and [4], with a particular focus on subject-verb agreement. An aspect that regards all these works is that they require datasets annotated with the linguistic phenomena they are examining with their probing tasks. Unfortunately, in the clinical domain it is much more difficult to find this kind of annotations [21], and there are many challenges related to the quality of the text, with abbreviations, typos and ungrammatical language [16]. Therefore the use of probing tasks is quite limited, especially for the Italian language. The latter category regards the analysis of the head’s behaviour. Since the introduction of the first visualization tools, like BertViz [22], it has been possible to note how heads behave accordingly to some recognizable patterns. The work in [10] presents some interesting results, manually selecting heads that give attention broadly, to the next token, or to [SEP]. The authors use probing tasks in order to show that certain heads target specific linguistic information, such as coreference, direct objects, relations between possessive pronouns and nouns, etc. In [11], similar patterns were presented and other probing tasks were executed. For instance, given a pair of tokens with a specific linguistic relation, they detect which heads assign a high weight to such pair. Our work differs from these ones because we propose a way of grouping heads according to their pattern with a quantitative approach. Moreover, while these works use probing tasks in order to find meaningful relations between pair of tokens without explicitly extracting them, we propose an algorithm specifically designed for this extraction, simplifying the subsequent analyses. In [19], the authors exploit clustering algorithms to automatically group heads into categories, and discuss their importance. A drawback of this approach is that clustering algorithms can produce very different results according to their implementation, their hyper-parameters and the number of selected clusters. Evaluating the interpretability of deep learning systems for NLP, in terms of highlighting the most important snippets in order to justify the model output, is a very active research field [23]. While several studies focused on whether the attention weights could lead or not to an insight of the reasoning of the model [24, 25, 26], the research regarding BERT is still ongoing. In [27], the embedding representations are analysed in the context of a Question Answering task. In [8], the authors show that the words receiving most of the attention belong to the specific lexicon of the document subject. We have performed the same analysis in our context but with no results. In our opinion, this may depend on the fact that our radiology reports mostly share the same lexicon, with just small differences. 3. Methodology In this section, we show the techniques used for identifying the heads’ pattern and for extracting the most relevant relations between words, according to the attention weights distribution. Our goal is to find interesting information encoded in the attention weights for each head. However, given that this information is not labelled in our corpus of reports, first we want to assess the behaviour of each head, potentially selecting the most promising ones. Therefore, first we propose a simple method for identify the behaviour of each head and to group them accordingly. Next, we propose an algorithm for extracting the relation between word pairs with the highest weights that works regardless the difference between the heads’ behaviours. 3.1. Metrics for the Head Grouping Given a document made by 𝑁 tokens, as described in Section 2.1, the head (𝑖, 𝑗), where 𝑖 is the number of the encoding layer and 𝑗 is the head number in its multi-head self-attention mechanism, we call 𝐴𝑖,𝑗 ∈ ℝ𝑁 ×𝑁 the matrix of the attention weights produced by (𝑖, 𝑗). For each token 𝑤, 𝐴𝑖,𝑗 contains a vector 𝑎𝑤 ∈ ℝ𝑁 which is a probability distribution representing its connections with all 𝑁 tokens (itself included). As reported in [10], when the attention is only on the special token [SEP], which is not used in the classification process but only for marking the end of the document, it can be seen as a null operation, or no-op. Therefore, first of all we want to evaluate how much 𝑎𝑤 is close to a no-op. Ideally, if all the attention is directed to [SEP], the probability distribution of the weights is a one-hot vector where the 1 is present in the index of [SEP]. In Equation 3, we refer to it as 𝑂. In order to calculate how much 𝑎𝑤 focuses on tokens different from [SEP], we calculate the No-Op Metric 𝜈 as: 0 𝑇 [𝑘] ≠ [𝑆𝐸𝑃] 𝜈𝑤 = 𝐽 𝑆𝐷(𝑎𝑤 ||𝑂), 𝑂[𝑘] = { (3) 1 𝑇 [𝑘] = [𝑆𝐸𝑃] where 𝐽 𝑆𝐷 is the Jensen-Shannon Divergence, which is a standard statistical method for evalu- ating the similarity between two probability distributions. The Jensen-Shannon Divergence is bounded between 0 and 1, and it is 0 when the two distributions are identical, 1 when they are completely different. Please note that this metric is not calculated for evaluating the behaviour of the entire head, instead it is designed for a single token. We can describe a token 𝑤 as operative if 𝜈𝑤 > 0.5, otherwise we call it not operative. Using this metric, we can introduce a first categorization of the heads. As reported in [10], if a head executes a specific function, such as connecting a verb with its direct object, then those tokens onto which the function cannot be applied are usually connected to [SEP]. Therefore, we can specify two categories of head patterns: General if more than 50% tokens are operative and Mixed otherwise. Considering the General heads, if the attention weights are distributed uniformly across the tokens, then no particular information could be extracted from them. Therefore, we evaluate how much 𝑎𝑤 is similar with respect to a standard uniform distribution, using the Focus Metric 𝜖, which we can define as: 1 𝜖𝑤 = 𝐽 𝑆𝐷(𝑎𝑤 ||𝑈 ), 𝑈 [𝑘] = ∀𝑘 ∈ [1, 𝑁 ] (4) 𝑁 Moreover, in several examples, we have observed a small number of heads where most tokens basically give high attention to themselves. Given their peculiar behaviour, we design a specific metric for identifying them: the Self Metric 𝜎. We evaluate the difference between 𝑎𝑤 and a one-hot vector, where the 1 is in the same position of 𝑤 in the document. More formally, 0 𝑆𝑤 [𝑘] ≠ 𝑤 𝜎𝑤 = 𝐽 𝑆𝐷(𝑎𝑤 ||𝑆𝑤 ), 𝑆𝑤 [𝑘] = { (5) 1 𝑆𝑤 [𝑘] = 𝑤 In order to capture the behaviour of each head, we calculate the average of 𝜖 and 𝜎 across all the tokens in a document. Observing the pattern varying the average of 𝜖, we can group the general heads into four sub-categories: • Broadcast, with an average 𝜖 lower than 0.4. These heads distribute their attention broadly across all tokens, with no particular criteria. This group resembles the Dense cluster described in [19]; • Offset, with an average 𝜖 higher than 0.7. These heads usually focus their attention on the previous/subsequent tokens without specific linguistic patterns. This group strongly resembles the Diagonal group observed in [11]; • Local, with an average 𝜖 between 0.4 and 0.7, and 𝜎 below 0.6. These heads mostly give attention to other tokens in the same sentence, with variable distance and behaviour depending on the analysed token. While some of these heads can be associated with Figure 1: Diagram explaining the process of head categorisation. 𝜖 ̄ stands for the average 𝜖 across all tokens, while 𝜎̄ stands for the average 𝜎. the Block group defined in [11] and the Dense&Vertical group defined in [19], a strong correlation with other groups has not been observed; • Not Local, with an average 𝜖 between 0.4 and 0.7, and 𝜎 higher than 0.6. These heads give attention mostly to the token itself, other occurrences of the same word or other similar words, regardless if they are in the same sentence or not. To the best of our knowledge, this pattern is not present in the literature. A more detailed description, with examples and numerical results taken from our case study, can be seen in Section 5.2. For the Mixed heads, given that the majority of the tokens is connected to [SEP], discovering which function they are trying to implement is a much more difficult process without probing tasks [10] and a general classification has not been proposed yet. For instance, the works in [19] and [11] group them into a category called Vertical, which basically highlights that the majority of tokens are connected to [SEP] with no further analysis. In Figure 1, we show the main flow of our categorisation process. Although our metrics are calculated for a single document, we have observed that there are no important differences in the behaviour of a head for different documents. An important note about the thresholds used for the described metrics is that they have been set with a bottom-up approach, observing the behaviour of different heads using several documents in our corpus and deriving a general rule. Although these values are apt for our context and they show a remarkable resemblance with other grouping techniques showed in different contexts [10, 11, 19], our method could require a different setting if other BERT models, language or datasets are considered. We are currently studying a method for recognising these thresholds in an automatic way with an unsupervised approach. 3.2. Mean Shift Linker Algorithm In [10, 12] the connections between pairs of tokens are simply shown with visualization tech- niques, with lines of different thickness on the basis of the attention weights. However, with relatively long documents with 400 or 500 words the amount of connections increases drastically, making the visualization less understandable and very complex to compute. Therefore, our approach is to directly extract the most important connections among tokens from a specific head. This could help the understanding of the function performed by the head, and simplify Figure 2: Simplified examples of the result of the Mean Shift Linker algorithm for an Offset head (on the left) and for a Local head (on the right) the visualization process. Our algorithm for automatically finding these connections is based on Mean Shift [28], a clustering algorithm suitable for density functions and one-dimensional clustering [29]. Given a distribution of attention weights 𝑎𝑤 = [𝛼1 , 𝛼2 … 𝛼𝑁 ] for a token 𝑤, the different 𝛼𝑖 ∈ 𝑎𝑤 are grouped into several clusters depending on their value. In our system, we considered the implementation provided by the standard Machine Learning library Scikit-Learn1 [30]. Since our clusters are composed by just a few tokens (even just one) and all the tokens needs to be analysed, we set the minimum bin frequency at 1 and the 𝑐𝑙𝑢𝑠𝑡𝑒𝑟_𝑎𝑙𝑙 parameter to 𝑇 𝑟𝑢𝑒. All the other hyperparameters were set to the default values. As we introduced in Section 3.1, one of the most important differences among the head patterns is how the attention weights are distributed across different tokens. For instance, in the Offset heads, a token is connected only to another one. Instead, in the Local heads, the attention can be distributed across several tokens with different degrees of importance. Given that in Mean Shift the number of clusters is defined automatically and it is not selected by the user, this algorithm can easily adapt to such differences. As highlighted in the example showed in Figure 2, on the left we can see how the algorithm selects only two clusters for Offset heads (on the left) and more clusters for the Local heads (on the right). A drawback of this approach is that while the last cluster can be easily discarded as irrelevant and the first considered as important, the role of the intermediate clusters is not immediately understandable. A more detailed analysis of these aspects is going to be conducted as future work. 4. Classification of Radiology Reports In this section, we describe the real-world context into which we applied BERT, namely the classification of radiology reports. We analyse chest tomography reports, focusing in particular on the possible presence of neoplastic lesions. The potential advantages of a reliable automatic classification of both old and new reports concern diverse areas such as logistics, health care management, monitoring the frequency of follow-up examinations, and collecting cases for research or teaching purposes. The proposed system for report classification is based on a schema defined in strict collaboration with the radiologists [31]. This schema is composed by three levels, that correspond to the main aspects considered by the physicians during the 1 https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MeanShift.html Accuracy F-Score Exam Type 96.1 95.9 Result 87.4 84.1 Lesion Nature 73.6 71.6 Table 1 Predictive performance in terms of Accuracy and F-Score. Performance is evaluated in 10-fold cross validation. evaluation of a report: 1. Exam Type (First Exam or Follow-Up); 2. Result (Suspect or Negative); 3. Lesion Nature (Neoplastic, or lesion with an Uncertain Nature). This third level is specified only for the Suspect reports. The dataset is composed by 5,752 classified computed tomography reports. Our reports contain a description (typically without verbs) of what the physicians have seen in the TC images (nodules, lesions, etc.), their relation to previous visits (for instance, if the dimensions are the same with respect to the previous exam) or excluding the presence of specific symptoms or abnormalities. Similarly to other clinical texts, our reports are characterized by a non-standard language, with abbreviations, ungrammatical language, acronyms and typos. This is due to the fact that reports are often written in haste or dictated to a speech recognition software. In addition, abbreviations and acronyms are sometimes idiosyncratic to the specific hospital or department. In order to see if BERT is effective also in this complex context, we performed the classifi- cation task, we adapted the BERT-base Italian model provided by the HuggingFace library2 by performing the Masked Language Model and Next Sentence Prediction tasks on 10, 000 unclassified reports and next we fine-tune it with our supervised training set. We used Adam as optimizer with learning rate 2 ∗ 10−5 and batch size 8 for 4 epochs. Performance is evaluated in 10-fold cross validation, therefore training 10 versions of the models using different training and tests and computing the average results. In Table 1, we show the results obtained by our model. For the first two classification levels, the results are higher than 85% in terms of accuracy. For the third level, the performance does not reach the same level of accuracy. This is due mainly to two issues. First, as described in [31], there is no strong agreement among the physicians for identifying Uncertain Nature cases and not mistaking them with Negative or Neoplastic reports. We speculate that these reports contain the most sensible information, therefore their language is the most ambiguous and cryptic. Moreover, it is also probable that some reports can be evaluated as both Uncertain or Negative depending on the doctor’s opinion. Secondly, the third level is only specified for the non negative reports, therefore limiting the number of reports (less than 2000) available for fine-tuning the BERT model. Overall, comparing the results obtained by BERT with the LSTM-based model presented in [31], we can see a small improvement in terms of accuracy and F-Score. 2 https://huggingface.co/dbmdz/bert-base-italian-cased (a) (b) Figure 3: (a) Result of our grouping process onto a radiology report. Each row represents an encoding layer of BERT made by 12 heads. (b) Noun-adjective connections (in Italian and with an English adaptation). The token selected is underlined. Its most important connections are in red. 5. Experimental Results The pre-trained model we used for our classification task has 12 encoding layers and each of which has 12 heads. Thus, we can represent the category of every head in the model as a 12 × 12 matrix. As mentioned in Section 3.1, our metrics are calculated considering a specific document and therefore some heads (especially in the proximity of the thresholds) may vary their category depending on the document. However, they are a small minority (between 10 and 15) with respect to the total. Analysing Figure 3a we can see some characteristics of the model and our results are quite similar to the ones showed in [19]. First of all, the first 2 encoding layers contain mostly Broadcast heads, which progressively diminish. Instead, the intermediate layers contain mostly Local heads and the last ones are made by a majority of Mixed heads. Offset and Not Local heads are sparse across the model. In order to select some interesting linguistic characteristics detected by the Italian BERT model in its Local and Not Local heads, we adopted the following approach. First, we calculated the operative tokens, and inspected them searching for interesting patterns. For these tokens, we executed the Mean Shift Linker algorithm, and considered only the first cluster, which contains the highest weights and therefore, it is supposed to highlight the most important relation between pairs of tokens. Finally, on the basis of these results, we formulated an hypothesis on the main linguistic function implemented by the head, and manually annotated some instances of such function. This approach radically differs from the typical probing tasks, where every head is tested with some predefined datasets. In the following sections, we highlight some interesting information that can be extracted from a selection of heads. Please note that we focused our analysis mostly on medical terms, which often are not present in the original vocabulary of the BERT model. Although this does not seem to have a negative impact on our analysis, as future work we will try to perform the same tasks in other domains with a less specific lexicon and evaluate if there are significant differences. Head Association Support Accuracy (7,4) no (nodules) → bilateral 494 78.54% portion → ca (caudal) 113 61.06% (9,6) local → recur (recurrence) 118 98.30% pulmonary → artery 1090 58.07% Table 2 Results of noun-adjective correlation in two Local heads. 5.1. Adjectives Considering head (7, 4), i.e., the fourth head of the seventh encoding layer, we have found a significant pattern that connects noun with adjectives. On the top of Figure 3b, we show how the first token of the word noduli (nodules, in English) is connected to the first token of the adjective aspecifici (aspecific) and the word bilaterali (bilateral). Please note that the adjective millimetrici (millimetric) is not recognised by the head, which is perhaps due to the fact that the adjective precedes the noun, which is quite uncommon in the Italial language. On the contrary, head (9, 6) finds connections between adjective and nouns. On the bottom of Figure 3b, we show how the adjective locale (local) is connected with the first token of rediciva (recurrence). In order not to limit our analysis only to qualitative examples, we have manually annotated several instances of such correlations and tested how these two heads behave. From our corpus of reports, we annotated the relations between nodules and bilateral, portion and caudal, local and recurrence and pulmonary and artery. In Table 2, we report the results we obtained. For the relation between local and recurrence, we have that 98% of the 118 instances annotated were recognised by head (9, 6). For head (7, 4), the relation between nodules and bilateral has an accuracy of 78.5% across 494 instances. Other relations are recognised with some difficulties, such as the one between artery and pulmonary or portion and caudal. These probably suffer from the fact that the words involved are very specific to the medical lexicon, which could not be learned properly due to the limited amount of reports in our dataset. 5.2. Semantic Field While Local or Mixed heads perform mostly grammatical connections, and therefore concentrate the attention between tokens not so distant from each other, Not Local heads show a completely different behaviour. Given also the fact that they are a minority with respect of the total 144 heads (as can be seen in Figure 3a observing the blue squares), we have inspected them closely. In heads (2, 1), (3, 11) and (4, 6) tokens are mostly connected to themselves or to other occurrences of the same word. This is particularly evident for the word Non (not, in English) which appears several times in a report. Usually, each appearance is connected to all the others. There are a few notable exceptions, where very similar tokens are connected, such as locale and locali (singular and plural versions of the adjective local). Moreover, head (2, 1) is quite noisy and shows random and local connections without any recognisable logic. On the other hand, heads (12, 1) and (12, 5) are very precise, and the vast majority of the tokens points only to themselves, regardless if similar words or other occurrences are present. Association Support Accuracy lesions → no (nodule) 938 99,98% texture → paren (parenchyma) 228 92.54% artery → ##orta (aorta) 606 67.66% pulmonary → chest 3661 98.55% inferior → superior 5332 99.83% segments → portion 131 77.10% left → right 1784 99.94% Table 3 Results for the head (6, 10) with examples of relation between words in the same semantic field. The most interesting phenomenon that we observed regards head (6, 10). Its behaviour can be summarised as follows. If token 𝑤 is present more than once in the document or there are tokens with very small variations (like singular and plural differences such as nodule and nodules), then the attention is distributed across all the other occurrences (or little variations) of 𝑤. Otherwise, most of the attention is concentrated on synonyms, antonyms or words in the same semantic field of 𝑤, like the ones in Table 3. If the two previous conditions are not satisfied, 𝑤 is connected only to itself. We investigated further the capability of head (6, 10) to find synonyms or words in the same semantic field. Considering our reports, we analysed the first cluster extracted by the Linker algorithm for each token and discarded when it is connected to itself, or when there are only small variations in terms of characters between the words. In Table 3, we report our results in which we can see some important connections. For instance, the connection between the word lesions and the first token of nodule (which is a particular kind of lesion) is captured almost every time it appears (99.98% accuracy) over more than 900 instances. In our opinion, the relation between texture and the first token of parenchyma (with an accuracy of 92.54%) is particularly important, especially given the fact that parenchyma is a very specific word in the anatomy lexicon, describing a particular type of texture. Simple antonyms like inferior and superior or left and right are also captured with an high accuracy. 5.3. Negations We also studied a specifc kind of relations which can be particularly useful in our analysis. In fact, when radiologists evaluate the conditions of a patient, they often write a sentence excluding the presence of something, especially lesions. Sentences like “No focal lesions traceable to secondary locations” are a strong evidence of a negative result of the report, and their individuation could be an important factor in terms of interpretability. Manually inspecting the Local heads, we have identified two of them that can be used to find these patterns: (9, 8) and (7, 12). There is however an important difference: head (9, 8) connects the negative particle with the word which has been denied, while head (7, 12) does the opposite. Nevertheless, instead of a very specific behaviour like the ones showed by [10], when a negation is not present, these heads also can connect adjectives or other tokens, instead of directing the attention into [SEP]. While we cannot straightforwardly claim that these heads are specialised into identifying the negations, we checked if at least they can be used for extracting specific information. Therefore, we annotated the relation between Non and lesioni (no lesions) and calculated the accuracy of these heads of recognising; we found that for more than 800 instances, head (9, 8) has an accuracy of 67.9% and (7, 12) reaches 86.0%. However, while in our context negations are expressed simply with the term “Non” or “nè” (neither), in general negations can be expressed in many different forms, and their detection is a complex task that can require specific models [32]. Moreover, the identification of negations could be limited by the presence of the term no not only for introducing a negation but also as the first token of nodulo or other similar words. Thus, while the result of head (7, 12) is quite good in this particular case, a more detailed evaluation of the capability of specific heads to detect a negation will be conducted as future work. 6. Conclusions and Future work We presented an application of fine-tuning the Italian-base BERT model in the context of the classification of radiology reports written in Italian. After verifying its efficacy, we have investi- gated how its heads behave, and we have proposed to group heads according to their behaviour. An important characteristic of our approach is that it relies only on simple mathematical metrics based on the Jensen-Shannon Divergence, instead of relying on manual observations or cluster- ing. We have also proposed an algorithm based on clustering for automatically extracting the most important connection between words, simplifying the understanding of the characteristics of each head. Combining these automatic procedures with manual observations, we have found and experi- mentally evaluated some relevant patterns that can improve the interpretability of BERT for the classification of radiology reports. For instance, in our application it is not sufficient to identify the most important findings or concepts but also to correlate them with their characteristics. For instance, a nodule can be associated with a neoplastic lesion if its margins are irregular or spiculated and not round. Therefore, finding the heads that connect nouns (like margin) and adjectives (like round) could be effectively used for the classification process and its explanation. At the same time, finding that a particular condition is excluded by a negation could be crucial information. Moreover, we have found a head that identifies words in the same semantic field with remarkable accuracy. Although this does not lead to an immediate application for interpretability, this characteristic is another proof of the ability of BERT to capture language properties. While we have studied and analysed the behaviour of BERT’s heads in the radiology context, our techniques are general, and can be easily adapted to other contexts. However, our metrics relies on specific thresholds that could vary on the basis of the document length or its charac- teristics. As future work, we want to test our techniques more extensively in other applications and using other datasets. References [1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, I. Polosukhin, Attention is all you need, in: Advances in Neural Information Processing Systems, volume 30, Curran Associates, Inc., 2017. [2] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Association for Computational Linguistics, 2019, pp. 4171–4186. [3] A. Rogers, O. Kovaleva, A. Rumshisky, A primer in BERTology: What we know about how BERT works, Trans. Assoc. Comput. Linguistics 8 (2020) 842–866. [4] G. Jawahar, B. Sagot, D. Seddah, What does BERT learn about the structure of language?, in: Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, Association for Computational Linguistics, 2019, pp. 3651–3657. [5] Y. Goldberg, Assessing BERT’s syntactic abilities, CoRR abs/1901.05287 (2019). URL: http://arxiv.org/abs/1901.05287. a r X i v : 1 9 0 1 . 0 5 2 8 7 . [6] A. Miaschi, D. Brunato, F. Dell’Orletta, G. Venturi, Linguistic profiling of a neural language model, in: Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020, International Committee on Computational Linguistics, 2020, pp. 745–756. [7] I. Tenney, D. Das, E. Pavlick, BERT rediscovers the classical NLP pipeline, in: Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, Association for Computational Linguistics, 2019, pp. 4593–4601. [8] A. Garcia-Silva, J. M. Gomez-Perez, Classifying scientific publications with BERT - is self-attention a feature selection method?, in: Advances in Information Retrieval, Springer International Publishing, Cham, 2021, pp. 161–175. [9] H. Xu, L. Shu, P. S. Yu, B. Liu, Understanding pre-trained BERT for aspect-based senti- ment analysis, in: Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020, International Committee on Computational Linguistics, 2020, pp. 244–250. [10] K. Clark, U. Khandelwal, O. Levy, C. D. Manning, What does BERT look at? an analysis of BERT’s attention, in: Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, BlackboxNLP@ACL 2019, Florence, Italy, August 1, 2019, Association for Computational Linguistics, 2019, pp. 276–286. [11] O. Kovaleva, A. Romanov, A. Rogers, A. Rumshisky, Revealing the dark secrets of BERT, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP- IJCNLP 2019, Hong Kong, China, November 3-7, 2019, Association for Computational Linguistics, 2019, pp. 4364–4373. [12] J. Vig, Y. Belinkov, Analyzing the structure of attention in a transformer language model, in: Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, BlackboxNLP@ACL 2019, Florence, Italy, August 1, 2019, Association for Computational Linguistics, 2019, pp. 63–76. [13] J. Mullenbach, S. Wiegreffe, J. Duke, J. Sun, J. Eisenstein, Explainable prediction of medical codes from clinical text, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Association for Computational Linguistics, New Orleans, Louisiana, 2018. [14] Y.-M. Kim, T.-H. Lee, Korean clinical entity recognition from diagnosis text using BERT, BMC Medical Informatics and Decision Making 20 (2020) 1–9. [15] Y. Si, J. Wang, H. Xu, K. Roberts, Enhancing clinical concept extraction with contextual embeddings, Journal of the American Medical Informatics Association 26 (2019) 1297–1304. [16] R. Leaman, R. Khare, Z. Lu, Challenges in clinical NLP for automated disorder normaliza- tion, Journal of Biomedical Informatics 57 (2015) 28–37. [17] L. Putelli, A. E. Gerevini, A. Lavelli, R. Maroldi, I. Serina, Attention-based explanation in a deep learning model for classifying radiology reports, in: Artificial Intelligence in Medicine - 19th International Conference on Artificial Intelligence in Medicine, AIME 2021, Virtual Event, June 15-18, 2021, Proceedings, volume 12721 of Lecture Notes in Computer Science, Springer, 2021, pp. 367–372. [18] Z. Yang, D. Yang, C. Dyer, X. He, A. J. Smola, E. H. Hovy, Hierarchical Attention Networks for Document Classification, in: NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, The Association for Computational Linguistics, 2016, pp. 1480–1489. [19] Y. Guan, J. Leng, C. Li, Q. Chen, M. Guo, How far does BERT look at: Distance-based clustering and analysis of BERT’s attention, in: Proceedings of the 28th International Con- ference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020, International Committee on Computational Linguistics, 2020, pp. 3853–3860. [20] A. Miaschi, G. Sarti, D. Brunato, F. Dell’Orletta, G. Venturi, Italian transformers under the linguistic lens, in: Proceedings of the Seventh Italian Conference on Computational Linguistics, CLiC-it 2020, Bologna, Italy, March 1-3, 2021, volume 2769 of CEUR Workshop Proceedings, CEUR-WS.org, 2020. [21] Z. Meng, F. Liu, E. Shareghi, Y. Su, C. Collins, N. Collier, Rewire-then-probe: A con- trastive recipe for probing biomedical knowledge of pre-trained language models, CoRR abs/2110.08173 (2021). [22] J. Vig, A multiscale visualization of attention in the transformer model, in: Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Flo- rence, Italy, July 28 - August 2, 2019, Volume 3: System Demonstrations, Association for Computational Linguistics, 2019, pp. 37–42. [23] A. Jacovi, Y. Goldberg, Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness?, in: Proceedings of the 58th Annual Meeting of the ACL, ACL 2020, 2020, pp. 4198–4205. [24] S. Jain, B. C. Wallace, Attention is not explanation, in: Proc. of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1, 2019, pp. 3543–3556. [25] S. Serrano, N. A. Smith, Is attention interpretable?, in: Proc. of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 2931–2951. [26] S. Wiegreffe, Y. Pinter, Attention is not not explanation, in: K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of the EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, 2019, pp. 11–20. [27] B. van Aken, B. Winter, A. Löser, F. A. Gers, How does BERT answer questions? a layer-wise analysis of transformer representations, in: Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM ’19, Association for Computing Machinery, New York, NY, USA, 2019, p. 1823–1832. [28] D. Comaniciu, P. Meer, Mean shift: A robust approach toward feature space analysis, IEEE Transactions on pattern analysis and machine intelligence 24 (2002) 603–619. [29] Y. A. Ghassabeh, On the convergence of the mean shift algorithm in the one-dimensional space, CoRR abs/1407.2961 (2014). a r X i v : 1 4 0 7 . 2 9 6 1 . [30] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011) 2825–2830. [31] L. Putelli, A. E. Gerevini, A. Lavelli, M. Olivato, I. Serina, Deep learning for classification of radiology reports with a hierarchical schema, in: Knowledge-Based and Intelligent Information & Engineering Systems: Proceedings of the 24th International Conference KES-2020, Virtual Event, 16-18 September 2020, volume 176 of Procedia Computer Science, Elsevier, 2020, pp. 349–359. [32] A. Khandelwal, S. Sawant, NegBERT: A transfer learning approach for negation detection and scope resolution, in: Proceedings of The 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11-16, 2020, European Language Resources Association, 2020, pp. 5739–5748.