InsBERT: Word importance from artificial insertions Adam Osuský, Dávid Javorský and Ondřej Bojar Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics, Malostranské nám. 25, Prague, 118 00, Czech Republic Abstract We investigate the quantification of word importance by introducing a novel self-supervised task that modifies masked language modeling. Instead of predicting masked words, our approach involves learning to identify which words were inserted. We hypothesize that resulting models will predict a higher likelihood of insertion for less important words. We experiment with two different insertion strategies: the List Inserting Method (LIM) and the BERT Inserting Method (BIM). We outline the process for gathering manually estimated word importance data and describe the construction of a dataset for evaluating our methods. Our results indicate that our modified language modeling surpasses baselines and is competitive with existing research in assessing word importance. Keywords word importance, self-supervision, word insertion, synthetic data 1. Introduction get label when a specific word is included in the text versus when it is removed. This approach reveals that A significant amount of human knowledge and commu- adversarial attack algorithms in NLP primarily disrupt nication is now recorded as digital text [1, 2], often in the distribution of this word importance. raw, unstructured form [3]. This necessitates methods In [12], a method is explored to derive word signifi- to make text searchable and easily summarized, driving cance from models trained for Natural Language Infer- the NLP community’s interest in quantifying word im- ence (NLI) and Paraphrase Identification (PI) by using portance as a possible solution. an attribution method to compute scores for each in- The concept of identifying important words dates back put word, identifying those that contribute most to the to the 1950s [4], with early methods based on word fre- model’s decision. The approach involves training an in- quencies such as TF-IDF , which remains widely used in terpreter to mask as many words as possible while still modern NLP applications [5]. Various methods have been preserving the original prediction. We compare the per- explored for determining word importance in tasks such formance of our approach with this work. as querying [6], summarization [7], text classification [8], This study explores assessing word importance com- and keyphrase extraction [9, 10]. prehensively, from collecting data to creating and evaluat- Current approaches for assessing word importance ing an automatic word importance scorer. More precisely, involve comparing spatial distribution of words in the the contribution of this work is: (1) A precise definition original versus shuffled text [11], exploiting attribution of word importance and proposed metrics for its evalua- methods [12], utilizing 𝜒2 test [13, 14] or interpreting tion, (2) a small multi-domain word-importance dataset attention in attention-based models [15], although their in English annotated by three annotators, (3) a novel interpretability is debated [16]. self-supervised machine learning method for predicting Kafle and Huenerfauth [17] collect annotations of word word importance. This self-supervised approach modifies importance as real numbers from 0 to 1, which they later BERT’s [20] methodology to predict artificially inserted use for captioning to aid those who are deaf or hard of words rather than masked ones, examining two insertion hearing [18]. methods: the List Inserting Method (LIM; inserting ran- Interestingly, [19] defines word importance ranks as domly from a word list) and the BERT Inserting Method the difference in the classifier’s confidence for the tar- (BIM; inserting using a BERT model). The results seem to indicate that our proposed method is superior to base- Workshop on Automata, Formal and Natural Languages 2024 (WAFNL 2024) lines such as TF-IDF and is on par or even better than $ adam.osusky746@student.cuni.cz (A. Osuský); existing approaches of calculating word importance.1 javorsky@ufal.mff.cuni.cz (D. Javorský); bojar@ufal.mff.cuni.cz (O. Bojar) € https://ufal.mff.cuni.cz/david-javorsky (D. Javorský); 2. Word Importance https://ufal.mff.cuni.cz/ondrej-bojar (O. Bojar)  0000-0003-2516-2535 (D. Javorský); 0000-0002-0606-0050 Word importance (WI) depends on its intended usage. (O. Bojar) Depending on objectives, such as text summarization or © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 1 CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 https://github.com/adam-osusky/predicting-word-importance CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings grammar correction, the same words may hold different 2. Create an order for the most important ones; any degrees of importance. In this work, we focus on seman- unranked words will receive the last rank, and tic importance and we define it by drawing inspiration they should be considered to have similar impor- from prior works: Kafle and Huenerfauth [17] emphasize tance. At least one word must be ranked. the loss incurred by removing a word and Javorský et al. 3. Click on words to select them. The selection order [12] focus on the meaning contribution added by a word. determines their ranking. Clicking on a selected We combine these perspectives into a unified definition word will deselect it. The first selected word is of WI: the most important. 4. Importance is the measure of a word’s contribu- Importance is the measure of a word’s tion to the overall meaning of the context. Indi- contribution to the overall meaning of the cating the extent to which the removal of a word context, indicating the extent to which would diminish the information conveyed by the the removal of a word would diminish the context. information conveyed by the context. 5. Contexts span five diverse domains: news, beletry, poetry, jokes, and transcribed spoken lan- We aim to collect human-annotated data for word im- guage. portance, and therefore we need to clearly formulate 6. In the transcribed spoken language do- instructions for annotators. Even though the most in- main, words may take the form of "(PER- tuitive approach would be to let annotators score each SON#NUMBER)" at the beginning of a person’s word by a real number within the range [0, 1], as it is reply, indicating the speaker’s identity. These done in other studies [17, 12], we find such task very tags are non-clickable and non-rankable. Addi- difficult for annotators. Therefore, we represent WI as tionally, words in the form "PERSON#NUMBER" an importance ranking. serve as references to other persons’ names By ranking, we mean the ordering of word positions within the utterance. within a context, where the word position ranked as 1 is considered the most important. We rank word positions A simple annotation tool was used for data collection. because the same word can appear multiple times in a This tool allows annotators to rank words by clicking on context with varying levels of importance. them in sequence. If an annotator wants to insert a word In our initial experiments, we observed that when an- into the middle of an already selected ranking, they must notators were unrestricted in the number of words they unselect the subsequent words and then reselect them in could rank, they tended to sequentially select key nouns the desired order. While it might seem more convenient from the subjects and objects, as well as the verbs that to allow direct insertion of a word into the middle of connect these elements in the sentences. However, this the ranking, the current approach has its benefits. By behavior does not align with our objectives. requiring annotators to reassess the subsequent rank- Our primary goal is to identify the most important ings when making changes, the process encourages a words, so the ranking can include only a subset of the more thoughtful and deliberate evaluation of the overall word positions in a given context. For a context with ranking. the length of 𝑚, we aim to rank 𝑛 positions where 1 ≤ 𝑛 ≤ ⌈0.1 · 𝑚⌉. Word positions that are not ranked are assigned the rank of 𝑛 + 1, referred as the “last rank”. 3. Data Collection We term this process as rank limit being equal to 10%. To ensure diversity, we target various domains and their We argue that it might facilitate the annotators’ atten- corresponding English datasets: News, the News Com- tion on identifying only the most essential words in a mentary dataset [21]; literature, data from [22]; poetry, given context. By restricting the number of words that data from [23]; jokes, data from [24]; and meeting tran- can be ranked to 10% of the total word positions, we force scripts, the ELITR Minuting Corpus [25]. From each a more selective process, making the annotators focus domain, we manually select 10 contexts (each context on the most salient words and not to be overwhelmed by possibly containing more sentences), ensuring that the numerous options of possible rankings. contexts are around 60 words long. To achieve better granularity for certain words like “don’t” and “I’m”, the Instructions for Annotators The full set of instruc- contexts are tokenized using the Moses tokenizer [26]. tions provided to our annotators is as follows: The dataset statistics are outlined in Table 1. 1. Arrange the words in descending order by their Each of the 50 contexts is annotated by three annota- importance. You can rank at most 10 percent of tors who are non-native English speakers. These con- the words, or choose to rank fewer if desired. texts with annotations form the Word Importance Dataset (WIDS). We make the dataset available at [27]. Characters Moses-tokens Domain Contexts Count mean±std Count mean±std News 10 2565 256.5±26.1 529 52.9±4.0 Literature 10 2207 220.7±17.1 601 60.1±7.2 Poetry 10 1776 177.6±27.5 540 54.0±6.2 Jokes 10 1938 193.8±25.4 575 57.5±7.0 Transcripts 10 2432 243.2±26.6 616 61.6±7.1 All 50 10918 218.4±38.5 2861 57.2±7.2 Table 1 Statistics of our Word Importance Dataset. The mean and standard deviation are computed on lengths of contexts. Domain Pair1-2 Pair1-3 Pair2-3 Average text by ordering them according to their predicted word News 0.318 0.296 0.388 0.334 Literature 0.223 0.286 0.273 0.260 importance score. We propose two methods for creating Poetry 0.260 0.332 0.238 0.277 the training dataset of inserted words: the List Inserting Jokes 0.533 0.630 0.533 0.565 Method (LIM) and the BERT Inserting Method (BIM). Transcripts 0.539 0.475 0.518 0.511 All 0.380 0.406 0.395 0.393 List Inserting Method (LIM) This method inserts Table 2 words randomly from a predefined list. This list is gen- Cohen’s kappa coefficient between pairs of annotators within erated by splitting the base corpus into words by white individual domains and across all domains in our Word Impor- space. Consequently, words that appear more frequently tance Dataset. The highest agreements within each domain in the corpus are more likely to be inserted, mirroring are highlighted in bold. the original distribution. BERT Inserting Method (BIM) This method aims to To assess the similarity of the annotations, we compute insert words that do not fit well in the sentence. This inter-annotator agreement using Cohen’s kappa [28]. We is achieved by leveraging the capabilities of another simplify the calculation by classifying each word position instance of the BERT model [20].2 Because BERT predicts in every context as either “selected” or “not selected” by the words without any information except the text itself, annotators. In Section 5.2, we present metrics to incorpo- we assume that they should not alter the sentence’s rate the order of selection for a more nuanced analysis. meaning significantly. In this method, mask tokens are The computed Cohen’s kappa values are shown in Ta- placed at the selected insertion positions within the text ble 2. It is unsurprising that one out of the two domains and BERT is then used to predict the masked tokens. with the least agreement than poetry. An intriguing ob- We prohibit predictions of neighboring tokens (those servation is that literature displays slightly lower agree- immediately before and after the masked token) and ment as poetry. The domains with the highest agreement sub-word tokens, i.e. tokens that are not a beginning are jokes and meeting transcripts. We find these findings of a word. After filtering out these unwanted tokens, in line with our intuition: There is often very clear what we select the prediction with the highest logit probability. words make jokes funny and speech in meetings may contain many objectively unimportant words, e.g. filler For both methods, possible positions for insertion in- words, hesitations, false starts etc. clude places before existing words and one additional position at the end of the text. These words are obtained 4. Methodology by splitting the text by white space to determine the po- tential insertion positions. We insert at most one word Our approach involves fine-tuning a pre-trained BERT in each position, ensuring that words are not inserted model [20] using automatically generated data. Specifi- consecutively. cally, we generate training text data by inserting words The positions for insertion are selected randomly. The into existing text and then use the modified text as train- insertion rate is defined as the ratio of the number of ing data. The objective of fine-tuning is to predict which words to be inserted to the total number of words in words were inserted. We hypothesize that this task will the original text sample. For instance, if a text sample require the model to understand the importance of each contains 10 words and we use an insertion rate of 0.5, we word and its contribution to the overall meaning of the insert 5 new words into this text. context, ultimately leading the model to assign higher In our experiments, our goal is to effectively compare likelihoods of insertion to less important words. This the two insertion methods, LIM and BIM, as well as evalu- enables us to create a ranking of words in a test input 2 https://huggingface.co/google-bert/bert-base-uncased In the 1970s, failures. Wheeler a became increasingly the forgetful and late In the 1970s, " Wheeler became " increasingly forgetful and came Olympic came to rely largely dwarf on his Seattle assistant, Molly personal to " rely personal largely " on his " assistant, late Molly rumored Myres, Poesaka to killed organise Police his Army affairs. souls back Myres, country to organise his " affairs. apartment Amid Amid increasing ill about health, the in September 1973 pairs he moved in occasional increasing visit ill health, in later September 1973 east he moved full-time into Mills, Myres's house in Byzantines. Leatherhead, time full-time which into original Myres's two house in Leatherhead, Surrey, Surrey, although said he claim continued to use (RCA his EMS central although he posthumously continued press to use shortly his central London Hill London "When flat during details day-trips Essendon. to the He city. major flat finally during " day-trips to " the prestigious city. There, There, he Morison authored a final was book, My Archaeological Mission birmingham he authored a final book, My bafta Archaeological other who to India Studios and as Pakistan, largely although polygyny much Mission to all India " and Pakistan, although much of " the state text was for of the text was with culled as from his previous publications; the it culled also from his ceremonial previous publications; " it nearby was was veterans published none by Thames was and Hudson in 1976. After published much by " Thames " and catholic Hudson in 1976. new After Charles suffering that a stroke, on Wheeler went died The at On Myers' john suffering a back stroke, Wheeler ( died at solely Myers' " home " Religious home on Kierszenbaum 22 July eldest 1976. Archives' In bridge on 22 July 1976. " In memoriam, the " British Academy, john Royal memoriam, the British Academy, Royal Academy, and Royal Society Ukraine flew Academy, also and " Royal Society " flew their flags capital at half-mast. their flags cites at half-mast. membership Wheeler's from funeral was first Wheeler's research funeral was " held with military directly player held with military review trappings in at a was local trappings what at a alone local crematorium, while a john larger george demolished. crematorium, while a contacting larger memorial service was held a in memorial service also was held in St James's michael Church, " Piccadilly in south St (April James's aid Church, area, Piccadilly in November. November. Figure 1: Example of text with inserted words using LIM (left) and BIM (right) methods with insertion rate 0.5. Words that are highlighted in yellow boxes are inserted. ate their performance across different insertion rates. To Frequency of Apostrophe Insertions vs. Insertion Rates achieve this, we train separate models for both methods 0.225 Apostrophe Insertions / All Insertions at insertion rates of 0.25, 0.5, and 0.75. 0.200 In future endeavors, it would be interesting to extend 0.175 this research by training models on datasets created using 0.150 both LIM and BIM, potentially combining or varying 0.125 insertion rates. 0.100 0.075 4.1. Example Text with Inserted Words 0.050 0.025 In Figure 1, we illustrate an example of text from our 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 preprocessed WikiText dataset (Section 5.1), where words Insertion Rate have been inserted using both the LIM and BIM methods. Figure 2: Frequency of apostrophes in inserted words by BIM The inserted words in the BIM method often appear su- method with varying insertion rates. perfluous, adding information to the sentences. Notably, there is a frequent insertion of apostrophes, occurring more often than desired. To investigate this phenomenon further, we conducted a simple experiment on a subset Data As the base text corpus in which we insert new of 100 examples to analyze how the frequency of apos- words, we have selected the WikiText dataset [29]. This trophes changes with varying insertion rates. Refer to dataset comprises articles from Wikipedia3 that are clas- Figure 2 for the results. It is observed that the frequency sified as either Good or Featured articles, according to of apostrophes converges to approximately 0.22 when the criteria specified by Wikipedia editors at the time of the insertion rate is at least 0.5. creation. Conversely, in the LIM method, the inserted words We use version wikitext-103-raw-v14 from the Hugging- sometimes introduce information that seems out of con- Face datasets library [2]. Each example in the dataset is text. Additionally, some inserted words include punctua- either a paragraph or a title. For our specific use case, tion marks, such as “(April” or “area,”, as seen in the last we preprocess this dataset by removing examples that sentence on the left in Figure 2. are titles. Additionally, the dataset is tokenized using the Moses tokenizer [26]. Since we employ a Transformer [15] model that already incorporates its own tokeniza- 5. Experiments tion, we detokenize the text. We retain the original train- validation-test splits. For detailed statistics regarding our 5.1. Training Details preprocessed WikiText, refer to Table 3. We detail the preprocessing methods applied to the Wiki- Text dataset and outline the construction of the training regime. 3 https://www.wikipedia.org/ 4 https://huggingface.co/datasets/wikitext Train Validation Test Paragraphs 859,532 1841 2183 Characters 509,512,733 1,083,136 1,217,025 Characters mean ± std 592.7 ± 404.2 588.3 ± 385.4 557.5 ± 404.1 Words 84,208,748 178,815 201,013 Words mean ± std 97.9 ± 67.1 97.1 ± 63.8 92.0 ± 67.3 Table 3 Statistics for train, validation, and test splits of our preprocessed WikiText. The statistics are computed on lengths of paragraphs, where std stands for empirical standard deviation. Words are obtained by splitting the detokenized text by white space. Model 𝐹1 Loss Precision Recall insertion probabilities. Since Word Importance Dataset BIM-0.25 0.929 0.046 0.930 0.927 LIM-0.25 0.952 0.031 0.957 0.947 (WIDS) is pretokenized by the Moses tokenizer, we use BIM-0.5 0.953 0.048 0.950 0.957 the logits of the first BERT-token to score the original LIM-0.5 0.972 0.028 0.974 0.971 Moses-token if a Moses-token is split into multiple BERT- BIM-0.75 0.982 0.027 0.980 0.984 tokens. LIM-0.75 0.985 0.021 0.985 0.985 For the human reference, we calculate the average Table 4 rank of each token based on the rankings provided by Performance of models on the test split from their respective all three annotators and then order the words according generated data. The model names indicate whether the List to these average ranks. With only three annotators, a Inserting Method (LIM) or BERT Inserting Method (BIM) was majority of words still fall into the lowest rank, leading to used for generating data, along with the insertion rate. Loss inconsistencies between model predictions and averaged refers to cross-entropy loss, and in the computation of 𝐹1 annotations, as they can result in different lowest ranks. score, precision, and recall, the positive target is the class for To ensure consistency in evaluation, we apply the 10% inserted words. Best results are in bold. rank limit to both the averaged annotations and model predictions. Since 90% of the positions fall into the lowest rank, this Models We create six datasets using both the LIM and creates challenges in designing effective evaluation met- BIM techniques, each with insertion rates of 0.25, 0.5, rics. To address these issues, we propose three metrics, and 0.75. Subsequently, we train six models, each corre- each progressively refining and incorporating desired sponding to a distinct combination of insertion method properties to better align with our evaluation goals. and rate. To ensure a fair comparison and avoid introducing bias due to differences in hyper-parameters, we use identical Pearson correlation The simplest and well-known settings for all models. Hyper-parameters were selected approach is to calculate the sample Pearson correlation empirically based on initial experiments. coefficient on the ranks of word positions over all po- We use a learning rate of 0.0032, batch size of 256, sitions and all contexts in the dataset. However, this Adam optimizer [30] with default betas (0.9,0.999), and method is not ideal because 90% of word positions within a linear learning rate scheduler with a linear warmup of a given context fall into the lowest rank. Our primary fo- 350 steps. Starting from the BERT5 pre-trained model, cus is on achieving higher agreement within the top 10%, we fine-tune on each of the datasets for 5 epochs. which is not adequately emphasized by this correlation The performance of individual models in the classifi- measure. cation task is shown in Table 4. The LIM models consis- tently outperform the BIM models. Given this discrep- k-inter Another perspective on rankings is to consider ancy and the distribution of inserted punctuation marks words that do not have the last rank and disregard their discussed in Section 4.1, it indicates that the BIM data specific order. By doing this, we view the rankings as present a more challenging task, as the inserted words indicators of which words are important, allowing us to blend more seamlessly with the context. measure the extent of the intersection between different rankings. We thus propose a new metric, k-inter, where we filter the ranking and keep only word positions that 5.2. Evaluation for WI ranking do not have the last rank. We then compute the fraction Our trained models predict logits for the probability of of context pairs where the intersection of their filtered word insertion. We construct the ranking by ordering rankings has at least 𝑘 elements. We examine values of BERT-tokens in one context in ascending order of their 𝑘 ∈ {1, 2, 3}. 5 https://huggingface.co/google-bert/bert-base-uncased Annotators Pearson 1-inter 2-inter 3-inter Overlap Pair1-3 0.534 0.90 0.76 0.52 0.319 6. Results Pair1-2 0.555 0.92 0.72 0.38 0.324 Pair2-3 0.602 0.90 0.70 0.46 0.394 We first evaluate the pair-wise agreement between anno- Average 0.563 0.91 0.73 0.45 0.346 tators using these metrics, which we present in Table 5. This evaluation complements Cohen’s kappa from Sec- Table 5 tion 3. The order of annotator pairs remains consistent Metrics from Section 5.2 computed between our annotators for both the Pearson correlation and the overlap metric. on Word Importance Dataset. The 𝑘-inter values for 𝑘 values of 1 and 2 are relatively high compared to Pearson correlation or the overlap, in- Domain Pair 1-2 Pair 1-3 Pair 2-3 Average dicating that the annotators agree on the selection of the News 0.286 0.247 0.511 0.348 Literature 0.211 0.220 0.340 0.257 most important words but not that well on their order. Poetry 0.189 0.301 0.220 0.237 This supports our decision to let annotators focus only on Jokes 0.484 0.475 0.462 0.474 the most important words and not make them mentally Transcripts 0.450 0.354 0.437 0.413 overloaded by the vast amount of options. In Table 6, we Table 6 further present the overlap between annotators within Overlap computed between our annotators on Word Impor- individual domains of WIDS. Annotators show the high- tance Dataset, but on individual domains. est similarity in the jokes domain and the lowest in the poetry domain. This observation aligns with the results Model Pearson 1-inter 2-inter 3-inter Overlap in Table 2. For other metrics on individual domains, see Random 0.256 0.54 0.13 0.01 0.061 Appendix A. PI 0.321 0.78 0.40 0.08 0.114 Finally, we evaluate the performance of all six of our TF-IDF 0.309 0.66 0.20 0.04 0.121 trained models. Additionally, we include random predic- BIM-0.75 0.335 0.82 0.32 0.12 0.125 tions as a baseline for our metrics and the average human BIM-0.25 0.341 0.76 0.40 0.14 0.131 LIM-0.5 0.328 0.72 0.40 0.12 0.137 performance from Table 5 as an upper bound. LIM-0.75 0.352 0.80 0.48 0.18 0.142 As an additional baseline, we include term fre- BIM-0.5 0.344 0.70 0.42 0.14 0.143 quency–inverse document frequency (TF-IDF), computed NLI 0.374 0.90 0.56 0.22 0.150 on the Word Importance Dataset without any preprocess- LIM-0.25 0.376 0.82 0.52 0.14 0.178 ing. Furthermore, we include two models, PI (Paraphrase Humans 0.563 0.91 0.73 0.45 0.346 Identification) and NLI (Natural Language Inference), de- Table 7 veloped by Javorský et al. [12]. We obtain rankings from Evaluation of models from Section 6 on the Word Importance all models by ordering the words according to their sig- Dataset. The “Random”category represents the average met- nificance scores. rics of 100 random predictions, while “Humans” denotes the The results are presented in Table 7, indicating that average of human metrics from Table 5. The metrics are de- our models are performing reasonably well. They sur- fined in Section 5.2. pass random predictions and TF-IDF across all metrics and are comparable to the NLI model. Notably, LIM-0.25 even exceeds the NLI model in both the overlap and Pear- Overlap The limitation of 𝑘-inter is that it does not son correlation metrics. Metrics that consider the order consider specific rank values, only if the words are in of selected words show our models are approximately the top 10%. We aim to assign more weight to agree- halfway to achieving human-level performance. They ments on specific rank values, prioritizing the match are approaching human performance in terms of 1-inter on higher-ranked agreements over lower-ranked ones. but lag significantly in higher 𝑘-inter metrics. We thus propose to use the average overlap metric, as It is quite surprising that LIM approach is superior to described by Webber et al. [31]. First, we derive an BIM, suggesting that simple methods are sometimes more ordered list of words from the ranking. The agree- efficient. We hypothesize that inserted words by BERT ment between lists 𝑙 and 𝑙 at depth 𝑑 is defined as are so well suited to the surrounding context that it is 𝐴(𝑙, 𝑙, 𝑑) = |𝑙:𝑑 ∩ 𝑙:𝑑 |/𝑑, where 𝑙:𝑑 represents the first 𝑑 very difficult to detect them, which effectively decreases elements of the list. The∑︀ average overlap at depth 𝑘 is the useful learning signal from them. then 𝐴𝑂(𝑙, 𝑙, 𝑘) = 𝑘1 𝑘𝑑=1 𝐴(𝑙, 𝑙, 𝑑). For context pairs For readers interested in a detailed view of all metrics of rankings, we compute the average overlap for each across individual domains, refer to Appendix A. pair and then average these values, which we refer to as overlap. The depth is chosen differently for each pair: for a context with length 𝑚, the depth is set to ⌈0.1 · 𝑚⌉, to be consistent with our rank limit of 10%. 7. Conclusion [3] H. Hassani, C. Beneki, S. Unger, M. T. Mazinani, M. R. Yeganegi, Text mining in big data analytics, In this paper, we define word importance, collect annota- Big Data and Cognitive Computing 4 (2020) 1. tions for a small multi-domain word-importance dataset [4] H. P. Luhn, The automatic creation of literature in English, propose metrics for its evaluation and intro- abstracts, IBM Journal of research and development duce a novel self-supervised machine learning method: 2 (1958) 159–165. The goal is to predict inserted words in the text. Our [5] M. Das, P. Alphonse, et al., A comparative study results demonstrate that our method outperforms base- on tf-idf feature weighting method and its anal- line models and is comparable to prior work on word ysis using unstructured dataset, arXiv preprint importance. arXiv:2308.04037 (2023). Possible future work might benefit from more experi- [6] Z. Dai, J. Callan, Context-aware document term ments when using BIM or combining LIM and BIM, poten- weighting for ad-hoc search, in: Proceedings of tially leading to more competitive results. Experimenting The Web Conference 2020, 2020, pp. 1897–1907. with smaller insertion ratios can be another potential [7] K. Hong, A. Nenkova, Improving the estimation avenue. of word importance for news multi-document sum- marization, in: S. Wintner, S. Goldwater, S. Rie- Limitations One of the primary limitations of our zler (Eds.), Proceedings of the 14th Conference of study is the size of the Word Importance Dataset, since the European Chapter of the Association for Com- it includes only 50 relatively short contexts that consists putational Linguistics, Association for Computa- of approximately 60 words. Varying lengths of context tional Linguistics, Gothenburg, Sweden, 2014, pp. might contribute to better generalization. The study com- 712–721. URL: https://aclanthology.org/E14-1075. pares importance scores to only one other indicator of doi:10.3115/v1/E14-1075. word significance and it also lacks the evaluation of im- [8] I. Sheikh, I. Illina, D. Fohr, G. Linarès, Learning portance scores on a downstream task. word importance with the neural bag-of-words Another limitation is the small number of annotators. model, in: P. Blunsom, K. Cho, S. Cohen, E. Grefen- With a larger pool of annotators, the data in the Word stette, K. M. Hermann, L. Rimell, J. Weston, S. W.-t. Importance Dataset would likely exhibit lower variance. Yih (Eds.), Proceedings of the 1st Workshop on Rep- This would result in higher quality averaged rankings resentation Learning for NLP, Association for Com- that are more closely aligned with the true distribution. putational Linguistics, Berlin, Germany, 2016, pp. Finally, the work does not provide the evaluation of 222–229. URL: https://aclanthology.org/W16-1626. importance scores on the word-importance dataset col- doi:10.18653/v1/W16-1626. lected by Kafle and Huenerfauth [17]. [9] M. Song, Y. Feng, L. Jing, A survey on recent ad- vances in keyphrase extraction from pre-trained language models, Findings of the Association for Acknowledgments Computational Linguistics: EACL 2023 (2023) 2153– 2164. The work has been partially supported by the grants [10] B. Xie, J. Song, L. Shao, S. Wu, X. Wei, B. Yang, 272323 of the Grant Agency of Charles University, 19- H. Lin, J. Xie, J. Su, From statistical methods to 26934X (NEUREM3) of the Czech Science Foundation and deep learning, automatic keyphrase prediction: A SVV project number 260 698. survey, Information Processing & Management 60 Computational resources were provided by the e- (2023) 103382. INFRA CZ project (ID:90254), supported by the Ministry [11] A. Mehri, M. Jamaati, H. Mehri, Word ranking in of Education, Youth and Sports of the Czech Republic. a single document by jensen–shannon divergence, Physics Letters A 379 (2015) 1627–1632. References [12] D. Javorský, O. Bojar, F. Yvon, Assessing word im- portance using models trained for semantic tasks, [1] G. Penedo, H. Kydlíček, L. von Werra, T. Wolf, 2023. arXiv:2305.19689. Fineweb, 2024. URL: https://huggingface. [13] X. Li, X. Wu, X. Hu, F. Xie, Z. Jiang, Keyword extrac- co/datasets/HuggingFaceFW/fineweb. tion based on lexical chains and word co-occurrence doi:10.57967/hf/2092. for chinese news web pages, in: 2008 IEEE Inter- [2] Q. Lhoest, A. V. del Moral, Y. Jernite, A. Thakur, national Conference on Data Mining Workshops, P. von Platen, S. Patil, J. Chaumond, M. Drame, IEEE, 2008, pp. 744–751. J. Plu, L. Tunstall, et al., Datasets: A commu- [14] H. Jiao, Q. Liu, H.-b. Jia, Chinese keyword extrac- nity library for natural language processing, arXiv tion based on n-gram and word co-occurrence, in: preprint arXiv:2109.02846 (2021). 2007 International Conference on Computational Intelligence and Security Workshops (CISW 2007), Ghosal, O. Bojar, Elitr minuting corpus: A novel IEEE, 2007, pp. 152–155. dataset for automatic minuting from multi-party [15] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, meetings in english and czech, in: Proceedings of L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At- the Thirteenth Language Resources and Evaluation tention is all you need, Advances in neural infor- Conference, 2022, pp. 3174–3182. mation processing systems 30 (2017). [26] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, [16] S. Serrano, N. A. Smith, Is attention interpretable?, M. Federico, N. Bertoldi, B. Cowan, W. Shen, arXiv preprint arXiv:1906.03731 (2019). C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, [17] S. Kafle, M. Huenerfauth, A corpus for modeling E. Herbst, Moses: Open source toolkit for statistical word importance in spoken dialogue transcripts, machine translation, in: S. Ananiadou (Ed.), Pro- in: N. Calzolari, K. Choukri, C. Cieri, T. Declerck, ceedings of the 45th Annual Meeting of the Asso- S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mar- ciation for Computational Linguistics Companion iani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis, Volume Proceedings of the Demo and Poster Ses- T. Tokunaga (Eds.), Proceedings of the Eleventh In- sions, Association for Computational Linguistics, ternational Conference on Language Resources and Prague, Czech Republic, 2007, pp. 177–180. URL: Evaluation (LREC 2018), European Language Re- https://aclanthology.org/P07-2045. sources Association (ELRA), Miyazaki, Japan, 2018. [27] A. Osuský, D. Javorský, Word importance dataset, URL: https://aclanthology.org/L18-1016. 2024. URL: http://hdl.handle.net/11234/1-5520, [18] S. Kafle, P. Yeung, M. Huenerfauth, Evaluating the LINDAT/CLARIAH-CZ digital library at the benefit of highlighting key words in captions for Institute of Formal and Applied Linguistics (ÚFAL), people who are deaf or hard of hearing, in: Pro- Faculty of Mathematics and Physics, Charles ceedings of the 21st International ACM SIGACCESS University. Conference on Computers and Accessibility, 2019, [28] J. Cohen, A coefficient of agreement for nominal pp. 43–55. scales, Educational and psychological measurement [19] L. Shen, X. Zhang, S. Ji, Y. Pu, C. Ge, X. Yang, 20 (1960) 37–46. Y. Feng, Textdefense: Adversarial text detection [29] S. Merity, C. Xiong, J. Bradbury, R. Socher, based on word importance entropy, arXiv preprint Pointer sentinel mixture models, 2016. arXiv:2302.05892 (2023). arXiv:1609.07843. [20] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, [30] D. P. Kingma, J. Ba, Adam: A method for stochas- Bert: Pre-training of deep bidirectional transform- tic optimization, arXiv preprint arXiv:1412.6980 ers for language understanding, arXiv preprint (2014). arXiv:1810.04805 (2018). [31] W. Webber, A. Moffat, J. Zobel, A similarity mea- [21] J. Tiedemann, Parallel data, tools and inter- sure for indefinite rankings, ACM Transactions on faces in OPUS, in: N. Calzolari, K. Choukri, Information Systems (TOIS) 28 (2010) 1–38. T. Declerck, M. U. Doğan, B. Maegaard, J. Mar- iani, A. Moreno, J. Odijk, S. Piperidis (Eds.), Proceedings of the Eighth International Con- ference on Language Resources and Evaluation (LREC’12), European Language Resources Asso- ciation (ELRA), Istanbul, Turkey, 2012, pp. 2214– 2218. URL: http://www.lrec-conf.org/proceedings/ lrec2012/pdf/463_Paper.pdf. [22] R. Nagyfi, Dataset card for project gutenber - english language ebooks, https://huggingface.co/ datasets/sedthh/gutenberg_english, 2023. Accessed: 2024-03-28. [23] A. Parrish, Github repository for gutenberg- poetry-corpus, https://github.com/aparrish/ gutenberg-poetry-corpus, 2018. Accessed: 2024-03- 28. [24] SocialGrep, Dataset card for one-million-reddit- jokes, https://huggingface.co/datasets/SocialGrep/ one-million-reddit-jokes, 2021. Accessed: 2024-03- 28. [25] A. Nedoluzhko, M. Singh, M. Hledíková, a. T. A. Metrics on Individual Domains In Table 9, we present our proposed metrics computed for individual domains within the Word Importance Dataset between human rankings. Notably, the poetry domain ex- hibits relatively high 𝑘-inter values, whereas the Pearson correlation and overlap metrics are low. This indicates that humans agreed more on which words are important rather than on the order of their importance. In Table 8, we present the overlap of the models from Section 6 across individual domains within the WIDS. Our models outperform the TF-IDF baseline in all do- mains except for the news domain. In a few cases and metrics, Random ranking outperforms some methods. It is worth noting that each domain includes only 10 ex- amples, which may lead to significant variability in the results. Despite this, human performance consistently exceeds that of the models across all domains. In Table 10, we present all of our proposed metrics computed for models from Section 6 on individual do- mains within the Word Importance Dataset. For these evaluations, the TF-IDF was created using text solely from the respective individual domain. It is apparent that the performance ordering of the models is not consistent across the different domains, likely due to each domain having only 10 examples. An interesting observation is that TF-IDF performs best on the news domain, whereas it is under performing in the other domains. Model News Lit. Poetry Jokes Trans. Random 0.066 0.061 0.062 0.068 0.065 PI 0.068 0.185 0.120 0.106 0.095 TF-IDF 0.126 0.055 0.044 0.088 0.142 BIM-0.75 0.075 0.196 0.121 0.106 0.125 BIM-0.25 0.069 0.212 0.078 0.130 0.166 LIM-0.5 0.063 0.156 0.158 0.143 0.167 LIM-0.75 0.048 0.128 0.135 0.192 0.206 BIM-0.5 0.079 0.170 0.114 0.084 0.270 NLI 0.047 0.221 0.207 0.153 0.120 LIM-0.25 0.115 0.133 0.159 0.183 0.299 Humans 0.348 0.257 0.237 0.474 0.413 Table 8 Overlap of models from Section 6 on the Word Importance Dataset for individual domains. The “Random” category rep- resents the average metrics of 100 random predictions, while “Humans” denotes the average of human metrics from Table 6. Sorted according to overlap score across the domains (not shown here). Annotators Pearson 1-inter 2-inter 3-inter 4-inter 5-inter Overlap News pair1-3 0.378 0.90 0.60 0.40 0.00 0.00 0.247 pair1-2 0.412 0.90 0.60 0.30 0.00 0.00 0.286 pair2-3 0.631 1.00 0.70 0.30 0.10 0.00 0.511 Average 0.474 0.93 0.63 0.33 0.03 0.00 0.348 Literature pair1-2 0.472 0.80 0.50 0.20 0.00 0.00 0.211 pair1-3 0.440 0.70 0.60 0.40 0.20 0.10 0.220 pair2-3 0.535 0.80 0.50 0.30 0.10 0.00 0.340 Average 0.483 0.77 0.53 0.30 0.10 0.03 0.257 Poetry pair1-2 0.413 0.90 0.60 0.10 0.00 0.00 0.189 pair2-3 0.422 0.70 0.50 0.20 0.10 0.00 0.220 pair1-3 0.481 0.90 0.70 0.30 0.30 0.00 0.301 Average 0.439 0.83 0.60 0.20 0.13 0.00 0.237 Jokes pair2-3 0.597 1.00 1.00 0.80 0.40 0.10 0.462 pair1-3 0.641 1.00 1.00 0.90 0.50 0.30 0.475 pair1-2 0.614 1.00 1.00 0.60 0.50 0.20 0.484 Average 0.617 1.00 1.00 0.77 0.47 0.20 0.474 Transcripts pair1-3 0.565 1.00 0.90 0.60 0.50 0.10 0.354 pair2-3 0.683 1.00 0.80 0.70 0.40 0.00 0.437 pair1-2 0.671 1.00 0.90 0.70 0.30 0.10 0.450 Average 0.640 1.00 0.87 0.67 0.40 0.07 0.413 Table 9 Metrics from Section 5.2 computed between our annotators on individual domains from the Word Importance Dataset corpus. Model Pearson 1-inter 2-inter 3-inter 4-inter 5-inter Overlap News NLI 0.17 0.80 0.30 0.00 0.00 0.00 0.047 LIM-0.75 0.168 0.60 0.10 0.00 0.00 0.00 0.048 LIM-0.5 0.177 0.40 0.30 0.10 0.00 0.00 0.063 Random 0.178 0.52 0.12 0.01 0.00 0.00 0.066 PI 0.189 0.80 0.20 0.00 0.00 0.00 0.068 BIM-0.25 0.191 0.50 0.10 0.10 0.00 0.00 0.069 BIM-0.75 0.216 0.70 0.20 0.00 0.00 0.00 0.075 BIM-0.5 0.183 0.30 0.20 0.10 0.00 0.00 0.079 LIM-0.25 0.235 0.70 0.40 0.00 0.00 0.00 0.115 tf-idf 0.229 0.60 0.30 0.00 0.00 0.00 0.126 Literature tf-idf 0.231 0.60 0.20 0.00 0.00 0.00 0.055 Random 0.251 0.55 0.15 0.02 0.00 0.00 0.061 LIM-0.75 0.292 0.80 0.40 0.10 0.00 0.00 0.128 LIM-0.25 0.302 0.70 0.40 0.00 0.00 0.00 0.133 LIM-0.5 0.302 0.80 0.20 0.10 0.00 0.00 0.156 BIM-0.5 0.345 0.90 0.50 0.10 0.00 0.00 0.170 PI 0.34 0.70 0.50 0.10 0.10 0.00 0.185 BIM-0.75 0.354 1.00 0.30 0.10 0.00 0.00 0.196 BIM-0.25 0.379 0.90 0.50 0.10 0.00 0.00 0.212 NLI 0.438 0.90 0.70 0.30 0.10 0.00 0.221 Poetry tf-idf 0.236 0.50 0.10 0.00 0.00 0.00 0.044 Random 0.255 0.52 0.13 0.01 0.00 0.00 0.062 BIM-0.25 0.279 0.80 0.40 0.00 0.00 0.00 0.078 BIM-0.5 0.293 0.70 0.50 0.00 0.00 0.00 0.114 PI 0.364 0.90 0.60 0.20 0.00 0.00 0.120 BIM-0.75 0.356 0.80 0.50 0.10 0.00 0.00 0.121 LIM-0.75 0.35 0.80 0.60 0.20 0.00 0.00 0.135 LIM-0.5 0.363 0.70 0.60 0.20 0.10 0.00 0.158 LIM-0.25 0.357 0.90 0.40 0.10 0.10 0.10 0.159 NLI 0.435 0.90 0.70 0.50 0.00 0.00 0.207 Jokes Random 0.179 0.55 0.16 0.02 0.00 0.00 0.068 BIM-0.5 0.214 0.70 0.30 0.00 0.00 0.00 0.084 tf-idf 0.203 0.60 0.30 0.10 0.00 0.00 0.088 PI 0.264 0.80 0.40 0.10 0.10 0.00 0.106 BIM-0.75 0.238 0.80 0.20 0.10 0.00 0.00 0.106 BIM-0.25 0.269 0.70 0.40 0.20 0.00 0.00 0.130 LIM-0.5 0.253 0.70 0.30 0.10 0.00 0.00 0.143 NLI 0.325 1.00 0.50 0.10 0.10 0.10 0.153 LIM-0.25 0.327 0.80 0.80 0.20 0.00 0.00 0.183 LIM-0.75 0.312 1.00 0.60 0.10 0.00 0.00 0.192 Transcripts Random 0.213 0.55 0.14 0.01 0.00 0.00 0.065 PI 0.248 0.70 0.30 0.00 0.00 0.00 0.095 NLI 0.294 0.90 0.60 0.20 0.00 0.00 0.120 BIM-0.75 0.301 0.80 0.30 0.30 0.10 0.00 0.125 tf-idf 0.309 0.90 0.30 0.00 0.00 0.00 0.142 BIM-0.25 0.355 0.90 0.60 0.30 0.10 0.10 0.166 LIM-0.5 0.33 1.00 0.60 0.10 0.10 0.00 0.167 LIM-0.75 0.409 0.80 0.70 0.50 0.20 0.00 0.206 BIM-0.5 0.438 0.90 0.60 0.50 0.10 0.10 0.270 LIM-0.25 0.446 1.00 0.60 0.40 0.00 0.00 0.299 Table 10 Metrics from Section 5.2 computed for models from Section 6 on individual domains from the Word Importance Dataset. Sorted by overlap within each domain.