InsBERT: Word importance from artificial insertions
                                Adam Osuský, Dávid Javorský and Ondřej Bojar
                                Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics, Malostranské nám. 25, Prague, 118 00,
                                Czech Republic


                                                                          Abstract
                                                                          We investigate the quantification of word importance by introducing a novel self-supervised task that modifies masked
                                                                          language modeling. Instead of predicting masked words, our approach involves learning to identify which words were
                                                                          inserted. We hypothesize that resulting models will predict a higher likelihood of insertion for less important words. We
                                                                          experiment with two different insertion strategies: the List Inserting Method (LIM) and the BERT Inserting Method (BIM). We
                                                                          outline the process for gathering manually estimated word importance data and describe the construction of a dataset for
                                                                          evaluating our methods. Our results indicate that our modified language modeling surpasses baselines and is competitive
                                                                          with existing research in assessing word importance.

                                                                          Keywords
                                                                          word importance, self-supervision, word insertion, synthetic data


                                1. Introduction                                                                                          get label when a specific word is included in the text
                                                                                                                                         versus when it is removed. This approach reveals that
                                A significant amount of human knowledge and commu-                                                       adversarial attack algorithms in NLP primarily disrupt
                                nication is now recorded as digital text [1, 2], often in                                                the distribution of this word importance.
                                raw, unstructured form [3]. This necessitates methods                                                       In [12], a method is explored to derive word signifi-
                                to make text searchable and easily summarized, driving                                                   cance from models trained for Natural Language Infer-
                                the NLP community’s interest in quantifying word im-                                                     ence (NLI) and Paraphrase Identification (PI) by using
                                portance as a possible solution.                                                                         an attribution method to compute scores for each in-
                                   The concept of identifying important words dates back                                                 put word, identifying those that contribute most to the
                                to the 1950s [4], with early methods based on word fre-                                                  model’s decision. The approach involves training an in-
                                quencies such as TF-IDF , which remains widely used in                                                   terpreter to mask as many words as possible while still
                                modern NLP applications [5]. Various methods have been                                                   preserving the original prediction. We compare the per-
                                explored for determining word importance in tasks such                                                   formance of our approach with this work.
                                as querying [6], summarization [7], text classification [8],                                                This study explores assessing word importance com-
                                and keyphrase extraction [9, 10].                                                                        prehensively, from collecting data to creating and evaluat-
                                   Current approaches for assessing word importance                                                      ing an automatic word importance scorer. More precisely,
                                involve comparing spatial distribution of words in the                                                   the contribution of this work is: (1) A precise definition
                                original versus shuffled text [11], exploiting attribution                                               of word importance and proposed metrics for its evalua-
                                methods [12], utilizing 𝜒2 test [13, 14] or interpreting                                                 tion, (2) a small multi-domain word-importance dataset
                                attention in attention-based models [15], although their                                                 in English annotated by three annotators, (3) a novel
                                interpretability is debated [16].                                                                        self-supervised machine learning method for predicting
                                   Kafle and Huenerfauth [17] collect annotations of word                                                word importance. This self-supervised approach modifies
                                importance as real numbers from 0 to 1, which they later                                                 BERT’s [20] methodology to predict artificially inserted
                                use for captioning to aid those who are deaf or hard of                                                  words rather than masked ones, examining two insertion
                                hearing [18].                                                                                            methods: the List Inserting Method (LIM; inserting ran-
                                   Interestingly, [19] defines word importance ranks as                                                  domly from a word list) and the BERT Inserting Method
                                the difference in the classifier’s confidence for the tar-                                               (BIM; inserting using a BERT model). The results seem
                                                                                                                                         to indicate that our proposed method is superior to base-
                                Workshop on Automata, Formal and Natural Languages 2024 (WAFNL
                                2024)                                                                                                    lines such as TF-IDF and is on par or even better than
                                $ adam.osusky746@student.cuni.cz (A. Osuský);                                                            existing approaches of calculating word importance.1
                                javorsky@ufal.mff.cuni.cz (D. Javorský); bojar@ufal.mff.cuni.cz
                                (O. Bojar)
                                 https://ufal.mff.cuni.cz/david-javorsky (D. Javorský);                                                 2. Word Importance
                                https://ufal.mff.cuni.cz/ondrej-bojar (O. Bojar)
                                 0000-0003-2516-2535 (D. Javorský); 0000-0002-0606-0050                                                                         Word importance (WI) depends on its intended usage.
                                (O. Bojar)                                                                                                                       Depending on objectives, such as text summarization or
                                                                    © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                                                    Attribution 4.0 International (CC BY 4.0).
                                                                    CEUR Workshop Proceedings (CEUR-WS.org)                              1
                                 CEUR
                                 Workshop
                                 Proceedings
                                               http://ceur-ws.org
                                               ISSN 1613-0073
                                                                                                                                             https://github.com/adam-osusky/predicting-word-importance


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
grammar correction, the same words may hold different             2. Create an order for the most important ones; any
degrees of importance. In this work, we focus on seman-              unranked words will receive the last rank, and
tic importance and we define it by drawing inspiration               they should be considered to have similar impor-
from prior works: Kafle and Huenerfauth [17] emphasize               tance. At least one word must be ranked.
the loss incurred by removing a word and Javorský et al.          3. Click on words to select them. The selection order
[12] focus on the meaning contribution added by a word.              determines their ranking. Clicking on a selected
We combine these perspectives into a unified definition              word will deselect it. The first selected word is
of WI:                                                               the most important.
                                                                  4. Importance is the measure of a word’s contribu-
       Importance is the measure of a word’s                         tion to the overall meaning of the context. Indi-
       contribution to the overall meaning of the                    cating the extent to which the removal of a word
       context, indicating the extent to which                       would diminish the information conveyed by the
       the removal of a word would diminish the                      context.
       information conveyed by the context.                       5. Contexts span five diverse domains: news,
                                                                     beletry, poetry, jokes, and transcribed spoken lan-
   We aim to collect human-annotated data for word im-
                                                                     guage.
portance, and therefore we need to clearly formulate
                                                                  6. In the transcribed spoken language do-
instructions for annotators. Even though the most in-
                                                                     main, words may take the form of "(PER-
tuitive approach would be to let annotators score each
                                                                     SON#NUMBER)" at the beginning of a person’s
word by a real number within the range [0, 1], as it is
                                                                     reply, indicating the speaker’s identity. These
done in other studies [17, 12], we find such task very
                                                                     tags are non-clickable and non-rankable. Addi-
difficult for annotators. Therefore, we represent WI as
                                                                     tionally, words in the form "PERSON#NUMBER"
an importance ranking.
                                                                     serve as references to other persons’ names
   By ranking, we mean the ordering of word positions
                                                                     within the utterance.
within a context, where the word position ranked as 1 is
considered the most important. We rank word positions            A simple annotation tool was used for data collection.
because the same word can appear multiple times in a          This tool allows annotators to rank words by clicking on
context with varying levels of importance.                    them in sequence. If an annotator wants to insert a word
   In our initial experiments, we observed that when an-      into the middle of an already selected ranking, they must
notators were unrestricted in the number of words they        unselect the subsequent words and then reselect them in
could rank, they tended to sequentially select key nouns      the desired order. While it might seem more convenient
from the subjects and objects, as well as the verbs that      to allow direct insertion of a word into the middle of
connect these elements in the sentences. However, this        the ranking, the current approach has its benefits. By
behavior does not align with our objectives.                  requiring annotators to reassess the subsequent rank-
   Our primary goal is to identify the most important         ings when making changes, the process encourages a
words, so the ranking can include only a subset of the        more thoughtful and deliberate evaluation of the overall
word positions in a given context. For a context with         ranking.
the length of 𝑚, we aim to rank 𝑛 positions where 1 ≤
𝑛 ≤ ⌈0.1 · 𝑚⌉. Word positions that are not ranked are
assigned the rank of 𝑛 + 1, referred as the “last rank”.      3. Data Collection
We term this process as rank limit being equal to 10%. To ensure diversity, we target various domains and their
   We argue that it might facilitate the annotators’ atten-
                                                       corresponding English datasets: News, the News Com-
tion on identifying only the most essential words in a mentary dataset [21]; literature, data from [22]; poetry,
given context. By restricting the number of words that data from [23]; jokes, data from [24]; and meeting tran-
can be ranked to 10% of the total word positions, we force
                                                       scripts, the ELITR Minuting Corpus [25]. From each
a more selective process, making the annotators focus  domain, we manually select 10 contexts (each context
on the most salient words and not to be overwhelmed by possibly containing more sentences), ensuring that the
numerous options of possible rankings.                 contexts are around 60 words long. To achieve better
                                                       granularity for certain words like “don’t” and “I’m”, the
Instructions for Annotators The full set of instruc- contexts are tokenized using the Moses tokenizer [26].
tions provided to our annotators is as follows:        The dataset statistics are outlined in Table 1.
    1. Arrange the words in descending order by their     Each of the 50 contexts is annotated by three annota-
        importance. You can rank at most 10 percent of tors who are non-native English speakers. These con-
        the words, or choose to rank fewer if desired. texts with annotations form the Word Importance Dataset
                                                       (WIDS). We make the dataset available at [27].
                                                              Characters           Moses-tokens
                            Domain           Contexts    Count    mean±std      Count   mean±std
                            News               10         2565    256.5±26.1     529     52.9±4.0
                            Literature         10         2207    220.7±17.1     601     60.1±7.2
                            Poetry             10         1776    177.6±27.5     540     54.0±6.2
                            Jokes              10         1938    193.8±25.4     575     57.5±7.0
                            Transcripts        10         2432    243.2±26.6     616     61.6±7.1
                            All                50        10918    218.4±38.5     2861    57.2±7.2

Table 1
Statistics of our Word Importance Dataset. The mean and standard deviation are computed on lengths of contexts.


    Domain        Pair1-2   Pair1-3       Pair2-3   Average
                                                                  text by ordering them according to their predicted word
    News           0.318     0.296         0.388     0.334
    Literature     0.223     0.286         0.273     0.260        importance score. We propose two methods for creating
    Poetry         0.260     0.332         0.238     0.277        the training dataset of inserted words: the List Inserting
    Jokes          0.533     0.630         0.533     0.565        Method (LIM) and the BERT Inserting Method (BIM).
    Transcripts    0.539     0.475         0.518     0.511
    All            0.380     0.406         0.395     0.393
                                                                  List Inserting Method (LIM) This method inserts
Table 2                                                           words randomly from a predefined list. This list is gen-
Cohen’s kappa coefficient between pairs of annotators within      erated by splitting the base corpus into words by white
individual domains and across all domains in our Word Impor-      space. Consequently, words that appear more frequently
tance Dataset. The highest agreements within each domain          in the corpus are more likely to be inserted, mirroring
are highlighted in bold.                                          the original distribution.

                                                                  BERT Inserting Method (BIM) This method aims to
   To assess the similarity of the annotations, we compute        insert words that do not fit well in the sentence. This
inter-annotator agreement using Cohen’s kappa [28]. We            is achieved by leveraging the capabilities of another
simplify the calculation by classifying each word position        instance of the BERT model [20].2 Because BERT predicts
in every context as either “selected” or “not selected” by        the words without any information except the text itself,
annotators. In Section 5.2, we present metrics to incorpo-        we assume that they should not alter the sentence’s
rate the order of selection for a more nuanced analysis.          meaning significantly. In this method, mask tokens are
   The computed Cohen’s kappa values are shown in Ta-             placed at the selected insertion positions within the text
ble 2. It is unsurprising that one out of the two domains         and BERT is then used to predict the masked tokens.
with the least agreement than poetry. An intriguing ob-           We prohibit predictions of neighboring tokens (those
servation is that literature displays slightly lower agree-       immediately before and after the masked token) and
ment as poetry. The domains with the highest agreement            sub-word tokens, i.e. tokens that are not a beginning
are jokes and meeting transcripts. We find these findings         of a word. After filtering out these unwanted tokens,
in line with our intuition: There is often very clear what        we select the prediction with the highest logit probability.
words make jokes funny and speech in meetings may
contain many objectively unimportant words, e.g. filler        For both methods, possible positions for insertion in-
words, hesitations, false starts etc.                       clude places before existing words and one additional
                                                            position at the end of the text. These words are obtained
4. Methodology                                              by splitting the text by white space to determine the po-
                                                            tential insertion positions. We insert at most one word
Our approach involves fine-tuning a pre-trained BERT in each position, ensuring that words are not inserted
model [20] using automatically generated data. Specifi- consecutively.
cally, we generate training text data by inserting words       The positions for insertion are selected randomly. The
into existing text and then use the modified text as train- insertion rate is defined as the ratio of the number of
ing data. The objective of fine-tuning is to predict which words to be inserted to the total number of words in
words were inserted. We hypothesize that this task will the original text sample. For instance, if a text sample
require the model to understand the importance of each contains 10 words and we use an insertion rate of 0.5, we
word and its contribution to the overall meaning of the insert 5 new words into this text.
context, ultimately leading the model to assign higher         In our experiments, our goal is to effectively compare
likelihoods of insertion to less important words. This the two insertion methods, LIM and BIM, as well as evalu-
enables us to create a ranking of words in a test input 2
                                                                   https://huggingface.co/google-bert/bert-base-uncased
 In the 1970s, failures. Wheeler a became increasingly the forgetful and                                             late In the 1970s, " Wheeler became " increasingly forgetful and came
 Olympic came to rely largely dwarf on his Seattle assistant, Molly                                                  personal to " rely personal largely " on his " assistant, late Molly
 rumored Myres, Poesaka to killed organise Police his Army affairs. souls                                            back Myres, country to organise his " affairs. apartment Amid
 Amid increasing ill about health, the in September 1973 pairs he moved in                                           occasional increasing visit ill health, in later September 1973 east he moved
 full-time into Mills, Myres's house in Byzantines. Leatherhead, time                                                full-time which into original Myres's two house in Leatherhead, Surrey,
 Surrey, although said he claim continued to use (RCA his EMS central                                                although he posthumously continued press to use shortly his central London
 Hill London "When flat during details day-trips Essendon. to the He city.                                           major flat finally during " day-trips to " the prestigious city. There,
 There, he Morison authored a final was book, My Archaeological Mission                                              birmingham he authored a final book, My bafta Archaeological other
 who to India Studios and as Pakistan, largely although polygyny much                                                Mission to all India " and Pakistan, although much of " the state text was
 for of the text was with culled as from his previous publications; the it                                           culled also from his ceremonial previous publications; " it nearby was
 was veterans published none by Thames was and Hudson in 1976. After                                                 published much by " Thames " and catholic Hudson in 1976. new After
 Charles suffering that a stroke, on Wheeler went died The at On Myers'                                              john suffering a back stroke, Wheeler ( died at solely Myers' " home "
 Religious home on Kierszenbaum 22 July eldest 1976. Archives' In bridge                                             on 22 July 1976. " In memoriam, the " British Academy, john Royal
 memoriam, the British Academy, Royal Academy, and Royal Society Ukraine flew                                        Academy, also and " Royal Society " flew their flags capital at half-mast.
 their flags cites at half-mast. membership Wheeler's from funeral was                                               first Wheeler's research funeral was " held with military directly
 player held with military review trappings in at a was local                                                        trappings what at a alone local crematorium, while a john larger george
 demolished. crematorium, while a contacting larger memorial service was held a in                                   memorial service also was held in St James's michael Church, " Piccadilly in
 south St (April James's aid Church, area, Piccadilly in November.                                                   November.


Figure 1: Example of text with inserted words using LIM (left) and BIM (right) methods with insertion rate 0.5. Words that
are highlighted in yellow boxes are inserted.


ate their performance across different insertion rates. To                                                                               Frequency of Apostrophe Insertions vs. Insertion Rates
achieve this, we train separate models for both methods                                                                       0.225

                                                                                     Apostrophe Insertions / All Insertions
at insertion rates of 0.25, 0.5, and 0.75.                                                                                    0.200
   In future endeavors, it would be interesting to extend                                                                     0.175
this research by training models on datasets created using                                                                    0.150
both LIM and BIM, potentially combining or varying                                                                            0.125
insertion rates.                                                                                                              0.100
                                                                                                                              0.075
4.1. Example Text with Inserted Words                                                                                         0.050
                                                                                                                              0.025
In Figure 1, we illustrate an example of text from our                                                                             0.0    0.1   0.2   0.3   0.4   0.5    0.6   0.7   0.8    0.9      1.0
preprocessed WikiText dataset (Section 5.1), where words                                                                                                     Insertion Rate
have been inserted using both the LIM and BIM methods.                               Figure 2: Frequency of apostrophes in inserted words by BIM
   The inserted words in the BIM method often appear su-                             method with varying insertion rates.
perfluous, adding information to the sentences. Notably,
there is a frequent insertion of apostrophes, occurring
more often than desired. To investigate this phenomenon
further, we conducted a simple experiment on a subset                                Data As the base text corpus in which we insert new
of 100 examples to analyze how the frequency of apos-                                words, we have selected the WikiText dataset [29]. This
trophes changes with varying insertion rates. Refer to                               dataset comprises articles from Wikipedia3 that are clas-
Figure 2 for the results. It is observed that the frequency                          sified as either Good or Featured articles, according to
of apostrophes converges to approximately 0.22 when                                  the criteria specified by Wikipedia editors at the time of
the insertion rate is at least 0.5.                                                  creation.
   Conversely, in the LIM method, the inserted words                                    We use version wikitext-103-raw-v14 from the Hugging-
sometimes introduce information that seems out of con-                               Face datasets library [2]. Each example in the dataset is
text. Additionally, some inserted words include punctua-                             either a paragraph or a title. For our specific use case,
tion marks, such as “(April” or “area,”, as seen in the last                         we preprocess this dataset by removing examples that
sentence on the left in Figure 2.                                                    are titles. Additionally, the dataset is tokenized using the
                                                                                     Moses tokenizer [26]. Since we employ a Transformer
                                                                                     [15] model that already incorporates its own tokeniza-
5. Experiments                                                                       tion, we detokenize the text. We retain the original train-
                                                                                     validation-test splits. For detailed statistics regarding our
5.1. Training Details                                                                preprocessed WikiText, refer to Table 3.
We detail the preprocessing methods applied to the Wiki-
Text dataset and outline the construction of the training
regime.
                                                                                     3
                                                                                                https://www.wikipedia.org/
                                                                                     4
                                                                                                https://huggingface.co/datasets/wikitext
                                                               Train        Validation          Test
                               Paragraphs                     859,532           1841            2183
                               Characters                   509,512,733      1,083,136       1,217,025
                               Characters mean ± std       592.7 ± 404.2   588.3 ± 385.4   557.5 ± 404.1
                               Words                        84,208,748        178,815         201,013
                               Words mean ± std             97.9 ± 67.1     97.1 ± 63.8     92.0 ± 67.3

Table 3
Statistics for train, validation, and test splits of our preprocessed WikiText. The statistics are computed on lengths of paragraphs,
where std stands for empirical standard deviation. Words are obtained by splitting the detokenized text by white space.


         Model         𝐹1      Loss     Precision     Recall
                                                                   insertion probabilities. Since Word Importance Dataset
         BIM-0.25     0.929    0.046      0.930        0.927
         LIM-0.25     0.952    0.031      0.957        0.947       (WIDS) is pretokenized by the Moses tokenizer, we use
         BIM-0.5      0.953    0.048      0.950        0.957       the logits of the first BERT-token to score the original
         LIM-0.5      0.972    0.028      0.974        0.971       Moses-token if a Moses-token is split into multiple BERT-
         BIM-0.75     0.982    0.027      0.980        0.984       tokens.
         LIM-0.75     0.985    0.021      0.985       0.985
                                                                      For the human reference, we calculate the average
Table 4                                                            rank of each token based on the rankings provided by
Performance of models on the test split from their respective all three annotators and then order the words according
generated data. The model names indicate whether the List to these average ranks. With only three annotators, a
Inserting Method (LIM) or BERT Inserting Method (BIM) was majority of words still fall into the lowest rank, leading to
used for generating data, along with the insertion rate. Loss inconsistencies between model predictions and averaged
refers to cross-entropy loss, and in the computation of 𝐹1 annotations, as they can result in different lowest ranks.
score, precision, and recall, the positive target is the class for
                                                                   To ensure consistency in evaluation, we apply the 10%
inserted words. Best results are in bold.
                                                                   rank limit to both the averaged annotations and model
                                                                   predictions.
                                                                      Since 90% of the positions fall into the lowest rank, this
Models We create six datasets using both the LIM and
                                                                   creates challenges in designing effective evaluation met-
BIM techniques, each with insertion rates of 0.25, 0.5,
                                                                   rics. To address these issues, we propose three metrics,
and 0.75. Subsequently, we train six models, each corre-
                                                                   each progressively refining and incorporating desired
sponding to a distinct combination of insertion method
                                                                   properties to better align with our evaluation goals.
and rate.
   To ensure a fair comparison and avoid introducing bias
due to differences in hyper-parameters, we use identical Pearson correlation The simplest and well-known
settings for all models. Hyper-parameters were selected approach is to calculate the sample Pearson correlation
empirically based on initial experiments.                          coefficient on the ranks of word positions over all po-
   We use a learning rate of 0.0032, batch size of 256, sitions and all contexts in the dataset. However, this
Adam optimizer [30] with default betas (0.9,0.999), and method is not ideal because 90% of word positions within
a linear learning rate scheduler with a linear warmup of a given context fall into the lowest rank. Our primary fo-
350 steps. Starting from the BERT5 pre-trained model, cus is on achieving higher agreement within the top 10%,
we fine-tune on each of the datasets for 5 epochs.                 which is not adequately emphasized by this correlation
   The performance of individual models in the classifi- measure.
cation task is shown in Table 4. The LIM models consis-
tently outperform the BIM models. Given this discrep- k-inter Another perspective on rankings is to consider
ancy and the distribution of inserted punctuation marks words that do not have the last rank and disregard their
discussed in Section 4.1, it indicates that the BIM data specific order. By doing this, we view the rankings as
present a more challenging task, as the inserted words indicators of which words are important, allowing us to
blend more seamlessly with the context.                            measure the extent of the intersection between different
                                                                   rankings. We thus propose a new metric, k-inter, where
                                                                   we filter the ranking and keep only word positions that
5.2. Evaluation for WI ranking                                     do not have the last rank. We then compute the fraction
Our trained models predict logits for the probability of of context pairs where the intersection of their filtered
word insertion. We construct the ranking by ordering rankings has at least 𝑘 elements. We examine values of
BERT-tokens in one context in ascending order of their 𝑘 ∈ {1, 2, 3}.

5
    https://huggingface.co/google-bert/bert-base-uncased
  Annotators     Pearson       1-inter    2-inter     3-inter   Overlap
  Pair1-3         0.534         0.90       0.76        0.52      0.319
                                                                          6. Results
  Pair1-2         0.555         0.92       0.72        0.38      0.324
  Pair2-3         0.602         0.90       0.70        0.46      0.394    We first evaluate the pair-wise agreement between anno-
  Average         0.563         0.91       0.73        0.45      0.346    tators using these metrics, which we present in Table 5.
                                                                          This evaluation complements Cohen’s kappa from Sec-
Table 5                                                                   tion 3. The order of annotator pairs remains consistent
Metrics from Section 5.2 computed between our annotators
                                                                          for both the Pearson correlation and the overlap metric.
on Word Importance Dataset.
                                                                          The 𝑘-inter values for 𝑘 values of 1 and 2 are relatively
                                                                          high compared to Pearson correlation or the overlap, in-
   Domain           Pair 1-2      Pair 1-3         Pair 2-3     Average
                                                                          dicating that the annotators agree on the selection of the
   News              0.286         0.247            0.511        0.348
   Literature        0.211         0.220            0.340        0.257    most important words but not that well on their order.
   Poetry            0.189         0.301            0.220        0.237    This supports our decision to let annotators focus only on
   Jokes             0.484         0.475            0.462        0.474    the most important words and not make them mentally
   Transcripts       0.450         0.354            0.437        0.413    overloaded by the vast amount of options. In Table 6, we
Table 6                                                                   further present the overlap between annotators within
Overlap computed between our annotators on Word Impor-                    individual domains of WIDS. Annotators show the high-
tance Dataset, but on individual domains.                                 est similarity in the jokes domain and the lowest in the
                                                                          poetry domain. This observation aligns with the results
   Model         Pearson   1-inter       2-inter     3-inter    Overlap   in Table 2. For other metrics on individual domains, see
   Random         0.256     0.54          0.13        0.01       0.061    Appendix A.
   PI             0.321     0.78          0.40        0.08       0.114       Finally, we evaluate the performance of all six of our
   TF-IDF         0.309     0.66          0.20        0.04       0.121    trained models. Additionally, we include random predic-
   BIM-0.75       0.335     0.82          0.32        0.12       0.125
                                                                          tions as a baseline for our metrics and the average human
   BIM-0.25       0.341     0.76          0.40        0.14       0.131
   LIM-0.5        0.328     0.72          0.40        0.12       0.137    performance from Table 5 as an upper bound.
   LIM-0.75       0.352     0.80          0.48        0.18       0.142       As an additional baseline, we include term fre-
   BIM-0.5        0.344     0.70          0.42        0.14       0.143    quency–inverse document frequency (TF-IDF), computed
   NLI            0.374     0.90          0.56        0.22       0.150    on the Word Importance Dataset without any preprocess-
   LIM-0.25       0.376     0.82          0.52        0.14       0.178
                                                                          ing. Furthermore, we include two models, PI (Paraphrase
   Humans         0.563     0.91          0.73        0.45       0.346
                                                                          Identification) and NLI (Natural Language Inference), de-
Table 7                                                                   veloped by Javorský et al. [12]. We obtain rankings from
Evaluation of models from Section 6 on the Word Importance                all models by ordering the words according to their sig-
Dataset. The “Random”category represents the average met-                 nificance scores.
rics of 100 random predictions, while “Humans” denotes the                   The results are presented in Table 7, indicating that
average of human metrics from Table 5. The metrics are de-
                                                                          our models are performing reasonably well. They sur-
fined in Section 5.2.
                                                                          pass random predictions and TF-IDF across all metrics
                                                                          and are comparable to the NLI model. Notably, LIM-0.25
                                                                          even exceeds the NLI model in both the overlap and Pear-
Overlap The limitation of 𝑘-inter is that it does not
                                                                          son correlation metrics. Metrics that consider the order
consider specific rank values, only if the words are in
                                                                          of selected words show our models are approximately
the top 10%. We aim to assign more weight to agree-
                                                                          halfway to achieving human-level performance. They
ments on specific rank values, prioritizing the match
                                                                          are approaching human performance in terms of 1-inter
on higher-ranked agreements over lower-ranked ones.
                                                                          but lag significantly in higher 𝑘-inter metrics.
We thus propose to use the average overlap metric, as
                                                                             It is quite surprising that LIM approach is superior to
described by Webber et al. [31]. First, we derive an
                                                                          BIM, suggesting that simple methods are sometimes more
ordered list of words from the ranking. The agree-
                                                                          efficient. We hypothesize that inserted words by BERT
ment between lists 𝑙 and 𝑙 at depth 𝑑 is defined as
                                                                          are so well suited to the surrounding context that it is
𝐴(𝑙, 𝑙, 𝑑) = |𝑙:𝑑 ∩ 𝑙:𝑑 |/𝑑, where 𝑙:𝑑 represents the first 𝑑
                                                                          very difficult to detect them, which effectively decreases
elements of the list. The∑︀ average overlap at depth 𝑘 is                 the useful learning signal from them.
then 𝐴𝑂(𝑙, 𝑙, 𝑘) = 𝑘1 𝑘𝑑=1 𝐴(𝑙, 𝑙, 𝑑). For context pairs
                                                                             For readers interested in a detailed view of all metrics
of rankings, we compute the average overlap for each
                                                                          across individual domains, refer to Appendix A.
pair and then average these values, which we refer to as
overlap. The depth is chosen differently for each pair:
for a context with length 𝑚, the depth is set to ⌈0.1 · 𝑚⌉,
to be consistent with our rank limit of 10%.
7. Conclusion                                                  [3] H. Hassani, C. Beneki, S. Unger, M. T. Mazinani,
                                                                   M. R. Yeganegi, Text mining in big data analytics,
In this paper, we define word importance, collect annota-          Big Data and Cognitive Computing 4 (2020) 1.
tions for a small multi-domain word-importance dataset         [4] H. P. Luhn, The automatic creation of literature
in English, propose metrics for its evaluation and intro-          abstracts, IBM Journal of research and development
duce a novel self-supervised machine learning method:              2 (1958) 159–165.
The goal is to predict inserted words in the text. Our         [5] M. Das, P. Alphonse, et al., A comparative study
results demonstrate that our method outperforms base-              on tf-idf feature weighting method and its anal-
line models and is comparable to prior work on word                ysis using unstructured dataset, arXiv preprint
importance.                                                        arXiv:2308.04037 (2023).
   Possible future work might benefit from more experi-        [6] Z. Dai, J. Callan, Context-aware document term
ments when using BIM or combining LIM and BIM, poten-              weighting for ad-hoc search, in: Proceedings of
tially leading to more competitive results. Experimenting          The Web Conference 2020, 2020, pp. 1897–1907.
with smaller insertion ratios can be another potential         [7] K. Hong, A. Nenkova, Improving the estimation
avenue.                                                            of word importance for news multi-document sum-
                                                                   marization, in: S. Wintner, S. Goldwater, S. Rie-
Limitations One of the primary limitations of our                  zler (Eds.), Proceedings of the 14th Conference of
study is the size of the Word Importance Dataset, since            the European Chapter of the Association for Com-
it includes only 50 relatively short contexts that consists        putational Linguistics, Association for Computa-
of approximately 60 words. Varying lengths of context              tional Linguistics, Gothenburg, Sweden, 2014, pp.
might contribute to better generalization. The study com-          712–721. URL: https://aclanthology.org/E14-1075.
pares importance scores to only one other indicator of             doi:10.3115/v1/E14-1075.
word significance and it also lacks the evaluation of im-      [8] I. Sheikh, I. Illina, D. Fohr, G. Linarès, Learning
portance scores on a downstream task.                              word importance with the neural bag-of-words
    Another limitation is the small number of annotators.          model, in: P. Blunsom, K. Cho, S. Cohen, E. Grefen-
With a larger pool of annotators, the data in the Word             stette, K. M. Hermann, L. Rimell, J. Weston, S. W.-t.
Importance Dataset would likely exhibit lower variance.            Yih (Eds.), Proceedings of the 1st Workshop on Rep-
This would result in higher quality averaged rankings              resentation Learning for NLP, Association for Com-
that are more closely aligned with the true distribution.          putational Linguistics, Berlin, Germany, 2016, pp.
    Finally, the work does not provide the evaluation of           222–229. URL: https://aclanthology.org/W16-1626.
importance scores on the word-importance dataset col-              doi:10.18653/v1/W16-1626.
lected by Kafle and Huenerfauth [17].                          [9] M. Song, Y. Feng, L. Jing, A survey on recent ad-
                                                                   vances in keyphrase extraction from pre-trained
                                                                   language models, Findings of the Association for
Acknowledgments                                                    Computational Linguistics: EACL 2023 (2023) 2153–
                                                                   2164.
The work has been partially supported by the grants
                                                              [10] B. Xie, J. Song, L. Shao, S. Wu, X. Wei, B. Yang,
272323 of the Grant Agency of Charles University, 19-
                                                                   H. Lin, J. Xie, J. Su, From statistical methods to
26934X (NEUREM3) of the Czech Science Foundation and
                                                                   deep learning, automatic keyphrase prediction: A
SVV project number 260 698.
                                                                   survey, Information Processing & Management 60
   Computational resources were provided by the e-
                                                                   (2023) 103382.
INFRA CZ project (ID:90254), supported by the Ministry
                                                              [11] A. Mehri, M. Jamaati, H. Mehri, Word ranking in
of Education, Youth and Sports of the Czech Republic.
                                                                   a single document by jensen–shannon divergence,
                                                                   Physics Letters A 379 (2015) 1627–1632.
References                                                    [12] D. Javorský, O. Bojar, F. Yvon, Assessing word im-
                                                                   portance using models trained for semantic tasks,
 [1] G. Penedo, H. Kydlíček, L. von Werra, T. Wolf,                2023. arXiv:2305.19689.
     Fineweb,      2024. URL: https://huggingface.            [13] X. Li, X. Wu, X. Hu, F. Xie, Z. Jiang, Keyword extrac-
     co/datasets/HuggingFaceFW/fineweb.                            tion based on lexical chains and word co-occurrence
     doi:10.57967/hf/2092.                                         for chinese news web pages, in: 2008 IEEE Inter-
 [2] Q. Lhoest, A. V. del Moral, Y. Jernite, A. Thakur,            national Conference on Data Mining Workshops,
     P. von Platen, S. Patil, J. Chaumond, M. Drame,               IEEE, 2008, pp. 744–751.
     J. Plu, L. Tunstall, et al., Datasets: A commu-          [14] H. Jiao, Q. Liu, H.-b. Jia, Chinese keyword extrac-
     nity library for natural language processing, arXiv           tion based on n-gram and word co-occurrence, in:
     preprint arXiv:2109.02846 (2021).                             2007 International Conference on Computational
     Intelligence and Security Workshops (CISW 2007),             Ghosal, O. Bojar, Elitr minuting corpus: A novel
     IEEE, 2007, pp. 152–155.                                     dataset for automatic minuting from multi-party
[15] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,             meetings in english and czech, in: Proceedings of
     L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At-         the Thirteenth Language Resources and Evaluation
     tention is all you need, Advances in neural infor-           Conference, 2022, pp. 3174–3182.
     mation processing systems 30 (2017).                    [26] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch,
[16] S. Serrano, N. A. Smith, Is attention interpretable?,        M. Federico, N. Bertoldi, B. Cowan, W. Shen,
     arXiv preprint arXiv:1906.03731 (2019).                      C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin,
[17] S. Kafle, M. Huenerfauth, A corpus for modeling              E. Herbst, Moses: Open source toolkit for statistical
     word importance in spoken dialogue transcripts,              machine translation, in: S. Ananiadou (Ed.), Pro-
     in: N. Calzolari, K. Choukri, C. Cieri, T. Declerck,         ceedings of the 45th Annual Meeting of the Asso-
     S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mar-        ciation for Computational Linguistics Companion
     iani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis,            Volume Proceedings of the Demo and Poster Ses-
     T. Tokunaga (Eds.), Proceedings of the Eleventh In-          sions, Association for Computational Linguistics,
     ternational Conference on Language Resources and             Prague, Czech Republic, 2007, pp. 177–180. URL:
     Evaluation (LREC 2018), European Language Re-                https://aclanthology.org/P07-2045.
     sources Association (ELRA), Miyazaki, Japan, 2018.      [27] A. Osuský, D. Javorský, Word importance dataset,
     URL: https://aclanthology.org/L18-1016.                      2024. URL: http://hdl.handle.net/11234/1-5520,
[18] S. Kafle, P. Yeung, M. Huenerfauth, Evaluating the           LINDAT/CLARIAH-CZ digital library at the
     benefit of highlighting key words in captions for            Institute of Formal and Applied Linguistics (ÚFAL),
     people who are deaf or hard of hearing, in: Pro-             Faculty of Mathematics and Physics, Charles
     ceedings of the 21st International ACM SIGACCESS             University.
     Conference on Computers and Accessibility, 2019,        [28] J. Cohen, A coefficient of agreement for nominal
     pp. 43–55.                                                   scales, Educational and psychological measurement
[19] L. Shen, X. Zhang, S. Ji, Y. Pu, C. Ge, X. Yang,             20 (1960) 37–46.
     Y. Feng, Textdefense: Adversarial text detection        [29] S. Merity, C. Xiong, J. Bradbury, R. Socher,
     based on word importance entropy, arXiv preprint             Pointer sentinel mixture models,               2016.
     arXiv:2302.05892 (2023).                                     arXiv:1609.07843.
[20] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,           [30] D. P. Kingma, J. Ba, Adam: A method for stochas-
     Bert: Pre-training of deep bidirectional transform-          tic optimization, arXiv preprint arXiv:1412.6980
     ers for language understanding, arXiv preprint               (2014).
     arXiv:1810.04805 (2018).                                [31] W. Webber, A. Moffat, J. Zobel, A similarity mea-
[21] J. Tiedemann, Parallel data, tools and inter-                sure for indefinite rankings, ACM Transactions on
     faces in OPUS, in: N. Calzolari, K. Choukri,                 Information Systems (TOIS) 28 (2010) 1–38.
     T. Declerck, M. U. Doğan, B. Maegaard, J. Mar-
     iani, A. Moreno, J. Odijk, S. Piperidis (Eds.),
     Proceedings of the Eighth International Con-
     ference on Language Resources and Evaluation
     (LREC’12), European Language Resources Asso-
     ciation (ELRA), Istanbul, Turkey, 2012, pp. 2214–
     2218. URL: http://www.lrec-conf.org/proceedings/
     lrec2012/pdf/463_Paper.pdf.
[22] R. Nagyfi, Dataset card for project gutenber -
     english language ebooks, https://huggingface.co/
     datasets/sedthh/gutenberg_english, 2023. Accessed:
     2024-03-28.
[23] A. Parrish, Github repository for gutenberg-
     poetry-corpus,         https://github.com/aparrish/
     gutenberg-poetry-corpus, 2018. Accessed: 2024-03-
     28.
[24] SocialGrep, Dataset card for one-million-reddit-
     jokes, https://huggingface.co/datasets/SocialGrep/
     one-million-reddit-jokes, 2021. Accessed: 2024-03-
     28.
[25] A. Nedoluzhko, M. Singh, M. Hledíková, a. T.
A. Metrics on Individual Domains
In Table 9, we present our proposed metrics computed for
individual domains within the Word Importance Dataset
between human rankings. Notably, the poetry domain ex-
hibits relatively high 𝑘-inter values, whereas the Pearson
correlation and overlap metrics are low. This indicates
that humans agreed more on which words are important
rather than on the order of their importance.
   In Table 8, we present the overlap of the models from
Section 6 across individual domains within the WIDS.
Our models outperform the TF-IDF baseline in all do-
mains except for the news domain. In a few cases and
metrics, Random ranking outperforms some methods. It
is worth noting that each domain includes only 10 ex-
amples, which may lead to significant variability in the
results. Despite this, human performance consistently
exceeds that of the models across all domains.
   In Table 10, we present all of our proposed metrics
computed for models from Section 6 on individual do-
mains within the Word Importance Dataset. For these
evaluations, the TF-IDF was created using text solely
from the respective individual domain. It is apparent that
the performance ordering of the models is not consistent
across the different domains, likely due to each domain
having only 10 examples.
   An interesting observation is that TF-IDF performs
best on the news domain, whereas it is under performing
in the other domains.
   Model       News      Lit.   Poetry    Jokes   Trans.
   Random      0.066    0.061    0.062    0.068   0.065
   PI          0.068    0.185    0.120    0.106   0.095
   TF-IDF      0.126    0.055    0.044    0.088   0.142
   BIM-0.75    0.075    0.196    0.121    0.106   0.125
   BIM-0.25    0.069    0.212    0.078    0.130   0.166
   LIM-0.5     0.063    0.156    0.158    0.143   0.167
   LIM-0.75    0.048    0.128    0.135    0.192   0.206
   BIM-0.5     0.079    0.170    0.114    0.084   0.270
   NLI         0.047    0.221   0.207     0.153   0.120
   LIM-0.25    0.115    0.133    0.159    0.183   0.299
   Humans      0.348    0.257    0.237    0.474   0.413

Table 8
Overlap of models from Section 6 on the Word Importance
Dataset for individual domains. The “Random” category rep-
resents the average metrics of 100 random predictions, while
“Humans” denotes the average of human metrics from Table 6.
Sorted according to overlap score across the domains (not
shown here).
                    Annotators    Pearson   1-inter   2-inter   3-inter   4-inter   5-inter   Overlap
                                                          News
                    pair1-3        0.378     0.90      0.60       0.40     0.00      0.00      0.247
                    pair1-2        0.412     0.90      0.60       0.30     0.00      0.00      0.286
                    pair2-3        0.631     1.00      0.70       0.30     0.10      0.00      0.511
                    Average        0.474     0.93      0.63       0.33     0.03      0.00      0.348
                                                       Literature
                    pair1-2        0.472     0.80      0.50       0.20     0.00      0.00      0.211
                    pair1-3        0.440     0.70      0.60       0.40     0.20      0.10      0.220
                    pair2-3        0.535     0.80      0.50       0.30     0.10      0.00      0.340
                    Average        0.483     0.77      0.53       0.30     0.10      0.03      0.257
                                                         Poetry
                    pair1-2        0.413     0.90      0.60       0.10     0.00      0.00      0.189
                    pair2-3        0.422     0.70      0.50       0.20     0.10      0.00      0.220
                    pair1-3        0.481     0.90      0.70       0.30     0.30      0.00      0.301
                    Average        0.439     0.83      0.60       0.20     0.13      0.00      0.237
                                                          Jokes
                    pair2-3        0.597     1.00      1.00       0.80     0.40      0.10      0.462
                    pair1-3        0.641     1.00      1.00       0.90     0.50      0.30      0.475
                    pair1-2        0.614     1.00      1.00       0.60     0.50      0.20      0.484
                    Average        0.617     1.00      1.00       0.77     0.47      0.20      0.474
                                                      Transcripts
                    pair1-3        0.565     1.00      0.90       0.60     0.50      0.10      0.354
                    pair2-3        0.683     1.00      0.80       0.70     0.40      0.00      0.437
                    pair1-2        0.671     1.00      0.90       0.70     0.30      0.10      0.450
                    Average        0.640     1.00      0.87       0.67     0.40      0.07      0.413

Table 9
Metrics from Section 5.2 computed between our annotators on individual domains from the Word Importance Dataset corpus.
                     Model      Pearson   1-inter   2-inter    3-inter   4-inter   5-inter   Overlap
                                                         News
                     NLI          0.17     0.80      0.30       0.00      0.00      0.00      0.047
                     LIM-0.75    0.168     0.60      0.10       0.00      0.00      0.00      0.048
                     LIM-0.5     0.177     0.40      0.30       0.10      0.00      0.00      0.063
                     Random      0.178     0.52      0.12       0.01      0.00      0.00      0.066
                     PI          0.189     0.80      0.20       0.00      0.00      0.00      0.068
                     BIM-0.25    0.191     0.50      0.10       0.10      0.00      0.00      0.069
                     BIM-0.75    0.216     0.70      0.20       0.00      0.00      0.00      0.075
                     BIM-0.5     0.183     0.30      0.20       0.10      0.00      0.00      0.079
                     LIM-0.25    0.235     0.70      0.40       0.00      0.00      0.00      0.115
                     tf-idf      0.229     0.60      0.30       0.00      0.00      0.00      0.126
                                                      Literature
                     tf-idf      0.231     0.60      0.20       0.00      0.00      0.00      0.055
                     Random      0.251     0.55      0.15       0.02      0.00      0.00      0.061
                     LIM-0.75    0.292     0.80      0.40       0.10      0.00      0.00      0.128
                     LIM-0.25    0.302     0.70      0.40       0.00      0.00      0.00      0.133
                     LIM-0.5     0.302     0.80      0.20       0.10      0.00      0.00      0.156
                     BIM-0.5     0.345     0.90      0.50       0.10      0.00      0.00      0.170
                     PI           0.34     0.70      0.50       0.10      0.10      0.00      0.185
                     BIM-0.75    0.354     1.00      0.30       0.10      0.00      0.00      0.196
                     BIM-0.25    0.379     0.90      0.50       0.10      0.00      0.00      0.212
                     NLI         0.438     0.90      0.70       0.30      0.10      0.00      0.221
                                                        Poetry
                     tf-idf      0.236     0.50      0.10       0.00      0.00      0.00      0.044
                     Random      0.255     0.52      0.13       0.01      0.00      0.00      0.062
                     BIM-0.25    0.279     0.80      0.40       0.00      0.00      0.00      0.078
                     BIM-0.5     0.293     0.70      0.50       0.00      0.00      0.00      0.114
                     PI          0.364     0.90      0.60       0.20      0.00      0.00      0.120
                     BIM-0.75    0.356     0.80      0.50       0.10      0.00      0.00      0.121
                     LIM-0.75     0.35     0.80      0.60       0.20      0.00      0.00      0.135
                     LIM-0.5     0.363     0.70      0.60       0.20      0.10      0.00      0.158
                     LIM-0.25    0.357     0.90      0.40       0.10      0.10      0.10      0.159
                     NLI         0.435     0.90      0.70       0.50      0.00      0.00      0.207
                                                         Jokes
                     Random      0.179     0.55      0.16       0.02      0.00      0.00      0.068
                     BIM-0.5     0.214     0.70      0.30       0.00      0.00      0.00      0.084
                     tf-idf      0.203     0.60      0.30       0.10      0.00      0.00      0.088
                     PI          0.264     0.80      0.40       0.10      0.10      0.00      0.106
                     BIM-0.75    0.238     0.80      0.20       0.10      0.00      0.00      0.106
                     BIM-0.25    0.269     0.70      0.40       0.20      0.00      0.00      0.130
                     LIM-0.5     0.253     0.70      0.30       0.10      0.00      0.00      0.143
                     NLI         0.325     1.00      0.50       0.10      0.10      0.10      0.153
                     LIM-0.25    0.327     0.80      0.80       0.20      0.00      0.00      0.183
                     LIM-0.75    0.312     1.00      0.60       0.10      0.00      0.00      0.192
                                                     Transcripts
                     Random      0.213     0.55      0.14       0.01      0.00      0.00      0.065
                     PI          0.248     0.70      0.30       0.00      0.00      0.00      0.095
                     NLI         0.294     0.90      0.60       0.20      0.00      0.00      0.120
                     BIM-0.75    0.301     0.80      0.30       0.30      0.10      0.00      0.125
                     tf-idf      0.309     0.90      0.30       0.00      0.00      0.00      0.142
                     BIM-0.25    0.355     0.90      0.60       0.30      0.10      0.10      0.166
                     LIM-0.5      0.33     1.00      0.60       0.10      0.10      0.00      0.167
                     LIM-0.75    0.409     0.80      0.70       0.50      0.20      0.00      0.206
                     BIM-0.5     0.438     0.90      0.60       0.50      0.10      0.10      0.270
                     LIM-0.25    0.446     1.00      0.60       0.40      0.00      0.00      0.299

Table 10
Metrics from Section 5.2 computed for models from Section 6 on individual domains from the Word Importance Dataset.
Sorted by overlap within each domain.