What about Grammar? Using BERT Embeddings to
Explore Functional-Semantic Shifts of Semi-Lexical and
Grammatical Constructions
Lauren Fonteyn
Leiden University Centre for Linguistics, Department of English Language and Culture, Arsenaalstraat 1,
2311CT, Leiden, the Netherlands


                                     Abstract
                                     The aim of this short paper is to extend the application of embedding-based methodologies beyond
                                     the realm of lexical semantic change. It focuses on the use of unsupervised BERT-embeddings and
                                     uncertainty measures (Classification Entropy), and assesses whether (and how) they can be used
                                     to (semi-)automatically flag possible functional-semantic changes in the use of the construction [BE
                                     about] in the Corpus of Historical American English (COHA).

                                     Keywords
                                     Distributional Semantics, Corpus Linguistics, Grammatical change, embeddings, BERT


1. Introduction
Given its long tradition in computational and statistical research, it comes as no surprise that
text-based humanities have embraced the use of distributional-semantic ‘vectors’ or ‘embed-
dings’ – i.e. (compressed) numeric vector representations of a word’s contextual distribution
that serve as a proxy of that word’s meaning [e.g. 3]. In particular, in fields such as Corpus
Linguistics – a subfield of Linguistics which grew around the computer-aided retrieval, anno-
tation, and later also categorization of textual data [22] – there seems to be an unprecedented
interest in vector-based distributional semantic models, which offer a quantifiable and data-
driven means of studying meaning. This interest has been fueled further with the arrival of
models equipped to create contextualized token vectors, which have eliminated the problems
associated with polysemy/homonymy conflation [e.g 7].
   In recent years, we have also witnessed a growth in the number of studies that have utilized
either “count” or “predictive” vector models [2] to study historical and diachronic corpus
data [also see 19]. Such studies, which often involve examination of nearest neighbours and
cosine similarities between type- and/or token-vectors over time, have provided the key to
a data-driven means of detecting and describing the diachronic trajectory of, predominantly,
lexical change [e.g. 11, 9, 10, 13].
   One consequence of this focus on lexical change is that, at present, the number of com-
putational distributional semantic studies that consider the functional-semantic properties of
more abstract, grammatical constructions seems disproportionate compared to the interest in

CHR 2020: Workshop on Computational Humanities Research, November 18–20, 2020, Amsterdam, The
Netherlands
£ l.fonteyn@hum.leidenuniv.nl (L. Fonteyn)
Ǳ 0000-0001-5706-8418 (L. Fonteyn)
                                   © 2020 Copyright for this paper by its authors.
                                   Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
              http://ceur-ws.org
              ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                       257
the phenomenon within the (Corpus) Linguistic community. Much like lexical semantics, the
function(s) and underlying meaning(s) of grammatical structures – which are often notoriously
polysemous and cover a broad range of nuanced, abstract meanings – are prone to change, and
many linguists believe that the continued discovery, description and analysis of such changes
plays an essential role in fleshing out our understanding of the mechanisms and motivations of
language change. A logical and necessary continuation in the pursuit to automated semantic
(shift) detection in large diachronic corpora would therefore involve further, more in-depth
explorations of the extent to which embeddings can be employed to capture the (changing)
functional-semantic properties of grammar. Given that that different components of language
have differing diachronic dynamics [21, 12], they may pose different challenges in the develop-
ment of unsupervised, embedding-based means of detecting diachronic change in large corpora
– and it is only by exploring the functional-semantic properties of grammar (at whatever
scale and level of detail is deemed reasonable) that these challenges and, consequently, their
solutions, can be discovered.
  The present study, then, sets out to do precisely that: explore some possible avenues of (semi-
)automatically detecting diachronic functional-semantic changes of grammatical constructions.
In that sense, the aim of this study is to extend the application of embeddings-based method-
ologies beyond the realm of lexical semantics, and further add to the budding research on
whether and how token vectors [e.g. 14, 29] or contextualized embeddings [e.g. 4] can be used
to study the functional-semantic properties of grammar. More specifically, the study, which
focusses on embeddings created by means of BERT [8] makes the following contributions:
  1. It demonstrates that unsupervised BERT-embeddings can successfully be employed to
identify the different functions or ’usage types’ of the grammatical construction [BE about]
in English. 2. It formulates three expected functional-semantic developments of [BE about]
based on linguistic literature, and assesses whether (and how) BERT-embeddings can be used
in combination with Entropy Difference measures and time-sensitive t-SNE plots to (semi-
)automatically detect these changes in the Corpus of Historical American English [COHA, 5].
3. It discusses potential pitfalls associated with the ‘present-day’ bias of the explored methods.


2. Data and Methodology
2.1. Corpus and Data
As a case study, I focus on the recent diachronic development of [BE about]. The data for this
study has been gathered from COHA, a corpus containing over 400 million words of American
English text written between 1810 and 2009. The corpus is balanced for genre (Fiction, Non-
Fiction, Magazine, Newspaper (after 1860)) and subgenre (e.g. prose, poetry, drama, etc.
(Fiction)). Such (sub-)genre balance is said to ensure that any changes observed in the corpus
will not simply be “artifacts of a changing genre balance”[5].
   While [BE about] can be used with a wide range of more abstract, grammatical meanings
(which vary substantially in frequency as well as distinctiveness), the present study will focus
on its three most frequent usage types: the futurate (e.g. I am about to leave), approximative
(e.g. There were about ten cats in the room), and descriptive use (e.g. This song is about
love). These usage types align with the higher-order sense categories distinguished in the
Oxford English Dictionary (OED) [1], which includes a group of tokens expressing approxima-
tion, another signalling connections/relations (in descriptions), and another containing tokens
(predominantly) followed by a to-infinitive expressing intention or imminent future.


                                               258
                                                                         future   approx   descriptive   other
   Figure 1 shows a two-dimensional t-SNE
mapping of the embeddings of 1,000 examples              40

of [be about], collected from COHA (decade               20
2000-2009). The examples were manually an-
                                                          0
notated at the level of granularity outlined
above. The sample contained 448 futurate,                20

279 approximative, and 225 descriptive uses              40
of [BE about]. The other 48 examples include
                                                         60
irrelevant structures (e.g. due to mistakes in                    40       20      0      20      40      60

COHA’s tagging of the possessive marker ’s
as a finite form of the verb be), as well as three
much more infrequent usages of [BE about]. Figure 1: t-SNE of [BE about] token embeddings.
These infrequent types consist of spatial uses There are three largely distinct usage types: futurate,
(e.g. She must be somewhere about), the fixed approximative, and descriptive. Less frequent senses
expression that’s about it, as well as an ac- (e.g. spatial) are marked as ‘other’.
tional use which can be paraphrased as ‘oc-
cupied/dealing with’ (chiefly found in more or less fixed expressions such as be about your
business and know what one is about).
   To assess whether BERT-embeddings can be used to identify and distinguish the three
usage types under scrutiny, we can conduct a simple ‘sense distinction task’. For this task,
the embeddings of the 952 ‘relevant’ examples and their accompanying usage type labels were
used as the training set. The procedure involved fitting a logistic regression classifier with L2
regularization (as implemented in Scikit-learn [25]) on the embeddings created for the labelled
training set. Subsequently, the classifier was applied to unseen test set of 200 examples from
each of the 20 decades covered in COHA (with the exception of the first decade (1810-1819),
which contained only 73 tokens of [BE about]). In sum, the test set includes 3,873 unseen
examples for which a usage type label was predicted. The predicted labels were then assessed
against the true usage types of the tokens.
   Based on the manual assessment of the pre-           1.00

dicted usage type labels, it appears that the           0.98
                                                        0.96
model performs quite well at distinguishing             0.94

the three main usage types of [BE about].               0.92
                                                            correct


                                                        0.90
For the final decade of COHA, only 5 out of             0.88
200 tokens had been mislabelled, indicating a           0.86
                                                        0.84
classification accuracy of 0.975. Notably, the               1800   1825 1850 1875   1900  1925 1950 1975    2000
                                                                                      bin
classification accuracy of the model remains
high with older data, ranging between 0.904
and 0.975 (as is shown in Figure 2). At the Figure 2: Accuracy based on labelled test set of
same time it can, perhaps unsurprisingly, be 3,873 unseen examples from the 20 decades included
noticed that the accuracy (slightly) decreases in COHA (1810-2009). The classification accuracy
as the linguistic data ages.                       ranges between 0.904 in the earliest decade, and 0.975
   Overall, these accuracy scores are encour- in the most recent decade.
aging, in that they highlight the eﬀiciency of
BERT-embeddings in identifying specific usage types of a grammatical construction, such as
[BE about], in a large diachronic corpus. While that is an interesting observation in itself,
the remainder of this paper will focus on assessing whether BERT-embeddings can be used to
(semi-)automatically detect functional-semantic changes of grammatical constructions.


                                                      259
2.2. Method
This paper’s methodological set-up starts from the assumption that changes in a word or con-
struction’s distribution, which can be captured in compressed numerical representations such
as BERT-embeddings, may indicate changes in its functional or semantic range. Following
earlier proposals using embeddings to detect lexical-semantic change in large diachronic cor-
pora [e.g. 11], this study will investigate whether known changes that have affected the [BE
about] construction can be detected by means of an embedding-based methodology combined
with Entropy Difference measures. More specifically, the task is conceptualized as follows: in
the case of lexical items, expansions or reductions in their distributional properties are often
equated with expansions or reductions of their possible interpretations – or, in other words,
with increases or decreases uncertainty regarding the exact interpretation of the lexical item.
To measure whether the “uncertainty over possible interpretations varies across time intervals”,
then, one can “compute the difference in entropy between the two usage type distributions in
these intervals” [11]. An increase in entropy over time could be used to signal that the number
of interpretations of a word has increased (e.g. due to the emergence of a new usage type),
whereas a decrease in entropy would signal that the opposite has occurred (e.g. loss of a usage
type). In principle, it is possible to extend this approach to the study of grammatical items.
   However, unlike the distributional changes that accompany lexical-semantic changes of well-
known examples such as broadcast or gay, the distributional changes witnessed for [BE about]
– and many grammatical constructions like it – seem to proceed in a protracted sequence of
small steps, often spanning over several centuries [21]. The question, then, is whether we can
manipulate the use of Entropy Difference measures to detect not only the emergence or loss of
an entire usage type, but also to detect any small-scale shifts within a usage-type.
   The approach tested here works assumes that the researcher is interested in determining
whether any of the usage types they distinguished has changed in the time span covered by
their corpus with respect to a single reference point (for instance: Present-day data). For [BE
about], the procedure involved fitting a logistic regression classifier on the embeddings created
for the 952 present-day tokens, and applying it to a test set of 200 examples from each decade
included in COHA (cf. Section 2.1). Subsequently, for each test token xi and each label y ∈ Y ,
the conditional probability p(y|xi ) is computed to assess the uncertainty of the classifier in
labelling the unseen examples. The resulting conditional probability over each label for each
test token is then summarized in an entropy score, H:
                                            ∑
                                H(xi ) = −      p(y|xi ) ∗ log p(y|xi )                       (1)
                                          y∈Y

If the entropy score changes over the 20 decades included in COHA, one could take this as an
indication that the distributional properties of the test tokens in a particular category have
shifted over time. Such shifts could be indicative of proper (subtle) functional-semantic change,
or of increased or decreased use of the construction under scrutiny in different genres or text
types. Conversely, if the distributional properties of a linguistic item or construction have
not changed over the time span covered by the corpus, we should not expect to witness any
changes in the certainty by which the model classifies tokens to its true usage type category
(i.e. entropy).
   Of course, it is diﬀicult to assess the effectiveness of this method if we do not know whether
[BE about] has actually undergone any functional-semantic changes in the period covered by
COHA. As a means of assessment, then, I will first formulate the expected results for each of


                                                260
the construction’s three main usage types based on prior literature.

2.2.1. futurate [BE about]
The first usage type under consideration is the futurate use of [BE about], illustrated in ex-
amples (2)-(4). In present-day English reference grammars, the phrasal expression be about to
is commonly described as a so-called ’quasi-auxiliary’ which can be used to make a temporal
reference to the future [28]. As such, be about to can be considered near-synonymous to other
English (quasi-)auxiliaries such as will, shall, and be going to (see example (2)). Notably,
however the use be about to is commonly said to convey a strong sense of immediacy, and it
has been suggested that the construction may in fact have more aﬀinity with “aspectualizing
expressions such as begin to/start to V than form expressing futurity” [16].
(2)   Sheen, yes THAT Charlie Sheen, is about to become the best-paid actor in a comedy on TV.
      (2006, COHA)

(3)   Just as I am about to step into the shower, the phone rings. It is George Stephanopoulos.
      (2002, COHA)

(4)   But when he saw the sneer on St. Exeter’s face, Logan knew things were about to get much
      worse. (2004, COHA)

From the relatively sparse number of accounts on the diachronic development of [BE about], it
can be concluded that the construction had already grammaticalized into a marker of (immedi-
ate) future by the 19th century [23, 16] (the suggested time of the first, full-blown futurate uses
ranging between the late 15th or 16th century [17] and the late 18th century [31], when the
construction started occurring with, for example, inanimate subjects and non-intention verbs,
as in (4)). As such, the semantic and distributional changes typical of a grammaticalizing
construction (e.g. bleaching [30] or host-class expansion [15]) most likely pre-date COHA.
  However, the [BE about] future did undergo a distributional change: in Present-day English,
the phrase almost exclusively occurs with a to-infinitive complement clause, whereas 19th
century and early 20th century texts also contain a variant with an ing-clause complement (as
in (5)-(6)).
(5)   I really thought all my bones were disjointed, and that my soul was about taking a last farewell
      of my poor body. (1812, COHA)

(6)   I was trying to sleep, and just as I was about succeeding Henderson called out: ‘[...]’. (1902,
      COHA).

If accurate, the method should detect that the distributional properties of futurate [BE about]
have narrowed slightly over the course of the 19-20th century.

2.2.2. approximative about
[BE about] often occurs as a marker of approximation. Note that the use of approximative about
is not restricted to contexts with the verb be. The current sample therefore only represents a
subset of the possible occurrences of approximative about, some examples of which are listed
in (7)-(9):


                                                  261
(7)   When he was about 12, his parents left him and his siblings at an orphanage for five months.
      (2006, COHA)
(8)   Though you hate to say anyone is recession-proof, U2 is about as close to that as you can get.
      (2009, COHA)
(9)   It was about at that time that Takemore disappeared from the township too. (2001, COHA)

With respect to its diachronic development, it has been shown that the spatial preposition
about approximative use of about – much like the near-synonymous use of around and various
other adpositions in other languages – developed into “approximative qualifiers of numerical
expressions and other amount expressions” [27] sometimes called “rounders” [24]. This process
is shown to have started around the beginning of the Middle English period (1250-1500) [6],
and the establishment of approximative about in the wide range of contexts in which it can
presently occur pre-dates COHA with a very large margin. The inclusion of approximative
[BE about] is therefore not so much motivated by the fact that it has been stable over the
course of the 19th–20th century. As such, no changes in classification uncertainty should be
attested.

2.2.3. descriptive [BE about]
In the third and final category, we find a group of what we could call ‘descriptive’ uses of [BE
about], in which the phrase [BE about] can be roughly paraphrased as ‘regards’, ‘is (primarily)
concerned with’. Its occurrences commonly involve clarifications of why situations are occurring
(e.g. (10)), as well as descriptions of the theme, topic (or plot) of conversations, books, and
films (e.g. (11)-(12)):
(10) ... he wonders what the fighting is about and and who is fighting whom. Is it North against
     South again? (2008, COHA)
(11) The latest news is about Amanda! Haven’t you heard? (2000, COHA)
(12) Mars and Venus Collide is not just about men understanding women. It is also about women
     understanding themselves and learning how to ask effectively for the support they need. (2008,
     COHA)

Unlike with the previous two usage types, the number of historical and diachronic accounts
that treat the descriptive use of [BE about] are not sparse, but virtually non-existent. Still,
a quick scan of the dated examples listed in the Oxford English Dictionary suggests that the
use of [BE (all) about] with an animate subject (e.g. a person, organization, or company)
to describe what the subject is ‘primarily concerned with’ or ‘fond of’ constitutes a relatively
recent (i.e. mid-20th century) phenomenon (e.g. (13)-(14)).
(13) ... give him your authenticity spiel and how radio should be all about the music. (2009, COHA)
(14) I’m all about the blindfold. There’s something intensely sensual about not knowing where
     you’re going (2005, COHA).

In other words, the types of elements that can occur in the subject slot in the descriptive
subtype of [BE about] appear to have expanded during the time period captured by COHA,
which may indicate that the descriptive use of [BE about] came to cover a broader semantic
range. Given that the distributional properties of the descriptive use in the early 19th century
were quite different from the present-day, we would again expect to find a shift classification
uncertainty.


                                                 262
2.2.4. Summary: expected shifts
In sum, the (in)stability of the classification entropy should reflect the following: The distribu-
tional properties of futurate [BE about] have changed slightly during the 20th century. Having
lost the ability to occur with ing-complements, it seems that the futurate use has narrowed. As
with the futurate use, the distributional properties of the descriptive use have changed. More
specifically, the descriptive use has expanded or broadened. By contrast, the distributional
properties of the approximative use have remained stable.


3. Results: detecting changes in [BE about]
As a starting point, it is worth considering                                                                          real
                                                                                                                      future
                                                            1.4
the distributional range of the [BE about] con-                                                                       descriptive
                                                                                                                      approx
struction as a whole. It appears that entropy               1.2

has indeed increased over time (from 0.76 to                1.0


                                                        H
1.09), and, as such, it could be flagged as a
                                                            0.8
case of semantic broadening [11]. To under-
                                                            0.6
stand what has been captured here precisely,
it is worth considering the relative frequency              0.4
                                                                  1800   1825   1850   1875   1900   1925   1950   1975     2000
                                                                                               bin
of the different usage types across time: with
the descriptive category growing more fre-
quent, there are effectively there major usage
                                                   Figure 3: Entropy based on logistic regression clas-
types by the end of the 20th century (whereas      sifier, fitted on the embeddings created for a training
there were only two at the start of the 19th       set of 952 labelled present-day tokens, and applied to a
century).                                          test set of 200 examples from each decade in COHA.
   What such a general test does not reveal is     For each test token and each label, the conditional
whether there have been any changes within         probability has been computed to assess the uncer-
these major usage types. For instance, there       tainty of the classifier in labelling the unseen exam-
is no immediate indication that anything has       ples. The resulting conditional probability over each
                                                   label for each test token is then summarized in an
changed about the descriptive use besides its      entropy score, which gradually decreases for the de-
overall frequency.                                 scriptive use, and, to a lesser extent, for the futurate
                                                   use.
3.1. changing usage types
To assess whether embeddings can be used to detect changes within the three major usage
types, one could explore treating the semantic change detection task as a classification exper-
iment. As explained in Section 2.2, the goal of the classification entropy test is to establish
whether any of the usage categories of [BE about] has changed compared to a present-day
reference point. We expect to find Entropy Differences over time in two of the three usage
types: the descriptive use, and, to a weaker extent, the futurate use. This expectation appears
to be borne out (Figure 3).
   A fair point that can be raised is that the attested decrease in uncertainty does not in fact
signal that the usage of a construction has changed, but rather reflects a decrease in the extent
to which embeddings created by unsupervised BERT (trained on predominantly Present-day
English data) as the linguistic material more generally becomes older (and, consequently, less
familiar). Still, the latter explanation loses at least some of its plausibility when considering the
expected, relative stability of the approximative use. It is furthermore reassuring that the slight


                                                  263
    80                                      2000                                                       2000

    60                                      1975         60                                            1975

    40                                      1950         40                                            1950

    20                                      1925                                                       1925
                                                         20
     0                                      1900                                                       1900
                                                         0
    20                                      1875                                                       1875
                                                         20
    40                                      1850                                                       1850

    60                                      1825         40                                            1825
         75   50   25    0   25   50   75                     50   25   0   25   50   75   100   125


                        (a) futurate                                         (b) descriptive
Figure 4: Time-sensitive t-SNE of [BE about]. For the futurate use, two archaic/obsolete clusters and one
recent cluster can be discerned. For the descriptive use, the pattern suggests expansion with more recent
token groupings.


increase in classifier certainty appears to coincide with the decline of [BE about Ving], which
renders futurate [BE about] more uniform and, consequently, more unambiguously recognizable.
   The suggested shifts are also evident from the visualizations in Figure 4a and 4b, which
present what one could call a ‘time-sensitive t-SNE’ representation of the futurate and de-
scriptive use of [BE about]. If one wishes to avoid the use of an indirect, present-day reference
point to query a corpus for potential constructional changes, it would of course also be possible
to examine token groupings (as apparent from a time-sensitive version of t-SNE [20] or another
type of dimension reduction technique) that appear to be specific for a particular time-period.
   Using Figure 4, for example, one can examine the token groupings which are dated towards
the beginning of the corpus (representing groupings of the [BE about Ving] pattern), or the
smaller, markedly recent grouping top center left (containing solely negative uses, expressing
absence of intent, e.g. Wang’s not about to forgive you (2007, COHA)). Note that the relative
size of the time-specific token groupings discussed here may affect the extent to which summary
statistics (such as the average pairwise distance between tokens or the silhouette score of
clusters over time) capture their emergence or disappearance.


4. Discussion and Conclusion
All in all, the present assessment of the use BERT embeddings and uncertainty measures to de-
tect functional-semantic change in grammatical constructions seems largely positive. However,
there are a number of potential pitfalls that must be addressed.
   A first, smaller point that could be raised concerns the attested change in the [BE about]
futurate. One may argue that this is, in fact, a formal rather than a functional-semantic
change. What we witness is a reduction in the variability of (near-synonymous) complement
clause types following the futurate, but this distributional change does not (straightforwardly)
mark a shift in the futurate’s meaning. Still, what has been detected is of value to linguists,
as the model has picked up a reduction in semasiological variation between a current and a
currently obsolete complementation pattern.
   Second, one may, as pointed out earlier, not be fully convinced that uncertainty measures
determined by an unsupervised, present-day neural language model are the most reliable mea-


                                                   264
sure to detect semantic shift. First, it is unclear to what extent the fact that BERT has
been pre-trained on present-day English material affects its performance with respect to the
gradually aging data. Second, if tokens from a decade d are flagged as yielding a high degree
of classifier uncertainty with respect to the reference point r, one should not be too quick to
assume semantic change proper has taken place: in fact, this may be due to a difference in
how the tokens are distributed across (sub)genres in d and r. In this study, I minimized both
problems somewhat, but the concerns are still legitimate. With respect to genre variation,
it helps to work with a carefully balanced corpus such as COHA (if available), or to try and
incorporate meta-information on (sub)genre in the model [e.g. 26]. With respect to the pos-
sible ‘present-day bias’ of the pre-trained model, it is reassuring to see that attest stability
with usage types that are not known to have changed, and that similar conclusions on possible
distributional shifts can be arrived at by examining time-sensitive t-SNE plots.
   However, it is important to stress that the explored approach solely considers uncertainty
with respect to the present-day reference point: while the classification entropy test successfully
pointed out that the descriptive use of [BE about] had undergone some changes, the decrease
in entropy cannot be equated to semantic narrowing. Instead, given that the descriptive
use of [BE about] rather seems to have shifted and broadened, the classification entropy test
merely indicates that the tokens have become more like the present-day examples and linguistic
material the model has been trained with. Second, the explored approach is limited in the
sense that the number (and nature) of usage types is imposed anachronistically to non-present-
day data. Since the procedure relies on a single reference point, it will not straightforwardly
flag any usage types that are absent in the training set, and it may erroneously impose the
pre-defined category labels onto tokens representing obsolete usages. A further indication that
the method may be problematically biased towards present-day language can be found when
the model’s actual classification errors are considered in more detail. In the case of the [BE
about] futurate, the overall classification accuracy is remarkably high at 0.985, with only 22 of
the 1517 examples not being recognized as futurates. On closer inspection of those 22 mistakes,
it appears that 19 of them involve the now obsolete [BE about Ving] pattern. Given that there
are 102 examples of [BE about Ving] in the test set, this amounts to an error rate of 18.6%.
Furthermore, the mistakes are of the type illustrated in (15), where a Present-day descriptive
interpretation (i.e. Napoleon was fond of going to war with England) is erroneously imposed:
(15) Jefferson obtained the consent of Congress to make an effort to buy New Orleans and West
     Florida, and sent Monroe to aid our minister in France in making the purchase. When the offer
     was made, Napoleon was about going to war with England, and, wanting money very much, he
     in turn offered to sell the whole province to the United States. (1897, COHA)

Because the use of data-driven, automated methods of semantic annotation and analysis is
appealing to researchers precisely because it could help avoid such anachronistic interpretations
of historical language [e.g. 29], it is of course unfortunate that they still occur at a reasonably
high rate. Yet, it should still be acknowledged that the very fact that word and phrase
embeddings created by BERT did succeed in recognizing different grammatical usage types in
Present-day language inspires hope that these problems can be tackled when models such as
these are trained on contemporary linguistic material and (following proposals such as [18])
made dynamic.


                                               265
Acknowledgments
I am grateful to Folgert Karsdorp for his advice on how to implement parts of the analysis.


References
 [1] “about, adv., prep.1, adj., and int.” In: Oxford English Dictionary Online. Oxford Uni-
     versity Press, 1990. url: oed.com/view/Entry/527.
 [2] M. Baroni, G. Dinu, and G. Kruszewski. “Don’t count, predict! A systematic comparison
     of context-counting vs. context-predicting semantic vectors”. en. In: Proceedings of the
     52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
     Papers). Baltimore, Maryland: Association for Computational Linguistics, 2014, pp. 238–
     247. doi: 10.3115/v1/P14-1023. url: http://aclweb.org/anthology/P14-1023 (visited
     on 01/26/2020).
 [3] G. Boleda. “Distributional Semantics and Linguistic Theory”. en. In: Annual Review of
     Linguistics 6.1 (Jan. 2020). arXiv: 1905.01896, pp. 213–234. issn: 2333-9683, 2333-9691.
     doi: 10.1146/annurev-linguistics-011619-030303. url: http://arxiv.org/abs/1905.01896
     (visited on 04/15/2020).
 [4] S. Budts and P. Petré. “Putting connections centre stage in diachronic construction
     grammar”. In: Nodes and Networks in Diachronic Construction Grammar. Ed. by L.
     Sommerer and E. Smirnova. Amsterdam: John Benjamins, 2020, pp. 317–352.
 [5] M. Davies. Corpus of Historical American English (COHA). Version V1. 2015. doi:
     10.7910/DVN/8SRSYK. url: https://doi.org/10.7910/DVN/8SRSYK.
 [6] H. De Smet. “The course of actualization”. en. In: Language 88.3 (2012), pp. 601–633.
     issn: 1535-0665. doi: 10.1353/lan.2012.0056. url: http://muse.jhu.edu/content/crossre
     f/journals/language/v088/88.3.de-smet.html (visited on 01/26/2020).
 [7] G. Desagulier. “Can word vectors help corpus linguists?” en. In: Studia Neophilologica
     91.2 (May 2019), pp. 219–240. issn: 0039-3274, 1651-2308. doi: 10.1080/00393274.2019
     .1616220. url: https://www.tandfonline.com/doi/full/10.1080/00393274.2019.1616220
     (visited on 05/17/2020).
 [8] J. Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language
     Understanding”. en. In: Proceedings of NAACL-HLT 2019. Minneapolis, Minnesota, June
     2019, pp. 4171–4186.
 [9] H. Dubossarsky et al. “Time-Out: Temporal Referencing for Robust Modeling of Lexical
     Semantic Change”. In: Proceedings of the 57th Annual Meeting of the Association for
     Computational Linguistics. Florence, Italy: Association for Computational Linguistics,
     July 2019, pp. 457–470. doi: 10.18653/v1/P19-1044. url: https://www.aclweb.org/ant
     hology/P19-1044 (visited on 06/28/2020).
[10]   S. Eger and A. Mehler. “On the Linearity of Semantic Change: Investigating Meaning
       Variation via Dynamic Graph Models”. In: Proceedings of the 54th Annual Meeting of the
       Association for Computational Linguistics (Volume 2: Short Papers). Berlin, Germany:
       Association for Computational Linguistics, Aug. 2016, pp. 52–58. doi: 10.18653/v1/P1
       6-2009. url: https://www.aclweb.org/anthology/P16-2009 (visited on 07/21/2020).


                                             266
[11]   M. Giulianelli, M. Del Tredici, and R. Fernández. “Analysing Lexical Semantic Change
       with Contextualised Word Representations”. In: arXiv:2004.14118 [cs] (Apr. 2020). arXiv:
       2004.14118. url: http://arxiv.org/abs/2004.14118 (visited on 06/28/2020).
[12]   S. J. Greenhill et al. “Evolutionary dynamics of language systems”. en. In: Proceedings
       of the National Academy of Sciences 114.42 (Oct. 2017), E8822–E8829. issn: 0027-8424,
       1091-6490. doi: 10.1073/pnas.1700388114. url: http://www.pnas.org/lookup/doi/10.1
       073/pnas.1700388114 (visited on 07/21/2020).
[13]   W. L. Hamilton, J. Leskovec, and D. Jurafsky. “Diachronic Word Embeddings Reveal
       Statistical Laws of Semantic Change”. In: arXiv:1605.09096 [cs] (Oct. 2018). arXiv:
       1605.09096. url: http://arxiv.org/abs/1605.09096 (visited on 06/28/2020).
[14]   M. Hilpert and D. Correia Saavedra. “Using token-based semantic vector spaces for
       corpus-linguistic analyses: From practical applications to tests of theoretical claims”. en.
       In: Corpus Linguistics and Linguistic Theory 0.0 (Sept. 2017). issn: 1613-7027, 1613-
       7035. doi: 10.1515/cllt-2017-0009. url: http://www.degruyter.com/view/j/cllt.ahead-o
       f-print/cllt-2017-0009/cllt-2017-0009.xml (visited on 05/16/2020).
[15]   N. P. Himmelmann. “Lexicalization and grammaticization: opposite or orthogonal?””
       en. In: What Makes Grammaticalization: A Look from Its Components and Its Fringes.
       Ed. by W. Bisang, N. P. Himmelmann, and B. Wiemer. Berlin: Mouton de Gruyter,
       2004, pp. 21–44.
[16]   S. Höche. “I am about to die vs. I am going to die: A usage-based comparison between
       two future-indicating constructions”. In: Converging Evidence: Methodological and Theo-
       retical Issues for Linguistic Research. Ed. by D. Schönefeld. Amsterdam: John Benjamins
       Publishing Company, 2011, pp. 115–142.
[17]   B. Jirsa. “Synchronic Applications for Diachronic Syntax: The Grammaticalization of to
       be about to in English”. en. In: Colorado Research in Linguistics 15 (1997). issn: 1937-
       7029. doi: 10.25810/5h5t-xg32. url: https://journals.colorado.edu/index.php/cril/artic
       le/view/231 (visited on 07/21/2020).
[18]   Y. Kim et al. “Temporal Analysis of Language through Neural Language Models”. en. In:
       Proceedings of the ACL 2014 Workshop on Language Technologies and Computational
       Social Science. Baltimore, MD, USA: Association for Computational Linguistics, 2014,
       pp. 61–65. doi: 10.3115/v1/W14-2517. url: http://aclweb.org/anthology/W14-2517
       (visited on 06/28/2020).
[19]   A. Kutuzov et al. “Diachronic word embeddings and semantic shifts: a survey”. In: Pro-
       ceedings of the 27th International Conference on Computational Linguistics. Santa Fe,
       New Mexico, USA: Association for Computational Linguistics, Aug. 2018, pp. 1384–1397.
       url: https://www.aclweb.org/anthology/C18-1117 (visited on 07/20/2020).
[20]   L. van der Maaten and G. Hinton. “Visualizing Data using t-SNE”. In: Journal of Machine
       Learning Research 9 (2008), pp. 2579–2605.
[21]   C. Mair and G. Leech. “Current Changes in English Syntax”. en. In: The Handbook of
       English Linguistics. Ed. by B. Aarts and A. McMahon. Malden, MA, USA: Blackwell
       Publishing, Jan. 2006, pp. 318–342. doi: 10.1002/9780470753002.ch14. url: http://doi
       .wiley.com/10.1002/9780470753002.ch14 (visited on 09/19/2020).


                                                267
[22]   T. McEnery and A. Hardie. The History of Corpus Linguistics. en. Oxford University
       Press, Mar. 2013. doi: 10.1093/oxfordhb/9780199585847.013.0034. url: http://oxfordh
       andbooks.com/view/10.1093/oxfordhb/9780199585847.001.0001/oxfordhb-97801995858
       47-e-34 (visited on 06/29/2020).
[23]   J. Mee. “The evolution of constructions: The case of be about to”. en. MA dissertation.
       University of New Mexico, 2013, p. 138.
[24]   W. Mihatsch. “The Diachrony of Rounders and Adaptors: Approximation and Unidirec-
       tional Change”. en. In: New Approaches to Hedging. Ed. by G. Kaltenböck, W. Mihatsch,
       and S. Schneider. BRILL, Jan. 2010, pp. 93–122. isbn: 978-90-04-25324-7. doi: 10.1163
       /9789004253247_007. url: https://brill.com/view/book/edcoll/9789004253247/B97890
       04253247-s007.xml (visited on 07/20/2020).
[25]   F. Pedregosa et al. “Scikit-learn: Machine learning in Python”. In: Journal of machine
       learning research 12.Oct (2011), pp. 2825–2830.
[26]   V. Perrone et al. “GASC: Genre-Aware Semantic Change for Ancient Greek”. In: Pro-
       ceedings of the 1st International Workshop on Computational Approaches to Historical
       Language Change (2019). arXiv: 1903.05587, pp. 56–66. doi: 10.18653/v1/W19- 4707.
       url: http://arxiv.org/abs/1903.05587 (visited on 09/19/2020).
[27]   F. Plank. “Inevitable reanalysis: From local adpositions to approximative adnumerals,
       in German and wherever”. en. In: Studies in Language 28.1 (2004), pp. 165–201. issn:
       0378-4177, 1569-9978. doi: 10.1075/sl.28.1.07pla. url: http://www.jbe-platform.com/c
       ontent/journals/10.1075/sl.28.1.07pla (visited on 07/20/2020).
[28]   R. Quirk et al. A Comprehensive Grammar of the English Language. London: Longman,
       1985.
[29]   E. Sagi, S. Kaufmann, and B. Clark. “Tracing semantic change with Latent Seman-
       tic Analysis”. en. In: Current Methods in Historical Semantics. Ed. by K. Allan and
       J. A. Robinson. Berlin, Boston: DE GRUYTER, Jan. 2011, pp. 161–183. isbn: 978-3-11-
       025290-3. doi: 10.1515/9783110252903.161. url: https://www.degruyter.com/view/boo
       ks/9783110252903/9783110252903.161/9783110252903.161.xml (visited on 01/26/2020).
[30]   E. E. Sweetser. “Grammaticalization and Semantic Bleaching”. In: Proceedings of the
       Fourteenth Annual Meeting of the Berkeley Linguistics Society (1988), pp. 389–405.
[31]   T. Watanabe. “Development and grammaticalization of Be About To: An analysis of the
       OED quotations”. In: Aspects of the History of English Language and Literature. Ed. by
       Y. Nakao and M. Ogura. Frankfurt am Main: Peter Lang, 2010, pp. 353–365.


                                             268