<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Workshop on Computational Humanities Research, November</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Toward a Thermodynamics of Meaning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jonathan Scott Enderle</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Pennsylvania Libraries</institution>
          ,
          <addr-line>3420 Walnut St., Philadelphia, PA 19104-6206</addr-line>
          ,
          <country country="US">United States of America</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>1</volume>
      <issue>4</issue>
      <fpage>8</fpage>
      <lpage>20</lpage>
      <abstract>
        <p>As language models such as GPT-3 become increasingly successful at generating realistic text, questions about what purely text-based modeling can learn about the world have become more urgent. Is text purely syntactic, as skeptics argue? Or does it in fact contain some semantic information that a sufficiently sophisticated language model could use to learn about the world without any additional inputs? This paper describes a new model that suggests some qualified answers to those questions. By theorizing the relationship between text and the world it describes as an equilibrium relationship between a thermodynamic system and a much larger reservoir, this paper argues that even very simple language models do learn structural facts about the world, while also proposing relatively precise limits on the nature and extent of those facts. This perspective promises not only to answer questions about what language models actually learn, but also to explain the consistent and surprising success of cooccurrence prediction as a meaning-making strategy in AI.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;language modeling</kwd>
        <kwd>natural language semantics</kwd>
        <kwd>artificial intelligence</kwd>
        <kwd>statistical mechanics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Since the introduction of the Transformer architecture in 2017 [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ], neural language models
have developed increasingly realistic text-generation abilities, and have demonstrated
impressive performance on many downstream NLP tasks. Assessed optimistically, these successes
suggest that language models, as they learn to generate realistic text, also infer meaningful
information about the world outside of language.
      </p>
      <p>
        Yet there are reasons to remain skeptical. Because they are so sophisticated, these models
can exploit subtle flaws in the design of language comprehension tasks that have been
overlooked in the past. This may make it difficult to realistically assess these models’ capacity for
true language comprehension. Moreover, there is a long tradition of debate among linguists,
philosophers, and cognitive scientists about whether it is even possible to infer semantics from
purely syntactic evidence [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ].
      </p>
      <p>
        This paper proposes a simple language model that directly addresses these questions by
viewing language as a system that interacts with another, much larger system: a semantic domain
that the model knows almost nothing about. Given a few assumptions about how these two
systems relate to one another, this model implies that some properties of the linguistic system
must be shared with its semantic domain, and that our measurements of those properties are
valid for both systems, even though we have access only to one. But this conclusion holds
only for some properties. The simplest version of this model closely resembles existing word
embeddings based on low-rank matrix factorization methods, and performs competitively on
a balanced analogy benchmark (BATS [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]).
      </p>
      <p>
        The assumptions and the mathematical formulation of this model are drawn from the
statistical mechanical theory of equilibrium states. By adopting a materialist view that treats
interpretations as physical phenomena, rather than as abstract mental phenomena, this model
shows more precisely what we can and cannot infer about meaning from text alone.
Additionally, the mathematical structure of this model suggests a close relationship between
cooccurrence prediction and meaning, if we understand meaning as a mapping between fragments of
language and possible interpretations. There is reason to believe that this line of reasoning will
apply to any model that operates by predicting cooccurrence, however sophisticated. Although
the model described here is a pale shadow of a hundred-billion-parameter model like GPT-3
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], the fundamental principle of its operation, this paper argues, is the same.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Previous Work</title>
      <p>
        Most recent work on language modeling builds on the word2vec word embedding model [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]
and its descendants such as GloVe [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. These models drew from a longer tradition of
distributional semantics in linguistics [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and early machine translation research [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ] [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
The promise of word embedding models for research in the humanities was quickly recognized,
leading to historical studies of analogical language [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and diachronic lexical change [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], but
questions remained about the utility of embeddings for close humanistic analysis. Word
embedding models sufer from stability problems, yielding seemingly precise answers that change
when training input is modified only slightly [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], and their internal geometric structure is
poorly understood [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
      </p>
      <p>
        Attempts to build a better theoretical understanding of word embeddings have often focused
on exploring the ways diferent models prove to be mathematically equivalent in some limit [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], or showing the importance of preprocessing and hyperparameter selection. In many cases,
with optimal hyperparameter choices, factorizing word cooccurrence matrices using SVD and
a log weighting is sufficient to produce results competitive with state-of-the-art models [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
For these reasons, the claim that word embeddings are indeed representations of meaning, and
not merely dense representations of word cooccurrence, still lacks strong theoretical support.
On the other hand, even simple coocurrence data seems intuitively to capture something about
meaning in a way that remains mysterious [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        More recent language modeling has focused on sequence prediction, either using recurrent
neural networks [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] or attention-based mechanisms [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ]. Large language models using the
Transformer architecture apparently capture rich semantic information usable in a range of
downstream applications [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. But as with word embeddings, there remain empirical and
theoretical reasons to be skeptical that these models are capturing information about meaning,
rather than performing an extremely sophisticated and accurate version of positionally-aware
cooccurrence prediction. At least some attempts to use Transformer models to perform
challenging natural language comprehension tasks have shown that existing problem datasets
contain subtle linguistic cues that leak information about correct answers [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. These cues
have been missed in the past, but with their linguistic sophistication, newer models recognize
them, producing spurious state-of-the-art results without demonstrating true comprehension.
      </p>
      <p>
        Recent work by Bender and Koller [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] provides an even stronger theoretical case against the
claim that language models infer meaning beyond simple cooccurrence. Synthesizing
arguments and evidence from linguistics and philosophy, including Searle’s famous Chinese Room
argument [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ], Bender and Koller argue that “the language modeling task, because it only uses
form as training data, cannot in principle lead to learning of meaning.” Or, in Searle’s pithy
formulation, the operations of a computer have “syntax but no semantics.” Bender and Koller’s
reliance on Searle is notable, given that Searle’s argument was not against language
modeling, but against the very possibility of artificial intelligence. Anyone who takes his reasoning
entirely seriously should forever abandon the notion that a computational process could truly
comprehend meaning. Yet in their final analysis, Bender and Koller back away from Searle’s
strongest claims, acknowledging that “if form is augmented with grounding data of some kind,
then meaning can conceivably be learned to the extent that the communicative intent is
represented in that data,” and that a sufficiently successful language model “has probably learned
something about meaning.”
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Meaning, Cooccurrence, and Thermodynamics</title>
      <p>How can we synthesize these seemingly contradictory bodies of theory and evidence? It’s
plausible to claim that language models can never do more than predict the way elements
of language cooccur in text, since they never see any other kind of evidence. And yet even
the simplest kinds of cooccurrence prediction, such as basic matrix factorization, produce
surprisingly good representations of something that looks intuitively like meaning. Suppose
that rather than examining the details of particular language models to see how they difer,
and which might be more or less correct, we focus on what they have in common. Is there
some unrecognized connection between meaning and cooccurrence prediction in all its forms?</p>
      <p>
        This section proposes such a connection based on a model borrowed from statistical
mechanics. Similar approaches have been applied to practical language modeling problems [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ]
[
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] and theoretical discussions of algorithmic and semantic information [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. But to the
author’s knowledge, no prior work has used thermodynamic analogies to specifically investigate
the relationship between language and its semantic domain.
      </p>
      <p>This model begins by treating interpretations as possible configurations of an unknown
physical system. It then constructs a statistical mechanical partition function that counts the
number of interpretations applicable to each fragment of language in a corpus. It immediately
follows that the Hessian of that function—the matrix of its mixed second partial derivatives—is
a covariance matrix describing word cooccurrences. The Hessian matrix can be used, in turn,
to approximate directional derivatives of the partition function, which describe the ways the
partition function changes when the meanings of words are slightly modified. These directional
derivatives are word vectors, with all the expected properties.</p>
      <sec id="sec-3-1">
        <title>3.1. Model Assumptions</title>
        <p>Setting up our model requires some odd assumptions about how language works. To begin
with, it requires that we assume that meaning is quantifiable in the most naive way. It’s not
uncommon in colloquial speech to talk about the amount of meaning a phrase has, without
specifying what the phrase means. Some phrases, we might say, are meaningless; others are
full of meaning. To construct a statistical mechanical model of meaning, it is useful to assume
that this is a perfectly correct way of quantifying meaning, and that, so quantified, meaning
is a conserved value that plays the same role as energy in a typical thermodynamic ensemble.</p>
        <p>As long as we are making extravagant assumptions, let’s also assume that for a given
linguistic system and an associated semantic domain, words have a stable average capacity for
holding meaning, and that word counts are conserved values just like energy, so that a
combined linguistic system and associated semantic domain contains an unknown but fixed number
of copies of every possible word. Leaving aside the linguistic significance of these assumptions
for a moment, we can skip ahead by recognizing them as formally equivalent to the assumptions
made in the construction of the grand canonical ensemble.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. The Grand Canonical Ensemble</title>
        <p>In classical thermodynamics, the grand canonical ensemble describes a system of particles—
such as a container of gas—that is in thermodynamic and chemical equilibrium with a much
larger system, a reservoir of energy and particles. Concretely, this means that the temperature
of the gas in the container is the same as its surroundings (assumed to be homogenous, and far
larger than the container), and that the container can exchange particles with its surroundings,
but at a steady state, so that it is as likely to lose a particle as to gain a particle at any given
moment. Furthermore, both the amount of energy and the number of particles shared between
the container and its surroundings are fixed—energy and particle number are conserved values.</p>
        <p>To understand the behavior of this ensemble, we begin by imagining that we could track the
exact position and momentum of every particle in the system (container), as well as the exact
position and momentum of every particle in the reservoir (surroundings of the container). At
a given instant in time, these values constitute a “microstate.” Since we have both a system
and a reservoir, we can divide a single microstate into parts, considering just the microstate of
the system, or just the microstate of the reservoir. We can also determine that certain system
microstates are incompatible with certain reservoir microstates, because the combination would
violate a conservation law. In other words, for some pairs of system microstate and reservoir
microstate to coexist, energy or particles would have to be created or destroyed, which would
violate the rule that energy and particle number are conserved values.</p>
        <p>If we rule out all system-reservoir microstate pairs that are not compatible (̸↔), and assume
that all reservoir microstates are equally likely—an acceptable approximation when the
reservoir is far larger than the system—then we can approximate the probability of a given system
microstate si by counting the number of reservoir microstates that are compatible (↔) with
it. Using Iverson brackets ([i = j] = δij ) we can say</p>
        <p>To recover the probability itself, we can divide by the sum over all s:
pi ∝
∑[rj ↔ si]</p>
        <p>j
pi =
∑[rj ↔ si]
j
∑[rj ↔ sk]
j,k
(1)
(2)</p>
        <p>These sums are very large, and it’s not clear how to calculate them. But it turns out we
don’t need to. Given a few standard assumptions from thermodynamics, our assumptions about
conserved quantities, and a bit of calculus, it’s possible to use them to derive the following
function:
i</p>
        <p>This is the grand canonical partition function. From it, we can then directly calculate the
probability of system microstate i like so:</p>
        <p>Z</p>
        <p>This formula tells us, first, that at a given fixed temperature T determined by β = kB1T ,
system microstates with more energy (Ei) are less probable, because they are compatible with
fewer reservoir microstates. It also tells us that for any given energy level, system microstates
containing more particles (Ni) with a higher chemical potential (µ) are more probable. This
is because given two systems with the same energy, the system with a higher overall potential
has a higher energy capacity.</p>
        <p>This partition function can be extended to systems that have multiple kinds (“species”) of
particles. In that case, each species has its own chemical potential and count. For a system
with k diferent species</p>
        <p>Z =</p>
        <p>∑ eβ(µ1N1,i+µ2N2,i+...+µkNk,i−Ei)
pi =
i
eβ(µ1N1,i+µ2N2,i+...+µkNk,i−Ei)</p>
        <p>Z</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. From Compatibility to Interpretation</title>
        <p>What does all this have to do with language? The first hint that the grand canonical partition
function might have some usefulness as a model for language is that energy and meaning (in the
naive quantitative sense described above) both impose similar compatibility constraints on the
system and reservoir. Just as higher-energy states in the system correspond to fewer possible
reservoir states, more meaningful sentences correspond to fewer possible interpretations. A
statement with less meaning has less precision, while a statement with more meaning has more
precision, eliminating a larger number of possible interpretations. The line of reasoning is
similar for particle species and words. Just as a particle species with higher chemical potential
has higher energy capacity, a word with a higher “semantic potential” has a higher capacity
for meaning.</p>
        <p>Consider, for example, the sentence “It stinks.” Then compare it to “On January 15, 2008,
a rainfall of 110mm was recorded in the city of Dubai.” The specific interpretations these
sentences can be given will depend on context, and in some contexts, “It stinks” might be
a meaningful and precise sentence. But on balance, we should expect “It stinks” to be less
meaningful than “On January 15...,” both because it contains fewer words, and because the
words it contains are less precise than words like “rainfall” and “Dubai.”</p>
        <p>Although this is a simple way of thinking about meaning, it is not as simplistic as it may
seem at first. Consider the sentence “Ask for me tomorrow, and you shall find me a grave man,”
as uttered by a dying Mercutio. One might think that by the logic above, this sentence would
be made less meaningful by the presence of an ambiguous word, “grave,” here meaning either
“serious” or “a place of burial.” But a more careful analysis leads to a diferent conclusion. If
these two senses were available independently, and the sentence could be properly interpreted
in two diferent ways, it would indeed be less meaningful because of this ambiguity. In this
context, however, choosing just one of those senses to the exclusion of the other would yield a
misreading of the sentence. It does not invite two diferent possible interpretations; it invites
one interpretation that combines together two distinct concepts both conveyed by the word
“grave.” By eliminating interpretations that do not combine these two senses together, this
sentence uses ambiguity to achieve a higher degree of precision. Analyzed this way, literary
language is often likely to be more precise and meaningful than everyday language, despite
sometimes having greater surface ambiguity.</p>
        <p>If we translate these ideas into a mathematical form, and start thinking about compatibility
(↔) as a semantic relationship, then equation 2 says roughly that the probability of a given
sentence (si) is equal to the number of interpretations (rj,..,k) it has, divided by the number of
interpretations that all possible grammatically correct sentences have. The refinement of that
equation to equation 6 now says that sentences with more meaning are less probable, because
they are compatible with fewer interpretations, and that for any given degree of meaningfulness,
sentences with a higher semantic potential are more probable. (That is, precise sentences are
harder to write, but it’s easier to write a precise sentence with more words, and it’s harder to
pack all your meaning into just a few very precise words.)</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. From Ensembles to Vectors</title>
        <p>Most word embedding models generate word vectors by using a supervised or semi-supervised
model to predict cooccurrences, and the vectors themselves aren’t significant outside that
predictive frame. But the picture is quite diferent for statistical-mechanical models such
as this one. One of the most elegant properties of partition functions is that a wide range
of thermodynamic quantities can be expressed directly as partial derivatives of the partition
function or its logarithm.</p>
        <p>For example, suppose we would like to determine the number of particles of a particular
kind present in all possible states of our system (Nk), and take the average. We can calculate
that value by taking the partial derivative of the logarithm of equation 6 with respect to the
1 .
chemical potential of that species, and dividing out β = kBT</p>
        <p>1 ∂ ln
⟨Nk⟩ = β ∂µk</p>
        <p>Z(µk)</p>
        <p>Since ∂∂lxn f (x) = ∂∂x f (x)/f (x), this simplifies to a probability-weighted sum of Nk counts
divided by β, efectively an arbitrary constant multiplier. Shifting it to the left hand side of
the equation gives
β⟨Nk⟩ = ∂∂µk Z(µk) = ∑ Nk,i</p>
        <p>Z(µk) i
eβ(µ1N1,i+µ2N2,i+...+µkNk,i−Ei)</p>
        <p>Z
= ∑ Nk,ipi
i</p>
        <p>This line of reasoning can be extended to second partial derivatives. The variance of Nk is
given by</p>
        <p>β [⟨Nk2⟩ − ⟨Nk⟩2] = ∂∂2µl2kn Z(µk)
Similarly, the covariance of Nk and Nj is a mixed partial derivative.</p>
        <p>These last two equations can be used to construct a matrix that has two simultaneous
meanings. It is, first, a covariance matrix that describes the way particle counts are correlated
with one another in the system. But it is also a Hessian matrix of second partial derivatives,
meaning that it describes the way small modifications to the chemical potential terms change
the overall partition function, shifting its energy balance across all possible system microstates.
This means that even if we can’t construct the partition function itself, we can in principle
measure the covariance of particles empirically, and use the resulting matrix to reconstruct
information about the partition function and the thermodynamic ensemble it describes.</p>
        <p>If we translate this into linguistic terms, we find that by taking empirical measurements of
word cooccurrence, we are also constructing the Hessian of a linguistic partition function that
describes how changes to the meaning of one word afect the meaning of another. The columns
of that matrix are word vectors. When two columns are similar, small modifications to the
meanings of the associated words have similar efects on the language as a whole. That is what
it means, in the context of this model, for two words to be similar. Line integrals through the
Hessian field in a given neighborhood can also be approximated by adding and subtracting
these vectors, giving a more precise interpretation to the formulas used to represent analogies.
Analogies are valid when they correspond to two diferent line integrals through a conservative
Hessian tensor field, beginning at the same point and ending close to the same point, and
therefore having similar final values.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Implementation</title>
        <p>
          Constructing a practical implementation of this model requires that we determine the values
for two sets of parameters: the potential for each word in the vocabulary, and the energy level
for each sentence. The simplest approach to this problem is to set all potential terms to zero,
and all energy terms to one. The covariance matrix that results from these choices is identical
to the one given by directly counting word cooccurrences. Alternative schemes will change
the weights given to each of the sentences, yielding a modified covariance matrix that is likely
to give better meaning representations. For performance reasons, some form of dimension
reduction is also necessary, but has no theoretical significance at all. In practice, random
projection (as in [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]) works well, especially after implementing some of the preprocessing
and hyperparameter selection recommendations in [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], which may be compensating for the
deficiencies that result from setting the energy and potential terms to constants.
        </p>
        <p>The problem of selecting energy and potential terms in a more principled way is left to other
work. But it is worth considering briefly, since it illustrates some interesting properties of the
model. First, in this model, the same sequence of words could appear twice with diferent energy
levels, and therefore diferent probabilities, depending on context. Second, there may be a way
to make predictions that link the semantic potential of given terms to known lexicographical
properties of those terms, such as the approximate number of senses the word has. And finally,
the partition function described here is not the only possible partition function that might be
applied to language. Partition functions based on word pairs, sequences, or even attention
mechanisms could be used to model language within this framework, all broadly interpretable
in the same way.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion</title>
      <p>Few of the ideas presented here are new. The fact that word vectors contain distributional
information that allows them to measure word similarity has been known for decades. Ideas
from statistical mechanics have been applied to language modeling, machine learning, and
information retrieval problems for decades. And for the last few years, there has been a steady
stream of work demonstrating that language model X reduces to language model Y in some
limit. But none of this has shown how these models could capture information about
interpretation or meaning. Trained on linguistic form alone, these models have no evidence showing
how linguistic forms map to mental models, concepts, narratives, or any other representations
of things outside of language.</p>
      <p>The claim that statistical models can infer things about meaning from linguistic form alone
thus faces a high burden of proof. And while there has been a proliferation of models that
do appear to support that claim, they all work on slightly diferent principles, and produce
slightly diferent results. This undercuts attempts at meta-induction; many small bodies of
evidence based on diferent principles of operation do not add up to one large body of evidence.
And so justified skepticism remains.</p>
      <p>What is new about the model proposed here is that it is general enough to explain the success
of many of these models without reference to the details of their operation. Fundamentally,
any model that is able to predict linguistic cooccurrences can be reinterpreted as an implicit
partition function along the lines proposed here. So reinterpreted, we can argue that
distributional information about language is linked by a precise mathematical structure to specific
facts about how words signify. Those facts are limited; they do not include any information
about what words, sentences, or longer fragments of language talk about. But they do include
information about how many interpretations might be applied to those units of language, and
how those interpretations correlate with one another at a macroscopic level.</p>
      <p>What unites all of these models, under this theory, is that they efectively assume that
meaning, quantified appropriately, is conserved, and that units of language—be they letters, words,
n-grams, or longer phrases—are also conserved. It’s not yet clear what these assumptions
might mean in linguistic terms, but they are crucial to the derivation of a partition function
that can relate the statistics of linguistic form to an unknown reservoir of meaning.</p>
      <p>
        These models must also make a third assumption: language exists in a state of equilibrium
with its reservoir of meaning. That assumption is unlikely to hold in general. If this way of
thinking about language modeling is sound, then an important project will be to understand
when the assumption of equilibrium is justified, and when it is not. It’s likely that during
periods of rapid linguistic change, for example, the equilibrium assumption will not be valid.
In that case, methods that can model far-from-equilibrium systems will be required. Since
non-equilibrium thermodynamics is a field still in its infancy [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], there will be much work to
be done, and many tasks that remain impossible without domain expertise. Nonetheless, a
deeper understanding of the meaning of these assumptions promises to clarify when and how
language models can infer meaning from linguistic form alone.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Antoniak</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Mimno</surname>
          </string-name>
          . “
          <article-title>Evaluating the Stability of Embedding-based Word Similarities”</article-title>
          .
          <source>In: TACL 6</source>
          (
          <year>2018</year>
          ), pp.
          <fpage>107</fpage>
          -
          <lpage>119</lpage>
          . url: https://transacl.org/ojs/index.php/tacl /article/view/1202.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Arora</surname>
          </string-name>
          et al. “
          <article-title>A Latent Variable Model Approach to PMI-based Word Embeddings”</article-title>
          .
          <source>In: TACL 4</source>
          (
          <year>2016</year>
          ), pp.
          <fpage>385</fpage>
          -
          <lpage>399</lpage>
          . doi:
          <volume>10</volume>
          .1162/tacl_a_00106. url: https://www.aclwe b.org/anthology/Q16-1028.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Baez</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Stay</surname>
          </string-name>
          . “
          <article-title>Algorithmic thermodynamics”</article-title>
          .
          <source>In: Mathematical Structures in Computer Science 22.5</source>
          (
          <issue>2012</issue>
          ), pp.
          <fpage>771</fpage>
          -
          <lpage>787</lpage>
          . doi:
          <volume>10</volume>
          .1017/S0960129511000521.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Bender</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Koller</surname>
          </string-name>
          . “
          <article-title>Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data”</article-title>
          .
          <source>In: ACL</source>
          <year>2020</year>
          .
          <article-title>Online: Association for Computational Linguistics</article-title>
          ,
          <year>July 2020</year>
          , pp.
          <fpage>5185</fpage>
          -
          <lpage>5198</lpage>
          . url: https://www.aclweb.org/anthology/2020.ac l-main.
          <volume>463</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>T. B. Brown</surname>
          </string-name>
          et al. “
          <article-title>Language Models are Few-Shot Learners”</article-title>
          . In: CoRR abs/
          <year>2005</year>
          .14165 (
          <year>2020</year>
          ). url: https://arxiv.org/abs/
          <year>2005</year>
          .14165.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>England</surname>
          </string-name>
          . “
          <article-title>Dissipative adaptation in driven self-assembly”</article-title>
          .
          <source>In: Nature Nanotechnology</source>
          <volume>10</volume>
          (
          <issue>Nov</issue>
          .
          <year>2015</year>
          ), pp.
          <fpage>919</fpage>
          -
          <lpage>923</lpage>
          . doi:
          <volume>10</volume>
          .1038/nnano.
          <year>2015</year>
          .
          <volume>250</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Firth</surname>
          </string-name>
          . “
          <article-title>A synopsis of linguistic theory</article-title>
          <year>1930</year>
          -
          <volume>55</volume>
          ”. In:
          <article-title>Studies in linguistic analysis</article-title>
          .
          <source>The Philological Society</source>
          , Oxford (
          <year>1957</year>
          ), pp.
          <fpage>1</fpage>
          -
          <lpage>32</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Gavin</surname>
          </string-name>
          . “Vector Semantics,
          <string-name>
            <given-names>William</given-names>
            <surname>Empson</surname>
          </string-name>
          , and
          <article-title>the Study of Ambiguity”</article-title>
          .
          <source>In: Critical Inquiry 44.4</source>
          (
          <issue>2018</issue>
          ), pp.
          <fpage>641</fpage>
          -
          <lpage>673</lpage>
          . doi:
          <volume>10</volume>
          .1086/698174.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gladkova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Drozd</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Matsuoka</surname>
          </string-name>
          . “
          <article-title>Analogy-based detection of morphological and semantic relations with word embeddings: what works and what doesn't”</article-title>
          .
          <source>In: SRW@HLT-NAACL</source>
          <year>2016</year>
          , San Diego California, USA, June 12-17,
          <year>2016</year>
          . The Association for Computational Linguistics,
          <year>2016</year>
          , pp.
          <fpage>8</fpage>
          -
          <lpage>15</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/n16-
          <fpage>2002</fpage>
          . url: https://doi.org/10.18653/v1/n16-
          <fpage>2002</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>W. L.</given-names>
            <surname>Hamilton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leskovec</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Jurafsky</surname>
          </string-name>
          . “
          <article-title>Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change”</article-title>
          .
          <source>In: ACL 2016</source>
          . Berlin, Germany: Association for Computational Linguistics, Aug.
          <year>2016</year>
          , pp.
          <fpage>1489</fpage>
          -
          <lpage>1501</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P16</fpage>
          -1141. url: https://www.aclweb.org/anthology/P16-1141.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Z. S.</given-names>
            <surname>Harris</surname>
          </string-name>
          . “
          <article-title>Distributional structure”</article-title>
          .
          <source>In: Word</source>
          <volume>10</volume>
          .
          <fpage>2</fpage>
          -
          <lpage>3</lpage>
          (
          <year>1954</year>
          ), pp.
          <fpage>146</fpage>
          -
          <lpage>162</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Heuser</surname>
          </string-name>
          . “
          <article-title>Word Vectors in the Eighteenth Century”</article-title>
          .
          <source>In: DH</source>
          <year>2017</year>
          , Montréal, Canada,
          <source>August</source>
          <volume>8</volume>
          -
          <issue>11</issue>
          ,
          <year>2017</year>
          ,
          <string-name>
            <given-names>Conference</given-names>
            <surname>Abstracts</surname>
          </string-name>
          . Ed. by
          <string-name>
            <given-names>R.</given-names>
            <surname>Lewis</surname>
          </string-name>
          et al.
          <source>Alliance of Digital Humanities Organizations (ADHO)</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>256</fpage>
          -
          <lpage>259</lpage>
          . url: https://dh2017.adho.org /abstracts/582/582.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>G.</given-names>
            <surname>Jawahar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Sagot</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Seddah</surname>
          </string-name>
          . “
          <article-title>What Does BERT Learn about the Structure of Language?” In: ACL 2019</article-title>
          .
          <article-title>Florence, Italy: Association for Computational Linguistics</article-title>
          ,
          <year>July 2019</year>
          , pp.
          <fpage>3651</fpage>
          -
          <lpage>3657</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P19</fpage>
          -1356. url: https://www.aclweb.org/a nthology/P19-1356.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolchinsky</surname>
          </string-name>
          and
          <string-name>
            <given-names>D. H.</given-names>
            <surname>Wolpert</surname>
          </string-name>
          . “
          <article-title>Semantic information, autonomous agency and nonequilibrium statistical physics”</article-title>
          .
          <source>In: Interface Focus 8.6</source>
          (
          <issue>2018</issue>
          ), p.
          <fpage>20180041</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Goldberg</surname>
          </string-name>
          . “
          <article-title>Neural Word Embedding as Implicit Matrix Factorization”</article-title>
          .
          <source>In: NIPS</source>
          <year>2014</year>
          , December 8-
          <issue>13</issue>
          <year>2014</year>
          , Montreal, Quebec, Canada.
          <year>2014</year>
          , pp.
          <fpage>2177</fpage>
          -
          <lpage>2185</lpage>
          . url: http://papers.nips.cc/paper/5477-neural
          <article-title>-word-embedding-as-implicit-matrix-fact orization</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Goldberg</surname>
          </string-name>
          ,
          <string-name>
            <surname>and I. Dagan.</surname>
          </string-name>
          “
          <article-title>Improving Distributional Similarity with Lessons Learned from Word Embeddings”</article-title>
          .
          <source>In: TACL 3</source>
          (
          <year>2015</year>
          ), pp.
          <fpage>211</fpage>
          -
          <lpage>225</lpage>
          . doi:
          <volume>10</volume>
          .1162/tacl_a _00134. url: https://www.aclweb.org/anthology/Q15-1016.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>M.</given-names>
            <surname>Masterman</surname>
          </string-name>
          . “
          <article-title>Semantic algorithms”</article-title>
          . In: Language, Cohesion and Form.
          <source>Studies in Natural Language Processing</source>
          . Cambridge University Press,
          <year>2005</year>
          , pp.
          <fpage>253</fpage>
          -
          <lpage>280</lpage>
          . doi: 10.1 017/CBO9780511486609.012.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Masterman</surname>
          </string-name>
          (Braithwaite).
          <source>“XI</source>
          .
          <article-title>-Words”</article-title>
          .
          <source>In: Proceedings of the Aristotelian Society 54.1 (July</source>
          <year>2015</year>
          ), pp.
          <fpage>209</fpage>
          -
          <lpage>232</lpage>
          . issn:
          <fpage>0066</fpage>
          -
          <lpage>7374</lpage>
          . doi:
          <volume>10</volume>
          .1093/aristotelian/54.1.209. url: https://academic.oup.com/aristotelian/article-pdf/54/1/209/5256573/aristotelian54-
          <fpage>02</fpage>
          09.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>T.</given-names>
            <surname>McCoy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Pavlick</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Linzen</surname>
          </string-name>
          . “
          <article-title>Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference”</article-title>
          .
          <source>In: ACL</source>
          <year>2019</year>
          .
          <article-title>Florence, Italy: Association for Computational Linguistics</article-title>
          ,
          <year>July 2019</year>
          , pp.
          <fpage>3428</fpage>
          -
          <lpage>3448</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P19</fpage>
          -1334. url: https://www.aclweb.org/anthology/P19-1334.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          et al. “
          <article-title>Efficient Estimation of Word Representations in Vector Space”</article-title>
          .
          <source>In: ICLR</source>
          <year>2013</year>
          , Scottsdale, Arizona, USA, May 2-
          <issue>4</issue>
          ,
          <year>2013</year>
          , Workshop Track Proceedings. Ed. by
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          and
          <string-name>
            <surname>Y. LeCun.</surname>
          </string-name>
          <year>2013</year>
          . url: http://arxiv.org/abs/1301.3781.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Mimno</surname>
          </string-name>
          and
          <string-name>
            <surname>L. Thompson. “</surname>
          </string-name>
          <article-title>The strange geometry of skip-gram with negative sampling”</article-title>
          .
          <source>In: EMNLP</source>
          <year>2017</year>
          , Copenhagen, Denmark, September 9-
          <issue>11</issue>
          ,
          <year>2017</year>
          .
          <year>2017</year>
          , pp.
          <fpage>2873</fpage>
          -
          <lpage>2878</lpage>
          . url: https://aclanthology.info/papers/D17-1308/d17-
          <fpage>1308</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>T.</given-names>
            <surname>Niven</surname>
          </string-name>
          and H.-Y. Kao. “
          <article-title>Probing Neural Network Comprehension of Natural Language Arguments”</article-title>
          .
          <source>In: ACL</source>
          <year>2019</year>
          .
          <article-title>Florence, Italy: Association for Computational Linguistics</article-title>
          ,
          <year>July 2019</year>
          , pp.
          <fpage>4658</fpage>
          -
          <lpage>4664</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P19</fpage>
          -1459. url: https://www.aclweb.org/a nthology/P19-1459.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pennington</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          . “Glove:
          <article-title>Global Vectors for Word Representation”</article-title>
          .
          <source>In: EMNLP 2014, October 25-29</source>
          ,
          <year>2014</year>
          , Doha,
          <string-name>
            <surname>Qatar,</surname>
          </string-name>
          <article-title>A meeting of SIGDAT, a Special Interest Group of the ACL</article-title>
          .
          <year>2014</year>
          , pp.
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          . url: http://aclweb.org/anth ology/D/D14/D14-1162.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>M.</given-names>
            <surname>Peters</surname>
          </string-name>
          et al. “
          <article-title>Deep Contextualized Word Representations”</article-title>
          . In: HLT-NAACL
          <year>2018</year>
          .
          <article-title>New Orleans, Louisiana: Association for Computational Linguistics</article-title>
          ,
          <year>June 2018</year>
          , pp.
          <fpage>2227</fpage>
          -
          <lpage>2237</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N18</fpage>
          -1202. url: https://www.aclweb.org/anthology/N18-1202.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>B.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          . “Stable Random Projection: Lightweight,
          <article-title>General-Purpose Dimensionality Reduction for Digitized Libraries”</article-title>
          .
          <source>In: Journal of Cultural Analytics</source>
          (
          <year>2018</year>
          ).
          <source>doi: 10.221 48/16</source>
          .025. url: https://culturalanalytics.org/article/11033.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Searle</surname>
          </string-name>
          . “Minds, Brains, and
          <article-title>Programs”</article-title>
          .
          <source>In: Behavioral and Brain Sciences 3.3</source>
          (
          <issue>1980</issue>
          ), pp.
          <fpage>417</fpage>
          -
          <lpage>57</lpage>
          . doi:
          <volume>10</volume>
          .1017/s0140525x00005756.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>N.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          . “
          <article-title>Modeling Documents with Deep Boltzmann Machines”</article-title>
          .
          <source>In: Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, UAI</source>
          <year>2013</year>
          ,
          <article-title>Bellevue</article-title>
          , WA, USA,
          <year>August</year>
          11-
          <issue>15</issue>
          ,
          <year>2013</year>
          . Ed. by
          <string-name>
            <given-names>A.</given-names>
            <surname>Nicholson</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Smyth</surname>
          </string-name>
          . AUAI Press,
          <year>2013</year>
          . url: https://dslpitt.org/uai/displayArt icleDetails.
          <article-title>jsp?mmnu=1%5C&amp;smnu=2%5C&amp;article%5C_id=2423%5C&amp;proceeding%5 C_id=29.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>G. J.</given-names>
            <surname>Stephens</surname>
          </string-name>
          and
          <string-name>
            <given-names>W.</given-names>
            <surname>Bialek</surname>
          </string-name>
          . “
          <article-title>Statistical mechanics of letters in words”</article-title>
          .
          <source>In: Physical Review E 81.6 (June</source>
          <year>2010</year>
          ). issn:
          <fpage>1550</fpage>
          -
          <lpage>2376</lpage>
          . doi:
          <volume>10</volume>
          .1103 /physreve .81 .066119. url: http://dx.doi.org/10.1103/PhysRevE.81.066119.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          et al. “
          <article-title>Attention Is All You Need”</article-title>
          .
          <source>In: CoRR abs/1706</source>
          .03762 (
          <year>2017</year>
          ). url: http://arxiv.org/abs/1706.03762.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>W.</given-names>
            <surname>Weaver</surname>
          </string-name>
          . “
          <article-title>Translation”</article-title>
          . In:
          <article-title>Machine translation of languages: fourteen essays</article-title>
          . MIT and Wiley,
          <year>1955</year>
          , pp.
          <fpage>15</fpage>
          -
          <lpage>23</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>